YouTube Ingestion

Ingest transcripts from YouTube videos, playlists, and channels to make video content searchable and queryable in your applications.

YouTube ingestion automatically extracts transcripts from videos using AI-powered transcript extraction. Playlists and channels are automatically expanded into individual videos for processing.

Overview

SourceSync supports ingesting content from various YouTube sources:

Individual Videos - Extract transcripts from specific YouTube videos
Playlists - Automatically expand playlists and process each video individually
Channels - Extract videos from channels and process transcripts for each video

All YouTube content is processed to extract high-quality transcripts that can be chunked, embedded, and made available for semantic search and retrieval.

Supported URL Formats

SourceSync supports all standard YouTube URL formats:

Video URLs

https://www.youtube.com/watch?v=VIDEO_ID
https://youtu.be/VIDEO_ID
https://www.youtube.com/embed/VIDEO_ID
https://www.youtube.com/v/VIDEO_ID

Playlist URLs

https://www.youtube.com/playlist?list=PLAYLIST_ID

Channel URLs

https://www.youtube.com/channel/CHANNEL_ID
https://www.youtube.com/c/CHANNEL_NAME
https://www.youtube.com/user/USER_NAME
https://www.youtube.com/@HANDLE

How It Works

1. URL Detection and Expansion

When you submit YouTube URLs to the ingestion API:

Video URLs are processed directly
Playlist URLs are expanded to extract all video URLs in the playlist
Channel URLs are expanded to extract recent video URLs from the channel

2. Transcript Extraction

For each video URL:

The system extracts high-quality transcripts using AI-powered transcript extraction
Transcripts are processed in plain text format for optimal chunking
Video metadata (title, description) can be included via the metadata parameter

3. Content Processing

Transcripts are chunked according to your chunkConfig settings
Text chunks are embedded using your namespace's embedding model
Vectors are stored in your configured vector database for semantic search

API Usage

Basic Example

curl -X POST https://api.sourcesync.ai/v1/ingest/youtube \\
  -H "Authorization: Bearer $SOURCE_SYNC_API_KEY" \\
  -H "Content-Type: application/json" \\
  -d '{
    "namespaceId": "ns_123",
    "ingestConfig": {
      "source": "YOUTUBE",
      "config": {
        "urls": [
          "https://www.youtube.com/watch?v=DL82mGde6wo"
        ]
      }
    }
  }'

Mixed URL Types

curl -X POST https://api.sourcesync.ai/v1/ingest/youtube \\
  -H "Authorization: Bearer $SOURCE_SYNC_API_KEY" \\
  -H "Content-Type: application/json" \\
  -d '{
    "namespaceId": "ns_123",
    "ingestConfig": {
      "source": "YOUTUBE",
      "config": {
        "urls": [
          "https://www.youtube.com/watch?v=DL82mGde6wo",
          "https://www.youtube.com/playlist?list=PLrAXtmRdnEQy6nuLwvJhvJKgJo8e_D5mD",
          "https://www.youtube.com/@OpenAI"
        ],
        "metadata": {
          "category": "educational",
          "topic": "AI"
        }
      },
      "chunkConfig": {
        "chunkSize": 1000,
        "chunkOverlap": 200
      }
    }
  }'

Configuration Options

Required Parameters

Name
namespaceId
Type
string(optional)
Description
The unique identifier of your namespace where content will be stored.
Name
ingestConfig.source
Type
string(optional)
Description
Must be set to "YOUTUBE" for YouTube ingestion.
Name
ingestConfig.config.urls
Type
array<string>(optional)
Description
Array of YouTube URLs (videos, playlists, or channels) to ingest.

Optional Parameters

Name
ingestConfig.config.metadata
Type
object(optional)
Description
Custom metadata to associate with all ingested documents. This metadata will be applied to every video document created from the provided URLs.
Name
ingestConfig.chunkConfig
Type
object(optional)
Description
Configuration for text chunking. If not provided, namespace defaults are used.
- Name
  chunkSize
  Type
  number(optional)
  Description
  Size of each chunk in tokens (default: 1000).
- Name
  chunkOverlap
  Type
  number(optional)
  Description
  Number of tokens to overlap between chunks (default: 200).
Name
syncFrequency
Type
string(optional)
Description
How often to re-sync the content. Options: NEVER, DAILY, WEEKLY, MONTHLY. Default is NEVER.

Response Format

The API returns an ingestion job run ID that you can use to track the progress of your YouTube ingestion:

{
  "success": true,
  "message": "Added your YouTube ingestion request to the queue successfully",
  "data": {
    "ingestJobRunId": "job_run_123",
    "documents": []
  }
}

Use the Ingest Job Run Status API to monitor the progress of your ingestion.

Best Practices

Optimal Chunking for Video Transcripts

Video transcripts often contain natural speaking patterns. Consider these chunking strategies:

Larger chunks (1000-1500 tokens) - Better for preserving context and complete thoughts
Moderate overlap (200-300 tokens) - Ensures continuity between chunks
Topic-based metadata - Add relevant metadata to improve search relevance

Handling Large Playlists

For playlists with many videos:

Monitor job progress - Use the job run status API to track processing
Set appropriate timeouts - Large playlists may take time to process
Consider batch processing - Split very large playlists into smaller batches

Metadata Best Practices

Include relevant metadata for better search and filtering:

{
  "metadata": {
    "category": "educational",
    "topic": "machine learning",
    "speaker": "Andrew Ng",
    "course": "CS229",
    "difficulty": "intermediate"
  }
}

Limitations

Language Support: Transcript extraction works best with English content, but supports multiple languages
Video Availability: Private or restricted videos cannot be processed
Rate Limits: API rate limits apply to prevent abuse
Content Filtering: Age-restricted or inappropriate content may be filtered out

Troubleshooting

Common Issues

Video not found: Ensure the video is publicly accessible and the URL is correct.

Transcript unavailable: Some videos may not have transcripts available. The system will skip these videos and continue processing others.

Playlist expansion failed: Verify the playlist is public and accessible.

Getting Help

If you encounter issues with YouTube ingestion:

Check the API Reference for detailed parameter information
Monitor your ingestion job status using the job run ID
Review error messages in the API response for specific guidance
Contact support if you need assistance with large-scale ingestion

Next Steps

After ingesting YouTube content:

Test search functionality - Query your ingested transcripts using the Search API
Build applications - Use the embedded transcript data in your AI applications
Monitor usage - Track your ingestion and search usage in the dashboard
Explore integrations - Connect with your preferred vector database and embedding models

View API Reference