YouTube Ingestion
Ingest transcripts from YouTube videos, playlists, and channels to make video content searchable and queryable in your applications.
YouTube ingestion automatically extracts transcripts from videos using AI-powered transcript extraction. Playlists and channels are automatically expanded into individual videos for processing.
Overview
SourceSync supports ingesting content from various YouTube sources:
- Individual Videos - Extract transcripts from specific YouTube videos
- Playlists - Automatically expand playlists and process each video individually
- Channels - Extract videos from channels and process transcripts for each video
All YouTube content is processed to extract high-quality transcripts that can be chunked, embedded, and made available for semantic search and retrieval.
Supported URL Formats
SourceSync supports all standard YouTube URL formats:
Video URLs
https://www.youtube.com/watch?v=VIDEO_ID
https://youtu.be/VIDEO_ID
https://www.youtube.com/embed/VIDEO_ID
https://www.youtube.com/v/VIDEO_ID
Playlist URLs
https://www.youtube.com/playlist?list=PLAYLIST_ID
Channel URLs
https://www.youtube.com/channel/CHANNEL_ID
https://www.youtube.com/c/CHANNEL_NAME
https://www.youtube.com/user/USER_NAME
https://www.youtube.com/@HANDLE
How It Works
1. URL Detection and Expansion
When you submit YouTube URLs to the ingestion API:
- Video URLs are processed directly
- Playlist URLs are expanded to extract all video URLs in the playlist
- Channel URLs are expanded to extract recent video URLs from the channel
2. Transcript Extraction
For each video URL:
- The system extracts high-quality transcripts using AI-powered transcript extraction
- Transcripts are processed in plain text format for optimal chunking
- Video metadata (title, description) can be included via the metadata parameter
3. Content Processing
- Transcripts are chunked according to your
chunkConfigsettings - Text chunks are embedded using your namespace's embedding model
- Vectors are stored in your configured vector database for semantic search
API Usage
Basic Example
curl -X POST https://api.sourcesync.ai/v1/ingest/youtube \\
-H "Authorization: Bearer $SOURCE_SYNC_API_KEY" \\
-H "Content-Type: application/json" \\
-d '{
"namespaceId": "ns_123",
"ingestConfig": {
"source": "YOUTUBE",
"config": {
"urls": [
"https://www.youtube.com/watch?v=DL82mGde6wo"
]
}
}
}'
Mixed URL Types
curl -X POST https://api.sourcesync.ai/v1/ingest/youtube \\
-H "Authorization: Bearer $SOURCE_SYNC_API_KEY" \\
-H "Content-Type: application/json" \\
-d '{
"namespaceId": "ns_123",
"ingestConfig": {
"source": "YOUTUBE",
"config": {
"urls": [
"https://www.youtube.com/watch?v=DL82mGde6wo",
"https://www.youtube.com/playlist?list=PLrAXtmRdnEQy6nuLwvJhvJKgJo8e_D5mD",
"https://www.youtube.com/@OpenAI"
],
"metadata": {
"category": "educational",
"topic": "AI"
}
},
"chunkConfig": {
"chunkSize": 1000,
"chunkOverlap": 200
}
}
}'
Configuration Options
Required Parameters
- Name
namespaceId- Type
- string(optional)
- Description
The unique identifier of your namespace where content will be stored.
- Name
ingestConfig.source- Type
- string(optional)
- Description
Must be set to
"YOUTUBE"for YouTube ingestion.
- Name
ingestConfig.config.urls- Type
- array<string>(optional)
- Description
Array of YouTube URLs (videos, playlists, or channels) to ingest.
Optional Parameters
- Name
ingestConfig.config.metadata- Type
- object(optional)
- Description
Custom metadata to associate with all ingested documents. This metadata will be applied to every video document created from the provided URLs.
- Name
ingestConfig.chunkConfig- Type
- object(optional)
- Description
Configuration for text chunking. If not provided, namespace defaults are used.
- Name
chunkSize- Type
- number(optional)
- Description
Size of each chunk in tokens (default: 1000).
- Name
chunkOverlap- Type
- number(optional)
- Description
Number of tokens to overlap between chunks (default: 200).
- Name
syncFrequency- Type
- string(optional)
- Description
How often to re-sync the content. Options:
NEVER,DAILY,WEEKLY,MONTHLY. Default isNEVER.
Response Format
The API returns an ingestion job run ID that you can use to track the progress of your YouTube ingestion:
{
"success": true,
"message": "Added your YouTube ingestion request to the queue successfully",
"data": {
"ingestJobRunId": "job_run_123",
"documents": []
}
}
Use the Ingest Job Run Status API to monitor the progress of your ingestion.
Best Practices
Optimal Chunking for Video Transcripts
Video transcripts often contain natural speaking patterns. Consider these chunking strategies:
- Larger chunks (1000-1500 tokens) - Better for preserving context and complete thoughts
- Moderate overlap (200-300 tokens) - Ensures continuity between chunks
- Topic-based metadata - Add relevant metadata to improve search relevance
Handling Large Playlists
For playlists with many videos:
- Monitor job progress - Use the job run status API to track processing
- Set appropriate timeouts - Large playlists may take time to process
- Consider batch processing - Split very large playlists into smaller batches
Metadata Best Practices
Include relevant metadata for better search and filtering:
{
"metadata": {
"category": "educational",
"topic": "machine learning",
"speaker": "Andrew Ng",
"course": "CS229",
"difficulty": "intermediate"
}
}
Limitations
- Language Support: Transcript extraction works best with English content, but supports multiple languages
- Video Availability: Private or restricted videos cannot be processed
- Rate Limits: API rate limits apply to prevent abuse
- Content Filtering: Age-restricted or inappropriate content may be filtered out
Troubleshooting
Common Issues
Video not found: Ensure the video is publicly accessible and the URL is correct.
Transcript unavailable: Some videos may not have transcripts available. The system will skip these videos and continue processing others.
Playlist expansion failed: Verify the playlist is public and accessible.
Getting Help
If you encounter issues with YouTube ingestion:
- Check the API Reference for detailed parameter information
- Monitor your ingestion job status using the job run ID
- Review error messages in the API response for specific guidance
- Contact support if you need assistance with large-scale ingestion
Next Steps
After ingesting YouTube content:
- Test search functionality - Query your ingested transcripts using the Search API
- Build applications - Use the embedded transcript data in your AI applications
- Monitor usage - Track your ingestion and search usage in the dashboard
- Explore integrations - Connect with your preferred vector database and embedding models