YouTube Ingestion

Ingest transcripts from YouTube videos, playlists, and channels to make video content searchable and queryable in your applications.


Overview

SourceSync supports ingesting content from various YouTube sources:

  • Individual Videos - Extract transcripts from specific YouTube videos
  • Playlists - Automatically expand playlists and process each video individually
  • Channels - Extract videos from channels and process transcripts for each video

All YouTube content is processed to extract high-quality transcripts that can be chunked, embedded, and made available for semantic search and retrieval.


Supported URL Formats

SourceSync supports all standard YouTube URL formats:

Video URLs

https://www.youtube.com/watch?v=VIDEO_ID
https://youtu.be/VIDEO_ID
https://www.youtube.com/embed/VIDEO_ID
https://www.youtube.com/v/VIDEO_ID

Playlist URLs

https://www.youtube.com/playlist?list=PLAYLIST_ID

Channel URLs

https://www.youtube.com/channel/CHANNEL_ID
https://www.youtube.com/c/CHANNEL_NAME
https://www.youtube.com/user/USER_NAME
https://www.youtube.com/@HANDLE

How It Works

1. URL Detection and Expansion

When you submit YouTube URLs to the ingestion API:

  1. Video URLs are processed directly
  2. Playlist URLs are expanded to extract all video URLs in the playlist
  3. Channel URLs are expanded to extract recent video URLs from the channel

2. Transcript Extraction

For each video URL:

  1. The system extracts high-quality transcripts using AI-powered transcript extraction
  2. Transcripts are processed in plain text format for optimal chunking
  3. Video metadata (title, description) can be included via the metadata parameter

3. Content Processing

  1. Transcripts are chunked according to your chunkConfig settings
  2. Text chunks are embedded using your namespace's embedding model
  3. Vectors are stored in your configured vector database for semantic search

API Usage

Basic Example

curl -X POST https://api.sourcesync.ai/v1/ingest/youtube \\
  -H "Authorization: Bearer $SOURCE_SYNC_API_KEY" \\
  -H "Content-Type: application/json" \\
  -d '{
    "namespaceId": "ns_123",
    "ingestConfig": {
      "source": "YOUTUBE",
      "config": {
        "urls": [
          "https://www.youtube.com/watch?v=DL82mGde6wo"
        ]
      }
    }
  }'

Mixed URL Types

curl -X POST https://api.sourcesync.ai/v1/ingest/youtube \\
  -H "Authorization: Bearer $SOURCE_SYNC_API_KEY" \\
  -H "Content-Type: application/json" \\
  -d '{
    "namespaceId": "ns_123",
    "ingestConfig": {
      "source": "YOUTUBE",
      "config": {
        "urls": [
          "https://www.youtube.com/watch?v=DL82mGde6wo",
          "https://www.youtube.com/playlist?list=PLrAXtmRdnEQy6nuLwvJhvJKgJo8e_D5mD",
          "https://www.youtube.com/@OpenAI"
        ],
        "metadata": {
          "category": "educational",
          "topic": "AI"
        }
      },
      "chunkConfig": {
        "chunkSize": 1000,
        "chunkOverlap": 200
      }
    }
  }'

Configuration Options

Required Parameters

  • Name
    namespaceId
    Type
    string(optional)
    Description

    The unique identifier of your namespace where content will be stored.

  • Name
    ingestConfig.source
    Type
    string(optional)
    Description

    Must be set to "YOUTUBE" for YouTube ingestion.

  • Name
    ingestConfig.config.urls
    Type
    array<string>(optional)
    Description

    Array of YouTube URLs (videos, playlists, or channels) to ingest.

Optional Parameters

  • Name
    ingestConfig.config.metadata
    Type
    object(optional)
    Description

    Custom metadata to associate with all ingested documents. This metadata will be applied to every video document created from the provided URLs.

  • Name
    ingestConfig.chunkConfig
    Type
    object(optional)
    Description

    Configuration for text chunking. If not provided, namespace defaults are used.

    • Name
      chunkSize
      Type
      number(optional)
      Description

      Size of each chunk in tokens (default: 1000).

    • Name
      chunkOverlap
      Type
      number(optional)
      Description

      Number of tokens to overlap between chunks (default: 200).

  • Name
    syncFrequency
    Type
    string(optional)
    Description

    How often to re-sync the content. Options: NEVER, DAILY, WEEKLY, MONTHLY. Default is NEVER.


Response Format

The API returns an ingestion job run ID that you can use to track the progress of your YouTube ingestion:

{
  "success": true,
  "message": "Added your YouTube ingestion request to the queue successfully",
  "data": {
    "ingestJobRunId": "job_run_123",
    "documents": []
  }
}

Use the Ingest Job Run Status API to monitor the progress of your ingestion.


Best Practices

Optimal Chunking for Video Transcripts

Video transcripts often contain natural speaking patterns. Consider these chunking strategies:

  • Larger chunks (1000-1500 tokens) - Better for preserving context and complete thoughts
  • Moderate overlap (200-300 tokens) - Ensures continuity between chunks
  • Topic-based metadata - Add relevant metadata to improve search relevance

Handling Large Playlists

For playlists with many videos:

  • Monitor job progress - Use the job run status API to track processing
  • Set appropriate timeouts - Large playlists may take time to process
  • Consider batch processing - Split very large playlists into smaller batches

Metadata Best Practices

Include relevant metadata for better search and filtering:

{
  "metadata": {
    "category": "educational",
    "topic": "machine learning",
    "speaker": "Andrew Ng",
    "course": "CS229",
    "difficulty": "intermediate"
  }
}

Limitations

  • Language Support: Transcript extraction works best with English content, but supports multiple languages
  • Video Availability: Private or restricted videos cannot be processed
  • Rate Limits: API rate limits apply to prevent abuse
  • Content Filtering: Age-restricted or inappropriate content may be filtered out

Troubleshooting

Common Issues

Getting Help

If you encounter issues with YouTube ingestion:

  1. Check the API Reference for detailed parameter information
  2. Monitor your ingestion job status using the job run ID
  3. Review error messages in the API response for specific guidance
  4. Contact support if you need assistance with large-scale ingestion

Next Steps

After ingesting YouTube content:

  • Test search functionality - Query your ingested transcripts using the Search API
  • Build applications - Use the embedded transcript data in your AI applications
  • Monitor usage - Track your ingestion and search usage in the dashboard
  • Explore integrations - Connect with your preferred vector database and embedding models