Web Scraping
Ingest content from websites into your RAG application using our web scraping capabilities.
Overview
Web scraping in SourceSync allows you to:
- Ingest content from multiple URLs
- Process entire sitemaps
- Crawl websites with configurable rules
- Extract content with specific selectors
Getting Started
To use web scraping in SourceSync, you need to configure a web scraper provider. We support three options:
- 
- Advanced web scraping provider
- Free tier available for testing
 
- 
- Free web scraping alternative
- Simple to get started
 
- 
- Reliable web scraping with JavaScript rendering
- Free tier available for testing
 
Choose your preferred provider and update your namespace configuration with the appropriate API key:
# The "provider" field can be:
# - "FIRECRAWL"
# - "JINA"
# - "SCRAPINGBEE"
curl -X PATCH https://api.sourcesync.ai/v1/namespaces/ns_123 \
  -H "Authorization: Bearer $SOURCE_SYNC_API_KEY" \
  -H "Accept: application/json" \
  -H "Content-Type: application/json" \
  -d '{
    "webScraperConfig": {
      "provider": "FIRECRAWL",
      "apiKey": "new-api-key"
    }
  }'
Ingestion Methods
SourceSync provides three methods for web content ingestion:
1. URL List Ingestion
Ingest content from a list of specific URLs:
curl -X POST https://api.sourcesync.ai/v1/ingest/urls \
  -H "Authorization: Bearer $SOURCE_SYNC_API_KEY" \
  -H "Accept: application/json" \
  -H "Content-Type: application/json" \
  -d '{
    "namespaceId": "your-namespace-identifier",
    "ingestConfig": {
      "source": "URLS_LIST",
      "config": {
        "urls": [
          "https://example.com/page1",
          "https://example.com/page2"
        ],
        "scrapeOptions": {
          "includeSelectors": ["article", "main"],
          "excludeSelectors": [".navigation", ".footer"]
        },
        "metadata": {
          "source": "website",
          "category": "docs"
        }
      },
      "chunkConfig": {
        "chunkSize": 1000,
        "chunkOverlap": 100
      }
    }
  }'
2. Sitemap Processing
Ingest all URLs from a sitemap:
curl -X POST https://api.sourcesync.ai/v1/ingest/sitemap \
  -H "Authorization: Bearer $SOURCE_SYNC_API_KEY" \
  -H "Accept: application/json" \
  -H "Content-Type: application/json" \
  -d '{
    "namespaceId": "your-namespace-identifier",
    "ingestConfig": {
      "source": "SITEMAP",
      "config": {
        "url": "https://example.com/sitemap.xml",
        "maxLinks": 1000,
        "includePaths": ["/docs"],
        "excludePaths": ["/docs/internal"],
        "scrapeOptions": {
          "includeSelectors": ["article", "main"],
          "excludeSelectors": [".navigation", ".footer"]
        }
      },
      "chunkConfig": {
        "chunkSize": 1000,
        "chunkOverlap": 100
      }
    }
  }'
3. Website Crawling
Crawl a website with custom rules:
curl -X POST https://api.sourcesync.ai/v1/ingest/website \
  -H "Authorization: Bearer $SOURCE_SYNC_API_KEY" \
  -H "Accept: application/json" \
  -H "Content-Type: application/json" \
  -d '{
    "namespaceId": "your-namespace-identifier",
    "ingestConfig": {
      "source": "WEBSITE",
      "config": {
        "url": "https://example.com",
        "maxDepth": 3,
        "maxLinks": 100,
        "includePaths": ["/docs", "/blog/posts"],
        "excludePaths": ["/admin"],
        "scrapeOptions": {
          "includeSelectors": ["article", "main"],
          "excludeSelectors": [".navigation", ".footer"]
        }
      },
      "chunkConfig": {
        "chunkSize": 1000,
        "chunkOverlap": 100
      }
    }
  }'
Configuration Options
Scraping Options
Control which parts of the HTML to extract:
- 
includeSelectors: CSS selectors for content to include { "includeSelectors": ["article", "main", ".content"] }
- 
excludeSelectors: CSS selectors for content to exclude { "excludeSelectors": [".ads", ".navigation", ".footer"] }
Sitemap Processing Options
Configure how the sitemap is processed:
- maxLinks: Maximum number of URLs to process
- includePaths: URL paths to include (e.g., ["/docs", "/blog"])
- excludePaths: URL paths to exclude (e.g., ["/admin", "/private"])
Website Crawling Options
Configure how the website is crawled:
- maxDepth: How many links deep to crawl (1-10)
- maxLinks: Maximum number of URLs to process
- includePaths: URL paths to include (e.g., ["/docs", "/blog"])
- excludePaths: URL paths to exclude (e.g., ["/admin", "/private"])
Chunking Options
Control how content is split:
{
  "chunkConfig": {
    "chunkSize": 1000,
    "chunkOverlap": 100
  }
}
Tracking Progress
All ingestion methods are asynchronous and return an ingestJobRunId:
{
  "success": true,
  "message": "Added your urls ingestion request to the queue",
  "data": {
    "ingestJobRunId": "ijr_abc123"
  }
}
Check ingestion status:
curl https://api.sourcesync.ai/v1/ingest-job-runs/ijr_abc123?namespaceId=your-namespace-identifier \
  -H "Authorization: Bearer $SOURCE_SYNC_API_KEY"
Response:
{
  "success": true,
  "message": "Fetched ingest job run details successfully",
  "data": {
    "id": "ijr_abc123",
    "status": "PROCESSING",
    "documents": {
      "queued": [...],
      "processing": [...],
      "completed": [...],
      "failed": [...]
    }
  }
}
Best Practices
Content Selection
- Use specific CSS selectors to target main content
- Exclude navigation, footers, and ads
- Test selectors on sample pages first
URL Management
- Start with small URL lists
- Use path patterns to focus crawling
- Set reasonable maxLinks limits
Resource Usage
- Monitor ingestion job status
- Use appropriate chunk sizes
- Consider rate limits and quotas