Web Scraping

Ingest content from websites into your RAG application using our web scraping capabilities.

Overview

Web scraping in SourceSync allows you to:

Ingest content from multiple URLs
Process entire sitemaps
Crawl websites with configurable rules
Extract content with specific selectors

Getting Started

To use web scraping in SourceSync, you need to configure a web scraper provider. We support three options:

Firecrawl
- Advanced web scraping provider
- Free tier available for testing
Jina Reader API
- Free web scraping alternative
- Simple to get started
ScrapingBee
- Reliable web scraping with JavaScript rendering
- Free tier available for testing

Choose your preferred provider and update your namespace configuration with the appropriate API key:

# The "provider" field can be:
# - "FIRECRAWL"
# - "JINA"
# - "SCRAPINGBEE"

curl -X PATCH https://api.sourcesync.ai/v1/namespaces/ns_123 \
  -H "Authorization: Bearer $SOURCE_SYNC_API_KEY" \
  -H "Accept: application/json" \
  -H "Content-Type: application/json" \
  -d '{
    "webScraperConfig": {
      "provider": "FIRECRAWL",
      "apiKey": "new-api-key"
    }
  }'

Ingestion Methods

SourceSync provides three methods for web content ingestion:

1. URL List Ingestion

Ingest content from a list of specific URLs:

curl -X POST https://api.sourcesync.ai/v1/ingest/urls \
  -H "Authorization: Bearer $SOURCE_SYNC_API_KEY" \
  -H "Accept: application/json" \
  -H "Content-Type: application/json" \
  -d '{
    "namespaceId": "your-namespace-identifier",
    "ingestConfig": {
      "source": "URLS_LIST",
      "config": {
        "urls": [
          "https://example.com/page1",
          "https://example.com/page2"
        ],
        "scrapeOptions": {
          "includeSelectors": ["article", "main"],
          "excludeSelectors": [".navigation", ".footer"]
        },
        "metadata": {
          "source": "website",
          "category": "docs"
        }
      },
      "chunkConfig": {
        "chunkSize": 1000,
        "chunkOverlap": 100
      }
    }
  }'

2. Sitemap Processing

Ingest all URLs from a sitemap:

curl -X POST https://api.sourcesync.ai/v1/ingest/sitemap \
  -H "Authorization: Bearer $SOURCE_SYNC_API_KEY" \
  -H "Accept: application/json" \
  -H "Content-Type: application/json" \
  -d '{
    "namespaceId": "your-namespace-identifier",
    "ingestConfig": {
      "source": "SITEMAP",
      "config": {
        "url": "https://example.com/sitemap.xml",
        "maxLinks": 1000,
        "includePaths": ["/docs"],
        "excludePaths": ["/docs/internal"],
        "scrapeOptions": {
          "includeSelectors": ["article", "main"],
          "excludeSelectors": [".navigation", ".footer"]
        }
      },
      "chunkConfig": {
        "chunkSize": 1000,
        "chunkOverlap": 100
      }
    }
  }'

3. Website Crawling

Crawl a website with custom rules:

curl -X POST https://api.sourcesync.ai/v1/ingest/website \
  -H "Authorization: Bearer $SOURCE_SYNC_API_KEY" \
  -H "Accept: application/json" \
  -H "Content-Type: application/json" \
  -d '{
    "namespaceId": "your-namespace-identifier",
    "ingestConfig": {
      "source": "WEBSITE",
      "config": {
        "url": "https://example.com",
        "maxDepth": 3,
        "maxLinks": 100,
        "includePaths": ["/docs", "/blog/posts"],
        "excludePaths": ["/admin"],
        "scrapeOptions": {
          "includeSelectors": ["article", "main"],
          "excludeSelectors": [".navigation", ".footer"]
        }
      },
      "chunkConfig": {
        "chunkSize": 1000,
        "chunkOverlap": 100
      }
    }
  }'

Configuration Options

Scraping Options

Control which parts of the HTML to extract:

includeSelectors: CSS selectors for content to include

{
  "includeSelectors": ["article", "main", ".content"]
}

excludeSelectors: CSS selectors for content to exclude

{
  "excludeSelectors": [".ads", ".navigation", ".footer"]
}

Sitemap Processing Options

Configure how the sitemap is processed:

maxLinks: Maximum number of URLs to process
includePaths: URL paths to include (e.g., ["/docs", "/blog"])
excludePaths: URL paths to exclude (e.g., ["/admin", "/private"])

Website Crawling Options

Configure how the website is crawled:

maxDepth: How many links deep to crawl (1-10)
maxLinks: Maximum number of URLs to process
includePaths: URL paths to include (e.g., ["/docs", "/blog"])
excludePaths: URL paths to exclude (e.g., ["/admin", "/private"])

Chunking Options

Control how content is split:

{
  "chunkConfig": {
    "chunkSize": 1000,
    "chunkOverlap": 100
  }
}

Tracking Progress

All ingestion methods are asynchronous and return an ingestJobRunId:

{
  "success": true,
  "message": "Added your urls ingestion request to the queue",
  "data": {
    "ingestJobRunId": "ijr_abc123"
  }
}

Check ingestion status:

curl https://api.sourcesync.ai/v1/ingest-job-runs/ijr_abc123?namespaceId=your-namespace-identifier \
  -H "Authorization: Bearer $SOURCE_SYNC_API_KEY"

Response:

{
  "success": true,
  "message": "Fetched ingest job run details successfully",
  "data": {
    "id": "ijr_abc123",
    "status": "PROCESSING",
    "documents": {
      "queued": [...],
      "processing": [...],
      "completed": [...],
      "failed": [...]
    }
  }
}

Best Practices

Content Selection

Use specific CSS selectors to target main content
Exclude navigation, footers, and ads
Test selectors on sample pages first

URL Management

Start with small URL lists
Use path patterns to focus crawling
Set reasonable maxLinks limits

Resource Usage

Monitor ingestion job status
Use appropriate chunk sizes
Consider rate limits and quotas

Next: Data Ingestion View API Reference