Changelog

Stay up to date with all the latest changes and improvements to SourceSync.

January 23, 2025

Enhanced document management with improved metadata handling and resync capabilities.

🚀 Features

Added array metadata support:
- Store arrays of values in document metadata
- Flexible array operations:
  - Replace arrays using $set
  - Add values using $append
  - Remove values using $remove
- Smart array search:
  - OR condition within array values
  - AND condition between different metadata keys
  - Automatic deduplication of values
Enhanced document model:
- Added name field for better document identification
- Added mimeType tracking in document properties
- Added document source statistics in fetch response
Added document resync support:
- New endpoint to trigger document resyncs
- Automatic reprocessing of existing documents
- Support for all document types except text and local files

🔄 API Changes

Added /v1/documents/resync endpoint (docs)

Enhanced /v1/documents response with source statistics:

{
  "stats": [
    {
      "source": "TEXT",
      "totalCount": 50
    },
    {
      "source": "LOCAL_FILE",
      "totalCount": 25
    },
    {
      "source": "URLS_LIST",
      "totalCount": 100
    },
    {
      "source": "SITEMAP",
      "totalCount": 75
    }
  ]
}

Updated metadata operations in document updates:

{
  "$metadata": {
    "$set": { "tags": ["important", "urgent"] },
    "$append": { "categories": ["new-category"] },
    "$remove": { "labels": ["outdated"] }
  }
}

Enhanced search endpoint to support array metadata filtering:

{
  "query": "search query",
  "filter": {
    "metadata": {
      "tags": ["important", "urgent"], // OR condition: matches if document has either tag
      "categories": ["tech"] // AND condition: must match with above filter
    }
  }
}

📝 Documentation

Updated document management guide with resync examples (docs)
Added array metadata handling guide (docs)
Enhanced search documentation with array metadata examples (docs)

January 22, 2025

Added native multitenancy support for enhanced data isolation.

🚀 Features

Added native multitenancy support:
- Virtual tenant separation within namespaces
- Automatic tenant-based data isolation
- Support for X-Tenant-ID header across all endpoints
- Tenant-scoped document management and search

🔄 API Changes

Added X-Tenant-ID header support for all namespace-related endpoints:
- Document ingestion endpoints
- Search endpoints
- Document management endpoints
- Connection management endpoints

📝 Documentation

Updated API reference to include tenant header usage (docs)
Added multitenancy guide with best practices (docs)

January 21, 2025

Enhanced connection management and ingestion capabilities.

🚀 Features

Added connection revocation support:
- New endpoint to revoke connections (/connections/:connectionId/revoke)
- Preserves existing ingested documents
- Automatic token refresh during ingestion
Improved ingestion response:
- Added document IDs in ingestion responses for immediate tracking
- Enhanced error handling and token refresh logic

🔄 API Changes

Added /v1/connections/:connectionId/revoke endpoint (docs)
Enhanced ingestion response format:
- Added documentIds in response for text, file, and URL ingestion
- Updated background job status tracking for sitemap and website ingestion

January 20, 2025

Major improvements to document management and ingestion capabilities.

🚀 Features

Added cursor-based pagination for document retrieval:
- Configurable page size (default: 20, max: 100)
- Consistent ordering by creation date
- Efficient navigation with cursor support
Enhanced document management:
- Automatic bulk operations for updates and deletes
- Synchronized storage and vector database operations
Improved error handling for S3-compatible storage
Enhanced sitemap ingestion with path filtering and limit the number of links to ingest
Added connection tracking for documents

🔄 API Changes

Updated /v1/documents endpoint:
- Added pagination parameter with pageSize and cursor
- Enhanced response with returnedCount, hasNextPage, and nextCursor
Added maxLinks, includePaths, and excludePaths for sitemap ingestion
Added connectionId to document responses
Added clientRedirectUrl to connection endpoints

📝 Documentation

Updated document management guide with pagination examples (docs)
Enhanced API reference for document operations (docs)

January 19, 2025

Improved web content processing for better search results.

🚀 Features

Enhanced HTML content processing:
- Automatic removal of non-text elements
- Cleaner markdown output for LLM consumption
- Improved content relevance for search

🔄 API Changes

Enhanced web ingestion endpoints to remove:
- Script tags
- Style elements
- Head section
- Meta tags
- iframes
- Other non-content elements

January 18, 2025

API optimization and security improvements.

🚀 Features

Enhanced GET request handling:
- Automatic ignoring of request body
- Improved security and performance
- Better adherence to HTTP standards

January 17, 2025

Performance improvements and simplified namespace handling.

🚀 Features

Improved search performance:
- Faster /search endpoint response
- Enhanced /search/hybrid endpoint speed
Simplified namespace management:
- Using user-provided namespace identifiers
- Eliminated need to store SourceSync-generated IDs
Made ingestion parameters optional:
- Optional metadata
- Optional chunk configuration
- Optional OCR configuration with sensible defaults

🔄 API Changes

Updated namespace handling in all endpoints
Made metadata, chunkConfig, ocrConfig optional in /ingest/file
Set BASIC_PARSER as default OCR strategy

January 16, 2025

Added OCR support for enhanced document processing.

🚀 Features

Added OCR support for document processing:
- Support for scanned text documents
- Image text extraction
- Configurable OCR strategy
- Integration with existing document processing

🔄 API Changes

Added ocrConfig parameter to file ingestion:
- Optional strategy field
- Support for STANDARD_OCR and BASIC_PARSER
- Default to BASIC_PARSER when not specified

📝 Documentation

Updated data ingestion guide with OCR configuration (docs)
Added OCR processing examples and best practices (docs)

January 13, 2025

Bug fixes for text file processing and improvements to document management and statistics.

🐛 Bug Fixes

Fixed plain text and markdown file processing:
- Correctly preserves MIME types during file uploads
- Fixed content type detection for .txt and .md files
- Improved error handling for text-based files
Enhanced document metadata updates:
- Update endpoint now syncs metadata changes to vector storage
- Delete endpoint removes files from file storage and vectors from vector storage
- Ensures data consistency across all storage layers

🚀 Features

Added document statistics tracking:
- File size tracking
- Character and token count metrics
- Embedding count tracking
- Enhanced document configuration storage
Added API Logs Dashboard (Alpha):
- Real-time monitoring of API requests
- Filter by status, method, and endpoint
- Detailed request/response inspection
- Performance metrics and error tracking
- Coming soon to all customers

🔄 API Changes

Enhanced document model with new properties:
- Added documentProperties for tracking statistics:
  - fileSize: Size of the original document
  - characterCount: Total character count
  - tokenCount: Number of tokens processed
  - embeddingCount: Number of embeddings generated
- Added embeddingConfig for configuration tracking:
  - provider: Embedding provider used (OPENAI, COHERE, JINA)
  - model: Specific model used
  - dimensions: Embedding dimensions
  - chunkSize: Document chunk size
  - chunkOverlap: Chunk overlap settings
- Added providers configuration tracking:
  - fileStorage: Storage provider (S3_COMPATIBLE)
  - vectorStorage: Vector store provider (PINECONE)
  - embeddingModel: Embedding model provider
  - webScraper: Web scraping provider

📝 Documentation

Updated file upload guide with improved MIME type handling (docs)
Enhanced documents API reference to reflect metadata syncing (docs)

January 12, 2025

Added Box integration to connect and search your Box enterprise content.

🚀 Features

Added Box connector to connect your Box with SourceSync (docs):
- New endpoint to add a new Box connection
- Secure OAuth2 flow for Box access and file selection
- Support for PDF, CSV, DOCX, TXT, MD, and PPTX files
- Integration with enterprise content management
Updated namespace:
- Added Box configuration to namespace

🔄 API Changes

Updated /v1/connections endpoint for creating and managing Box connections (docs)
Added /v1/ingest/box endpoint for ingesting selected Box files (docs)
Updated namespace configuration to support Box settings (docs)

📝 Documentation

New guide on how to use the Box connector (docs)
Updated the namespace reference to include Box configuration (docs)
Added new pages to sitemap.xml for improved SEO

January 11, 2025

Added Jina Reader API and ScrapingBee as new web scraping providers for enhanced content ingestion capabilities.

🚀 Features

Added Jina Reader API as a free web scraping provider (docs)
Added ScrapingBee as a web scraping provider with JavaScript rendering support (docs)

🔄 API Changes

Added Jina and ScrapingBee web scraper configurations in namespace settings (docs)

📝 Documentation

Updated web scraping guide with Jina and ScrapingBee configurations (docs)

January 10, 2025

Enhanced pricing transparency and added API logging capabilities.

🚀 Features

Added API request logging:
- Basic request metadata (timestamp, endpoint, method)
- Response status and timing
- Error tracking
- Preparing for upcoming API dashboard feature

🔄 API Changes

Added request logging to all API endpoints:
- No changes to request/response format
- No performance impact
- Tiered log retention (7-90 days based on plan)
- No sensitive data or payload contents stored

📝 Documentation

Enhanced pricing page clarity (docs):
- Added detailed processing limits with overage costs
- Clarified retrieval call limits and pricing
- Improved rate limit explanations
- Added log retention periods by plan
- Enhanced enterprise plan feature list
Added security documentation (docs):
- Data privacy commitments
- Infrastructure details
- Compliance roadmap

January 9, 2025

Added OneDrive integration and Jina AI embedding models for enhanced content search capabilities.

🚀 Features

Added OneDrive connector to connect your OneDrive with SourceSync (docs):
- New endpoint to add a new OneDrive connection
- Secure OAuth2 flow for OneDrive access and file selection
- Support for PDF, CSV, DOCX, TXT, MD, and PPTX files
- Integration with both personal and business accounts
- Required Microsoft Graph permissions: Files.Read.All, offline_access, openid, User.Read
Updated namespace:
- Added OneDrive configuration to namespace
Added Jina AI as a new embedding model provider (docs):
- High-performance jina-embeddings-v3 model with 1024 dimensions

🔄 API Changes

Updated /v1/connections endpoint for creating and managing OneDrive connections (docs)
Added /v1/ingest/onedrive endpoint for ingesting selected OneDrive files (docs)
Updated namespace configuration to support:
- OneDrive connector settings (docs)
- Jina embedding model in embeddingModelConfig (docs)

📝 Documentation

New guide on how to use the OneDrive connector (docs)
Added Jina embedding models to supported models list with configuration examples (docs)
Updated the namespace reference to include new configurations (docs)

January 8, 2025

Added Dropbox integration to connect and search your Dropbox content.

🚀 Features

Added Dropbox connector to connect your Dropbox with SourceSync (docs):
- New endpoint to add a new Dropbox connection
- Secure OAuth2 flow for Dropbox access and file selection
- Support for PDF, CSV, DOCX, TXT, MD, and PPTX files
Updated namespace:
- Added Dropbox configuration to namespace

🔄 API Changes

Updated /v1/connections endpoint for creating and managing Dropbox connections (docs)
Added /v1/ingest/dropbox endpoint for ingesting selected Dropbox files (docs)
Updated namespace configuration to support Dropbox settings (docs)

📝 Documentation

New guide on how to use the Dropbox connector (docs)
Updated the namespace reference to include Dropbox configuration (docs)
Added new pages to sitemap.xml for improved SEO

January 7, 2025

Added Google Drive integration to connect and search your Drive content.

🚀 Features

Added Google Drive connector to connect your Drive with SourceSync (docs):
- New endpoint to add a new Drive connection
- Secure OAuth2 flow for Drive access and file selection
- Support for Google Docs, Sheets, and native files
Updated namespace:
- Added Google Drive configuration to namespace

🔄 API Changes

Updated /v1/connections endpoint for creating and managing Drive connections (docs)
Added /v1/ingest/google-drive endpoint for ingesting selected Drive files (docs)
Updated namespace configuration to support Drive settings (docs)

📝 Documentation

New guide on how to use the Google Drive connector (docs)
Updated the namespace reference to include Drive configuration (docs)
Added new pages to sitemap.xml for improved SEO

January 6, 2025

Major improvements to document connectivity with the addition of Notion connector.

🚀 Features

Added Notion connector to connect your Notion workspace with SourceSync (docs):
- New endpoint to add a new Notion connection
- Secure OAuth2 flow for Notion workspace access and content ingestion
Updated namespace:
- Added Notion configuration to namespace

🔄 API Changes

Added /v1/connections endpoint for creating and fetching connections (docs)
Added /v1/connections/:connectionId endpoint for getting and managing a particular connection (docs)
Added /v1/ingest/notion endpoint for ingesting Notion content from all the pages you select during the OAuth flow (docs)
Updated namespace configuration to support Notion settings (docs)

📝 Documentation

New guide on how to use the Notion connector (docs)
New endpoint reference for connectors (docs)
Updated the namespace reference to include Notion configuration (docs)
Added new pages to sitemap.xml for improved SEO and discoverability

January 5, 2025

Major improvements to document ingestion capabilities, introducing direct file uploads and API standardization.

🚀 Features

Added direct file upload support with multiple formats (docs):
- Documents: .pdf, .docx, .pptx, .xlsx
- Text-based formats: .csv, .json, .xml, .html
- Upload files directly through the API without needing public URLs
Enhanced document model with ingestion tracking (docs):
- Added ingestJob field
- Added ingestJobRun field

🔄 API Changes

Added new /ingest/file endpoint for direct file uploads (docs)
Added web scraper configuration support in namespaces
Standardized document filter parameters to camelCase (docs):
- document_ids → documentIds
- document_external_ids → documentExternalIds
- document_types → documentTypes

📝 Documentation

Added sitemap.xml for improved SEO and discoverability
Updated RAG flow diagrams in "What is RAG?" guide:
- Added light mode versions
- Improved contrast and readability
- Enhanced visual consistency across themes

January 4, 2025 🚀

We're excited to announce the official launch of SourceSync - a privacy-focused, self-serve platform for building AI-powered search and Q&A applications!

🚀 Core Features

Privacy-First Architecture:
- Bring your own S3-compatible storage (docs)
- Bring your own vector database (docs)
- Use your own LLM API keys (docs)
- No data retention or training on your content
- EU-based metadata storage
Document Processing:
- Raw text ingestion (docs)
- Web scraping capabilities (docs):
  - URLs: Process web pages and files available through public URLs
  - Sitemaps: Automatically process all pages in a sitemap
  - Websites: Intelligently crawl websites with custom rules
- Custom chunking configuration (docs)
Search & Retrieval:
- Hybrid search combining semantic and keyword approaches (docs)
- Support for multiple embedding models (docs):
  - OpenAI: State-of-the-art embeddings
  - Cohere: Multilingual support
- Basic metadata filtering (key=value pairs) (docs)
- Multilingual search support (docs)
Developer Experience:
- Comprehensive API access (docs)
- Detailed documentation (docs)
- Multiple code examples (cURL, JavaScript, Python)

📈 Launch Plans

Pilot ($99/month):
- 5,000 monthly ingestion pages
- 25,000 monthly retrieval calls
- ~50 requests/minute
- Email support
Pro ($299/month):
- 25,000 monthly ingestion pages
- 100,000 monthly retrieval calls
- ~200 requests/minute
- Standard support
Team ($999/month):
- 100,000 monthly ingestion pages
- 500,000 monthly retrieval calls
- ~500 requests/minute
- Priority support
Enterprise (Custom pricing):
- Custom usage limits
- Custom rate limits
- Dedicated SLA
- White-glove support

🗺️ Roadmap

Enhanced Processing:
- Direct file uploads (coming tomorrow!)
- OCR for tables & images in PDFs
- Advanced metadata filtering
- Additional embedding models (Voyage, Mistral)
- Additional vector databases (Qdrant, Weaviate)
Advanced Features:
- Advanced RAG pipeline with reranking
- Multi-step retrieval
- Real-time content updates
- Custom embeddings & fine-tuning
Integrations:
- Google Drive
- Notion
- SharePoint
- Dropbox
- And more connectors
Developer Tools:
- Webhooks for real-time notifications
- Advanced analytics
- Usage pattern monitoring
- Search quality metrics

Learn About RAG Get Started