Changelog

Stay up to date with all the latest changes and improvements to SourceSync.


January 23, 2025

Enhanced document management with improved metadata handling and resync capabilities.

🚀 Features

  • Added array metadata support:
    • Store arrays of values in document metadata
    • Flexible array operations:
      • Replace arrays using $set
      • Add values using $append
      • Remove values using $remove
    • Smart array search:
      • OR condition within array values
      • AND condition between different metadata keys
      • Automatic deduplication of values
  • Enhanced document model:
    • Added name field for better document identification
    • Added mimeType tracking in document properties
    • Added document source statistics in fetch response
  • Added document resync support:
    • New endpoint to trigger document resyncs
    • Automatic reprocessing of existing documents
    • Support for all document types except text and local files

🔄 API Changes

  • Added /v1/documents/resync endpoint (docs)
  • Enhanced /v1/documents response with source statistics:
    {
      "stats": [
        {
          "source": "TEXT",
          "totalCount": 50
        },
        {
          "source": "LOCAL_FILE",
          "totalCount": 25
        },
        {
          "source": "URLS_LIST",
          "totalCount": 100
        },
        {
          "source": "SITEMAP",
          "totalCount": 75
        }
      ]
    }
    
  • Updated metadata operations in document updates:
    {
      "$metadata": {
        "$set": { "tags": ["important", "urgent"] },
        "$append": { "categories": ["new-category"] },
        "$remove": { "labels": ["outdated"] }
      }
    }
    
  • Enhanced search endpoint to support array metadata filtering:
    {
      "query": "search query",
      "filter": {
        "metadata": {
          "tags": ["important", "urgent"], // OR condition: matches if document has either tag
          "categories": ["tech"] // AND condition: must match with above filter
        }
      }
    }
    

📝 Documentation

  • Updated document management guide with resync examples (docs)
  • Added array metadata handling guide (docs)
  • Enhanced search documentation with array metadata examples (docs)

January 22, 2025

Added native multitenancy support for enhanced data isolation.

🚀 Features

  • Added native multitenancy support:
    • Virtual tenant separation within namespaces
    • Automatic tenant-based data isolation
    • Support for X-Tenant-ID header across all endpoints
    • Tenant-scoped document management and search

🔄 API Changes

  • Added X-Tenant-ID header support for all namespace-related endpoints:
    • Document ingestion endpoints
    • Search endpoints
    • Document management endpoints
    • Connection management endpoints

📝 Documentation

  • Updated API reference to include tenant header usage (docs)
  • Added multitenancy guide with best practices (docs)

January 21, 2025

Enhanced connection management and ingestion capabilities.

🚀 Features

  • Added connection revocation support:
    • New endpoint to revoke connections (/connections/:connectionId/revoke)
    • Preserves existing ingested documents
    • Automatic token refresh during ingestion
  • Improved ingestion response:
    • Added document IDs in ingestion responses for immediate tracking
    • Enhanced error handling and token refresh logic

🔄 API Changes

  • Added /v1/connections/:connectionId/revoke endpoint (docs)
  • Enhanced ingestion response format:
    • Added documentIds in response for text, file, and URL ingestion
    • Updated background job status tracking for sitemap and website ingestion

January 20, 2025

Major improvements to document management and ingestion capabilities.

🚀 Features

  • Added cursor-based pagination for document retrieval:
    • Configurable page size (default: 20, max: 100)
    • Consistent ordering by creation date
    • Efficient navigation with cursor support
  • Enhanced document management:
    • Automatic bulk operations for updates and deletes
    • Synchronized storage and vector database operations
  • Improved error handling for S3-compatible storage
  • Enhanced sitemap ingestion with path filtering and limit the number of links to ingest
  • Added connection tracking for documents

🔄 API Changes

  • Updated /v1/documents endpoint:
    • Added pagination parameter with pageSize and cursor
    • Enhanced response with returnedCount, hasNextPage, and nextCursor
  • Added maxLinks, includePaths, and excludePaths for sitemap ingestion
  • Added connectionId to document responses
  • Added clientRedirectUrl to connection endpoints

📝 Documentation

  • Updated document management guide with pagination examples (docs)
  • Enhanced API reference for document operations (docs)

January 19, 2025

Improved web content processing for better search results.

🚀 Features

  • Enhanced HTML content processing:
    • Automatic removal of non-text elements
    • Cleaner markdown output for LLM consumption
    • Improved content relevance for search

🔄 API Changes

  • Enhanced web ingestion endpoints to remove:
    • Script tags
    • Style elements
    • Head section
    • Meta tags
    • iframes
    • Other non-content elements

January 18, 2025

API optimization and security improvements.

🚀 Features

  • Enhanced GET request handling:
    • Automatic ignoring of request body
    • Improved security and performance
    • Better adherence to HTTP standards

January 17, 2025

Performance improvements and simplified namespace handling.

🚀 Features

  • Improved search performance:
    • Faster /search endpoint response
    • Enhanced /search/hybrid endpoint speed
  • Simplified namespace management:
    • Using user-provided namespace identifiers
    • Eliminated need to store SourceSync-generated IDs
  • Made ingestion parameters optional:
    • Optional metadata
    • Optional chunk configuration
    • Optional OCR configuration with sensible defaults

🔄 API Changes

  • Updated namespace handling in all endpoints
  • Made metadata, chunkConfig, ocrConfig optional in /ingest/file
  • Set BASIC_PARSER as default OCR strategy

January 16, 2025

Added OCR support for enhanced document processing.

🚀 Features

  • Added OCR support for document processing:
    • Support for scanned text documents
    • Image text extraction
    • Configurable OCR strategy
    • Integration with existing document processing

🔄 API Changes

  • Added ocrConfig parameter to file ingestion:
    • Optional strategy field
    • Support for STANDARD_OCR and BASIC_PARSER
    • Default to BASIC_PARSER when not specified

📝 Documentation

  • Updated data ingestion guide with OCR configuration (docs)
  • Added OCR processing examples and best practices (docs)

January 13, 2025

Bug fixes for text file processing and improvements to document management and statistics.

🐛 Bug Fixes

  • Fixed plain text and markdown file processing:
    • Correctly preserves MIME types during file uploads
    • Fixed content type detection for .txt and .md files
    • Improved error handling for text-based files
  • Enhanced document metadata updates:
    • Update endpoint now syncs metadata changes to vector storage
    • Delete endpoint removes files from file storage and vectors from vector storage
    • Ensures data consistency across all storage layers

🚀 Features

  • Added document statistics tracking:
    • File size tracking
    • Character and token count metrics
    • Embedding count tracking
    • Enhanced document configuration storage
  • Added API Logs Dashboard (Alpha):
    • Real-time monitoring of API requests
    • Filter by status, method, and endpoint
    • Detailed request/response inspection
    • Performance metrics and error tracking
    • Coming soon to all customers

🔄 API Changes

  • Enhanced document model with new properties:
    • Added documentProperties for tracking statistics:
      • fileSize: Size of the original document
      • characterCount: Total character count
      • tokenCount: Number of tokens processed
      • embeddingCount: Number of embeddings generated
    • Added embeddingConfig for configuration tracking:
      • provider: Embedding provider used (OPENAI, COHERE, JINA)
      • model: Specific model used
      • dimensions: Embedding dimensions
      • chunkSize: Document chunk size
      • chunkOverlap: Chunk overlap settings
    • Added providers configuration tracking:
      • fileStorage: Storage provider (S3_COMPATIBLE)
      • vectorStorage: Vector store provider (PINECONE)
      • embeddingModel: Embedding model provider
      • webScraper: Web scraping provider

📝 Documentation

  • Updated file upload guide with improved MIME type handling (docs)
  • Enhanced documents API reference to reflect metadata syncing (docs)

January 12, 2025

Added Box integration to connect and search your Box enterprise content.

🚀 Features

  • Added Box connector to connect your Box with SourceSync (docs):
    • New endpoint to add a new Box connection
    • Secure OAuth2 flow for Box access and file selection
    • Support for PDF, CSV, DOCX, TXT, MD, and PPTX files
    • Integration with enterprise content management
  • Updated namespace:
    • Added Box configuration to namespace

🔄 API Changes

  • Updated /v1/connections endpoint for creating and managing Box connections (docs)
  • Added /v1/ingest/box endpoint for ingesting selected Box files (docs)
  • Updated namespace configuration to support Box settings (docs)

📝 Documentation

  • New guide on how to use the Box connector (docs)
  • Updated the namespace reference to include Box configuration (docs)
  • Added new pages to sitemap.xml for improved SEO

January 11, 2025

Added Jina Reader API and ScrapingBee as new web scraping providers for enhanced content ingestion capabilities.

🚀 Features

  • Added Jina Reader API as a free web scraping provider (docs)
  • Added ScrapingBee as a web scraping provider with JavaScript rendering support (docs)

🔄 API Changes

  • Added Jina and ScrapingBee web scraper configurations in namespace settings (docs)

📝 Documentation

  • Updated web scraping guide with Jina and ScrapingBee configurations (docs)

January 10, 2025

Enhanced pricing transparency and added API logging capabilities.

🚀 Features

  • Added API request logging:
    • Basic request metadata (timestamp, endpoint, method)
    • Response status and timing
    • Error tracking
    • Preparing for upcoming API dashboard feature

🔄 API Changes

  • Added request logging to all API endpoints:
    • No changes to request/response format
    • No performance impact
    • Tiered log retention (7-90 days based on plan)
    • No sensitive data or payload contents stored

📝 Documentation

  • Enhanced pricing page clarity (docs):
    • Added detailed processing limits with overage costs
    • Clarified retrieval call limits and pricing
    • Improved rate limit explanations
    • Added log retention periods by plan
    • Enhanced enterprise plan feature list
  • Added security documentation (docs):
    • Data privacy commitments
    • Infrastructure details
    • Compliance roadmap

January 9, 2025

Added OneDrive integration and Jina AI embedding models for enhanced content search capabilities.

🚀 Features

  • Added OneDrive connector to connect your OneDrive with SourceSync (docs):
    • New endpoint to add a new OneDrive connection
    • Secure OAuth2 flow for OneDrive access and file selection
    • Support for PDF, CSV, DOCX, TXT, MD, and PPTX files
    • Integration with both personal and business accounts
    • Required Microsoft Graph permissions: Files.Read.All, offline_access, openid, User.Read
  • Updated namespace:
    • Added OneDrive configuration to namespace
  • Added Jina AI as a new embedding model provider (docs):
    • High-performance jina-embeddings-v3 model with 1024 dimensions

🔄 API Changes

  • Updated /v1/connections endpoint for creating and managing OneDrive connections (docs)
  • Added /v1/ingest/onedrive endpoint for ingesting selected OneDrive files (docs)
  • Updated namespace configuration to support:
    • OneDrive connector settings (docs)
    • Jina embedding model in embeddingModelConfig (docs)

📝 Documentation

  • New guide on how to use the OneDrive connector (docs)
  • Added Jina embedding models to supported models list with configuration examples (docs)
  • Updated the namespace reference to include new configurations (docs)

January 8, 2025

Added Dropbox integration to connect and search your Dropbox content.

🚀 Features

  • Added Dropbox connector to connect your Dropbox with SourceSync (docs):
    • New endpoint to add a new Dropbox connection
    • Secure OAuth2 flow for Dropbox access and file selection
    • Support for PDF, CSV, DOCX, TXT, MD, and PPTX files
  • Updated namespace:
    • Added Dropbox configuration to namespace

🔄 API Changes

  • Updated /v1/connections endpoint for creating and managing Dropbox connections (docs)
  • Added /v1/ingest/dropbox endpoint for ingesting selected Dropbox files (docs)
  • Updated namespace configuration to support Dropbox settings (docs)

📝 Documentation

  • New guide on how to use the Dropbox connector (docs)
  • Updated the namespace reference to include Dropbox configuration (docs)
  • Added new pages to sitemap.xml for improved SEO

January 7, 2025

Added Google Drive integration to connect and search your Drive content.

🚀 Features

  • Added Google Drive connector to connect your Drive with SourceSync (docs):
    • New endpoint to add a new Drive connection
    • Secure OAuth2 flow for Drive access and file selection
    • Support for Google Docs, Sheets, and native files
  • Updated namespace:
    • Added Google Drive configuration to namespace

🔄 API Changes

  • Updated /v1/connections endpoint for creating and managing Drive connections (docs)
  • Added /v1/ingest/google-drive endpoint for ingesting selected Drive files (docs)
  • Updated namespace configuration to support Drive settings (docs)

📝 Documentation

  • New guide on how to use the Google Drive connector (docs)
  • Updated the namespace reference to include Drive configuration (docs)
  • Added new pages to sitemap.xml for improved SEO

January 6, 2025

Major improvements to document connectivity with the addition of Notion connector.

🚀 Features

  • Added Notion connector to connect your Notion workspace with SourceSync (docs):
    • New endpoint to add a new Notion connection
    • Secure OAuth2 flow for Notion workspace access and content ingestion
  • Updated namespace:
    • Added Notion configuration to namespace

🔄 API Changes

  • Added /v1/connections endpoint for creating and fetching connections (docs)
  • Added /v1/connections/:connectionId endpoint for getting and managing a particular connection (docs)
  • Added /v1/ingest/notion endpoint for ingesting Notion content from all the pages you select during the OAuth flow (docs)
  • Updated namespace configuration to support Notion settings (docs)

📝 Documentation

  • New guide on how to use the Notion connector (docs)
  • New endpoint reference for connectors (docs)
  • Updated the namespace reference to include Notion configuration (docs)
  • Added new pages to sitemap.xml for improved SEO and discoverability

January 5, 2025

Major improvements to document ingestion capabilities, introducing direct file uploads and API standardization.

🚀 Features

  • Added direct file upload support with multiple formats (docs):
    • Documents: .pdf, .docx, .pptx, .xlsx
    • Text-based formats: .csv, .json, .xml, .html
    • Upload files directly through the API without needing public URLs
  • Enhanced document model with ingestion tracking (docs):
    • Added ingestJob field
    • Added ingestJobRun field

🔄 API Changes

  • Added new /ingest/file endpoint for direct file uploads (docs)
  • Added web scraper configuration support in namespaces
  • Standardized document filter parameters to camelCase (docs):
    • document_idsdocumentIds
    • document_external_idsdocumentExternalIds
    • document_typesdocumentTypes

📝 Documentation

  • Added sitemap.xml for improved SEO and discoverability
  • Updated RAG flow diagrams in "What is RAG?" guide:
    • Added light mode versions
    • Improved contrast and readability
    • Enhanced visual consistency across themes

January 4, 2025 🚀

We're excited to announce the official launch of SourceSync - a privacy-focused, self-serve platform for building AI-powered search and Q&A applications!

🚀 Core Features

  • Privacy-First Architecture:

    • Bring your own S3-compatible storage (docs)
    • Bring your own vector database (docs)
    • Use your own LLM API keys (docs)
    • No data retention or training on your content
    • EU-based metadata storage
  • Document Processing:

    • Raw text ingestion (docs)
    • Web scraping capabilities (docs):
      • URLs: Process web pages and files available through public URLs
      • Sitemaps: Automatically process all pages in a sitemap
      • Websites: Intelligently crawl websites with custom rules
    • Custom chunking configuration (docs)
  • Search & Retrieval:

    • Hybrid search combining semantic and keyword approaches (docs)
    • Support for multiple embedding models (docs):
      • OpenAI: State-of-the-art embeddings
      • Cohere: Multilingual support
    • Basic metadata filtering (key=value pairs) (docs)
    • Multilingual search support (docs)
  • Developer Experience:

    • Comprehensive API access (docs)
    • Detailed documentation (docs)
    • Multiple code examples (cURL, JavaScript, Python)

📈 Launch Plans

  • Pilot ($99/month):

    • 5,000 monthly ingestion pages
    • 25,000 monthly retrieval calls
    • ~50 requests/minute
    • Email support
  • Pro ($299/month):

    • 25,000 monthly ingestion pages
    • 100,000 monthly retrieval calls
    • ~200 requests/minute
    • Standard support
  • Team ($999/month):

    • 100,000 monthly ingestion pages
    • 500,000 monthly retrieval calls
    • ~500 requests/minute
    • Priority support
  • Enterprise (Custom pricing):

    • Custom usage limits
    • Custom rate limits
    • Dedicated SLA
    • White-glove support

🗺️ Roadmap

  • Enhanced Processing:

    • Direct file uploads (coming tomorrow!)
    • OCR for tables & images in PDFs
    • Advanced metadata filtering
    • Additional embedding models (Voyage, Mistral)
    • Additional vector databases (Qdrant, Weaviate)
  • Advanced Features:

    • Advanced RAG pipeline with reranking
    • Multi-step retrieval
    • Real-time content updates
    • Custom embeddings & fine-tuning
  • Integrations:

    • Google Drive
    • Notion
    • SharePoint
    • Dropbox
    • And more connectors
  • Developer Tools:

    • Webhooks for real-time notifications
    • Advanced analytics
    • Usage pattern monitoring
    • Search quality metrics