Document Management

Learn how to manage and retrieve your documents effectively in SourceSync.

Overview

SourceSync provides a powerful document management system that allows you to:

  • Track all ingested documents
  • Filter documents by various criteria
  • Update document metadata
  • Organize content across namespaces

Document Structure

Each document in SourceSync has:

  • A unique identifier (id) - System-generated unique ID
  • An external identifier (externalId) - Generated by external system
  • A document type (e.g., TEXT, PDF, DOCX) - Based on ingestion source
  • Custom metadata - Your organization-specific data
  • Ingestion status - Current processing state
  • Creation and update timestamps

Example document:

{
  "id": "doc_123",
  "name": "https://example.com",
  "externalId": "external_123",
  "documentType": "URL",
  "ingestionSource": "WEBSITE",
  "ingestionStatus": "SUCCESS",
  "ingestionError": null,
  "ingestJob": {
    "id": "job_123"
  },
  "ingestJobRun": {
    "id": "job_run_123"
  },
  "connection": {
    "id": "conn_123"
  },
  "documentProperties": {
    "mimeType": "text/html",
    "fileSize": 1347,
    "characterCount": 1335,
    "tokenCount": 340,
    "embeddingCount": 1
  },
  "embeddingConfig": {
    "provider": "OPENAI",
    "model": "text-embedding-3-small",
    "dimensions": 1536,
    "chunkSize": 1024,
    "chunkOverlap": 256
  },
  "providers": {
    "fileStorage": "S3_COMPATIBLE",
    "vectorStorage": "PINECONE",
    "embeddingModel": "OPENAI",
    "webScraper": "FIRECRAWL"
  },
  "metadata": {
    "category": "security",
    "status": "published"
  },
  "namespace": {
    "identifier": "ns_123"
  },
  "organization": {
    "id": "org_123"
  },
  "createdAt": {
    "isoString": "2024-01-01T00:00:00Z"
  },
  "updatedAt": {
    "isoString": "2024-01-01T00:00:00Z"
  }
}

Using Metadata

Metadata helps organize and filter your documents. Here's a comprehensive example:

{
  "metadata": {
    "department": "engineering",
    "docType": "api-spec",
    "version": "2.0",
    "status": "published",
    "platform": "mobile",
    "language": "en",
    "lastReviewer": "john.smith",
    "lastReviewedAt": "2024-01-15T10:00:00Z"
  }
}

Best practices for metadata:

  • Use consistent keys and naming conventions
  • Follow standard date formats (ISO 8601)
  • Include relevant identifiers for filtering
  • Keep values standardized for effective searching

Retrieving Documents

Fetch documents using various filters:

curl -X POST https://api.sourcesync.ai/v1/documents \
  -H "Authorization: Bearer $RAGAAS_API_KEY" \
  -H "Accept: application/json" \
  -H "Content-Type: application/json" \
  -d '{
    "namespaceId": "ns_abc123",
    "filterConfig": {
      "documentIds": ["doc_123", "doc_456"],
      "documentExternalIds": ["contract-2024-01"],
      "documentConnectionIds": ["conn_123"],
      "documentTypes": ["URL", "GOOGLE_DRIVE_DOCUMENT"],
      "metadata": {
        "department": "engineering",
        "status": "published"
      }
    }
  }'

Pagination

SourceSync uses cursor-based pagination to efficiently handle large document sets. This approach ensures consistent results even when documents are being added or modified between requests.

How Pagination Works

  1. Cursor-Based Navigation

    • Instead of using page numbers, we use a cursor that points to your position in the result set
    • Results are ordered by creation date (newest first)
    • The cursor is an opaque string that encodes the position information
  2. Request Parameters

    {
      "pagination": {
        "pageSize": 10, // Optional, default: 100, max: 100
        "cursor": "eyJjcmVhdGVkQXQiOi..." // Optional, from previous response
      }
    }
    
  3. Response Format

    {
      "data": {
        "itemsReturned": 7,      // Documents returned in current page
        "hasNextPage": true,     // More documents available
        "nextCursor": "eyJjcmVhdGVkQXQiOi...",  // Use this for next page
        "documents": [...]       // Array of documents
      }
    }
    

Smart Pagination

SourceSync implements smart pagination to provide a better user experience:

  1. Automatic Page Filling

    • If filters reduce the result set below the requested page size
    • The system automatically fetches more documents to fill the page
    • You'll always get the maximum possible documents up to your requested page size
  2. Efficient Filtering

    • Document ID and external ID filters are applied first using efficient indexes
    • Document type and metadata filters are applied next
    • All filtering happens at the database level for optimal performance
  3. Consistent Results

    • The cursor ensures you won't miss documents or see duplicates
    • Works reliably even when documents are being added or modified
    • Maintains proper ordering by creation date

Example: Fetching All Documents

async function fetchAllDocuments(namespaceId, filterConfig) {
  let cursor = null
  let hasNextPage = true
  const allDocuments = []

  while (hasNextPage) {
    const response = await fetch('https://api.sourcesync.ai/v1/documents', {
      method: 'POST',
      headers: {
        Authorization: `Bearer ${RAGAAS_API_KEY}`,
        'Content-Type': 'application/json',
      },
      body: JSON.stringify({
        namespaceId,
        filterConfig,
        pagination: {
          pageSize: 10,
          cursor,
        },
      }),
    })

    const { data } = await response.json()
    allDocuments.push(...data.documents)

    hasNextPage = data.hasNextPage
    cursor = data.nextCursor
  }

  return allDocuments
}

Managing Documents

Updating Documents

Update metadata for multiple documents using filters:

curl -X PATCH https://api.sourcesync.ai/v1/documents \
  -H "Authorization: Bearer $RAGAAS_API_KEY" \
  -H "Accept: application/json" \
  -H "Content-Type: application/json" \
  -d '{
    "namespaceId": "ns_abc123",
    "filterConfig": {
      "documentTypes": ["URL", "GOOGLE_DRIVE_DOCUMENT"],
      "metadata": {
        "department": "legal",
        "status": "pending"
      }
    },
    "data": {
      "metadata": {
        "status": "reviewed",
        "reviewedBy": "john.doe",
        "reviewedAt": "2024-01-15T10:00:00Z"
      }
    }
  }'

Deleting Documents

Delete multiple documents using filters:

curl -X DELETE https://api.sourcesync.ai/v1/documents \
  -H "Authorization: Bearer $RAGAAS_API_KEY" \
  -H "Accept: application/json" \
  -H "Content-Type: application/json" \
  -d '{
    "namespaceId": "ns_abc123",
    "filterConfig": {
      "metadata": {
        "status": "archived",
      }
    }
  }'

Best Practices

  1. Organization

    • Use separate namespaces for different projects/environments
    • Define a consistent metadata schema
  2. Performance

    • Use specific filters to reduce result sets
    • Batch updates when possible