Document Management

Learn how to manage and retrieve your documents effectively in SourceSync.

Overview

SourceSync provides a powerful document management system that allows you to:

Track all ingested documents
Filter documents by various criteria
Update document metadata
Organize content across namespaces

All document operations are namespace-scoped. Make sure you have the correct namespaceId before making any requests.

Document Structure

Each document in SourceSync has:

A unique identifier (id) - System-generated unique ID
An external identifier (externalId) - Generated by external system
A document type (e.g., TEXT, PDF, DOCX) - Based on ingestion source
Custom metadata - Your organization-specific data
Ingestion status - Current processing state
Creation and update timestamps

Example document:

{
  "id": "doc_123",
  "name": "https://example.com",
  "externalId": "external_123",
  "documentType": "URL",
  "ingestionSource": "WEBSITE",
  "ingestionStatus": "SUCCESS",
  "ingestionError": null,
  "ingestJob": {
    "id": "job_123"
  },
  "ingestJobRun": {
    "id": "job_run_123"
  },
  "connection": {
    "id": "conn_123"
  },
  "documentProperties": {
    "mimeType": "text/html",
    "fileSize": 1347,
    "characterCount": 1335,
    "tokenCount": 340,
    "embeddingCount": 1,
    "ocrPagesCount": 0
  },
  "embeddingConfig": {
    "provider": "OPENAI",
    "model": "text-embedding-3-small",
    "dimensions": 1536,
    "chunkSize": 1024,
    "chunkOverlap": 256
  },
  "providers": {
    "fileStorage": "S3_COMPATIBLE",
    "vectorStorage": "PINECONE",
    "embeddingModel": "OPENAI",
    "webScraper": "FIRECRAWL"
  },
  "metadata": {
    "category": "security",
    "status": "published"
  },
  "namespace": {
    "identifier": "ns_123"
  },
  "organization": {
    "id": "org_123"
  },
  "createdAt": {
    "isoString": "2024-01-01T00:00:00Z"
  },
  "updatedAt": {
    "isoString": "2024-01-01T00:00:00Z"
  }
}

Using Metadata

Metadata helps organize and filter your documents. Here's a comprehensive example:

{
  "metadata": {
    "department": "engineering",
    "docType": "api-spec",
    "version": "2.0",
    "status": "published",
    "platform": "mobile",
    "language": "en",
    "lastReviewer": "john.smith",
    "lastReviewedAt": "2024-01-15T10:00:00Z"
  }
}

Best practices for metadata:

Use consistent keys and naming conventions
Follow standard date formats (ISO 8601)
Include relevant identifiers for filtering
Keep values standardized for effective searching

Retrieving Documents

Fetch documents using various filters:

curl -X POST https://api.sourcesync.ai/v1/documents \
  -H "Authorization: Bearer $SOURCE_SYNC_API_KEY" \
  -H "Accept: application/json" \
  -H "Content-Type: application/json" \
  -d '{
    "namespaceId": "ns_abc123",
    "filterConfig": {
      "documentIds": ["doc_123", "doc_456"],
      "documentExternalIds": ["contract-2024-01"],
      "documentConnectionIds": ["conn_123"],
      "documentTypes": ["URL", "GOOGLE_DRIVE_DOCUMENT"],
      "metadata": {
        "department": "engineering",
        "status": "published"
      }
    }
  }'

Document filtering works in two stages:

First, documents are selected if they match ANY of the provided IDs (documentIds OR documentExternalIds OR documentConnectionIds)

Then, these documents are filtered to match ALL other criteria (documentTypes AND metadata)

Pagination

SourceSync uses cursor-based pagination to efficiently handle large document sets. This approach ensures consistent results even when documents are being added or modified between requests.

How Pagination Works

Cursor-Based Navigation
- Instead of using page numbers, we use a cursor that points to your position in the result set
- Results are ordered by creation date (newest first)
- The cursor is an opaque string that encodes the position information

Request Parameters

{
  "pagination": {
    "pageSize": 10, // Optional, default: 100, max: 100
    "cursor": "eyJjcmVhdGVkQXQiOi..." // Optional, from previous response
  }
}

Response Format

{
  "data": {
    "itemsReturned": 7,      // Documents returned in current page
    "hasNextPage": true,     // More documents available
    "nextCursor": "eyJjcmVhdGVkQXQiOi...",  // Use this for next page
    "documents": [...]       // Array of documents
  }
}

Smart Pagination

SourceSync implements smart pagination to provide a better user experience:

Automatic Page Filling
- If filters reduce the result set below the requested page size
- The system automatically fetches more documents to fill the page
- You'll always get the maximum possible documents up to your requested page size
Efficient Filtering
- Document ID and external ID filters are applied first using efficient indexes
- Document type and metadata filters are applied next
- All filtering happens at the database level for optimal performance
Consistent Results
- The cursor ensures you won't miss documents or see duplicates
- Works reliably even when documents are being added or modified
- Maintains proper ordering by creation date

Example: Fetching All Documents

async function fetchAllDocuments(namespaceId, filterConfig) {
  let cursor = null
  let hasNextPage = true
  const allDocuments = []

  while (hasNextPage) {
    const response = await fetch('https://api.sourcesync.ai/v1/documents', {
      method: 'POST',
      headers: {
        Authorization: `Bearer ${SOURCE_SYNC_API_KEY}`,
        'Content-Type': 'application/json',
      },
      body: JSON.stringify({
        namespaceId,
        filterConfig,
        pagination: {
          pageSize: 10,
          cursor,
        },
      }),
    })

    const { data } = await response.json()
    allDocuments.push(...data.documents)

    hasNextPage = data.hasNextPage
    cursor = data.nextCursor
  }

  return allDocuments
}

The cursor is an opaque string - never try to modify or construct cursors manually. Always use the cursor exactly as provided in the response.

Managing Documents

Updating Documents

Update metadata for multiple documents using filters:

curl -X PATCH https://api.sourcesync.ai/v1/documents \
  -H "Authorization: Bearer $SOURCE_SYNC_API_KEY" \
  -H "Accept: application/json" \
  -H "Content-Type: application/json" \
  -d '{
    "namespaceId": "ns_abc123",
    "filterConfig": {
      "documentTypes": ["URL", "GOOGLE_DRIVE_DOCUMENT"],
      "metadata": {
        "department": "legal",
        "status": "pending"
      }
    },
    "data": {
      "metadata": {
        "status": "reviewed",
        "reviewedBy": "john.doe",
        "reviewedAt": "2024-01-15T10:00:00Z"
      }
    }
  }'

Deleting Documents

Delete multiple documents using filters:

curl -X DELETE https://api.sourcesync.ai/v1/documents \
  -H "Authorization: Bearer $SOURCE_SYNC_API_KEY" \
  -H "Accept: application/json" \
  -H "Content-Type: application/json" \
  -d '{
    "namespaceId": "ns_abc123",
    "filterConfig": {
      "metadata": {
        "status": "archived",
      }
    }
  }'

Document deletion is permanent. All associated vectors and embeddings are also removed. Consider using soft deletion via metadata if you need to preserve history.

Best Practices

Organization
- Use separate namespaces for different projects/environments
- Define a consistent metadata schema
Performance
- Use specific filters to reduce result sets
- Batch updates when possible

Next: Search View API Reference