Document Management
Learn how to manage and retrieve your documents effectively in SourceSync.
Overview
SourceSync provides a powerful document management system that allows you to:
- Track all ingested documents
- Filter documents by various criteria
- Update document metadata
- Organize content across namespaces
All document operations are namespace-scoped. Make sure you have the correct
namespaceId
before making any requests.
Document Structure
Each document in SourceSync has:
- A unique identifier (
id
) - System-generated unique ID - An external identifier (
externalId
) - Generated by external system - A document type (e.g.,
TEXT
,PDF
,DOCX
) - Based on ingestion source - Custom metadata - Your organization-specific data
- Ingestion status - Current processing state
- Creation and update timestamps
Example document:
{
"id": "doc_123",
"name": "https://example.com",
"externalId": "external_123",
"documentType": "URL",
"ingestionSource": "WEBSITE",
"ingestionStatus": "SUCCESS",
"ingestionError": null,
"ingestJob": {
"id": "job_123"
},
"ingestJobRun": {
"id": "job_run_123"
},
"connection": {
"id": "conn_123"
},
"documentProperties": {
"mimeType": "text/html",
"fileSize": 1347,
"characterCount": 1335,
"tokenCount": 340,
"embeddingCount": 1
},
"embeddingConfig": {
"provider": "OPENAI",
"model": "text-embedding-3-small",
"dimensions": 1536,
"chunkSize": 1024,
"chunkOverlap": 256
},
"providers": {
"fileStorage": "S3_COMPATIBLE",
"vectorStorage": "PINECONE",
"embeddingModel": "OPENAI",
"webScraper": "FIRECRAWL"
},
"metadata": {
"category": "security",
"status": "published"
},
"namespace": {
"identifier": "ns_123"
},
"organization": {
"id": "org_123"
},
"createdAt": {
"isoString": "2024-01-01T00:00:00Z"
},
"updatedAt": {
"isoString": "2024-01-01T00:00:00Z"
}
}
Using Metadata
Metadata helps organize and filter your documents. Here's a comprehensive example:
{
"metadata": {
"department": "engineering",
"docType": "api-spec",
"version": "2.0",
"status": "published",
"platform": "mobile",
"language": "en",
"lastReviewer": "john.smith",
"lastReviewedAt": "2024-01-15T10:00:00Z"
}
}
Best practices for metadata:
- Use consistent keys and naming conventions
- Follow standard date formats (ISO 8601)
- Include relevant identifiers for filtering
- Keep values standardized for effective searching
Retrieving Documents
Fetch documents using various filters:
curl -X POST https://api.sourcesync.ai/v1/documents \
-H "Authorization: Bearer $RAGAAS_API_KEY" \
-H "Accept: application/json" \
-H "Content-Type: application/json" \
-d '{
"namespaceId": "ns_abc123",
"filterConfig": {
"documentIds": ["doc_123", "doc_456"],
"documentExternalIds": ["contract-2024-01"],
"documentConnectionIds": ["conn_123"],
"documentTypes": ["URL", "GOOGLE_DRIVE_DOCUMENT"],
"metadata": {
"department": "engineering",
"status": "published"
}
}
}'
- First, documents are selected if they match ANY of the provided IDs
(
documentIds
ORdocumentExternalIds
ORdocumentConnectionIds
)
- Then, these documents are filtered to match ALL other criteria
(
documentTypes
ANDmetadata
)
Pagination
SourceSync uses cursor-based pagination to efficiently handle large document sets. This approach ensures consistent results even when documents are being added or modified between requests.
How Pagination Works
-
Cursor-Based Navigation
- Instead of using page numbers, we use a cursor that points to your position in the result set
- Results are ordered by creation date (newest first)
- The cursor is an opaque string that encodes the position information
-
Request Parameters
{ "pagination": { "pageSize": 10, // Optional, default: 100, max: 100 "cursor": "eyJjcmVhdGVkQXQiOi..." // Optional, from previous response } }
-
Response Format
{ "data": { "itemsReturned": 7, // Documents returned in current page "hasNextPage": true, // More documents available "nextCursor": "eyJjcmVhdGVkQXQiOi...", // Use this for next page "documents": [...] // Array of documents } }
Smart Pagination
SourceSync implements smart pagination to provide a better user experience:
-
Automatic Page Filling
- If filters reduce the result set below the requested page size
- The system automatically fetches more documents to fill the page
- You'll always get the maximum possible documents up to your requested page size
-
Efficient Filtering
- Document ID and external ID filters are applied first using efficient indexes
- Document type and metadata filters are applied next
- All filtering happens at the database level for optimal performance
-
Consistent Results
- The cursor ensures you won't miss documents or see duplicates
- Works reliably even when documents are being added or modified
- Maintains proper ordering by creation date
Example: Fetching All Documents
async function fetchAllDocuments(namespaceId, filterConfig) {
let cursor = null
let hasNextPage = true
const allDocuments = []
while (hasNextPage) {
const response = await fetch('https://api.sourcesync.ai/v1/documents', {
method: 'POST',
headers: {
Authorization: `Bearer ${RAGAAS_API_KEY}`,
'Content-Type': 'application/json',
},
body: JSON.stringify({
namespaceId,
filterConfig,
pagination: {
pageSize: 10,
cursor,
},
}),
})
const { data } = await response.json()
allDocuments.push(...data.documents)
hasNextPage = data.hasNextPage
cursor = data.nextCursor
}
return allDocuments
}
The cursor is an opaque string - never try to modify or construct cursors manually. Always use the cursor exactly as provided in the response.
Managing Documents
Updating Documents
Update metadata for multiple documents using filters:
curl -X PATCH https://api.sourcesync.ai/v1/documents \
-H "Authorization: Bearer $RAGAAS_API_KEY" \
-H "Accept: application/json" \
-H "Content-Type: application/json" \
-d '{
"namespaceId": "ns_abc123",
"filterConfig": {
"documentTypes": ["URL", "GOOGLE_DRIVE_DOCUMENT"],
"metadata": {
"department": "legal",
"status": "pending"
}
},
"data": {
"metadata": {
"status": "reviewed",
"reviewedBy": "john.doe",
"reviewedAt": "2024-01-15T10:00:00Z"
}
}
}'
Deleting Documents
Delete multiple documents using filters:
curl -X DELETE https://api.sourcesync.ai/v1/documents \
-H "Authorization: Bearer $RAGAAS_API_KEY" \
-H "Accept: application/json" \
-H "Content-Type: application/json" \
-d '{
"namespaceId": "ns_abc123",
"filterConfig": {
"metadata": {
"status": "archived",
}
}
}'
Document deletion is permanent. All associated vectors and embeddings are also removed. Consider using soft deletion via metadata if you need to preserve history.
Best Practices
-
Organization
- Use separate namespaces for different projects/environments
- Define a consistent metadata schema
-
Performance
- Use specific filters to reduce result sets
- Batch updates when possible