Getting started with custom datasources in Glean

The Glean Indexing API provides a flexible way to bring content from internal tools, legacy systems, and other custom-built applications into the Glean search index. By using this API, organizations can surface information not available through native connectors and create a more complete, unified search experience for their teams.

The API supports a wide range of content types, including unstructured documents and structured entities. It applies Glean’s permission model to ensure users only see the information they are authorized to access. To support development, Glean offers official client libraries for Python, TypeScript/JavaScript, Go, and Java.

This discussion covers the core concepts behind custom datasources and explains how to create and index your first one. It also introduces the key configuration parameters that shape how your datasource behaves, preparing you for the best practices and troubleshooting guidance in Part 2.

Foundational concepts: Understanding the indexing architecture

A successful implementation of the Indexing API begins with a clear understanding of its core architectural components. These concepts form the building blocks for any custom integration.

Datasource: The datasource is the highest-level container for a specific set of content within Glean. When using the Indexing API, these are referred to as "Push API Datasources", "Indexing API Datasources", or "Custom Datasources”. Every datasource is defined by a configuration that includes a unique name for API calls, a user-facing displayName, and a datasourceCategory that provides a strong signal for search relevance.
Document: This is the most fundamental unit of content you can index. A document is defined by a DocumentDefinition object, which contains essential fields such as a unique id (within its datasource), a title, the main content or body, and a viewURL that links back to the source system.
Custom entity: A Custom Entity is a special type of content designed for structured, key-value data rather than the text-heavy format of a typical document. This is ideal for indexing objects like product specifications, project profiles, or definitions from an internal glossary. Custom entities are indexed using the same API endpoints as documents, but the datasource they belong to must be configured with the isEntityDatasource flag set to true.
Permissions model: The API allows for the precise definition of access controls for each indexed document. Permissions can be granted to individual users and/or groups. A critical configuration parameter, is_user_referenced_by_email, dictates whether the system should map permissions using a user's email address or a datasource-specific unique ID.
Bulk vs. incremental indexing: It is critically important to understand the two primary modes of indexing content.
- Incremental indexing: This is performed using the /indexdocument, or /indexdocuments endpoints, which adds a single new document or updates an existing one. This method is best suited for real-time or frequent, small-scale updates.
- Bulk indexing: This is performed via the /bulkindexdocuments endpoint, which allows uploading a large number of documents in a single process. This endpoint functions as a replacement or "full sync" mechanism. When a bulk upload completes for a datasource, any documents previously indexed in that datasource but not included in the new upload will be automatically deleted. This behavior is a common source of error for new users and can lead to unintentional data loss if not handled with care. It should be used for periodic, full-synchronization jobs where the uploaded batch represents the complete and current state of the content.

Getting started: Creating and indexing your first datasource

This section provides a practical, step-by-step guide to configuring a new custom datasource and indexing your first document.

Pre-requisite: Authentication

All requests to the Indexing API must be authenticated using a bearer token. This token must be generated by a user with the appropriate permissions within the Glean Admin console. It should be treated as a secret and stored securely. The official Glean API client libraries provide a simple method for configuring this token, which is automatically included in all subsequent API calls.

Method 1: Datasource setup via the Admin console (UI)

For initial setup, or for administrators who prefer a graphical interface, the Glean Admin console provides a user-friendly way to create a custom datasource.

Navigate to the Admin console and select Data sources from the menu.
Click the Add data source button and choose Custom from the list of available app types.
Fill in the required "Data source basics," including the Name, Display name, and an Icon for the datasource.
Configure any other desired settings, such as URL patterns or quick actions.
Click Publish to create the datasource and make it available for indexing.

Method 2 (preferred): Datasource setup via the API (`/adddatasource`)

For programmatic or automated environments, creating and managing datasources via the API is the recommended approach. The /adddatasource endpoint is used to submit a CustomDatasourceConfig object that defines the datasource's properties.

Key `CustomDatasourceConfig` parameters

The following table details the most important parameters in the CustomDatasourceConfig object.

Parameter	Type	Required?	Description
`name`	string	Yes	Unique identifier for the datasource. Used in all subsequent API calls.
`displayName`	string	No	User-facing name in the UI. Falls back to a title-cased name if omitted.
`datasourceCategory`	string	No	The type of content is a strong signal for search relevance. Cannot be UNCATEGORIZED.
`urlRegex`	string	Yes (for docs)	A regex to match document URLs. Not required if isEntityDatasource is true.
`isEntityDatasource`	boolean	No	Set to true if this datasource will hold structured entities instead of documents. Defaults to false.
`isTestDatasource`	boolean	No	Set to true to create a test datasource that does not affect production search rankings. Defaults to false.
`isUserReferencedByEmail`	boolean	No	Set to true if user permissions will be identified by email address.

Indexing a document (`/indexdocument`)

Once the datasource has been configured, content can be added or updated using the /indexdocument endpoint. The following example demonstrates how to index a basic document using cURL.

curl -X POST https://customer-be.glean.com/api/index/v1/indexdocument \

-H 'Authorization: Bearer <your_indexing_token>' \

-d '{

"document": {

"datasource": "internal-docs",

"objectType": "Document",

"id": "getting-started-guide",

"title": "Getting Started with Our Tool",

"viewURL": "https://internal.company.com/docs/getting-started",

"body": {

"mimeType": "text/plain",

"textContent": "This is the full text content of the document..."

},

"permissions": {

"users": ["user@example .com"],

"groups": ["everyone@example .com"]

}

}'

Wrapping Up

With these core concepts and setup steps in place, you now have the foundation needed to start integrating your internal systems with the Glean Indexing API.
In Advanced custom datasource guidance for Glean, we’ll go deeper into best practices, advanced capabilities, and the most common issues teams encounter as they expand their integrations.

Find more posts tagged with

Getting started with Glean

Glean for Developers

Welcome!

It looks like you're new here. Sign in or register to see or post comments, upvote, and react.

Getting Started

Events

Help Center

glean.com