Conceptual Search
Conceptual Search is an experimental feature. The API and functionality may change in future releases.
Conceptual Search is an AI-powered feature that allows you to generate a data product plan from a natural language query. It bridges the gap between a high-level business question and a concrete, executable data product definition.
Overview
At its core, Conceptual Search uses a sophisticated, two-stage process orchestrated by AI agents and knowledge graphs:
- Knowledge Graphs: The system builds knowledge graphs for both database tables and columns. Nodes represent tables/columns, and edges connect conceptually related items based on semantic similarity and shared concepts extracted by an LLM.
- Graph-Based Retrievers: When you search, the system uses a hybrid approach of vector search and graph traversal to find relevant tables and columns, even if they are not direct keyword matches.
- AI Agents: The process is managed by two LangChain/LangGraph agents: a
DataProductPlannerAgentand aDataProductBuilderAgent.
The Two-Stage Workflow
Stage 1: Planning
The goal of this stage is to convert a vague user request (e.g., "customer churn metrics") into a structured DataProductPlan, which is a well-defined list of dimensions and measures.
- Input: A natural language query.
- Agent's Task: The
DataProductPlannerAgentuses its tools to find relevant database tables and existing data products. - Output: The agent produces a
DataProductPlanobject, which can be reviewed and modified by the user.
The DataProductPlan generated by the AI is a starting point. It is crucial to review and validate this plan to ensure it aligns with your business requirements before proceeding to the building stage.
Stage 2: Building
This stage takes the abstract DataProductPlan and maps each attribute to a specific, physical database column, defining its logic (e.g., aggregation for measures).
- Input: The
DataProductPlanfrom Stage 1. - Agent's Task: The
DataProductBuilderAgentiterates through each attribute in the plan, using the graph-based column retriever to find the most relevant physical column. - Output: The collected mappings are assembled into a final
ETLModel, which is a complete, machine-readable definition of the data product, ready to be used to generate a SQL query.
Usage Example
from intugle import DataProduct
dp = DataProduct()
# 1. Generate a plan from a natural language query
plan = await dp.plan(query="top 10 customers by their total purchase amount")
# 2. Review and modify the plan
print("Original Plan:")
plan.display()
plan.rename_attribute("total purchase amount", "Total Spend")
plan.disable_attribute("customer address") # Assuming this was in the plan
print("\nModified Plan:")
plan.display()
# 3. Create the ETL model from the modified plan
etl_model = await dp.create_etl_model_from_plan(plan)
# 4. Build the data product
result_dataset = dp.build(etl=etl_model)
# 5. Access the results
print(result_dataset.to_df())
Modifying the Data Product Plan
The DataProductPlan object is not just a static output; it's an interactive object that you can modify to refine the AI's suggestions. This allows you to correct any misunderstandings or add your own domain knowledge to the plan.
Here are the available methods to modify the plan:
| Method | Description | Example |
|---|---|---|
rename_attribute(old, new) | Renames an existing attribute. | plan.rename_attribute('Customer ID', 'Client Identifier') |
set_attribute_description(name, desc) | Updates the description of an attribute. | plan.set_attribute_description('Client Identifier', 'The unique ID for each client') |
set_attribute_classification(name, class) | Changes the classification to 'Dimension' or 'Measure'. | plan.set_attribute_classification('Total Sales', 'Measure') |
disable_attribute(name) | Deactivates an attribute so it won't be included in the final data product. | plan.disable_attribute('Customer Address') |
enable_attribute(name) | Reactivates a previously disabled attribute. | plan.enable_attribute('Customer Address') |
to_df() | Returns the final plan as a pandas DataFrame with only active attributes. | final_plan_df = plan.to_df() |
Qdrant Server Requirement
Conceptual Search utilizes Qdrant as its vector database for efficient retrieval of relevant tables and columns. Therefore, a running Qdrant instance is required.
You can easily set up a Qdrant server using Docker:
docker run -d -p 6333:6333 -p 6334:6334 \
-v qdrant_storage:/qdrant/storage:z \
--name qdrant qdrant/qdrant
After starting the Qdrant server, you need to configure its URL and API key (if authorization is used) in your environment variables:
export QDRANT_URL="http://localhost:6333"
export QDRANT_API_KEY="your-qdrant-api-key" # if authorization is used
Enhancing Performance with Tavily Web Search
For better performance and more contextually aware data product plans, it is recommended to use the Tavily web search tool. This allows the planning agent to research industry-best practices and common metrics related to your query.
To enable this feature, you need to get a API key from Tavily and set it as an environment variable:
export TAVILY_API_KEY="your-tavily-api-key"