Skip to main content

Link prediction

Link prediction is one of the most powerful features of the Intugle Data Tools library. It's the process of automatically discovering meaningful relationships and potential join keys between different, isolated datasets. This turns a collection of separate tables into a connected semantic graph, which is the foundation for building unified data products.

The LinkPredictor class

The core part responsible for this process is the LinkPredictor. While the KnowledgeBuilder manages this process for you, you can also use the LinkPredictor directly for more granular control.

Accessing the LinkPredictor

After running the predict_links() or build() method on a KnowledgeBuilder instance, you can access the underlying LinkPredictor instance via the link_predictor attribute.

# After running the pipeline...
predictor_instance = kb.link_predictor

# Now you can use all the methods of the LinkPredictor
links_list = predictor_instance.links

Manual usage

To use the LinkPredictor manually, you must give it a list of fully profiled DataSet objects.

from intugle.analysis.models import DataSet,
from intugle.link_predictor.predictor import LinkPredictor


# 1. Initialize and fully profile your DataSet objects first
customers_data = {"path": "path/to/customers.csv", "type": "csv"}
orders_data = {"path": "path/to/orders.csv", "type": "csv"}

customers_dataset = DataSet(customers_data, name="customers")
customers_dataset.profile().identify_datatypes().identify_keys()

orders_dataset = DataSet(orders_data, name="orders")
orders_dataset.profile().identify_datatypes().identify_keys()

# 2. Initialize the LinkPredictor with the processed datasets
predictor = LinkPredictor([customers_dataset, orders_dataset])

# 3. Run the prediction
predictor.predict(save=True)

# 4. Access the results
# The discovered links are stored as a list of PredictedLink objects in the `links` attribute
links_list = predictor.links
for link in links_list:
print(f"Found link from {link.from_dataset}.{link.from_column} to {link.to_dataset}.{link.to_column}")

Caching mechanism

The predict() method avoids redundant work. It saves its results to a __relationships__.yml file and only re-runs the analysis if it detects that any of the underlying dataset analyses have changed since the last run.

Useful methods and attributes

The primary way to access the results. This attribute holds a list of PredictedLink Pydantic objects, giving you structured access to the discovered relationships.

A utility function that converts the links list into a Pandas DataFrame. This is useful for quick exploration, analysis, or display in a notebook environment.

# Get the results as a DataFrame for easy viewing
links_df = predictor.get_links_df()

# Display the DataFrame
# columns: from_dataset, from_column, to_dataset, to_column
print(links_df)

save_yaml() and load_from_yaml()

You can manually save the state of the predictor or load results from a specific file.

# Save the discovered links to a custom file
predictor.save_yaml("my_custom_links.yml")

# Load links from a file
predictor.load_from_yaml("my_custom_links.yml")

show_graph()

After running the prediction, you can visualize the discovered relationships as a graph. This is an excellent way to understand the overall structure of your connected data.

# This will render a graph of the relationships
predictor.show_graph()