Skip to main content
Version: 1.0.0

Link prediction

Link prediction is one of the most powerful features of the Intugle Data Tools library. It's the process of automatically discovering meaningful relationships and potential join keys between different, isolated datasets. This turns a collection of separate tables into a connected semantic graph, which is the foundation for building unified data products.

The LinkPredictor class

The core part responsible for this process is the LinkPredictor. While the SemanticModel manages this process for you, you can also use the LinkPredictor directly for more granular control.

Accessing the LinkPredictor

After running the predict_links() or build() method on a SemanticModel instance, you can access the underlying LinkPredictor instance via the link_predictor attribute.

# After running the pipeline...
predictor_instance = sm.link_predictor

# Now you can use all the methods of the LinkPredictor
links_list = predictor_instance.links

Manual usage

To use the LinkPredictor manually, you must give it a list of fully profiled DataSet objects.

from intugle.analysis.models import DataSet,
from intugle.link_predictor.predictor import LinkPredictor


# 1. Initialize and fully profile your DataSet objects first
customers_data = {"path": "path/to/customers.csv", "type": "csv"}
orders_data = {"path": "path/to/orders.csv", "type": "csv"}

customers_dataset = DataSet(customers_data, name="customers")
customers_dataset.profile().identify_datatypes().identify_keys()

orders_dataset = DataSet(orders_data, name="orders")
orders_dataset.profile().identify_datatypes().identify_keys()

# 2. Initialize the LinkPredictor with the processed datasets
predictor = LinkPredictor([customers_dataset, orders_dataset])

# 3. Run the prediction
predictor.predict(save=True)

# 4. Access the results
# The discovered links are stored as a list of PredictedLink objects in the `links` attribute
links_list = predictor.links
for link in links_list:
print(f"Found link from {link.from_dataset}.{link.from_column} to {link.to_dataset}.{link.to_column}")

Caching mechanism

The predict() method avoids redundant work. It saves its results to a __relationships__.yml file and only re-runs the analysis if it detects that any of the underlying dataset analyses have changed since the last run.

Useful methods and attributes

The primary way to access the results. This attribute holds a list of PredictedLink Pydantic objects, giving you structured access to the discovered relationships.

A utility function that converts the links list into a Pandas DataFrame. This is useful for quick exploration, analysis, or display in a notebook environment.

# Get the results as a DataFrame for easy viewing
links_df = predictor.get_links_df()

# Display the DataFrame
# columns: from_dataset, from_column, to_dataset, to_column
print(links_df)

save_yaml() and load_from_yaml()

You can manually save the state of the predictor or load results from a specific file.

# Save the discovered links to a custom file
predictor.save_yaml("my_custom_links.yml")

# Load links from a file
predictor.load_from_yaml("my_custom_links.yml")

show_graph()

After running the prediction, you can visualize the discovered relationships as a graph. This is an excellent way to understand the overall structure of your connected data.

# This will render a graph of the relationships
predictor.show_graph()
For Non-Interactive Environments

The show_graph() method is designed for interactive environments like Jupyter notebooks where a GUI is available to display the plot.

If you run this method in a non-interactive script, you will see a UserWarning because no interactive backend is available. To save the graph to a file instead, you can use matplotlib:

import matplotlib.pyplot as plt

# This will generate the plot and raise a warning
predictor.show_graph()

# This will save the generated plot to a file
plt.savefig("link_prediction_graph.png")