Duplicate Lineage got inserted into Azure Purview

Sri Lakshman Velugubantla 20 Reputation points
2025-01-02T17:40:17.5266667+00:00

Hi Microsoft Team,

I have registered and scanned the Databricks Unity Catalog in Azure Purview. Initially, all tables and views metadata were imported into Purview, but the lineage was not pulled in. After enabling the access schema in Unity Catalog, which includes the table_lineage and column_lineage tables containing lineage information, the lineage data was extracted and pulled into Azure Purview.

However, I am now encountering an issue with duplicate lineages for the same table in Azure Purview. This is because the table_lineage and column_lineage tables contain historical lineage data. We run notebooks daily, and each run stores lineage data in these tables with different entity_ids, resulting in duplicate lineages for each table.

I cannot clean up the lineage data in the Databricks tables as it would affect other aspects of the Databricks catalog.

Could you please help me resolve this issue? Is there a way for Purview to take only the latest lineage from these tables? Can I create a new table with only the latest lineage data and scan that table instead? Will this approach work?

if works then which columns are required for purview to create a lineage process?? Please let me know i will create a table with those columns and insert latest lineage data and scan again.

Microsoft Purview
Microsoft Purview
A Microsoft data governance service that helps manage and govern on-premises, multicloud, and software-as-a-service data. Previously known as Azure Purview.
1,335 questions
{count} votes

1 answer

Sort by: Most helpful
  1. David Broggy 6,071 Reputation points MVP
    2025-01-02T19:40:05.4733333+00:00

    Hi @sri lakshman,

    I'm not aware of a feature for Purview to read just the latest lineage.

    You would need to create another table which contains only the latest lineage and have Purview scan that.

    I appreciate that would mean you need to maintain a new table but that's my recommended solution.

    good luck.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.