Documenting metadata lineage and physical implementation information

Our metadata crew are in the middle of some serious design work in looking at how to best document our data holdings and I was wondering whether there were any examples I could see of data lineage being represented in Aristotle?

If we were to harvest information from our internal systems that table W and X in system A, was fed into a landing table Y in system B, and then table Y’s content was then pulled into system C in table Z how could we best represent this information?

As far as I know, these tables themselves would be represented in Aristotle as Distributions; and we even have a custom field in Distributions to represent the Physical Table Name of each table W, X, Y and Z. Each physical column of each table would be represented in these Aristotle distributions by the field path_name (or for the API it’s ‘logical_path’ inside of distributiondataelementpath_set inside of distribution).

But the data flow, from distributions W and X, feeding into distribution Y, which in turn feeds distribution Z. How is this shown?

Also, System A, B, and C … how are they best documented?

My feeling is that we would use links, such as the API call http://dss.aristotlecloud.io/api/v4/links/ , but I have not tested this.

Lastly, in order to locate table W in Aristotle, are we able to search via API according to Distribution physical name? (Physical name being a custom field we have set up).

So to sum up, my questions are:

  1. How do we best represent data flow across tables across different data platforms?
  2. Confirm that Path name is the best place to document the physical column name of a physical table distribution? (I’ve even seen a csv file documented as a Distribution, and Path name was then used to specify cell values eg: B1:B10)
  3. How best to locate whether a physical table is actually documented in Aristotle when all I know right now is the physical name? Is this a graphQL call, or must we cycle through all distributions via v4’s metadata_distribution_list ?
2 Likes

We have quite a few examples of this in the Knowledge Base:

And in the Aristotle.Cloud user guide examples: Collections | Data Lineage Examples - Aristotle.Cloud

At the moment, we only support Data Lineage being done at a Dataset level using provenance records and these can be used to specify a source data set, with associated information.

We are currently undergoing development of an extension to Aristotle based on feedback from several clients to support linkages at the Distribution and logical path levels to allow for direct linkage between data tables.

In response to your summary questions:

  1. How do we best represent data flow across tables across different data platforms?

    • At the moment, using the provenance records for datasets. In future this can be done between tables as well.
  2. Confirm that logical_path is the best place to document the physical column name of a physical distribution? (we’ve always done it this way, but someone’s tried to tell me otherwise recently)

    • Yes, the logical_path is where I would recommend storing the column name information. This field is designed to map between a field in a data file (column name, XML element name, etc…) to a data element – so this is the perfect place for the column name to go.
  3. How best to locate whether a physical table is actually documented in Aristotle when all I know right now is the physical name?

    • All custom fields are searchable using the search in Aristotle, so if the Physical name is stored against the item, searching for it should return the right results in search for this.
    • Also, because we’ve not added the ability to capture alternative names Services Australia can start using this to define and record Physical Names against distributions. In an upcoming update, we are also going to be adding the ability to search and filter based on alternative names in the updated browse pages as well.
      🐜 Alternative Name Types are now available for all users!

Thanks Sam.

Sorry, what I meant to say is how do I search for a physical name using the API when all I know is the physical name.

Although, if we switched to Alternative Names, and I could API search via those, this would solve the problem.