I’ve started testing out the OpenAI ChatBot feature after seeing demos online. I’m pretty strongly for human experts being involved (and responsible for) defining and updating metadata. But after seeing a few examples, this seems like it could be a pretty powerful way to help document large datasets quickly.
Intriguing… No, I haven’t used it, but keen to find out more. How does it work? Even using this as a starting point for definition development, it could significantly reduce time required to research new definitions and add detail that might not otherwise be documented.
It does sound like an interesting source for definitions and ideas. As you said though, we would need to fully review everything it says. It can still give answers that are:
Nonsense (obviously wrong just from the text it has provided)
Wrong (but not obviously wrong just from the text it provided)
@skew That was my thinking yes - it would be an interesting value add, but certainly not a replacement for personal development of information.
As @AndrewB mentioned, AI can be troublesome especially when it is trained on incomplete entries, and I doubt that lots of “metadata” from a registry would have gone into training ChatGPT.
If we were to implement it, it would follow similar principles we already have in the registry where autocreated definitions are flagged as such.
I’ve been able to successfully teach OpenAI how to understand ISO 11179 and it is able to do some basic interpretation of columns to build out Data Element names with good rationales. What is most interesting is that even OpenAI acknowledges the need for good documentation:
It’s worth noting that abbreviations and acronyms are often used in column names to reduce space, and these can make it difficult to understand what the data element is called. In these cases, it’s often necessary to refer to documentation or metadata to understand the full names of the Object Class, Property, and Value Domain.
@sam That is really cool. We could load descriptions of data sets, distributions and fields/columns into Aristotle and have it spit out data elements and other components (as well as descriptions). Then we could start the process of reviewing and refining the content.