Feature feedback: Updates to datasets to improve discovery and grouping

Hey all,

We have an upcoming change to datasets to improve how datasets are grouped, we’ve also been working with people from the New South Wales Department of Communities and Justice who have given us approval to use an example from their registry to showcase the improvements.

Because of the major to dataset visualisation we are gathering feedback to ensure this is the ideal solution across all clients and ensure all options are explored. If you have questions, recommendations, feedback or would like to offer you approval of the change please comment below.


In summary, they have a large dataset call the “Human Services Dataset (HSDS)” that includes over 100 distributions, and are looking for solutions to capture a data model for these tables and file.

However, the the two layer Dataset / Distribution model from DCAT and ISO11179 isn’t sufficient for recording the complexity of some large data assets.

The challenge:

The HSDS current contains 136 distributions shown sequentially and takes between 5-10 to load.

To resolve this, the Family and Community Services Insights and Analyics area has developed a collection that holds the hierarchy of the dataset to make it easier to browse. However this puts critical information about the dataset outside the governance process.

There is also a further challenge that when viewing a dataset within a Tablion Data Portal the hierarchy is not able to be synced from a collection.


The suggestion: adding a dataset grouping browser

We are going to add a “Dataset grouping” hierarchy, similar to Data Set Specifications to Datasets to capture a hierarchy of distributions in a way that is easy to manage.

This will add a hierarchical grouping that allows distributions to be structured within a Dataset
to captures semantics of the structure within the page. This will also improves discovery of distributions and improves overall page load speeds and page size.

Proposed change: Screenshot 1: The HSDS is now able to bring in all details into the description within a unified item page.

Proposed change: Screenshot 2: This will move the distributions further down the page into its own “browse panel” within the dataset. Each grouping can have its own metadata, such as a description of a subtable, view or database, and viewing an individual grouping will show all containing groups and distributions.

Proposed change: Screenshot 3: This will also allow only single distributions to be rendered making pages shorter and page loads faster.

Proposed change: Screenshot 4: Adding a hierarchy to the dataset will also allow this to be synced across to a Tablion Data Portal to improve variable selection and discovery:

Alternate options

Alternate option 1:Nested Datasets

DCAT and ISO11179 allow datasets to have relations to other datasets to build a hierarchy, however we have decided not to implement this for two major reasons.

Firstly, nested Datasets will introduce the usability challenge of managing and syncing governance, registration and permission between parent & child datasets. Adding this relation would have made it difficult to know who controlled the hierarchy and what was within the tree. Secondly, this would add technical complexity when syncing between systems as different datasets made not be able to be synced.

Hi Sam,

I agree the current approach doesn’t work well when there are many distributions. It is a bit slow to load, but also hard for people to find the distribution they want.

However I wonder whether people who don’t have experience with the Data Set will still find it difficult to find the distribution they want in the hierarchy.

Alternative option

An alternative may be to list the distributions in a table in the Data Set (similar to how the Data Elements are listed in Screenshot 3).

People could either:

  • Use the hierarchy to filter the Distributions in the table.
  • Use a search box to filter the Distributions in the table.

Regards,
Andrew

What if we added a toggle option so a user can switch between browsing the groupings, and seeing a scrollable list of all distributions in the dataset?

We are also looking at filtering as a future enhancement for this as well.

Hi Sam, I think the proposed change is a good one! Just having the collapsible distributions toggle has already made a huge difference with navigating our large datasets. But this capability would make it even easier.

My main question would be around ensuring there is a quick/easy process to bulk update the dataset / dataset groups with many distributions? And editing the groupings afterwards.

Some other thoughts:

  • Would the summary hierarchy be present for all datasets in the registry, or would you enable it only if there were many distributions?

  • If you click on the highest level in the hierarchy, would it allow you to load/display all distributions in the dataset, as this might be desirable in some cases.

  • And will downloads give you the option of downloading the entire dataset as well as particular groups only?

Cheers,
Shane

Good questions Shane, see my answers below:

We are looking at how to improve the bulk uploading and management of datasets, but that won’t be included in this release. What will be included is an easy tree-management view for adding and updating groups.

We’re going to enable it by default across all datasets on registries. We had considered a togglable view, but by making this the default it will make all datasets consistent, and we are working to make this a better view for datasets, regardless of how many distributions they contain.

That is the togglable option I mentioned above in response to Andrew. Users will be able to switch between “hierarchical view” and a “flat list view”. The reason here is there might be 1 or 2 ‘master’ distributions that need to be attached at the top level.

We will start with downloads still downloading the full item, but will look at how to add more configurability in future releases.

1 Like

That would be great thanks Sam.

Hi Sam,

The changes look really interesting. How would the changes to the Data Set / Distribution model look like under the hood?

If I use the API to call for a Data Set or a Distribution, what changes can I expect? Optional groups of distributions within a given distribution?

Is there an example on a test server somewhere that can I can query via a browser to have a peek? :slight_smile: