Everything you do with data depends on your ability to access metadata. Put simply, metadata is data that describes other data, and it’s the metadata that tells you what’s in a particular database, or table within a database. Without that, you’re sailing blind in a massive sea of data. With that in mind, let’s start our discussion of a core component of a well-functioning enterprise data management system: the data catalog.
There are a number of systems used to organize and manage data--databases, data lakes, data warehouses, etc.--which may reside on-premise in your own facilities, or through rented facilities provided through cloud services.
Generally speaking, if you think of a data warehouse as being like a library, then a database is like a book within that library. Furthermore, the tables within each database are like chapters in the book. You could have every great work that’s ever been printed collected in a library, but without an organized system for cataloging its contents it might take months to hunt down a specific titler. This is where the data catalog comes in--it’s like the library catalog that captures the high level info about the data set and where it is located.
At a high level, it’s as simple as that. A data catalog provides a systemized way of assigning metadata to the contents of all of your organization’s data repositories so that you can quickly locate and extract the datasets that you need for your analyses. By doing so, it makes it possible for analysts, data scientists, developers, business users, and anyone else who needs data to find it.
Applying the metadata: the ‘what, why, when and where’ of the data asset
So what do you do with a data catalog once you’ve got one? You use it to search for data, and also to add metadata to your data assets. As the ability to search for data depends on good use of metadata, let’s talk a little bit about metadata first.
Metadata falls into two broad categories: technical and business. Technical metadata has to do with the storage system where the dataset resides, and also the ‘schema’ of the database, which describes how the database is logically structured and the ‘rules’ set forth for the data that resides in it. So, technical metadata might include things like:
The name of the data asset as well as a description of it
For a relational database, data table names and data types, views etc.
Information about where it is stored (details about the server, cloud vendor, data lake, etc.)
Technical metadata, not surprisingly, tends to be structured according to strict rules, and much of it is assigned when the data asset is created and stored. Business metadata, on the other hand, may be a little less formal, and much of it may be generated as users interact with the data itself.
Business metadata might be described in “tags” that provide details regarding the history and use of the data asset. The data catalog consolidates all this information--business and technical metadata--in a single location, and makes it discoverable.
Making the data discoverable through search
As mentioned previously, once a data asset is cataloged with proper metadata it becomes ‘searchable’ or findable by the user. To be clear, most data catalogs don’t index the data itself--just the metadata that describes it. Just like you don’t normally pull a book out of a library catalog, you don’t necessarily retrieve the dataset itself from the data catalog, just the information about it (including its location). This, however, is something that is changing, and if you want to learn how virtualization is making it possible to skip steps and retrieve data directly from a data catalog platform, check out this article.
What are the minimum expectations of a data catalog?
There are a lot of solutions on the market, so a fair question to ask is, what are the minimum requirements I should expect from a data catalog vendor? At a minimum, the platform should:
Make data searchable, both by specific dataset, and by topic
Compile metadata into a user-friendly format that can be leveraged by technical and non-technical users alike to explore and understand the data sources
Provide a logical path to the data so that the user can easily connect to it
Support governance, identifying sensitive information, etc.
Of course, you don’t want to solely focus on minimum requirements--you want to think about the future, and the expandability of your data architecture. So ask yourself questions like:
Does the system add sufficient value to the data? For example, crowdsourcing can now be used to enrich metadata in a way that greatly enhances not only the discoverability of the data, but the ability to quickly tell whether the dataset will suit the purposes of the analytics question.
Does the system reach across all data sources, regardless of the system or geographical location? As we mentioned before, traditional data catalogs have been thought of as systems that tell you where the data is and what’s in it, but don’t necessarily deliver the data for you. The next phase in data catalog evolution, however, involves creating a unified virtual data layer that sits on all your data repositories--Hadoop, Oracle, Snowflake, Teradata, etc.--and allows you to retrieve the data as quickly as you locate it. Ideally, this should be done without actually having to move the data--once again, the magic of virtualization.
Is it scalable? This is always a critical question in the era of big data, where data assets are piling up faster than most organizations can handle.
Does it allow analysts to quickly connect with the analytics tools of their choice? Let’s face it, data analytics is now a race, and the company that can make fast decisions based on sound data will have a major leg up on the competition. Single click integration with popular analytics tools like Looker, Tableau, etc., as well as machine learning environments that allow you to leverage R and Python, will greatly speed this process.
And finally, what exactly does the platform mean by ‘search’? A basic solution will provide keyword search, but this does nothing more than rudimentary keyword matches and will bring up a lot of useless information. Moving forward, think about the kind of semantic search experience that Google provides, and how it integrates natural language processing to discern context and intent. Even better, think about Generative AI-based platforms likeChatGPT, which allows you to communicate with a system just like you'd talk to a person.
In summary, a data catalog at its most basic level is a central place where you collect all the information, or ‘metadata’, about all of your organization’s datasets--it’s where a user goes to find out the ‘where, why, when, what’ of the data. But it’s also a space that’s rapidly expanding. The solutions that will meet tomorrow’s business needs will also allow you to quickly access and analyze data that might be scattered across highly distributed and fragmented data architectures.
If you'd like to learn about how Promethium is taking data catalogs to the next level with Generative AI, read on.
Comments