Data Fabric vs. Data Catalog

Organizations use a few different models to manage access to data on, the most dynamic of which is the data fabric. To understand the unique advantages this approach offers, it’s helpful to understand how it fundamentally differs from one of the more common attempts to gain control over the unwieldy volume of data that organizations are currently collecting: the data catalog.

The data catalog emerged partly in response to limitations that were becoming increasingly problematic for data warehouses, which attempted to collect all the data that’s relevant to the organization in a single location, and then organize it so that it’s easy for analysts to find. The problem with this approach was that it required too much structure. It was kind of like a public library, and therefore required the attention to detail and process that one looks for in a librarian. When companies started grappling with Big Data, the idea of having to meticulously curate each data set before making it available in a central repository was slow--data was piling up.

The data catalog took a slightly different approach. Instead of collecting and neatly curating data, it attempted to overlay some structure on the data architecture that already existed within the organization by offering an automated means to apply metadata, or descriptive data, to the data assets. This metadata comes in two forms--technical metadata, like schemas and column names, and business metadata, such as the purpose for the dataset’s existence and its history of usage.

If a data warehouse is like a library, a data catalog is more like the search engine on a library website--it allows more flexible search, and also greatly expands what you can access so that you’re not just limited to information that’s been collected and curated. A data fabric takes it a step further, and acts more like the internet, which is not an actual collection of data, but rather a virtual access layer for info that might be located anywhere in the world. This distinction is important, as many organizations now have data that’s collected in siloed repositories scattered all over the world.

Let’s discuss the transition between data catalog and data fabric by exploring some of the similarities and differences between the two approaches.

How data catalogs and data fabrics are similar

Data catalogs and data fabrics have a few things in common, both conceptually and functionally. For example, both of these systems:

Collect as well as assign metadata for the data assets they contain
Make data assets searchable, both by the specific dataset, and by topic
Facilitate the ability to connect to the data sources, once they’re located
Are designed to support regulatory compliance by allowing easier identification and control over information that is sensitive, or otherwise likely to be required by a regulatory agency or lawsuit

How data catalogs and data fabrics are different

While both of these systems leverage the capturing and applying of descriptive metadata to organize data assets and make them retrievable, they also differ in fundamental ways:

Passive vs. Active Metadata. A data catalog leverages a process for making sense of the passive metadata that’s either a part of the schema or added to a dataset by users from its inception forward. A data fabric, on the other hand, applies active metadata based on user behavior and input as the system evolves. As it does so, it discovers relationships and potential integration points between datasets that exist in different repositories, and catalogs these for future use.
No ETL. Many data catalog solutions require a significant amount of ETL (extract, transform, load), but a data fabric is a unified virtual data layer that overlays all your data repositories--Hadoop, Oracle, Snowflake, Teradata, etc.--and allows you to use the data where it exists, without requiring you to move it or upload it into a new system.
Covering all data architecture. While data catalogs typically tend to specialize on a few platforms, a data fabric provides the ability to integrate data through all standard data delivery methods, including streaming, ETL, replication, messaging, virtualization or microservices
Advanced visualization. While many data catalogs do provide the ability to visualize relationships in data assets, the data fabric model takes this a step further by offering a user-friendly way to both visualize and manipulate those relationships, based not just on columns and rows, but also additional information about how the data assets have been used in the past. It also provides more in-depth information about the completeness of the data, which allows the user to make more informed decisions. Furthermore, a data fabric offers the ability to visualize the data before loading it into an external analysis platform.
Unified access. The data fabric takes a step further toward true data democratization by unifying access to all data that exists within the organization. Any authorized user can access data assets without having to copy or move them.
Google-like search. A data fabric, by virtue of the fact that it is retrieving information about data that might be anywhere within the organization, requires a more advanced approach to search. While a data catalog typically uses keyword based search, a data fabric relies on semantic search on par with the major search engines Google, and integrates natural language processing to discern context and intent.
Central vs. department-level data management. While a data catalog is a major leap forward in terms of allowing access to a wider range of data assets, it still fundamentally relies on the traditional central data model, where a small team of experts is responsible for the entirety of an organization’s data. A data fabric, on the other hand, due to the fact that it’s a virtual layer over existing architecture, opens the possibility of departmental oversight of data assets. This has many advantages, not the least of which is that the central data team is freed to deal with bigger issues.

In sum, the data fabric can be viewed as the next step in the evolution of a data catalog. While there is admittedly some overlap in the continuum between data catalogs and data fabrics, you can differentiate between the two by looking at how they apply AI and NLP to allow more advanced classification of data, and also more advanced and user-friendly search features. If the system is limited to only certain data platforms, uses keyword search, is limited in terms of its ability to apply metadata to only what the schema and users tell it about a data asset, and requires a degree of ETL, it’s a data catalog.

On the other hand, if it applies AI and NLP to apply metadata based on user behavior and intent, if it allows virtualized access to all data no matter where it resides, uses semantic search, and doesn’t ever require data to be moved, it’s a data fabric.

If you’d like to learn more about how a data fabric can benefit your organization, read on.

Subscribe to Promethium Blog