First, let’s define what we mean by a ‘data fabric’. As this is often confused with a related topic, the ‘data mesh’, we’ll define the data fabric by exploring the similarities and differences between the two concepts.
What is a real data fabric, and why is it important?
Both the data fabric and data mesh were designed to deal with the reality of highly distributed and siloed data within large organizations. These datasets are distributed both in terms of governance and authority, meaning that they are produced and managed by different entities within an organization, and held to standards that might differ based on the nature of the department that owns the data.
All of this creates a major obstacle for organizations’ efforts to become data driven. When a question is posed, data analysts spend most of their time locating and assembling the data. They are forced to be ‘data detectives’ first, and analysts second. Often by the time they’ve got the data-driven answer, the question has changed substantially.
With that said, the data fabric and data meshes can be distinguished from each other in several ways.
Technology solution vs. architectural approach. A data mesh is best understood as an overall architectural guidebook and organizational approach for a design that is essentially technology agnostic. It holds that data should be owned locally rather than at the enterprise-level, based on the idea that the departments that are most likely to use it are best equipped to be its caretakers. This local ownership is guided by overarching organization-wide rules that promote the governance and sharing of the data between departments.
A data fabric, on the other hand, is a technology solution for creating a metadata-driven virtualized layer that sits on top of a highly distributed or federated data architecture. This is sometimes referred to as an orchestration layer, as it is essentially organizing elements that are already in existence.
A data fabric starts by observing existing metadata, logs, and history of user interaction, in order to define the data’s usage. It then utilizes this broad set of information to inform recommendation and orchestration. As such, it’s not technology agnostic. To the contrary, it relies on the selection of technology with very specific features, which we’ll discuss shortly.
Iterative historical approach vs. original intent. In line with its localization-focused approach, a data mesh emphasizes the source from which the data originates to determine the ultimate evolution of data products emanating from that source.
This sort of bottom-up approach differs from the data fabric, which assumes that the experience and expertise of the users across many departments that may be accessing, using, and reusing the data sets is just as valuable as the data’s source of origin. Data is defined over time based on what people are actually doing with it, and this informs how the system makes its recommendations over the long term. It continually tracks data use cases, considering them as contributors to the overall refinement of the understanding of the data.
Automation vs. manual design. A data mesh can be best understood as a framework for the manual design and ongoing maintenance of a data orchestration system. At its core, it’s really a set of policies and practices for the management of data within a large organization. A data fabric, on the other hand, relies on augmented data management and automated orchestration technologies, seeking to minimize the need for human intervention at all levels.
What elements do you need in a data fabric?
As previously noted, the data fabric relies heavily on augmented data management and automation at all levels of design. While a data mesh can be considered technology agnostic, the kinds of technologies and features employed in a data fabric are critical. Here are a few of the top features of a data fabric:
AI-/machine-learning-augmented process for accumulating passive metadata--making sense of the metadata that was initially applied to each dataset. From that point, it begins to actively apply metadata based on user behavior and input as the system evolves. As it does so, it discovers relationships and potential integration points between datasets that exist in different repositories, and catalogs these for future use
Ability to integrate data through all standard data delivery methods, including streaming, ETL, replication, messaging, virtualization or microservices
Capabilities for visualizing relationships between data sources in a user-friendly manner
Unified data access--authorized users from any department need to have the ability to access data from any part of the organization, without having to copy it or move it
Semantic search--users need to be able to search for datasets just like they might search for a local movie theater on Google
As you may have guessed, the data fabric and data mesh concepts aren’t mutually exclusive. Interestingly, it has been asserted that, to be successful, a data mesh needs to leverage the same principles of discovery and analysis that a fabric leverages. That may be another way of saying that, if a data mesh is to work, it needs to leverage a data fabric.
Why some products don’t qualify as data fabrics
Of the previously listed capabilities, the last two are perhaps the most problematic for most attempts at providing a true data fabric.
If you need to move the data, it isn’t a ‘fabric’. Regarding unified access, while it is true that many data catalogs and other software platforms offer a single access point for the entirety of a company’s data, the tricky part is doing that without making them copy or port the data into a new platform. There are a number of reasons why moving data is problematic, such as the time it takes, and the risk of having multiple non-synchronized versions of the same data (there goes your ‘single version of the truth’). Accessing the data involves retrieving it, but it shouldn’t involve moving it -- ‘ETL’ should not be required.
If you need SQL programming skills to get data, it isn't a ‘fabric’. Regarding the need for semantic search, the truth is that SQL--while vital--also is a major obstacle to data access.
Of course the obvious point is that only those with SQL chops have the keys to the data kingdom. But it doesn’t just create problems for people who can’t code. Even people with serious SQL skills may have difficulty coding up a federated query that joins data from multiple repositories. Solutions that don’t heavily leverage NLP in the process of searching and querying data won’t provide the simplification needed to make a data fabric effective and practical.
If you’d like to learn more about how NLP, semantic search, and ETL-free technology can be leveraged to power a data fabric, read on.