Monday, November 21, 2016

Data Catalog


Not to be confused with a “catalogue” which is some form of ancient paper based device, a “catalog” is a collection of metadata.  It is a directory of information that describes where a data set, file or database entity is located.  Additional information about the data may also be included such as the producer, content, quality, condition, and any other characteristic that may be pertinent.  It is a tool that allows an analyst to find the data they need.  There may be solutions hidden in your data.  A data catalog, at the least, will tell you where to look.


In any organization, data is collected and stored across different departments, multiple databases, and in a variety of formats.  In banking, for example, the customer information that a bank manager sees isn’t the same as what the Finance Department sees.  In fact, the bank manager is likely not even aware that a separate and unique data source about their clients even exists.  Registering these sources in a catalog allows people to become aware of the existence of data they may find useful.

Suppose you are at the library and you want to hold in your hands a map with information about Hole-in-the-Wall Falls in Oregon.  You could look at numerous maps and not find anything.  The first map you pick up may be a highway map.  If the catalog you are looking at has the map descriptions, it will save you a lot of searching.  The catalog may describe the map you are looking for as a topographical map showing hydrology for the state of Oregon, with the map being located at a specific library.  Now, instead of travelling from library to library looking through a variety of maps of Oregon, you can focus your attention on tracking down this single map with the information you need.

Microsoft’s Azure Data Catalog (“ADC”) is a fully managed service.  With ADC, when you register a data source, you can point to the source of that data and ADC will automatically extract structural metadata.  The source of the data does not have to be in the cloud.

Once registered, the catalog card can be used by anyone with access.  Others can then annotate it in order to enrich the metadata.  ADC will allow for crowd sourcing of metadata in order to provide a catalog rich in details.  Tags can include, for example, descriptions of how the registered data can be used to find what otherwise might be obscure or unique solutions.

Because the source of the data is registered in the Catalog, a user can connect directly to that data source through the catalog.  If the data is such that it shouldn’t be freely shared throughout an organization, ADC will allow the registrant to restrict access by defining ownership of the data and authorization requirements for access.

Organizations produce data at an enormous rate.  Storage for that data is likely to run the full gamut of places from an individual computer to the cloud, with locations anywhere on the planet.  This exponential growth of data and data sources makes a data catalog a very useful tool for making that data useful to everyone within the organization.  Through the use of ADC, you can actually find that needle in the haystack.

Some links to get you started:


You can find a series of “how to” links at the end of this Data Catalo intro article: https://azure.microsoft.com/en-us/documentation/articles/data-catalog-what-is-data-catalog/


No comments:

Post a Comment