BLOG: Data Management Using IBM Cloud Pak for Data and AI

Brad Miller
Practice Director, Information & Analytics

Back in the 90s when the rapid expansion of physical data warehouses came into its own with the growth of enterprise data, the vision was often to move/replicate all relevant information into a single source of truth. With the advent of more diverse data sources, complex applications, and the need to process information at approaching real time, legacy data warehouses and data lakes are adapting to the times. It is not a new concept that, in order to make the most of an organization’s data assets, one must have a common understanding and definition of concepts and terms and how they apply to the actual physical assets.

Minimizing Data Replication

With the advent of logical data warehouses and data hubs, organizations can minimize the data being replicated into data analytics environments. The challenge is that the knowledge of the physical assets is now more distributed in nature and requires broader knowledge of data and business context for success. Leveraging the IBM Cloud Pak for Data – Data Virtualization solution, many organizations have been able to cut integration and maintenance time significantly. While many have succeeded, there are others who have struggled to accomplish the envisioned value proposition. The organizations who struggle all have one common theme: the inability to align the organization to a common understanding and perspective of the key business terms and, thus, the data assets. The challenge comes down to semantics.

Semantics: Clearly Defining Business Terms for Data Management

Semantics by definition is focused on the meaning behind a word or concept. Modern business concepts that are often assumed to be simple constructs, like “Customer”, are often much more complex than they look on the surface. Understanding an organization’s definitions is key at any level, but particularly within the data analytics and artificial intelligence (AI) realms of data science. For example, in banking, a retail banking customer may be different from a customer in the wealth management or loan divisions. If an executive asks for the number of active customers, clarification is needed to determine the correct breadth of customers, not to mention the parameters of “active”.

By clearly defining business terms and arriving at a master data management strategy, organizations can mitigate confusion, lack of trust, and ultimately the loss of value of the organization’s information assets. This is not the most glamorous endeavor and often takes time and effort. It then requires care and feeding, because once a foundation is established, the business may change in some way, causing refactoring of some portion of the semantic layer. In the past, this was a barrier to the adoption of master data management and often caused organizations to rush through the process or leave the metadata to decay over time by not maintaining it and thus quickly diminishing its value.

Mainline has created a number of accelerators to help customers build this metadata layer and then maintain it with the least amount of investment and ongoing management. New and innovative solutions such as IBM Cloud Pak for Data have drastically reduced this challenge by unifying a data platform, which can collect, manage, and apply workflows and business rules to the organization’s data assets and metadata. It can also do so whether the ecosystem is on private cloud, public cloud, or multi-cloud with the same code base. This means you could have parts of the ecosystem running on Salesforce, AWS, Azure, Google, and with other private third parties and still maintain a unified code and skill base, thus reducing complexity and cost.

Sematic Enablement – IBM Cloud Pak for Data

It is true, no solution or platform can remove all of the complexity and human interaction out of a business semantic layer and data modeling process, but with the use of embedded AI and machine learning, IBM tools automate significant efforts in building and maintaining the data ecosystem that are still manual with other vendor solutions.

The IBM Cloud Pak for Data Platform includes functionality such as:

Auto Discovery, Auto Tagging, Classification, and Industry Accelerator content empowers users to identify, categorize, define, and maintain business and technical objects and attributes. The automation is powered by internal Watson AI processes that continually learn and evolve to better meet the business needs.
Monitoring of changes to the metadata as well as data itself with alerts to key organizational stakeholders when changes occur.
Application of policies and rules, many of which are prebuilt within the solution and can be automatically applied.
Embedded AI (Watson) learns the organizational nuances in terms and data artifacts, allowing the system to become more and more accurate with both onboarding as well as maintaining the data ecosystem, metadata, and business terms.

Some may be thinking, “I thought this was a topic just related to business glossaries, terms, and applying the right meaning to connect the different perspectives within an organization.” This is a correct assumption, but context is often one of the keys to correctly defining terms like “Customer” or “Product” or “Account ID.” In order to mitigate the long and drawn-out process of hashing out definitions of all data artifacts both of high importance and lesser, IBM has leveraged AI and contextual awareness, looking at the technical metadata and data itself in order to help automate a larger portion of work.

Use Case: How Embedded Artificial Intelligence Saves Time and Money

The following is an example of suggestions for an attribute called “Utility.” There are no terms currently assigned, but you can see the AI suggestions on the right-hand side and its confidence in the match. In this case, 71% is not high confidence, but if there were other attributes such as “Utility Name”, “Utility”, “Utilities”, etc., these would show up as well, and with a higher confidence in the match. The user can set a threshold to auto assign and then approve the matches that fall below the threshold.

Here is a view of the Auto Discovery functionality within IBM Cloud Pak for Data. The Cloud Pak for Data solution gives the user the ability to assess the object (in this case a Db2 table) at different levels, from Column and Term to the actual Data Profiling and Quality analysis.

The Auto Discovery within Cloud Pak for Data leverages internal AI programming (Watson) to score its assumptions about each attribute and entity based on the data housed within. Where lack of context or inability to determine data types inhibit the AI from categorizing or aligning the data, this can still be done manually or be overridden if incorrect, but even partially automating the process saves time, effort, and organizational cost.

Often the challenge of identifying and managing data and metadata is based on different divisions, subsidiaries, or outside entities that leverage shared terms but apply different meanings. Within the IBM Cloud Pak for Data solution, the terms have contextual relevance such that a term may have a different meaning or classification between two divisions or groups/data domains within an organization.

Instilling Trust in Master Data

Cloud Pak for Data lineage allows for tracing terms and artifacts through their business and data processing lifecycle and back to their source(s). The user can then see the grouping of categories, terms, and technical artifacts that are associated, understand the business context, and then identify back to the source of record, thus validating the data asset is being used in the correct context. This leads to expanded trust and usage of data within the organization and allows conflicts to be resolved based on a detailed understanding of the origin of each data element. This functionality allows the data consumer community to get more involved in the maintenance of the ecosystem and spread the workload that AI cannot manage across a much wider group. Many hands make lighter work. This group is often deeper in understanding of the data artifact or term and, in the end, provides a higher quality data ecosystem.

Here is a visualization of the mixed business and technical artifacts within a lineage model. Terms, categories, relationships, and business rules would be visualized within the dynamic and interactive diagram.

Data Virtualization

The Cloud Pak for Data platform also brings an embedded data virtualization solution to an organization, which then couples the business terms and insights back to the physical data artifacts. This combination of semantics and data virtualization is key to driving democratization of analytics and AI within an organization. A comprehensive and unified solution is optimal to build and maintain information insights in this new world. Whether your focus is Artificial Intelligence, Data Governance, Data Warehousing, Big Data, Analytics and or hardening Open Source code within the organization, IBM Cloud Pak for Data is the comprehensive solution. IBM Cloud Pak for Data is designed so that one code base can also be deployed no matter where the workload needs to run (private cloud, public cloud or even multi-cloud deployment). To differentiate the platform further, Watson intelligence is continually learning and automating more of the workload so organizations spend less and less time and cost maintaining the value of the data artifacts and relationships. IBM has set a new bar in the data ecosystem and AI platform arena with Cloud Pak for Data. With its continued evolution and RedHat OpenShift foundation backing the platform, IBM further distances itself from the competition.

More Information:

Mainline offers a comprehensive portfolio of data management, governance, integration, business analytics, AI, and machine learning solutions. For more information, contact your Mainline Account Executive directly or click here to contact us with any questions.

You May Also Be Interested In:

BLOG: What Happened to Netezza

Webinar Replay: Cloud Pak for Data Turning Data into Insights (54:32)

Video: Cloud Pak for Data – Make your Data Ready for AI & Cloud (2:09)