BLOG: Simplifying Data for AI with IBM Spectrum Discover

May 7th, 2021 BLOG: Simplifying Data for AI with IBM Spectrum Discover
Laura Tuller
Senior Systems Engineer – Storage Solutions


What is today’s most precious resource? A common answer is “oil”, but it can be argued that “data” is equally valuable. The two resources are similar in that once you find a new “vein” (of oil, or in this case, data), it takes time and effort to mine and refine the resources so they can be usable. Today’s world is a digital one that is data-driven. IDC estimates that in 2025, the world will create and replicate 163ZB (that’s 163,000,000,000,000 gigabytes) of data – 10 times the amount created in 2016. Think for a minute about the number of smart and mobile devices in today’s world that are creating videos, audio clips, emails, and images and the ever-increasing resolution with which we capture them. Although an invaluable resource, businesses find the process of sorting, aggregating, and analyzing this big data to be onerous and too time consuming to effectively utilize their data for making decisions and pivoting business strategies.


IBM Spectrum Discover – Simplifying Data Mining


It’s not about the data search…it is about the discovery that comes as a result of the search.

Announced in October 2018, IBM Spectrum Discover connects file and object both on-premises and in the cloud to rapidly ingest, consolidate, and index metadata for billions of objects. Spectrum Discover goes beyond system detail metadata and allows users to create custom metadata that opens up ‘a whole new world’ for tagging and searching. Custom tagging enables data scientists, storage admins, and data stewards to more efficiently classify, manage, and gain insights from massive amounts of unstructured data. Users can add new tags as policies and can map filters to a key or a tag.

This isn’t a blog post on how to set anything up, but rather to showcase the simplicity of data mining. In a matter of a few clicks, you can go from creating a tag, to filtering out the data you want to apply to that tag. An example of this is if we wanted to look for new video files, we could create a policy with a filter for files newer than seven days, with a file type that matches a video file. When a file meets this condition, a tag “newVideo” is set to “true” in the metadata for the file. Along with custom tagging, Spectrum Discover also provides an easy-to-use search engine and GUI (I’ll mention updates to these later), content inspection, policy engine, and application plugin API SDK.


Managing Sensitive Data


The struggle to find and identify files and objects that contain sensitive data is also becoming a bigger concern with implementations of GDPR and California’s CCPA, the ‘right to be forgotten’ law. How can a company be sure they’ve deleted every instance of a person including their backups? IBM Spectrum Discover can connect to IBM Spectrum Protect and IBM Spectrum Archive to discover, index, and label files in backups and archives. It can then detect and extract information from more than a thousand different file types and can use pattern matching to detect Personal Identifiable Information (PII) such as social security number, credit card number, or email addresses, letting you rest assured that you’re in full compliance with the law.


Data Optimization


Companies can also keep control of their “spend and sprawl” by utilizing the data optimization portion of Spectrum Discover. This includes organizing the data, tiering the data based on frequency of access (i.e., dictating by policy where to store “hot” data versus “cold” data), and global deduplication to ultimately reduce the size of the data set. Net App, Dell EMC Isilon, Windows SMB, and Amazon S3 storage can all be scanned by Spectrum Discover to provide a multi-cloud, multi-storage catalog and index repository of information. Why have five copies of last year’s grocery list (hand sanitizer, Clorox wipes, and TP) sitting on top of a pile of current bills that are urgent and need to be paid, when instead you can keep just one copy and move it to the back of a drawer, in a cabinet, if you ever really want to access it and relive 2020 again?


New Features of IBM Spectrum Discover


2021 has brought some big announcements from IBM on new features and functionality of Spectrum Discover that make it a much more competitive product in the marketplace:

  • Updated GUI with new “hamburger” menu for better flow/organization of menu tiers
  • Deployment on Red Had OpenShift, which can reduce memory requirements as much as 50% compared to traditional VM deployments
  • Upgraded searching with Query Builder. This is a search bar that builds queries in real time as a user clicks items to add. It’s visible through the entire user’s journey and can be copied or referenced in the future. This allows for a more natural language search without the need for a programming/query language. There’s also a query summary with a user’s tag value selection count and potential record count that can be of help to further fine-tune or broaden a search.
  • Ability to import tags from COS/S3 in order to extend records with new information or when migrating from another data management system so you don’t need to reinvent the wheel
  • Data movement with IBM Spectrum Scale Active File Management (AFM). Spectrum Discover metadata and policies provide more efficient data movement between Spectrum Scale and IBM Cloud Object Storage (iCOS)
  • Moonwalk integration!!!!! (Yes, all of those exclamation points are intentional and necessary.) Saving the best for last, Moonwalk is the first third-party data mover to be certified for use with Spectrum Discover. It provides the ability to move data from third-party storage, object stores, and cloud endpoints. Discover calls the shots, Moonwalk moves the data.


Harness the Value of your Data


Data management tools should not just be part of a data governance strategy but should also be part of a long-term business roadmap. Don’t feel like you’re drowning in data – let Spectrum Discover throw you a life vest so you can get back in control of your ship and steer the direction of your company.

Our storage and data specialists at Mainline can work with you to see how Spectrum Discover can be used to harness the value of unstructured data for competitive advantage and help you to achieve critical data insights. Reach out to your Mainline Account Executive directly or contact us for a demo or to answer any questions.


Related Storage Systems and Data articles and videos:


BLOG: A look at IBM Data Storage Management

BLOG: Data Management Using IBM Cloud Pak for Data and AI

VLOG: IBM Storage Class Memory (SCM)