Many customers have their data of record housed on the mainframe. It is both difficult and expensive to move this data to distributed applications. Products are installed to extract and transform mainframe data. Precious mainframe capacity (MIPS) are used in extracting and delivering the data to distributed platforms. Data is then transferred between systems, maybe multiple times. Each time, data ages. Aggregations are suspect to errors. Manpower is needed to maintain the products, and the constantly changing transformation logic.
The data lake has emerged to solve this problem. This emerging technology is used to augment the data warehouse. Data lakes provide a quicker means for delivering data, and the premise behind them is to land data in its raw format.
What is Apache Spark™?
Apache Spark™ is an in-memory analytics engine that is coupled with optimized data access and abstraction service. Apache Spark™ has emerged from the Apache Software Foundation as a lightning-fast cluster computing engine. Per the Apache Spark™ open source website, more than 1000 organizations are using Apache Spark™. It’s proven speed far exceeds Hadoop. It enables highly iterative analysis on many data sources and large volumes of data.
The Marriage between Mainframe Data Access and Apache Spark
So how does this all fit together? IBM and Rocket have collaborated, using Apache Spark™ and a proprietary “optimized data layer” from Rocket Software, to enable access to mainframe data such as DB2, CICS, VSAM, IDMS, Sequential and other data types, including distributed data such as Oracle and SQL Server. It can also access the Hadoop data, enabling a complete view of the customer. This data is exposed as SQL result sets. With the z/OS Platform for Apache Spark, (in this blog referred to as Spark) the data is left on the mainframe until it is queried. Access is more efficient when queried directly from Data Lakes, Reporting tools and data marts.
Spark is JAVA based, which is ideal for the mainframe architecture. The new mainframes include hardware chip architecture, that enables fast JAVA processing. JAVA, and therefore Spark, takes advantage of the zIIPs, SIMD, SMTz, zEDC and sysplexes.
Rocket Software provides the glue with their product called Optimized Data Integration Layer. This product eliminates the need to write complex programs or customized data access routines, and it executes primarily on the zIIP, thereby minimizing the impact to mainframe MIPS (and MSUs) when activating the functions.
An eclipsed-based graphical tool, Data Service Studio, is used to discover data designs, build and test models, and generate the virtual table maps. Models are built using metadata tied to the actual data sources. Data Scientists and Data Engineers will use this tools to create, test and validate their models.
The Jupyter Notebook is a very popular graphical collaboration tool for sharing models between Data Scientists. Jupyter Notebooks can be used to communicate with Spark, to access the Mainframe data utilizing these models, therefore minimizing the learning curve for Data Scientists.
Summary
Apache Spark™ is a powerful and popular analytics runtime framework. It now runs on z/OS as a free product with optional S&S charges. Its Optimized Data Integration Layer provides easy access to the mainframe data. This new architecture provides improved operational and performance benefits, eliminates the costs associated with moving data throughout their organization, improves data integrity and provides near real-time data currency.
Next Steps
Are you wondering how this fits into your environment? Would you consider a Proof of Technology (POT)? IBM, Rocket Software and Mainline are here to help you with a POT. Our team is available to help you plan the project, design your Use Case scenarios, loan equipment on your mainframe including zIIPs and memory, and train you on the use of Spark and Optimized Data Integration Layer. Pull your Data Scientists, Architects, Data Engineers and Mainframe systems teams together, and understand how easy it is to implement this new technology.
Additional BLOGs:
1) https://mainline.com/linuxone-value-smt/
2) https://mainline.com/ibm-wave-zvm-changing-way-work-linux-z-systems/
3) https://mainline.com/linuxone-oracle-perfect-marriage/
For further information, reach out to Marianne Eggett at marianne.eggett@mainline.com or your local Mainline Account Executive to learn how z/OS Platform for Apache Spark can help your IT infrastructure.