Infinidat’s Preeminent Caching

September 18th, 2017

Robert Young
Senior Systems Engineer
Data Storage Services
Mainline Information Systems

 

For his next spin on a hyper-car, Christian von Keoenigsegg cranked it up to 11, and created Koenigsegg Direct Drive (KDD) for the Regera. The main purpose was to eliminate the transmission, and perhaps for the pure pleasure of pulling alongside a Ferrari on the Autobahn at 100 MPH and leaving it behind in a cloud of smoke from spinning tires, as it accelerates to over 240 MPH. Infinidat engineering had more practical reasons to create smoking fast IO in their Infinibox cache management design. In fact – as we will see – a new way of managing IO and underlying data, in conjunction with caching, was necessary for the success of the overall design.

First, let’s talk about underlying design philosophies, and why best-in-class caching is necessary. Cache (up to 3 TB of DRAM plus SSD) is for serving IO – data (or backing store) resides in long-term 7200 RPM SATA storage. To make up for the slowness of SATA, stellar caching is required. In the “warm” SSD caching spillover layer, blocks of data are just a copy of the underlying backing store. Because SSD doesn’t have to be reliable, the SSD layer is made up of single disks without RAID protection. This allows for much larger (up to 200 TB) amounts of SSD to be presented at a cost savings.

Next, let’s focus on how caching works, as it pertains to writes. Incoming writes, for a volume, are inbound to all 3 controller nodes. Infinidat’s Infinibox has no concept of a LUN or volume being online to a given controller – no owning node. As writes are combined into 64K objects, metadata tracks what Infinidat commonly calls an Activity Vector. Fields within the Activity Vector help determine how hot a given object is. During an Activity Period, if write activity exceeds a certain Activity Threshold, the object is considered hot and will not be destaged. A Large Maker of Enterprise Storage (LMES) tells us that, in studying their write activity in thousands of storage subsystems, 30% to 50% of block writes are re-writes. In the Infinibox, after 5 minutes, the hotter blocks (frequent re-writes) are compressed and destaged in a log-based fashion – as cooler blocks were earlier – in a 14+2 stripe to SATA backing store, being careful about placement of like activity. Writes are not only destaged to backing store, but all except for the coolest blocks are also copied uncompressed to the SSD spillover cache. Compression is NOT in a critical path.

But, writes are boring. The magic is in Infinidat’s beautiful engineering of read pre-fetching. A segment (stripe) contains metadata fields as shown.

The Activity Timestamp is the time of first access to any data segment within a current Activity Period in a 16-bit field. T1 through T3 are 16-bit fields that contain the time of first access in the previous three Activity Periods. Infinidat engineers apply known algebraic concepts [1] to pre-fetch, based on activity in time above a threshold (or below a threshold), within segments or across virtual buckets or volumes – quite successfully.
Demonstrating how effective this is, an owner of a 1 PB Infinibox running OLTP plus Analytics, during the busiest time of year, shows this as read cache activity in their Infinibox:

The larger dips are sequential IO backup traffic that bypass read cache. The top of the graph is 100%. The DRAM plus SSD cache hits show a combined 97% to 100% read cache hit ratio in most timeframes. The “LMES” mentioned above – with thousands of storage subsystems measured – shows an average in the mid-50% DRAM read cache hit rate, with less than 5% of their customers obtaining read cache hit rates above 90%. Infinidat claims with 1.7 Exabytes deployed world-wide (and discounting for 5% of the outliers that pull down their overall cache hit rates) that customers are obtaining an average read cache hit rate from DRAM of over 95%.

In a follow-up post, we’ll show why the KDD of Caching was critical to allow SATA storage to be the backend, and why SATA storage itself is crucial to the overall design.

Please contact your Mainline Account Executive directly, or click here to contact us with any questions.

[1] The “algebraic concepts” is a TRIE. TRIE – (taken from the word {reTRIEval}) is a digital tree structure also known as a Prefix or Radix tree. In basic terms, this is an extremely fast and compact data structure where a node’s key is encoded in the node’s position in the tree. TRIEs were invented by Edward Fredkin at MIT in 1960, and generated renewed interest in the 2000s when employed by Google for its autocomplete feature. This is the mechanism Google uses when a user enters a word into a search box, and Google automatically starts to complete the search string. It requires searching a web-scale index as fast as the user can type. Infinidat adapted the structure for storage virtualization, specifically for providing extremely efficient and high-speed mappings between virtualization address layers.

Mainline