Tuesday, May 02, 2006

Improving Back-End Storage Resiliency

Some of the most important factors in obtaining High Availability are redundancy, fault tolerance and resiliency. While almost all high-end disk subsystems provide these capabilities, the same can not be said for all mid-tier disk arrays.

A typical mid-tier array consists of a couple of Storage Controllers or Storage Processors which are connected to one or more Disk Enclosures thru Fibre Channel Arbitrated Loop (FC-AL). Inside each Disk Enclosure is a backplane which the disks are connecting to. In addition, each Disk Enclosure is connected to Dual independent FC-AL loops and contains a Port Bypass Circuit (PBC). This type of Disk Enclosure is called a JBOD (Just a Bunch of Disks).

The purpose of the PBC is to detect faulty drives and isolate them from the loop without breaking it. In general, the PBC works well as long as the drive is faulty. But what if the drive is mis-behaving? In this scenario, the PBC is useless because it is unable to detect and isolate mis-behaving drives. As a result, a single drive can easily take out both Loops and render the Disk Enclosure, as well as, the Disk Enclosures below it, useless. One thing to keep in mind is that even though there are Dual Independent Loops connecting to each Disk Enclosure, these loops have something in common, the drive's Logic. So a failure in the drive's logic can have widespread effects severely impacting disk subsystem and application availability. Another issue with JBODs is performance as well as latency and scalability. The bandwidth of the Loop is shared among all elements in the Loop and additionally only one controller can communicate to a disk at a time. Latency and Arbitration are also issues as more devices enter the Loop. In order to keep these issues to a minimum most disk subsystem vendors imposed a maximum on the number of drives in a Loop. That number is around 50-60 drives which is a little less than half of the maximum number of elements that can be connected into an FC-AL loop.

Because innovation doesn't only allow us to solve new problems but also allows us to solve old problems, a new technology surfaced around 2001 in the form of embedded FC switches (also called Switch on a Chip) in the Disk Enclosures that address the above issues. Such type of an Enclosure is called an SBOD (Switched Bunch of Disks). The embedded switches allow for each drive to have an individual FC-AL connection and be the only element in the loop thus providing faster arbitration and lower latency. In addition, the embedded switches provide the ability to isolate drive chatter without endagering the Loop. Performance and scalability are also direct beneficiaries of this technology. An SBOD, can deliver up to 50% more performance over JBOD mostly due to the fact that the bandwidth is not shared as well as due to the crossbar architecture deployed inside the embedded switches. Scalability has also improved outside of the Disk Enclosure by being able to increase the number of drives and Disk Enclosures that are connected to the FC-AL loop(s) on the Storage Controller.
Network Appliance was one of the first vendors to recognize the benefits of such technology and have been providing our customers with increased storage resiliency across all of our product line for almost 3 years now.

No comments: