Wed 25 Oct 2006
I knew it couldn’t possibly end there …
I recently sent Jon Toigo (DrunkenData) a formal response to questions he and some of his readers had about Data ONTAP GX, NetApp’s scale-out storage architecture.
In a followon comment, one of Jon Toigo’s readers, who failed to indicate his affiliation, but whose name links to LeftHand Networks :-), wrote:
“It’s nice to see NetApp being open about its GX architecture. This being said, the architectural deficiencies were clearly spun in a positive light with marketing hype.”
So, in the interest of educating that reader (and perhaps some other readers) about scale-out storage and the scale-out computing revolution, I thought I’d take the opportunity here to share some of my experience and personal observations.
And … sorry to disappoint, but no “spin” here.
In that post, Jon’s reader missed two key points, imho.
First, the *primary* purpose of almost all scale-out storage architectures is to provide scalable AGGREGATE performance for a large number of clients - not to speed single stream performance to a single host. This is true both in technical computing apps and in largescale enterprise deployments.
In fact, most hosts in cluster computing configurations (or enterprise server environments) have only a single GbE (or 2Gb FC) into the client fabric - so you’ll never get more than a wire’s worth of bandwidth (90 or 100 MB/s for GbE) to a single host anyway.
For higher performance interconnects - including 10GbE and IB, it *is* possible for single hosts to obtain higher throughputs, but you still won’t exceed the single wire (or single “logical wire” if you’re using and can leverage link agg) performance from host to storage subsystem. So the fact that a host goes through an N-blade in the GX architecture to get the data is irrelevant as long as the N-blade can deliver the single session bandwidth expected by the host - which, of course, it can.
That having been said, technical computing apps requiring high aggregate I/O for “single problem” scenarios do exist. But those applications are structured in one of two ways to leverage parallel computing - either as a set of “embarrassingly parallel” processes, or as part of a large parallel job (using, e.g. MPI).
In the former case, this is simply an aggregate I/O problem with lots of hosts accessing lots of files on the shared storage system (perhaps in the same directory, perhaps in multiple directories). In the latter case, the individual “workers” in an MPI communication group (node/process set) will EACH perform their own I/O - either accessing individual files or accessing (typically) disjoint portions of large files (perhaps through a parallel I/O layer such as MPI-IO). For either case, GX does just as it is supposed to - it spreads the accesses across *BOTH* the N-blades and the D-blades to provide scalable I/O.
This is basically the manner in which *all* parallel / clustered file systems are employed for scale-out computing applications. There are URLs to many of the companies developing such systems in my aforementioned post on The Scale-out Evolution.
As I indicated in my response to Jon, pNFS is an NFSv4 proposed extension that will effectively allow you to take the VLDB functionality “out-of-band”, so that a pNFS-savvy client can get a “map” of the locations of all of the file segments through the metadata server (at time of open or in response to file extend callbacks), then go directly to those segments for its “data access” commands (read, write) - in essence, eliminate the “hop” that Jon initially asked about.
There are other performance/efficiency options that can be effected in such architectures as well - both in an out-of-band metadata scenario AND in the current N-Blade implementation. These have to do with how those components may choose to do read-ahead across distributed segments of a large shared file - effectively issuing requests for multiple portions of the file at the same time to “fill” a larger pipe back to the requestor. With a pNFS client, the multiple segment read-ahead could be effected by the host. With the current GX implementation, by the N-Blade.
Bottom line … yes, it’s true that a single host’s I/O will be limited by the *smaller* of: (a) it’s network ingress capability (typically 1 GbE); (b) the network egress capability of the N-Blade to which the host session (mount) is connected. Not a problem for most apps I’ve ever seen given the hardware on which GX currently runs.
The reader’s second point has to do with the functionality that is part of the current ONTAP GX release. The initial release of ONTAP GX, as I had indicated, is targeted at technical computing applications, for which the set of features currently provided are critical.
Technical computing applications work almost exclusively against files and filesystems. What I think Jon’s reader is missing is the difference between a block storage platform (in which you can add bricks and controllers under a single management domain and provide striped LUNs - stuff many of the vendors’ volume managers have been doing for years) and a scale-out FILESYSTEM, which ONTAP GX presents.
As many users of parallel and clustered filesystems know, this is a very important distinction. Large storage systems can effectively be crippled by a filesystem that hasn’t been adequately designed for scalability. The old “metadata bottleneck” in SAN filesystems.
ONTAP GX provides both a scale-out storage platform (distributed storage “bricks”, distributed controllers, single management domain) as well as a parallel filesystem with DISTRIBUTED metadata - alleviating file system bottlenecks that would otherwise prevent full utilization of the underlying architecture.
What you’re seeing at NetApp, and I believe elsewhere in the storage industry, is the gradual maturation of scale-out storage technology (and particularly parallel/clustered filesystems) - across the industry, as companies look to leverage these innovations first in production technical computing applications (digital animation, EDA, bioinformatics, automotive/aerospace design, …), and subsequently as highly scalable enterprise storage systems (e.g. for enterprise grid deployments).
The truth of the matter, despite what some marketing departments will tell you, is that (a) building parallel/clustered filesystems is hard (especially in delivering scalable, full featured, robust, enterprise-class services); and (b) we (as an industry) are very much in the early days of the scale-out storage revolution. Just ask someone who’s tried to deploy any of these systems, or do an in-depth survey of the scale-out storage solutions and ask them (and their customers) about feature sets and robustness.
From my (admittedly biased) perspective, the HUGE advantage NetApp has is a very successful enterprise business based on a proven scale-up approach to unified storage (SAN, iSAN, NAS) with Data ONTAP, and an emerging scale-out storage platform based on the Spinnaker legacy — all based on the same physical hardware components, same core storage “container” architecture (WAFL), and same data management services (snapshots, mirrors, etc.).
Getting all of that right requires significant innovation, and takes time. Hence the evolutionary, multi-release rollout of ONTAP GX.
November 6th, 2006 at 10:31 am
Bruce,
I just wanted to thank you personally for spending the time to write such lengthy answers to the questions we requested on my blog. It speaks well of NetApp that they will dedicate expensive cycles of key players to respond to inquiries.
You are on my Christmas Card list.
Thanks again.
Jon Toigo
November 10th, 2006 at 9:06 am
ONTAP GX is really just a clustered namespace implementation isn’t it? Since that doesn’t get more NAS heads on the SAME data, how is that called scalability? I see how is allows more heads to access the same filesystem (virtualized at the CNS layer), but what if your application has one really hot file such as is often the case with databases?
I’ll be blogging more about scalable NAS, but there are some papers on the matter on my blogsite as well:
Kevin Closson’s Blog
November 22nd, 2006 at 12:03 am
Grid Guy,
It will not end here either.
It’s satisfying to see someone understanding parts of the technology behind scale-out distributed computing, albeit clearly some religion around NetApp’s “ho-hum†GX architecture. I’ve been researching Distributed File Systems for many years and it appears to me that GX does not have any “revolutionary†technology when compared to what’s been developed by dozens of Universities around the world, and is shipping from other companies today as well. I guess my expectations were much higher having paid $300M for the Spinnaker technology.
I believe the architecture may be stifled by NetApp’s legacy WAFL file system and clustered pair redundancy scheme, which was never designed to be distributed in the first place, so the challenge became; how do you integrate a scale-out architecture with an existing scale-up architecture without losing it’s rich set of functionality? In fact, it appears they had to rip out a significant amount of functionality and eliminate features that are key incentives for customers that buy NetApp today.
When high end clustered storage solutions get relegated to “Technical Computing Applicationsâ€, it’s just a nice way of saying, “we can’t deliver our scalable system with all of these features businesses require, so let’s sell it into the HPC space.†I’ve researched just about every distributed systems technology on the planet, and I can honestly say I’m not impressed. Twenty node limit, no asynchronous mirroring/DR (SnapMirror & SnapVault), limited CIFS, no NDMP Mirroring, no iSCSI, no FC, no SNMP management support, no DFM, no SnapManager, etc., etc.
Ok, on to the distributed systems architectural discussion:
So, in the interest of further “educating the reader about scale-out storage and the scale-out computing revolutionâ€, I thought I’d take the opportunity to clarify and elaborate on some of your points, and no “spin†here either.
I agree that the “primary purpose of almost all scale-out storage architectures is to provide scalable AGGREGATE performance for a large number of clients - not to speed single stream performance to a single host.†Imagine 100s of hosts performing random I/O operations, rendering filer head caches useless, and having to resolve and forward I/O’s. The more nodes there are in the cluster, the higher the probability that I/O’s have to be forwarded. Now tell me what this does to the AGGREGATE latency of millions of I/O operations from hundreds of servers? It’s not just about large sequential I/O and bandwidth. Although while we’re on this point, and with 10 Gig right around the corner, combined with I/OAT and multi-Quad-core processors, the AGGREGATE I/O “hops†are going to significantly reduce the “AGGREGATE†performance possible from the servers. It’s not just about how much data a single host can push, but how high can the number of hosts and nodes scale before the system loses its “linear scalabilityâ€.
Yes “GX does just as it is supposed to - it spreads the accesses across *BOTH* the N-blades and the D-blades to provide scalable I/O.†Ok, let’s look at a scenario you may not have considered. If files are striped evenly across all N-blades and underlying “D-blades†(assuming the files are big enough to be striped across all blades given their chunk size), and let’s say you have a 20 node system (GX limit; embarrassing!), you could theoretically get a nice multi-path spread of 20 I/O’s going to 20 N-Nodes simultaneously, but then it’s possible that all 20 I/Os will result in another 20 I/O requests to the actual N-Blades containing the data. This situation gets worse as more nodes are added to the system, but with only 20, they are limiting the latency impact from I/O forwarding.
Yes, “pNFS is an NFSv4 proposed extension that will effectively allow you to take the VLDB functionality “out-of-bandâ€, so that a pNFS-savvy client can get a “map†of the locations of all of the file segments through the metadata server (at time of open or in response to file extend callbacks), then go directly to those segments for its “data access†commands (read, write) - in essence, eliminate the “hop†that Jon initially asked about.â€
This is where I really start to get excited. LeftHand is shipping “pNFS like†data locality awareness functionality for the host today, and its not a fat, non industry standard, “Panasas like†file system driver, but a simple, standard, Microsoft approved, Patent Pending DSM plug-in for Microsoft’s MPIO stack, so that the masses can get highly scalable clustering without paying an arm and a leg for it, and without sacrificing those features that make the IT guys & gals happy.
I’m not missing the difference between block and file. Some of us believe that a File System is just another application on a pool of block level storage. For example, if you move NetApp’s N-Node software to the client, this is effectively what you have. Having the scale-out functionality at the block level has the advantage of being file system agnostic, which allows it to address a broader range of applications.
Keep in mind that there is a significant amount of metadata that needs to be managed in a clustered scale-out multi-node block system as well, and most of the same distributed system challenges exist. Solving the scalability problem with metadata management and its fault tolerance is often considered the “holy grail†of Distributed File Systems. Just to set the record straight, LeftHand has all the technology in SAN/iQ for a Distributed File System, including a distributed lock manager, a critical piece of technology required for Distributed File Systems. LeftHand chose to enter the market at the SMB level and move up instead of entering the HPC market and try to move down. Block gives LeftHand more of a universal storage solution for the SMB market, but doesn’t preclude them from delivering many more related goodies in the near future, and moving up market….
Actually, building a “parallel/clustered filesystem†is rather straight forward these days, and can be downloaded for free from many of the popular Linux sites. What’s difficult to build is a highly reliable, scalable, fault tolerant system that can sustain, network faults, disk faults, multiple node failures, on-the-fly storage expansion, all simultaneously while data is being accessed, snapshoted, replicated and deleted, without the applications going off-line or experiencing severe degradation in performance. This is where most of these “free†Distributed File Systems break down and end up in the HPC graveyard or University lab.
I know you’re excited for NetApp, and I am too. I’m excited that they are helping lead the charge for the next generation of scale-out storage technology, and helping leave the legacy, monolithic scale-up architectures from those Evil Machine Companies in the dust.
John