While the Solid State Drive (SSD) is one of the biggest buzzwords of the past year, it is not by any means a new technology. SSD has been with us in some form or another for quite some time. So the real question is, why is it gaining momentum now?
In enterprise environments, RAM-based SSDs have existed for over a decade to accommodate a niche high performance market that not many of us had a lot to do with. Over the course of that decade, traditional disks kept growing in size but not in speed. As a result the balance between capacity and speed has warped and confronted us with scenarios where we needed to put many more drives in our storage systems than we would normally need from a capacity viewpoint in order to get the speed we need. (And we most certainly need speed. Consolidation of our server farms, applications, and desktops on computing platforms where Moore’s law still applies have pushed traditional arrays to their limits.) Does this mean we're going to replace our traditional magnetic drives with SSD technology? Not quite, but giving some serious thought to where SSD technology fits in a modern storage landscape is something all of us should do.
Over the last two years, NAND-based SSDs have been making an appearance in the enterprise market—and they are doing well. However they have their limitations.
As with any storage design, which SSD should be used and where it should be positioned all depends on your application landscape and your budget. Most likely you won't require SSD technology for your entire application set.
So let's take a serious look at what SSD is and what it can do for you, beginning with where you use it.
Where to position SSD technology to best accommodate your storage requirements depends on a number of factors. Each of the ways SSDs are used in the enterprise has pros and cons. Since the applications determine which of these points are important in your environment, we’ll discuss on a high-level which applications could benefit from a certain type of SSD. SSDs are currently used in mainly the following ways:
- Flash-cached (performance acceleration of magnetic disks by extending cache)
- Flash-tiered (performance acceleration of magnetic disks by using a tiered approach)
- Hybrid file systems (performance acceleration of magnetic disks by using a tiered approach)
- Flash-Only arrays (using SSD’s in a purpose built all flash array)
- Server side flash (connecting SSD directly to the PCI bus of the server)
- Network flash (combination of flash-only arrays en server side flash)
Most, if not all storage arrays have some level of caching available within the system, sometimes just for read or just for write, though ideally for both. As a rule this is RAM-based, battery-backed cache that is sized related to the size of the platform. It's generally only expandable on the larger systems, with flash cached cards that can expand the internal RAM-based cache to an extra layer of SSD.
With read caching, predictive algorithms are used to load data that is expected to be read or has been read several times in the last time period into cache to be able to serve it faster.
Write caching is generally used to buffer incoming writes and to optimize them before writing them to disk, for example by using full stripe writes, where enough data is cached to write across all disks in an underlying stripe set. In modern arrays the use of SSD technology allows vendors to expand these caches to beyond what is normally available in the system.
When caching writes, the advantage of using SSDs are immediately obvious. but it’s only suitable for bursts. Sustained writes at a higher level than the underlying disks can handle will fill up the cache and at that point you're at the mercy of the performance of the underlying disks.
With regards to read caching, an SSD cache needs to ‘warm up’ for a few days so the algorithms in the array can determine which blocks are 'hot' and therefor need caching. Depending on the access pattern of the data this can increase performance from slight to significant.
Pros and Cons
Strong points of this technology are that it can decrease the amount of spindles required in your existing array while very little training is required for your support staff. In the case of a read cache card a disadvantage can be that if the card fails or a failover to another head occurs the cache will need a few days to warm up and reach the same efficiency levels as before.
A dataset consisting of VDI desktops which have a high de-dupe ratio will, in a de-dupe aware cache, perform very well. In a completely random access pattern to a volume with millions of small files performance may not see such a big improvement. In a scenario like that you could however still cache your file metadata to achieve a performance boost.
But if you can get a workable average out of your measurements that can be accommodated by your magnetic disks, extending the cache of your existing array for burst handling can be a good idea. No requirements for training staff on a new platform, low impact implementation and relatively easy migration of data are all additional bonuses here.
Examples of Flash cached arrays would be NetApp with its Flash Cache cards. (Note that these cards only cache reads.)
In flash-tiered arrays SSD is used as a Tier0 where a form of tiering is used to move data either down to underlying traditional disks or up from traditional disks to SSD. This can both be a dynamic or a scheduled process and it is important to note that depending on the implementation, blocks as large as 1GB and even entire volumes can be subject to relocation, regardless of the fact that the hot data might only be a few KB in size. Generally these are the more traditional arrays that are not necessarily optimized for the use of SSDs.
Pros and Cons
Wear-leveling can be a serious issue if the array is not optimized to write to Flash in a sequential way. Controllers may not be fully optimized for the higher throughput and IOPS rates that Flash can provide which is why quite a few startups have actually redesigned their array from the ground up to take full advantage of the capacity Flash drives can provide. Tiering algorithms can significantly increase the load on the system.
If you have a significant investment in an existing Enterprise platform, depending on your requirements, adding SSD as a tier 0 to your array could be a good idea. It would address the issue of having to manage multiple arrays and would allow for more than 100TB under a single pane of glass, which seems currently to be the limit for ‘all flash arrays’. Examples of flash tiered arrays would be Hitachi’s VSP and EMC’s VNX platforms.
Hybrid File Systems
Hybrid file systems describe a technology that allows for DRAM, SSD and traditional spinning disks to be part of the same file system. Data written to the storage system is stored into battery protected DRAM or SSD after which an acknowledgement of the write is sent to the client. The file system subsequently de-stages the data to the backend disks but is also capable of promoting data that is getting hot. It is in effect a level of tiering within the file system.
Pros and Cons
The level of intelligence required to apply this kind of tiering requires a file system with enough intelligence do the work transparently from the clients perspective.
File systems like WAFL and ZFS support this level of intelligence. NetApp has announced Flash Pools, the SUN/Oracle 7000 series uses ZFS in this way, as does Nexenta.
Flash Only arrays
The architecture of Flash-only arrays resembles traditional storage systems in the sense that there is a central array with FC / iSCSI, etc connectivity. Flash-only arrays have some advantages over Flash-cached/tiered arrays, they are generally designed from the ground up to take full advantage of the speed that SSD drives can provide and have advanced wear-leveling and management capabilities.
Pros and Cons
The price/capacity ratio is a lot less favorable than that off traditional disks but through thin provisioning, (inline)de-dupe, compression, zeroing, etc, this form of SSD approaches the price per usable/effective GB which traditional Tier 1 arrays offer, only many times faster.
Only recently players in this segment have started to offer HA within the array as an option, without it, this can’t be used to support business critical applications without organizing fail-over/redundancy at higher (application/OS) levels.
Network latency is still an issue here as the compute layer is divided from the storage layer by some sort of network, be it FC or Ethernet. That means for sub-millisecond response times a more localized solution is required.
Cost can also be an issue depending on what you need. If you don’t need space but IOPS a system like this one will give you the best $/IOPS price, if however you do need a substantial amount of space as well, it might be cheaper to go with a few more spindles in a traditional array. The serious players in this segment are starting to sell their systems as competitive with traditional Tier1 storage on $/GB basis due to added features like compression, de-duplication, zeroing out, etc.
Customers requiring a consistently high rate of IOPS or throughput may benefit from a system in this category. It will require some extra training, a more complicated migration path and of course an implementation will need to be scheduled. A new VDI implementation would be a good example where this technology could do well.
Solutions like Whiptail use SSD drives with MLC cells but they consolidate IOs so data gets written with blocks at a time effectively eliminating the write amplification and making a single write from all the incoming random writes.
Other vendors like Violin Memory use PCI flash cards and have a dynamic net storage space depending on IO load. They allocate more of the gross space to allow garbage collection to create empty cells. The lower the IO load, the more net space you get.
Pure Storage has just finished customer try-outa and they have an excellent team of technicians at work, also, not unimportant for a startup, they have some very strong VC backing.
Nimbus Data Systems, already a financially healthy company offers HA, unified access, and a decent set of included software which includes replication, snapshots, etc.
Texas Memory Systems’ RAMSAN product is the grandfather of modern day SSD arrays. These days not all their products are RAM based, they have MLC arrays as well.
Server Side Flash
Server side Flash describes a solution where SSDs are connected to the local PCI bus of a server platform.
Pros and Cons
Local Flash directly on the PCI bus does not have network latencies to contend with so this is the fastest variant in the field with access times of microseconds as opposed to milliseconds. It does have some limitations; at least until flash cards can talk directly to each other over some kind of inter system PCI framework. Something some blade servers already provide! If however you need a synchronous multisite solution, network latency rears its ugly head again and you would have to go for a Hadoop-esque kind of solution to be able to use these cards.
There is a reason that from FusionIO’s total turnover over 2011, 57% came from two customers, Apple and Facebook. The reason they don’t see network latency as an issue is because they have solved the problem at the application level. Facebook’s Hadoop systems that operate on a share nothing principle don’t require shared storage functionality. Apple uses Oracles DataGuard which optimizes the network traffic between the nodes.
For those of us not using this technology this can be a serious limitation if you want to use for example, VMWare HA. DB systems that you want to mirror must be so at the application level, again ethernet latencies will increase latency considerably. However more and more Hypervisor and database manufacturers are adding this level of intelligence, MS SQL server 2012 is now being shipped with the ‘Always On’ functionality. At present the technology is primarily used for accelerating temporary datasets like Business Intelligence servers, non-persistent virtual desktops or temp space and log files of applications.
There are a few manufacturers that have decided to build a system where the servers hosting the PCI Flash cards can be interconnected through the use of a common enclosure. Nutanix and Kaminario use standard (blade) form factors to connect PCI busses together where each individual server has a FusionIO card with NAND-based Flash and/or magnetic disk at the back of it. HP and HDS blade servers now support similar functionality where PCI busses can be interlinked, adding a FusionIO card to each server would create a system where compute and storage layers are connected via PCI.
Currently, however, server-based Flash is intended for those datasets that have a temporary nature or that have application level redundancy combined with a demand for high IOPS and low latency.
Network Flash is a hybrid solution where server-side flash talks to a flash-tiered, cached or flash-only array. The key here is to notice that this technology involves network latency.
Pros and Cons
This approach creates a bottleneck within itself since the server-side Flash card will need to communicate with the Flash array over some kind of network.
Technologies like compression, inline de-duplication and perhaps a form of tiering that spans the internal card and the central array will alleviate this issue somewhat. With FC speeds going up to 16 and later 32Gbit/s and Ethernet going from 40 to 100Gbit/s we come a little closer to the 300+Gbits/s that PCI-E v3.0 x32 can accommodate but if you are looking at a solution at the time of writing this would be a limitation to consider.
EMC started in this Network Flash space with their Thunder and Lightning project. NetApp has also announced that it will start in this space. Currently this technology is not ready for production.
So what can we expect from NAND Flash Memory in the near future? Fact is that the cell degradation will keep haunting it. Density can’t go down much further without this degradation causing problems. And density is what flash needs to gain more market share.
Also, within the next couple of years, Heat Assisted Magnetic Recording (HAMR) for spinning disks will become mature and we’ll start to see hard disks of 100s of TB in size. IOPS might still be a problem for these disks but it will be a long time, if ever, before Flash Memory comes near that amount of GB/$.
It’s more likely we’ll see a mix of the two and have hybrid solutions that do real hot block auto tiering. Current tiering solutions where background processes move large blocks of data (100MBs in size at a time sometimes) to and from SSD in a form of cache a couple of times a day are only a viable in specific use cases. With files, a single byte change will move the whole file up to SSD. (Something that's rather pointless when it’s a database file or a VDI disk image.) Only inline auto tiering at the block level will be able to do the job and no solutions exist today that offer that.
But there are other alternatives that may soon render Flash Memory obsolete altogether. Techniques like Phase-change Memory (PRAM) have working prototypes that are 50-100 times faster than Flash Memory. With current outlooks the data (bit-) switching times may end up only a few times slower than our current RAM and make an excellent alternative to all storage solutions.
Other contestants are Ferroelectric RAM (FeRAM) and Magnetoresistive RAM (MRAM). These are comparable to DRAM in speed but are non-volatile so they don’t require power to retain data. Currently they lack density and have much higher costs but in the near future one them may prove to be the one universal memory that runs and stores all data. But until these technologies are ready for mass production and can be purchased at a reasonable price, RAM and NAND Flash are what we have to work with to close the IOPS gap created by ever growing traditional spinning disks stuck at 15000 RPM.
It is noteworthy to mention that the 5 to 7 years we expect it will take before HAMR and Persistent RAM based solutions will take over is a significantly shorter timespan than the 40+ years traditional magnetic disks have dominated as the preferred data storage media.
Is Flash therefor an intermediary solution and not worthy of investment? Not entirely. If you have a high IOPS requirement over the next 5-7 years you may very well not have a choice.
And let’s not forget that although P-RAM solutions might be technically superior to NAND Flash that doesn’t necessarily mean it will be adopted straight away. NAND Flash is being produced at an incredible rate and will continue to get cheaper, just as wear-leveling techniques will get better. Advances in these fields will very likely slow the adaptation of future technologies particularly since manufacturers have made some significant investments in NAND Flash production. Capital, unlike data, can be pretty slow to move.
Whatever you decide you might need, look before you leap. Measure the impact of your applications on your storage, don’t stop at IOPS, block sizes and projected growth but also look at the level of availability, latency and intelligence you require at the storage level.
The credits for this information are for my colleague Herco van Brug and Marcel Kleine; Follow Herco and Marcel on twitter. Comments or Feedback ?! Please let us know! (firstname.lastname@example.org; www.twitter.com/rspruijt)
The whitepaper with even more content and technical details can be downloaded here (no registration required).