AI storage platforms meet machine learning, data analysis needs (original) (raw)
There are many routes an organization can follow when buying an AI storage platform. But one important goal should be to find a product that will enable the company to more effectively collect data and perform machine learning and AI tasks.
Some of the key issues involved in evaluating and selecting AI data storage products include the following:
- The storage platform must offer high performance and scalability and manage costs effectively.
- Performance must encompass both delivering high throughput and achieving low latency.
- Producing good AI models means collating many terabytes or petabytes of data, which can be costly. Organizations must be aware of the overall cost of managing a machine learning and AI platform.
In deep learning, where machine learning algorithms can operate unsupervised, the I/O profile results in highly random access, as successive layers of deep learning algorithms process multiple levels of data analysis. Machine learning and AI training typically runs in batch mode, where data scientists create machine learning AI models, test them against data and refine the models over time. This approach requires low latency to ensure quick execution, as a shorter model testing time means more iterations and a better model.
So, the specific storage product an organization chooses should be based on the type of work it does and the machine learning and AI training required. In either case, the cost-to-performance ratio of storage introduces some compromises.
The multi-tier approach
Cost versus performance is a key consideration when purchasing any storage product. Given the option, most companies will purchase the fastest storage possible. However, performance comes at a price, and typically, high-performance systems don't scale into the multi-petabyte range. Add the assumption that the working set of data being analyzed at any time will be a subset of the overall data assets, and it's easy to see that storage tiering is a necessary part of designing machine learning and AI data storage.
What exactly does tiering mean in the context of machine learning and AI? Traditional tiering products evolved from fixed storage pools to become complex systems that optimized the placement of individual storage blocks, based on usage frequency and available pool capacities. But this approach doesn't work well with machine learning and AI requirements because of how data is processed.
Automated tiering products assume data goes through a lifecycle of importance to the business. New data is highly important and will be accessed frequently. Over time, the value of data diminishes, and it can be moved to lower-cost, lower-performing storage.
Data used for machine learning and AI analysis is different. Entire data sets become active and are used for analysis, with the whole of the data being required at any one time. This means the data in use has to sit on a tier of storage with consistent performance, because any variability in access will affect issues such as model training.
Cost versus performance is a key consideration when purchasing any storage product. Given the option, most companies will purchase the fastest storage possible.
The random nature of data processing in machine learning and AI model development means reactive storage platform algorithms that try to dynamically rebalance data over time won't work. These algorithms assume a small and relatively static working set that changes gradually over time. In machine learning and AI, data access profiles will be much more random, making it hard to predict which data to cache and how to size the cache or faster tiers.
The two-tier storage model
An obvious way to implement storage for machine learning and AI workloads is to simply use a two-tier model. The performance tier offers as much performance and the lowest latency possible, while being sized for the maximum data set the system is expected to process.
High-performance flash is expensive, and as the market moves toward capacity flash products like triple- and quadruple-level cell, a new market is emerging at the high-performance end, with low-latency flash products such as Samsung Z-NAND and Toshiba XL-Flash. These complement storage-class memory products being developed by offering low-latency I/O. Vast Data, for example, uses both quadruple-level cell and Intel Optane technology to deliver a high-performance, scalable store for unstructured data, with both NFS and S3 API support.
These tier 0 products use NVMe devices for connectivity, either internally or across a storage network. NVMe optimizes the I/O stack or I/O protocol, compared to legacy SAS and SATA. The result is lower latency and greater throughput, but also much higher platform utilization as server processors aren't waiting as long for I/O to complete.
Products such as Pure Storage AIRI, IBM Spectrum Storage for AI and NetApp All Flash FAS A800 all use NVMe internally to obtain the highest possible media performance. Dell EMC and DataDirect Networks use scale-out file system products from their product lines to support machine learning and AI reference architectures.
The capacity tier needs to safely store all AI model data for extended periods of time, typically months or years. As a result, scalable platforms that offer high degrees of durability are essential to manage the volumes of data required for machine learning and AI. The object storage market has evolved to produce a range of AI storage products that are highly scalable and durable.
What exactly is durability?
In a typical storage system, data is protected using a schema that builds redundancy into the data stored on disk. If an individual component fails, the extra copies of data are used to recover from the loss and rebuild data once the failed components are replaced. Although RAID 5 and above provide protection for drive failures, additional systems are needed to protect against large-scale disasters, such as data center outages. The durability, or mitigation of data loss, is expensive to implement as traditional systems scale.
Erasure coding builds redundancy into data, so that the loss of drives, servers or even entire data centers doesn't cause data loss. The dispersed nature of erasure coded data means storage systems can be built to scale multiple petabytes with local and geographic data protection, without the expense and overhead of managing multiple systems.
Object stores offer scalability and durability for data that must be retained over long periods, typically multiple years. However, to gain the cost benefit, object storage products are built with inexpensive storage based on hard drives with some caching capability. This makes them less suitable for daily processing of machine learning and AI data, but excellent for long-term retention.
A geo-dispersed object store also enables data from multiple locations and sources to be ingested and accessed from multiple locations and sources. This can be valuable if, for example, data processing uses a mix of on-premises and public cloud infrastructure. Geo-dispersal is a feature of the Scality Ring platform, which integrates with Hewlett Packard Enterprise and WekaIO Inc. products to create a two-tier storage architecture.
Hybrid storage architectures
The challenge for businesses is how to implement a hybrid architecture that includes both highly scalable and high-performance storage. Object storage systems enable organizations to store most data, while some offerings use performance nodes that store active data on servers with high-performance flash. The advantage of this approach is that either capacity or performance nodes can be added to products to scale in either direction. Cloudian, for example, offers hardware appliances that provide either scalability or performance capabilities.
Systems that are built from high-performance storage must be designed to scale for the entire data set being processed. In these scenarios, data is moved to and from the high-performance platform, as multiple AI data sets are processed over time.
The storage architecture must be able to provide the network bandwidth required to both move data to and from the storage for AI product and meet the requirements of the AI platform. Products, such as the Nvidia DGX-1 and DGX-2 platforms, can consume tens of gigabytes of data per second. As a result, to keep up, the connectivity between compute and storage in AI data storage products must be low latency InfiniBand or 100 gigabit Ethernet.
Software-defined storage for AI products
Building storage for machine learning and AI doesn't have to mean deploying an appliance. New high-performance AI storage products are available that are essentially software-defined storage (SDS). These products take advantage of the performance of new media -- including NVMe and, in some cases, persistent memory or storage-class memory.
One benefit of SDS products is their applicability to public cloud, as they can be instantiated and scaled dynamically across public cloud infrastructure. This model of operation can be appealing when the amount of infrastructure isn't known or is required for only short periods of time.
WekaIO offers its Matrix software-based, scale-out storage platform that can be deployed on premises on servers with NVMe drives or in AWS public cloud with NVMe-enabled Elastic Compute Cloud instances. Excelero NVMesh is another SDS product that scales performance linearly across multiple servers and storage, and is typically combined with IBM Spectrum Scale to create a scale-out file system.
Data mobility
Combining the capacity and performance tiers into a single product requires manual or automated processes to move data between the performance and capacity tiers and metadata to track data successfully as it's moved around. Some AI storage products can be integrated directly with object storage, simplifying this process. The public cloud can be a powerful option for machine learning and AI development, as data moved between internal cloud services generates no storage egress charges. WekaIO Matrix, for example, can replicate data on and off premises and archive it to object storage.
Putting it all together
Businesses that want to implement on-premises storage for machine learning and AI workloads must consider capacity and performance. For the performance tier, they can either build from scratch or deploy a packaged product, effectively converged infrastructure for machine learning. With the build option, businesses can deploy an on-premises appliance or use SDS. SDS enables organizations to implement storage as a separate layer or build out a hyper-converged infrastructure. If data will be retained on premises, then the organization can use appliances or follow the software-defined route to deploy a capacity tier using object storage.
Turning to the public cloud, IT organizations can use native services, such as object storage and block storage. File storage products still have a long way to go in terms of reaching the low latencies machine learning and AI applications need. Instead, organizations are likely to use block storage, especially in conjunction with SDS or AI storage products that add a file-services layer to native block resources.