Breakthrough in GPU performance technology from NVIDIA, IBM, and universities to increase the performance by directly connecting to SSDs instead of relying on the CPU
Big accelerator Memory, or BaM, is an intriguing endeavor to lower the dependence of NVIDIA GPUs & comparable hardware accelerators on a standard CPU such as accessing storage, which will improve performance and capacity. NVIDIA is the most prominent member of the BaM team, using their extensive resources for inventive projects such as moving routine CPU-focused tasks to GPU performance cores. Instead of depending on virtual address translation, page-fault-based on-demand data loading, and additional standard CPU-based mechanisms for managing considerable amounts of data, the new BaM will deliver software and hardware architecture allowing NVIDIA graphics processors to grab data straight from memory and storage areas and function that data without relying on only CPU cores. — BaM design paper written by the researchers Dissecting BaM for viewers, we see two prominent features: a software-managed cache of GPU memory. The assignment of transferring info between data storage and the graphics card is managed by the threads located on the cores of the GPU, through a process of using RDMA, PCI Express interfaces, and custom Linux kernel drivers, allowing for the SSDs to write and read memory from the GPU when required. Secondly, the software library for GPU threads requests data directly from NVMe SSDs by communicating with those drives. Driver commands are prepared by the GPU threads only under the order if the specific data requested is not located in the software-managed cache locations. Algorithms operating on the graphics processor to complete heavy workloads will be able to access the information required efficiently and of utmost importance in such a way that is optimized for their specific data access routines. Researchers from the three groups experimented on a prototype Linux-based system utilizing BaM and standard GPUs and NVMe SSDs to exhibit the design as a viable alternative to the current approach of the CPU directing all matters. Researches explain that the storage access can be put into simultaneous work, that the synchronization limitations are dismissed, and I/O bandwidth is used to boost application performance much more efficiently than before. BaM provides a user-level library of highly concurrent NVMe submission/completion queues in GPU memory that enables GPU threads whose on-demand accesses miss from the software cache to make storage accesses in a high-throughput manner," they continued. “This user-level approach incurs little software overhead for each storage access and supports a high-degree of thread-level parallelism. The new details of the BaM design will be open-sourced for both the company’s hardware and software optimization for other companies to create such designs of their own. Similar functionality is AMD’s Radeon Solid State Graphics card that positioned flash next to a graphics card processor. — NVIDIA’s chief scientist Bill Dally, who previously led Stanford’s computer science department, and other prominent authors notate in the paper. News Source: The Register