ARCHITECTURAL AND SYSTEM IMPLICATIONS OF DISAGGREGATED MEMORY Disaggregated memory offers a compelling solution to the constraints of conventional memory architectures by decoupling memory from individual compute nodes and pooling it across an entire data center. This enables processors across hosts to share a unified memory pool, resulting in improved memory utilization, lower cost, and significantly expanded capacity for memory-intensive applications. Compute Express Link (CXL) is an open-standard interconnect developed collaboratively by Intel, AMD, NVIDIA, and Micron. CXL allows processors to access memory devices over the PCIe bus, either within a single host or across multiple hosts interconnected via a PCIe switch. Crucially, CXL maintains the conventional shared-memory programming model, permitting unmodified applications to interact with disaggregated memory in a byte-addressable fashion using standard load/store instructions. As CXL technology evolves, commercial devices are emerging in diverse forms, such as FPGA-based memory expanders, ASIC-based modules for single-host memory expansion, and CXL-enabled servers for multi-host memory pooling. Combined with traditional, local DRAM attached directly to processor, CXL-enabled remote memory facilitates a tiered memory system comprising components with varying characteristics in terms of speed, capacity, power consumption, and cost. Since the advent of commercial CXL memory solutions, research studies have sought to characterize if the increased memory latency and performance heterogeneity introduced by CXL memory can undermine the effectiveness and efficiency of existing processor designs, which are built on the assumption of low and relatively uniform memory latency in traditional memory hierarchies. It was found that unlocking the full potential of tiered memory systems relies on a multi-pronged approach: Adaptive Memory Scheduling: Schedulers account for differences in latency, bandwidth, and parallelism between local DDR and remote CXL memory, maintaining system balance based on workload criticality and contention. Load Balancing: Intelligent interleaving of requests across multiple CXL devices with dynamic queue management prevents any single tier from becoming a bottleneck. Enhanced Hardware-Level Parallelism: Multi-ported designs, parallelism-aware controllers, and improved interconnect integration between the memory controller and CXL fabric help reduce protocol latency. Cache and Prefetcher Optimization: Refining Last-Level Cache (LLC) policies with tier-aware caching, dynamic cache partitioning, and enhanced prefetch mechanisms, supported by OS/runtime hints, reduces cache miss penalties. System-Level Software Optimizations: Tuning the OS and applications through adaptive memory allocation, profiling, and dynamic adjustments ensures that latency-critical tasks map to local DDR while less time-sensitive data can use CXL memory. # https://lnkd.in/eMwMb4Hr
Common Memory System Challenges in Data Centers
Explore top LinkedIn content from expert professionals.
Summary
Common memory system challenges in data centers refer to the difficulties faced in managing, expanding, and efficiently using memory resources across servers. As modern applications demand more memory and faster access speeds, data centers must adapt to overcome limitations such as memory rigidity, latency, and underutilized resources.
- Address memory rigidity: Use memory pooling and expansion technologies like CXL to allow servers to borrow unused memory and adapt to changing workload needs without expensive hardware upgrades.
- Reduce latency bottlenecks: Improve system design by adopting adaptive scheduling, smarter cache policies, and parallel processing to minimize slowdowns when accessing remote or tiered memory.
- Increase resource utilization: Dynamically allocate memory to workloads based on real-time needs to prevent idle memory and cut down on unnecessary server expenses.
-
-
💥 Why Every Cloud Vendor Is Betting on CXL ✅ The world before CXL: memory is rigid Every CPU socket has: - A fixed number of DIMM slots - A fixed maximum memory capacity - A tightly-coupled memory controller So if a server supports 8 DIMMs per socket, that's it. You cannot: - Add more memory than the motherboard supports - Borrow unused memory from another server - Flexibly allocate RAM between workloads This creates serious problems in cloud data centers: 🔺 Stranded memory Some servers use only 40–50% of their RAM, but that memory sits idle. 🔺 Fixed "instance types" Cloud vendors must sell: - compute-optimized - memory-optimized - storage-optimized because hardware is rigid. 🔺 Expensive upgrades If a workload needs more memory: → you must swap DIMMs → or buy a bigger server → or scale horizontally (extra cost & complexity) This rigidity is exactly what CXL fixes. ✅ What is CXL (Compute Express Link)? CXL lets the CPU access extra memory both inside the server (via CXL memory expanders) and outside the server (via CXL memory pools), all over a PCIe-based link. It adds cache-coherent, low-latency extensions on top of PCIe. There are three sub-protocols: - CXL.io: same as PCIe config/MMIO. - CXL.cache: device caches host memory - CXL.mem: device exposes memory to the CPU For cloud vendors today, CXL.mem is the real game changer. CXL.io and CXL.cache exist, but they're not the focus of this post. ✅ First use-case: CXL Memory Expansion (inside a single server) Imagine your server has 8 DIMM slots, all filled. You want more memory. Before CXL? Impossible. With CXL? Just plug in a CXL memory module. Examples: - A CXL memory expander card in a PCIe slot - A CXL Type-3 memory module(Pure memory device) in a special CXL DIMM slot. The CPU sees this new CXL memory as additional memory regions, just slower than local DRAM. So CXL enables: 🔹 More memory than the motherboard physically allows 🔹 Upgrade memory without touching DIMMs 🔹 Mix DRAM + CXL memory as "tiered memory" (hot data in DRAM, cold data in CXL) ✅ Second use-case: Memory Pooling (shared memory across servers) This is CXL 2.0+. Now we add a CXL switch. - Multiple servers connect to a CXL switching layer - Behind it sits a box full of CXL memory modules - Each server can borrow slices of that memory dynamically This solves stranded memory. ✅ Why cloud vendors need CXL.mem 🔹 They want fewer fixed SKUs Instead of: - compute-optimized - memory-optimized They want flexible instances: "Give this VM exactly 128 GB more memory from the pool". 🔹 They want higher resource utilization Today’s cloud servers run memory under capacity. CXL enables safe oversubscription and much higher utilization. 🔹 Modern workloads now demand far more memory Databases, analytics engines, vector DBs, AI inference all becoming memory-hungry. Local DIMMs can't keep up. 🔹 They want to reduce CAPEX Less stranded memory = fewer servers needed → massive savings. #CXL #PCIe
-
This computing challenge was laid bare almost 10 years ago by some of the brightest minds in the field. Today, hundreds of researchers and engineers are still banging their heads. The problem: hardware and software designed for millisecond-scale I/O (e.g., disks, ethernet) becomes sluggish for modern hardware (e.g., RDMA, GPUs). When reading a packet from the network took 100ms, it was OK to have a program call a blocking read() operation and yield the processor to another program. The few microseconds of overheads where easily recouped by the vast utilization gains. (Utilization is a MAJOR economical and environmental concern in datacenters.) But when a RDMA read takes 2 microseconds, 10 or 100 microseconds represent humongous overheads. Where does that come from? - Copying bytes from user space to kernel space - Trashing shared caches - Waiting for a notification (e.g., posted interrupt) - Slamming a full-fledge TCP/IP stack on top of the RDMA devices ... and so on. In the past decade, lots of work has been done, both in hardware and software, to improve microsecond scales overheads. But today, in the age of "AI factories", where NVLink 5.0 can blast 1.8 TB/s between processing units, we are nowhere done optimizing the stack to improve datacenter utilizations and address the big challenges of the 21st century. p.s.: in the hypothetical scenario where one were to process 1KB requests at 1.8 TB/s, one would have 0.53 nanoseconds to process the request. An hypothetical 1 microseconds context switch tax represents > 1886x overhead...