Compute Express Link (CXL) is an open standard interconnect built on PCIe that maintains a unified memory space and high-speed communication between a CPU, also known as a host processor, and attached accelerator devices.
This communication can also take place between an accelerator device and a virtual machine (VM) or a container.
CXL performs offloading by allowing a device to request and access resources from another device directly, without involving the host processor or the main memory.
CXL allows the CPU and device, or device and virtual machine or container, to load/store from each other's memory, minimizing CPU involvement and redundant data movements between components.
CXL ensures that all data is coherent, meaning it appears the same no matter what component accesses it; it also allows a device to effectively cache memory by temporarily storing program instructions, data frequently used in the operations of program, or information a CPU is likely to need inside of the CPU, increasing data retrieval speeds.
This feature is important for offloading, as it enables multiple devices to access the same data without causing data corruption or inconsistencies.
CXL is currently in its first iteration, CXL 1.1, but CXL 2.0 and CXL 3.0 specs have been announced. (More on that later.)
When the CPU needs to perform a task that is best suited for an accelerator device, it sends a request over the CXL interface to the accelerator device. The accelerator device then retrieves the necessary data from memory or storage, performs the task, and sends the results back to the CPU over the CXL interface.
In the case of a virtual machine (VM), the CXL interface can be used to offload tasks from the CPU to an accelerator device that is assigned to the VM. The accelerator device communicates with the VM over the CXL interface, performing tasks and exchanging data as needed.
In the case of a container, the CXL interface can be used to allow multiple containers to share access to the same accelerator device. The CPU communicates with the accelerator device over the CXL interface, and the accelerator device communicates with the containers in a similar manner, performing tasks and exchanging data as needed.
In all cases, the CXL interface provides a fast, efficient way for the CPU, accelerator devices, VMs, and containers to exchange data and offload tasks, enhancing overall system performance and reducing latency.
As the amount of available data increases, data centers must adapt to handle increasingly complex and demanding workloads.
Server architecture, which has remained the same for decades, is being modified so that high-performance computing systems can handle the massive amounts of data generated by AI/ML/DL applications.
What does this look like? Data centers are shifting from a model where each server has its own processing, memory, networking devices, and accelerators to a disaggregated pooling model that matches resources and tasks based on workload requirements.
This is where CXL comes into play: it enables efficient resource sharing/pooling for higher performance, reduces the need for complex software, and lowers overall system costs.
CXL supports three protocols: CXL.io, CXL.cache, and CXL.memory.
Together, these protocols enable coherent memory sharing between devices, and use cases become more specific for each protocol. The protocols are listed in order below.
Let's take a look at each:
CXL.io is the foundational CXL protocol, functioning the same as PCIe 5.0.
This protocol has a non-coherent load/store interface for I/O devices, and it is used for initialization, device discovery, and connection to the device.
CXL.cache allows CXL devices to access and cache host memory with low latency.
CXL.mem enables a host to access device-attached memory using load/store commands, and it is applicable to both volatile and non-volatile memory architectures.
If needed, CXL.mem can provide more RAM (random access memory) in an empty PCIe 5.0 slot. Even though it may slightly lower performance and increase latency, it provides much more memory in a server without having to purchase more memory.
There are three types of CXL devices: Type 1, Type 2, and Type 3.
Let's take a look at each in a little more detail:
Type 1 devices include accelerators such as SmartNICs, which usually do not have local memory, but they can use CXL.io and CXL.cache to communicate with the host processor's DDR (double data rate) memory.
Type 2 devices include GPUs, ASICs, and FPGAs, which are all equipped with DDR or HBM (high-bandwidth memory).
They can also use CXL.io, CXL.cache, and CXL.mem to make the processor's memory locally available to the accelerator and the accelerator's memory locally available to the processor.
These devices are both in the same cache coherent domain, and they help enhance different workloads.
Type 3 devices use CXL.io and CXL.mem for memory expansion and resource pooling.
For example, a buffer attached to the CXL bus could be used to enable DRAM capacity expansion, increase memory bandwidth, or add persistent memory without the loss of DRAM slots.
This means that high-speed, low-latency storage devices that would have previously displaced DRAM can instead compliment it with CXL-enabled devices.
CXL builds upon PCIe 5.0 to establish coherency, a simplified software stack, and compatibility with existing standards.
CXL uses a PCIe 5.0 feature that allows alternate protocols to use the physical PCIe layer. When a CXL-enabled accelerator is plugged into a x16 slot, the device negotiates with the host processor's port at 2.5 GT/s (gigatransfers per second), which is the regular PCIe 1.0 transfer rate.
Alignment with PCIe 5.0 means that both device classes can transfer data at 32 GT/s, or up to 64 GB/s in each direction over a 16-lane link. In addition, the performance demands of CXL are likely to highlight the need to adopt PCIe 6.0.
CXL transaction protocols are activated only if both sides support CXL; if both sides do not, the protocols simply act as PCIe devices.
CXL has a variety of benefits for both business and data center operators, including:
By reducing the software stack and overhead, CXL enables faster and more efficient communication between these devices.
One of the main ways that CXL reduces software stack and overhead is by implementing a cache-coherent protocol, which allows multiple devices to share a common memory space. This means that data can be transferred between devices without the need for additional memory copies or translation, reducing the number of instructions that need to be executed and the amount of time it takes to transfer data.
Additionally, CXL supports direct memory access (DMA), which allows devices to access system memory without involving the CPU. This reduces the overhead associated with managing memory and enables faster data transfers.
CXL also supports a variety of other features that reduce software overhead, such as hardware-based address translation and virtualization support. These features allow devices to access memory and other system resources more efficiently, without the need for additional software layers.
CXL 1.1 is currently available, and CXL 2.0 and 3.0 have been announced as future iterations of CXL.
Because CXL is closely tied to PCIe, new versions of CXL depend on new versions of PCIe, with about a two-year gap between releases of PCIe and an even longer gap between the release of new specifications and products coming to market.
Let's take a closer look at CXL 2.0 and CXL 3.0:
CXL 2.0 has four new unique features: switching, memory pooling, an "as needed" memory model, and integrity and data encryption (IDE).
Let's dive into each in more detail:
CXL switches connect multiple hosts and memory devices, allowing the number of devices connected on a CXL network to grow.
A multiple logical device feature allows numerous hosts and devices to all be linked together, and the network of resources will be managed and controlled by a fabric manager.
Switching is incorporated into the memory devices via a crossbar in the CXL memory pooling chip, which reduces latency but requires a more powerful chip, as the chip is now responsible for forwarding data.
Attached CXL devices can use DDR DRAM to expand the host's main memory.
CXL devices can segment themselves into multiple pieces and provide access to different hosts. Hardware can be dynamically assigned and shifted between various hosts without restarts.
CXL switches also allow for partitioning or provisioning of resources based on current demands. This can be done flexibly, as the host is able to access all or just a portion of the capacity of as many devices as needed.
As a whole, switching helps data centers achieve the enhanced performance through main memory expansion, and switching increases efficiency and reduces cost of ownership through memory pooling.
Switching enables memory pooling--meaning memory devices are connected to a memory fabric and are allocated by a fabric manager to a host device.
With CXL switch, a host can access one or more devices from a memory pool. The host must be CXL 2.0-enabled to do this, but the memory devices can be a mix of CXL 1.1 and 2.0-enabled hardware.
A CXL 1.1 device can only act as a single logical device that can only be accessed by one host at a time; however, a CLX 2.0 device can be divided up as multiple logical devices, allowing up to 16 different hosts to access different portions of memory.
For example, a single host can use half the memory in one device, and a quarter of the memory in another to match the memory requirements of its workloads to the available memory capacity in the pool.
The remaining capacity of these devices can be used by the other hosts.
CXL 2.0 allocates memory to hosts on an "as needed" basis, resulting in greater memory utilization and efficiency.
This allows for the provisioning of server main memory for regular workloads, with the ability to access the memory pool when needed for demanding workloads and, as a result, reducing total cost of ownership.
Disaggregation, meaning the separation of server architecture components, increases vulnerability to cyberattacks.
This is why CXL is designed with advanced cybersecurity measures at top of mind.
All three CXL protocols are protected by Integrity and Data Encryption (IDE), which is implemented in secure protocol engines inside of the CXL host and device chips to meet the high-speed data rate requirements without increasing latency.
CXL chips and systems themselves need protections against cyberattacks. A hardware root of trust implemented in the CXL chips can provide the foundation for security and support requirements for secure boot and secure firmware download.
CXL 3.0 further disaggregates a server's architecture by allowing processors, storage, networking, and other accelerators to be pooled and addressed dynamically by multiple hosts and accelerators, just like the memory in CXL 2.0.
CXL 3.0 also allows for direct communication between devices/components over a switch or across a switch fabric. For example, two GPUs could talk to each other without using the network or getting the host CPU and memory involved.
The CXL Consortium is an open industry standard group formed to create technical specifications that enhance performance for emerging usage models while supporting an open ecosystem for data center accelerators and other high-speed system enhancements.
Learn more about the CXL Consortium here.
CXL plays a critical role in enabling data centers to tackle complex and demanding workloads that result from increasing amounts of available data.
CXL's various protocols and devices all work together to enhance bidirecitonal communication and reduce latency, a key component of cross domain solutions.
Within a military environment, this ensures that critical intelligence is kept safe and delivered within a matter of seconds, increasing situational awareness and reducing response times at the tactical edge.
Through more efficient resource sharing--specifically memory--a simplified compute architecture, interconnectivity between components, and strict security measures, CXL helps enhance performance while reducing total cost of ownership in real-time.
Sources: