Open MPI logo

Portable Hardware Locality (hwloc) Documentation: v2.11.0

  |   Home   |   Support   |   FAQ   |  
Heterogeneous Memory

Heterogeneous memory hardware exposes different NUMA nodes for different memory technologies. On the image below, a dual-socket server has both HBM (high bandwidth memory) and usual DRAM connected to each socket, as well as some CXL memory connected to the entire machine.

The hardware usually exposes "normal" memory first because it is where "normal" data buffers should be allocated by default. However there is no guarantee about whether HBM, NVM, CXL will appear second. Hence there is a need to explicit memory technologies and performance to help users decide where to allocate.

Memory Tiers

hwloc builds Memory Tiers to identify different kinds of NUMA nodes. On the above machine, the first tier would contain both HBM NUMA nodes (L#1 and L#3), while the second tier would contain both DRAM nodes (L#0 and L#2), and the CXL memory (L#4) would be in the third tier. NUMA nodes are then annotated accordingly:

  • Each node object has its subtype field set to HBM, DRAM or CXL-DRAM (see other possible values in Normal attributes).
  • Each node also has a string info attribute with name MemoryTier and value 0 for the first tier, 1 for the second, etc.

Tiers are built using two kinds of information:

  • First hwloc looks into operating system information to find out whether a node is non-volatile, CXL, special-purpose, etc.
  • Then it combines that knowledge with performance metrics exposed by the hardware to guess what's actually DRAM, HBM, etc. These metrics are also exposed in hwloc Memory Attributes, for instance bandwidth and latency, for read and write. See Memory Attributes and Comparing memory node attributes for finding where to allocate on for more details.

Once nodes with similar or different characteristics are identified, they are placed in tiers. Tiers are then sorted by bandwidth so that the highest bandwidth is ranked first, etc.

If hwloc fails to build tiers properly, see HWLOC_MEMTIERS and HWLOC_MEMTIERS_GUESS in Environment Variables.

Using Heterogeneous Memory from the command-line

Tiers may be specified in location filters when using NUMA nodes in hwloc command-line tools. For instance, binding memory on the first HBM node (numa[hbm]:0) is actually equivalent to binding on the second node (numa:1) on our example platform:

$ hwloc-bind --membind 'numa[hbm]:0' -- myprogram
$ hwloc-bind --membind 'numa:1' -- myprogram

To count DRAM nodes in the first CPU package, or all nodes:

$ hwloc-calc -N 'numa[dram]' package:0
1
$ hwloc-calc -N 'numa' package:0
2

To list all the physical indexes of Tier-0 NUMA nodes (HBM P#2 and P#3 not shown on the figure):

$ hwloc-calc -I 'numa[tier=0]' -p all
2,3

The number of tiers may be retrieved by looking at topology attributes in the root object:

$ hwloc-info --get-attr "info MemoryTiersNr" topology
2

hwloc-calc and hwloc-bind also have options such as --local-memory and --best-memattr to select the best NUMA node among the local ones. For instance, the following command-lines say that, among nodes near node:0 (DRAM L#0), the best one for latency is itself while the best one for bandwidth is node:1 (HBM L#1).

$ hwloc-calc --best-memattr latency node:0
0
$ hwloc-calc --best-memattr bandwidth node:0
1

Using Heterogeneous Memory from the C API

There are two major changes introduced by heterogeneous memory when looking at the hierarchical tree of objects.

  • First, there may be multiple memory children attached at the same place. For instance, each Package in the above image has two memory children, one for the DRAM NUMA node, and another one for the HBM node.
  • Second, memory children may be attached at different levels. In the above image, CXL memory is attached to the root Machine object instead of below a Package.

Hence, one may have to rethink the way it selects NUMA nodes.

Iterating over the list of (heterogeneous) NUMA nodes

A common need consists in iterating over the list of NUMA nodes (e.g. using hwloc_get_next_obj_by_type()). This is useful for counting some domains before partitioning a job, or for finding a node that is local to some objects. With heterogeneous memory, one should remember that multiple nodes may now have the same locality (HBM and DRAM above) or overlapping localities (e.g. DRAM and CXL above). Checking NUMA node subtype or tier attributes is a good way to avoid this issue by ignoring nodes of different kinds.

Another solution consists in ignoring nodes whose cpuset overlap the previously selected ones. For instance, in the above example, one could first select DRAM L#0 but ignore HBM L#1 (because it overlaps with DRAM L#0), then select DRAM L#2 but ignore HBM L#3 and CXL L#4 (overlap wih DRAM L#2).


It is also possible to iterate over the memory parents (e.g. Packages in our example) and select only one memory child for each of them. hwloc_get_memory_parents_depth() may be used to find the depth of these parents. However this method only works if all memory parents are at the same level. It would fail in our example: the root Machine object also has a memory child (CXL), hence hwloc_get_memory_parents_depth() would returns HWLOC_TYPE_DEPTH_MULTIPLE.

Iterating over local (heterogeneous) NUMA nodes

Another common need is to find NUMA nodes that are local to some objects (e.g. a Core). A basic solution consists in looking at the Core nodeset and iterating over NUMA nodes to select those whose nodeset are included. A nicer solution is to walk up the tree to find ancestors with a memory child. With heterogeneous memory, multiple such ancestors may exist (Package and Machine in our example) and they may have multiple memory children.

Both these methods may be replaced with hwloc_get_local_numanode_objs() which provides a convenient and flexible way to retrieve local NUMA nodes. One may then iterate over the returned array to select the appropriate one(s) depending on their subtype, tier or performance attributes.


hwloc_memattr_get_best_target() is also a convenient way to select the best local NUMA node according to performance metrics. See also Comparing memory node attributes for finding where to allocate on.