Open MPI logo

Portable Hardware Locality (hwloc) Documentation: v2.11.2

  |   Home   |   Support   |   FAQ   |  
Topology Attributes: Distances, Memory Attributes and CPU Kinds

Besides the hierarchy of objects and individual object attributes (see Object attributes), hwloc may also expose finer information about the hardware organization.

Distances

A machine with 4 CPUs may have identical links between every pairs of CPUs, or those CPUs could also only be connected through a ring. In the ring case, accessing the memory of nearby CPUs is slower than local memory, but it is also faster than accessing the memory of CPU on the opposite side of the ring. These deep details cannot be exposed in the hwloc hierarchy, that is why hwloc also exposes distances.

Distances are matrices of values between sets of objects, usually latencies or bandwidths. By default, hwloc tries to get a matrix of relative latencies between NUMA nodes when exposed by the hardware.

In the aforementioned ring case, the matrix could report 10 for latency between a NUMA node and itself, 20 for nearby nodes, and 30 for nodes that are opposites on the ring. Those are theoretical values exposed by hardware vendors (in the System Locality Distance Information Table (SLIT) in the ACPI) rather than physical latencies. They are mostly meant for comparing node relative distances.

Distances structures currently created by hwloc are:

NUMALatency (Linux, Solaris, FreeBSD)
This is the matrix of theoretical latencies described above.
XGMIBandwidth (RSMI)

This is the matrix of unidirectional XGMI bandwidths between AMD GPUs (in MB/s). It contains 0 when there is no direct XGMI link between objects. Values on the diagonal are artificially set to very high so that local access always appears faster than remote access.

GPUs are identified by RSMI OS devices such as "rsmi0". They may be converted into the corresponding OpenCL or PCI devices using hwloc_get_obj_with_same_locality() or the hwloc-annotate tool.

hwloc_distances_transform() or hwloc-annotate may also be used to transform this matrix into something more convenient, for instance by replacing bandwidths with numbers of links between peers.

XGMIHops (RSMI)
This matrix lists the number of XGMI hops between AMD GPUs. It reports 1 when there is a direct link between two distinct GPUs. If there is no XGMI route between them, the value is 0. The number of hops between a GPU and itself (on the diagonal) is 0 as well.
XeLinkBandwidth (LevelZero)

This is the matrix of unidirectional XeLink bandwidths between Intel GPUs (in MB/s). It contains 0 when there is no direct XeLink between objects. When there are multiple links, their bandwidth is aggregated.

Values on the diagonal are artificially set to very high so that local access always appears faster than remote access. This includes bandwidths between a (sub)device and itself, between a subdevice and its parent device, or between two subdevices of the same parent.

The matrix interconnects all LevelZero devices and subdevices (if any), even if some of them may have no link at all.

The bandwidths of links between subdevices are accumulated in the bandwidth between their parents.

NVLinkBandwidth (NVML)

This is the matrix of unidirectional NVLink bandwidths between NVIDIA GPUs (in MB/s). It contains 0 when there is no direct NVLink between objects. When there are multiple links, their bandwidth is aggregated. Values on the diagonal are artificially set to very high so that local access always appears faster than remote access.

On POWER platforms, NVLinks may also connects GPUs to CPUs. On NVIDIA platforms such as DGX-2, a NVSwitch may interconnect GPUs through NVLinks. In these cases, the distances structure is heterogeneous. GPUs always appear first in the matrix (as NVML OS devices such as "nvml0"), and non-GPU objects may appear at the end (Package for POWER processors, PCI device for NVSwitch).

NVML OS devices may be converted into the corresponding CUDA, OpenCL or PCI devices using hwloc_get_obj_with_same_locality() or the hwloc-annotate tool.

hwloc_distances_transform() or hwloc-annotate may also be used to transform this matrix into something more convenient, for instance by removing switches or CPU ports, or by replacing bandwidths with numbers of links between peers.

When a NVSwitch interconnects GPUs, only links between one GPU and different NVSwitch ports are reported. They may be merged into a single switch port with hwloc_distances_transform() or hwloc-annotate. Or a transitive closure may also be applied to report the bandwidth between GPUs across the NVSwitch.

Users may also specify their own matrices between any set of objects, even if these objects are of different types (e.g. bandwidths between GPUs and CPUs).

The entire API is located in hwloc/distances.h. See also Retrieve distances between objects, as well as Helpers for consulting distance matrices and Add distances between objects.

Memory Attributes

Machines with heterogeneous memory, for instance high-bandwidth memory (HBM), normal memory (DDR), and/or high-capacity slow memory (such as non-volatile memory DIMMs, NVDIMMs) require applications to allocate buffers in the appropriate target memory depending on performance and capacity needs. Those target nodes may be exposed in the hwloc hierarchy as different memory children but there is a need for performance information to select the appropriate one.

hwloc memory attributes are designed to expose memory information such as latency, bandwidth, etc. Users may also specify their own attributes and values.

The memory attributes API is located in hwloc/memattrs.h, see Comparing memory node attributes for finding where to allocate on and Managing memory attributes for details. See also an example in doc/examples/memory-attributes.c in the source tree.

Memory attributes are the low-level solution to selecting target memory. hwloc uses them internally to build Memory Tiers which provide an easy way to distinguish NUMA nodes of different kinds, as explained in Heterogeneous Memory.

CPU Kinds

Hybrid CPUs may contain different kinds of cores. The CPU kinds API in hwloc/cpukinds.h provides a way to list the sets of PUs in each kind and get some optional information about their hardware characteristics and efficiency.

If the operating system provides efficiency information (e.g. Windows 10, MacOS X / Darwin and some Linux kernels), it is used to rank hwloc CPU kinds by efficiency. Otherwise, hwloc implements several heuristics based on frequencies and core types (see HWLOC_CPUKINDS_RANKING in Environment Variables).

The ranking shows energy-efficient CPUs first, and high-performance power-hungry cores last.

These CPU kinds may be annotated with the following native attributes:

FrequencyMaxMHz (Linux)
The maximal operating frequency of the core, as reported by cpufreq drivers on Linux.
FrequencyBaseMHz (Linux)
The base/nominal operating frequency of the core, as reported by some cpufreq or ACPI drivers on Linux (e.g. cpufreq_cppc or intel_pstate).
CoreType (x86)
A string describing the kind of core, currently IntelAtom or IntelCore, as reported by the x86 CPUID instruction and Linux PMU on some Intel processors.
LinuxCapacity (Linux)
The Linux-specific CPU capacity found in sysfs, as reported by the Linux kernel on some recent platforms. Higher values usually mean that the Linux scheduler considers the core as high-performance rather than energy-efficient.
LinuxCPUType (Linux)
The Linux-specific CPU type found in sysfs, such as intel_atom_0, as reported by future Linux kernels on some Intel processors.
DarwinCompatible (Darwin / Mac OS X)
The compatibility attribute of the CPUs as found in the IO registry on Darwin / Mac OS X. For instance apple,icestorm;ARM,v8 for energy-efficient cores and apple,firestorm;ARM,v8 on performance cores on Apple M1 CPU.

See Kinds of CPU cores for details.