Skip to main content

139. Host-Specific Docker Configuration

Status: Accepted Date: 2025-07-06

Context

Our server fleet includes hosts with different hardware capabilities. The hecate server is equipped with a powerful NVIDIA GPU for machine learning tasks, while other servers are standard CPUs. To leverage the GPU within Docker containers on hecate, the Docker daemon must be configured to use the NVIDIA Container Toolkit. Other hosts do not have this hardware and do not need this configuration.

We need a way to apply the correct Docker configuration to the correct host.

Decision

The 08_docker Ansible role will use host-specific configuration files for the Docker daemon (daemon.json).

The role will have two template files for daemon.json:

  1. daemon.json.gpu.j2: This template includes the necessary configuration to set the default runtime to nvidia, enabling GPU access for containers.
  2. daemon.json.standard.j2: A standard configuration for all other hosts without GPUs.

The Ansible task will use a when condition based on the inventory_hostname to select the correct template. If the hostname is hecate, it will use the gpu template; for all other hosts, it will use the standard template.

This allows us to tailor the Docker environment to the specific hardware capabilities of each host while keeping the logic contained and explicit within a single role.

Consequences

Positive:

  • Hardware Optimization: Ensures that we can fully leverage the expensive GPU hardware on hecate for our containerized machine learning workloads.
  • Clean Separation: Keeps the standard Docker configuration clean and free from unnecessary or non-functional settings on hosts that don't have a GPU.
  • Explicit Configuration: The use of separate, clearly named template files makes the different configurations very explicit and easy to understand and modify.

Negative:

  • Requires Host-Specific Logic: Introduces host-specific conditional logic into the Ansible role. This makes the role slightly more complex than a purely generic one.

Mitigation:

  • Necessary Complexity: This is essential complexity. The hardware is different, so the configuration must be different. The when condition is the standard and correct way to handle this in Ansible.
  • Contained Logic: All the host-specific logic is contained within this single, dedicated 08_docker role. It doesn't leak into other parts of the automation. The logic is simple and easy to understand (if host == hecate, use gpu_config, else use standard_config).