Skip to main content
liamvdv
  1. Posts/

Add DataDog to Kamal

·4 mins

Adding DataDog to Kamal #

I just spend 20 hours setting up DataDog APM (Tracing), Logs and DogStatsD with Kamal Deploy, formally known as mrsk. In this post I’ll share how you can do so in ~10 minutes.

DataDog is top-notch when it comes to adding observability of your systems. We’ll setup DataDog APM including tracing and profiling, System Metrics, Custom Metrics via DogStatsD and Logs.

BotBrains was recently accepted into AWS Activate, a program by AWS to help Startups get off the ground by giving them $2k in credits for 2 years and plenty of partner offers (90% off HubSpot!). This made us eligible for DataDogs for Startups, giving us $100k in credits for 1 year on the DataDog platform.

DataDog Agent runs in a Docker Container, and receive traces on :8126 and DogStatsD metrics on :8125. Setting up such networking is easy if you use docker-compose.yml, but painful in Kamal — there are all kinds of failing edge cases defining custom docker networks. Installing the DataDog agent on the host means we cannot automate it via Kamal and still requires workarounds to enable containers to talk to host. You must break the kamal abstractions via options: and continuously investigate the deployment / traefik restart logs to understand the exact docker commands. I tried all options and the following works best.

The final config is actually easy to understand. Instead of using docker networks and the network stack, we use Unix Domain Sockets. The DataDog Agent 7 natively supports UDS, which are Sockets with a filesystem name. We can now simply define volumes that are mounted into the traefik, your application and agent containers to setup monitoring.

service: api
image: ...

env:
	clear:
		# DataDog - ddtrace-run Python config
		DD_ENV: <%= ENV["STAGE"] %>
		DD_VERSION: <%= ENV["VERSION"] %>		
		DD_SITE: "datadoghq.eu"
		DD_TRACE_AGENT_URL: "unix:///var/run/datadog/apm.socket"
		DD_DOGSTATSD_URL: "unix:///var/run/datadog/dsd.socket"
		# Optional
		DD_LOGS_INJECTION: "true"
		DD_RUNTIME_METRICS_ENABLED: "true"
		DD_TRACE_SAMPLE_RATE: "1"
		DD_PROFILING_ENABLED: "true"

# DataDog APM labels
# https://docs.datadoghq.com/tracing/guide/tutorial-enable-python-container-agent-host/
labels:
  com.datadoghq.tags.service: api
  com.datadoghq.tags.version: <%= ENV["VERSION"] %>
  com.datadoghq.tags.env: <%= ENV["STAGE"] %>

# DataDog Unix Socket
volumes:
  - "/var/run/datadog:/var/run/datadog"

accessories:
  monitoring:
    image: gcr.io/datadoghq/agent:7
    volumes:
      # Explanation on volumes:
      #   https://docs.datadoghq.com/containers/docker/log/?tab=dockercompose#installation
      # DataDog APM
      - "/var/run/docker.sock:/var/run/docker.sock:ro"
      - "/proc/:/host/proc/:ro"
      - "/sys/fs/cgroup/:/host/sys/fs/cgroup:ro"
      # DataDog Unix Socket Support
      #   client speaks StatsD only?
      #   https://docs.datadoghq.com/developers/dogstatsd/unix_socket/?tab=docker#socat-proxy
      #   socat -s -u UDP-RECV:8125 UNIX-SENDTO:/var/run/datadog/dsd.socket
      - "/var/run/datadog:/var/run/datadog"
      # DataDog Logs
      - "/var/lib/docker/containers:/var/lib/docker/containers:ro"
      - "/opt/api-monitoring/run:/opt/api-monitoring/run:rw" # NOTE(liamvdv): careful, {service}-monitoring will be kamal given container name.
    options:
      pid: host
      cgroupns: host
    env:
      clear:
        # DataDog APM
        DD_APM_ENABLED: "true"
        DD_APM_RECEIVER_SOCKET: "/var/run/datadog/apm.socket"
        DD_APM_NON_LOCAL_TRAFFIC: "true"

        # DataDog Logs
        DD_LOGS_ENABLED: "true"
        DD_LOGS_CONFIG_CONTAINER_COLLECT_ALL: "true"
        DD_CONTAINER_EXCLUDE_LOGS: "image:datadog/agent:*"

        # DataDog DogStatsD Metrics
        DD_DOGSTATSD_SOCKET: "/var/run/datadog/dsd.socket"
        DD_DOGSTATSD_NON_LOCAL_TRAFFIC: "true"
        DD_DOGSTATSD_ORIGIN_DETECTION: "true"
        DD_PROCESS_AGENT_ENABLED: "true" # https://docs.datadoghq.com/containers/docker/?tab=standard#optional-collection-agents

        # DataDog General Config
        DD_SITE: "datadoghq.eu"
      secret:
        - DD_API_KEY
        - DD_ENV
		# install as sidecar on hosts in
    roles:
      - web

# Configure custom arguments for Traefik
traefik:
	# We need v2.10+ to enable localAgentSocket
  # https://github.com/traefik/traefik/pull/9714
  image: botbrains/traefiksocat:v2.10
  env:
    DD_DOGSTATSD_SOCKET: "/var/run/datadog/dsd.socket"
  args:
    accesslog: true
    log.level: "ERROR"
    # https://github.com/traefik/traefik/pull/9714/files
    tracing.datadog.localAgentSocket: "/var/run/datadog/apm.socket"
		# Lacking support for localAgentSocket for metrics, tracking with
	  # https://github.com/traefik/traefik/issues/10184
    metrics.datadog.address: "127.0.0.1:8125"
    metrics.datadog.addRoutersLabels: "true"
    metrics.datadog.prefix: traefik
  options:
    volume:
      - "/var/run/datadog:/var/run/datadog"

Note that you need to keep DD_SITE and DD_ENV, DD_VERSION correctly set for your use case. I’m based in the EU so I need to have my data here for compliance reasons. If you’re using datadoghq.com, you can simply not set DD_SITE as it will default to the .com domain.

To enable monitoring on other Docker Containers, make use you mount "/var/run/datadog:/var/run/datadog" and configure the other applications correctly to talk to UDS (for most dd client libraries that means setting DD_APM_RECEIVER_SOCKET and DD_DOGSTATSD_SOCKET).

Why the custom traefik image? #

There are 2 reasons for a custom docker Image:

  1. Kamal by default uses traefik:v2.9 , but support for --tracing.datadog.localAgentSocket was only released in v2.10. So we need to use traefik:v2.10

  2. Traefik v2.10 does not yet support --metrics.datadog.localAgentSocket but only --metrics.datadog.address 127.0.0.1:8125 to emit DogStatsD metrics. To combat this, traefiksocat runs a background process called socat that proxies all traffic to :8125 to the specified UNIX Socket DD_DOGSTATSD_SOCKET, by default /var/run/datadog/dsd.socket. It’s the offical workaround by DataDog.

    The issue is tracked here https://github.com/traefik/traefik/issues/10184 and the traefiksocat code can be found here https://github.com/liamvdv/traefiksocat.

Please let me know if you would be interested in learning how we setup the DataDog Client Library ddtrace in your Gunicorn + FastAPI Python Web Application for function level tracing and profiling.

How to deploy? #

First deploy the agent, then reboot the newly configured Traefik.

bundle exec kamal accessory boot monitoring # add -d destination if you have stages
bundle exec kamal traefik reboot # add -d destination if you have stages

Finally add the client library to your application container if you wish and redeploy those. Make sure that the environment variables are updated (kamal envify).

You should see metrics in DataDog Metrics Explorer both for the host and container metrics as well as the traefik specific metrics, e.g. traefik.router.request.total.

Now, Go Build!