Add DataDog to Kamal
Adding DataDog to Kamal #
I just spend 20 hours setting up DataDog APM (Tracing), Logs and DogStatsD with Kamal Deploy, formally known as mrsk. In this post I’ll share how you can do so in ~10 minutes.
DataDog is top-notch when it comes to adding observability of your systems. We’ll setup DataDog APM including tracing and profiling, System Metrics, Custom Metrics via DogStatsD and Logs.
BotBrains was recently accepted into AWS Activate, a program by AWS to help Startups get off the ground by giving them $2k in credits for 2 years and plenty of partner offers (90% off HubSpot!). This made us eligible for DataDogs for Startups, giving us $100k in credits for 1 year on the DataDog platform.
DataDog Agent runs in a Docker Container, and receive traces on :8126 and DogStatsD metrics on :8125. Setting up such networking is easy if you use docker-compose.yml, but painful in Kamal — there are all kinds of failing edge cases defining custom docker networks. Installing the DataDog agent on the host means we cannot automate it via Kamal and still requires workarounds to enable containers to talk to host. You must break the kamal abstractions via options:
and continuously investigate the deployment / traefik restart logs to understand the exact docker commands. I tried all options and the following works best.
The final config is actually easy to understand. Instead of using docker networks and the network stack, we use Unix Domain Sockets. The DataDog Agent 7 natively supports UDS, which are Sockets with a filesystem name. We can now simply define volumes that are mounted into the traefik, your application and agent containers to setup monitoring.
service: api
image: ...
env:
clear:
# DataDog - ddtrace-run Python config
DD_ENV: <%= ENV["STAGE"] %>
DD_VERSION: <%= ENV["VERSION"] %>
DD_SITE: "datadoghq.eu"
DD_TRACE_AGENT_URL: "unix:///var/run/datadog/apm.socket"
DD_DOGSTATSD_URL: "unix:///var/run/datadog/dsd.socket"
# Optional
DD_LOGS_INJECTION: "true"
DD_RUNTIME_METRICS_ENABLED: "true"
DD_TRACE_SAMPLE_RATE: "1"
DD_PROFILING_ENABLED: "true"
# DataDog APM labels
# https://docs.datadoghq.com/tracing/guide/tutorial-enable-python-container-agent-host/
labels:
com.datadoghq.tags.service: api
com.datadoghq.tags.version: <%= ENV["VERSION"] %>
com.datadoghq.tags.env: <%= ENV["STAGE"] %>
# DataDog Unix Socket
volumes:
- "/var/run/datadog:/var/run/datadog"
accessories:
monitoring:
image: gcr.io/datadoghq/agent:7
volumes:
# Explanation on volumes:
# https://docs.datadoghq.com/containers/docker/log/?tab=dockercompose#installation
# DataDog APM
- "/var/run/docker.sock:/var/run/docker.sock:ro"
- "/proc/:/host/proc/:ro"
- "/sys/fs/cgroup/:/host/sys/fs/cgroup:ro"
# DataDog Unix Socket Support
# client speaks StatsD only?
# https://docs.datadoghq.com/developers/dogstatsd/unix_socket/?tab=docker#socat-proxy
# socat -s -u UDP-RECV:8125 UNIX-SENDTO:/var/run/datadog/dsd.socket
- "/var/run/datadog:/var/run/datadog"
# DataDog Logs
- "/var/lib/docker/containers:/var/lib/docker/containers:ro"
- "/opt/api-monitoring/run:/opt/api-monitoring/run:rw" # NOTE(liamvdv): careful, {service}-monitoring will be kamal given container name.
options:
pid: host
cgroupns: host
env:
clear:
# DataDog APM
DD_APM_ENABLED: "true"
DD_APM_RECEIVER_SOCKET: "/var/run/datadog/apm.socket"
DD_APM_NON_LOCAL_TRAFFIC: "true"
# DataDog Logs
DD_LOGS_ENABLED: "true"
DD_LOGS_CONFIG_CONTAINER_COLLECT_ALL: "true"
DD_CONTAINER_EXCLUDE_LOGS: "image:datadog/agent:*"
# DataDog DogStatsD Metrics
DD_DOGSTATSD_SOCKET: "/var/run/datadog/dsd.socket"
DD_DOGSTATSD_NON_LOCAL_TRAFFIC: "true"
DD_DOGSTATSD_ORIGIN_DETECTION: "true"
DD_PROCESS_AGENT_ENABLED: "true" # https://docs.datadoghq.com/containers/docker/?tab=standard#optional-collection-agents
# DataDog General Config
DD_SITE: "datadoghq.eu"
secret:
- DD_API_KEY
- DD_ENV
# install as sidecar on hosts in
roles:
- web
# Configure custom arguments for Traefik
traefik:
# We need v2.10+ to enable localAgentSocket
# https://github.com/traefik/traefik/pull/9714
image: botbrains/traefiksocat:v2.10
env:
DD_DOGSTATSD_SOCKET: "/var/run/datadog/dsd.socket"
args:
accesslog: true
log.level: "ERROR"
# https://github.com/traefik/traefik/pull/9714/files
tracing.datadog.localAgentSocket: "/var/run/datadog/apm.socket"
# Lacking support for localAgentSocket for metrics, tracking with
# https://github.com/traefik/traefik/issues/10184
metrics.datadog.address: "127.0.0.1:8125"
metrics.datadog.addRoutersLabels: "true"
metrics.datadog.prefix: traefik
options:
volume:
- "/var/run/datadog:/var/run/datadog"
Note that you need to keep DD_SITE and DD_ENV, DD_VERSION correctly set for your use case. I’m based in the EU so I need to have my data here for compliance reasons. If you’re using datadoghq.com, you can simply not set DD_SITE as it will default to the .com domain.
To enable monitoring on other Docker Containers, make use you mount "/var/run/datadog:/var/run/datadog"
and configure the other applications correctly to talk to UDS (for most dd client libraries that means setting DD_APM_RECEIVER_SOCKET
and DD_DOGSTATSD_SOCKET
).
Why the custom traefik image? #
There are 2 reasons for a custom docker Image:
Kamal by default uses
traefik:v2.9
, but support for--tracing.datadog.localAgentSocket
was only released in v2.10. So we need to use traefik:v2.10Traefik v2.10 does not yet support
--metrics.datadog.localAgentSocket
but only--metrics.datadog.address 127.0.0.1:8125
to emit DogStatsD metrics. To combat this,traefiksocat
runs a background process calledsocat
that proxies all traffic to :8125 to the specified UNIX SocketDD_DOGSTATSD_SOCKET
, by default/var/run/datadog/dsd.socket
. It’s the offical workaround by DataDog.The issue is tracked here https://github.com/traefik/traefik/issues/10184 and the traefiksocat code can be found here https://github.com/liamvdv/traefiksocat.
Please let me know if you would be interested in learning how we setup the DataDog Client Library ddtrace in your Gunicorn + FastAPI Python Web Application for function level tracing and profiling.
How to deploy? #
First deploy the agent, then reboot the newly configured Traefik.
bundle exec kamal accessory boot monitoring # add -d destination if you have stages
bundle exec kamal traefik reboot # add -d destination if you have stages
Finally add the client library to your application container if you wish and redeploy those. Make sure that the environment variables are updated (kamal envify).
You should see metrics in DataDog Metrics Explorer both for the host and container metrics as well as the traefik specific metrics, e.g. traefik.router.request.total
.
Now, Go Build!