Observability: Tracing, Logging, and Knowing What Broke

The relationship between tracing and log in Rust is like the relationship between a well-organized crime scene and someone's diary. Both record events. One is structured, contextual, and queryable. The other is a sequence of strings.

Use tracing. It's the standard for async Rust services. The log crate still exists and still works, but tracing is what you want for production services — it supports structured key-value fields, spans that capture context across async await points, and a subscriber system that can route events to multiple backends simultaneously.

Setup

[dependencies]
tracing = "0.1"
tracing-subscriber = { version = "0.3", features = [
    "env-filter",
    "json",
    "fmt",
] }
#![allow(unused)]
fn main() {
use tracing_subscriber::{layer::SubscriberExt, util::SubscriberInitExt, EnvFilter};

pub fn init_tracing() {
    tracing_subscriber::registry()
        .with(
            EnvFilter::try_from_default_env()
                .unwrap_or_else(|_| EnvFilter::new("info")),
        )
        .with(
            tracing_subscriber::fmt::layer()
                .with_target(true)
                .with_thread_ids(false),
        )
        .init();
}
}

EnvFilter::try_from_default_env() reads the RUST_LOG environment variable. Set it to control verbosity:

RUST_LOG=info                          # All crates at info level
RUST_LOG=myapp=debug,sqlx=warn         # Fine-grained control
RUST_LOG=myapp::handlers=trace         # Module-level control

The Difference Between Events and Spans

An event is a point-in-time occurrence. A span is a period of time with a start and end, representing a unit of work.

#![allow(unused)]
fn main() {
// Event: records something that happened
tracing::info!("Request received");
tracing::warn!(user_id = %id, "User not found");
tracing::error!(error = %e, "Database connection failed");

// Span: wraps a unit of work
async fn handle_request(user_id: Uuid) {
    let span = tracing::info_span!("handle_request", %user_id);
    let _guard = span.enter(); // or use .instrument()

    // Events inside this span are associated with it
    tracing::info!("Processing request");
    do_work().await;
    tracing::info!("Request complete");
}
}

The key insight: in async code, a span needs to be attached to a Future, not entered with .enter(). If you use .enter() in async code, the span guard is alive on one thread while the future runs on another — the recorded scope becomes meaningless.

Use .instrument():

#![allow(unused)]
fn main() {
use tracing::Instrument;

async fn process_order(order_id: Uuid, user_id: Uuid) -> Result<(), AppError> {
    async move {
        tracing::info!("Starting order processing");

        let items = fetch_order_items(order_id).await?;
        tracing::info!(item_count = items.len(), "Fetched order items");

        charge_payment(user_id, calculate_total(&items)).await?;
        tracing::info!("Payment charged");

        Ok(())
    }
    .instrument(tracing::info_span!("process_order", %order_id, %user_id))
    .await
}
}

Everything inside the instrumented future runs in the context of the span. When the future yields at .await points, the span is correctly suspended and resumed with the future. Nested calls automatically see the parent span in their context.

Structured Fields

This is the whole point. Unstructured:

#![allow(unused)]
fn main() {
tracing::info!("User {} logged in from {}", user_id, ip_address);
}

Structured:

#![allow(unused)]
fn main() {
tracing::info!(
    user_id = %user_id,
    ip = %ip_address,
    "User logged in"
);
}

The % sigil uses Display. Use ? for Debug. Fields without a sigil use the value directly (for types that implement tracing::Value).

In structured form, user_id and ip are first-class fields, not parts of a string. Your log aggregation system can filter by them, count them, and alert on them. You cannot meaningfully query unstructured log strings at scale.

#![allow(unused)]
fn main() {
tracing::error!(
    error = %e,
    error.details = ?e,     // Debug representation for the full error
    user_id = %auth.user_id,
    request_id = %req_id,
    "Payment failed"
);
}

Axum Integration with tower-http's TraceLayer

You added TraceLayer in the HTTP chapter. Here's what it's doing and how to customize it:

#![allow(unused)]
fn main() {
use tower_http::trace::{DefaultMakeSpan, DefaultOnResponse, TraceLayer};
use tracing::Level;

let trace_layer = TraceLayer::new_for_http()
    .make_span_with(
        DefaultMakeSpan::new()
            .level(Level::INFO)
            .include_headers(false), // Don't log all headers by default
    )
    .on_response(
        DefaultOnResponse::new()
            .level(Level::INFO)
            .latency_unit(tower_http::LatencyUnit::Millis),
    );
}

Or write custom span creation:

#![allow(unused)]
fn main() {
use axum::extract::Request;
use tower_http::trace::MakeSpan;

#[derive(Clone)]
struct CustomMakeSpan;

impl MakeSpan<axum::body::Body> for CustomMakeSpan {
    fn make_span(&mut self, request: &Request<axum::body::Body>) -> tracing::Span {
        let request_id = request
            .headers()
            .get("x-request-id")
            .and_then(|v| v.to_str().ok())
            .unwrap_or("unknown");

        tracing::info_span!(
            "request",
            method = %request.method(),
            uri = %request.uri(),
            request_id = %request_id,
        )
    }
}
}

JSON Logging for Production

In development, human-readable output is fine. In production, you want JSON so your log aggregator (Datadog, CloudWatch, Loki, etc.) can parse it:

#![allow(unused)]
fn main() {
pub fn init_tracing(json: bool) {
    let registry = tracing_subscriber::registry()
        .with(
            EnvFilter::try_from_default_env()
                .unwrap_or_else(|_| EnvFilter::new("info")),
        );

    if json {
        registry
            .with(tracing_subscriber::fmt::layer().json())
            .init();
    } else {
        registry
            .with(tracing_subscriber::fmt::layer().pretty())
            .init();
    }
}

// In main:
let json_logs = std::env::var("APP_ENV")
    .map(|e| e == "production")
    .unwrap_or(false);
init_tracing(json_logs);
}

Each log line becomes a JSON object:

{
  "timestamp": "2024-01-15T10:23:45.123456Z",
  "level": "INFO",
  "target": "myapp::handlers::users",
  "span": {"name": "request", "method": "POST", "uri": "/api/v1/users"},
  "fields": {
    "message": "User created",
    "user_id": "550e8400-e29b-41d4-a716-446655440000"
  }
}

Distributed Tracing with OpenTelemetry

For systems with multiple services, you want traces to span service boundaries. OpenTelemetry is the standard:

[dependencies]
opentelemetry = { version = "0.22", features = ["trace"] }
opentelemetry-otlp = { version = "0.15", features = ["tonic"] }
opentelemetry_sdk = { version = "0.22", features = ["rt-tokio", "trace"] }
tracing-opentelemetry = "0.23"
#![allow(unused)]
fn main() {
use opentelemetry::global;
use opentelemetry_otlp::WithExportConfig;
use opentelemetry_sdk::runtime;
use tracing_opentelemetry::OpenTelemetryLayer;

pub fn init_tracing_with_otel(service_name: &str, otlp_endpoint: &str) {
    // Initialize OTLP exporter (sends to Jaeger, Tempo, etc.)
    let tracer = opentelemetry_otlp::new_pipeline()
        .tracing()
        .with_exporter(
            opentelemetry_otlp::new_exporter()
                .tonic()
                .with_endpoint(otlp_endpoint),
        )
        .with_trace_config(
            opentelemetry_sdk::trace::Config::default()
                .with_resource(opentelemetry_sdk::Resource::new(vec![
                    opentelemetry::KeyValue::new("service.name", service_name.to_string()),
                ])),
        )
        .install_batch(runtime::Tokio)
        .expect("Failed to install OpenTelemetry tracer");

    tracing_subscriber::registry()
        .with(EnvFilter::try_from_default_env().unwrap_or_else(|_| EnvFilter::new("info")))
        .with(tracing_subscriber::fmt::layer().json())
        .with(OpenTelemetryLayer::new(tracer))
        .init();
}

// Graceful shutdown: flush pending spans
pub fn shutdown_tracing() {
    global::shutdown_tracer_provider();
}
}

With this setup, your tracing spans are automatically exported to your observability backend. Jaeger, Grafana Tempo, Honeycomb, Datadog — they all support the OTLP protocol.

Request IDs

Every request should get a unique ID that's logged with every event during that request's lifetime:

#![allow(unused)]
fn main() {
use tower_http::request_id::{MakeRequestUuid, PropagateRequestIdLayer, SetRequestIdLayer};

// In your router:
let app = Router::new()
    .route("/", get(handler))
    .layer(
        tower::ServiceBuilder::new()
            // Set X-Request-Id header on incoming requests that don't have one
            .layer(SetRequestIdLayer::x_request_id(MakeRequestUuid))
            // Propagate X-Request-Id to outgoing responses
            .layer(PropagateRequestIdLayer::x_request_id())
            // TraceLayer should come after request ID is set
            .layer(TraceLayer::new_for_http().make_span_with(CustomMakeSpan)),
    );
}

Then in your custom span maker, read the request ID and include it in the span. Every log event within the request's span will include the request ID. When a user reports a problem, they give you their request ID and you can find all log events for that request immediately.

Practical Log Levels

LevelUse for
ERRORThings that are broken and need immediate attention
WARNThings that are wrong but the system handled them
INFONormal operational events (request received, user created, job completed)
DEBUGDiagnostic information useful during development
TRACEVery detailed debugging (query parameters, response bodies, every step)

In production, run at INFO. In development, DEBUG is usually enough. TRACE will give you more information than you wanted.

Do not use ERROR for "user not found" or "invalid input" — those are expected conditions that the application handles. ERROR means something is broken: a dependency is unreachable, a supposed invariant was violated, a budget was exceeded. If your error rate dashboard is full of 404s, you can't see the real errors.

What You Actually Need in Production

At minimum:

  1. JSON structured logging with level, timestamp, service name
  2. Request ID on every request, propagated in spans
  3. Latency, status code, and method/path logged per request
  4. ERROR-level events for unexpected failures

Nice to have: 5. OpenTelemetry export to a trace backend 6. Correlation IDs propagated to downstream service calls 7. Structured error context (error type, message, affected resource)

You can start with just 1-4 and add the rest as your system grows. The important thing is that when production is broken at midnight, you can look at your logs and understand what happened. Unstructured logs make that harder than it needs to be.