Skip to content

Observability

Klaxon's telemetry is unified under OpenTelemetry. Every component (server, auth, worker, web, mobile) emits traces + metrics + logs through OTLP into an in-cluster OTel Collector, which forwards to OneUptime. There's exactly one OneUptime credential in the system — the Collector holds it.

Architecture

┌─ browser (@klaxon/web) ──────────────────┐
│  @opentelemetry/sdk-trace-web            │
│  @opentelemetry/sdk-logs                 │   OTLP/HTTP + traceparent header
│  instrumentation-fetch/document-load     ├───────────────┐
└──────────────────────────────────────────┘               │
┌─ mobile (klaxon-mobile / Expo) ──────────┐               │
│  @opentelemetry/sdk-trace-base           │   OTLP/HTTP   │
│  manual fetch wrapper + screen spans     ├───────────────┤
└──────────────────────────────────────────┘               ▼
                                            ┌─────────────────────────────┐
┌─ klaxon-server ──────────────────────────┐│  OTel Collector (Deployment)│
│  klaxon-telemetry crate                  ││  OTLP gRPC :4317  ◄────────┤ Rust
│  traces + logs via OTLP/gRPC             ├┤  OTLP HTTP :4318  ◄────────┤ browser/mobile
└──────────────────────────────────────────┘│  prometheus receiver ──────┤ /metrics scrape
┌─ klaxon-auth ────────────────────────────┐│  processors: batch,        │
│  same klaxon-telemetry init              ├┤  memory_limiter, resource  │
└──────────────────────────────────────────┘│                            │
┌─ klaxon-server --worker ─────────────────┐│  exporter: otlphttp        │
│  + worker-specific metrics               ├┤  → OneUptime               │
└──────────────────────────────────────────┘└──────────┬─────────────────┘

                                              OneUptime OTLP ingestor

Signal paths:

  • Traces — each binary/app pushes OTLP to the Collector. Browser + mobile inject W3C traceparent on every fetch, so a single trace spans browser → klaxon-auth → klaxon-server when a user clicks through the UI.
  • Metrics — HTTP + worker metrics stay on the existing metrics crate + Prometheus /metrics endpoint. The Collector's prometheus receiver scrapes those and re-exports as OTLP, side-stepping the Rust metrics ↔ OTEL SDK bridge.
  • Logs — Rust uses opentelemetry-appender-tracing so every tracing::info! / warn! / error! event becomes an OTLP log record stamped with the active trace_id and span_id. Browser + mobile do the same via @opentelemetry/api-logs.

Configuration

Rust (klaxon-server + klaxon-auth + worker)

VariableDefaultDescription
OTEL_ENABLEDfalseWhen false, only JSON stdout logging runs.
OTEL_EXPORTER_OTLP_ENDPOINThttp://localhost:4317Collector gRPC endpoint.
OTEL_SERVICE_NAMEklaxon-serverservice.name resource attribute.
DEPLOYMENT_ENVIRONMENTdevelopmentdeployment.environment resource attribute.
K8S_POD_NAME, K8S_NAMESPACE_NAMEunsetPopulated from the k8s downward API when deployed via Helm.

The Helm chart wires all of these automatically from values.yaml::otelCollector.enabled — operators don't set the env vars individually.

Web (@klaxon/web)

Vite env vars, baked at build time:

  • VITE_OTEL_ENDPOINT — OTLP/HTTP base URL. Defaults to /otel, i.e. same origin as the web UI (ingress routes /otel/* to the Collector).
  • VITE_APP_VERSION — populated from your CI build ID, becomes service.version.

Local dev: the Vite config proxies /otel to localhost:4318 so pnpm --filter @klaxon/web dev works against docker compose up -d without CORS gymnastics.

Mobile (klaxon-mobile, Expo)

Expo EXPO_PUBLIC_* env vars, baked at EAS build time:

  • EXPO_PUBLIC_OTEL_ENDPOINT — OTLP/HTTP base URL, e.g. https://klaxon.sh/otel. Defaults to http://localhost:4318 for expo start.
  • EXPO_PUBLIC_ENVIRONMENT — becomes deployment.environment.

OneUptime setup

  1. Create (or open) a OneUptime project and navigate to Settings → Telemetry.

  2. Copy the OTLP ingestion URL (typically https://oneuptime.com/otlp for OneUptime Cloud) and the project token.

  3. Populate the Helm values:

    yaml
    secrets:
      oneuptimeOtlpEndpoint: "https://oneuptime.com/otlp"
      oneuptimeOtlpToken: "<your project token>"

    Or via Pulumi:

    bash
    pulumi config set klaxon:oneuptimeOtlpEndpoint https://oneuptime.com/otlp
    pulumi config set --secret klaxon:oneuptimeOtlpToken <your token>
  4. helm upgrade --install klaxon deploy/helm/klaxon -f values.yaml.

  5. Verify in OneUptime that klaxon-server, klaxon-auth, klaxon-worker, klaxon-web, klaxon-mobile all appear under Services, and that metrics with names like http_requests_total and klaxon_worker_task_duration_seconds show up.

Auth format

OneUptime authenticates OTLP/HTTP with an HTTP Authorization: Basic <token> header. The Collector config wraps ${env:ONEUPTIME_OTLP_TOKEN} in that header — operators just supply the raw token. If your OneUptime self-hosted install uses a different scheme (e.g. a custom header), edit deploy/helm/klaxon/templates/otel-collector-configmap.yamlexporters.otlphttp/oneuptime.headers directly.

Local development

bash
docker compose up -d   # starts postgres + redis + otel-collector

The local Collector is otel/opentelemetry-collector-contrib with a debug exporter (see deploy/otel-collector-local.yaml), so everything sent to it is pretty-printed to the Collector's stdout:

bash
docker compose logs -f otel-collector

Run the Rust binaries against it:

bash
OTEL_ENABLED=true \
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317 \
DEPLOYMENT_ENVIRONMENT=development \
cargo run -p klaxon-server

Run the web dev server:

bash
pnpm --filter @klaxon/web dev
# opens http://localhost:5173 — the Vite proxy forwards /otel → 4318

Run Expo on a physical device (LAN):

bash
cd apps/klaxon-mobile
EXPO_PUBLIC_OTEL_ENDPOINT=http://<your-laptop-lan-ip>:4318 \
  npx expo start

What's instrumented

Server (klaxon-server)

  • Every HTTP request gets a root span via track_metrics middleware (crates/klaxon-server/src/metrics.rs). Attributes: method, path, request_id, status, duration_ms. The middleware also populates http_requests_total + http_request_duration_seconds Prometheus metrics.
  • #[tracing::instrument] on all handler functions — every DB transaction, every MCP tool call, every WebSocket lifecycle event gets a child span.
  • Two business metrics: klaxon_mcp_tool_calls_total{tool,success}, klaxon_active_websockets.

Auth (klaxon-auth)

  • Uses the shared klaxon-telemetry crate, so distributed traces from the browser (via the traceparent header) stitch together across the OAuth consent flow and subsequent /api/* calls.
  • No custom metrics yet.

Worker (klaxon-server --worker)

  • Each sweep (snooze_sweep, auto_archive_sweep, push_batch, webhook_batch, email_notifications, session_cleanup, audit_retention) gets a #[tracing::instrument] span and updates two metrics:
    • klaxon_worker_task_duration_seconds{task} histogram
    • klaxon_worker_task_failures_total{task} counter
  • klaxon_notification_queue_depth gauge sampled every 15 s from notification_queue.

Web (@klaxon/web)

  • Auto-instrumented via @opentelemetry/instrumentation-fetch + -xml-http-request + -document-load.
  • Every apiFetch call also gets a manual wrapping span with http.method + http.url + http.status_code so queries / filters can run against business-level attributes without needing to know the auto-instrumentation's span names.
  • log.* calls in packages/common/src/logger.ts emit OTLP log records + mirror to console.

Mobile (klaxon-mobile)

  • Manual fetch wrapper (apps/klaxon-mobile/lib/api.ts) injects traceparent via propagation.inject.
  • screen.load span per route change, driven by usePathname in _layout.tsx.
  • Same log.* shape as the web app.

Request ID

Every server response includes an x-request-id header; if the client sends one, it's propagated, otherwise a UUID v4 is generated. Use this to correlate JSON log entries (request_id field), OTel trace spans (request_id attribute), and client-side records.

Now that trace_id is present in every log record, request_id serves mostly for operators grepping kubectl logs — OneUptime's UI jumps from span to logs via trace_id directly.

Scrape config (Prometheus, if you have one locally)

yaml
scrape_configs:
  - job_name: klaxon
    static_configs:
      - targets: ["localhost:3000", "localhost:3001"]
    metrics_path: /metrics
    scrape_interval: 15s

In production the OTel Collector does this scrape automatically via the prometheus receiver in deploy/helm/klaxon/templates/otel-collector-configmap.yaml.

Debug tips

  • No spans reaching OneUptime? Check kubectl logs <pod>-otel-collector. If you see failed to push data to exporter then the ONEUPTIME_OTLP_TOKEN is wrong or the endpoint is unreachable.
  • trace_id missing from log records? The layer ordering in klaxon-telemetry::init is load-bearing (OpenTelemetryLayer before the appender). Regressing this will let logs pass through without span context. Covered by a unit test in the crate.
  • Browser OTLP being CORS-blocked? The Collector's otlp.http.cors.allowed_origins must include your web origin — edit values.yaml::otelCollector.cors.allowedOrigins.
  • Eyeballing incoming signals in cluster? Flip values.yaml::otelCollector.debugExporter.enabled = true and redeploy; the Collector will then log pretty-printed spans/logs alongside forwarding them to OneUptime. Disable before production (it logs full payloads).
  • RN app not producing spans? Make sure EXPO_PUBLIC_OTEL_ENDPOINT is reachable from the device (not just the Metro dev server) — physical devices need your laptop's LAN IP, not localhost.