Service 03

Observability

Traces. Logs. Metrics. Session replay. Wired into every Sage scaffold by default, correlated by trace ID across the whole stack.


The problem

Most teams ship without observability. Bugs are forensic exercises.

Most agentic apps ship without observability. The first prod bug is forensic: dig through container logs, guess at timestamps, correlate by user-id-if-you-have-it. Cerebe latency spike? You can't see it. Trace from web → API → LLM? You don't have one.

Teams that do wire observability typically wire 30% of it: traces but not logs, metrics but not traces, or — worst — Sentry-style stack traces but no distributed tracing. The bug-investigation tax compounds for the life of the app.


How it works

OpenTelemetry everywhere; trace IDs correlate web → API → Cerebe.

Sage wires OpenTelemetry into both backend and frontend at scaffold time. FastAPI requests auto-emit spans. Outbound HTTP calls (including Cerebe SDK calls) propagate the trace. The web app sends OTel traces to the same backend — one trace_id binds the entire user request from click to LLM and back.

Structured logs go to Loki tagged with trace_id. Highlight.io sessions correlate to the same trace. A failed conversation shows up as a single timeline: session replay, distributed trace, structured logs, all linked.

  • OpenTelemetry SDK instrumented in FastAPI + Next.js — auto-spans for HTTP, DB, LLM calls
  • Trace propagation across web → API → Cerebe; one trace_id binds the whole user request
  • Tempo (traces) + Loki (logs, structured + tagged with trace_id) + Grafana dashboards
  • Highlight.io session replay with trace correlation — replay a session, see the trace
  • Cerebe LLM calls auto-tagged with model, latency, token cost — cost-per-conversation visible
  • Local dev hits the same observability stack via k3d — debug on real telemetry, not console.log
backend/app/telemetry/init.py python
# backend/app/telemetry/init.py
from opentelemetry import trace
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor

def init_telemetry(app, service_name: str):
    """Tempo (traces) + Loki (logs) + Highlight (sessions) — wired."""
    tracer = configure_tracer(service_name)
    FastAPIInstrumentor.instrument_app(app, tracer_provider=tracer)
    HTTPXClientInstrumentor().instrument()
    configure_loki_handler(service_name)        # structured logs
    configure_highlight_proxy(service_name)     # session replay

# web/app/lib/observability.ts
import { Highlight } from "@highlight-run/next/client";
import { initOTel } from "@/lib/otel-web";

Highlight.init("3ej73o5e", { tracingOrigins: true });
initOTel();    # trace IDs correlate web → API → Cerebe
# Every request now produces:
#   - OTel trace span (Tempo)
#   - Structured log (Loki, tagged with trace_id)
#   - Highlight session segment (replay + heatmap)

Pricing relevance

Observability is ship-from-day-one across all paid tiers. The Highlight.io account is the customer's; Sage wires it. Enterprise tier includes a hosted Grafana + Tempo + Loki bundle option.

Open-source posture

Telemetry init code is OSS (Apache-2.0). Tempo, Loki, Grafana, OpenTelemetry are all OSS. Highlight.io is the one third-party SaaS dependency; replaceable with Sentry / Datadog / etc. by editing one file.

Get Started

Stop debugging with console.log.

Every Sage scaffold ships with traces, logs, metrics, and session replay correlated by trace ID. The next bug is a trace away, not a forensic exercise.