LLM Observability: How to Monitor AI Apps in Production

Introduction

Deploying a Large Language Model application to production is only half the battle. The real challenge begins when your AI system is serving thousands of users and you need to understand what is happening inside the black box. LLM observability — the practice of monitoring, tracing, and evaluating LLM behavior in real time — has become an essential discipline for any team running AI applications at scale.

Unlike traditional software where a function returns deterministic outputs, LLMs are stochastic systems. The same input can produce different outputs, quality can degrade silently, costs can spiral unexpectedly, and hallucinations can erode user trust without triggering a single error in your logs. Traditional APM tools like Datadog or New Relic were not designed to capture the nuances of LLM behavior — prompt versions, token consumption, response quality, or chain-of-thought traces.

In this guide, we will explore the LLM observability landscape, compare the leading tools, walk through practical integration examples, and establish the metrics and patterns you need to monitor AI apps confidently in production.

Why LLM Observability Is Different

Traditional application monitoring focuses on latency, error rates, and throughput. LLM observability requires all of that plus an entirely new set of concerns:

Non-deterministic outputs — The same prompt can yield different responses, making regression testing fundamentally harder
Quality is subjective — A 200 OK response does not mean the answer was correct, helpful, or safe
Cost is per-token — A single runaway prompt can burn through your budget in minutes
Multi-step chains — RAG pipelines, agent loops, and tool calls create deeply nested execution traces
Prompt engineering is iterative — You need A/B testing and version tracking for prompts, not just code
Hallucinations are silent failures — The model confidently returns wrong information with no error signal

The Core Pillars of LLM Observability:

Tracing → Follow a request through every step of your LLM pipeline
Evaluation → Automatically score output quality, relevance, and safety
Metrics → Track latency, token usage, cost, and error rates
Logging → Capture inputs, outputs, and metadata for every LLM call
Alerting → Get notified when quality degrades, costs spike, or errors surge

Key Metrics to Monitor

Latency Metrics

Time to First Token (TTFT) — How quickly the model starts streaming a response. Critical for user-perceived performance.
End-to-end latency — Total time from request to complete response.
Tokens per second — Throughput measure that helps identify bottlenecks.
P50, P95, P99 latencies — Tail latencies matter enormously for LLM apps.

Token Usage and Cost

Input tokens per request — Reveals prompt bloat and context stuffing issues
Output tokens per request — Detects verbose or runaway completions
Cost per request — Calculated from token counts and model pricing
Cost per user / per feature — Essential for understanding unit economics
Cache hit rate — If using semantic caching, track how often it saves API calls

Quality and Safety Metrics

Hallucination rate — Percentage of responses containing fabricated information
Relevance score — How well the response addresses the user query
Groundedness — In RAG systems, whether the answer is supported by retrieved context
Toxicity / safety score — Automated content safety checks on outputs
User feedback rate — Thumbs up/down signals from end users

Comparing LLM Observability Tools

Tool	Type	Key Strengths	Pricing Model	Best For
LangSmith	Managed SaaS	Deep LangChain integration, dataset management, evaluation framework	Free tier + usage-based	Teams using LangChain/LangGraph
Langfuse	Open Source / Cloud	Framework-agnostic, self-hostable, prompt management, cost tracking	Open source (self-host free) + cloud tiers	Teams wanting vendor independence
Helicone	Proxy-based SaaS	Zero-code integration via proxy, request caching, rate limiting	Free tier + usage-based	Quick setup with any LLM provider
Arize Phoenix	Open Source / Cloud	Embedding drift detection, LLM traces, evaluation with LLM-as-judge	Open source + Arize cloud	ML teams needing drift monitoring
OpenLLMetry	Open Source	OpenTelemetry-native, vendor-neutral, works with any OTel backend	Open source	Teams with existing OTel infrastructure

Setting Up LangSmith for LLM Tracing

Basic Configuration

import os

# Configure LangSmith tracing
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "ls_your_api_key_here"
os.environ["LANGCHAIN_PROJECT"] = "my-production-app"

# All LangChain calls are now traced automatically
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage

llm = ChatOpenAI(model="gpt-4o", temperature=0)
response = llm.invoke([HumanMessage(content="Explain quantum computing")])

Custom Tracing with the @traceable Decorator

from langsmith import traceable
from openai import OpenAI

client = OpenAI()

@traceable(run_type="llm", name="generate_summary")
def summarize_document(document: str, max_length: int = 200) -> str:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": f"Summarize in under {max_length} words."},
            {"role": "user", "content": document},
        ],
    )
    return response.choices[0].message.content

@traceable(run_type="chain", name="rag_pipeline")
def rag_pipeline(query: str) -> dict:
    docs = retrieve_documents(query)
    context = format_context(docs)
    answer = summarize_document(context)
    return {"answer": answer, "sources": docs}

Running Evaluations in LangSmith

from langsmith import Client
from langsmith.evaluation import evaluate

client = Client()

def relevance_evaluator(run, example):
    """Check if the output is relevant to the input question."""
    prediction = run.outputs.get("answer", "")
    reference = example.outputs.get("expected_answer", "")

    judge = ChatOpenAI(model="gpt-4o-mini", temperature=0)
    score_response = judge.invoke([
        {"role": "system", "content": "Score 1-5 how relevant the answer is. Return only the number."},
        {"role": "user", "content": f"Question: {example.inputs['question']}\nAnswer: {prediction}"},
    ])
    score = int(score_response.content.strip()) / 5.0
    return {"key": "relevance", "score": score}

results = evaluate(
    rag_pipeline,
    data="my-qa-dataset",
    evaluators=[relevance_evaluator],
    experiment_prefix="rag-v2",
)

Integrating Langfuse for Open-Source Observability

Basic Setup with the Langfuse Decorator

import os
os.environ["LANGFUSE_PUBLIC_KEY"] = "pk-lf-..."
os.environ["LANGFUSE_SECRET_KEY"] = "sk-lf-..."
os.environ["LANGFUSE_HOST"] = "https://cloud.langfuse.com"

from langfuse.decorators import observe, langfuse_context
from openai import OpenAI

client = OpenAI()

@observe(as_type="generation")
def call_llm(prompt: str, model: str = "gpt-4o") -> str:
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
    )
    result = response.choices[0].message.content

    langfuse_context.update_current_observation(
        usage={
            "input": response.usage.prompt_tokens,
            "output": response.usage.completion_tokens,
            "unit": "TOKENS",
        },
        model=model,
    )
    return result

@observe()
def rag_query(question: str) -> str:
    context = retrieve_context(question)
    answer = call_llm(f"Context: {context}\n\nQuestion: {question}")

    langfuse_context.score_current_trace(
        name="user_feedback",
        value=1,
        comment="User found answer helpful",
    )
    return answer

Prompt Management with Langfuse

from langfuse import Langfuse

langfuse = Langfuse()

# Fetch the latest production prompt by name
prompt = langfuse.get_prompt("rag-system-prompt", label="production")

# Use the compiled prompt with variables
compiled = prompt.compile(max_length="200", tone="professional")

Helicone: Zero-Code Proxy-Based Monitoring

from openai import OpenAI

# Route through Helicone proxy - one line change
client = OpenAI(
    base_url="https://oai.helicone.ai/v1",
    default_headers={
        "Helicone-Auth": "Bearer sk-helicone-...",
        "Helicone-Property-Environment": "production",
        "Helicone-Property-Feature": "document-qa",
        "Helicone-User-Id": "user-123",
        "Helicone-Cache-Enabled": "true",
    },
)

# All calls are now logged with latency, tokens, cost, and custom properties
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is observability?"}],
)

When to Use Proxy-Based vs. SDK-Based Observability:

Proxy-based (Helicone) → Fastest setup, no code changes, great for cost tracking and caching. Limited for complex chain tracing.
SDK-based (LangSmith, Langfuse) → Deeper tracing, evaluation frameworks, prompt management. Requires code instrumentation.
Hybrid approach → Use a proxy for cost/caching and an SDK for tracing. Many teams combine both.

OpenTelemetry for LLMs

OpenTelemetry (OTel) is the vendor-neutral observability standard. The OpenLLMetry project provides auto-instrumentation for LLM libraries that exports traces in standard OTel format, meaning you can send LLM traces to any OTel-compatible backend.

from traceloop.sdk import Traceloop
from traceloop.sdk.decorators import workflow, task

Traceloop.init(app_name="my-ai-app")

from openai import OpenAI
client = OpenAI()

@workflow(name="document_qa")
def answer_question(question: str, documents: list[str]) -> str:
    context = select_relevant_docs(question, documents)
    return generate_answer(question, context)

@task(name="generate_answer")
def generate_answer(question: str, context: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": f"Answer based on: {context}"},
            {"role": "user", "content": question},
        ],
    )
    return response.choices[0].message.content

Why OpenTelemetry Matters for LLM Observability:

OTel prevents vendor lock-in. By instrumenting with OpenTelemetry, you can switch between backends (Jaeger, Datadog, Grafana, Arize) without changing your application code. The Semantic Conventions for GenAI standardize attribute names like gen_ai.usage.prompt_tokens and gen_ai.request.model, ensuring consistent data across tools.

Building a Custom Observability Layer

import time
import json
import logging
import functools
from dataclasses import dataclass, asdict
from datetime import datetime, timezone

logger = logging.getLogger("llm_observability")

@dataclass
class LLMTrace:
    trace_id: str
    span_name: str
    model: str
    input_tokens: int
    output_tokens: int
    latency_ms: float
    cost_usd: float
    status: str
    timestamp: str
    metadata: dict

MODEL_PRICING = {
    "gpt-4o": {"input": 2.50, "output": 10.00},
    "gpt-4o-mini": {"input": 0.15, "output": 0.60},
    "claude-sonnet-4-20250514": {"input": 3.00, "output": 15.00},
}

def calculate_cost(model: str, input_tokens: int, output_tokens: int) -> float:
    pricing = MODEL_PRICING.get(model, {"input": 0, "output": 0})
    input_cost = (input_tokens / 1_000_000) * pricing["input"]
    output_cost = (output_tokens / 1_000_000) * pricing["output"]
    return round(input_cost + output_cost, 6)

def observe_llm(span_name: str, model: str = "gpt-4o"):
    def decorator(func):
        @functools.wraps(func)
        def wrapper(*args, **kwargs):
            import uuid
            trace_id = str(uuid.uuid4())
            start_time = time.perf_counter()

            try:
                result = func(*args, **kwargs)
                latency = (time.perf_counter() - start_time) * 1000

                usage = getattr(result, "usage", None)
                input_tokens = usage.prompt_tokens if usage else 0
                output_tokens = usage.completion_tokens if usage else 0

                trace = LLMTrace(
                    trace_id=trace_id, span_name=span_name, model=model,
                    input_tokens=input_tokens, output_tokens=output_tokens,
                    latency_ms=round(latency, 2),
                    cost_usd=calculate_cost(model, input_tokens, output_tokens),
                    status="success",
                    timestamp=datetime.now(timezone.utc).isoformat(),
                    metadata={"args_count": len(args)},
                )
                logger.info(json.dumps(asdict(trace)))
                return result

            except Exception as e:
                latency = (time.perf_counter() - start_time) * 1000
                trace = LLMTrace(
                    trace_id=trace_id, span_name=span_name, model=model,
                    input_tokens=0, output_tokens=0, latency_ms=round(latency, 2),
                    cost_usd=0, status="error",
                    timestamp=datetime.now(timezone.utc).isoformat(),
                    metadata={"error": str(e)},
                )
                logger.error(json.dumps(asdict(trace)))
                raise
        return wrapper
    return decorator

Hallucination Detection in Production

from openai import OpenAI

client = OpenAI()

def check_groundedness(context: str, answer: str) -> dict:
    """Use an LLM judge to check if the answer is grounded in the context."""
    judge_prompt = f"""You are an expert fact-checker. Given the context and answer below,
determine if the answer is fully supported by the context.

Context: {context}

Answer: {answer}

Respond with JSON:
{{"grounded": true/false, "score": 0.0-1.0, "unsupported_claims": ["list of claims not in context"]}}"""

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": judge_prompt}],
        response_format={"type": "json_object"},
    )
    return json.loads(response.choices[0].message.content)

Production Monitoring Dashboard Essentials

Real-Time Operational Metrics

Request volume — Requests per minute, broken down by model, feature, and user segment
Latency distribution — P50/P95/P99 with alerting on P95 breaches
Error rate — Rate limit errors, timeouts, malformed responses
Token throughput — Input and output tokens per minute with anomaly detection

Cost Management

Daily cost trend — Total spend with rolling 7-day average and budget threshold alerts
Cost per feature — Break down spending by product feature
Cost per user — Identify power users and potential abuse patterns
Model cost comparison — Side-by-side cost of different models for the same tasks

Alerting Best Practices:

Alert on P95 latency > 10s for interactive use cases
Alert on error rate > 5% over a 5-minute window
Alert on daily cost exceeding 120% of the trailing 7-day average
Alert on quality score dropping below 0.7 for any feature
Alert on hallucination rate > 10% over the past hour

Best Practices for LLM Observability

Instrument from Day One — Integrate tracing before your first production deployment. The cost of adding instrumentation later is far higher.
Log Everything, Sample Wisely — Capture full input/output pairs but be mindful of PII. Use sampling for high-volume endpoints.
Version Your Prompts — Every prompt change is effectively a deployment. Track which version produced which results.
Build Evaluation Datasets Early — Curate a golden dataset from real production traffic and run automated evaluations on every change.
Combine Automated and Human Evaluation — LLM-as-judge evaluations are useful for scale, but regularly review traces manually.
Use Structured Metadata — Tag every trace with user_id, feature, environment, prompt_version, model.

Conclusion

LLM observability is not a luxury — it is a production requirement. As AI applications become more complex with multi-step agents, RAG pipelines, and tool use, the need for deep visibility into LLM behavior only grows. The good news is that the tooling ecosystem has matured significantly: LangSmith offers deep integration for LangChain users, Langfuse provides a powerful open-source alternative, Helicone delivers instant proxy-based monitoring, and OpenTelemetry is emerging as the vendor-neutral standard.

The key takeaway is to start with tracing and metrics, then layer on automated evaluation and quality monitoring as your application matures. Track your costs from day one, version your prompts, build evaluation datasets from real traffic, and set up alerts before you need them.

At HexoByte Solutions, we help teams implement robust LLM observability stacks as part of our AI engineering services. Whether you are deploying your first AI feature or scaling a production system handling millions of requests, having the right observability foundation is what separates reliable AI applications from unpredictable ones.