LLM Observability: How to Monitor AI Apps in Production

Back to All Articles

Introduction

Deploying a Large Language Model application to production is only half the battle. The real challenge begins when your AI system is serving thousands of users and you need to understand what is happening inside the black box. LLM observability — the practice of monitoring, tracing, and evaluating LLM behavior in real time — has become an essential discipline for any team running AI applications at scale.

Unlike traditional software where a function returns deterministic outputs, LLMs are stochastic systems. The same input can produce different outputs, quality can degrade silently, costs can spiral unexpectedly, and hallucinations can erode user trust without triggering a single error in your logs. Traditional APM tools like Datadog or New Relic were not designed to capture the nuances of LLM behavior — prompt versions, token consumption, response quality, or chain-of-thought traces.

In this guide, we will explore the LLM observability landscape, compare the leading tools, walk through practical integration examples, and establish the metrics and patterns you need to monitor AI apps confidently in production.

Why LLM Observability Is Different

Traditional application monitoring focuses on latency, error rates, and throughput. LLM observability requires all of that plus an entirely new set of concerns:

The Core Pillars of LLM Observability:
  • Tracing → Follow a request through every step of your LLM pipeline
  • Evaluation → Automatically score output quality, relevance, and safety
  • Metrics → Track latency, token usage, cost, and error rates
  • Logging → Capture inputs, outputs, and metadata for every LLM call
  • Alerting → Get notified when quality degrades, costs spike, or errors surge

Key Metrics to Monitor

Latency Metrics

Token Usage and Cost

Quality and Safety Metrics

Comparing LLM Observability Tools

Tool Type Key Strengths Pricing Model Best For
LangSmith Managed SaaS Deep LangChain integration, dataset management, evaluation framework Free tier + usage-based Teams using LangChain/LangGraph
Langfuse Open Source / Cloud Framework-agnostic, self-hostable, prompt management, cost tracking Open source (self-host free) + cloud tiers Teams wanting vendor independence
Helicone Proxy-based SaaS Zero-code integration via proxy, request caching, rate limiting Free tier + usage-based Quick setup with any LLM provider
Arize Phoenix Open Source / Cloud Embedding drift detection, LLM traces, evaluation with LLM-as-judge Open source + Arize cloud ML teams needing drift monitoring
OpenLLMetry Open Source OpenTelemetry-native, vendor-neutral, works with any OTel backend Open source Teams with existing OTel infrastructure

Setting Up LangSmith for LLM Tracing

Basic Configuration

import os

# Configure LangSmith tracing
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "ls_your_api_key_here"
os.environ["LANGCHAIN_PROJECT"] = "my-production-app"

# All LangChain calls are now traced automatically
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage

llm = ChatOpenAI(model="gpt-4o", temperature=0)
response = llm.invoke([HumanMessage(content="Explain quantum computing")])

Custom Tracing with the @traceable Decorator

from langsmith import traceable
from openai import OpenAI

client = OpenAI()

@traceable(run_type="llm", name="generate_summary")
def summarize_document(document: str, max_length: int = 200) -> str:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": f"Summarize in under {max_length} words."},
            {"role": "user", "content": document},
        ],
    )
    return response.choices[0].message.content

@traceable(run_type="chain", name="rag_pipeline")
def rag_pipeline(query: str) -> dict:
    docs = retrieve_documents(query)
    context = format_context(docs)
    answer = summarize_document(context)
    return {"answer": answer, "sources": docs}

Running Evaluations in LangSmith

from langsmith import Client
from langsmith.evaluation import evaluate

client = Client()

def relevance_evaluator(run, example):
    """Check if the output is relevant to the input question."""
    prediction = run.outputs.get("answer", "")
    reference = example.outputs.get("expected_answer", "")

    judge = ChatOpenAI(model="gpt-4o-mini", temperature=0)
    score_response = judge.invoke([
        {"role": "system", "content": "Score 1-5 how relevant the answer is. Return only the number."},
        {"role": "user", "content": f"Question: {example.inputs['question']}\nAnswer: {prediction}"},
    ])
    score = int(score_response.content.strip()) / 5.0
    return {"key": "relevance", "score": score}

results = evaluate(
    rag_pipeline,
    data="my-qa-dataset",
    evaluators=[relevance_evaluator],
    experiment_prefix="rag-v2",
)

Integrating Langfuse for Open-Source Observability

Basic Setup with the Langfuse Decorator

import os
os.environ["LANGFUSE_PUBLIC_KEY"] = "pk-lf-..."
os.environ["LANGFUSE_SECRET_KEY"] = "sk-lf-..."
os.environ["LANGFUSE_HOST"] = "https://cloud.langfuse.com"

from langfuse.decorators import observe, langfuse_context
from openai import OpenAI

client = OpenAI()

@observe(as_type="generation")
def call_llm(prompt: str, model: str = "gpt-4o") -> str:
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
    )
    result = response.choices[0].message.content

    langfuse_context.update_current_observation(
        usage={
            "input": response.usage.prompt_tokens,
            "output": response.usage.completion_tokens,
            "unit": "TOKENS",
        },
        model=model,
    )
    return result

@observe()
def rag_query(question: str) -> str:
    context = retrieve_context(question)
    answer = call_llm(f"Context: {context}\n\nQuestion: {question}")

    langfuse_context.score_current_trace(
        name="user_feedback",
        value=1,
        comment="User found answer helpful",
    )
    return answer

Prompt Management with Langfuse

from langfuse import Langfuse

langfuse = Langfuse()

# Fetch the latest production prompt by name
prompt = langfuse.get_prompt("rag-system-prompt", label="production")

# Use the compiled prompt with variables
compiled = prompt.compile(max_length="200", tone="professional")

Helicone: Zero-Code Proxy-Based Monitoring

from openai import OpenAI

# Route through Helicone proxy - one line change
client = OpenAI(
    base_url="https://oai.helicone.ai/v1",
    default_headers={
        "Helicone-Auth": "Bearer sk-helicone-...",
        "Helicone-Property-Environment": "production",
        "Helicone-Property-Feature": "document-qa",
        "Helicone-User-Id": "user-123",
        "Helicone-Cache-Enabled": "true",
    },
)

# All calls are now logged with latency, tokens, cost, and custom properties
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is observability?"}],
)
When to Use Proxy-Based vs. SDK-Based Observability:
  • Proxy-based (Helicone) → Fastest setup, no code changes, great for cost tracking and caching. Limited for complex chain tracing.
  • SDK-based (LangSmith, Langfuse) → Deeper tracing, evaluation frameworks, prompt management. Requires code instrumentation.
  • Hybrid approach → Use a proxy for cost/caching and an SDK for tracing. Many teams combine both.

OpenTelemetry for LLMs

OpenTelemetry (OTel) is the vendor-neutral observability standard. The OpenLLMetry project provides auto-instrumentation for LLM libraries that exports traces in standard OTel format, meaning you can send LLM traces to any OTel-compatible backend.

from traceloop.sdk import Traceloop
from traceloop.sdk.decorators import workflow, task

Traceloop.init(app_name="my-ai-app")

from openai import OpenAI
client = OpenAI()

@workflow(name="document_qa")
def answer_question(question: str, documents: list[str]) -> str:
    context = select_relevant_docs(question, documents)
    return generate_answer(question, context)

@task(name="generate_answer")
def generate_answer(question: str, context: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": f"Answer based on: {context}"},
            {"role": "user", "content": question},
        ],
    )
    return response.choices[0].message.content
Why OpenTelemetry Matters for LLM Observability:

OTel prevents vendor lock-in. By instrumenting with OpenTelemetry, you can switch between backends (Jaeger, Datadog, Grafana, Arize) without changing your application code. The Semantic Conventions for GenAI standardize attribute names like gen_ai.usage.prompt_tokens and gen_ai.request.model, ensuring consistent data across tools.

Building a Custom Observability Layer

import time
import json
import logging
import functools
from dataclasses import dataclass, asdict
from datetime import datetime, timezone

logger = logging.getLogger("llm_observability")

@dataclass
class LLMTrace:
    trace_id: str
    span_name: str
    model: str
    input_tokens: int
    output_tokens: int
    latency_ms: float
    cost_usd: float
    status: str
    timestamp: str
    metadata: dict

MODEL_PRICING = {
    "gpt-4o": {"input": 2.50, "output": 10.00},
    "gpt-4o-mini": {"input": 0.15, "output": 0.60},
    "claude-sonnet-4-20250514": {"input": 3.00, "output": 15.00},
}

def calculate_cost(model: str, input_tokens: int, output_tokens: int) -> float:
    pricing = MODEL_PRICING.get(model, {"input": 0, "output": 0})
    input_cost = (input_tokens / 1_000_000) * pricing["input"]
    output_cost = (output_tokens / 1_000_000) * pricing["output"]
    return round(input_cost + output_cost, 6)

def observe_llm(span_name: str, model: str = "gpt-4o"):
    def decorator(func):
        @functools.wraps(func)
        def wrapper(*args, **kwargs):
            import uuid
            trace_id = str(uuid.uuid4())
            start_time = time.perf_counter()

            try:
                result = func(*args, **kwargs)
                latency = (time.perf_counter() - start_time) * 1000

                usage = getattr(result, "usage", None)
                input_tokens = usage.prompt_tokens if usage else 0
                output_tokens = usage.completion_tokens if usage else 0

                trace = LLMTrace(
                    trace_id=trace_id, span_name=span_name, model=model,
                    input_tokens=input_tokens, output_tokens=output_tokens,
                    latency_ms=round(latency, 2),
                    cost_usd=calculate_cost(model, input_tokens, output_tokens),
                    status="success",
                    timestamp=datetime.now(timezone.utc).isoformat(),
                    metadata={"args_count": len(args)},
                )
                logger.info(json.dumps(asdict(trace)))
                return result

            except Exception as e:
                latency = (time.perf_counter() - start_time) * 1000
                trace = LLMTrace(
                    trace_id=trace_id, span_name=span_name, model=model,
                    input_tokens=0, output_tokens=0, latency_ms=round(latency, 2),
                    cost_usd=0, status="error",
                    timestamp=datetime.now(timezone.utc).isoformat(),
                    metadata={"error": str(e)},
                )
                logger.error(json.dumps(asdict(trace)))
                raise
        return wrapper
    return decorator

Hallucination Detection in Production

from openai import OpenAI

client = OpenAI()

def check_groundedness(context: str, answer: str) -> dict:
    """Use an LLM judge to check if the answer is grounded in the context."""
    judge_prompt = f"""You are an expert fact-checker. Given the context and answer below,
determine if the answer is fully supported by the context.

Context: {context}

Answer: {answer}

Respond with JSON:
{{"grounded": true/false, "score": 0.0-1.0, "unsupported_claims": ["list of claims not in context"]}}"""

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": judge_prompt}],
        response_format={"type": "json_object"},
    )
    return json.loads(response.choices[0].message.content)

Production Monitoring Dashboard Essentials

Real-Time Operational Metrics

Cost Management

Alerting Best Practices:
  • Alert on P95 latency > 10s for interactive use cases
  • Alert on error rate > 5% over a 5-minute window
  • Alert on daily cost exceeding 120% of the trailing 7-day average
  • Alert on quality score dropping below 0.7 for any feature
  • Alert on hallucination rate > 10% over the past hour

Best Practices for LLM Observability

Conclusion

LLM observability is not a luxury — it is a production requirement. As AI applications become more complex with multi-step agents, RAG pipelines, and tool use, the need for deep visibility into LLM behavior only grows. The good news is that the tooling ecosystem has matured significantly: LangSmith offers deep integration for LangChain users, Langfuse provides a powerful open-source alternative, Helicone delivers instant proxy-based monitoring, and OpenTelemetry is emerging as the vendor-neutral standard.

The key takeaway is to start with tracing and metrics, then layer on automated evaluation and quality monitoring as your application matures. Track your costs from day one, version your prompts, build evaluation datasets from real traffic, and set up alerts before you need them.

At HexoByte Solutions, we help teams implement robust LLM observability stacks as part of our AI engineering services. Whether you are deploying your first AI feature or scaling a production system handling millions of requests, having the right observability foundation is what separates reliable AI applications from unpredictable ones.