Introduction
Deploying a Large Language Model application to production is only half the battle. The real challenge begins when your AI system is serving thousands of users and you need to understand what is happening inside the black box. LLM observability — the practice of monitoring, tracing, and evaluating LLM behavior in real time — has become an essential discipline for any team running AI applications at scale.
Unlike traditional software where a function returns deterministic outputs, LLMs are stochastic systems. The same input can produce different outputs, quality can degrade silently, costs can spiral unexpectedly, and hallucinations can erode user trust without triggering a single error in your logs. Traditional APM tools like Datadog or New Relic were not designed to capture the nuances of LLM behavior — prompt versions, token consumption, response quality, or chain-of-thought traces.
In this guide, we will explore the LLM observability landscape, compare the leading tools, walk through practical integration examples, and establish the metrics and patterns you need to monitor AI apps confidently in production.
Why LLM Observability Is Different
Traditional application monitoring focuses on latency, error rates, and throughput. LLM observability requires all of that plus an entirely new set of concerns:
- Non-deterministic outputs — The same prompt can yield different responses, making regression testing fundamentally harder
- Quality is subjective — A 200 OK response does not mean the answer was correct, helpful, or safe
- Cost is per-token — A single runaway prompt can burn through your budget in minutes
- Multi-step chains — RAG pipelines, agent loops, and tool calls create deeply nested execution traces
- Prompt engineering is iterative — You need A/B testing and version tracking for prompts, not just code
- Hallucinations are silent failures — The model confidently returns wrong information with no error signal
- Tracing → Follow a request through every step of your LLM pipeline
- Evaluation → Automatically score output quality, relevance, and safety
- Metrics → Track latency, token usage, cost, and error rates
- Logging → Capture inputs, outputs, and metadata for every LLM call
- Alerting → Get notified when quality degrades, costs spike, or errors surge
Key Metrics to Monitor
Latency Metrics
- Time to First Token (TTFT) — How quickly the model starts streaming a response. Critical for user-perceived performance.
- End-to-end latency — Total time from request to complete response.
- Tokens per second — Throughput measure that helps identify bottlenecks.
- P50, P95, P99 latencies — Tail latencies matter enormously for LLM apps.
Token Usage and Cost
- Input tokens per request — Reveals prompt bloat and context stuffing issues
- Output tokens per request — Detects verbose or runaway completions
- Cost per request — Calculated from token counts and model pricing
- Cost per user / per feature — Essential for understanding unit economics
- Cache hit rate — If using semantic caching, track how often it saves API calls
Quality and Safety Metrics
- Hallucination rate — Percentage of responses containing fabricated information
- Relevance score — How well the response addresses the user query
- Groundedness — In RAG systems, whether the answer is supported by retrieved context
- Toxicity / safety score — Automated content safety checks on outputs
- User feedback rate — Thumbs up/down signals from end users
Comparing LLM Observability Tools
| Tool | Type | Key Strengths | Pricing Model | Best For |
|---|---|---|---|---|
| LangSmith | Managed SaaS | Deep LangChain integration, dataset management, evaluation framework | Free tier + usage-based | Teams using LangChain/LangGraph |
| Langfuse | Open Source / Cloud | Framework-agnostic, self-hostable, prompt management, cost tracking | Open source (self-host free) + cloud tiers | Teams wanting vendor independence |
| Helicone | Proxy-based SaaS | Zero-code integration via proxy, request caching, rate limiting | Free tier + usage-based | Quick setup with any LLM provider |
| Arize Phoenix | Open Source / Cloud | Embedding drift detection, LLM traces, evaluation with LLM-as-judge | Open source + Arize cloud | ML teams needing drift monitoring |
| OpenLLMetry | Open Source | OpenTelemetry-native, vendor-neutral, works with any OTel backend | Open source | Teams with existing OTel infrastructure |
Setting Up LangSmith for LLM Tracing
Basic Configuration
import os
# Configure LangSmith tracing
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "ls_your_api_key_here"
os.environ["LANGCHAIN_PROJECT"] = "my-production-app"
# All LangChain calls are now traced automatically
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage
llm = ChatOpenAI(model="gpt-4o", temperature=0)
response = llm.invoke([HumanMessage(content="Explain quantum computing")])
Custom Tracing with the @traceable Decorator
from langsmith import traceable
from openai import OpenAI
client = OpenAI()
@traceable(run_type="llm", name="generate_summary")
def summarize_document(document: str, max_length: int = 200) -> str:
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": f"Summarize in under {max_length} words."},
{"role": "user", "content": document},
],
)
return response.choices[0].message.content
@traceable(run_type="chain", name="rag_pipeline")
def rag_pipeline(query: str) -> dict:
docs = retrieve_documents(query)
context = format_context(docs)
answer = summarize_document(context)
return {"answer": answer, "sources": docs}
Running Evaluations in LangSmith
from langsmith import Client
from langsmith.evaluation import evaluate
client = Client()
def relevance_evaluator(run, example):
"""Check if the output is relevant to the input question."""
prediction = run.outputs.get("answer", "")
reference = example.outputs.get("expected_answer", "")
judge = ChatOpenAI(model="gpt-4o-mini", temperature=0)
score_response = judge.invoke([
{"role": "system", "content": "Score 1-5 how relevant the answer is. Return only the number."},
{"role": "user", "content": f"Question: {example.inputs['question']}\nAnswer: {prediction}"},
])
score = int(score_response.content.strip()) / 5.0
return {"key": "relevance", "score": score}
results = evaluate(
rag_pipeline,
data="my-qa-dataset",
evaluators=[relevance_evaluator],
experiment_prefix="rag-v2",
)
Integrating Langfuse for Open-Source Observability
Basic Setup with the Langfuse Decorator
import os
os.environ["LANGFUSE_PUBLIC_KEY"] = "pk-lf-..."
os.environ["LANGFUSE_SECRET_KEY"] = "sk-lf-..."
os.environ["LANGFUSE_HOST"] = "https://cloud.langfuse.com"
from langfuse.decorators import observe, langfuse_context
from openai import OpenAI
client = OpenAI()
@observe(as_type="generation")
def call_llm(prompt: str, model: str = "gpt-4o") -> str:
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
)
result = response.choices[0].message.content
langfuse_context.update_current_observation(
usage={
"input": response.usage.prompt_tokens,
"output": response.usage.completion_tokens,
"unit": "TOKENS",
},
model=model,
)
return result
@observe()
def rag_query(question: str) -> str:
context = retrieve_context(question)
answer = call_llm(f"Context: {context}\n\nQuestion: {question}")
langfuse_context.score_current_trace(
name="user_feedback",
value=1,
comment="User found answer helpful",
)
return answer
Prompt Management with Langfuse
from langfuse import Langfuse
langfuse = Langfuse()
# Fetch the latest production prompt by name
prompt = langfuse.get_prompt("rag-system-prompt", label="production")
# Use the compiled prompt with variables
compiled = prompt.compile(max_length="200", tone="professional")
Helicone: Zero-Code Proxy-Based Monitoring
from openai import OpenAI
# Route through Helicone proxy - one line change
client = OpenAI(
base_url="https://oai.helicone.ai/v1",
default_headers={
"Helicone-Auth": "Bearer sk-helicone-...",
"Helicone-Property-Environment": "production",
"Helicone-Property-Feature": "document-qa",
"Helicone-User-Id": "user-123",
"Helicone-Cache-Enabled": "true",
},
)
# All calls are now logged with latency, tokens, cost, and custom properties
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "What is observability?"}],
)
- Proxy-based (Helicone) → Fastest setup, no code changes, great for cost tracking and caching. Limited for complex chain tracing.
- SDK-based (LangSmith, Langfuse) → Deeper tracing, evaluation frameworks, prompt management. Requires code instrumentation.
- Hybrid approach → Use a proxy for cost/caching and an SDK for tracing. Many teams combine both.
OpenTelemetry for LLMs
OpenTelemetry (OTel) is the vendor-neutral observability standard. The OpenLLMetry project provides auto-instrumentation for LLM libraries that exports traces in standard OTel format, meaning you can send LLM traces to any OTel-compatible backend.
from traceloop.sdk import Traceloop
from traceloop.sdk.decorators import workflow, task
Traceloop.init(app_name="my-ai-app")
from openai import OpenAI
client = OpenAI()
@workflow(name="document_qa")
def answer_question(question: str, documents: list[str]) -> str:
context = select_relevant_docs(question, documents)
return generate_answer(question, context)
@task(name="generate_answer")
def generate_answer(question: str, context: str) -> str:
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": f"Answer based on: {context}"},
{"role": "user", "content": question},
],
)
return response.choices[0].message.content
OTel prevents vendor lock-in. By instrumenting with OpenTelemetry, you can switch between backends (Jaeger, Datadog, Grafana, Arize) without changing your application code. The Semantic Conventions for GenAI standardize attribute names like gen_ai.usage.prompt_tokens and gen_ai.request.model, ensuring consistent data across tools.
Building a Custom Observability Layer
import time
import json
import logging
import functools
from dataclasses import dataclass, asdict
from datetime import datetime, timezone
logger = logging.getLogger("llm_observability")
@dataclass
class LLMTrace:
trace_id: str
span_name: str
model: str
input_tokens: int
output_tokens: int
latency_ms: float
cost_usd: float
status: str
timestamp: str
metadata: dict
MODEL_PRICING = {
"gpt-4o": {"input": 2.50, "output": 10.00},
"gpt-4o-mini": {"input": 0.15, "output": 0.60},
"claude-sonnet-4-20250514": {"input": 3.00, "output": 15.00},
}
def calculate_cost(model: str, input_tokens: int, output_tokens: int) -> float:
pricing = MODEL_PRICING.get(model, {"input": 0, "output": 0})
input_cost = (input_tokens / 1_000_000) * pricing["input"]
output_cost = (output_tokens / 1_000_000) * pricing["output"]
return round(input_cost + output_cost, 6)
def observe_llm(span_name: str, model: str = "gpt-4o"):
def decorator(func):
@functools.wraps(func)
def wrapper(*args, **kwargs):
import uuid
trace_id = str(uuid.uuid4())
start_time = time.perf_counter()
try:
result = func(*args, **kwargs)
latency = (time.perf_counter() - start_time) * 1000
usage = getattr(result, "usage", None)
input_tokens = usage.prompt_tokens if usage else 0
output_tokens = usage.completion_tokens if usage else 0
trace = LLMTrace(
trace_id=trace_id, span_name=span_name, model=model,
input_tokens=input_tokens, output_tokens=output_tokens,
latency_ms=round(latency, 2),
cost_usd=calculate_cost(model, input_tokens, output_tokens),
status="success",
timestamp=datetime.now(timezone.utc).isoformat(),
metadata={"args_count": len(args)},
)
logger.info(json.dumps(asdict(trace)))
return result
except Exception as e:
latency = (time.perf_counter() - start_time) * 1000
trace = LLMTrace(
trace_id=trace_id, span_name=span_name, model=model,
input_tokens=0, output_tokens=0, latency_ms=round(latency, 2),
cost_usd=0, status="error",
timestamp=datetime.now(timezone.utc).isoformat(),
metadata={"error": str(e)},
)
logger.error(json.dumps(asdict(trace)))
raise
return wrapper
return decorator
Hallucination Detection in Production
from openai import OpenAI
client = OpenAI()
def check_groundedness(context: str, answer: str) -> dict:
"""Use an LLM judge to check if the answer is grounded in the context."""
judge_prompt = f"""You are an expert fact-checker. Given the context and answer below,
determine if the answer is fully supported by the context.
Context: {context}
Answer: {answer}
Respond with JSON:
{{"grounded": true/false, "score": 0.0-1.0, "unsupported_claims": ["list of claims not in context"]}}"""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": judge_prompt}],
response_format={"type": "json_object"},
)
return json.loads(response.choices[0].message.content)
Production Monitoring Dashboard Essentials
Real-Time Operational Metrics
- Request volume — Requests per minute, broken down by model, feature, and user segment
- Latency distribution — P50/P95/P99 with alerting on P95 breaches
- Error rate — Rate limit errors, timeouts, malformed responses
- Token throughput — Input and output tokens per minute with anomaly detection
Cost Management
- Daily cost trend — Total spend with rolling 7-day average and budget threshold alerts
- Cost per feature — Break down spending by product feature
- Cost per user — Identify power users and potential abuse patterns
- Model cost comparison — Side-by-side cost of different models for the same tasks
- Alert on P95 latency > 10s for interactive use cases
- Alert on error rate > 5% over a 5-minute window
- Alert on daily cost exceeding 120% of the trailing 7-day average
- Alert on quality score dropping below 0.7 for any feature
- Alert on hallucination rate > 10% over the past hour
Best Practices for LLM Observability
- Instrument from Day One — Integrate tracing before your first production deployment. The cost of adding instrumentation later is far higher.
- Log Everything, Sample Wisely — Capture full input/output pairs but be mindful of PII. Use sampling for high-volume endpoints.
- Version Your Prompts — Every prompt change is effectively a deployment. Track which version produced which results.
- Build Evaluation Datasets Early — Curate a golden dataset from real production traffic and run automated evaluations on every change.
- Combine Automated and Human Evaluation — LLM-as-judge evaluations are useful for scale, but regularly review traces manually.
- Use Structured Metadata — Tag every trace with
user_id,feature,environment,prompt_version,model.
Conclusion
LLM observability is not a luxury — it is a production requirement. As AI applications become more complex with multi-step agents, RAG pipelines, and tool use, the need for deep visibility into LLM behavior only grows. The good news is that the tooling ecosystem has matured significantly: LangSmith offers deep integration for LangChain users, Langfuse provides a powerful open-source alternative, Helicone delivers instant proxy-based monitoring, and OpenTelemetry is emerging as the vendor-neutral standard.
The key takeaway is to start with tracing and metrics, then layer on automated evaluation and quality monitoring as your application matures. Track your costs from day one, version your prompts, build evaluation datasets from real traffic, and set up alerts before you need them.
At HexoByte Solutions, we help teams implement robust LLM observability stacks as part of our AI engineering services. Whether you are deploying your first AI feature or scaling a production system handling millions of requests, having the right observability foundation is what separates reliable AI applications from unpredictable ones.