Self-Host Opik: Open-Source LLM Observability with Docker

2026-05-03

What is Opik?

Opik (built by Comet) is an open-source platform for the full lifecycle of LLM applications. It gives you comprehensive tracing, evaluation, and monitoring so you can understand exactly what your AI agents are doing — in development and in production.

Why it's trending: With 19K+ GitHub stars and active daily development, Opik fills a critical gap. As teams move LLM prototypes to production, they need visibility into token usage, latency, hallucination rates, and prompt effectiveness. Opik provides this out of the box with 40+ native integrations (LangChain, OpenAI, Anthropic, CrewAI, Dify, Mastra, OpenWebUI, and more).

Key Capabilities

Traces & Spans — Track every LLM call with full context (input, output, tokens, latency, metadata)
Evaluation — LLM-as-a-judge metrics for hallucination detection, answer relevance, context precision
Datasets & Experiments — Version your prompts and run A/B evaluations
Playground — Iterate on prompts side-by-side before deploying
Production Monitoring — Dashboards for trace counts, token usage, and feedback scores
Agent Optimizer — Automatically improve prompts and tool selection

Architecture Overview

Opik's self-hosted deployment consists of seven core services orchestrated with Docker Compose:

Opik Architecture

Frontend — Nginx + React SPA served on port 5173. The web dashboard for exploring traces, running experiments, and managing projects.
Backend (Java/Dropwizard) — The core API server. Handles trace ingestion, project management, authentication, and coordinates evaluations via the Python backend.
Python Backend — Dedicated service for running LLM evaluations (scoring) and the Agent Optimizer. Uses RQ (Redis Queue) for async job processing.
MySQL 8.4 — Stores state data: projects, datasets, experiment configurations, and user accounts.
ClickHouse — Columnar analytics database optimized for high-volume trace/spans ingestion (40M+ traces/day).
Redis — In-memory cache and job queue backbone.
MinIO — S3-compatible object storage for trace attachments and media.

Optional services include Guardrails (content safety filters) and the OpenTelemetry stack (OTel Collector + Jaeger) for infrastructure observability.

Prerequisites

Docker Engine 24+ and Docker Compose v2
8 GB RAM (16 GB recommended for evaluation workloads)
10 GB free disk space
Linux, macOS, or Windows (WSL2)

Step 1: Clone and Launch

git clone https://github.com/comet-ml/opik.git
cd opik

# Start the full Opik platform
./opik.sh

The launcher script handles everything: pulling images, starting infrastructure (MySQL, ClickHouse, Redis, MinIO, ZooKeeper), then bringing up the backend and frontend. First launch takes 3-5 minutes while images download.

Once ready, open http://localhost:5173 in your browser.

Service Profiles

You can start Opik in different modes using the --profile flag:

./opik.sh --infra          # Databases and caches only
./opik.sh --backend        # Infra + backend services
./opik.sh                  # Full Opik suite (default)
./opik.sh --guardrails     # Full suite + content safety
./opik.sh --backend --guardrails  # Backend + guardrails

For production-like deployments with the OpenTelemetry stack:

./opik.sh --opik-otel

Step 2: Install the Python SDK

pip install opik

Configure the SDK to talk to your local instance:

opik configure

When prompted for the server address, enter http://localhost:5173/api (or leave blank and set use_local=True in code).

Alternatively, configure programmatically:

import opik
opik.configure(use_local=True)

Step 3: Log Your First Trace

Create a Python script to verify the setup:

import opik
from openai import OpenAI

opik.configure(use_local=True)
client = OpenAI()

@opik.track
def ask_llm(question: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": question}]
    )
    return response.choices[0].message.content

result = ask_llm("What is LLM observability?")
print(result)

Run the script, then refresh your Opik dashboard at http://localhost:5173. You'll see a new trace under Projects → Default Project, showing:

Input prompt and model response
Token count and latency
Nested spans (if you add more @opik.track decorators)

Using Framework Integrations

If you're already using LangChain, LlamaIndex, or CrewAI, Opik auto-instruments them. For example, with LangChain:

import opik
opik.configure(use_local=True)

# Opik automatically patches LangChain
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

llm = ChatOpenAI(model="gpt-4o-mini")
prompt = ChatPromptTemplate.from_template("Explain {topic} in one sentence.")
chain = prompt | llm

# This call is automatically traced
result = chain.invoke({"topic": "retrieval-augmented generation"})

No decorators needed — Opik's LangChain integration captures every chain step automatically.

Step 4: Run Evaluations

Opik's evaluation framework lets you measure quality with LLM-as-a-judge metrics:

from opik.evaluation.metrics import (
    Hallucination,
    AnswerRelevance,
    ContextPrecision
)

# Check for hallucination
hallucination_metric = Hallucination()
score = hallucination_metric.score(
    input="What is the capital of France?",
    output="Paris",
    context=["France is a country in Europe."]
)
print(f"Hallucination score: {score}")

# Evaluate answer relevance
relevance = AnswerRelevance()
score = relevance.score(
    input="Explain quantum computing",
    output="Quantum computing uses qubits...",
)
print(f"Relevance score: {score}")

To run a systematic evaluation, create a dataset of test cases and run an experiment:

from opik import Opik
from opik.evaluation import evaluate

client = Opik()
dataset = client.get_or_create_dataset("QA benchmark")

# Insert test cases
dataset.insert([
    {"input": "What is Docker?", "expected_output": "A containerization platform"},
    {"input": "Define Kubernetes", "expected_output": "Container orchestration system"},
])

# Define your evaluation task
def evaluation_task(item):
    # Your LLM call here
    return {"output": "...", "context": ["..."]}

# Run experiment
evaluate(
    experiment_name="My first eval",
    dataset=dataset,
    task=evaluation_task,
    scoring_metrics=[Hallucination(), AnswerRelevance()],
)

Results appear in the Opik Experiments dashboard, where you can compare runs side by side.

Verification Checklist

After deployment, verify each component:

Frontend loads at http://localhost:5173
Backend health check: curl http://localhost:5173/api/health-check
Python backend: curl http://localhost:8000/healthcheck
MySQL: docker exec opik-mysql-1 mysqladmin ping -h 127.0.0.1 --silent
ClickHouse: curl http://localhost:8123/ping
Redis: docker exec opik-redis-1 redis-cli -a opik ping
MinIO console: http://localhost:9090 (login with credentials from compose)
First trace appears in the dashboard after running the test script

Managing Opik

View logs:

docker compose -f deployment/docker-compose/docker-compose.yaml --profile opik logs -f

Stop Opik:

docker compose -f deployment/docker-compose/docker-compose.yaml --profile opik down

Update to latest:

git pull
docker compose -f deployment/docker-compose/docker-compose.yaml --profile opik pull
./opik.sh

Data persistence: All data survives container restarts via named Docker volumes (mysql, clickhouse, redis-data, minio-data, zookeeper). To fully reset, remove the volumes:

docker compose -f deployment/docker-compose/docker-compose.yaml --profile opik down -v

Resources

GitHub: comet-ml/opik — 19K+ stars, Apache 2.0
Documentation: comet.com/docs/opik
Integrations: 40+ frameworks including LangChain, OpenAI, Anthropic, CrewAI, Dify, Mastra
Community: Slack · Twitter/X
Alternative tools: LangSmith (commercial), LangFuse (open-source), Phoenix/Arize (open-source)