PLUNGETAI ナビ · ニュース

ツールを探す

カテゴリ

LLM evals & observability — traces, scoring, live monitoring

Treat LLM apps like production systems: offline evals, live traces, and metrics at scale. This hub covers major eval/observability platforms and OSS options.

The gap between “I tuned this prompt once” and “I can ship a change and watch 200 regressions pass plus P99 latency stay flat” is filled by this category. Key comparisons: custom scorer support, unified **trace + eval + replay**, and whether offline/online share one dataset. Depth of LangChain/LlamaIndex/OpenAI SDK integration is another frequent deciding factor.

編集用 / GSC 追記欄

LangSmith vs Langfuse vs Braintrust

LangSmith leans into LangChain; Langfuse is self-hostable OSS; Braintrust is eval-first. Run one real pipeline through two of them for two weeks and see which tab engineers actually open.

How do I eval a RAG stack?

Typically retrieval metrics (recall/precision/nDCG) + generation scores (correctness, faithfulness, groundedness), topped with human spot-checks. Look for built-in LLM-as-judge and golden dataset management.

Monitoring LLMs in production

Track P50/P95 latency, token cost distribution, failure rate, and PII leakage. Confirm log retention and training-use clauses on each vendor site.

このカテゴリのツール

概要と公式リンクは各ツール詳細へ。カテゴリ内の関連ピックも参照してください。

LangChain 公式の評価・トレース基盤。データセット／スコアラー／本番監視／人手レビューを LangChain・LangGraph と最深統合。

評価 / 可観測性

OSS の LLM 可観測・評価プラットフォーム。トレース／データセット／スコアラー／プロンプト管理、Docker でセルフホスト可能。

評価 / 可観測性

Braintrust: 人気のAI製品です。機能・料金・対応地域・データ取り扱い・最新モデルは公式サイトで確認してください。

評価 / 可観測性

Arize Phoenix: 人気のAI製品です。機能・料金・対応地域・データ取り扱い・最新モデルは公式サイトで確認してください。

評価 / 可観測性

Helicone: 人気のAI製品です。機能・料金・対応地域・データ取り扱い・最新モデルは公式サイトで確認してください。

評価 / 可観測性

Galileo: 人気のAI製品です。機能・料金・対応地域・データ取り扱い・最新モデルは公式サイトで確認してください。

評価 / 可観測性

Patronus AI: 人気のAI製品です。機能・料金・対応地域・データ取り扱い・最新モデルは公式サイトで確認してください。

評価 / 可観測性