PLUNGETAI 안내 · 뉴스

도구 둘러보기

분류

LLM evals & observability — traces, scoring, live monitoring

Treat LLM apps like production systems: offline evals, live traces, and metrics at scale. This hub covers major eval/observability platforms and OSS options.

The gap between “I tuned this prompt once” and “I can ship a change and watch 200 regressions pass plus P99 latency stay flat” is filled by this category. Key comparisons: custom scorer support, unified **trace + eval + replay**, and whether offline/online share one dataset. Depth of LangChain/LlamaIndex/OpenAI SDK integration is another frequent deciding factor.

편집 / GSC 추가

LangSmith vs Langfuse vs Braintrust

LangSmith leans into LangChain; Langfuse is self-hostable OSS; Braintrust is eval-first. Run one real pipeline through two of them for two weeks and see which tab engineers actually open.

How do I eval a RAG stack?

Typically retrieval metrics (recall/precision/nDCG) + generation scores (correctness, faithfulness, groundedness), topped with human spot-checks. Look for built-in LLM-as-judge and golden dataset management.

Monitoring LLMs in production

Track P50/P95 latency, token cost distribution, failure rate, and PII leakage. Confirm log retention and training-use clauses on each vendor site.

이 분류의 도구

요약과 공식 링크는 각 상세 페이지에서 확인하고, 같은 분류의 관련 항목도 살펴보세요.

LangChain 공식 평가·트레이스 플랫폼. 데이터셋·스코어러·실시간 모니터링·사람 리뷰를 LangChain/LangGraph와 가장 깊게 연동.

평가 / 관측성

오픈소스 LLM 옵저버빌리티·평가 플랫폼. 트레이스·데이터셋·스코어러·프롬프트 관리, Docker 자체 호스팅 가능.

평가 / 관측성

Braintrust: 인기 AI 제품—기능·가격·지원 지역·데이터 처리·최신 모델은 공식 사이트를 확인하세요.

평가 / 관측성

Arize Phoenix: 인기 AI 제품—기능·가격·지원 지역·데이터 처리·최신 모델은 공식 사이트를 확인하세요.

평가 / 관측성

Helicone: 인기 AI 제품—기능·가격·지원 지역·데이터 처리·최신 모델은 공식 사이트를 확인하세요.

평가 / 관측성

Galileo: 인기 AI 제품—기능·가격·지원 지역·데이터 처리·최신 모델은 공식 사이트를 확인하세요.

평가 / 관측성

Patronus AI: 인기 AI 제품—기능·가격·지원 지역·데이터 처리·최신 모델은 공식 사이트를 확인하세요.

평가 / 관측성