Article Not Found

After deploying RAG systems, over 80% of teams still rely on manual spot-checks or wait for user complaints to judge quality—this isn't an engineering answer, it's relying on luck. The RAGAS framework attempts to solve this problem using the "LLM evaluating LLM" approach.

What this is

RAGAS (Retrieval Augmented Generation Assessment) is a RAG quality evaluation framework proposed in 2023. RAG (Retrieval Augmented Generation, where LLMs first retrieve relevant documents from a knowledge base, then generate answers based on those documents) is easy to deploy, but quality judgment is hard—LLMs are inherently good at "sounding plausible," and even if they quietly mix in content not found in the retrieved documents, users struggle to notice in the short term.

Traditional NLP metrics like BLEU and ROUGE severely fail in RAG scenarios: they only do surface-level lexical matching and cannot judge the core question of "whether the answer is faithful to the retrieved content." RAGAS's solution is to use another LLM to evaluate the RAG's output, adopting the LLM-as-Judge pattern.

RAGAS splits two metrics each from the retrieval and generation dimensions, forming a 2×2 evaluation matrix:

Faithfulness: Whether the answer comes only from the retrieved content, without "making things up." It breaks the answer into independent claims, judges whether each is supported by the retrieved documents, and calculates Faithfulness = supported count / total count.
Answer Relevancy: Whether it answers the question. The evaluating LLM infers possible questions from the answer, then calculates the semantic similarity with the original question.
Context Recall: Whether the retrieval missed key information. This is the only metric that still requires human-provided reference answers.
Context Precision: Whether useful content is ranked at the top.

Industry view

We note that the automated evaluation direction represented by RAGAS is becoming an industry consensus—as RAG moves from experimentation to production, "gut-feeling" quality judgments must be replaced by quantifiable, sustainable systems. The emergence of production-grade implementations like Spring Boot + LangChain4j shows RAGAS is moving from paper to engineering.

But it's worth warning that LLM-as-Judge itself is not reliable. The evaluating LLM also produces hallucinations and biases—using a flawed tool to inspect a flawed system is logically imperfect. Secondly, high faithfulness does not equal a correct answer: the retrieved documents themselves might be wrong, and an answer faithful to a wrong document is still wrong. Context Recall still requires manual annotation of reference answers, meaning automation is incomplete. Some developers point out that RAGAS's four metrics are better suited for horizontal comparisons during system iteration rather than absolute quality judgments.

Impact on regular people

For enterprise IT: RAG systems are shifting from "deployed means done" to "deployed means monitoring begins." Evaluation and monitoring capabilities will become just as important as RAG itself.
For individual careers: Understanding evaluation frameworks like RAGAS is becoming the watershed skill that separates AI engineers from "API-calling engineers."
For the consumer market: The answer quality of enterprise internal knowledge bases and customer service bots will be more guaranteed. The experience of "asking a question but getting nonsense" is expected to decrease.

Stop Guessing RAG Quality: RAGAS Uses AI to Grade AI

What this is

Industry view

Impact on regular people

相关推荐

RAG 系统质量不能再靠感觉判断 — RAGAS 框架用 AI 给 AI 当考官

实测 65% 代码任务可本地运行 — API 账单降 74%，多数人在为懒惰交云算力税

想让AI直接帮你买域名部署网站？Cloudflare刚开了这个口

几个程序员用AI替自己社交 — 技术人做副业的老问题又来了

LangChain 让 AI 学会实时汇报进度 — 不会解释思考过程的智能体没商业价值

LangChain 推上下文工程：给 AI 塞资料越多越笨，管好上下文成刚需