对比阅读 | opcnew

The RAGAS framework proposes 4 core evaluation metrics, finally giving a quantitative standard to the "feels good" assessment of RAG systems. We note that an increasing number of enterprise AI projects are moving past the "can it run" phase and hitting the wall of "does it run well"—and traditional NLP evaluation methods (calculating similarity via string matching) fail completely here.

What this is

The evaluation of RAG (Retrieval-Augmented Generation, the tech where AI checks references before answering) systems has long been a black box. You changed the model or tuned the parameters, but did the answer quality actually improve? Or did it just "feel" better? If a question is answered poorly, is it because the retrieval phase failed to find the right data, or because the generation phase hallucinated? In the past, the answers to these questions were purely guessed by experience.

The open-source evaluation framework RAGAS attempts to open this black box with 4 metrics:

Faithfulness: Is the answer faithful to the retrieved context, or does it contain fabrications?
Answer Relevancy: Is the answer on-topic, or does it miss the point?
Context Precision: How much of the retrieved context is actually relevant, and is the junk ratio high?
Context Recall: How much of the relevant information that should have been found was actually retrieved?

These 4 metrics cover the two critical phases of RAG: retrieval (finding data) and generation (writing the answer). Whichever metric is low indicates where the problem lies—low recall means a retrieval strategy issue; low faithfulness means the AI is hallucinating.

Industry view

What we care about is that the practical significance of this evaluation system is quite clear: it shifts RAG system iteration from "gut feeling" to "looking at data." Run an evaluation after tuning parameters; if the metrics rise, it's a genuine improvement. When reporting to the boss, data is far more convincing than "I think." This is the inevitable path for enterprise AI applications to become engineered and standardized.

But the risks to watch out for are equally obvious: RAGAS's core mechanism is "using an LLM as a judge." LLMs themselves harbor biases and uncertainties; using one to evaluate another LLM's output is, to some extent, using the problem to solve the problem. Furthermore, the credibility of evaluation results relies heavily on the quality of the test set (the dataset containing questions and ground truth answers)—manual annotation is costly, while LLM-generated sets require manual spot-checking and correction. Otherwise, it's just "measuring something with errors using a ruler with errors."

Impact on regular people

For enterprise IT: Deploying a RAG system is no longer a one-off deal; it requires establishing supporting evaluation workflows and test set maintenance mechanisms. This means the hidden costs and cycles of AI projects are both increasing.

For individual careers: Understanding the logic that "AI output requires quantitative evaluation" is becoming a foundational literacy for collaborating with AI; those who can read evaluation metrics and pinpoint problem areas will be more competitive than those who only know how to call APIs.

For the consumer market: As enterprises begin to measure AI quality with data, the AI products C-end users encounter will gradually become more reliable and less prone to hallucinations—this is a good thing, but the speed depends on how much enterprises are willing to invest in the "invisible" evaluation phase.

RAGAS 框架提出 4 个核心评估指标，RAG 系统的“感觉不错”终于有了量化标准。我们注意到，越来越多企业 AI 项目正越过“能不能跑”的阶段，撞上“跑得好不好”这堵墙——而传统 NLP 评估方式（通过字符串匹配算相似度）对此完全失效。

这是什么

RAG（检索增强生成，让 AI 先查资料再回答的技术）系统的评估，长期以来是个黑盒。你换了模型、调了参数，回答质量真的提升了吗？还是只是“感觉”变好了？某个问题答得差，是检索阶段没找到正确资料，还是生成阶段 AI 在胡编？这些问题的答案，过去全凭经验猜。

开源评估框架 RAGAS 试图用 4 个指标把这个黑盒打开：

Faithfulness（忠实度）：答案是否忠实于检索到的资料，有没有编造。
Answer Relevancy（答案相关性）：答案是否切题，有没有答非所问。
Context Precision（上下文精确度）：检索回来的资料中有多少是相关的，垃圾占比高不高。
Context Recall（上下文召回率）：该找到的相关信息，到底找到了多少。

这 4 个指标覆盖了 RAG 的两个关键阶段：检索（找资料）和生成（写答案）。哪个指标低，问题就出在对应的环节——召回率低是检索策略有问题，忠实度低是 AI 在幻觉。

行业怎么看

我们关心的是，这套评估体系的落地意义相当明确：它让 RAG 系统的迭代从“拍脑袋”变成“看数据”。调了参数后跑一遍评估，指标升了就是真升了；向老板汇报时，数据比“我觉得”有说服力得多。这是企业 AI 应用走向工程化、标准化的必经之路。

但值得警惕的风险同样明显：RAGAS 的核心机制是“用 LLM 当裁判”。LLM 本身存在偏见和不确定性，用它来评估另一个 LLM 的输出，某种程度上是在用问题解决问题。此外，评估结果的可信度高度依赖测试集（包含问题和标准答案的数据集）的质量——手工标注成本高，用 LLM 生成则需要人工抽检校正，否则就是“用有误差的尺子量有误差的东西”。

对普通人的影响

对企业 IT：部署 RAG 系统不再是一锤子买卖，需要建立配套的评估流程和测试集维护机制，这意味着 AI 项目的隐性成本和周期都在增加。

对个人职场：理解“AI 输出需要量化评估”这一逻辑，正在成为与 AI 协作的基础素养；能看懂评估指标、定位问题环节的人，比只会调 API 的人更有竞争力。

对消费市场：当企业开始用数据丈量 AI 质量，C 端用户接触到的 AI 产品会逐渐变得更可靠、更少胡编乱造——这是好事，但速度取决于企业愿意在“看不见”的评估环节投入多少。

对比阅读：Stop Scoring RAG by Feel: AI Apps Enter Data-Driven Operations Era 与 RAG 系统不能靠感觉打分 — AI 应用开始进入精细化运营时代

Stop Scoring RAG by Feel: AI Apps Enter Data-Driven Operations Era