After deploying RAG systems, over 80% of teams still rely on manual spot-checks or wait for user complaints to judge quality—this isn't an engineering answer, it's relying on luck. The RAGAS framework attempts to solve this problem using the "LLM evaluating LLM" approach.
What this is
RAGAS (Retrieval Augmented Generation Assessment) is a RAG quality evaluation framework proposed in 2023. RAG (Retrieval Augmented Generation, where LLMs first retrieve relevant documents from a knowledge base, then generate answers based on those documents) is easy to deploy, but quality judgment is hard—LLMs are inherently good at "sounding plausible," and even if they quietly mix in content not found in the retrieved documents, users struggle to notice in the short term.
Traditional NLP metrics like BLEU and ROUGE severely fail in RAG scenarios: they only do surface-level lexical matching and cannot judge the core question of "whether the answer is faithful to the retrieved content." RAGAS's solution is to use another LLM to evaluate the RAG's output, adopting the LLM-as-Judge pattern.
RAGAS splits two metrics each from the retrieval and generation dimensions, forming a 2×2 evaluation matrix:
- Faithfulness: Whether the answer comes only from the retrieved content, without "making things up." It breaks the answer into independent claims, judges whether each is supported by the retrieved documents, and calculates Faithfulness = supported count / total count.
- Answer Relevancy: Whether it answers the question. The evaluating LLM infers possible questions from the answer, then calculates the semantic similarity with the original question.
- Context Recall: Whether the retrieval missed key information. This is the only metric that still requires human-provided reference answers.
- Context Precision: Whether useful content is ranked at the top.
Industry view
We note that the automated evaluation direction represented by RAGAS is becoming an industry consensus—as RAG moves from experimentation to production, "gut-feeling" quality judgments must be replaced by quantifiable, sustainable systems. The emergence of production-grade implementations like Spring Boot + LangChain4j shows RAGAS is moving from paper to engineering.
But it's worth warning that LLM-as-Judge itself is not reliable. The evaluating LLM also produces hallucinations and biases—using a flawed tool to inspect a flawed system is logically imperfect. Secondly, high faithfulness does not equal a correct answer: the retrieved documents themselves might be wrong, and an answer faithful to a wrong document is still wrong. Context Recall still requires manual annotation of reference answers, meaning automation is incomplete. Some developers point out that RAGAS's four metrics are better suited for horizontal comparisons during system iteration rather than absolute quality judgments.
Impact on regular people
- For enterprise IT: RAG systems are shifting from "deployed means done" to "deployed means monitoring begins." Evaluation and monitoring capabilities will become just as important as RAG itself.
- For individual careers: Understanding evaluation frameworks like RAGAS is becoming the watershed skill that separates AI engineers from "API-calling engineers."
- For the consumer market: The answer quality of enterprise internal knowledge bases and customer service bots will be more guaranteed. The experience of "asking a question but getting nonsense" is expected to decrease.