Why enterprise RAG programs fail: Google research introduces ‘enough context’ answer

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More

A new study from Google researchers introduces “sufficient context,” a novel perspective for understanding and improving retrieval augmented generation (RAG) systems in large language models (LLMs).

This approach makes it possible to determine if an LLM has enough information to answer a query accurately, a critical factor for developers building real-world enterprise applications where reliability and factual correctness are paramount.

The persistent challenges of RAG

RAG systems have become a cornerstone for building more factual and verifiable AI applications. However, these systems can exhibit undesirable traits. They might confidently provide incorrect answers even when presented with retrieved evidence, get distracted by irrelevant information in the context, or fail to extract answers from long text snippets properly.

The researchers state in their paper, “The ideal outcome is for the LLM to output the correct answer if the provided context contains enough information to answer the question when combined with the model’s parametric knowledge. Otherwise, the model should abstain from answering and/or ask for more information.”

Achieving this ideal scenario requires building models that can determine whether the provided context can help answer a question correctly and use it selectively. Previous attempts to address this have examined how LLMs behave with varying degrees of information. However, the Google paper argues that “while the goal seems to be to understand how LLMs behave when they do or do not have sufficient information to answer the query, prior work fails to address this head-on.”

Sufficient context

To tackle this, the researchers introduce the concept of “sufficient context.” At a high level, input instances are classified based on whether the provided context contains enough information to answer the query. This splits contexts into two cases:

Sufficient Context: The context has all the necessary information to provide a definitive answer.

Insufficient Context: The context lacks the necessary information. This could be because the query requires specialized knowledge not present in the context, or the information is incomplete, inconclusive or contradictory.

Source: arXiv

This designation is determined by looking at the question and the associated context without needing a ground-truth answer. This is vital for real-world applications where ground-truth answers are not readily available during inference.

The researchers developed an LLM-based “autorater” to automate the labeling of instances as having sufficient or insufficient context. They found that Google’s Gemini 1.5 Pro model, with a single example (1-shot), performed best in classifying context sufficiency, achieving high F1 scores and accuracy.

The paper notes, “In real-world scenarios, we cannot expect candidate answers when evaluating model performance. Hence, it is desirable to use a method that works using only the query and context.”

Key findings on LLM behavior with RAG

Analyzing various models and datasets through this lens of sufficient context revealed several important insights.

As expected, models generally achieve higher accuracy when the context is sufficient. However, even with sufficient context, models tend to hallucinate more often than they abstain. When the context is insufficient, the situation becomes more complex, with models exhibiting both higher rates of abstention and, for some models, increased hallucination.

Interestingly, while RAG generally improves overall performance, additional context can also reduce a model’s ability to abstain from answering when it doesn’t have sufficient information. “This phenomenon may arise from the model’s increased confidence in the presence of any contextual information, leading to a higher propensity for hallucination rather than abstention,” the researchers suggest.

A particularly curious observation was the ability of models sometimes to provide correct answers even when the provided context was deemed insufficient. While a natural assumption is that the models already “know” the answer from their pre-training (parametric knowledge), the researchers found other contributing factors. For example, the context might help disambiguate a query or bridge gaps in the model’s knowledge, even if it doesn’t contain the full answer. This ability of models to sometimes succeed even with limited external information has broader implications for RAG system design.

Source: arXiv

Cyrus Rashtchian, co-author of the study and senior research scientist at Google, elaborates on this, emphasizing that the quality of the base LLM remains critical. “For a really good enterprise RAG system, the model should be evaluated on benchmarks with and without retrieval,” he told VentureBeat. He suggested that retrieval should be viewed as “augmentation of its knowledge,” rather than the sole source of truth. The base model, he explains, “still needs to fill in gaps, or use context clues (which are informed by pre-training knowledge) to properly reason about the retrieved context. For example, the model should know enough to know if the question is under-specified or ambiguous, rather than just blindly copying from the context.”

Reducing hallucinations in RAG systems

Given the finding that models may hallucinate rather than abstain, especially with RAG compared to no RAG setting, the researchers explored techniques to mitigate this.

They developed a new “selective generation” framework. This method uses a smaller, separate “intervention model” to decide whether the main LLM should generate an answer or abstain, offering a controllable trade-off between accuracy and coverage (the percentage of questions answered).

This framework can be combined with any LLM, including proprietary models like Gemini and GPT. The study found that using sufficient context as an additional signal in this framework leads to significantly higher accuracy for answered queries across various models and datasets. This method improved the fraction of correct answers among model responses by 2–10% for Gemini, GPT, and Gemma models.

To put this 2-10% improvement into a business perspective, Rashtchian offers a concrete example from customer support AI. “You could imagine a customer asking about whether they can have a discount,” he said. “In some cases, the retrieved context is recent and specifically describes an ongoing promotion, so the model can answer with confidence. But in other cases, the context might be ‘stale,’ describing a discount from a few months ago, or maybe it has specific terms and conditions. So it would be better for the model to say, ‘I am not sure,’ or ‘You should talk to a customer support agent to get more information for your specific case.’”

The team also investigated fine-tuning models to encourage abstention. This involved training models on examples where the answer was replaced with “I don’t know” instead of the original ground-truth, particularly for instances with insufficient context. The intuition was that explicit training on such examples could steer the model to abstain rather than hallucinate.

The results were mixed: fine-tuned models often had a higher rate of correct answers but still hallucinated frequently, often more than they abstained. The paper concludes that while fine-tuning might help, “more work is needed to develop a reliable strategy that can balance these objectives.”

Applying sufficient context to real-world RAG systems

For enterprise teams looking to apply these insights to their own RAG systems, such as those powering internal knowledge bases or customer support AI, Rashtchian outlines a practical approach. He suggests first collecting a dataset of query-context pairs that represent the kind of examples the model will see in production. Next, use an LLM-based autorater to label each example as having sufficient or insufficient context.

“This already will give a good estimate of the % of sufficient context,” Rashtchian said. “If it is less than 80-90%, then there is likely a lot of room to improve on the retrieval or knowledge base side of things — this is a good observable symptom.”

Rashtchian advises teams to then “stratify model responses based on examples with sufficient vs. insufficient context.” By examining metrics on these two separate datasets, teams can better understand performance nuances.

“For example, we saw that models were more likely to provide an incorrect response (with respect to the ground truth) when given insufficient context. This is another observable symptom,” he notes, adding that “aggregating statistics over a whole dataset may gloss over a small set of important but poorly handled queries.”

While an LLM-based autorater demonstrates high accuracy, enterprise teams might wonder about the additional computational cost. Rashtchian clarified that the overhead can be managed for diagnostic purposes.

“I would say running an LLM-based autorater on a small test set (say 500-1000 examples) should be relatively inexpensive, and this can be done ‘offline’ so there’s no worry about the amount of time it takes,” he said. For real-time applications, he concedes, “it would be better to use a heuristic, or at least a smaller model.” The crucial takeaway, according to Rashtchian, is that “engineers should be looking at something beyond the similarity scores, etc, from their retrieval component. Having an extra signal, from an LLM or a heuristic, can lead to new insights.”

Daily insights on business use cases with VB Daily

If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI.

Read our Privacy Policy

Thanks for subscribing. Check out more VB newsletters here.

An error occured.