LLM: Building Advanced RAG (Part 1)
In our previous post, we shared about our approach with RAG (Retrieval-Augmented Generation) for building a crypto research workflow. However, implementing it in practice faced many issues to bring the product to production.
In reality, this is a challenge that most teams building RAG solutions for news encounter. The Financial Times, when deploying the FT Chatbot for answering questions from a decade’s worth of news data, also faced this problem.
In this article, we will share the issues we actually encountered and some initial efforts to improve.
Limitations of Basic RAG Techniques
RAG is a method aimed at improving the results of LLMs by leveraging additional external data sources. This process includes three main components: the retrieval system, a database containing reference information, and the generative component — the LLM.
In Basic RAG, information from external data sources is loaded and split into small chunks, encoded into vectors using a Transformer Encoder model (embedding model), and stored as indices (indexing). When a user query is received, the system vectorizes this query and searches through the indices to identify the most relevant information chunks, which are then aggregated and fed into the LLM model as context to generate the response content.
While Basic RAG can address basic needs, it also has several issues such as:
- Query Process: Lack of accuracy in selecting information relevant to the user’s question and the potential to miss important details.
- Response Generation Process: Contextual inaccuracies can lead to the generation of incorrect or irrelevant content, affecting the quality and reliability of the system.
- Information Integration Process: Information retrieved from multiple sources without synthesis or supplementation can lead to duplication, conflict, and inconsistencies in style, affecting the coherence of the response and diminishing the user experience.
The limitations of Basic RAG have driven the development of a more complex system called Advanced RAG. This system is enhanced to overcome the weaknesses by refining the data retrieval mechanism, strengthening the processing and integration of retrieved content, and improving the coherence and reliability of the generated responses. Through these improvements, Advanced RAG is capable of better contextual understanding and provides more relevant and accurate responses in RAG-based systems.
Advanced RAG Techniques
Advanced RAG encompasses technical methods to enhance the processing stages in RAG, as illustrated below:
- Indexing: Optimizing the vectorization phase and transforming data into semantically focused formats, which improves query efficiency.
- Query Transformation: Clarifying and refining the user’s original question to increase relevance to the retrieval task.
- Query Routing: Selecting the appropriate data sources and queries to maximize retrieval efficiency.
- Retrieval: Ensuring the retrieved content is comprehensive and contextually complete.
- Post-Retrieval: Effectively integrating the retrieved content segments to provide the LLM model with precise and concise contextual information.
- Generation: Evaluating and re-ranking the retrieved information, selecting essential content to enhance the relevance and reliability of the response.
- Evaluation: Verifying the quality of the generated content according to specific criteria.
Advanced RAG Techniques
1. Indexing
The Indexing process is an essential step that improves the accuracy and efficiency of systems utilizing LLMs. Indexing involves not only storing data but also organizing and optimizing it so that necessary information can be easily understood and retrieved without losing important context.
Some techniques in the Indexing process include:
- Chunk Optimization: Optimizing the size and structure of text chunks to ensure they are neither too large nor too small, maintaining necessary context without exceeding the length limitations of LLMs.
- Embedding Fine-tuning: Refining the embedding model to improve its semantic understanding of the indexed data, thereby enhancing the ability to match retrieved content with user queries.
- Multi-Representation: This method allows documents to be transformed into lightweight retrieval units, such as content summaries, improving the accuracy and speed of the retrieval process when users need specific information from a large document.
- Hierarchical Indexing: Applying hierarchical models like RAPTOR to organize data into different levels of aggregation from detailed to general, helping improve information retrieval based on broader and more precise contexts.
- Metadata Attachment: Adding metadata to each chunk or piece of data to enhance analysis and classification capabilities, allowing for more systematic data retrieval suitable for various specific situations.
2. Query Transformation & Query Routing
The Query Transformation process involves techniques that use LLMs as reasoning tools to refine user inputs to improve the quality of information retrieval. LLMs can transform the original question into clearer, more understandable questions, thereby enhancing the search and retrieval efficiency.
Some techniques in the Query Transformation process include:
- HyDE (Hypothetical Document Embeddings): This is a reverse technique where the LLM is asked to generate hypothetical data based on the question and then use the vector of this data along with the vector of the question to improve the quality of reference information retrieval. This technique enhances the semantic similarity between the question and the retrieved reference content, thereby increasing the accuracy and efficiency of the query process.
- Multi-Step Query: This method breaks down complex questions into multiple simpler sub-questions, retrieves answers for each sub-question in parallel, and combines the retrieved results. This allows the LLM to synthesize and generate a more accurate response. This method is particularly useful in cases where there is no direct information available about the queried content.
The Query Routing process involves the LLM model determining the next action based on the user’s query. The choices can include searching for content in a specific dataset, experimenting with different search directions, and then synthesizing the results into a unified response. This process directs the query to the most appropriate data source, ranging from traditional vector databases to graph databases or relational databases, or even different hierarchical indexes.
Some techniques in the Query Routing process include:
- Logical Routing: This technique uses logic to direct the query to the appropriate data source. By analyzing the structure and purpose of the question, the router selects the most suitable index or data source for the query. This optimizes the information retrieval process by ensuring that the query is handled by the data source capable of providing the most accurate response.
- Semantic Routing: This method leverages the semantics of the question to direct it. The router analyzes the semantic meaning of the question and directs it to the appropriate index or data source, increasing the accuracy of relevant information retrieval. This method is particularly effective when dealing with complex queries that require an understanding of the context and meaning of each word and phrase in the query.
3. Retrieval
The Retrieval process is a core step in the RAG system, focusing on extracting reference data from various sources to provide the necessary context and information for the LLM model to generate responses.
Some techniques in the Retrieval process include:
- Recursive Retriever: This technique allows deep retrieval into related data, performing additional queries based on the results of previous queries. It is useful in scenarios requiring detailed or in-depth information exploration.
- Router Retriever: This technique uses an LLM to dynamically decide the appropriate data source or querying tool for each specific query.
- Auto Retriever: This method automatically queries the database by using the LLM to determine metadata for filtering or to create appropriate query statements for retrieval.
- Fusion Retriever: This technique combines results from multiple queries and indexes, optimizing information retrieval and ensuring comprehensive and non-duplicative results, providing a multi-faceted view of the retrieved information.
- Auto Merging Retriever: When multiple sub-segments of data are retrieved, this technique merges them into a parent data segment, allowing the aggregation of smaller contexts into a larger context, aiding in the synthesis of information. This technique helps improve the relevance and integrity of the context.
4. Post-Retrieval
The Post-Retrieval process involves refining the retrieved results through filtering, re-ranking, or transforming them. The goal of this process is to prepare and enhance the context before feeding it into the LLM model to generate the final response, ensuring that the information provided to the LLM is accurate and efficient.
Some techniques in the Post-Retrieval process include:
- Rerank: Rearranging the retrieved text segments so that the most relevant results appear first, improving the accuracy of the information. Reranking not only reduces the number of documents that need to be fed into the LLM model but also acts as a filter to process the language more accurately.
- Compress: Reducing excess and unnecessary context, eliminating noise to enhance the LLM’s understanding of the key information. Compression helps optimize the length of the context that the LLM can handle, thereby improving the response quality by focusing on the essential information.
- Filter: Selecting content before feeding it into the LLM, removing irrelevant or low-accuracy documents or information. This technique ensures that only relevant and high-quality information is used, thus improving the accuracy and reliability of the response.
5. Generation
The Generation process is the stage where the LLM model generates a response based on the information retrieved and processed from the previous steps. The goal of this process is to create a highly accurate and relevant response to the user’s initial query.
The quality of this process heavily depends on the chosen LLM model. However, the system can apply several techniques to enhance the quality of the Generation results:
- FLARE: FLARE is a method based on Prompt Engineering to control when the LLM should perform data retrieval. This technique ensures that the LLM only retrieves data when essential information is lacking, avoiding unnecessary or irrelevant data collection. This process continuously adjusts the question and checks for low-probability keywords; if these words appear, the system retrieves relevant documents to improve and refine the response, thereby enhancing the accuracy and relevance of the final response.
- ITER-RETGEN: ITER-RETGEN is a technique that iteratively performs the Generation process based on the retrieved information. Each iteration uses the results from the previous iteration as specific context to help retrieve more relevant knowledge, continuously improving the quality of the response.
- ToC (Tree of Clarifications): ToC is a method that recursively performs queries to clarify the initial question. In this process, each question-answer step evaluates the current query to generate a more specific question. This process helps clarify ambiguities in the original question, thereby improving the accuracy and detail of the response.
For the last and very important phase — Evaluation — we will cover this in the next part. Stay tuned!