Large Language Models (LLMs) are rapidly transforming customer service, enabling businesses to automate interactions, provide instant support, and personalize customer experiences. However, LLMs have limitations: they can only work with the information they were trained on, which can lead to outdated or inaccurate responses. To address this, two primary techniques have emerged: Retrieval Augmented Generation (RAG) and Cache Augmented Generation (CAG).
In this blog post, we’ll delve into both approaches, with a particular focus on their suitability for AI customer service agents.
The Challenge: Grounding LLMs in Relevant Data
LLMs like Gemini are powerful, but their knowledge is limited to their training data. For customer service agents to be effective, they need access to a wealth of up-to-date information, including:
- Company knowledge bases
- Product catalogs
- FAQs
- Customer interaction history
- Real-time data (e.g., order status, shipping information)
RAG and CAG offer different ways to provide LLMs with this necessary external data.
Recent advancements in Large Language Models (LLMs) have significantly expanded their context windows, enabling them to process much larger amounts of information in a single interaction. Techniques, such as IBM’s extension of context lengths to 128,000 tokens and Microsoft’s LongRoPE method (extending context windows beyond 2 million tokens), have made it feasible to provide LLMs with extensive data directly. These developments have also led to more efficient processing, reducing the computational costs associated with handling large context windows. As a result, Cache Augmented Generation (CAG) has become more practical for AI customer service agents. By pre-loading comprehensive information into the LLM’s context window, CAG allows for rapid and accurate responses without the need for real-time data retrieval. This approach is particularly advantageous in scenarios where low latency and high accuracy are critical. However, Retrieval Augmented Generation (RAG) remains relevant, especially when dealing with highly dynamic information that changes frequently. In such cases, RAG’s ability to fetch the most current data ensures that responses remain accurate and up-to-date. In summary, the choice between CAG and RAG depends on the specific requirements of the customer service application. With the latest advancements in LLM context windows and processing efficiency, CAG offers a compelling solution for scenarios involving large, relatively static datasets, while RAG continues to be essential for applications requiring access to dynamic information.
Retrieval Augmented Generation (RAG)
RAG involves augmenting an LLM’s knowledge by retrieving relevant information from an external source at the time of the user’s query.
Here’s how it works:
- Knowledge Base Preparation: The company’s data is organized into a searchable format, often a vector database.
- Query Processing: When a customer asks a question, the query is converted into a vector representation.
- Information Retrieval: The system searches the vector database for the most relevant information or “chunks” of data.
- Augmented Prompt: The retrieved information is combined with the original query and fed into the LLM.
- Response Generation: The LLM generates a response based on both its pre-existing knowledge and the retrieved information.
RAG in Customer Service:
In a customer service setting, RAG can be used to:
- Answer questions about products or services by retrieving information from a product catalog or knowledge base.
- Provide troubleshooting assistance by retrieving relevant documentation.
- Offer personalized support by retrieving customer-specific data, such as order history or account details.
Advantages of RAG:
- Dynamic Information: RAG can access the most up-to-date information, making it suitable for data that changes frequently.
- Scalability: RAG can handle large knowledge bases by retrieving only the relevant information.
- Flexibility: RAG can be adapted to various data sources and query types.
Disadvantages of RAG:
- Complexity: RAG systems can be complex to set up and maintain, requiring expertise in vector databases, embedding models, and retrieval algorithms.
- Latency: The retrieval process can add latency to the response time, which can be critical in customer service interactions.
- Retrieval Accuracy: The quality of the response depends heavily on the accuracy of the information retrieval process. If irrelevant or incomplete information is retrieved, the LLM may generate inaccurate or unhelpful responses.
Cache Augmented Generation (CAG)
CAG is a newer approach that leverages the increasing context window sizes of modern LLMs. Instead of retrieving information at query time, CAG pre-loads a substantial amount of relevant data into the LLM’s context window.
Here’s the process:
- Data Pre-loading: A large chunk of relevant information is loaded into the LLM’s context window before the customer interaction begins.
- Direct Response: When a customer asks a question, the LLM processes the query within the context of the pre-loaded information and generates a response.
Here’s a comparison of 10 recent top Large Language Models (LLMs), detailing their input token lengths, approximate word counts, and associated costs:
Model Name | Provider | Input Token Length | Approximate Words | Input Cost per 1M Tokens | Output Cost per 1M Tokens |
GPT-4o | OpenAI | 128,000 | ~96,000 | $2.50 | $10.00 |
GPT-4o-mini | OpenAI | 128,000 | ~96,000 | $0.15 | $0.60 |
GPT-4o-realtime-preview | OpenAI | 128,000 | ~96,000 | $5.00 | $20.00 |
Claude 3 Opus | Anthropic | 200,000 | ~150,000 | $15.00 | $75.00 |
Claude 3 Sonnet | Anthropic | 200,000 | ~150,000 | $3.00 | $15.00 |
Claude 3 Haiku | Anthropic | 200,000 | ~150,000 | $0.25 | $1.25 |
Gemini 2.0 Flash | 1,048,576 | ~786,432 | $0.15 | $0.60 | |
o1 | OpenAI | 200,000 | ~150,000 | $15.00 | $60.00 |
o1-mini | OpenAI | 128,000 | ~96,000 | $1.10 | $4.40 |
DeepSeek-R1 | DeepSeek | 64,000 | ~48,000 | $0.55 | $2.19 |
Notes:
- Approximate Words: Calculated based on the assumption that 1 token ≈ 0.75 words.
- Costs: Represented per 1 million tokens.
- Sources: Information compiled from various sources, including Ron Lancaster’s notes on LLM pricing, LLM Pricing Table, and Galaxy of AI’s blog on LLM pricing.
Please note that pricing and specifications are subject to change as providers update their offerings.
CAG in Customer Service:
For customer service, CAG can be employed to:
- Provide instant answers to common questions by pre-loading FAQs or knowledge base articles.
- Offer consistent support based on a fixed set of guidelines or policies.
- Guide customers through standard procedures by pre-loading relevant documentation.
Advantages of CAG:
- Speed: CAG offers very low latency, as the LLM can access the necessary information directly from its context.
- Simplicity: CAG is simpler to implement than RAG, as it eliminates the need for a separate retrieval system.
- Accuracy: By providing the LLM with the complete context, CAG reduces the risk of retrieval errors and improves the accuracy of the responses.
- Cost-Effectiveness: As shown in the table, processing costs for models like GPT-4o-mini and Gemini 2.0 Flash, which support large context windows, are becoming increasingly cost-effective.
- Reduced Hallucination: CAG can lead to more reliable responses, as the LLM relies on a complete set of pre-loaded information rather than generating content based on incomplete or outdated data.
Disadvantages of CAG:
- Context Window Limits: CAG is limited by the LLM’s context window size. While context windows are expanding, they still have limitations.
- Static Information: CAG is best suited for relatively static information. It may not be ideal for scenarios where the information changes frequently.
- Memory Constraints: Pre-loading very large amounts of data can strain the LLM’s memory and increase processing demands, potentially affecting performance.
CAG vs. RAG for Customer Service Agents: Which is Best?
The best approach depends on the specific use case and the nature of the data:
Choose RAG if:
- The customer service agent needs to access highly dynamic information (e.g., real-time inventory, pricing, or changing promotions).
- The knowledge base is very large and cannot fit into the LLM’s context window.
- The queries are complex and require retrieving information from diverse sources.
Choose CAG if:
- The customer service agent primarily deals with relatively static information (e.g., FAQs, product information, company policies).
- Low latency is critical for providing a seamless customer experience.
- Simplicity of implementation is a priority.
- The relevant information can fit within the LLM’s context window.
Hybrid Approach
In many cases, a hybrid approach that combines the strengths of both RAG and CAG may be the most effective solution. For example, a customer service agent could use:
- CAG to pre-load essential information, such as product details and company policies, for quick and consistent responses to common queries.
- RAG to retrieve dynamic information, such as order status or real-time support articles, for more complex or specific requests.
Conclusion
RAG and CAG offer distinct advantages for building LLM-powered customer service agents. By carefully considering the specific requirements of your use case, you can choose the most appropriate approach or combine both techniques to create a highly effective and efficient customer service solution.