Generation of augmented extraction: from theory to implementation of LangChain

From the theory of the original academic paper to its implementation in Python with OpenAI, Weaviate, and LangChain

Since we realized that we can overload large language models (LLM) with our own data, there have been active discussions on how to most effectively bridge the gap between the general knowledge of LLM and our own data. There has been much debate about what is best suited for this: fine-tuning or retrieval-augmented generation (RAG) (spoiler alert: both).

This article focuses on the concept of RAG and first examines its theory. Then the article demonstrates how a simple RAG pipeline can be implemented using LangChain for OpenAI language models and the Weaviate vector database.

What is Retrieval-Augmented Generation

Retrieval-Augmented Generation (RAG) is the concept of providing LLM with additional information from an external knowledge source. This allows them to generate more accurate and contextual responses while reducing hallucinations.

The Problem

Modern LLMs are trained on large volumes of data to achieve a wide range of general knowledge stored in the neural network weights (parametric memory). But prompting LLMs to generate output requiring knowledge that was not included in its training data, such as newer or domain-specific information, can lead to factual inaccuracies (hallucinations), as shown in the following screenshot:

Therefore, it is important to bridge the gap between the general knowledge of LLM and any additional context to help LLM generate more accurate and contextual outputs while reducing hallucinations.

The Solution

Traditionally, neural networks adapt to domain-specific or private information by fine-tuning the model. This method is effective, but it requires significant computational resources, costs, and technical expertise, making it less flexible for adapting to changing information.

In 2020, in the paper "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks", Lewis et al. proposed a more flexible method called "Retrieval-Augmented Generation" (RAG). In this paper, the researchers combined a generative model with a retrieval module to provide additional information from an external knowledge source that could be more easily updated.

Simply put, RAG for LLM is the same as an open-book exam for humans. In an open-book exam, students are allowed to bring reference materials such as textbooks or notes, which they can use to look up relevant information to answer a question. The idea of an open-book exam is that the test focuses on students' reasoning skills rather than their ability to memorize specific information.

Similarly, factual knowledge is separated from the reasoning capabilities of LLM and stored in an external knowledge source that can be easily accessed and updated:

  • Parametric knowledge: acquired during training, which is implicitly stored in the neural network weights.

  • Non-parametric knowledge: stored in an external knowledge source, such as a vector database.

(By the way, I did not come up with this brilliant comparison. As far as I know, this comparison was first mentioned by JJ during the Kaggle — LLM Science Exam competition here).

The standard RAG workflow is shown below:

  1. Extract: The user query is used to extract relevant context from an external knowledge source. For this, the user query is embedded with an embedding model into the same vector space as the additional context in the vector database. This allows for similarity search and returns the top k nearest data objects from the vector database.

  2. Augment: The user query and the extracted additional context are inserted into the prompt template.

  3. Generate: Finally, the prompt with the augmented extraction is passed to the LLM.

Implementation of augmented extraction generation using LangChain

This section implements the RAG pipeline in Python using the LLM from OpenAI in combination with the vector database Weaviate and the embedding model from OpenAI. LangChain is used for orchestration.

Prerequisites

Make sure you have installed the necessary Python packages:

  • langchain for orchestration

  • openai for the embedding model and LLM

  • weaviate-client for the vector database

#!pip install langchain openai weaviate-client

Additionally, define the appropriate environment variables in the .env file in the root directory. To get the OpenAI API key, you need an OpenAI account where you need to create a new secret key in the API keys section.

OPENAI_API_KEY=""

Then run the following command to load the necessary environment variables.

import dotenv
dotenv.load_dotenv()

Preparation

At this stage, you need to prepare the vector database as an external knowledge source containing all the additional information. This vector database is populated with the following steps:

  1. Collect and upload data

  2. Split documents into chunks

  3. Embed and store the chunks

The first step is to collect and load the data. In this example, you will use the 2022 State of the Union Address by President Biden as additional context. The raw text document is available in the LangChain GitHub repository. To load the data, you can use one of the many built-in DocumentLoaders in LangChain. A Document is a dictionary with text and metadata. To load the text, you will use the TextLoader in LangChain.

import requests
from langchain.document_loaders import TextLoader

url = "https://raw.githubusercontent.com/langchain-ai/langchain/master/docs/docs/modules/state_of_the_union.txt"
res = requests.get(url)
with open("state_of_the_union.txt", "w") as f:
    f.write(res.text)

loader = TextLoader('./state_of_the_union.txt')
documents = loader.load()

Next, split the documents into chunks — since the Document in its raw form is too long to fit into the context window of an LLM, you need to split it into smaller pieces. For this purpose, LangChain has many built-in text splitters. For this simple example, you can use the CharacterTextSplitter with a chunk_size of about 500 and chunk_overlap of 50 to maintain continuity of the text between chunks.

from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = text_splitter.split_documents(documents)

Finally, embed and store the chunks — to enable semantic search over the text chunks, you need to generate vector embeddings for each chunk and then store them along with their embeddings. To generate vector embeddings, you can use the OpenAI embedding model, and to store them, you can use the Weaviate vector database. When calling .from_documents(), the vector database is automatically populated with the chunks.

from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Weaviate
import weaviate
from weaviate.embedded import EmbeddedOptions

client = weaviate.Client(
  embedded_options = EmbeddedOptions()
)

vectorstore = Weaviate.from_documents(
    client = client,    
    documents = chunks,
    embedding = OpenAIEmbeddings(),
    by_text = False
)

Step 1: Extraction

After filling the vector database, you can define it as a retriever component that retrieves additional context based on the semantic similarity between the user's query and the embedded fragments.

retriever = vectorstore.as_retriever()

Step 2: Augmentation

Next, to augment the prompt with additional context, you need to prepare a prompt template. The prompt can be easily customized from the prompt template as shown below.

from langchain.prompts import ChatPromptTemplate

template = """You are an assistant for question-answering tasks. 
Use the following pieces of retrieved context to answer the question. 
If you don't know the answer, just say that you don't know. 
Use three sentences maximum and keep the answer concise.
Question: {question} 
Context: {context} 
Answer:
"""
prompt = ChatPromptTemplate.from_template(template)

print(prompt)

Step 3: Generation

Finally, you can build a chain for the RAG pipeline by combining the retriever, prompt template, and LLM together. Once the RAG chain is defined, you can invoke it.

from langchain.chat_models import ChatOpenAI
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.output_parser import StrOutputParser

llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

rag_chain = (
    {"context": retriever,  "question": RunnablePassthrough()} 
    | prompt 
    | llm
    | StrOutputParser() 
)

query = "What did the president say about Justice Breyer"
rag_chain.invoke(query)
"The president thanked Justice Breyer for his service and acknowledged his dedication to serving the country. 
The president also mentioned that he nominated Judge Ketanji Brown Jackson as a successor to continue Justice Breyer's legacy of excellence."

Below is the resulting RAG pipeline for this specific example.

Summary

We considered the concept of RAG, which was presented in the article "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" from 2020. First, we considered the theory behind this concept, including motivation and problem-solving, then its implementation in Python. In this article, we implemented the RAG pipeline using LLM from OpenAI in combination with the vector database Weaviate and the embedding model from OpenAI. For orchestration, we used LangChain.

Comments