Improving RAG with Knowledge Graphs

Generation with augmented sampling is a method that connects external data sources to improve the output of large language models. This method is ideal for LLM to access private or domain-specific data and solve hallucination-related problems. Therefore, RAG is widely used to support many GenAI applications, such as AI chatbots and recommendation systems.

Introduction to RAG and Related Issues

Basic RAG typically combines a vector database and an LLM, where the vector database stores and retrieves contextual information for user queries, and the LLM generates responses based on the retrieved context. This approach works well in many cases; however, it struggles with complex tasks such as multi-hop reasoning or answering questions that require connecting disparate pieces of information.

For example, the question "What name was given to the son of the man who defeated the usurper Allectus?"

To answer this question, basic RAG typically performs the following steps:

  1. Identifies the person: determines who defeated Allectus.

  2. Identifies the person's son: searches for information about this person's family, specifically his son.

  3. Finds the name: determines the son's name.

The problem usually arises at the first step, as basic RAG retrieves text based on semantic similarity rather than directly answering complex queries where specific details may not be mentioned in the dataset. This limitation makes it difficult to find the exact required information, often requiring costly and impractical solutions such as manually creating question-answer pairs for frequent queries.

To address such issues, Microsoft Research has introduced GraphRAG, a brand new method that complements RAG extraction and generation with knowledge graphs. In the following sections, we will explain how GraphRAG works and how to run it with the vector database Milvus.

What is GraphRAG and how does it work?

Unlike the basic RAG, which uses a vector database to retrieve semantically similar text, GraphRAG enhances RAG by including knowledge graphs (KG). Knowledge graphs are data structures that store and link data based on their relationships.

The GraphRAG pipeline typically consists of two main processes: indexing and query processing.

Indexing

The indexing process consists of four main stages:

  1. Segmentation of text units: the entire input corpus is divided into several text units (text fragments). These fragments are the smallest analyzable units and can be paragraphs, sentences, or other logical units. By segmenting long documents into smaller fragments, we can extract and retain more detailed information about these inputs.

  2. Extraction of entities, relationships, and statements: GraphRAG uses LLM to identify and extract all entities (names of people, places, organizations, etc.), relationships between them, and key statements expressed in the text from each text unit. We will use this extracted information to build the initial knowledge graph.

  3. Hierarchical clustering: GraphRAG uses the Leiden algorithm to perform hierarchical clustering on the initial knowledge graphs. Leiden is a community detection algorithm that can effectively detect community structures in the graph. Entities in each cluster are assigned to different communities for deeper analysis.

Note: a community is a group of nodes in a graph that are densely connected to each other but sparsely connected to other dense groups in the network.

  1. Community Summary Generation: GraphRAG generates summaries for each community and its participants using a bottom-up approach. These summaries include the main entities in the community, their relationships, and key statements. This step provides an overview of the entire dataset and offers useful contextual information for subsequent queries.

Query Processing

GraphRAG has two distinct query processing workflows tailored for different queries.

  • Global Search for reasoning about holistic questions related to the entire data corpus by utilizing community summaries.

  • Local Search for reasoning about specific entities by propagating to their neighbors and related concepts.

This global search workflow includes the following phases.

  1. History of user queries and dialogues: the system takes the history of user queries and conversations as initial input data.

  2. Community report packages: as contextual data, the system uses community reports generated by the LLM from the specified level of the community hierarchy. These community reports are shuffled and divided into several packages (Package 1, Package 2… Package N).

  3. RIR (Rated Intermediate Responses): Each community report package is further divided into text fragments of a predefined size. Each text fragment is used to generate an intermediate response. The response contains a list of information fragments called items. Each item has a numerical rating indicating its importance. These generated intermediate responses are Rated Intermediate Responses (Response 1, Response 2… Response N).

  4. Ranking and filtering: the system ranks and filters these intermediate responses, selecting the most important items. The selected important items form Aggregated Intermediate Responses.

  5. Final response: Aggregated Intermediate Responses are used as context to generate the final response.

When users ask questions about specific entities (e.g., names of people, places, organizations, etc.), we recommend using the local search process. This process includes the following steps:

  1. User Request: First, the system receives a user request, which can be a simple question or a more complex query.

  2. Search for Similar Entities: The system identifies a set of entities from the knowledge graph that are semantically related to the user input. These entities serve as entry points into the knowledge graph. At this stage, a vector database, such as Milvus, is used to perform text similarity search.

  3. Entity and Text Unit Matching: The extracted text units are matched with the corresponding entities, removing the original text information.

  4. Entity and Relationship Extraction: At this stage, specific information about entities and their corresponding relationships is extracted.

  5. Entity and Covariate Matching: At this stage, entities are matched with their covariates, which may include statistical data or other relevant attributes.

  6. Entity and Community Report Matching: Community reports are integrated into the search results, including some global information.

  7. Using Dialogue History: The system uses dialogue history to better understand the user's intent and context.

  8. Response Generation: The system constructs and responds to the user's request based on the filtered and sorted data generated in the previous stages.

Comparison of Basic RAG and GraphRAG in Output Quality

To demonstrate the effectiveness of GraphRAG, its creators compared the output quality of basic RAG and GraphRAG in their blog. For illustration, I will provide a simple example.

Dataset Used

For their experiments, the creators of GraphRAG used the dataset "Information on Violent Incidents from News Articles" (VIINA).

Note: This dataset contains sensitive topics. It was chosen solely because of its complexity and the presence of various opinions and partial information.

Overview of the experiment

The basic RAG and GraphRAG were asked the same question, which requires aggregating information across the entire dataset to answer.

Question: What are the 5 main topics in the dataset?

The answers are shown in the image below. The results of the basic RAG were not related to the military theme, as the vector search provided unrelated text, leading to an inaccurate assessment. In contrast, GraphRAG provided a clear and relevant answer, identifying the main topics and supporting details. The results were consistent with the dataset with references to the original material.

Further experiments presented in the article "From Local to Global: Graph RAG Approach to Query-Focused Summarization" showed that GraphRAG significantly improves multi-hop reasoning and complex information summarization. The study showed that GraphRAG outperforms the basic RAG in both completeness and diversity:

  • Completeness: the extent to which the answer covers all aspects of the question.

  • Diversity: the variety and richness of perspectives and ideas contained in the answer.

For more detailed information about these experiments, we recommend you read the original GraphRAG article.

How to implement GraphRAG with the Milvus vector database

GraphRAG extends RAG applications with knowledge graphs and also uses a vector database to retrieve relevant entities. This section shows how to implement GraphRAG, create a GraphRAG index, and query it using the Milvus vector database.

Prerequisites

Before running the code, make sure you have installed the following dependencies:

pip install --upgrade pymilvus
pip install git+https://github.com/zc277584121/graphrag.git

Note: We installed GraphRAG from a fork repository as the Milvus storage feature was still awaiting official merging at the time of writing.

Let's start with the indexing workflow.

Data preparation

Download a small text file from Project Gutenberg with about a thousand lines and use it for indexing with GraphRAG.

This dataset is dedicated to the history of Leonardo da Vinci. We use GraphRAG to build a graphical index of all connections related to da Vinci, and the Milvus vector database to search for relevant knowledge to answer questions.

import nest_asyncio
nest_asyncio.apply()
import os
import urllib.request
index_root = os.path.join(os.getcwd(), 'graphrag_index')
os.makedirs(os.path.join(index_root, 'input'), exist_ok=True)
url = "https://www.gutenberg.org/cache/epub/7785/pg7785.txt"
file_path = os.path.join(index_root, 'input', 'davinci.txt')
urllib.request.urlretrieve(url, file_path)
with open(file_path, 'r+', encoding='utf-8') as file:
    # We use the first 934 lines of the text file, because the later lines are not relevant for this example.
    # If you want to save api key cost, you can truncate the text file to a smaller size.
    lines = file.readlines()
    file.seek(0)
    file.writelines(lines[:934])  # Decrease this number if you want to save api key cost.
    file.truncate()

Workspace Initialization

Now let's use GraphRAG to index the text file. To initialize the workspace, first run the command graphrag.index --init.

python -m graphrag.index --init --root ./graphrag_index

Setting up the env file and parameters

You will find the .env file in the root directory of the index. To use it, add your OpenAI API key to the .env file.

Important notes:

  • In this example, we will use OpenAI models, so make sure you have a ready API key.

  • GraphRAG indexing is expensive as it processes the entire text corpus using LLM. To save money, try truncating the text file to a smaller size.

Running the indexing pipeline

The indexing process will take some time. After completion, a new directory with artifacts containing a series of Parquet files will appear in the folder ./graphrag_index/output//.

python -m graphrag.index --root ./graphrag_index

Processing requests using the Milvus vector database

At the query execution stage, we use Milvus to store entity description embeddings needed for local search in GraphRAG. This approach combines structured data from the knowledge graph with unstructured data from input documents, allowing the LLM context to be supplemented with relevant entity information and providing more accurate answers.

import os
import pandas as pd
import tiktoken
from graphrag.query.context_builder.entity_extraction import EntityVectorStoreKey
from graphrag.query.indexer_adapters import (
    # read_indexer_covariates,
    read_indexer_entities,
    read_indexer_relationships,
    read_indexer_reports,
    read_indexer_text_units,
)
from graphrag.query.input.loaders.dfs import (
    store_entity_semantic_embeddings,
)
from graphrag.query.llm.oai.chat_openai import ChatOpenAI
from graphrag.query.llm.oai.embedding import OpenAIEmbedding
from graphrag.query.llm.oai.typing import OpenaiApiType
from graphrag.query.question_gen.local_gen import LocalQuestionGen
from graphrag.query.structured_search.local_search.mixed_context import (
    LocalSearchMixedContext,
)
from graphrag.query.structured_search.local_search.search import LocalSearch
from graphrag.vector_stores import MilvusVectorStore
output_dir = os.path.join(index_root, "output")
subdirs = [os.path.join(output_dir, d) for d in os.listdir(output_dir)]
latest_subdir = max(subdirs, key=os.path.getmtime)  # Get latest output directory
INPUT_DIR = os.path.join(latest_subdir, "artifacts")
COMMUNITY_REPORT_TABLE = "create_final_community_reports"
ENTITY_TABLE = "create_final_nodes"
ENTITY_EMBEDDING_TABLE = "create_final_entities"
RELATIONSHIP_TABLE = "create_final_relationships"
COVARIATE_TABLE = "create_final_covariates"
TEXT_UNIT_TABLE = "create_final_text_units"
COMMUNITY_LEVEL = 2

Loading data from the indexing process

During the indexing process, several parquet files will be generated. We load them into memory and store the entity description information in the Milvus vector database.

Reading entities:

# read nodes table to get community and degree data
entity_df = pd.read_parquet(f"{INPUT_DIR}/{ENTITY_TABLE}.parquet")
entity_embedding_df = pd.read_parquet(f"{INPUT_DIR}/{ENTITY_EMBEDDING_TABLE}.parquet")
entities = read_indexer_entities(entity_df, entity_embedding_df, COMMUNITY_LEVEL)
description_embedding_store = MilvusVectorStore(
    collection_name="entity_description_embeddings",
)
# description_embedding_store.connect(uri="http://localhost:19530") # For Milvus docker service
description_embedding_store.connect(uri="./milvus.db") # For Milvus Lite
entity_description_embeddings = store_entity_semantic_embeddings(
    entities=entities, vectorstore=description_embedding_store
)
print(f"Entity count: {len(entity_df)}")
entity_df.head()

Entity count: 651

Reading relationships

relationship_df = pd.read_parquet(f"{INPUT_DIR}/{RELATIONSHIP_TABLE}.parquet")
relationships = read_indexer_relationships(relationship_df)
print(f"Relationship count: {len(relationship_df)}")
relationship_df.head()

Relationship count: 290

Reading community reports

report_df = pd.read_parquet(f"{INPUT_DIR}/{COMMUNITY_REPORT_TABLE}.parquet")
reports = read_indexer_reports(report_df, entity_df, COMMUNITY_LEVEL)
print(f"Report records: {len(report_df)}")
report_df.head()

Report records: 45

Reading text units

text_unit_df = pd.read_parquet(f"{INPUT_DIR}/{TEXT_UNIT_TABLE}.parquet")
text_units = read_indexer_text_units(text_unit_df)
print(f"Text unit records: {len(text_unit_df)}")
text_unit_df.head()

Text unit records: 51

Creating a local search system

We have prepared the necessary data for the local search system. Now we can use it to create an instance of LocalSearch, LLM, and an embedding model.

api_key = os.environ["OPENAI_API_KEY"]  # Your OpenAI API key
llm_model = "gpt-4o"  # Or gpt-4-turbo-preview
embedding_model = "text-embedding-3-small"
llm = ChatOpenAI(
    api_key=api_key,
    model=llm_model,
    api_type=OpenaiApiType.OpenAI,
    max_retries=20,
)
token_encoder = tiktoken.get_encoding("cl100k_base")
text_embedder = OpenAIEmbedding(
    api_key=api_key,
    api_base=None,
    api_type=OpenaiApiType.OpenAI,
    model=embedding_model,
    deployment_name=embedding_model,
    max_retries=20,
)
context_builder = LocalSearchMixedContext(
    community_reports=reports,
    text_units=text_units,
    entities=entities,
    relationships=relationships,
    covariates=None, #covariates,#todo
    entity_text_embeddings=description_embedding_store,
    embedding_vectorstore_key=EntityVectorStoreKey.ID,  # if the vectorstore uses entity title as ids, set this to EntityVectorStoreKey.TITLE
    text_embedder=text_embedder,
    token_encoder=token_encoder,
)
local_context_params = {
    "text_unit_prop": 0.5,
    "community_prop": 0.1,
    "conversation_history_max_turns": 5,
    "conversation_history_user_turns_only": True,
    "top_k_mapped_entities": 10,
    "top_k_relationships": 10,
    "include_entity_rank": True,
    "include_relationship_weight": True,
    "include_community_rank": False,
    "return_candidate_context": False,
    "embedding_vectorstore_key": EntityVectorStoreKey.ID,  # set this to EntityVectorStoreKey.TITLE if the vectorstore uses entity title as ids
    "max_tokens": 12_000,  # change this based on the token limit you have on your model (if you are using a model with 8k limit, a good setting could be 5000)
}
llm_params = {
    "max_tokens": 2_000,  # change this based on the token limit you have on your model (if you are using a model with 8k limit, a good setting could be 1000=1500)
    "temperature": 0.0,
}
search_engine = LocalSearch(
    llm=llm,
    context_builder=context_builder,
    token_encoder=token_encoder,
    llm_params=llm_params,
    context_builder_params=local_context_params,
    response_type="multiple paragraphs",  # free form text describing the response type and format, can be anything, e.g. prioritized list, single paragraph, multiple paragraphs, multiple-page report
)

Creating a request

result = await search_engine.asearch("Tell me about Leonardo Da Vinci")
print(result.response)
# Leonardo da Vinci
    Leonardo da Vinci, born in 1452 in the town of Vinci near Florence, is widely celebrated as one of the most versatile geniuses of the Italian Renaissance. His full name was Leonardo di Ser Piero d'Antonio di Ser Piero di Ser Guido da Vinci, and he was the natural and first-born son of Ser Piero, a country notary [Data: Entities (0)]. Leonardo's contributions spanned various fields, including art, science, engineering, and philosophy, earning him the title of the most Universal Genius of Christian times [Data: Entities (8)].
    ## Early Life and Training
    Leonardo's early promise was recognized by his father, who took some of his drawings to Andrea del Verrocchio, a renowned artist and sculptor. Impressed by Leonardo's talent, Verrocchio accepted him into his workshop around 1469-1470. Here, Leonardo met other notable artists, including Botticelli and Lorenzo di Credi [Data: Sources (6, 7)]. By 1472, Leonardo was admitted into the Guild of Florentine Painters, marking the beginning of his professional career [Data: Sources (7)].
    ## Artistic Masterpieces
    Leonardo is perhaps best known for his iconic paintings, such as the "Mona Lisa" and "The Last Supper." The "Mona Lisa," renowned for its subtle expression and detailed background, is housed in the Louvre and remains one of the most famous artworks in the world [Data: Relationships (0, 45)]. "The Last Supper," a fresco depicting the moment Jesus announced that one of his disciples would betray him, is located in the refectory of Santa Maria delle Grazie in Milan [Data: Sources (2)]. Other significant works include "The Virgin of the Rocks" and the "Treatise on Painting," which he began around 1489-1490 [Data: Relationships (7, 12)].
    ## Scientific and Engineering Contributions
    Leonardo's genius extended beyond art to various scientific and engineering endeavors. He made significant observations in anatomy, optics, and hydraulics, and his notebooks are filled with sketches and ideas that anticipated many modern inventions. For instance, he anticipated Copernicus' theory of the earth's movement and Lamarck's classification of animals [Data: Relationships (38, 39)]. His work on the laws of light and shade and his mastery of chiaroscuro had a profound impact on both art and science [Data: Sources (45)].
    ## Patronage and Professional Relationships
    Leonardo's career was significantly influenced by his patrons. Ludovico Sforza, the Duke of Milan, employed Leonardo as a court painter and general artificer, commissioning various works and even gifting him a vineyard in 1499 [Data: Relationships (9, 19, 84)]. In his later years, Leonardo moved to France under the patronage of King Francis I, who provided him with a princely income and held him in high regard [Data: Relationships (114, 37)]. Leonardo spent his final years at the Manor House of Cloux near Amboise, where he was frequently visited by the King and supported by his close friend and assistant, Francesco Melzi [Data: Relationships (28, 122)].
    ## Legacy and Influence
    Leonardo da Vinci's influence extended far beyond his lifetime. He founded a School of painting in Milan, and his techniques and teachings were carried forward by his students and followers, such as Giovanni Ambrogio da Predis and Francesco Melzi [Data: Relationships (6, 15, 28)]. His works continue to be celebrated and studied, cementing his legacy as one of the greatest masters of the Renaissance. Leonardo's ability to blend art and science has left an indelible mark on both fields, inspiring countless generations of artists and scientists [Data: Entities (148, 86); Relationships (27, 12)].
    In summary, Leonardo da Vinci's unparalleled contributions to art, science, and engineering, combined with his innovative thinking and profound influence on his contemporaries and future generations, make him a towering figure in the history of human achievement. His legacy continues to inspire admiration and study, underscoring the timeless relevance of his genius.

GraphRAG results are specific, with clearly identified cited data sources.

Question Generation

GraphRAG can also generate questions based on historical queries, which is useful for creating recommended questions in a chatbot dialogue. This method combines structured data from the knowledge graph with unstructured data from input documents to create potential questions related to specific entities.

question_generator = LocalQuestionGen(
   llm=llm,
   context_builder=context_builder,
   token_encoder=token_encoder,
   llm_params=llm_params,
   context_builder_params=local_context_params,
)
question_history = [
    "Tell me about Leonardo Da Vinci",
    "Leonardo's early works",
]

Question generation based on history.

candidate_questions = await question_generator.agenerate(
        question_history=question_history, context_data=None, question_count=5
    )
candidate_questions.response
["- What were some of Leonardo da Vinci's early works and where are they housed?",
     "- How did Leonardo da Vinci's relationship with Andrea del Verrocchio influence his early works?",
     '- What notable projects did Leonardo da Vinci undertake during his time in Milan?',
     "- How did Leonardo da Vinci's engineering skills contribute to his projects?",
     "- What was the significance of Leonardo da Vinci's relationship with Francis I of France?"]

You can delete the index root if you want to remove the index to free up space.

# import shutil
#
# shutil.rmtree(index_root)

Summary

In this article, we explored GraphRAG, an innovative method that enhances RAG technology by integrating knowledge graphs. GraphRAG is ideal for tackling complex tasks such as multi-step reasoning and answering complex questions that require linking disparate pieces of information.

In combination with the vector database Milvus, GraphRAG can effectively analyze complex semantic relationships in large datasets, providing more accurate and in-depth results. This powerful combination makes GraphRAG an indispensable tool for various practical applications in the field of generative AI, providing a reliable solution for understanding and processing complex information.

Comments