Unlocking the Power of Graph Data: A Comprehensive Guide to GraphRAG, Neo4j, and Gro

Revanth Reddy Tondapu
Oct 14, 2024
5 min read

In today's data-driven world, understanding complex relationships within datasets is paramount. Graph-based approaches provide unique insights by highlighting connections between data points that might otherwise go unnoticed. In this blog post, we explore how GraphRAG, Neo4j, and Gro can be integrated to enhance data visualization and improve the quality of AI responses.

Introduction to GraphRAG

GraphRAG is a powerful tool designed to enrich AI interaction by extracting detailed relationships between entities in a dataset. Unlike basic retrieval-augmented generation (RAG) models that primarily rely on semantic search, GraphRAG offers a deeper understanding by identifying and analyzing entities and their relationships.

How GraphRAG Works

Entity Extraction: GraphRAG identifies entities within text documents and categorizes them by name, type, and description. It also determines the relationships between these entities, including their strengths.
Community Reporting: After extracting entities, GraphRAG compiles a comprehensive report that outlines the overarching themes and dynamics within the dataset. This helps in understanding the broader context beyond individual entities.

Leveraging Gro for Language Processing

Gro acts as the language processing backend for GraphRAG, offering the computational power needed to interpret and analyze data. By setting up Gro with GraphRAG, you can utilize advanced language models to enhance data processing capabilities.

Setting Up Gro with GraphRAG

Installation and API Configuration: Begin by installing GraphRAG and configuring the necessary API keys for Gro. Adjust settings such as model names and API parameters to suit your specific needs.
Data Preparation: Organize your data files in a designated input folder. Use indexing commands to extract entities and relationships, structuring the output for further analysis.
Embedding Models: Since Gro doesn't natively support embedding models, use external providers to ensure comprehensive data interpretation.

Visualizing Data with Neo4j

Neo4j is a robust graph database that allows you to visualize and explore complex data relationships. By importing data into Neo4j, you can interactively analyze how different entities are connected.

Setting Up Neo4j

Installation: Download the appropriate version of Neo4j for your operating system and follow the installation instructions to set up a local server.
Data Import: Convert extracted data from GraphRAG into CSV format and import it into Neo4j using Cypher queries. This process involves creating nodes and relationships that represent the data structure.
Data Exploration: Use Neo4j's interface to run queries and visualize data relationships. This can help you gain insights into entity connections, document structures, and community dynamics.

Comparing Global and Local Search in GraphRAG

GraphRAG offers two modes of data querying: global and local search. Each serves a distinct purpose based on the scope of the inquiry:

Global Search: Suitable for broad questions about the entire dataset, global search uses community reports to provide comprehensive insights. It employs a map-reduce approach to synthesize information.
Local Search: Focused on specific entities and their properties, local search uses structured data from the knowledge graph and input documents. It's more direct and simpler, ideal for detailed investigations.

Code Implementation: Parquet to CSV Conversion

To facilitate data import into Neo4j, you might need to convert Parquet files to CSV. Here's a Python script to assist with that process:

import os
import pandas as pd
import csv

# Define the directory containing Parquet files and the directory to save CSV files
parquet_dir = '/path/to/parquet/files'  # Update this with the path to your Parquet files
csv_dir = '/path/to/csv/files'          # Update this with the path where you want to save CSV files

# Function to clean and properly format the string fields
def clean_quotes(value):
    if isinstance(value, str):
        # Remove extra quotes and strip leading/trailing spaces
        value = value.strip().replace('""', '"').replace('"', '')
        # Ensure proper quoting for fields with commas or quotes
        if ',' in value or '"' in value:
            value = f'"{value}"'
    return value

# Convert all Parquet files to CSV
for file_name in os.listdir(parquet_dir):
    if file_name.endswith('.parquet'):
        parquet_file = os.path.join(parquet_dir, file_name)
        csv_file = os.path.join(csv_dir, file_name.replace('.parquet', '.csv'))

        # Load the Parquet file
        df = pd.read_parquet(parquet_file)

        # Clean quotes in string fields
        for column in df.select_dtypes(include=['object']).columns:
            df[column] = df[column].apply(clean_quotes)

        # Save to CSV
        df.to_csv(csv_file, index=False, quoting=csv.QUOTE_NONNUMERIC)
        print(f"Converted {parquet_file} to {csv_file} successfully.")

print("All Parquet files have been converted to CSV.")

Neo4j Installation and Setup

These commands will help you download, extract, and start Neo4j on a Unix-based system:

# Download the Neo4j community edition
curl -O https://dist.neo4j.org/neo4j-community-5.21.2-unix.tar.gz

# Extract the downloaded archive
tar -xzf neo4j-community-5.21.2-unix.tar.gz

# Navigate into the Neo4j directory
cd neo4j-community-5.21.2

# Start the Neo4j server
./bin/neo4j start

# Check the status of the Neo4j server
./bin/neo4j status

Importing Data into Neo4j

Use the following Cypher queries to import CSV data into Neo4j, creating nodes and relationships as per your data model:

// Import Documents
LOAD CSV WITH HEADERS FROM 'file:///create_final_documents.csv' AS row
CREATE (d:Document {
  id: row.id,
  title: row.title,
  raw_content: row.raw_content,
  text_unit_ids: row.text_unit_ids
});

// Import Text Units
LOAD CSV WITH HEADERS FROM 'file:///create_final_text_units.csv' AS row
CREATE (t:TextUnit {
  id: row.id,
  text: row.text,
  n_tokens: toFloat(row.n_tokens),
  document_ids: row.document_ids,
  entity_ids: row.entity_ids,
  relationship_ids: row.relationship_ids
});

// Import Entities
LOAD CSV WITH HEADERS FROM 'file:///create_final_entities.csv' AS row
CREATE (e:Entity {
  id: row.id,
  name: row.name,
  type: row.type,
  description: row.description,
  human_readable_id: toInteger(row.human_readable_id),
  text_unit_ids: row.text_unit_ids
});

// Import Relationships
LOAD CSV WITH HEADERS FROM 'file:///create_final_relationships.csv' AS row
CREATE (r:Relationship {
  source: row.source,
  target: row.target,
  weight: toFloat(row.weight),
  description: row.description,
  id: row.id,
  human_readable_id: row.human_readable_id,
  source_degree: toInteger(row.source_degree),
  target_degree: toInteger(row.target_degree),
  rank: toInteger(row.rank),
  text_unit_ids: row.text_unit_ids
});

// Import Nodes
LOAD CSV WITH HEADERS FROM 'file:///create_final_nodes.csv' AS row
CREATE (n:Node {
  id: row.id,
  level: toInteger(row.level),
  title: row.title,
  type: row.type,
  description: row.description,
  source_id: row.source_id,
  community: row.community,
  degree: toInteger(row.degree),
  human_readable_id: toInteger(row.human_readable_id),
  size: toInteger(row.size),
  entity_type: row.entity_type,
  top_level_node_id: row.top_level_node_id,
  x: toInteger(row.x),
  y: toInteger(row.y)
});

// Import Communities
LOAD CSV WITH HEADERS FROM 'file:///create_final_communities.csv' AS row
CREATE (c:Community {
  id: row.id,
  title: row.title,
  level: toInteger(row.level),
  raw_community: row.raw_community,
  relationship_ids: row.relationship_ids,
  text_unit_ids: row.text_unit_ids
});

// Import Community Reports
LOAD CSV WITH HEADERS FROM 'file:///create_final_community_reports.csv' AS row
CREATE (cr:CommunityReport {
  id: row.id,
  community: row.community,
  full_content: row.full_content,
  level: toInteger(row.level),
  rank: toFloat(row.rank),
  title: row.title,
  rank_explanation: row.rank_explanation,
  summary: row.summary,
  findings: row.findings,
  full_content_json: row.full_content_json
});

// Create indexes for better performance
CREATE INDEX FOR (d:Document) ON (d.id);
CREATE INDEX FOR (t:TextUnit) ON (t.id);
CREATE INDEX FOR (e:Entity) ON (e.id);
CREATE INDEX FOR (r:Relationship) ON (r.id);
CREATE INDEX FOR (n:Node) ON (n.id);
CREATE INDEX FOR (c:Community) ON (c.id);
CREATE INDEX FOR (cr:CommunityReport) ON (cr.id);

// Create relationships after all nodes are imported
MATCH (d:Document)
UNWIND split(d.text_unit_ids, ',') AS textUnitId
MATCH (t:TextUnit {id: trim(textUnitId)})
CREATE (d)-[:HAS_TEXT_UNIT]->(t);

MATCH (t:TextUnit)
UNWIND split(t.document_ids, ',') AS docId
MATCH (d:Document {id: trim(docId)})
CREATE (t)-[:BELONGS_TO]->(d);

MATCH (t:TextUnit)
UNWIND split(t.entity_ids, ',') AS entityId
MATCH (e:Entity {id: trim(entityId)})
CREATE (t)-[:HAS_ENTITY]->(e);

MATCH (t:TextUnit)
UNWIND split(t.relationship_ids, ',') AS relId
MATCH (r:Relationship {id: trim(relId)})
CREATE (t)-[:HAS_RELATIONSHIP]->(r);

MATCH (e:Entity)
UNWIND split(e.text_unit_ids, ',') AS textUnitId
MATCH (t:TextUnit {id: trim(textUnitId)})
CREATE (e)-[:MENTIONED_IN]->(t);

MATCH (r:Relationship)
MATCH (source:Entity {name: r.source})
MATCH (target:Entity {name: r.target})
CREATE (source)-[:RELATES_TO]->(target);

MATCH (r:Relationship)
UNWIND split(r.text_unit_ids, ',') AS textUnitId
MATCH (t:TextUnit {id: trim(textUnitId)})
CREATE (r)-[:MENTIONED_IN]->(t);

MATCH (c:Community)
UNWIND split(c.relationship_ids, ',') AS relId
MATCH (r:Relationship {id: trim(relId)})
CREATE (c)-[:HAS_RELATIONSHIP]->(r);

MATCH (c:Community)
UNWIND split(c.text_unit_ids, ',') AS textUnitId
MATCH (t:TextUnit {id: trim(textUnitId)})
CREATE (c)-[:HAS_TEXT_UNIT]->(t);

MATCH (cr:CommunityReport)
MATCH (c:Community {id: cr.community})
CREATE (cr)-[:REPORTS_ON]->(c);

Neo4j Queries for Visualization

These queries help you visualize different relationships and nodes within the Neo4j database:

// Visualize Document to TextUnit relationships
MATCH (d:Document)-[r:HAS_TEXT_UNIT]->(t:TextUnit)
RETURN d, r, t
LIMIT 50;

// Visualize Entity to TextUnit relationships
MATCH (e:Entity)-[r:MENTIONED_IN]->(t:TextUnit)
RETURN e, r, t
LIMIT 50;

// Visualize Relationships between Entities
MATCH (e1:Entity)-[r:RELATES_TO]->(e2:Entity)
RETURN e1, r, e2
LIMIT 50;

// Visualize Community to Relationship connections
MATCH (c:Community)-[r:HAS_RELATIONSHIP]->(rel:Relationship)
RETURN c, r, rel
LIMIT 50;

// Visualize Community Reports and their Communities
MATCH (cr:CommunityReport)-[r:REPORTS_ON]->(c:Community)
RETURN cr, r, c
LIMIT 50;

// Visualize the most connected Entities
MATCH (e:Entity)
WITH e, COUNT{(e)-[:RELATES_TO]->(:Entity)} AS degree
ORDER BY degree DESC
LIMIT 10
MATCH (e)-[r:RELATES_TO]->(other:Entity)
RETURN e, r, other;

// Visualize TextUnits and their connections to Entities and Relationships
MATCH (t:TextUnit)-[:HAS_ENTITY]->(e:Entity)
MATCH (t)-[:HAS_RELATIONSHIP]->(r:Relationship)
RETURN t, e, r
LIMIT 50;

// Visualize Documents and their associated Entities (via TextUnits)
MATCH (d:Document)-[:HAS_TEXT_UNIT]->(t:TextUnit)-[:HAS_ENTITY]->(e:Entity)
RETURN d, t, e
LIMIT 50;

// Visualize Communities and their TextUnits
MATCH (c:Community)-[:HAS_TEXT_UNIT]->(t:TextUnit)
RETURN c, t
LIMIT 50;

// Visualize Relationships and their associated TextUnits
MATCH (r:Relationship)-[:MENTIONED_IN]->(t:TextUnit)
RETURN r, t
LIMIT 50;

// Visualize Entities of different types and their relationships
MATCH (e1:Entity)-[r:RELATES_TO]->(e2:Entity)
WHERE e1.type <> e2.type
RETURN e1, r, e2
LIMIT 50;

// Visualize the distribution of Entity types
MATCH (e:Entity)
RETURN e.type AS EntityType, COUNT(e) AS Count
ORDER BY Count DESC;

// Visualize the most frequently occurring relationships
MATCH ()-[r:RELATES_TO]->()
RETURN TYPE(r) AS RelationshipType, COUNT(r) AS Count
ORDER BY Count DESC
LIMIT 10;

// Visualize the path from Document to Entity
MATCH path = (d:Document)-[:HAS_TEXT_UNIT]->(t:TextUnit)-[:HAS_ENTITY]->(e:Entity)
RETURN path
LIMIT 25;

Conclusion

The integration of GraphRAG, Neo4j, and Gro offers a comprehensive solution for enhancing data visualization and AI response quality. By extracting detailed entity relationships and leveraging graph databases, you can gain deeper insights into complex datasets. Whether you're exploring entity connections or understanding community dynamics, this approach provides a robust framework for data analysis and interaction. With these tools, you can transform your data into meaningful insights and improve the quality of AI-driven responses.