top of page
  • Writer's pictureRevanth Reddy Tondapu

Part 14: Getting Started with Neo4j Sandbox and Graph Data Science Library


Neo4j Sandbox and Graph Data Science Library
Neo4j Sandbox and Graph Data Science Library

Welcome to our comprehensive guide on getting started with Neo4j's sandbox and harnessing the power of the Graph Data Science (GDS) Library. In this tutorial, we will walk you through setting up the Neo4j sandbox, importing sample data, and using the GDS library for various analytical tasks such as centrality, community detection, similarity search, and finding the shortest path.


Step 1: Setting Up Neo4j Sandbox

To begin, we need to set up Neo4j's sandbox environment, which is a free, cloud-based platform ideal for experimenting with graph data.

  1. Access the Sandbox: Open your web browser and search for "Neo4j sandbox." Click on the link that directs you to the official Neo4j website.

  2. Launch the Sandbox: On the Neo4j sandbox page, click on "Launch Free Sandbox."

  3. Create an Account: Scroll down, click "Next," and fill in the required information to create your account. If this is your first time, you will be prompted to log in using a social account like Gmail.


Step 2: Importing Sample Data

With your sandbox environment ready, the next step is to import a sample dataset. Neo4j's sandbox offers various sample datasets, and for this demo, we will use a dataset related to flights.

  1. Select the Dataset: In the sandbox interface, you will find options to import different sample datasets. Select the dataset that contains flight-related data.

  2. Start the Import Process: Click "Create" to start importing the dataset. This process may take a minute.

  3. Open Neo4j Browser: Once the import is complete, open the Neo4j browser to interact with your dataset.


Step 3: Understanding the Data Schema

Before diving into data analysis, it is essential to understand the schema of the imported data. Neo4j provides schema visualization tools to help with this.

  1. Visualize the Schema: Use Neo4j's browser to visualize the schema by running the following command:

CALL db.schema.visualization();

2. Analyze the Schema: Rearrange the graph to make it more readable. You'll notice that we have an Airport label with a HAS_ROUTE relationship to other Airport nodes. Each airport is associated with a city, region, country, and continent. There are also relationships between city, region, country, and continent nodes.


Understanding the Data Schema
Understanding the Data Schema

Step 4: Using the Graph Data Science Library

Neo4j's Graph Data Science Library includes many algorithms for various analytical use cases. In this sandbox environment, the GDS library is pre-installed, but for local installations, you would need to install it first.


Projecting a Graph

To use the GDS library, we need to project a graph, which is a subset of the main data that we store in memory.

  1. Projecting a Named Graph: The following query projects a named graph called routes using the GDS library. The gds.graph.project procedure is used to create a named graph in the GDS catalog.

CALL gds.graph.project(
    'routes',
    'Airport',
    'HAS_ROUTE'
)
YIELD
    graphName, nodeProjection, nodeCount, relationshipProjection, relationshipCount;
  • Parameters:

    • routes: The name of the projected graph.

    • Airport: The label of the nodes to include in the graph.

    • HAS_ROUTE: The type of relationships to include in the graph (in this case, the HAS_ROUTE relationships between airports).

  • Yields:

    • graphName: The name of the projected graph.

    • nodeProjection: The label of the nodes included in the graph.

    • nodeCount: The number of nodes included in the graph.

    • relationshipProjection: The type of relationships included in the graph.

    • relationshipCount: The number of relationships included in the graph.

This query is useful for preparing your graph data for analysis using various algorithms provided by the GDS library.


Running the PageRank Algorithm

The PageRank algorithm helps in identifying the importance of nodes in a network. Let's run the PageRank algorithm on our projected graph.


  1. Stream PageRank Scores:

CALL gds.pageRank.stream('routes')
YIELD nodeId, score
WITH gds.util.asNode(nodeId) AS n, score AS pageRank
RETURN n.iata AS iata, n.descr AS description, pageRank
ORDER BY pageRank DESC, iata ASC;

  • Explanation:

    • CALL gds.pageRank.stream('routes'): Executes the PageRank algorithm on the routes graph.

    • YIELD nodeId, score: Returns the node ID and the calculated PageRank score.

    • WITH gds.util.asNode(nodeId) AS n, score AS pageRank: Converts the node ID to a node reference and aliases the score.

    • RETURN n.iata AS iata, n.descr AS description, pageRank: Returns the IATA code, description, and PageRank score.

    • ORDER BY pageRank DESC, iata ASC: Sorts the results by PageRank in descending order and then by IATA in ascending order.

This query helps us identify the most influential airports in our graph based on the PageRank algorithm.


Writing PageRank Scores Back to the Graph

After streaming the PageRank scores, we might want to write these scores back to our graph to use them later.


2. Write PageRank Scores:

CALL gds.pageRank.write('routes', {
    writeProperty: 'pageRank'
})
YIELD nodePropertiesWritten, ranIterations;

  • Explanation:

    • CALL gds.pageRank.write('routes', { writeProperty: 'pageRank' }): Runs the PageRank algorithm and writes the scores to a node property called pageRank.

    • YIELD nodePropertiesWritten, ranIterations: Returns the number of node properties written and the number of iterations the algorithm ran.

By writing the PageRank scores back to the graph, we can easily reference these scores in future queries without having to recompute them.


Querying PageRank Scores

Now that we have written the PageRank scores back to the graph, let's query these scores directly.


3. Query PageRank Scores:

MATCH (a:Airport)
RETURN a.iata AS iata, a.descr AS description, a.pageRank AS pageRank
ORDER BY a.pageRank DESC, a.iata ASC;
  • Explanation:

    • MATCH (a:Airport): Finds all nodes with the label Airport.

    • RETURN a.iata AS iata, a.descr AS description, a.pageRank AS pageRank: Returns the IATA code, description, and PageRank score.

    • ORDER BY a.pageRank DESC, a.iata ASC: Sorts the results by PageRank in descending order and then by IATA in ascending order.

This query helps us retrieve and view the PageRank scores directly from the graph, allowing us to easily see which airports are most important according to the PageRank algorithm.


Conclusion

Neo4j's sandbox and Graph Data Science Library provide a robust platform for analyzing complex graph data. By following the steps outlined above, you can set up the sandbox environment, import sample data, and use various graph algorithms to derive insights from flight data. Whether you are looking to identify key airports, detect communities, compare travel patterns, or find the shortest routes, the GDS library has the tools you need to make data-driven decisions.

2 views0 comments

Comentarios


bottom of page