Building AI-Powered Database Agents: A Hands-On Tutorial

In the era of big data, leveraging artificial intelligence (AI) to automate processes and assist in decision-making is becoming increasingly essential. One compelling application of AI is the creation of database agents. These AI-powered entities can interact with databases, perform complex queries, and provide insights, all through natural language inputs. In this tutorial, we will walk you through building a database agent using a large language model (LLM) and a popular framework for creating LLM-powered applications.

What Are AI Agents?

AI agents are autonomous entities that perform tasks to assist in decision-making and automate processes within an organization. When integrated with databases, these agents can query data, generate reports, and provide answers to complex questions, making them invaluable in various enterprise applications.

Tools We Will Use

For this tutorial, we will use the following tools:

LLM Provider: We'll use an LLM for natural language processing and text-to-SQL conversion.
LangChain: A powerful framework for building LLM-powered applications.
Pandas: A Python library for data manipulation and analysis.
SQLite: A lightweight, disk-based database.

While we will use a specific LLM provider in this tutorial, the code can be easily adapted to work with other models or API providers.

Setting Up the Environment

Installing Required Libraries

First, let's install the necessary libraries. You can do this using pip:

!pip install pyodbc==5.1.0
!pip install tabulate==0.9.0
!pip install openai==1.12.0
!pip install langchain==0.1.6
!pip install langchain-community==0.0.20
!pip install langchain-core==0.1.23
!pip install langchain-experimental==0.0.49
!pip install langchain-openai==0.0.5
!pip install pandas==2.2.2

Importing Libraries

Next, we import the libraries we will use in our Jupyter Notebook or Python script:

import os
from IPython.display import Markdown, HTML, display
from langchain_openai import OpenAI
from langchain.agents import create_sql_agent
from langchain.agents.agent_toolkits import SQLDatabaseToolkit
from langchain.sql_database import SQLDatabase
from sqlalchemy import create_engine
import pandas as pd

Loading Data and Creating the Database

We'll start by creating a simple SQLite database and importing a CSV file into it. For this example, we'll use a file named finance.csv that contains sales data.

# Load the CSV file into a Pandas DataFrame and fill NaN values with 0
file_url = "finance.csv"
df = pd.read_csv(file_url).fillna(value=0)

# Create a SQLite database and import the data
database_file_path = "test.db"
engine = create_engine(f'sqlite:///{database_file_path}')

df.to_sql(
    'prodinfo',
    con=engine,
    if_exists='replace',
    index=False
)

Setting Up the LLM

To use the LLM, you need to obtain an API key from your provider. Once you have the key, you can initialize the LLM:

# Initialize the LLM with your API key
llm = OpenAI(openai_api_key="your_openai_api_key")

Building the Database Agent

Defining the Agent

We will define an agent that interacts with the database, converts natural language queries to SQL, and retrieves the results.

# Define the agent instructions and format
MSSQL_AGENT_PREFIX = """
You are an agent designed to interact with a SQL database.
## Instructions:
- Given an input question, create a syntactically correct {dialect} query
to run, then look at the results of the query and return the answer.
- Unless the user specifies a specific number of examples they wish to
obtain, **ALWAYS** limit your query to at most {top_k} results.
- You can order the results by a relevant column to return the most
interesting examples in the database.
- Never query for all the columns from a specific table, only ask for
the relevant columns given the question.
- You have access to tools for interacting with the database.
- You MUST double check your query before executing it. If you get an error
while executing a query, rewrite the query and try again.
- DO NOT make any DML statements (INSERT, UPDATE, DELETE, DROP etc.)
to the database.
- DO NOT MAKE UP AN ANSWER OR USE PRIOR KNOWLEDGE, ONLY USE THE RESULTS
OF THE CALCULATIONS YOU HAVE DONE.
- Your response should be in Markdown. However, **when running a SQL Query
in "Action Input", do not include the markdown backticks**.
Those are only for formatting the response, not for executing the command.
- ALWAYS, as part of your final answer, explain how you got to the answer
on a section that starts with: "Explanation:". Include the SQL query as
part of the explanation section.
- If the question does not seem related to the database, just return
"I don\'t know" as the answer.
- Only use the below tools. Only use the information returned by the
below tools to construct your query and final answer.
- Do not make up table names, only use the tables returned by any of the
tools below.

## Tools:
"""

MSSQL_AGENT_FORMAT_INSTRUCTIONS = """
## Use the following format:

Question: the input question you must answer.
Thought: you should always think about what to do.
Action: the action to take, should be one of [{tool_names}].
Action Input: the input to the action.
Observation: the result of the action.
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer.
Final Answer: the final answer to the original input question.
"""

# Create the SQLDatabase instance
db = SQLDatabase.from_uri(f'sqlite:///{database_file_path}')
toolkit = SQLDatabaseToolkit(db=db, llm=llm)

# Define the question
QUESTION = """Which was the most popular product in Canada in 2014 in terms of Units Sold?"""

# Create the SQL agent
agent_executor_SQL = create_sql_agent(
    prefix=MSSQL_AGENT_PREFIX,
    format_instructions=MSSQL_AGENT_FORMAT_INSTRUCTIONS,
    llm=llm,
    toolkit=toolkit,
    top_k=30,
    verbose=True
)

# Invoke the agent to answer the question
agent_executor_SQL.invoke(QUESTION)

Explanation

Installing Libraries: The required libraries are installed using pip.
Importing Libraries: Relevant libraries are imported for database interaction, data manipulation, and LLM-based AI agent creation.
Loading Data: The CSV file is loaded into a Pandas DataFrame, and NaN values are replaced with 0.
Creating SQLite Database: The DataFrame is saved into a SQLite database.
Defining Agent Instructions: Instructions and format for the SQL agent are defined to guide its interactions.
Initializing LLM: The LLM is initialized using your API key.
Creating SQL Database and Toolkit: The SQL database and toolkit are created using LangChain.
Defining Question: The question to be answered by the agent is defined.
Creating SQL Agent: The SQL agent is created using the defined instructions, format, LLM, and toolkit.
Invoking Agent: The agent is invoked to answer the question, and the result is displayed.

Make sure to replace "your_openai_api_key" with your actual OpenAI API key before running the script. This code can be executed in a Jupyter Notebook or Python environment to interact with the SQLite database and obtain answers from the AI agent.

Conclusion

Building AI-powered database agents can significantly enhance your ability to interact with and leverage data stored in databases. With tools like LangChain and LLMs, creating such agents becomes straightforward and accessible. By following this tutorial, you can set up your own database agent and customize it to meet your specific needs.

Happy coding! If you have any questions or need further assistance, feel free to leave a comment below.