Building a Chatbot for Multiple PDFs: A Step-by-Step Guide

Revanth Reddy Tondapu
Jun 16, 2024
4 min read

Welcome to this new tutorial where I'm going to show you how to build an exciting application: a chatbot that allows you to chat with multiple PDFs from your computer at once. Let's dive in and see how it works.

How the Chatbot Works

For this example, I'll upload the Constitution and the Bill of Rights. When I click "Process," the application will read these documents and store them in a database. Once that's done, you can start asking questions. For instance, you can ask about the three branches of the United States government, and the chatbot will pull the information from the Constitution. Similarly, you can ask about the First Amendment, and it will provide the answer from the Bill of Rights.

The chatbot only answers questions related to the uploaded PDF documents, ensuring that the responses are based solely on the provided information.

What We'll Use

To build this chatbot, we'll use various tools and libraries:

Streamlit: For creating the graphical user interface (GUI).
PyPDF2: To read the PDF documents.
LangChain: To interact with language models.
OpenAI and Hugging Face: For the AI models.
FAISS: To create a vector store for embedding the text.

Setting Up the Environment

First, let's set up our Python environment. We'll create a virtual environment to keep everything organized. Here's how to do it:

Create and activate the virtual environment:

python -m venv myenv 
source myenv/bin/activate 
# On Windows, use `myenv\Scripts\activate`

This creates a virtual environment named myenv and activates it. A virtual environment is a self-contained directory that contains all the dependencies required for a project.

Install the necessary libraries:

pip install streamlit pypdf2 langchain python-dotenv faiss-cpu openai huggingface-hub

This command installs the libraries we'll use in our project. These include tools for creating the user interface, reading PDFs, interacting with AI models, and more.

Building the Graphical User Interface

Next, we set up the main part of our application using Streamlit. This allows users to upload PDFs and ask questions.

Create a file called app.py:

import streamlit as st
from dotenv import load_dotenv
from PyPDF2 import PdfReader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.llms import OpenAI
from langchain.chains.question_answering import load_qa_chain
import os

# Load environment variables
load_dotenv()

st.set_page_config(page_title="Chat with Multiple PDFs", page_icon="📚")

st.header("Chat with Multiple PDFs 📚")

query = st.text_input("Ask a question about your documents:")

with st.sidebar:
    st.subheader("Your Documents")
    pdf_docs = st.file_uploader("Upload your PDFs here and click on Process", type="pdf", accept_multiple_files=True)
    if st.button("Process"):
        with st.spinner("Processing..."):
            raw_text = get_pdf_text(pdf_docs)
            text_chunks = get_text_chunks(raw_text)
            vector_store = get_vector_store(text_chunks)
            st.session_state.conversation = get_conversation_chain(vector_store)

Explanation:

Imports: We import the necessary libraries, including Streamlit for the GUI, PyPDF2 for reading PDFs, and various components from LangChain for interacting with AI models.
Load environment variables: The load_dotenv() function loads environment variables from a .env file.
Set up Streamlit: We configure the Streamlit page and create a header.
Query input: We add a text input field for users to ask questions.
Sidebar: We add a sidebar where users can upload PDFs and click a button to process them. When the button is clicked, the PDFs are read, split into chunks, and stored in a vector store. This data is then used to create a conversation chain.

Processing the PDFs

Next, we'll read the text from the uploaded PDFs, split it into chunks, and create embeddings for those chunks.

Extract text from PDFs:

def get_pdf_text(pdf_docs):
    from PyPDF2 import PdfReader
    text = ""
    for pdf in pdf_docs:
        pdf_reader = PdfReader(pdf)
        for page in pdf_reader.pages:
            text += page.extract_text()
    return text

Explanation:
Function get_pdf_text: This function takes a list of PDF documents, reads each page, and extracts the text. The extracted text from all PDFs is concatenated into a single string and returned.
Split text into chunks:

def get_text_chunks(text):
    from langchain.text_splitter import RecursiveCharacterTextSplitter
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200, length_function=len)
    chunks = text_splitter.split_text(text)
    return chunks

Explanation:
Function get_text_chunks: This function takes a large string of text and splits it into smaller chunks. It uses the RecursiveCharacterTextSplitter to ensure that the chunks are of manageable size and overlap slightly to preserve context.
Create vector store:

def get_vector_store(text_chunks):
    from langchain.embeddings.openai import OpenAIEmbeddings
    from langchain.vectorstores import FAISS
    embeddings = OpenAIEmbeddings()
    vector_store = FAISS.from_texts(text_chunks, embedding=embeddings)
    return vector_store

Explanation:
Function get_vector_store: This function takes the text chunks and converts them into embeddings using OpenAI's embedding model. These embeddings are then stored in a FAISS vector store, which allows for efficient similarity searches.

Setting Up the Conversation Chain

We'll create a conversation chain that includes memory, so the chatbot can remember the context of the conversation.

Create the conversation chain:

def get_conversation_chain(vector_store):
    from langchain.memory import ConversationBufferMemory
    from langchain.chains import ConversationalRetrievalChain
    memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)
    conversation_chain = ConversationalRetrievalChain.from_llm(
        llm=OpenAI(),
        retriever=vector_store.as_retriever(),
        memory=memory
    )
    return conversation_chain

Explanation:

Function get_conversation_chain: This function creates a conversation chain that uses memory to keep track of the chat history. It initializes a ConversationBufferMemory and a ConversationalRetrievalChain using the OpenAI model and the vector store.
Handle user input and generate responses:

def handle_user_input(user_question):
    response = st.session_state.conversation({'question': user_question})
    st.session_state.chat_history.append((user_question, response['answer']))

Explanation:

Function handle_user_input: This function takes the user's question, generates a response using the conversation chain, and appends the question and answer to the chat history.

Displaying the Chat History

We'll display the chat history using custom HTML templates.

Create HTML templates:

user_template = '<div class="chat-message user">{}</div>'
bot_template = '<div class="chat-message bot">{}</div>'
css = """
<style>
    .chat-message {padding: 10px; margin: 10px 0; border-radius: 5px;}
    .chat-message.user {background-color: #dcf8c6;}
    .chat-message.bot {background-color: #f1f0f0;}
</style>
"""
st.write(css, unsafe_allow_html=True)

Explanation:

HTML Templates: These templates define the structure and style of the chat messages. The user_template is for messages from the user, and the bot_template is for messages from the chatbot. The css variable contains the styling for these templates.

Display messages:

if 'chat_history' not in st.session_state:
    st.session_state['chat_history'] = []

for i, (user_msg, bot_msg) in enumerate(st.session_state.chat_history):
    st.write(user_template.format(user_msg), unsafe_allow_html=True)
    st.write(bot_template.format(bot_msg), unsafe_allow_html=True)

if query:
    handle_user_input(query)

Explanation:

Displaying the Chat History: This code checks if chat_history exists in the session state. If not, it initializes it. It then iterates over the chat history and displays each user message and bot response using the HTML templates. If a new query is submitted, it calls handle_user_input to generate a response.