Exploring Crawl4AI: Enhancing AI Agents with Advanced Web Crawling and Scraping Capabilities

In this post, we are going to explore Crawl4AI, an open-source, LLM-friendly web crawler and scraper. This tool enables you to extract structured data from web pages automatically and integrate it seamlessly with AI agents. By the end of this tutorial, you will learn how to:

Perform basic web scraping using Crawl4AI.
Convert unstructured data into structured JSON format.
Integrate Crawl4AI with AI agents for automated data extraction, cleaning, and analysis.

What is Crawl4AI?

Crawl4AI is an open-source tool designed to efficiently crawl and scrape web pages. Here are some of its key features:

Free and Open Source: Completely free to use and modify.
LLM-Friendly Output: Outputs data in formats like JSON, cleaned HTML, and Markdown.
Versatile Data Extraction: Can extract images, audio, video, links, metadata, and take screenshots of web pages.
Advanced Crawling Capabilities: Supports scrolling and handling multiple URLs simultaneously.
Various Chunking Strategies: Offers different methods to chunk data for efficient processing.

Let's walk through the process of using Crawl4AI to scrape a webpage and integrate it with AI agents.

Step 1: Installing Crawl4AI

First, install Crawl4AI and its dependencies. Open your terminal and run the following commands:

pip install "crawl4ai @ git+https://github.com/unclecode/crawl4ai.git" transformers torch nltk

Next, export your OpenAI API key:

export OPENAI_API_KEY='your_api_key_here'

Step 2: Basic Web Scraping with Crawl4AI

Create a file named app.py and open it. Add the following code to initialize and use the web crawler:

from crawl4ai import WebCrawler

# Create an instance of WebCrawler
crawler = WebCrawler()

# Warm up the crawler (load necessary models)
crawler.warmup()

# Run the crawler on a URL
result = crawler.run(url="https://openai.com/api/pricing/")

# Print the extracted content in Markdown format
print(result.markdown)

Explanation:

Initialize: Create an instance of WebCrawler.
Warm up: Load necessary models to prepare the crawler.
Run: Perform the web crawling on the specified URL.
Print: Display the extracted content in Markdown format.

Run this script in your terminal:

python app.py

This basic example demonstrates how to scrape a webpage and print the extracted data.

Step 3: Converting Unstructured Data to Structured Data

To convert unstructured data into structured JSON format, we'll use an LLM. Update your app.py file as follows:

import os
from crawl4ai import WebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field

# Define a model for the structured data
class OpenAIModelFee(BaseModel):
    model_name: str = Field(..., description="Name of the OpenAI model.")
    input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
    output_fee: str = Field(..., description="Fee for output token for the OpenAI model.")

# URL to be crawled
url = 'https://openai.com/api/pricing/'

# Create an instance of WebCrawler
crawler = WebCrawler()

# Warm up the crawler
crawler.warmup()

# Run the crawler with the extraction strategy
result = crawler.run(
    url=url,
    word_count_threshold=1,
    extraction_strategy=LLMExtractionStrategy(
        provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY'), 
        schema=OpenAIModelFee.schema(),
        extraction_type="schema",
        instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens. 
        Do not miss any models in the entire content. One extracted model JSON format should look like this: 
        {"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}."""
    ),            
    bypass_cache=True,
)

# Print the extracted structured content
print(result.extracted_content)

Explanation:

Model Definition: Define a Pydantic model OpenAIModelFee for the structured data.
Extraction Strategy: Use LLMExtractionStrategy to define how to extract the data.
Run with Strategy: Perform crawling with the specified extraction strategy.

Run this updated script in your terminal:

python app.py

The output will be structured JSON data based on the provided schema and extraction instructions.

Step 4: Integrating with AI Agents

Now, let's integrate Crawl4AI with AI agents. We'll use a tool to manage the agents and automate the data extraction, cleaning, and analysis process.

First, install the necessary package:

pip install praisonai

Create a configuration file named agents.yaml with the following content:

framework: crewai
topic: extract model pricing from websites
roles:
  web_scraper:
    backstory: An expert in web scraping with a deep understanding of extracting structured data from online sources. 
    goal: Gather model pricing data from various websites
    role: Web Scraper
    tasks:
      scrape_model_pricing:
        description: Scrape model pricing information from the provided list of websites.
        expected_output: Raw HTML or JSON containing model pricing data.
    tools:
    - 'ModelFeeTool'
  data_cleaner:
    backstory: Specialist in data cleaning, ensuring that all collected data is accurate and properly formatted.
    goal: Clean and organize the scraped pricing data
    role: Data Cleaner
    tasks:
      clean_pricing_data:
        description: Process the raw scraped data to remove any duplicates and inconsistencies, and convert it into a structured format.
        expected_output: Cleaned and organized JSON or CSV file with model pricing data.
    tools:
    - ''
  data_analyzer:
    backstory: Data analysis expert focused on deriving actionable insights from structured data.
    goal: Analyze the cleaned pricing data to extract insights
    role: Data Analyzer
    tasks:
      analyze_pricing_data:
        description: Analyze the cleaned data to extract trends, patterns, and insights on model pricing.
        expected_output: Detailed report summarizing model pricing trends and insights.
    tools:
    - ''
dependencies: []

Next, create a Python script named tools.py to define the tool used by the web scraper agent:

import os
from crawl4ai import WebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field
from praisonai_tools import BaseTool

# Define the structured data model
class ModelFee(BaseModel):
    llm_model_name: str = Field(..., description="Name of the model.")
    input_fee: str = Field(..., description="Fee for input token for the model.")
    output_fee: str = Field(..., description="Fee for output token for the model.")

# Define the tool for extracting model fees
class ModelFeeTool(BaseTool):
    name: str = "ModelFeeTool"
    description: str = "Extracts model fees for input and output tokens from the given pricing page."

    def _run(self, url: str):
        # Create an instance of WebCrawler
        crawler = WebCrawler()
        
        # Warm up the crawler
        crawler.warmup()

        # Run the crawler with the extraction strategy
        result = crawler.run(
            url=url,
            word_count_threshold=1,
            extraction_strategy=LLMExtractionStrategy(
                provider="openai/gpt-4o",
                api_token=os.getenv('OPENAI_API_KEY'), 
                schema=ModelFee.schema(),
                extraction_type="schema",
                instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens. 
                Do not miss any models in the entire content. One extracted model JSON format should look like this: 
                {"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}."""
            ),            
            bypass_cache=True,
        )
        return result.extracted_content

if __name__ == "__main__":
    # Test the ModelFeeTool
    tool = ModelFeeTool()
    url = "https://www.openai.com/pricing"
    result = tool.run(url)
    print(result)

Explanation:

Model Definition: Define a Pydantic model ModelFee for the structured data.
Tool Definition: Define ModelFeeTool to use the WebCrawler and extract data based on the specified schema and instructions.
Run Method: Implement the _run method to perform the crawling and return the extracted data.

Finally, create a Python script named run_agents.py to run the agents:

from praisonai import AgentManager

# Initialize the agent manager with the configuration file
manager = AgentManager(config_file="agents.yaml")

# Define the URLs to scrape
urls = ["https://openai.com/api/pricing/", "https://www.anthropic.com/pricing", "https://cohere.com/pricing"]

# Run the agents with the specified URLs
results = manager.run_agents(urls=urls)

# Print the final results from the agents
print("Final Results:\n")
for result in results:
    print(result)

Explanation:

Initialize Agent Manager: Create an instance of AgentManager with the configuration file.
Define URLs: List the URLs to be scraped.
Run Agents: Execute the agents with the provided URLs.
Print Results: Display the final results obtained from the agents.

Running the Complete Setup

To run the complete setup, follow these steps:

Ensure you have created the following files:

app.py for basic web scraping.
agents.yaml for agent configuration.
tools.py for defining the scraping tool.
run_agents.py for executing the agents.

Install the necessary packages:

pip install "crawl4ai @ git+https://github.com/unclecode/crawl4ai.git" transformers torch nltk praisonai

3. Export your OpenAI API key:

export OPENAI_API_KEY='your_api_key_here'

3. Create app.py:

# Sample Code Snippets for Reference

# app.py - Basic Web Scraping
from crawl4ai import WebCrawler

# Create an instance of WebCrawler
crawler = WebCrawler()

# Warm up the crawler (load necessary models)
crawler.warmup()

# Run the crawler on a URL
result = crawler.run(url="https://openai.com/api/pricing/")

# Print the extracted content in Markdown format
print(result.markdown)

# tools.py - Defining the Scraping Tool
import os
from crawl4ai import WebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field
from praisonai_tools import BaseTool

# Define the structured data model
class ModelFee(BaseModel):
    llm_model_name: str = Field(..., description="Name of the model.")
    input_fee: str = Field(..., description="Fee for input token for the model.")
    output_fee: str = Field(..., description="Fee for output token for the model.")

# Define the tool for extracting model fees
class ModelFeeTool(BaseTool):
    name: str = "ModelFeeTool"
    description: str = "Extracts model fees for input and output tokens from the given pricing page."

    def _run(self, url: str):
        # Create an instance of WebCrawler
        crawler = WebCrawler()
        
        # Warm up the crawler
        crawler.warmup()

        # Run the crawler with the extraction strategy
        result = crawler.run(
            url=url,
            word_count_threshold=1,
            extraction_strategy=LLMExtractionStrategy(
                provider="openai/gpt-4o",
                api_token=os.getenv('OPENAI_API_KEY'), 
                schema=ModelFee.schema(),
                extraction_type="schema",
                instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens. 
                Do not miss any models in the entire content. One extracted model JSON format should look like this: 
                {"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}."""
            ),            
            bypass_cache=True,
        )
        return result.extracted_content

if __name__ == "__main__":
    # Test the ModelFeeTool
    tool = ModelFeeTool()
    url = "https://www.openai.com/pricing"
    result = tool.run(url)
    print(result)

# run_agents.py - Running the Agents
from praisonai import AgentManager

# Initialize the agent manager with the configuration file
manager = AgentManager(config_file="agents.yaml")

# Define the URLs to scrape
urls = ["https://openai.com/api/pricing/", "https://www.anthropic.com/pricing", "https://cohere.com/pricing"]

# Run the agents with the specified URLs
results = manager.run_agents(urls=urls)

# Print the final results from the agents
print("Final Results:\n")
for result in results:
    print(result)

4. Run the basic web scraping script to ensure everything is set up correctly:

python app.py

5. Execute the agents to perform automated web scraping, data cleaning, and analysis:

python run_agents.py

Conclusion

In this post, we explored how to use Crawl4AI to enhance AI agents with advanced web crawling and scraping capabilities. We covered the following steps:

Basic web scraping using Crawl4AI.
Converting unstructured data into structured JSON format.
Integrating Crawl4AI with AI agents for automated data extraction, cleaning, and analysis.

By leveraging the power of Crawl4AI and AI agents, you can automate the process of gathering and analyzing data from various web sources efficiently. This setup can be further customized and extended to suit specific requirements and use cases.

Stay tuned for more tutorials and updates on leveraging AI and automation in web scraping and data analysis. If you found this post helpful, don't forget to like, share, and subscribe for more content!

This concludes our tutorial on using Crawl4AI for automated web scraping, data structuring, and integrating with AI agents. If you have any questions or need further assistance, feel free to reach out. Happy coding!

Happy scraping and automating! 🚀

Exploring Crawl4AI: Enhancing AI Agents with Advanced Web Crawling and Scraping Capabilities

What is Crawl4AI?

Step 1: Installing Crawl4AI

Step 2: Basic Web Scraping with Crawl4AI

Explanation:

Step 3: Converting Unstructured Data to Structured Data

Explanation:

Step 4: Integrating with AI Agents

Explanation:

Explanation:

Running the Complete Setup

Conclusion

Recent Posts

Revanth Quick Learn