Part 8: Pandas Adventures: Reading JSON, HTML, and Excel Data Like a Pro

Reading JSON, HTML, and Excel Data in Python

Hello everyone! Today, we continue our exciting journey into the world of Pandas, the powerful Python library for data manipulation and analysis. In our previous post, we explored reading CSV files and using various properties inside read_csv. Now, let's dive into reading JSON, HTML, and Excel data and learn about some advanced techniques.

Reading JSON Data with Pandas

JSON (JavaScript Object Notation) is a popular format for data exchange. It’s like a dictionary in Python, with key-value pairs. Let’s see how we can read JSON data with Pandas.

Example JSON Data

Here’s a simple JSON structure:

import pandas as pd

json_data = {
    "employee_name": "James",
    "email": "james@example.com",
    "job_profile": {
        "title1": "Team Lead",
        "title2": "Senior Developer"
    }
}

To read this JSON data and convert it to a Pandas DataFrame, we can use the read_json function.

df = pd.read_json('path_to_json_file.json')
print(df.head())

Handling Nested JSON

If your JSON data is nested, like the one above, Pandas will not automatically flatten the nested structures. Instead, you might need to do additional processing.

nested_json = '{"employee_name": "James", "email": "james@example.com", "job_profile": {"title1": "Team Lead", "title2": "Senior Developer"}}'
df_nested = pd.read_json(nested_json)
print(df_nested)

Reading JSON from URLs

Pandas can also read JSON data directly from a URL.

url = 'https://example.com/data.json'
df_url = pd.read_json(url)
print(df_url.head())

Writing DataFrames to JSON

You can convert a DataFrame back to a JSON string using the to_json function.

json_str = df.to_json(orient='records')
print(json_str)

Reading HTML Tables with Pandas

Pandas can read tables directly from HTML content. This is useful for web scraping.

Example HTML URL

Let’s consider a URL that contains an HTML table.

url = 'https://example.com/table.html'
tables = pd.read_html(url)
print(tables[0].head())

Here, read_html returns a list of all tables found on the webpage. You can then access each table using its index.

Reading Excel Files with Pandas

Excel files are commonly used for data storage. Pandas makes it easy to read them.

Reading an Excel File

You can read an Excel file using the read_excel function. If your Excel file contains multiple sheets, you can specify the sheet name.

df_excel = pd.read_excel('data.xlsx', sheet_name='Sheet1')
print(df_excel.head())

Pickling DataFrames

Pickling is a way to save the structure of a DataFrame to disk. It’s useful for saving the state of your data analysis.

Saving a DataFrame as a Pickle File

df.to_pickle('data.pkl')

Loading a Pickle File

df_pickle = pd.read_pickle('data.pkl')
print(df_pickle.head())

Conclusion

Today, we explored how to read JSON, HTML, and Excel data using Pandas. We also learned about converting DataFrames to JSON and saving/loading DataFrames using pickle files. These techniques are incredibly useful for handling different types of data sources and ensuring your data analysis process is efficient and effective.

Keep practicing these techniques and experiment with different types of data. In our next post, we will explore reading data from MongoDB and SQL databases. Stay tuned for more exciting tutorials!

Thank you for reading! If you found this post helpful, share it with your friends and family. Happy learning!