Altcademy - a Forbes magazine logo Best Coding Bootcamp 2023

How to create a dataframe in Python

Introduction to Dataframes

If you're learning programming, chances are that you've come across the term dataframe. But what exactly is a dataframe? In the simplest terms, a dataframe is a data structure used for storing and organizing data in a tabular form, similar to an Excel spreadsheet or a SQL table. It consists of rows and columns, where each row represents an observation or a data point, and each column represents a variable or a feature of the data.

Dataframes are incredibly useful for working with large datasets and for performing data analysis tasks. They allow you to easily manipulate, filter, and visualize data, which is essential for understanding trends and making informed decisions. In this blog post, we'll learn how to create a dataframe in Python using a popular library called pandas.

Introducing Pandas

Pandas is an open-source Python library that provides high-performance, easy-to-use data structures and data analysis tools. It is built on top of the NumPy library and has become the go-to library for data manipulation and analysis in Python. One of the key features of pandas is its ability to work with dataframes.

To start using pandas, you'll need to install it first. You can do this by running the following command in your terminal or command prompt:

pip install pandas

Once pandas is installed, you can import it in your Python script or notebook using the following line of code:

import pandas as pd

We use the alias pd for pandas, which is a common convention in the Python data science community.

Creating a Dataframe from Scratch

There are several ways to create a dataframe in pandas. We'll start by creating a dataframe from scratch using a Python dictionary.

Using a Python Dictionary

A Python dictionary is a collection of key-value pairs, where each key is associated with a value. To create a dataframe from a dictionary, we can use the pd.DataFrame() function and pass the dictionary as an argument. The keys in the dictionary will become the column names in the dataframe, and the values will be the data in the columns.

Here's an example:

# Define a dictionary of data
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 40],
    'City': ['New York', 'San Francisco', 'Los Angeles', 'Chicago']
}

# Create a dataframe from the dictionary
df = pd.DataFrame(data)

# Display the dataframe
print(df)

This will output the following dataframe:

      Name  Age           City
0    Alice   25       New York
1      Bob   30  San Francisco
2  Charlie   35    Los Angeles
3    David   40        Chicago

You can see that the dataframe has rows and columns with labels. By default, pandas assigns integer labels for the rows starting from 0. You can also specify custom row labels by setting the index parameter in the pd.DataFrame() function.

# Create a dataframe with custom row labels
df = pd.DataFrame(data, index=['Person 1', 'Person 2', 'Person 3', 'Person 4'])

# Display the dataframe
print(df)

This will output:

         Name  Age           City
Person 1  Alice   25       New York
Person 2    Bob   30  San Francisco
Person 3  Charlie   35    Los Angeles
Person 4    David   40        Chicago

Using Lists

Another way to create a dataframe is by using lists. You can create a dataframe by passing a list of lists to the pd.DataFrame() function, where each inner list represents a row in the dataframe. You'll also need to provide the column names using the columns parameter.

Here's an example:

# Define a list of data
data = [
    ['Alice', 25, 'New York'],
    ['Bob', 30, 'San Francisco'],
    ['Charlie', 35, 'Los Angeles'],
    ['David', 40, 'Chicago']
]

# Define the column names
columns = ['Name', 'Age', 'City']

# Create a dataframe from the list and column names
df = pd.DataFrame(data, columns=columns)

# Display the dataframe
print(df)

This will output the same dataframe as the previous example:

      Name  Age           City
0    Alice   25       New York
1      Bob   30  San Francisco
2  Charlie   35    Los Angeles
3    David   40        Chicago

Creating a Dataframe from External Data

In most cases, you'll be working with data stored in external files, such as CSV, Excel, or JSON files. Pandas provides several functions to read data from these files and create a dataframe.

Reading from a CSV File

A CSV (Comma-Separated Values) file is a plain text file where each line represents a row in the table, and the values in each row are separated by commas. To read a CSV file and create a dataframe, you can use the pd.read_csv() function. Simply pass the file path as an argument.

Let's assume we have a CSV file called data.csv with the following content:

Name,Age,City
Alice,25,New York
Bob,30,San Francisco
Charlie,35,Los Angeles
David,40,Chicago

To read this file and create a dataframe, use the following code:

# Read data from the CSV file
df = pd.read_csv('data.csv')

# Display the dataframe
print(df)

This will output the same dataframe as earlier:

      Name  Age           City
0    Alice   25       New York
1      Bob   30  San Francisco
2  Charlie   35    Los Angeles
3    David   40        Chicago

Reading from an Excel File

To read data from an Excel file, you'll need to install the openpyxl library first by running:

pip install openpyxl

Once you have openpyxl installed, you can use the pd.read_excel() function to create a dataframe from an Excel file. Pass the file path and the sheet name as arguments.

Let's assume we have an Excel file called data.xlsx with the same data as the CSV example. To read this file and create a dataframe, use the following code:

# Read data from the Excel file
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')

# Display the dataframe
print(df)

This will output the same dataframe as earlier:

      Name  Age           City
0    Alice   25       New York
1      Bob   30  San Francisco
2  Charlie   35    Los Angeles
3    David   40        Chicago

Reading from a JSON File

JSON (JavaScript Object Notation) is a lightweight data interchange format that is easy for humans to read and write and easy for machines to parse and generate. To read a JSON file and create a dataframe, you can use the pd.read_json() function. Pass the file path as an argument.

Let's assume we have a JSON file called data.json with the following content:

[
  {"Name": "Alice", "Age": 25, "City": "New York"},
  {"Name": "Bob", "Age": 30, "City": "San Francisco"},
  {"Name": "Charlie", "Age": 35, "City": "Los Angeles"},
  {"Name": "David", "Age": 40, "City": "Chicago"}
]

To read this file and create a dataframe, use the following code:

# Read data from the JSON file
df = pd.read_json('data.json')

# Display the dataframe
print(df)

This will output the same dataframe as earlier:

      Name  Age           City
0    Alice   25       New York
1      Bob   30  San Francisco
2  Charlie   35    Los Angeles
3    David   40        Chicago

Conclusion

In this blog post, we've learned how to create a dataframe in Python using the pandas library. We've discussed how to create a dataframe from scratch using dictionaries and lists, as well as how to read data from external files such as CSV, Excel, and JSON files.

By now, you should have a good understanding of how dataframes work and how to create them using pandas. Going forward, you can use this powerful data structure to store, analyze, and visualize your data, making your programming tasks easier and more efficient.