Altcademy - a Forbes magazine logo Best Coding Bootcamp 2023

How to read csv file in Python

Introduction

CSV (Comma Separated Values) files are a popular way to store and share data because of their simplicity, versatility, and ability to be read by both humans and machines. They're often used in data science, machine learning, and web development projects.

In this blog post, we'll explore how to read CSV files in Python using various approaches, such as the built-in csv module and the popular pandas library. Along the way, we'll provide code examples and practical tips to help you understand and apply these techniques in your own projects.

Prerequisites

To follow along, you'll need to have Python installed on your machine. We'll be using Python 3 in this tutorial, but the examples should work with minor modifications in Python 2 as well. You can download Python from the official website at https://www.python.org/downloads/.

Additionally, we'll be using the pandas library, which you can install using the following command if you don't already have it:

pip install pandas

Now that we have Python and pandas installed let's dive into reading CSV files.

Reading CSV files using the built-in csv module

Python comes with a built-in csv module that provides two primary functions for reading CSV files: csv.reader() and csv.DictReader(). We'll explore both of these functions and their differences below.

Using csv.reader()

csv.reader() is a function that reads CSV files and returns an iterable object, which can be used to access rows in the CSV file one at a time. To use csv.reader(), we'll need to follow these steps:

  1. Open the CSV file using Python's built-in open() function.
  2. Create a csv.reader() object using the opened file.
  3. Iterate through the rows in the reader object and process each row as needed.

Here's an example of how to read a simple CSV file using csv.reader():

import csv

# Open the CSV file
with open("example.csv", "r") as csvfile:
    # Create a CSV reader object
    csvreader = csv.reader(csvfile)

    # Iterate through the rows in the reader object
    for row in csvreader:
        # Process each row as needed
        print(row)

In this example, we're opening a file called "example.csv" in read mode ("r") and creating a csv.reader() object using the opened file. We then iterate through the rows in the reader object using a for loop and print each row to the console. The output will be a list of lists, where each inner list represents a row in the CSV file.

One thing to note is that the csv.reader() function does not automatically handle different delimiters (e.g., tabs or pipes instead of commas). If your CSV file uses a different delimiter, you can specify it using the delimiter parameter, like this:

csvreader = csv.reader(csvfile, delimiter="\t")

This would tell the csv.reader() function to use tabs as the delimiter instead of commas.

Using csv.DictReader()

csv.DictReader() is another function provided by the csv module that reads CSV files and returns an iterable object. However, instead of returning a list of lists like csv.reader(), csv.DictReader() returns a list of dictionaries, where each dictionary represents a row in the CSV file with keys corresponding to the column names and values corresponding to the data in each cell.

Here's an example of how to read a CSV file using csv.DictReader():

import csv

# Open the CSV file
with open("example.csv", "r") as csvfile:
    # Create a CSV DictReader object
    csvdictreader = csv.DictReader(csvfile)

    # Iterate through the rows in the DictReader object
    for row in csvdictreader:
        # Process each row as needed
        print(row)

In this example, we're opening a file called "example.csv" in read mode ("r") and creating a csv.DictReader() object using the opened file. We then iterate through the rows in the DictReader object using a for loop and print each row to the console. The output will be a list of dictionaries, where each dictionary represents a row in the CSV file.

Just like with csv.reader(), if you need to specify a different delimiter, you can do so using the delimiter parameter:

csvdictreader = csv.DictReader(csvfile, delimiter="\t")

Reading CSV files using the pandas library

While the built-in csv module is useful for simple CSV files, it can be cumbersome to work with when dealing with more complex data structures, missing values, and data manipulation tasks. This is where the pandas library comes in handy.

pandas is a powerful open-source library that provides easy-to-use data structures and data analysis tools for Python. It's especially useful for working with structured data, like CSV files.

To read a CSV file using pandas, we can use the pd.read_csv() function, which returns a DataFrame object. A DataFrame is a two-dimensional table with labeled axes (rows and columns) that can be easily manipulated and analyzed.

Here's an example of how to read a CSV file using pandas:

import pandas as pd

# Read the CSV file using pandas
data = pd.read_csv("example.csv")

# Display the first few rows of the DataFrame
print(data.head())

In this example, we're reading a CSV file called "example.csv" using the pd.read_csv() function and storing the resulting DataFrame in a variable called data. We then display the first few rows of the DataFrame using the head() method.

pandas automatically detects the delimiter and handles missing values, so you don't need to worry about specifying them like you would with the csv module. However, if you need to customize the behavior of the pd.read_csv() function, you can do so using various optional parameters. Some of the most common ones include:

  • sep: Specify the delimiter to use. By default, pandas will try to automatically detect the delimiter.
  • header: Specify which row to use as the column names. By default, pandas assumes the first row contains the column names.
  • index_col: Specify which column to use as the row labels. By default, pandas will create a new index for the rows.
  • skiprows: Specify the number of rows to skip at the beginning of the file. This can be useful for skipping metadata or comments in the CSV file.
  • na_values: Specify values that should be treated as missing or NaN (Not a Number). By default, pandas recognizes empty cells and some common representations of missing values, like "NA" and "NULL".

Here's an example of how to read a CSV file with custom parameters using pandas:

import pandas as pd

# Read the CSV file with custom parameters
data = pd.read_csv("example.csv", sep="\t", header=None, index_col=0, skiprows=1, na_values=["NA", "NULL"])

# Display the first few rows of the DataFrame
print(data.head())

In this example, we're reading a CSV file called "example.csv" using the pd.read_csv() function with custom parameters, such as using a tab delimiter, no header row, the first column as the row index, skipping the first row, and treating "NA" and "NULL" as missing values.

Conclusion

In this blog post, we've explored how to read CSV files in Python using the built-in csv module and the popular pandas library. We've covered how to use csv.reader() and csv.DictReader() functions for simple CSV files, and pd.read_csv() for more complex data structures and manipulation tasks.

With these techniques in your toolkit, you should now be well-equipped to read and process CSV files in your Python projects. As you continue learning programming, remember to experiment, practice, and explore different approaches to find the one that works best for you and your specific use case. Happy coding!