Altcademy - a Forbes magazine logo Best Coding Bootcamp 2023

How to make a histogram in matplotlib

Understanding Histograms

Before diving into the code, let's make sure we understand what a histogram is. Imagine you have a bag of different colored marbles and you want to know how many marbles of each color you have. You could sort them by color and then count them, making a pile for each color. A histogram is similar to these piles. It's a type of chart that shows you how many times something happens, or how many items fall into different ranges (we call these ranges 'bins').

In the context of programming, a histogram is a graphical representation of the distribution of numerical data. It's a way of summarizing information from a process or a large dataset, which can be incredibly helpful when you're trying to get insights from your data.

Setting Up Your Environment

To create a histogram in Python, we'll use a library called Matplotlib. Think of a library as a collection of tools that can help you do specific tasks without having to build those tools yourself. To use Matplotlib, you need to install it and import it into your Python script. Here's how you can do that:

# Install matplotlib using pip (you only need to do this once)
!pip install matplotlib

# Import the pyplot module from matplotlib
import matplotlib.pyplot as plt

The pyplot module is like a toolbox for plotting graphs and charts, including histograms.

Your First Histogram

Let's start by creating a very simple histogram. We'll use a list of numbers and show how many times each number appears.

# Sample data: heights of 10 people in centimeters
heights = [150, 160, 165, 157, 180, 160, 170, 172, 177, 190]

# Create a histogram
plt.hist(heights, bins=5, edgecolor='black')

# Add titles and labels
plt.title('Heights Distribution')
plt.xlabel('Heights (cm)')
plt.ylabel('Number of People')

# Show the plot
plt.show()

In this example, heights is our dataset. The plt.hist() function creates the histogram, and we've specified 5 bins. You can think of bins like buckets; in this case, we're sorting our data into 5 buckets based on the heights. The edgecolor='black' part just makes each bin stand out more by adding a black border around them.

Customizing Bin Sizes

The number and size of bins can significantly affect how your histogram looks and the insights you can gain from it. If you have too few bins, you might miss important details. Too many bins, and the histogram might become overly complex. You can specify the bins manually to have better control over your histogram:

# Define bin edges
bin_edges = [150, 155, 160, 165, 170, 175, 180, 185, 190, 195]

# Create a histogram with custom bins
plt.hist(heights, bins=bin_edges, edgecolor='black')

# Add titles and labels
plt.title('Heights Distribution with Custom Bins')
plt.xlabel('Heights (cm)')
plt.ylabel('Number of People')

# Show the plot
plt.show()

Here, bin_edges is a list that defines the edges of the bins. Each bin will count how many numbers from heights fall between those edges.

Understanding Distribution with Histograms

Histograms are powerful because they help us understand the distribution of our data. Distribution refers to how your data is spread out or clustered together. For example, if most people in our heights list were around 160 cm tall, the histogram would have a peak at that height. This peak is known as the 'mode' of the data.

Let's look at another example using randomly generated data that follows a normal distribution, which is a common pattern where most values cluster around a central point:

import numpy as np

# Generate random data with a normal distribution
np.random.seed(0)  # This ensures that the random numbers are the same each time
data = np.random.randn(1000)  # 'randn' generates 1000 numbers with a normal distribution

# Create the histogram
plt.hist(data, bins=25, edgecolor='black')

# Add titles and labels
plt.title('Normal Distribution of Random Data')
plt.xlabel('Value')
plt.ylabel('Frequency')

# Show the plot
plt.show()

In this histogram, you can see the bell-shaped curve that's typical of a normal distribution. The np.random.randn() function from the NumPy library is used to generate random numbers that follow this pattern.

Adding More Style

Matplotlib allows you to add style to your charts to make them more visually appealing. For example, you can change the color of the bars, add a grid for easier reading, or use a different style template:

# Set a style
plt.style.use('ggplot')

# Create a histogram with a different color
plt.hist(data, bins=25, color='skyblue', edgecolor='black')

# Add a grid
plt.grid(True)

# Add titles and labels
plt.title('Stylish Histogram')
plt.xlabel('Value')
plt.ylabel('Frequency')

# Show the plot
plt.show()

By using plt.style.use('ggplot'), we've applied a style template that changes the color scheme and background of the plot. The grid function adds a grid to the background, making it easier to read the values.

Analyzing Multiple Datasets

Sometimes you'll want to compare two sets of data. You can do this by plotting two histograms on the same chart:

# Generate another set of random data
data2 = np.random.randn(1000) + 2  # This data will be shifted to the right

# Plot two histograms on the same chart
plt.hist(data, bins=25, color='blue', edgecolor='black', alpha=0.5, label='Dataset 1')
plt.hist(data2, bins=25, color='red', edgecolor='black', alpha=0.5, label='Dataset 2')

# Add a legend
plt.legend()

# Add titles and labels
plt.title('Comparing Two Datasets')
plt.xlabel('Value')
plt.ylabel('Frequency')

# Show the plot
plt.show()

The alpha parameter controls the transparency of the bars, which is useful when you want to see how the bars overlap. The label parameter is used to create a legend, which helps distinguish between the two datasets.

Conclusion

Creating histograms in Matplotlib is a straightforward process that can yield a wealth of insights into your data. Whether you're looking at a simple distribution of heights, analyzing the normal distribution of random numbers, or comparing multiple datasets, histograms are an essential tool in your data analysis toolkit.

Remember, the key to a useful histogram is in choosing the right number and size of bins to accurately represent your data. With the flexibility and customization that Matplotlib offers, you can tweak your histograms to perfection.

As you continue your journey in programming and data visualization, think of each histogram as a story about your data. With the right plot, you can uncover patterns, trends, and outliers that might otherwise remain hidden. Keep experimenting, and enjoy the process of discovery that comes with visualizing data. Happy plotting!