Updated: Sep 15
In this post, you are going to learn visualizing Data with the Matplotlib Library. It is one of the oldest library in python for data visualization and Many new libraries are based upon it. So learning about it will really help you a lot.
After reading this post, you will be totally comfortable in Visualizing data with python.
So let's get started.
Download the Data sets used in this post - Matplotlib data set.
1. Line Chart -
In this article, for all the data visualization, we will be using matplotlib library.
To create the line chart or any charts, first we need to import the matplotlib's pyplot module.
The pyplot module is commonly imported as plt from matplotlib. Let's import it.
We will be using the crimes in india dataset for the visualizations. So let's read it and assign it to crimes.
Now, let's look at the 2001 crimes data.
If you look at the shape of the data for 2001, you can see that it has 682 rows and 14 columns. Why so many rows for a single year? The reason is the data is first divided by many states and each states has many districts.
so before we make any plots we have to do some data manipulations to consolidate the data. To do that, we can use the groupby and aggregation methods in pandas.
Here, we first created a list of all the crimes in our data set. Then we group the data set by year and did a summation on each columns for each year. If you want, you can also group the data by state or district or groupby multiple columns by passing the names of the columns in a list to the groupby method.
We have done all the necessary works. Now let's create our line chart.
To create the line chart we are interested in, we need to pass a list of x-values as the first parameter and a list of y-values as the second parameter to the plot( ) .
we use the plt.show( ) command to display the plot. We can see that the total numbers of crimes in 2003 went down then it kept rising more and more by every passing year.
If you want to save the plot in your computer, use the command -
plt.savefig( 'MyPic.png' )
You can also save it in another format like pdf, just change the extension above.
Now, let's customize this plot.
A. Adding Axis labels and Title -
It's always good practice to add axis labels and title of a plot. Let's see how to add them.
xlabel( ) - To add a label to the x-axis. xlabel accepts a string value.
ylable( ) - To add a label to the y-axis. ylabel accepts a string value.
title( ) - To add a title to the plot. title also accepts a string value.
Let's create two line plots on the same axis with different colors.
Here, we added two additional parameters, color and label and also added legend to the plot to distinguish between two line plots (Please see the code above).
To change the location of the legend, you have to use the loc parameter of plt.legend( ) .
plt.legend( loc= 'upper left' )
2. Subplot( ) in Matplotlib-
A. Matplotlib Classes -
When we make multiple plots, we have to explicitly tell matplotlib to which plot we are making the changes. To do that we need to understand classes that pyplot uses internally to maintain state so we can interact with them directly.
To create multiple plots, first we create a figure which acts as a container for all of our plots using pyplot.figure( ) .
Then we create a subplot using figure.add_subplot. This will return a Axes object, which needs to be assigned to a variable.
axes_obj = fig.add_subplot( nrows, ncols, plot_number )
Suppose, we want the figure to contain 2 plots, one above the other, we need to write -
This will create a grid, 2 rows by 1 column of plots. Once we are done adding subplots to the fig., we display everything using plt.show( ).
B. Grid Positioning -
If we want to create a subplot with 2 rows and 2 columns, this is how it will look like -
C. Adding Data -
To generate a line chart within an Axes object, we need to call Axes.plot( ) and pass in the data you want plotted.
ax1.plot( x_values, y_values )
Let's create 2 line subplots in a 2 row by 1 column layout.
D. Formatting and Spacing -
Sometimes when we create subplots, the x-axis ticks or the y-axis ticks become unreadable.
The reason is matplotlib uses the default dimensions for the total plotting area instead of resizing it.
To increase the dimensions of the plotting area, we need to use the figsize parameter when we call plt.figure( ).
fig = plt.figure( figsize = ( width, height )
The unit for both width and height are in inches.
And to add the titles to each subplot, use the command-
You can see that though we increase the size of the subplots but still the x-axis of the top subplot is overlapping on the title of the bottom subplot.
To fix this, use the command-
Let's add some more subplots.
3. Bar Plots -
A bar plot uses rectangular bars whose length are proportional to the values they represent. An effective bar plot uses categorical values on one axis and numerical values on the other axis.
If you look at the crimes data, you can see that the State and districts are categorical values which we need for the bar plot. So let's again use groupby and aggregation to transform our data into the desired shape before plotting.
Let's sort the data from highest to lowest to see which states had most crimes.
Frankly speaking I didn't expected Maharashtra at top.
A. Vertical bar plot -
Let's visualize this using a bar plot. To create a vertical bar plot, we need to specify the positions of the bars on the x-axis, the heights of the bars, the width of the bars and the positions of the tick labels.
We will use the np.arange( ) function to generate the positions of the of the bars on x-axis.
from numpy import arange.
x_positions = arange(5)
We will use pyplot.bar( ) to create the bar plot, which needs at least two parameters. The bar x positions which we already defined and the height of the bar will be the top 5 values of the total crimes column. the width is optional by default it is 0.8.
plt.bars( x_positions, bar_height, width )
For the positions and labels of the ticks, we will use plt.xticks( ) -
plt.xticks( x_positions, labels , rotation)
Here, for the locations of the x-ticks, we again use the arange function from numpy and labels will be the state names in our case. We will use the rotation parameter so that the labels don't overlap on each others.
In the above code, I used crimes_by_state.index[0:5] to get the names of the state.
If you want you can also set it manually by passing names in a list to the plt.xticks( ).
B. Horizontal Bar Plot -
For horizontal bar , we will use the pyplot.barh( ) .
But there will be fewer changes from the vertical bar plot. Instead of x_pos, we will use y_pos. We will use width to for the total crimes values instead of height and height for the bar width. And use plt.yticks( ) instead of plt.xticks( ).
plt.barh( y_pos, width, height )
plt.yticks( y_pos, labels)
Here, I used the figsize parameter that we learned before to make the plot bigger and also changed the default color from blue to green using the color parameter.
C. Grouped Bar Plot -
For the the Grouped bar plot we will also use the pyplot.bar( ) function but with some modifications. You have to do some trial and errors to get it right but the process is same.
If you look closely in the above figure, you can see that the Crimes in Maharashtra are start to appear less compared to other states even though Total crimes are highest in Maharashtra. It means there are some crimes in Maharashtra that dragging it to the highest position. That is why segmenting the data are very important, to find deeper insights.
Let's see which crimes are highest in Maharashtra.
We can see that theft is a way too big problem for Maharashtra. This is what dragging it to the top position.
D. Stacked Bar Plot -
Here, we also use the pyplot.bar( ) function with and additional parameter bottom that tells matplotlib how to stack these bars.
4. Scatter Plots -
While Bar Plots help us visualize a few data points to quickly compare them, they aren't good at helping us visualize many data points.
A scatter plot helps us determine if 2 columns are weakly or strongly correlated.
To generate a Scatter plot, we will use the pyplot.scatter( ). The scatter method has 2 required parameters, x and y.
plt.scatter( x, y )
The values for these parameters needs to be iterable objects of matching lengths.
Let's create a scatter plot of cruelty by husband and relatives with dowry deaths columns. This time we won't use the groupby dataframe which is in reduced form. We will be using the original dataframe because we want more values.
We can see that there is a weak positive relationship but not a stronger one.
To find the correlation coefficient between two series in pandas, you can use .corr( ) method.
5. Histograms -
We can create a histogram using pyplot.hist( ) . This method has only 1 required parameter, an iterable object containing the values we want a histogram for.
plt.hist( x )
While histogram use bars whose length are scaled to the values they are representing, they differ from bar plots in a few ways.
Histogram helps us visualize Continuous values using bins while bar plots help us visualize discrete values. The locations of the bars on the x-axis matter in a histogram but they don't in a simple bar plot. Lastly, bar plots also have gaps between the bars to emphasize that the values are discrete.
By default the bins size in matplotlib is 10.
As in our current data set, the nature of all data is discrete, So, let's read another data set which has a continuous data.
We will use the Google Merchandise store's GA data.
Let's create the histogram for the product_revenue columns data.
Let's change the bins size to 20, and see the shape of the histogram.
6. Box Plot -
A box plot consist of box-and-whisker diagram, which represents the different quartiles in a visual way. And Each quartile contains 1/4 th of the values.
We can create a box plot using pyplot.boxplot( ) .
plt.boxplot( x )
Matplotlib will sort the values, calculate the quartiles that divide the values into four equal regions, and generate the box and whisker diagram.
Let's visualize the riots data using a box plot.
We can also create a multiple box plot. When we select multiple columns to pass into pyplot.boxplot( ), we need to use the values accessor to return a multiple numpy array.
7. Style sheets in Matplotlib -
You might be a big fan of ggplot or Seaborn or any other plotting library. Matplotlib provides very easy way to customize your plots like these libraries with the need of learning new programming language or syntax of the library.
All you have to do is use pyplot.style.use( ).
To find out all the styles available, you can use -
Or use this link to see it visually - Available styles.
Let's change our plots with these style sheets.
A. fivethirtyeight -
B. Seaborn -
C. ggplot -
So make sure to check out the blog later or click here to subscribe or use the box below.