From the course: Machine Learning with Python: Foundations

How to visualize data in Python - Python Tutorial

From the course: Machine Learning with Python: Foundations

How to visualize data in Python

- [Instructor] Like the popular saying, "a picture is worth a thousand words." Visualizations are sometimes more useful than summary statistics in helping us understand our data. One of the most popular visualization packages in Python is a matplotlib package, which provides a host of powerful functions and methods that allow us to produce publication quality visualizations. The plot method of a Pandas dataframe provides an abstraction of the matplotlib functions. To ensure that the plots we create in this tutorial appear right after our code, we have to run the following command. Next, let's import and preview the data we will use for our illustrations. The first type of plot we create is a relationship visualization. These types of visualizations are used to illustrate the correlation between two or more continuous variables. Scatter plots are one of the most commonly used relationship visualizations. They show how one variable changes in response to a change in another. To create a scatter plot, we start with our data frame vehicles. We call the plot method within the method, we specify the value for the kind arguments as scatter, we specify a value for the X axis, here we choose city MPG and we specify a value for the Y axis. Here we choose CO2 emissions. The plot that we have shows that the relationship between vehicle emissions levels and city mileage is negative. In other words, vehicles with higher mileage ratings emit less carbon. Next, we create a distribution visualization. As the name suggests, distribution visualizations illustrate this statistical distribution of the values of a feature. One of the most commonly used distribution visualizations is the histogram. With a histogram, we can figure out which values are most common for a feature. To create a histogram, we start with our vehicles data, and we specify the column that we want. So here we choose CO2 emissions. Then we call the plot method. And within the method we specify the value for the kind argument which this time around is hissed. The plot shows that the carbon emissions values for the vehicles in the dataset range from just on the 200 grams per mile, to just over a thousand grams per mile. It also shows a most vehicles fall within the 300 to 700 grams per mile range. Comparison visualizations are used to illustrate the difference between two or more items at a given point in time or over a period of time. One of the most commonly used comparison visualizations is the box plot. Using a box plot, we can compare the distribution of values for a continuous feature against the values of a categorical feature. To create a box plot in Python, we must first create a pivot table, such that the value we want on the X axis of our plot are listed as column labels while the values we want on the Y axis of our plot are the cell values. To create the pivot table, we begin with our vehicles dataset. We call the pivot method. Within the method, we specify a value for the columns. Here we specify the value as drive, and we specify a value for the cells, which is the values argument and we specify this as CO2 emissions. Note that the value NAN is an acronym for not a number and is how a pan does dataframe represents missing values. The emissions values are missing for every column except the one that corresponds with a drive type for a particular vehicle. For example, we can tell that the first vehicle in our table is a two wheel drive vehicle while the fourth one is a rare wheel drive vehicle. With our data in this format, we can then create a box plot. So we have some of the code already written for us here, so we call the plots method and within the method, we specify kind = box plus, we want a box plot, and we also specify a figure size of 10 by six. The plot shows that on average front wheel drive cars have lower carbon emissions than other types of cars. Our fourth visualization is a composition visualization. These types of visualizations show the component makeup of data. Stacked bar charts are one of the most commonly used composition visualizations. Stacked bar charts show how much a subgroup contributes to the whole. To create a stacked bar chart in Python, we must first create a pivot table so that the values we want on the X axis of our plot are listed as row labels while the composite groups are listed as column labels. To do this, we start with a group level aggregation. So we start with our vehicles data, we grouped by year, we specify or select the drive, the column we want, which is drive and we call value counts to give us a unique count of values. The next thing we do is call the unstack method to pivot that innermost index, which is drive to column labels, unstack. Now that our data is in this format, we can create a stat bar chart. So here we call the plot method within the method when we specify a value for kind this time around it's bar stat = 2. We're going to stat bar chart. This will specify fixed size and by six. The plot shows the total number of vehicles rated by the EPA each year, as well as the proportion of front wheel, all wheel, and rear wheel vehicles that make up those numbers.

Contents