From the course: Python Statistics Essential Training
Histograms and distributions - Python Tutorial
From the course: Python Statistics Essential Training
Histograms and distributions
- [Instructor] In this lesson, we're going to look at histograms and distributions for numerical columns. Again, we can start off with the describe method to get summary statistics but oftentimes, we want to visualize this as well. I'll show you some ways to visualize these. Let's take the SalePrice column. Note that we can run describe on a single column or a series in addition to a data frame. When we run describe on a series, we get back a series. Again, in the index of this series, we have the descriptions of the statistical summaries that are coming back and the values, we have the corresponding values for each statistic. I'm just going to quickly go through these. Look at the minimum value. It looks like there's a house that sold for $12,000. That seems pretty cheap. I might want to check out that row or check out those low values. Also might want to go on that high side. We have a house that sold for 755,000. Make sure that that looks about right. Also, I could look at the mean and the median. Those look pretty close to each other and remember, our count is not the count of rows but the count of non-missing rows, but it looks like this is, in this case, the same as the number of rows. We don't have any SalePrice values that are missing. One of the ways to visualize a numeric column is to do a histogram, and again, this is very easy in pandas. We just say .hist, and here's the histogram. You can see that this is somewhat skewed. It rises quickly to some value around $150,000 and then tails off and there's a long tail going out to around $700,000. We know that it goes up to $700,000 at least. If it didn't go up to that value, it wouldn't be on this chart. However, there aren't very many entries at that end. Now, you need to be careful with histograms because you can tell different stories with them. By default, pandas is going to give you 10 bins. People often ask, "Is 10 bins the correct number of bins?" I would say the answer is it depends, but it might not be. This is one of those things that people would write dissertations about the correct number of bins. I'm going to show you how you can change this. We can say bins is equal to 30, and we can get a little bit finer granularity in there. We could also come down here and say bins is equal to three. This probably isn't a story that we want to tell using bins is equal to three. From inspecting this data, it looks like somewhere around 150 is the most common value or the mode. When we do describe, we don't get a mode reported on there, but we did get a median and a mean, which were both around that value as well. You can see that for whatever reason, there's a little bump here around 550 and there's another bump here around 600. Those are kind of interesting to me. They probably indicate that these houses sold at complete numbers like 600,000 or 550 rather than some value in between there. It's interesting also that we're not seeing that happen at other values where we're not seeing like big spikes at 300,000 or 200,000 or 100,000, and it also doesn't look like we're seeing bumps at the 50,000 level as well. We can bump up the number of bins and look at the finer granularity. That might yield useful information. It might just be noise. In this lesson, we took the SalePrice column and we did summary statistics on that. We also showed how to look at a very common visualization of that, which is the histogram. We showed how to change the number of bins in a histogram to tell a different story as well. Leveraging a histogram is a great way to get a feel for how the data is distributed.
Practice while you learn with exercise files
Download the files the instructor uses to teach the course. Follow along and learn by watching, listening and practicing.
Contents
-
-
-
-
(Locked)
Categorical exploration4m 2s
-
Histograms and distributions4m 22s
-
(Locked)
Outliers and Z-scores6m 52s
-
(Locked)
Correlations7m 2s
-
(Locked)
Scatter plots7m 37s
-
(Locked)
Visualizing categorical and numerical values10m 42s
-
(Locked)
Comparing two categoricals6m 2s
-
(Locked)
Challenge: Explore Ames27s
-
(Locked)
Solution: Explore Ames3m 28s
-
(Locked)
-
-
-