From the course: Complete Guide to Analytics Engineering

Grouping data in pandas

- [Instructor] Up until this point, the sales data set we've worked with has had one row for each sale made by our sales team. When we start grouping or aggregating our data, we take several dimensions like date and sales employee and then sum up the quantity of sales that sales employee had on that date. Pretty straightforward, but a very important and common function in analytics engineering. Common aggregations you can perform in Python Pandas are sum, where we add numbers up, mean, where we add all the numbers up and divide it by the count of value summed, min, where we take the smallest number in a group, max, where we take the largest number in a group, count, where the number of non-null values in a group are counted, size, where values in a group are counted, including null values, and more. So why aggregate data? Because it can provide valuable insights to stakeholders. For example, a sales manager would want to know how many total sales each employee had last month. We would use group by and sum to find that number. If they want to see which sales employee had the largest deal, we could use group by and max to find that. Grouping and aggregating data can also reveal trends that might not be noticeable with just a table of sales transactions. If we group sales by month instead of employee ID, we can see how last month compared to the previous month and so forth. This might let us know if sales for a certain product are decreasing or if total sales for the whole company is up this year compared to last. Grouping data is all about shifting our focus from the granular data to higher levels for insights so we can understand the business, our employees, and our customers. Let's jump back into Python. To start, let's create a new Python script in our file explorer in GitHub Codespaces. Remember, there's a link to our GitHub repo in this description. Navigate to the branch for this video, 04_02B, if you're just joining us. I'm going to call this new script data_merges.ipymb. Let's start with sum, probably the most popular of the aggregations across all of analytics. If you're a member, we used this function once before in chapter three, but let's practice again. We'll need a group by employee ID dimension and sum the quantity measure. First, let's import Pandas. Next, let's connect to our red_20_tech_us_sales_cleaned.csv. Next, we'll create a new function called sum_sales_quantity, and then we'll use the groupby function, grouping by employee ID, and we'll sum quantity. Lastly, let's print this sum_sales_quantity function. You'll have to reconnect to our kernel, Python 3.12.1, and now we see the output where it summed by our employee ID. Now in a new cell, let's practice with mean, minimum and maximum. We can borrow some of the code up here to save us some time. Instead of sum, we'll use mean. Now let's copy that whole line and paste it below. Let's call this one min_sales_quantity and use the minimum function. Let's call this one the max_sales_quantity and use the max function. Let's run that really quick to make sure it works. Now let's print each of those new functions to see their output. We can see the mean sales quantity for each employee now. Copy the min_sales_quantity function, print that as well. This is the minimum quantity of sales each employee has done, which is one, which makes sense. Lastly, let's print the max sales quantity. If you open up a cell in the wrong place, you can always click and drag it to the bottom, just like I did there. Great, now we can see the largest quantity of sales each employee has had. Now let's have some fun and graph a couple of these aggregate functions we just created. Unfortunately, Pandas doesn't have any plotting functionality, but there's another library that does that's really easy to use. It's called MatPlotLib, and it's really popular. In a new cell, we'll import matplotlib.pyplot as plt. We'll call the sum sales function and plot it. We want to see a bar chart, so we'll tell it to bar. We'll set the size, and we can also set the color. We can add a few labels to our chart. Let's do that now. Lastly, we'll use plt.show so that it'll show us our graph. Execute that. Looks like I've spelled figsize wrong. Little things like this happen all the time in Python. Have to tell it exactly the functions it wants. Awesome. I'm going to close this output down here so we can see a little bit better. Here's our sum of quantity per employee, where the employee IDs are shown on the X axis, sum of their quantity shown on the Y axis. You can see quite a lot of variability in our sales employees. I would expect this because not everybody started on the same day at the company, so there's definitely going to be some variability here. Also, not all salespeople are the same at selling. One of the things I love about Python is how quickly we can build visuals like this with just a few lines of code. Now let's make one more graph. Let's create a new cell. This time we'll use mean_sales_quantity and order our data in descending order first so we can check the distribution of our data. Let's create a new value called sorted_mean_sales_quantity. We use the sort_values function, ascending set to false. We use the plot function again. Nice, now our mean values are sorted in descending order, and we're seeing a right skew distribution of average sales quantity. There appears to be a few employees with far greater average order size. You can see that with the leftmost bar, which is quite higher than the next bars. This might suggest that we have a couple of employees that deal in greater size orders. Awesome job with grouping and aggregating our data. It's exciting to take data, clean it, model it, and visualize it. I get excited when I first see the distribution of a data set I just started working with. It's like a little surprise. I hope you felt that excitement too when your visual generated. Up next, we're going to use the merge function to bring multiple data sets and data frames together so we can try to draw more conclusions about our sales employees with the four aggregate functions we just built.

Contents