From the course: Machine Learning with Data Reduction in Excel, R, and Power BI

Heatmaps and dendrograms

- [Instructor] One thing we may notice when we calculate distances between points is that the scale of the axes may differ. For example, we may expect it to be 70 degrees Fahrenheit today but we wouldn't necessarily expect to get 70 inches of rain unless it was perhaps a hurricane. If we think about how, how these distances work this means the temperature has a much greater magnitude of impact on the calculated distances than the rainfall. When working with algorithms, like clustering, it's important to scale the models, so the dimensions aren't skewed by the differences in scale like this. One way we can account for this in Excel is by multiplying each one of the rainfall values by 365.25. We're multiplying by 365.25 instead of 365 to approximate the leap here that occurs every four years in our total yearly numbers. We'll add the number and a new cell in Excel. We'll then copy it, highlight the rainfall cells, and use the paste special keys where we'll select V for values and M to multiply each one of these rainfall values by 365. We can then delete the value because we don't need it anymore. Another way to visualize dendrograms for clustering is in conjunction with a heat map visual. Let's see how this works in Excel by applying conditional formatting onto our table. I'm just going to delete this really quickly. We'll select all the rows except the headers. Then the home tab will select conditional formatting. We'll select more rules for the color scales. Right now, the lowest value has a very bright orange but what we want to do is we want to reverse this. The reason we want to reverse the color scheme is that it matches the dendrogram results that we'll later see in R. We'll select more colors and I'll give this a light orange hex value. Conversely, we'll follow the same process for the highest value to set it to a dark orange shade. Once we confirm our choices, we see our averages table that we've scaled now has different shades of orange on it. To scale the rainfall field in our studio I'm going to take the existing rainfall averages and let's put it after we update the column names. And we're going to assign the rainfall value the original value, but we'll multiply it by 365.25. Now, if we run the hierarchical clustering model again let's see how the results differ. When we originally set up the diagram we had Tampa and Phoenix that were clustered close together because they were close in temperature but the precipitation, the rainfall field wasn't scaled enough to differentiate the distances on a Cartesian coordinate. We've now to the rainfall to account for it in Euclidean distances instead of giving most of the weight to temperatures. So we'll run our results again and see what they look like. So we now see Tampa is much closer to Orlando. Another way to visualize dendrograms for clustering is in conjunction with a heat map visual in our studio. We can create this visual directly to expand on the cluster dendrogram visual we already created that represents the hierarchical clustering outcomes. Let's first create a heat map using the built in heat map function in R, and referencing the averages data frame within it. This doesn't return a result and it tells us that this occurs because we don't have a matrix and we need to set this up or our averages data frame needs to be a matrix. The other thing that we need to do is we need to remove the city label because that's not a numeric input in the model. So we're going to first filter the columns by first using a comma and then C and now we'll refer to our temperature and rainfall fields. We'll then put as matrix around our averages data frame and let's run the model again. Right now we see the matrix appears with the temperature and the rainfall on the X axis at the bottom and the city names on the Y axis. We want to flip the matrix by transposing it so that we see the cities on the labels below. So we see the cities going across the table. We can do that using the lowercase T function. I'm just going to run all the steps again. And let's open up our visual. We now have a heat map together with a cluster dendrogram that gives us a perspective of both the values and the distance between them and the Cartesian coordinates, so on this scatter plot, in both the colors and also in the tree structure that we see on top of the visual. This heat map is both a combination of a traditional heat map on a table visual but it's also shows the cluster dendrogram in the same view.

Contents