From the course: Machine Learning with Data Reduction in Excel, R, and Power BI

Dimensionality

- [Instructor] Data tables are made up of rows and columns of data. The rows serve as records of the recorded data entries while the columns serve as dimensions in the model. In this example table, we have eight records of five dimensions that we can select to include in our model. When we talk about dimensions, I personally find a it's really helpful to put them in visuals to understand data points as points in a chart, instead of trying to understand them exclusively from analyzing tables. A one dimensional model we can think of as residing along a single access as data points to show as some sort of distribution. Conversely, a two dimensional scatter plot displays two inputs for each data point, and spreads it out over two axes in a scatter plot for temperature and rainfall instead of just temperature. We can create charts for three dimensional data, but those are often difficult to parse especially on a two dimensional view. So for the purposes of this course, when we walk through the algorithms and we want to visualize them to understand them, we'll typically stick to the two dimensional scatter plots. Right now we see that weather data for all 25 U.S. metro areas over the entire time period they recorded weather data, which accounts for several decades up to over 100 years of daily measurements. However, if we add filters to the existing data through the data tab, we can see that by filtering only for New York City, we see both the charts update accordingly. In Excel we change the appearance of the line in scatter plots by applying a filter to only show the data for New York, for example. And now let's create a separate new variable specifically for New York that we will call, NY. Next, we'll assign it the result of the filtered data frame df, which we'll reference as the df variable followed by square brackets. In these square brackets, we're going to choose the field to filter the data frame by. In our example, we're going to set the city field from the df variable equal to New York. Use the dollar sign to access the city column in the df variable. Notice that we use double equal signs to indicate an exact match in this condition, and we'll put New York in quotation marks. Lastly, we're going to add a comma after our filter to indicate we're only filtering it by the columns. Nothing will follow this coma except the last square bracket to close off this expression. We'll run this variable. Once we set up the NY variable, let's use the head function to make sure we're filtering it properly for New York only. Now let's use the data frame we just created for New York in a scatter plot of its own. We're going to set this up in a similar way that we did in the numerosity video except we're going to change the data to NY instead of df, as the data frame we're using in the plot function. The reason we get a message about an error in our plot is the figure margins are too large for this space. We're going to produce the size of our data frame to just a few data points in the next video. So we'll see how to adjust this later, but for now, let's select specific columns to include in the data frame. I'm going to call this new data frame, sample, and we'll use the same df variable. First place a comma because we're not selecting in rows, but we do want to select the columns. Next, we'll use the c function in R to select the vectors as the column names. We then type in the field names we want to include in this function. So we'll include TMAX, TMIN, and let's say PRCP. So we'll run this line. And again, we'll use the head function to make sure that we've only included certain columns. And we see, we just have a few columns in our data frame. Of course we don't have our city or dates. So we want to think about what fields we need to include as both labels and measures. We'll continue to use these functions to reduce the size of our dimensionality in terms of reducing the number of columns, and we'll also reduce the number of rows to make it easier to analyze and use in our clustering in PCA models. The concept of selecting the rows and columns we want to use from a larger data set like this, is an important part of working with algorithms because it gives us a way to work with a much more manageable data set size. Let's see what the New York data frame variable NY looks like in a scatter plot. We'll set up the plot in the same way that we did for the large data frame variable df except we'll use NY as our data instead. If you need to, please check out how to adjust the settings for our studio if you're having trouble getting the plot to render. To see it in a full view rather than the compressed view, let's select zoom. Here we can see what the trends for the New York rainfall versus the maximum temperature look like over 100 plus year period in a scatter plot.

Contents