In this first machine learning tutorial, we'll create a linear regression model that predicts the price of an automobile based on different variables such as make and technical specifications. To do this, we'll use Azure Machine Learning Studio to develop and iterate on a simple predictive analytics experiment.
A Machine Learning Studio experiment consists of dragging components to a canvas, and connecting them in order to create a model, train the model, and score and test the model. The experiment uses predictive modeling techniques in the form of Machine Learning Studio modules that ingest data, train a model against it, and apply the model to new data. You can also add modules to preprocess data and select features, split data into training and test sets, and evaluate or cross-validate the quality of your model.
Enter Machine Learning Studio: https://studio.azureml.net, and click the Get started button. You can choose either the Guest Access or sign in with your Microsoft account.
And for more general information about Machine Learning Studio, see What is Machine Learning Studio?.
In this machine learning tutorial, you'll follow five basic steps to build an experiment in Machine Learning Studio in order to create, train, and score your model:
There are a number of sample datasets included with Machine Learning Studio that you can choose from, and you can import data from many sources. For this example, we will use the included sample dataset, Automobile price data (Raw). This dataset includes entries for a number of individual automobiles, including information such as make, model, technical specifications, and price.
Start a new experiment by clicking +NEW at the bottom of the Machine Learning Studio window, select EXPERIMENT, and then select Blank Experiment. Select the default experiment name at the top of the canvas and rename it to something meaningful, for example, Automobile price prediction.
To the left of the experiment canvas is a palette of datasets and modules. Type automobile in the Search box at the top of this palette to find the dataset labeled Automobile price data (Raw).
Drag the dataset to the experiment canvas.
To see what this data looks like, click the output port at the bottom of the automobile dataset, and then select Visualize. The variables in the dataset appear as columns, and each instance of an automobile appears as a row. The far-right column (column 26 and titled "price") is the target variable we're going to try to predict.
Close the visualization window by clicking the "x" in the upper-right corner.
A dataset usually requires some preprocessing before it can be analyzed. You might have noticed the missing values present in the columns of various rows. These missing values need to be cleaned so the model can analyze the data correctly. In our case, we'll remove any rows that have missing values. Also, the normalized-losses column has a large proportion of missing values, so we'll exclude that column from the model altogether.
First we'll remove the normalized-losses column, and then we'll remove any row that has missing data.
Type project columns in the Search box at the top of the module palette to find the Project Columns module, then drag it to the experiment canvas and connect it to the output port of the Automobile price data (Raw) dataset. This module allows us to select which columns of data we want to include or exclude in the model.
Select the Project Columns module and click Launch column selector in the Properties pane.
The properties pane for Project Columns indicates that it will pass through all columns from the dataset except normalized-losses.
You can add a comment to a module by double-clicking the module and entering text. This can help you see at a glance what the module is doing in your experiment. In this case, double-click the Project Columns module and type the comment "Exclude normalized-losses."
Drag the Clean Missing Data module to the experiment canvas and connect it to the Project Columns module. In the Properties pane, select Remove entire row under Cleaning mode to clean the data by removing rows that have missing values. Double-click the module and type the comment "Remove missing value rows."
Run the experiment by clicking RUN under the experiment canvas.
When the experiment is finished, all the modules have a green check mark to indicate that they finished successfully. Notice also the Finished running status in the upper-right corner.
All we have done in the experiment to this point is clean the data. If you want to view the cleaned dataset, click the left output port of the Clean Missing Data module ("Cleaned dataset") and select Visualize. Notice that the normalized-losses column is no longer included, and there are no missing values.
Now that the data is clean, we're ready to specify what features we're going to use in the predictive model.
In machine learning, features are individual measurable properties of something you’re interested in. In our dataset, each row represents one automobile, and each column is a feature of that automobile. Finding a good set of features for creating a predictive model requires experimentation and knowledge about the problem you want to solve. Some features are better for predicting the target than others. Also, some features have a strong correlation with other features (for example, city-mpg versus highway-mpg), so they will not add much new information to the model, and they can be removed.
Let's build a model that uses a subset of the features in our dataset. You can come back and select different features, run the experiment again, and see if you get better results. As a first guess, we'll select the following features (columns) with the Project Columns module. Note that for training the model, we need to include the price value that we're going to predict.
make, body-style, wheel-base, engine-size, horsepower, peak-rpm, highway-mpg, price
Drag another Project Columns module to the experiment canvas and connect it to the left output port of the Clean Missing Data module. Double-click the module and type "Select features for prediction."
Click Launch column selector in the Properties pane.
In the column selector, select No columns for Begin With, and then select Include and column names in the filter row. Enter our list of column names. This directs the module to pass through only columns that we specify.
Because we've run the experiment, the column definitions for our data have passed from the original dataset through the Clean Missing Data module. When you connect Project Columns to Clean Missing Data, the Project Columns module becomes aware of the column definitions in our data. When you click the column names box, a list of columns is displayed, and you can select the columns that you want to add to the list.
Click the check mark (OK) button.
This produces the dataset that will be used in the learning algorithm in the next steps. Later, you can return and try again with a different selection of features.
Now that the data is ready, constructing a predictive model consists of training and testing. We'll use our data to train the model and then test the model to see how close it's able to predict prices.
Classification and regression are two types of supervised machine learning techniques. Classification is used to make a prediction from a defined set of values, such as a color (red, blue, or green). Regression is used to make a prediction from a continuous set of values, such as a person's age.
We want to predict the price of an automobile, which can be any value, so we'll use a regression model. For this example, we'll train a simple linear regression model, and in the next step, we'll test it.
We can use our data for both training and testing by splitting it into separate training and testing sets. Select and drag the Split module to the experiment canvas and connect it to the output of the last Project Columns module. Set Fraction of rows in the first output dataset to 0.75. This way, we'll use 75 percent of the data to train the model, and hold back 25 percent for testing.
Run the experiment. This allows the Project Columns and Split modules to pass column definitions to the modules we'll be adding next.
To select the learning algorithm, expand the Machine Learning category in the module palette to the left of the canvas, and then expand Initialize Model. This displays several categories of modules that can be used to initialize machine learning algorithms.
For this experiment, select the Linear Regression module under the Regression category (you can also find the module by typing "linear regression" in the palette Search box), and drag it to the experiment canvas.
Find and drag the Train Model module to the experiment canvas. Connect the left input port to the output of the Linear Regression module. Connect the right input port to the training data output (left port) of the Split module.
Select the Train Model module, click Launch column selector in the Properties pane, and then select the price column. This is the value that our model is going to predict.
Run the experiment.
The result is a trained regression model that can be used to score new samples to make predictions.
Now that we've trained the model using 75 percent of our data, we can use it to score the other 25 percent of the data to see how well our model functions.
Find and drag the Score Model module to the experiment canvas and connect the left input port to the output of the Train Model module. Connect the right input port to the test data output (right port) of the Split module.
To run the experiment and view the output from the Score Model module, click the output port, and then select Visualize. The output shows the predicted values for price and the known values from the test data.
Finally, to test the quality of the results, select and drag the Evaluate Model module to the experiment canvas, and connect the left input port to the output of the Score Model module. (There are two input ports because the Evaluate Model module can be used to compare two models.)
Run the experiment.
To view the output from the Evaluate Model module, click the output port, and then select Visualize. The following statistics are shown for our model:
For each of the error statistics, smaller is better. A smaller value indicates that the predictions more closely match the actual values. For Coefficient of Determination, the closer its value is to one (1.0), the better the predictions.
The final experiment should look like this:
Now that you've completed a first machine learning tutorial and have your experiment set up, you can iterate to try to improve the model. For instance, you can change the features you use in your prediction. Or you can modify the properties of the Linear Regression algorithm or try a different algorithm altogether. You can even add multiple machine learning algorithms to your experiment at one time and compare two by using the Evaluate Model module.
Use the SAVE AS button under the experiment canvas to copy any iteration of your experiment. You can see all the iterations of your experiment by clicking VIEW RUN HISTORY under the canvas. See Manage experiment iterations in Azure Machine Learning Studio for more details.
When you're satisfied with your model, you can deploy it as a web service to be used to predict automobile prices by using new data. See Deploy an Azure Machine Learning web service for more details.
For a more extensive and detailed walkthrough of predictive modeling techniques for creating, training, scoring, and deploying a model, see Develop a predictive solution by using Azure Machine Learning.