Executive Summary

Background

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. The goal of this project is to use data from accelerometers on the belt, forearm, arm, and dumbbell of 6 participants as they perform barbell lifts correctly and incorrectly 5 different ways.

Data

The training data for this project are available at:

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv

The test data are available at:

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv

Goal

The goal of this project is to predict the manner in which subjects did the exercise. This is the "classe" variable in the training set. The model will use the other variables to predict with. This report describes:

How the model is built
Use of cross validation
An estimate of expected out of sample error

Getting and cleaning the Data

The first step is to download the data, load it into R and prepare it for the modeling process.

Load the functions and static variables

All functions are loaded and static variables are assigned. Also in this section, the seed is set so the pseudo-random number generator operates in a consistent way for repeat-ability.

library(rpart.plot)
library(caret)
library(rpart)
library(rattle)
library(RColorBrewer)
library(randomForest)
library(e1071)

set.seed(1)

# You should download required files first

path <- paste(getwd(),"/","data", sep="")
train.file <- file.path(path, "pml-training.csv")
test.file <- file.path(path, "pml-testing.csv")

Clean Data

Process missing data (i.e., "NA", "#DIV/0!" and ""), and set them all to NA.

train.data.raw <- read.csv(train.file, na.strings=c("NA","#DIV/0!",""))
test.data.raw <- read.csv(test.file, na.strings=c("NA","#DIV/0!",""))

Remove unecessary colums

Columns that are not deeded for the model and columns that contain NAs are eliminated.

# Drop the first 7 columns as they're unnecessary for predicting.
train.data.clean1 <- train.data.raw[,8:length(colnames(train.data.raw))]
test.data.clean1 <- test.data.raw[,8:length(colnames(test.data.raw))]

# Drop colums with NAs
train.data.clean1 <- train.data.clean1[, colSums(is.na(train.data.clean1)) == 0] 
test.data.clean1 <- test.data.clean1[, colSums(is.na(test.data.clean1)) == 0] 

# Check for near zero variance predictors and drop them if necessary
nzv <- nearZeroVar(train.data.clean1,saveMetrics=TRUE)
zero.var.ind <- sum(nzv$nzv)

if ((zero.var.ind>0)) {
        train.data.clean1 <- train.data.clean1[,nzv$nzv==FALSE]
}

Slice the data for cross validation

The training data is divided into two sets. This first is a training set with 70% of the data which is used to train the model. The second is a validation set used to assess model performance.

in.training <- createDataPartition(train.data.clean1$classe, p=0.70, list=F)
train.data.final <- train.data.clean1[in.training, ]
validate.data.final <- train.data.clean1[-in.training, ]

Model Development

Train the model

The training data-set is used to fit a Random Forest model because it automatically selects important variables and is robust to correlated covariates & outliers in general. 5-fold cross validation is used when applying the algorithm. A Random Forest algorithm is a way of averaging multiple deep decision trees, trained on different parts of the same data-set, with the goal of reducing the variance. This typically produces better performance at the expense of bias and interpret-ability. The Cross-validation technique assesses how the results of a statistical analysis will generalize to an independent data set. In 5-fold cross-validation, the original sample is randomly partitioned into 5 equal sized sub-samples. a single sample is retained for validation and the other sub-samples are used as training data. The process is repeated 5 times and the results from the folds are averaged.

control.parms <- trainControl(method="cv", 5)
rf.model <- train(classe ~ ., data=train.data.final, method="rf",
                 trControl=control.parms, ntree=251)
rf.model

Estimate performance

The model fit using the training data is tested against the validation data. Predicted values for the validation data are then compared to the actual values. This allows forecasting the accuracy and overall out-of-sample error, which indicate how well the model will perform with other data.

rf.predict <- predict(rf.model, validate.data.final)
confusionMatrix(validate.data.final$classe, rf.predict)

accuracy <- postResample(rf.predict, validate.data.final$classe)
acc.out <- accuracy[1]

overall.ose <- 
        1 - as.numeric(confusionMatrix(validate.data.final$classe, rf.predict)
                       $overall[1])

Results

The accuracy of this model is r acc.out and the Overall Out-of-Sample error is r overall.ose.

Run the model

The model is applied to the test data to produce the results.

results <- predict(rf.model, 
                   test.data.clean1[, -length(names(test.data.clean1))])
results

Appendix - Decision Tree Visualization

treeModel <- rpart(classe ~ ., data=train.data.final, method="class")
fancyRpartPlot(treeModel)

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
machine_proj_1_cache/html		machine_proj_1_cache/html
machine_proj_1_files/figure-html		machine_proj_1_files/figure-html
results		results
Readme.md		Readme.md
machine_proj_1.R		machine_proj_1.R
machine_proj_1.Rmd		machine_proj_1.Rmd
machine_proj_1.html		machine_proj_1.html
machine_proj_1.md		machine_proj_1.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Executive Summary

Background

Data

Goal

Getting and cleaning the Data

Load the functions and static variables

Clean Data

Remove unecessary colums

Slice the data for cross validation

Model Development

Train the model

Estimate performance

Results

Run the model

Appendix - Decision Tree Visualization

About

Uh oh!

Releases

Packages

Languages

Johnny-Kao/Practical-Machine-Learning-Project

Folders and files

Latest commit

History

Repository files navigation

Executive Summary

Background

Data

Goal

Getting and cleaning the Data

Load the functions and static variables

Clean Data

Remove unecessary colums

Slice the data for cross validation

Model Development

Train the model

Estimate performance

Results

Run the model

Appendix - Decision Tree Visualization

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages