machine learning in R: using random forest as an example
We will introduce the basic ideas of machine learning in R, use random forest model as an example.
1 preparation
we need to install the following packages 1:
# install needed R packages
::update_packages(c('magrittr', 'dplyr', 'randomForest', 'ROCR', 'GGally'), upgrade = TRUE) remotes
To avoid conflict of function name, in the following code, I will try me best to use pkg::fun()
instead of library(pkg)
. But magrittr is special, we have to library()
it.
library(magrittr)
Before we start, let set the random seed to make our results reproducible:
set.seed(0)
2 generate data set
We use one of R’s built-in data set, iris
, Edgar Anderson’s Iris Data set.
The original data set contains observations for four features (sepal length and width, and petal length and width — all in cm) of 150 flowers of three species (each 50).
To make things simple, here we only choose two species, versicolor
and virginica
. 2
<- iris %>% tibble::as_tibble() %>%
df ::filter(Species != 'setosa') %>%
dplyr::mutate(Species = factor(Species))
dplyr df
Remember to drops unused level using
factor()
, otherwiserandomForest::randomForest()
would complain. 3
Optional: before we build the model, we may explore the correlation between features 4.
3 Split data set
Before we build the model, we need to split the data set into training set and testing set. So we can train our model using data in training set, and evalute the model using data in testing set.
Here we randomly assigns 80 percent samples to the traing set, and the left 20 percent to the testing set 5.
<- floor(nrow(df) * 0.8) # Calculate the size of training sets
nrow_training <- sample(1:nrow(df), nrow_training) # these rows will be select for training
indexes
<- df[indexes, ]
training <- df[-indexes, ] testing
You may want to have a look at the result 6.
4 Build the model
Then we can build a random forest model 7.
= randomForest::randomForest(Species ~ ., training) rf_classifier
You can feed randomForest::randomForest()
two arguments:
- a
formula
: hereSpecies
is the response variable,.
tells that all other variables are features - a data frame to train the model
5 Evaluate the model - basics
After we build the model, the fundamental use is to make prediction. That’s exactly where the testing set comes in handy:
<- predict(rf_classifier, testing[, -ncol(testing)]) predicted_class
predict()
needs two arguments: the model and a data frame of features 8.
We can compare the predicted class with real class 9:
= testing[[ncol(testing)]]
real_class
::tibble(predicted_class, real_class) %>%
tibble::mutate(correct = predicted_class == real_class) dplyr
To quantify the performance of our model, we can construct a confusion matrix
# we define 'versicolor' as positive class
<- sum((predicted_class == 'versicolor') & (real_class == 'versicolor')) #true positive
TP <- sum((predicted_class == 'versicolor') & (real_class != 'versicolor')) #false positive
FP <- sum((predicted_class != 'versicolor') & (real_class == 'versicolor')) #false negative
FN <- sum((predicted_class != 'versicolor') & (real_class != 'versicolor')) #true negative
TN
::tribble(
tibble~'', ~'Predicted versicolor', ~'Predicted virginica',
'True versicolor', TP, FP,
'True virginica', FN, TN
)
and compute common evaluation indicators
::tribble(
tibble~indicator, ~value,
'sensitivity', TP/(TP + FN),
'specificity', TN/(TN + FP),
'accuracy', (TP + TN)/(TP + TN + FP + FN)
)
6 Evaluate the model - advanced
Except for definite class, the model can also tell us the probability that a given sample belongs to each class:
<- predict(rf_classifier, testing[, -ncol(testing)], type = "prob")
probability %>% tibble::as_tibble() probability
For example, 0.998 in the second row means the probability of the second sample in the testing set belonging to versicolor
.
Actually, the predicted class we see above is determined by judging whether the probability is greater than a given cutoff (the default is 0.5):
all((probability[ , 1, drop = T] > 0.5) == (predicted_class == 'versicolor'))
## [1] TRUE
Using different curoff, we can draw a ROC curve 10.
<- testing[[ncol(testing)]] %>% {. == 'versicolor'} %>% as.integer
label
<- ROCR::prediction(probability[, 1], label)
prediction ## Error in loadNamespace(name): there is no package called 'ROCR'
%>% ROCR::performance("tpr", "fpr") %>% ROCR::plot(main = "ROC Curve")
prediction ## Error in loadNamespace(name): there is no package called 'ROCR'
and cauculate the AUC (area under the curve)
::performance(prediction, 'auc')@y.values[[1]]
ROCR## Error in loadNamespace(name): there is no package called 'ROCR'
7 Tips and more
If you want to see Importance of each feature
::randomForest(Species ~ ., training, importance = TRUE) %>% randomForest::varImpPlot() randomForest
- magrittr enable one style of writing R code. For why I use that style, please refer to https://r4ds.had.co.nz/pipes.html#piping-alternatives
- dplyr: manipulate data frame
- randomForest: build random forest model
- ROCR: ROC analysis
- GGally: plot correlation between features
- magrittr enable one style of writing R code. For why I use that style, please refer to https://r4ds.had.co.nz/pipes.html#piping-alternatives
<- iris %>% tibble::as_tibble() %>% df ::filter(Species != 'setosa') %>% dplyr::mutate(Species = factor(Species)) dplyr df
- The frist line turn
iris
into a tibble, a modern reimagining of thedata.frame
. - The second line select rows whose species is not
setosa
, so onlyversicolor
andvirginica
are left. - The third line drops unused level of the
Species
factor,
- The frist line turn
This is a little technical:
The orignial Species contains three levels,
setosa
,versicolor
andvirginica
, each has 50 values.%>% dplyr::count(Species) iris
%>% .$Species %>% levels iris ## [1] "setosa" "versicolor" "virginica"
Although we remove all
setosa
values,<- iris %>% tibble::as_tibble() %>% dplyr::filter(Species != 'setosa') df0 %>% dplyr::count(Species) df0
the
setosa
level still exists, and now this level contains no values.%>% .$Species %>% levels df0 ## [1] "setosa" "versicolor" "virginica"
That would cause
randomForest::randomForest()
to fail.::randomForest(Species ~ ., df0) randomForest## Error in randomForest.default(m, y, ...): Can't have empty classes in y.
After we call
factor()
, Species only contains two levels, both do have values.)↩︎%>% .$Species %>% factor %>% levels df0 ## [1] "versicolor" "virginica"
- ↩︎
::ggpairs(df, columns = 1:4, ggplot2::aes(color = Species)) GGally## Error in loadNamespace(name): there is no package called 'GGally'
<- floor(nrow(df) * 0.8) # Calculate the size of training sets nrow_training <- sample(1:nrow(df), nrow_training) # these rows will be select for training indexes <- df[indexes, ] training <- df[-indexes, ] testing
The code seems a little complicated, and it require you to be familiar with the R language.
Anyway, I will try to use a simple example to explain the core idea:
- Image your data contains only 5 rows, then 80 percent is 5 * 0.8 = 4 (i.e.
nrow_training
is4
).
- Image you decide to choose the 1st, 2nd, 3rd and 5th rows for training (i.e.
indexes
isc(1, 2, 3, 5)
)
- Now
training
contains the 1st, 2nd, 3rd and 5th rows ofdf
([indexes, ]
means to choose these rows)
- And
testing
contains the 4th row ofdf
([-indexes, ]
means not to choose these rows, so only the 4th row is left)
- Image your data contains only 5 rows, then 80 percent is 5 * 0.8 = 4 (i.e.
training
testing
- ↩︎
rf_classifier## ## Call: ## randomForest(formula = Species ~ ., data = training) ## Type of random forest: classification ## Number of trees: 500 ## No. of variables tried at each split: 2 ## ## OOB estimate of error rate: 11.25% ## Confusion matrix: ## versicolor virginica class.error ## versicolor 35 4 0.1025641 ## virginica 5 36 0.1219512
-ncol(testing)
means to drop the last column, sotesting[, -ncol(testing)
only contains features↩︎testing[[ncol(testing)]]
gets the last column, i.e, the real value ofSpecies
in the testing set↩︎<- testing[[ncol(testing)]] %>% {. == 'versicolor'} %>% as.integer label <- ROCR::prediction(probability[, 1], label) prediction %>% ROCR::performance("tpr", "fpr") %>% ROCR::plot(main = "ROC Curve") prediction
ROCR::prediction()
needs two arguments:- the probability of a set of samples belonging to a given class (
versicolor
here) - the real classes of those samples,
1
means belonging the given class,0
otherwise
- the probability of a set of samples belonging to a given class (