# machine learning in R: using random forest as an example

We will introduce the basic ideas of machine learning in R, use random forest model as an example.

# 1 preparation

we need to install the following packages 1:

``````# install needed R packages
remotes::update_packages(c('magrittr', 'dplyr', 'randomForest', 'ROCR', 'GGally'), upgrade = TRUE)``````

To avoid conflict of function name, in the following code, I will try me best to use `pkg::fun()` instead of `library(pkg)`. But magrittr is special, we have to `library()` it.

``library(magrittr)``

Before we start, let set the random seed to make our results reproducible:

``set.seed(0) ``

# 2 generate data set

We use one of R’s built-in data set, `iris`, Edgar Anderson’s Iris Data set.

The original data set contains observations for four features (sepal length and width, and petal length and width — all in cm) of 150 flowers of three species (each 50).

To make things simple, here we only choose two species, `versicolor` and `virginica`. 2

``````df <- iris %>% tibble::as_tibble() %>%
dplyr::filter(Species != 'setosa') %>%
dplyr::mutate(Species = factor(Species))
df``````

Remember to drops unused level using `factor()`, otherwise `randomForest::randomForest()` would complain. 3

Optional: before we build the model, we may explore the correlation between features 4.

# 3 Split data set

Before we build the model, we need to split the data set into training set and testing set. So we can train our model using data in training set, and evalute the model using data in testing set.

Here we randomly assigns 80 percent samples to the traing set, and the left 20 percent to the testing set 5.

``````nrow_training <- floor(nrow(df) * 0.8)  # Calculate the size of training sets
indexes <- sample(1:nrow(df), nrow_training)  # these rows will be select for training

training <- df[indexes, ]
testing <- df[-indexes, ]``````

You may want to have a look at the result 6.

# 4 Build the model

Then we can build a random forest model 7.

``rf_classifier = randomForest::randomForest(Species ~ ., training)``

You can feed `randomForest::randomForest()` two arguments:

• a `formula`: here `Species` is the response variable, `.` tells that all other variables are features
• a data frame to train the model

# 5 Evaluate the model - basics

After we build the model, the fundamental use is to make prediction. That’s exactly where the testing set comes in handy:

``predicted_class <- predict(rf_classifier, testing[, -ncol(testing)])``

`predict()` needs two arguments: the model and a data frame of features 8.

We can compare the predicted class with real class 9:

``````real_class = testing[[ncol(testing)]]

tibble::tibble(predicted_class, real_class) %>%
dplyr::mutate(correct = predicted_class == real_class)``````

To quantify the performance of our model, we can construct a confusion matrix

``````# we define 'versicolor' as positive class
TP <- sum((predicted_class == 'versicolor') & (real_class == 'versicolor')) #true positive
FP <- sum((predicted_class == 'versicolor') & (real_class != 'versicolor')) #false positive
FN <- sum((predicted_class != 'versicolor') & (real_class == 'versicolor')) #false negative
TN <- sum((predicted_class != 'versicolor') & (real_class != 'versicolor')) #true negative

tibble::tribble(
~'', ~'Predicted versicolor', ~'Predicted virginica',
'True versicolor', TP, FP,
'True virginica',  FN, TN
)``````

and compute common evaluation indicators

``````tibble::tribble(
~indicator, ~value,
'sensitivity', TP/(TP + FN),
'specificity', TN/(TN + FP),
'accuracy', (TP + TN)/(TP + TN + FP + FN)
)``````

# 6 Evaluate the model - advanced

Except for definite class, the model can also tell us the probability that a given sample belongs to each class:

``````probability <- predict(rf_classifier, testing[, -ncol(testing)], type = "prob")
probability %>% tibble::as_tibble()``````

For example, 0.998 in the second row means the probability of the second sample in the testing set belonging to `versicolor`.

Actually, the predicted class we see above is determined by judging whether the probability is greater than a given cutoff (the default is 0.5):

``````all((probability[ , 1, drop = T] > 0.5) == (predicted_class == 'versicolor'))
##  TRUE``````

Using different curoff, we can draw a ROC curve 10.

``````label <- testing[[ncol(testing)]] %>% {. ==  'versicolor'} %>% as.integer

prediction <- ROCR::prediction(probability[, 1], label)
## Error in loadNamespace(name): there is no package called 'ROCR'
prediction %>% ROCR::performance("tpr", "fpr") %>% ROCR::plot(main = "ROC Curve")

and cauculate the AUC (area under the curve)

``````ROCR::performance(prediction, 'auc')@y.values[]
## Error in loadNamespace(name): there is no package called 'ROCR'``````

# 7 Tips and more

If you want to see Importance of each feature

``randomForest::randomForest(Species ~ ., training, importance = TRUE) %>% randomForest::varImpPlot()`` • magrittr enable one style of writing R code. For why I use that style, please refer to https://r4ds.had.co.nz/pipes.html#piping-alternatives
• dplyr: manipulate data frame
• randomForest: build random forest model
• ROCR: ROC analysis
• GGally: plot correlation between features
↩︎
1. ``````df <- iris %>% tibble::as_tibble() %>%
dplyr::filter(Species != 'setosa') %>%
dplyr::mutate(Species = factor(Species))
df``````
• The frist line turn `iris` into a tibble, a modern reimagining of the `data.frame`.
• The second line select rows whose species is not `setosa`, so only `versicolor` and `virginica` are left.
• The third line drops unused level of the `Species` factor,
↩︎
2. This is a little technical:

The orignial Species contains three levels, `setosa`, `versicolor` and `virginica`, each has 50 values.

``iris %>% dplyr::count(Species)``
``````iris %>% .\$Species %>% levels
##  "setosa"     "versicolor" "virginica"``````

Although we remove all `setosa` values,

``````df0 <- iris %>% tibble::as_tibble() %>% dplyr::filter(Species != 'setosa')
df0 %>% dplyr::count(Species)``````

the `setosa` level still exists, and now this level contains no values.

``````df0 %>% .\$Species %>% levels
##  "setosa"     "versicolor" "virginica"``````

That would cause `randomForest::randomForest()` to fail.

``````randomForest::randomForest(Species ~ ., df0)
## Error in randomForest.default(m, y, ...): Can't have empty classes in y.``````

After we call `factor()`, Species only contains two levels, both do have values.)

``````df0 %>% .\$Species %>% factor %>% levels
##  "versicolor" "virginica"``````
↩︎
3. ``````GGally::ggpairs(df, columns = 1:4, ggplot2::aes(color = Species))
## Error in loadNamespace(name): there is no package called 'GGally'``````
↩︎
4. ``````nrow_training <- floor(nrow(df) * 0.8)  # Calculate the size of training sets
indexes <- sample(1:nrow(df), nrow_training)  # these rows will be select for training

training <- df[indexes, ]
testing <- df[-indexes, ]``````

The code seems a little complicated, and it require you to be familiar with the R language.

Anyway, I will try to use a simple example to explain the core idea:

• Image your data contains only 5 rows, then 80 percent is 5 * 0.8 = 4 (i.e. `nrow_training` is `4`).
• Image you decide to choose the 1st, 2nd, 3rd and 5th rows for training (i.e. `indexes` is `c(1, 2, 3, 5)`)
• Now `training` contains the 1st, 2nd, 3rd and 5th rows of `df` (`[indexes, ]` means to choose these rows)
• And `testing` contains the 4th row of `df` (`[-indexes, ]` means not to choose these rows, so only the 4th row is left)
↩︎
5. ``training``
``testing``
↩︎
6. ``````rf_classifier
##
## Call:
##  randomForest(formula = Species ~ ., data = training)
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
##
##         OOB estimate of  error rate: 11.25%
## Confusion matrix:
##            versicolor virginica class.error
## versicolor         35         4   0.1025641
## virginica           5        36   0.1219512``````
↩︎
7. `-ncol(testing)` means to drop the last column, so `testing[, -ncol(testing)` only contains features↩︎

8. `testing[[ncol(testing)]]` gets the last column, i.e, the real value of `Species` in the testing set↩︎

9. ``````label <- testing[[ncol(testing)]] %>% {. ==  'versicolor'} %>% as.integer

prediction <- ROCR::prediction(probability[, 1], label)
prediction %>% ROCR::performance("tpr", "fpr") %>% ROCR::plot(main = "ROC Curve") ``````

`ROCR::prediction()` needs two arguments:

• the probability of a set of samples belonging to a given class (`versicolor` here)
• the real classes of those samples, `1` means belonging the given class, `0` otherwise
↩︎