Skip to Content

machine learning in R: using random forest as an example

We will introduce the basic ideas of machine learning in R, use random forest model as an example.

1 preparation

we need to install the following packages 1:

# install needed R packages
remotes::update_packages(c('magrittr', 'dplyr', 'randomForest', 'ROCR', 'GGally'), upgrade = TRUE)

To avoid conflict of function name, in the following code, I will try me best to use pkg::fun() instead of library(pkg). But magrittr is special, we have to library() it.

library(magrittr)

Before we start, let set the random seed to make our results reproducible:

set.seed(0) 

2 generate data set

We use one of R’s built-in data set, iris, Edgar Anderson’s Iris Data set.

The original data set contains observations for four features (sepal length and width, and petal length and width — all in cm) of 150 flowers of three species (each 50).

To make things simple, here we only choose two species, versicolor and virginica. 2

df <- iris %>% tibble::as_tibble() %>% 
    dplyr::filter(Species != 'setosa') %>%
    dplyr::mutate(Species = factor(Species))
df

Remember to drops unused level using factor(), otherwise randomForest::randomForest() would complain. 3

Optional: before we build the model, we may explore the correlation between features 4.

3 Split data set

Before we build the model, we need to split the data set into training set and testing set. So we can train our model using data in training set, and evalute the model using data in testing set.

Here we randomly assigns 80 percent samples to the traing set, and the left 20 percent to the testing set 5.

nrow_training <- floor(nrow(df) * 0.8)  # Calculate the size of training sets
indexes <- sample(1:nrow(df), nrow_training)  # these rows will be select for training

training <- df[indexes, ] 
testing <- df[-indexes, ]

You may want to have a look at the result 6.

4 Build the model

Then we can build a random forest model 7.

rf_classifier = randomForest::randomForest(Species ~ ., training)

You can feed randomForest::randomForest() two arguments:

  • a formula: here Species is the response variable, . tells that all other variables are features
  • a data frame to train the model

5 Evaluate the model - basics

After we build the model, the fundamental use is to make prediction. That’s exactly where the testing set comes in handy:

predicted_class <- predict(rf_classifier, testing[, -ncol(testing)])

predict() needs two arguments: the model and a data frame of features 8.

We can compare the predicted class with real class 9:

real_class = testing[[ncol(testing)]]

tibble::tibble(predicted_class, real_class) %>% 
    dplyr::mutate(correct = predicted_class == real_class)

To quantify the performance of our model, we can construct a confusion matrix

# we define 'versicolor' as positive class
TP <- sum((predicted_class == 'versicolor') & (real_class == 'versicolor')) #true positive
FP <- sum((predicted_class == 'versicolor') & (real_class != 'versicolor')) #false positive
FN <- sum((predicted_class != 'versicolor') & (real_class == 'versicolor')) #false negative 
TN <- sum((predicted_class != 'versicolor') & (real_class != 'versicolor')) #true negative 

tibble::tribble(
    ~'', ~'Predicted versicolor', ~'Predicted virginica',
    'True versicolor', TP, FP,
    'True virginica',  FN, TN
)

and compute common evaluation indicators

tibble::tribble(
    ~indicator, ~value,
    'sensitivity', TP/(TP + FN),
    'specificity', TN/(TN + FP),
    'accuracy', (TP + TN)/(TP + TN + FP + FN)
)

6 Evaluate the model - advanced

Except for definite class, the model can also tell us the probability that a given sample belongs to each class:

probability <- predict(rf_classifier, testing[, -ncol(testing)], type = "prob")
probability %>% tibble::as_tibble()

For example, 0.998 in the second row means the probability of the second sample in the testing set belonging to versicolor.

Actually, the predicted class we see above is determined by judging whether the probability is greater than a given cutoff (the default is 0.5):

all((probability[ , 1, drop = T] > 0.5) == (predicted_class == 'versicolor'))
## [1] TRUE

Using different curoff, we can draw a ROC curve 10.

label <- testing[[ncol(testing)]] %>% {. ==  'versicolor'} %>% as.integer

prediction <- ROCR::prediction(probability[, 1], label)
## Error in loadNamespace(name): there is no package called 'ROCR'
prediction %>% ROCR::performance("tpr", "fpr") %>% ROCR::plot(main = "ROC Curve") 
## Error in eval(lhs, parent, parent): object 'prediction' not found

and cauculate the AUC (area under the curve)

ROCR::performance(prediction, 'auc')@y.values[[1]]
## Error in loadNamespace(name): there is no package called 'ROCR'

7 Tips and more

If you want to see Importance of each feature

randomForest::randomForest(Species ~ ., training, importance = TRUE) %>% randomForest::varImpPlot()


    • magrittr enable one style of writing R code. For why I use that style, please refer to https://r4ds.had.co.nz/pipes.html#piping-alternatives
    • dplyr: manipulate data frame
    • randomForest: build random forest model
    • ROCR: ROC analysis
    • GGally: plot correlation between features
    ↩︎
  1. df <- iris %>% tibble::as_tibble() %>% 
        dplyr::filter(Species != 'setosa') %>%
        dplyr::mutate(Species = factor(Species))
    df
    • The frist line turn iris into a tibble, a modern reimagining of the data.frame.
    • The second line select rows whose species is not setosa, so only versicolor and virginica are left.
    • The third line drops unused level of the Species factor,
    ↩︎
  2. This is a little technical:

    The orignial Species contains three levels, setosa, versicolor and virginica, each has 50 values.

    iris %>% dplyr::count(Species)
    iris %>% .$Species %>% levels
    ## [1] "setosa"     "versicolor" "virginica"

    Although we remove all setosa values,

    df0 <- iris %>% tibble::as_tibble() %>% dplyr::filter(Species != 'setosa')
    df0 %>% dplyr::count(Species)

    the setosa level still exists, and now this level contains no values.

    df0 %>% .$Species %>% levels
    ## [1] "setosa"     "versicolor" "virginica"

    That would cause randomForest::randomForest() to fail.

    randomForest::randomForest(Species ~ ., df0)
    ## Error in randomForest.default(m, y, ...): Can't have empty classes in y.

    After we call factor(), Species only contains two levels, both do have values.)

    df0 %>% .$Species %>% factor %>% levels
    ## [1] "versicolor" "virginica"
    ↩︎
  3. GGally::ggpairs(df, columns = 1:4, ggplot2::aes(color = Species))
    ## Error in loadNamespace(name): there is no package called 'GGally'
    ↩︎
  4. nrow_training <- floor(nrow(df) * 0.8)  # Calculate the size of training sets
    indexes <- sample(1:nrow(df), nrow_training)  # these rows will be select for training
    
    training <- df[indexes, ] 
    testing <- df[-indexes, ]

    The code seems a little complicated, and it require you to be familiar with the R language.

    Anyway, I will try to use a simple example to explain the core idea:

    • Image your data contains only 5 rows, then 80 percent is 5 * 0.8 = 4 (i.e. nrow_training is 4).
    • Image you decide to choose the 1st, 2nd, 3rd and 5th rows for training (i.e. indexes is c(1, 2, 3, 5))
    • Now training contains the 1st, 2nd, 3rd and 5th rows of df ([indexes, ] means to choose these rows)
    • And testing contains the 4th row of df ([-indexes, ] means not to choose these rows, so only the 4th row is left)
    ↩︎
  5. training
    testing
    ↩︎
  6. rf_classifier
    ## 
    ## Call:
    ##  randomForest(formula = Species ~ ., data = training) 
    ##                Type of random forest: classification
    ##                      Number of trees: 500
    ## No. of variables tried at each split: 2
    ## 
    ##         OOB estimate of  error rate: 11.25%
    ## Confusion matrix:
    ##            versicolor virginica class.error
    ## versicolor         35         4   0.1025641
    ## virginica           5        36   0.1219512
    ↩︎
  7. -ncol(testing) means to drop the last column, so testing[, -ncol(testing) only contains features↩︎

  8. testing[[ncol(testing)]] gets the last column, i.e, the real value of Species in the testing set↩︎

  9. label <- testing[[ncol(testing)]] %>% {. ==  'versicolor'} %>% as.integer
    
    prediction <- ROCR::prediction(probability[, 1], label)
    prediction %>% ROCR::performance("tpr", "fpr") %>% ROCR::plot(main = "ROC Curve") 

    ROCR::prediction() needs two arguments:

    • the probability of a set of samples belonging to a given class (versicolor here)
    • the real classes of those samples, 1 means belonging the given class, 0 otherwise
    ↩︎