streamlines the process for creating predictive models a

Author : fastitem123
Publish Date : 2021-01-09 07:43:01


Max Kuhn builds both packages (with contributions from many other talented people). The caret package (short for Classification And REgression Training) streamlines the process for creating predictive models and has been the top choice among R users. It’s been around for a long time, and there are numerous resources, answers, and solutions to all the possible questions. On the other hand, tidymodels is newer and is built on the tidyverse principles. RStudio hired Max intending to design a tidy version of the caret.
I have been using caret for predictive modelling. While I am aware of tidymodels, I only began exploring last week. As everything in life is, adopting a new ecosystem takes time and patience. So this post is by no means an exhaustive analysis. The complete code is available on GitHub, and an HTML version of the Markdown is published.


https://www.bhakticharuswami.com/tgb/yhn/video-ciclo-ing-02.html
https://www.bhakticharuswami.com/tgb/yhn/video-ciclo-ing-03.html
https://www.bhakticharuswami.com/tgb/yhn/video-ciclo-ing-04.html
https://www.bhakticharuswami.com/tgb/yhn/video-ciclo-ing-05.html
https://www.bhakticharuswami.com/tgb/yhn/video-ciclo-ing-06.html
https://www.bhakticharuswami.com/tgb/yhn/video-ciclo-ing-07.html
https://www.bhakticharuswami.com/tgb/yhn/video-ciclo-ing-08.html
https://www.bhakticharuswami.com/tgb/yhn/video-ciclo-ing-09.html
https://www.bhakticharuswami.com/tgb/yhn/video-het-bk-parc-on01.html
https://www.bhakticharuswami.com/tgb/yhn/video-het-bk-parc-on02.html
https://www.bhakticharuswami.com/tgb/yhn/video-het-bk-parc-on03.html
https://www.bhakticharuswami.com/tgb/yhn/video-het-bk-parc-on04.html
https://www.bhakticharuswami.com/tgb/yhn/video-het-bk-parc-on05.html
https://www.bhakticharuswami.com/tgb/yhn/video-het-bk-parc-on06.html
https://www.bhakticharuswami.com/tgb/yhn/video-het-bk-parc-on07.html
https://www.bhakticharuswami.com/tgb/yhn/video-het-bk-parc-on08.html
https://www.bhakticharuswami.com/tgb/yhn/video-les-cham-n-direct-1.html
https://www.bhakticharuswami.com/tgb/yhn/video-les-cham-n-direct-2.html
https://www.bhakticharuswami.com/tgb/yhn/video-les-cham-n-direct-3.html
https://www.bhakticharuswami.com/tgb/yhn/video-les-cham-n-direct-4.html
Overview
caret is a single package with various functions for machine learning. For example, createDataPartition for splitting data and trainControl for setting up cross-validation.
tidymodels is a collection of packages for modelling. When I execute the library(tidymodels) command, the following packages are loaded:
resample: for data splitting and resampling
parsnip: for trying out a range of models
recipes: for pre-processing
workflow: for putting everything together
yardstick: for evaluating models
broom: for converting the information in common statistical R objects into user-friendly, predictable formats
dials: for creating and managing tuning parameters
Some common libraries from tidyverse, such as dplyr, are also loaded. As shown, tidymodels breaks down the machine learning workflow into multiple stages and provides specialised packages for each stage. This is beneficial for users because of the increased flexibility and possibility. However, for a beginner, it might be intimidating (at least it was for me).
Import data
The data are from Bike Sharing Dataset of the UCI repository. The goal is to predict bike rental count total based on the environmental and seasonal settings.
library(tidymodels) 
library(caret)
library(lubridate) 
library(tidyverse) 
library(moments) 
library(corrr) 
library(randomForest)
bike <- read_csv("Bike-Sharing-Dataset/hour.csv")
bike %>% dim()
## [1] 17379    17
There are 17,379 cases and 17 features. I removed instant, changed the formatting for year, and renamed some variables.
bike %>%
  mutate(instant = NULL, yr = yr + 2011) %>%
  rename(
    date = dteday,
    year = yr,
    month = mnth,
    hour = hr,
    weather = weathersit,
    humidity = hum,
    total = cnt
  ) ->
bike
head(bike)
Image for post
Data frame header preview
Explore data
Target variable
bike %>%
  pivot_longer(
    cols = c(casual, registered, total),
    names_to = "usertype",
    values_to = "count"
  ) %>%
  ggplot(aes(count, colour = usertype)) +
  geom_density() +
  labs(
    title = "Distribution of the number of rental bikes",
    x = "Number per hour", y = "Density"
  ) +
  scale_colour_discrete(
    name = "User type",
    breaks = c("casual", "registered", "total"),
    labels = c("Non-registered", "Registered", "Total")
  )
Image for post
Target variable distribution
The distributions of rental counts are positively skewed. It is desirable to have a normal distribution, as most machine learning techniques require the dependent variable to be normal. I addressed the skewness later.
Correlation
I used correlated() function from corrr package, which is part of tidymodels but not automatically loaded with library(tidymodels) command. I am constantly surprised by how many packages there are in the tidy ecosystem. Typically, I prioritise tidy packages over independent ones because of the integration of pipeline and the consistency in aesthetic, and corrr is no exception.
bike %>%
  select(where(is.numeric)) %>%
  correlate() %>%
  rearrange(absolute = FALSE) %>%
  shave() ->
  bike_cor
rplot(bike_cor, print_cor = TRUE)
Image for post
Correlation plot
Prepare data
Since I have not split the data yet, this step is not data scaling or centring, which should fit the training set and transform the testing set. Here, I focus on the process that applies to all data and does not have a parameter, such as factorising or simple mathematic calculation. For example, if I take the square root of a number, I can square it to know the original number. However, for normalisation, I need to know the minimum and the maximum value of a variable, both of which might be different for training and testing.
Target variable
I focused on the total count, so casual and registered variables are moved. As suggested earlier, the target variable is positively skewed and requires transformation. I tried several common techniques for positively skewed data and applied the one with the lowest skewness — cubic root.
bike_all <- bike %>%
  select(-casual, -registered)

# Original
skewness(bike_all$total)
## [1] 1.277301

# Log
skewness(log10(bike_all$total))
## [1] -0.936101

# Log + constant
skewness(log1p(bike_all$total))
## [1] -0.8181098

# Square root
skewness(sqrt(bike_all$total))
## [1] 0.2864499

# Cubic root
skewness(bike_all$total^(1 / 3))
## [1] -0.0831688

# Transform with cubic root
bike_all$total <- bike_all$total^(1 / 3)
Predictors
Categorical variables are converted to factors according to the attribute information provided by UCI.
bike_all$season <- factor(
  bike_all$season,
  levels = c(1, 2, 3, 4),
  labels = c("spring", "summer", "autumn", "winter")
)
bike_all$holiday <- factor(
  bike_all$holiday,
  levels = c(0, 1), labels = c(FALSE, TRUE)
)
bike_all$workingday <- factor(
  bike_all$workingday,
  levels = c(0, 1), labels = c(FALSE, TRUE)
)
bike_all$weather <- factor(
  bike_all$weather,
  levels = c(1, 2, 3, 4),
  labels = c("clear", "cloudy", "rainy", "heavy rain"),
  ordered = TRUE
)
head(bike_all)
Image for postImage for post
Data frame header preview
Split data (Train/Test, Cross-Validation)
Both packages provide functions for common data splitting strategies, such as k-fold, grouped k-fold, leave-out-one, and bootstrapping. But tidyverse appears to be more comprehensive since it includes Monte Carlo cross-validation (I don’t know what this is, but it sounds cool) and nested cross-validation. I particularly emphasised the method because a research paper found that, “nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size.” (Vabalas et al., 2019)
tidymodels
The tidymodels rsample library handles data splitting. Training and testing split is done as shown, along with 10-fold cross-validation.
set.seed(25)
split <- initial_split(bike_all, prop = 0.8)
train_data <- training(split)
train_data %>% dim()
## [1] 13904    14

test_data <- testing(split)
test_data %>% dim()
## [1] 3475   14

train_cv <- vfold_cv(train_data, v = 10)
caret
There are two options available:
Use caret's native functions, such as createDataPartition.
set.seed(25)
train_index <- createDataPartition(
  bike_all$total, p = 0.8, times = 1, list = FALSE
)
train_data <- mics[ train_index, ]
test_data  <- mics[-train_index, ]

fold_index <- createFolds(
  train_data$total,
  k = 10, returnTrain = TRUE, list = TRUE
)
train_cv <- trainControl(method="cv", index = fold_index)
Use tidymodels’ rsample2caret function, which returns a list that mimics the index and indexOut elements of a trainControl object.
train_cv_caret <- rsample2caret(train_cv)
ctrl_caret <- tr



Catagory :general