Day 11 of #50daysofkaggle

Roadmap to tidymodels

Implementing DT using R

February 20, 2023

Till now I practiced creating classification predictions on the Titanic dataset using KNN, DT and SVM algorithms. As per Kaggle, my submission got a score of 77%. Now I’m going to try these approaches in R.

steps to do : data reading > cleaning > replacing NA > splitting > model using Decision Trees> comparing results

Data reading and cleaning

loading the necessary libraries & reading train.csv from a zipped file. Taking a glimpse of the resulting df


#reading kaggle zip file that I downloaded in older folder
ziplocation <- "D:/Ramakant/Personal/Weekends in Mumbai/Blog/quarto_blog/posts/2022-10-12-day-6-of-50daysofkaggle/"
df <-  read_csv(unz(ziplocation, "train.csv"))
Rows: 891
Columns: 12
$ PassengerId <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
$ Survived    <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
$ Pclass      <dbl> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
$ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
$ Sex         <chr> "male", "female", "female", "female", "male", "male", "mal…
$ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
$ SibSp       <dbl> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
$ Parch       <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
$ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
$ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
$ Cabin       <chr> NA, "C85", NA, "C123", NA, NA, "E46", NA, NA, NA, "G6", "C…
$ Embarked    <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"…

Reformatting the df to create a new tibble df_n

df_n <- df %>% 
  #selecting only the numerical variables
  select_if(is.numeric) %>% 
  #converting outcome variable into factor for classification 
  mutate(Survived = as.factor(Survived)) %>% 
  #adding back the Sex & Embarked predictors
  bind_cols(Sex = df$Sex, Embarked = df$Embarked) 

# A tibble: 6 × 9
  PassengerId Survived Pclass   Age SibSp Parch  Fare Sex    Embarked
        <dbl> <fct>     <dbl> <dbl> <dbl> <dbl> <dbl> <chr>  <chr>   
1           1 0             3    22     1     0  7.25 male   S       
2           2 1             1    38     1     0 71.3  female C       
3           3 1             3    26     0     0  7.92 female S       
4           4 1             1    35     1     0 53.1  female S       
5           5 0             3    35     0     0  8.05 male   S       
6           6 0             3    NA     0     0  8.46 male   Q       

Finding the null values in the new df

df_n %>%  
  summarise_all(~ sum(
# A tibble: 1 × 9
  PassengerId Survived Pclass   Age SibSp Parch  Fare   Sex Embarked
        <int>    <int>  <int> <int> <int> <int> <int> <int>    <int>
1           0        0      0   177     0     0     0     0        2

We see that there are 177 null values in the Age column. This will be tackled in the recipe section along with PassengerId

Model Building

Splitting the data

Splitting the data into train & test

df_split <- initial_split(df_n, prop = 0.8)
train <- training(df_split)
test <- testing(df_split)


creating the recipe

dt_recipe <- recipe(Survived ~ ., data = df_n) %>% 
  step_dummy(all_nominal_predictors()) %>% 
  step_normalize(all_numeric_predictors()) %>% 
  #replacing NA values in Age with median Age
  step_mutate_at(Age, fn = ~ replace_na(Age, median(Age, na.rm = T))) %>% 
  #updating the role of the PassengerId to exclude from analysis
  update_role(PassengerId, new_role = "id_variable")


Another way to view the recipe using tidy() function

# A tibble: 3 × 6
  number operation type      trained skip  id             
   <int> <chr>     <chr>     <lgl>   <lgl> <chr>          
1      1 step      dummy     FALSE   FALSE dummy_jD1sy    
2      2 step      normalize FALSE   FALSE normalize_CvfCk
3      3 step      mutate_at FALSE   FALSE mutate_at_xAmJj

Model Creation

Declaring a model dt_model as a Decision Tree with depth as 3 and engine rpart

dt_model <- decision_tree(mode = "classification", tree_depth = 3) %>% 
dt_model %>% translate()
Decision Tree Model Specification (classification)

Main Arguments:
  tree_depth = 3

Computational engine: rpart 

Model fit template:
rpart::rpart(formula = missing_arg(), data = missing_arg(), weights = missing_arg(), 
    maxdepth = 3)

Workflow creation

Workflow = recipe + model

dt_wf <- workflow() %>%
  add_model(dt_model) %>% 

Predicting on test data

Fitting the dt_wf workflow with model created on train data to predict the test data

dt_predict <- predict(fit(dt_wf, data = train), test)
# A tibble: 6 × 1
1 0          
2 0          
3 0          
4 1          
5 0          
6 0          

Creating a new tibble called preidcted_table by binding the predicted values .pred_class to the test data

predicted_table <- bind_cols(test, dt_predict) %>% 
  rename(dt_yhat = .pred_class) %>% 
  select(Survived, dt_yhat) 
# A tibble: 6 × 2
  Survived dt_yhat
  <fct>    <fct>  
1 0        0      
2 0        0      
3 0        0      
4 0        1      
5 0        0      
6 0        0      

Testing accuracy

As mentioned in the TMRW documentation for binary classification metrics, we will try creating the confusion matrix and checking accuracy

conf_mat(predicted_table, truth = Survived, estimate = dt_yhat)
Prediction   0   1
         0 109  22
         1  11  37

Estimating the accuracy of our model

accuracy(predicted_table, truth = Survived, estimate = dt_yhat)
# A tibble: 1 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.816

In the tidymodels approach, we can define the required metrics with metric_set seperately to check the model accuracy

classification_metrics <- metric_set(accuracy, f_meas)
predicted_table %>% 
  classification_metrics(truth = Survived, estimate = dt_yhat)
# A tibble: 2 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.816
2 f_meas   binary         0.869

Submission on Kaggle

When I ran this code on Kaggle, the Decision Tree predictions resulted in a score of 0.7799. Exactly similar to the DT code written in python earlier.

Overall, I’m glad that I was able to wrap my head around the tidymodels workflow.

Next steps

  • Figure out how to compare accuracy of different models (KNN, SVM) that I had coded earlier in python
  • figure out hyper-parameter tuning from the tune() package