This page includes an analysis performed on flu symptom data from this study. The complete analysis is available on my Github. This portion of the assignment explores three machine learning models: decision tree, LASSO, and random forest.
In this file, we will refine our analysis for this exercise and incorporate some machine learning models
Outcome of interest = Body temperature (continuous, numerical) - Corresponding model = regression - Performance metric = RMSE
library(ggplot2) #for plotting
library(broom) #for cleaning up output from lm()
library(here) #for data loading/saving
## here() starts at /Users/ameliafoley/Desktop/MADA/AMELIAFOLEY-MADA-portfolio
library(tidymodels) #for modeling
## Registered S3 method overwritten by 'tune':
## method from
## required_pkgs.model_spec parsnip
## ── Attaching packages ────────────────────────────────────── tidymodels 0.1.4 ──
## ✓ dials 0.0.10 ✓ rsample 0.1.0
## ✓ dplyr 1.0.7 ✓ tibble 3.1.5
## ✓ infer 1.0.0 ✓ tidyr 1.1.4
## ✓ modeldata 0.1.1 ✓ tune 0.1.6.9001
## ✓ parsnip 0.1.7.9001 ✓ workflows 0.2.4.9000
## ✓ purrr 0.3.4 ✓ workflowsets 0.1.0
## ✓ recipes 0.1.17.9000 ✓ yardstick 0.0.8
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## x purrr::discard() masks scales::discard()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x recipes::step() masks stats::step()
## • Use tidymodels_prefer() to resolve common conflicts.
library(rpart)
##
## Attaching package: 'rpart'
## The following object is masked from 'package:dials':
##
## prune
library(glmnet)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
## Loaded glmnet 4.1-3
library(ranger)
library(rpart.plot) # for visualizing a decision tree
library(vip) # for variable importance plots
##
## Attaching package: 'vip'
## The following object is masked from 'package:utils':
##
## vi
#path to data
#note the use of the here() package and not absolute paths
<- here::here("files","processeddata_copy.rds")
data_location
#load cleaned data.
<- readRDS(data_location) mydata
# set seed for reproducible analysis (instead of random subset each time)
set.seed(123)
#subset 3/4 of data as training set
<- initial_split(mydata,
data_split prop = 7/10,
strata = BodyTemp) #stratify by body temp for balanced outcome
#save sets as data frames
<- training(data_split)
train_data <- testing(data_split) test_data
We want to perform 5-fold CV, 5 times repeated
#create folds (resample object)
set.seed(123)
<- vfold_cv(train_data,
folds v = 5,
repeats = 5,
strata = BodyTemp) #folds is set up to perform our CV
#linear model set up
<- linear_reg() %>%
lm_mod set_engine('lm') %>%
set_mode('regression')
#create recipe for data and fitting and make dummy variables
<- recipe(BodyTemp ~ ., data = train_data) %>% step_dummy(all_nominal())
BT_rec
#workflow set up
<-
BT_wflow workflow() %>% add_model(lm_mod) %>% add_recipe(BT_rec)
#use workflow to prepare recipe and train model with predictors
<-
BT_fit %>% fit(data = train_data)
BT_wflow
#extract model coefficient
%>% extract_fit_parsnip() %>% tidy() BT_fit
## # A tibble: 32 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 98.4 0.294 334. 0
## 2 SwollenLymphNodes_Yes -0.140 0.112 -1.25 0.211
## 3 ChestCongestion_Yes 0.193 0.116 1.66 0.0972
## 4 ChillsSweats_Yes 0.185 0.155 1.19 0.234
## 5 NasalCongestion_Yes -0.275 0.137 -2.00 0.0456
## 6 Sneeze_Yes -0.469 0.118 -3.96 0.0000855
## 7 Fatigue_Yes 0.278 0.204 1.37 0.173
## 8 SubjectiveFever_Yes 0.483 0.123 3.92 0.000103
## 9 Headache_Yes -0.0169 0.150 -0.112 0.911
## 10 Weakness_1 0.370 0.194 1.91 0.0569
## # … with 22 more rows
#recipe for null model
<- recipe(BodyTemp ~ 1, data = train_data) #predicts mean of outcome
null_train_rec
#null model workflow incorporating null model recipe
<- workflow() %>% add_model(lm_mod) %>% add_recipe(null_train_rec)
null_wflow
# I want to check and make sure that the null model worked as it was supposed to, so I want to view the predictions and make sure they are all the mean of the outcome
#get fit for train data using null workflow
<- null_wflow %>% fit(data = train_data)
nullfittest #get predictions based on null model
<- predict(nullfittest, train_data)
prediction <- predict(nullfittest, test_data)
test_pred #the predictions for the train and test data are all the same mean value, so this tells us the null model was set up properly
#Now, we'll use fit_resamples based on the tidymodels tutorial for CV/resampling (https://www.tidymodels.org/start/resampling/)
#fit model with training data
<- fit_resamples(null_wflow, resamples = folds) null_fit_train
## ! Fold1, Repeat1: internal: A correlation computation is required, but `estimate` is const...
## ! Fold2, Repeat1: internal: A correlation computation is required, but `estimate` is const...
## ! Fold3, Repeat1: internal: A correlation computation is required, but `estimate` is const...
## ! Fold4, Repeat1: internal: A correlation computation is required, but `estimate` is const...
## ! Fold5, Repeat1: internal: A correlation computation is required, but `estimate` is const...
## ! Fold1, Repeat2: internal: A correlation computation is required, but `estimate` is const...
## ! Fold2, Repeat2: internal: A correlation computation is required, but `estimate` is const...
## ! Fold3, Repeat2: internal: A correlation computation is required, but `estimate` is const...
## ! Fold4, Repeat2: internal: A correlation computation is required, but `estimate` is const...
## ! Fold5, Repeat2: internal: A correlation computation is required, but `estimate` is const...
## ! Fold1, Repeat3: internal: A correlation computation is required, but `estimate` is const...
## ! Fold2, Repeat3: internal: A correlation computation is required, but `estimate` is const...
## ! Fold3, Repeat3: internal: A correlation computation is required, but `estimate` is const...
## ! Fold4, Repeat3: internal: A correlation computation is required, but `estimate` is const...
## ! Fold5, Repeat3: internal: A correlation computation is required, but `estimate` is const...
## ! Fold1, Repeat4: internal: A correlation computation is required, but `estimate` is const...
## ! Fold2, Repeat4: internal: A correlation computation is required, but `estimate` is const...
## ! Fold3, Repeat4: internal: A correlation computation is required, but `estimate` is const...
## ! Fold4, Repeat4: internal: A correlation computation is required, but `estimate` is const...
## ! Fold5, Repeat4: internal: A correlation computation is required, but `estimate` is const...
## ! Fold1, Repeat5: internal: A correlation computation is required, but `estimate` is const...
## ! Fold2, Repeat5: internal: A correlation computation is required, but `estimate` is const...
## ! Fold3, Repeat5: internal: A correlation computation is required, but `estimate` is const...
## ! Fold4, Repeat5: internal: A correlation computation is required, but `estimate` is const...
## ! Fold5, Repeat5: internal: A correlation computation is required, but `estimate` is const...
#get results
<- collect_metrics(null_fit_train)
metrics_null_train #RMSE for null train fit is 1.204757
#repeat for test data
<- recipe(BodyTemp ~ 1, data = test_data) #predicts mean of outcome
null_test_rec <- workflow() %>% add_model(lm_mod) %>% add_recipe(null_test_rec) #sets workflow with new test recipe
null_test_wflow <- fit_resamples(null_test_wflow, resamples = folds) #performs fit null_fit_test
## ! Fold1, Repeat1: internal: A correlation computation is required, but `estimate` is const...
## ! Fold2, Repeat1: internal: A correlation computation is required, but `estimate` is const...
## ! Fold3, Repeat1: internal: A correlation computation is required, but `estimate` is const...
## ! Fold4, Repeat1: internal: A correlation computation is required, but `estimate` is const...
## ! Fold5, Repeat1: internal: A correlation computation is required, but `estimate` is const...
## ! Fold1, Repeat2: internal: A correlation computation is required, but `estimate` is const...
## ! Fold2, Repeat2: internal: A correlation computation is required, but `estimate` is const...
## ! Fold3, Repeat2: internal: A correlation computation is required, but `estimate` is const...
## ! Fold4, Repeat2: internal: A correlation computation is required, but `estimate` is const...
## ! Fold5, Repeat2: internal: A correlation computation is required, but `estimate` is const...
## ! Fold1, Repeat3: internal: A correlation computation is required, but `estimate` is const...
## ! Fold2, Repeat3: internal: A correlation computation is required, but `estimate` is const...
## ! Fold3, Repeat3: internal: A correlation computation is required, but `estimate` is const...
## ! Fold4, Repeat3: internal: A correlation computation is required, but `estimate` is const...
## ! Fold5, Repeat3: internal: A correlation computation is required, but `estimate` is const...
## ! Fold1, Repeat4: internal: A correlation computation is required, but `estimate` is const...
## ! Fold2, Repeat4: internal: A correlation computation is required, but `estimate` is const...
## ! Fold3, Repeat4: internal: A correlation computation is required, but `estimate` is const...
## ! Fold4, Repeat4: internal: A correlation computation is required, but `estimate` is const...
## ! Fold5, Repeat4: internal: A correlation computation is required, but `estimate` is const...
## ! Fold1, Repeat5: internal: A correlation computation is required, but `estimate` is const...
## ! Fold2, Repeat5: internal: A correlation computation is required, but `estimate` is const...
## ! Fold3, Repeat5: internal: A correlation computation is required, but `estimate` is const...
## ! Fold4, Repeat5: internal: A correlation computation is required, but `estimate` is const...
## ! Fold5, Repeat5: internal: A correlation computation is required, but `estimate` is const...
<- collect_metrics(null_fit_test) #gets fit metrics
metrics_null_test #RMSE for null test fit is 1.204757
The RMSE that we get for both the null test and null train models is 1.204757. We’ll use this later to compare to the performance of our real models (we want any real models to perform better than this null model).
Include: 1. Model specification 2. Workflow definition 3. Tuning grid specification 4. Tuning w/ cross-validation + tune_grid()
## Decision tree
#going based off of tidymodels tutorial: tune parameters
#since we already split our data into test and train sets, we'll continue to use those here. they are `train_data` and `test_data`
#model specification
<-
tune_spec decision_tree(
cost_complexity = tune(),
tree_depth = tune()
%>%
) set_engine("rpart") %>%
set_mode("regression")
tune_spec
## Decision Tree Model Specification (regression)
##
## Main Arguments:
## cost_complexity = tune()
## tree_depth = tune()
##
## Computational engine: rpart
#tuning grid specification
<- grid_regular(cost_complexity(),
tree_grid tree_depth(),
levels = 5)
tree_grid
## # A tibble: 25 × 2
## cost_complexity tree_depth
## <dbl> <int>
## 1 0.0000000001 1
## 2 0.0000000178 1
## 3 0.00000316 1
## 4 0.000562 1
## 5 0.1 1
## 6 0.0000000001 4
## 7 0.0000000178 4
## 8 0.00000316 4
## 9 0.000562 4
## 10 0.1 4
## # … with 15 more rows
#cross validation
set.seed(123)
<- vfold_cv(train_data)
cell_folds
#workflow
set.seed(123)
<- workflow() %>%
tree_wf add_model(tune_spec) %>%
add_recipe(BT_rec)
#model tuning with `tune_grid()`
<-
tree_res %>%
tree_wf tune_grid(
resamples = cell_folds,
grid = tree_grid
)
## ! Fold01: internal: A correlation computation is required, but `estimate` is const...
## ! Fold02: internal: A correlation computation is required, but `estimate` is const...
## ! Fold03: internal: A correlation computation is required, but `estimate` is const...
## ! Fold04: internal: A correlation computation is required, but `estimate` is const...
## ! Fold05: internal: A correlation computation is required, but `estimate` is const...
## ! Fold06: internal: A correlation computation is required, but `estimate` is const...
## ! Fold07: internal: A correlation computation is required, but `estimate` is const...
## ! Fold08: internal: A correlation computation is required, but `estimate` is const...
## ! Fold09: internal: A correlation computation is required, but `estimate` is const...
## ! Fold10: internal: A correlation computation is required, but `estimate` is const...
%>% collect_metrics() tree_res
## # A tibble: 50 × 8
## cost_complexity tree_depth .metric .estimator mean n std_err .config
## <dbl> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 0.0000000001 1 rmse standard 1.19 10 0.0531 Prepro…
## 2 0.0000000001 1 rsq standard 0.0289 10 0.00760 Prepro…
## 3 0.0000000178 1 rmse standard 1.19 10 0.0531 Prepro…
## 4 0.0000000178 1 rsq standard 0.0289 10 0.00760 Prepro…
## 5 0.00000316 1 rmse standard 1.19 10 0.0531 Prepro…
## 6 0.00000316 1 rsq standard 0.0289 10 0.00760 Prepro…
## 7 0.000562 1 rmse standard 1.19 10 0.0531 Prepro…
## 8 0.000562 1 rsq standard 0.0289 10 0.00760 Prepro…
## 9 0.1 1 rmse standard 1.20 10 0.0558 Prepro…
## 10 0.1 1 rsq standard NaN 0 NA Prepro…
## # … with 40 more rows
#Here we see 25 candidate models, and the RMSE and Rsq for each
%>% autoplot() #view plot tree_res
#select the best decision tree model
<- tree_res %>% select_best("rmse")
best_tree #view model details best_tree
## # A tibble: 1 × 3
## cost_complexity tree_depth .config
## <dbl> <int> <chr>
## 1 0.0000000001 1 Preprocessor1_Model01
#finalize model workflow with best model
<- tree_wf %>%
tree_final_wf finalize_workflow(best_tree)
#fit model
<-
tree_fit %>% fit(train_data)
tree_final_wf tree_fit
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: decision_tree()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 1 Recipe Step
##
## • step_dummy()
##
## ── Model ───────────────────────────────────────────────────────────────────────
## n= 508
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 508 742.9363 98.93642
## 2) Sneeze_Yes>=0.5 280 259.6477 98.69107 *
## 3) Sneeze_Yes< 0.5 228 445.7356 99.23772 *
#diagnostics
autoplot(tree_res)
#calculate residuals - originally got stuck trying out lots of different methods for this. took inspiration from Zane's code to manually calculate residuals rather than using some of the built in functions that I could not get to cooperate
<- tree_fit %>%
tree_resid augment(train_data) %>% #this will add predictions to our df
select(.pred, BodyTemp) %>%
mutate(.resid = BodyTemp - .pred) #manually calculate residuals
#model predictions from tuned model vs actual outcomes
<- ggplot(tree_resid, aes(x = BodyTemp, y = .pred)) + geom_point() +
tree_pred_plot labs(title = "Predictions vs Actual Outcomes: Decision Tree", x = "Body Temperature Outcome", y = "Body Temperature Prediction")
tree_pred_plot
#plot residuals vs predictions
<- ggplot(tree_resid, aes(y = .resid, x = .pred)) + geom_point() +
tree_resid_plot labs(title = "Predictions vs Residuals: Decision Tree", x = "Body Temperature Prediction", y = "Residuals")
#view plot tree_resid_plot
#compare to null model
#view null RMSE for train data metrics_null_train
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 rmse standard 1.21 25 0.0171 Preprocessor1_Model1
## 2 rsq standard NaN 0 NA Preprocessor1_Model1
%>% show_best(n=1) #view RMSE for best decision tree model tree_res
## Warning: No value of `metric` was given; metric 'rmse' will be used.
## # A tibble: 1 × 8
## cost_complexity tree_depth .metric .estimator mean n std_err .config
## <dbl> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 0.0000000001 1 rmse standard 1.19 10 0.0531 Preprocesso…
I am unsure if these plots look like what they are supposed to…and what exactly this means for our model. It is odd that there are only two different predictions (two distinct horizontal lines on the pred vs bodytemp plot) even with a range of actual outcome values. It looks like as a result, we also have two distinct vertical lines for our residuals plot, since we only had two different predictions.
For the performance comparison, we see a null RMSE of 1.2 and a decision tree RMSE of 1.18. These are very similar values
#based on tidymodels tutorial: case study
#model specification
<- linear_reg(penalty = tune()) %>% set_engine("glmnet") %>% set_mode("regression")
lasso
#set workflow
<- workflow() %>% add_model(lasso) %>% add_recipe(BT_rec)
lasso_wf
#tuning grid specification
<- tibble(penalty = 10^seq(-3, 0, length.out = 30))
lasso_grid
#tuning with CV and tune_grid
<- lasso_wf %>% tune_grid(resamples = cell_folds,
lasso_res grid = lasso_grid,
control = control_grid(save_pred = TRUE),
metrics = metric_set(rmse))
#view model metrics
%>% collect_metrics() lasso_res
## # A tibble: 30 × 7
## penalty .metric .estimator mean n std_err .config
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 0.001 rmse standard 1.16 10 0.0481 Preprocessor1_Model01
## 2 0.00127 rmse standard 1.16 10 0.0482 Preprocessor1_Model02
## 3 0.00161 rmse standard 1.16 10 0.0482 Preprocessor1_Model03
## 4 0.00204 rmse standard 1.16 10 0.0482 Preprocessor1_Model04
## 5 0.00259 rmse standard 1.16 10 0.0483 Preprocessor1_Model05
## 6 0.00329 rmse standard 1.16 10 0.0484 Preprocessor1_Model06
## 7 0.00418 rmse standard 1.16 10 0.0484 Preprocessor1_Model07
## 8 0.00530 rmse standard 1.16 10 0.0485 Preprocessor1_Model08
## 9 0.00672 rmse standard 1.16 10 0.0487 Preprocessor1_Model09
## 10 0.00853 rmse standard 1.16 10 0.0488 Preprocessor1_Model10
## # … with 20 more rows
#select top models
<-
top_lasso %>% show_best("rmse") %>% arrange(penalty)
lasso_res #view top_lasso
## # A tibble: 5 × 7
## penalty .metric .estimator mean n std_err .config
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 0.0281 rmse standard 1.15 10 0.0490 Preprocessor1_Model15
## 2 0.0356 rmse standard 1.15 10 0.0491 Preprocessor1_Model16
## 3 0.0452 rmse standard 1.15 10 0.0491 Preprocessor1_Model17
## 4 0.0574 rmse standard 1.15 10 0.0496 Preprocessor1_Model18
## 5 0.0728 rmse standard 1.15 10 0.0507 Preprocessor1_Model19
#see best lasso
<- lasso_res %>% select_best()
best_lasso #view best_lasso
## # A tibble: 1 × 2
## penalty .config
## <dbl> <chr>
## 1 0.0728 Preprocessor1_Model19
#finalize workflow with top model
<- lasso_wf %>% finalize_workflow(best_lasso)
lasso_final_wf
#fit model with finalized WF
<- lasso_final_wf %>% fit(train_data) lasso_fit
#diagnostics
autoplot(lasso_res)
#calculate residuals
<- lasso_fit %>%
lasso_resid augment(train_data) %>% #this will add predictions to our df
select(.pred, BodyTemp) %>%
mutate(.resid = BodyTemp - .pred) #manually calculate residuals
#model predictions from tuned model vs actual outcomes
<- ggplot(lasso_resid, aes(x = BodyTemp, y = .pred)) + geom_point() +
lasso_pred_plot labs(title = "Predictions vs Actual Outcomes: LASSO", x = "Body Temperature Outcome", y = "Body Temperature Prediction")
lasso_pred_plot
#plot residuals vs predictions
<- ggplot(lasso_resid, aes(y = .resid, x = .pred)) + geom_point() +
lasso_resid_plot labs(title = "Predictions vs Residuals: LASSO", x = "Body Temperature Prediction", y = "Residuals")
#view plot lasso_resid_plot
#compare to null model
#view null RMSE for train data metrics_null_train
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 rmse standard 1.21 25 0.0171 Preprocessor1_Model1
## 2 rsq standard NaN 0 NA Preprocessor1_Model1
%>% show_best(n=1) #view RMSE for best lasso model lasso_res
## # A tibble: 1 × 7
## penalty .metric .estimator mean n std_err .config
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 0.0728 rmse standard 1.15 10 0.0507 Preprocessor1_Model19
From these plots, we can tell the difference between the LASSO model and the decision tree model that we started with. There is clearly more variety in the predictions for this model, as well as for the residuals. Ultimately, we compare the RMSE for this LASSO model to the null model RMSE. The null model RMSE was 1.2, and the LASSO model RMSE is 1.14. This is a bit better of a model, but not by a whole lot. The
#based on tidymodels tutorial: case study
library(parallel)
<- parallel::detectCores()
cores cores
## [1] 8
#model specification
<- rand_forest(mtry = tune(), min_n = tune(), trees = tune()) %>% set_engine("ranger", num.threads = cores) %>% set_mode("regression")
r_forest
#set workflow
<- workflow() %>% add_model(r_forest) %>% add_recipe(BT_rec)
r_forest_wf
#tuning grid specification
<- expand.grid(mtry = c(3, 4, 5, 6), min_n = c(40,50,60), trees = c(500,1000) )
rf_grid #what we will tune:
%>% parameters() r_forest
## Collection of 3 parameters for tuning
##
## identifier type object
## mtry mtry nparam[?]
## trees trees nparam[+]
## min_n min_n nparam[+]
##
## Model parameters needing finalization:
## # Randomly Selected Predictors ('mtry')
##
## See `?dials::finalize` or `?dials::update.parameters` for more information.
#tuning with CV and tune_grid
<-
r_forest_res %>%
r_forest_wf tune_grid(resamples = cell_folds,
grid = rf_grid,
control = control_grid(save_pred = TRUE),
metrics = metric_set(rmse))
#view top models
%>% show_best(metric = "rmse") r_forest_res
## # A tibble: 5 × 9
## mtry trees min_n .metric .estimator mean n std_err .config
## <dbl> <dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 4 500 40 rmse standard 1.15 10 0.0532 Preprocessor1_Model02
## 2 6 1000 60 rmse standard 1.15 10 0.0527 Preprocessor1_Model24
## 3 5 1000 60 rmse standard 1.15 10 0.0533 Preprocessor1_Model23
## 4 6 500 60 rmse standard 1.15 10 0.0522 Preprocessor1_Model12
## 5 5 500 60 rmse standard 1.15 10 0.0534 Preprocessor1_Model11
#view plot of models performance
autoplot(r_forest_res)
#select best model
<- r_forest_res %>% select_best(metric = "rmse")
rf_best rf_best
## # A tibble: 1 × 4
## mtry trees min_n .config
## <dbl> <dbl> <dbl> <chr>
## 1 4 500 40 Preprocessor1_Model02
#finalize workflow with top model
<- r_forest_wf %>% finalize_workflow(rf_best)
rf_final_wf
#fit model with finalized WF
<- rf_final_wf %>% fit(train_data) rf_fit
#diagnostics
autoplot(r_forest_res)
#calculate residuals
<- rf_fit %>%
rf_resid augment(train_data) %>% #this will add predictions to our df
select(.pred, BodyTemp) %>%
mutate(.resid = BodyTemp - .pred) #manually calculate residuals
#model predictions from tuned model vs actual outcomes
<- ggplot(rf_resid, aes(x = BodyTemp, y = .pred)) + geom_point() +
rf_pred_plot labs(title = "Predictions vs Actual Outcomes: Random Forest", x = "Body Temperature Outcome", y = "Body Temperature Prediction")
rf_pred_plot
#plot residuals vs predictions
<- ggplot(rf_resid, aes(y = .resid, x = .pred)) + geom_point() +
rf_resid_plot labs(title = "Predictions vs Actual Outcomes: Random Forest", x = "Body Temperature Prediction", y = "Residuals")
#view plot rf_resid_plot
#compare to null model
#view null RMSE for train data metrics_null_train
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 rmse standard 1.21 25 0.0171 Preprocessor1_Model1
## 2 rsq standard NaN 0 NA Preprocessor1_Model1
%>% show_best(n=1) #view RMSE for best decision tree model r_forest_res
## # A tibble: 1 × 9
## mtry trees min_n .metric .estimator mean n std_err .config
## <dbl> <dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 4 500 40 rmse standard 1.15 10 0.0532 Preprocessor1_Model02
Here, we see that the random forest plots for residuals and outcomes vs predictors look somewhat similar to the LASSO plots, still quite different from the decision tree plots. Remember, the RMSE for our null model for the train data was 1.2. the RMSE that we get for our best random forest model is 1.15. It seems that the random forest performs better than the null and the better than the decision tree, but not by a lot. It has a similar performance to the LASSO model.
#recall model performance metrics side by side
#view null model performance metrics_null_train
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 rmse standard 1.21 25 0.0171 Preprocessor1_Model1
## 2 rsq standard NaN 0 NA Preprocessor1_Model1
%>% show_best(n=1) #view RMSE for best decision tree model tree_res
## Warning: No value of `metric` was given; metric 'rmse' will be used.
## # A tibble: 1 × 8
## cost_complexity tree_depth .metric .estimator mean n std_err .config
## <dbl> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 0.0000000001 1 rmse standard 1.19 10 0.0531 Preprocesso…
%>% show_best(n=1) #view RMSE for best lasso tree model lasso_res
## # A tibble: 1 × 7
## penalty .metric .estimator mean n std_err .config
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 0.0728 rmse standard 1.15 10 0.0507 Preprocessor1_Model19
%>% show_best(n=1) #view RMSE for best random forest model r_forest_res
## # A tibble: 1 × 9
## mtry trees min_n .metric .estimator mean n std_err .config
## <dbl> <dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 4 500 40 rmse standard 1.15 10 0.0532 Preprocessor1_Model02
Our null model has an RMSE of 1.2 and std error of 0.017. Any model we pick should do better than that. Decision tree: RMSE 1.18, std err 0.053 LASSO: RMSE 1.14, std err 0.051 Random forest: RMSE 1.15, std err 0.053
While LASSO performs only slightly better than the random forest, it is the frontrunning model with the lowest RMSE model and lowest standard error. I will select the LASSO model as my top model for this data.
#fit to test data
<- lasso_final_wf %>% last_fit(data_split)
last_lasso_fit %>% collect_metrics() last_lasso_fit
## # A tibble: 2 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 rmse standard 1.15 Preprocessor1_Model1
## 2 rsq standard 0.0284 Preprocessor1_Model1
Here we see that the RMSE for the LASSO model on the test data (using function last_fit()
) is 1.15…, which is very close to the RMSE we saw on the train data in the LASSO model. This is a reflection of good model performance. Let’s see some diagnostics for this final model.
#calculate residuals
<- last_lasso_fit %>%
final_resid augment() %>% #this will add predictions to our df
select(.pred, BodyTemp) %>%
mutate(.resid = BodyTemp - .pred) #manually calculate residuals
#model predictions from tuned model vs actual outcomes
<- ggplot(final_resid, aes(x = BodyTemp, y = .pred)) + geom_point() +
final_pred_plot labs(title = "Predictions vs Actual Outcomes: LASSO", x = "Body Temperature Outcome", y = "Body Temperature Prediction")
final_pred_plot
#plot residuals vs predictions
<- ggplot(final_resid, aes(y = .resid, x = .pred)) + geom_point() +
final_resid_plot labs(title = "Predictions vs Residuals: LASSO", x = "Residuals", y = "Body Temperature Prediction")
#view plot final_resid_plot
#compare to null model
#view null RMSE for train data metrics_null_train
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 rmse standard 1.21 25 0.0171 Preprocessor1_Model1
## 2 rsq standard NaN 0 NA Preprocessor1_Model1
%>% collect_metrics() #view RMSE for final model last_lasso_fit
## # A tibble: 2 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 rmse standard 1.15 Preprocessor1_Model1
## 2 rsq standard 0.0284 Preprocessor1_Model1