本文介绍了Tidymodel包:R中的常规线性模型(glm)和决策树(袋装树,增强树和随机森林)模型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题

我正在尝试使用R中的 Tidymodels软件包进行分析.我正在关注以下有关R中决策树学习的教程:-

I am attempting to undertake an analysis using the Tidymodels Package in R. I am following this tutorial below regarding decision tree learning in R:-

教程

https://bcullen.rbind.io/post/2020-06-02-tidymodels-decision-tree-learning-in-r/

我有一个名为FID的数据框()(见下文),其中因变量频率(数字),而预测变量是:-年(数字),月(因子),季风(因子)和天(数字).

I have a data frame called FID (see below) where the dependent variable is the frequency (numeric), and the predictor variables are:- Year (numeric), Month (factor), Monsoon (factor), and Days (numeric).

我相信我已经通过建立袋装树,随机森林和增强树模型成功地遵循了名为"Tidymodels:R中的决策树学习"的教程.

I believe I have successfully followed the tutorial named "Tidymodels: Decision Tree Learning in R" by building a bagged tree, random forest, and boosted tree model.

对于此分析,我还想构建一个通用线性模型(glm),以便在所有模型(即随机森林,袋装树,增强树和常规树)之间进行模型比较.线性模型)以建立最佳模型拟合.所有模型均经过 10倍交叉验证,以减少过拟合的偏倚.

For this analysis, I would also like to construct a general linear model (glm) in order to make model comparisons between all models (i.e the random forest, bagged tree, boosted tree, and general linear models) to establish the best model fit. All models are subject to 10-fold cross-validation to decrease the bias of overfitting.

问题

随后,我尝试改编本教程中的代码(请参见下文)以适合glm模型,但是对于是否已正确调整模型,我感到困惑.我不确定在模型完全拟合后尝试生成 rmse 值时,glm R-code的此元素是否正在生成以下错误消息:-

Subsequently, I have attempted to adapt the code (please see below) from the tutorial to fit a glm model, but I am confused whether I have tuned the model appropriately. I am unsure if this element of glm R-code is producing the error message below when I am attempting to produce the rmse values after the models have all been fit:-

错误消息

Error: Problem with `mutate()` input `model`.
x Input `model` can't be recycled to size 4.
ℹ Input `model` is `c("bag", "rf", "boost")`.
ℹ Input `model` must be size 4 or 1, not 3.

此外,我不确定在这些模型中的cipal()函数中表达的 R代码是否适当或正确,这在处理步骤中非常重要在安装每个模型之前.从我的角度来看,我想知道是否可以改进模型的配方.

In addition, I am unsure if the R code expressed in the recipe() function for these models is adequate or correct, which is very important during the processing steps before fitting each model. From my perspective, I was wondering if the recipe for the models could be improved.

如果这是可能的,我想知道是否有人可以在安装glm模型时与错误消息一起纠正配方(如果有必要),请帮助我.

If this is possible, I was wondering if anyone could please help me regarding the error message when fitting the glm model, in conjunction with correcting the recipe (if this is necessary).

我不是高级R编码器,并且我已经通过研究其他Tidymodel教程进行了详尽的尝试,以寻求解决方案.但是,我不明白此错误消息的含义.因此,如果有人能够提供帮助,我想表示最深切的谢意.

I am not an advanced R coder, and I have made a thorough attempt to try and find a solution by researching other Tidymodel tutorials; but, I do not understand what this error message means. Therefore, if anyone is able to help, I would like to express my deepest appreciation.

非常感谢.

R代码

##Open the tidymodels package
library(tidymodels)
library(glmnet)
library(parsnip)
library(rpart.plot)
library(rpart)
library(tidyverse) # manipulating data
library(skimr) # data visualization
library(baguette) # bagged trees
library(future) # parallel processing & decrease computation time
library(xgboost) # boosted trees
library(ranger)

###########################################################
# Put 3/4 of the data into the training set
#split this single dataset into two: a training set and a testing set
data_split <- initial_split(Tidy_df, prop = 3/4)

# Create data frames for the two sets:
train_data <- training(data_split)
test_data  <- testing(data_split)

# resample the data with 10-fold cross-validation (10-fold by default)
cv <- vfold_cv(train_data)

###########################################################
##Produce the recipe
##Preprocessing
############################################################

rec <- recipe(Frequency ~ ., data = fid_df) %>%
  step_nzv(all_predictors(), freq_cut = 0, unique_cut = 0) %>% # remove variables with zero variances
  step_novel(all_nominal()) %>% # prepares test data to handle previously unseen factor levels
  step_medianimpute(all_numeric(), -all_outcomes(), -has_role("id vars"))  %>% # replaces missing numeric observations with the median
  step_dummy(all_nominal(), -has_role("id vars")) # dummy codes categorical variables

###########################################################
##Create Models
###########################################################

##########################################################
##General Linear Models
#########################################################

##glm
mod_glm<-linear_reg(mode="regression",
                       penalty = 0.1,
                       mixture = 1) %>%
                            set_engine("glmnet")

##Create workflow
wflow_glm <- workflow() %>%
                add_recipe(rec) %>%
                      add_model(mod_glm)

##Fit the model
plan(multisession)

fit_glm <- fit_resamples(
                        wflow_glm,
                        cv,
                        metrics = metric_set(rmse, rsq),
                        control = control_resamples(save_pred = TRUE)
                        )

##########################################################
##Bagged Trees
##########################################################

#####Bagged Trees
mod_bag <- bag_tree() %>%
            set_mode("regression") %>%
             set_engine("rpart", times = 10) #10 bootstrap resamples


##Create workflow
wflow_bag <- workflow() %>%
                   add_recipe(rec) %>%
                       add_model(mod_bag)

##Fit the model
plan(multisession)

fit_bag <- fit_resamples(
                      wflow_bag,
                      cv,
                      metrics = metric_set(rmse, rsq),
                      control = control_resamples(save_pred = TRUE)
                      )

###################################################
##Random forests
###################################################

mod_rf <-rand_forest(trees = 1e3) %>%
                              set_engine("ranger",
                              num.threads = parallel::detectCores(),
                              importance = "permutation",
                              verbose = TRUE) %>%
                              set_mode("regression")

##Create Workflow

wflow_rf <- workflow() %>%
               add_model(mod_rf) %>%
                     add_recipe(rec)

##Fit the model

plan(multisession)

fit_rf<-fit_resamples(
             wflow_rf,
             cv,
             metrics = metric_set(rmse, rsq),
             control = control_resamples(save_pred = TRUE)
             )

############################################################
##Boosted Trees
############################################################

mod_boost <- boost_tree() %>%
                 set_engine("xgboost", nthreads = parallel::detectCores()) %>%
                      set_mode("regression")

##Create workflow

wflow_boost <- workflow() %>%
                  add_recipe(rec) %>%
                    add_model(mod_boost)

##Fit model

plan(multisession)

fit_boost <-fit_resamples(
                       wflow_boost,
                       cv,
                       metrics = metric_set(rmse, rsq),
                       control = control_resamples(save_pred = TRUE)
                       )

##############################################
##Evaluate the models
##############################################

collect_metrics(fit_bag) %>%
        bind_rows(collect_metrics(fit_rf)) %>%
          bind_rows(collect_metrics(fit_boost)) %>%
            bind_rows(collect_metrics(fit_glm)) %>%
              dplyr::filter(.metric == "rmse") %>%
                dplyr::mutate(model = c("bag", "rf", "boost")) %>%
                 dplyr::select(model, everything()) %>%
                    knitr::kable()

####Error message

Error: Problem with `mutate()` input `model`.
x Input `model` can't be recycled to size 4.
ℹ Input `model` is `c("bag", "rf", "boost")`.
ℹ Input `model` must be size 4 or 1, not 3.
Run `rlang::last_error()` to see where the error occurred.

#####################################################
##Out-of-sample performance
#####################################################

# bagged trees
final_fit_bag <- last_fit(
                     wflow_bag,
                       split = split)
# random forest
final_fit_rf <- last_fit(
                  wflow_rf,
                    split = split)
# boosted trees
final_fit_boost <- last_fit(
                      wflow_boost,
                          split = split)

数据框-FID

structure(list(Year = c(2015, 2015, 2015, 2015, 2015, 2015, 2015,
2015, 2015, 2015, 2015, 2015, 2016, 2016, 2016, 2016, 2016, 2016,
2016, 2016, 2016, 2016, 2016, 2016, 2017, 2017, 2017, 2017, 2017,
2017, 2017, 2017, 2017, 2017, 2017, 2017), Month = structure(c(1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L,
5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 5L, 6L, 7L,
8L, 9L, 10L, 11L, 12L), .Label = c("January", "February", "March",
"April", "May", "June", "July", "August", "September", "October",
"November", "December"), class = "factor"), Monsoon = structure(c(2L,
2L, 1L, 1L, 4L, 4L, 4L, 4L, 4L, 3L, 3L, 2L, 2L, 2L, 1L, 1L, 4L,
4L, 4L, 4L, 4L, 3L, 3L, 2L, 2L, 2L, 1L, 1L, 4L, 4L, 4L, 4L, 4L,
3L, 3L, 2L), .Label = c("First_Inter_Monssoon", "North_Monsoon",
"Second_Inter_Monsoon", "South_Monsson"), class = "factor"),
    Frequency = c(36, 28, 39, 46, 5, 0, 0, 22, 10, 15, 8,
    33, 33, 29, 31, 23, 8, 9, 7, 40, 41, 41, 30, 30, 44, 37,
    41, 42, 20, 0, 7, 27, 35, 27, 43, 38), Days = c(31,
    28, 31, 30, 6, 0, 0, 29, 15, 29, 29, 31, 31, 29, 30, 30,
    7, 0, 7, 30, 30, 31, 30, 27, 31, 28, 30, 30, 21, 0, 7, 26,
    29, 27, 29, 29)), row.names = c(NA, -36L), class = "data.frame")

推荐答案

我认为拟合线性模型的错误来自 Month Monsoon 的关系彼此.它们是完全相关的:

I believe the error from fitting the linear model is coming from how Month and Monsoon are related to each other. They are perfectly correlated:

library(tidyverse)

fid_df <- structure(list(Year = c(2015, 2015, 2015, 2015, 2015, 2015, 2015,
                                  2015, 2015, 2015, 2015, 2015, 2016, 2016, 2016, 2016, 2016, 2016,
                                  2016, 2016, 2016, 2016, 2016, 2016, 2017, 2017, 2017, 2017, 2017,
                                  2017, 2017, 2017, 2017, 2017, 2017, 2017), Month = structure(c(1L,
                                                                                                 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L,
                                                                                                 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 5L, 6L, 7L,
                                                                                                 8L, 9L, 10L, 11L, 12L), .Label = c("January", "February", "March",
                                                                                                                                    "April", "May", "June", "July", "August", "September", "October",
                                                                                                                                    "November", "December"), class = "factor"), Monsoon = structure(c(2L,
                                                                                                                                                                                                      2L, 1L, 1L, 4L, 4L, 4L, 4L, 4L, 3L, 3L, 2L, 2L, 2L, 1L, 1L, 4L,
                                                                                                                                                                                                      4L, 4L, 4L, 4L, 3L, 3L, 2L, 2L, 2L, 1L, 1L, 4L, 4L, 4L, 4L, 4L,
                                                                                                                                                                                                      3L, 3L, 2L), .Label = c("First_Inter_Monssoon", "North_Monsoon",
                                                                                                                                                                                                                              "Second_Inter_Monsoon", "South_Monsson"), class = "factor"),
                         Frequency = c(36, 28, 39, 46, 5, 0, 0, 22, 10, 15, 8,
                                       33, 33, 29, 31, 23, 8, 9, 7, 40, 41, 41, 30, 30, 44, 37,
                                       41, 42, 20, 0, 7, 27, 35, 27, 43, 38), Days = c(31,
                                                                                       28, 31, 30, 6, 0, 0, 29, 15, 29, 29, 31, 31, 29, 30, 30,
                                                                                       7, 0, 7, 30, 30, 31, 30, 27, 31, 28, 30, 30, 21, 0, 7, 26,
                                                                                       29, 27, 29, 29)), row.names = c(NA, -36L), class = "data.frame")


fid_df %>%
  count(Month, Monsoon)
#>        Month              Monsoon n
#> 1    January        North_Monsoon 3
#> 2   February        North_Monsoon 3
#> 3      March First_Inter_Monssoon 3
#> 4      April First_Inter_Monssoon 3
#> 5        May        South_Monsson 3
#> 6       June        South_Monsson 3
#> 7       July        South_Monsson 3
#> 8     August        South_Monsson 3
#> 9  September        South_Monsson 3
#> 10   October Second_Inter_Monsoon 3
#> 11  November Second_Inter_Monsoon 3
#> 12  December        North_Monsoon 3

如果在线性模型中使用这样的变量,则该模型将无法找到两组系数的估计值:

If you use variables like this in a linear model, the model cannot find estimates for both sets of coefficients:

lm(Frequency ~ ., data = fid_df) %>% summary()
#>
#> Call:
#> lm(formula = Frequency ~ ., data = fid_df)
#>
#> Residuals:
#>      Min       1Q   Median       3Q      Max
#> -15.0008  -3.9357   0.6564   2.9769  12.7681
#>
#> Coefficients: (3 not defined because of singularities)
#>                               Estimate Std. Error t value Pr(>|t|)
#> (Intercept)                 -7286.9226  3443.9292  -2.116   0.0459 *
#> Year                            3.6155     1.7104   2.114   0.0461 *
#> MonthFebruary                  -3.2641     6.6172  -0.493   0.6267
#> MonthMarch                      0.1006     6.5125   0.015   0.9878
#> MonthApril                      0.4843     6.5213   0.074   0.9415
#> MonthMay                       -4.0308    11.0472  -0.365   0.7187
#> MonthJune                       1.0135    15.5046   0.065   0.9485
#> MonthJuly                      -2.6910    13.6106  -0.198   0.8451
#> MonthAugust                    -4.9307     6.6172  -0.745   0.4641
#> MonthSeptember                 -1.7105     7.1126  -0.240   0.8122
#> MonthOctober                   -7.6981     6.5685  -1.172   0.2538
#> MonthNovember                  -8.7484     6.5493  -1.336   0.1953
#> MonthDecember                  -1.6981     6.5685  -0.259   0.7984
#> MonsoonNorth_Monsoon                NA         NA      NA       NA
#> MonsoonSecond_Inter_Monsoon         NA         NA      NA       NA
#> MonsoonSouth_Monsson                NA         NA      NA       NA
#> Days                            1.1510     0.4540   2.535   0.0189 *
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 7.968 on 22 degrees of freedom
#> Multiple R-squared:  0.8135, Adjusted R-squared:  0.7033
#> F-statistic: 7.381 on 13 and 22 DF,  p-value: 2.535e-05

由于有了此信息,我建议您使用一些领域知识来决定在模型中是否使用 Month Monsoon 两者.

Since you have this info, I would recommend using some domain knowledge to decide whether to use Month or Monsoon in the model but not both.

这篇关于Tidymodel包:R中的常规线性模型(glm)和决策树(袋装树,增强树和随机森林)模型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-13 18:49