如何在数据集的所有变量的for循环中应用回归

如何在数据集的所有变量的for循环中应用回归

本文介绍了在R中添加行时,如何在数据集的所有变量的for循环中应用回归的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述 29岁程序员,3月因学历无情被辞! 我有一个这样的数据集: head(TRAINSET) X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 Y 1 -2.973012 -2.956570 -2.386837 -0.5861751 4e -04 0.44 0.0728 0.0307 0.0354 0.0078 0.0047 0.0100 -0.0022 0.0038 -0.005200012 2 -2.937649 -2.958624 -2.373960 -0.5636891 5e-04 0.44 0.0718 0.0323 0.0351 0.0075 0.0028 0.0095 -0.0019 0.0000 0.042085781 3 -2.984238 -2.937649 -2.428712 -0.5555258 2e-04 0.43 0.0728 0.0329 0.0347 0.0088 0.0018 0.0092 -0.0019 -0.0076 0.004577122 4 -2.976535 -2.970053 -2.443424 -0.5331107 9e-04 0.47 0.0588 0.0320 0.0331 0.0253 0.0011 0.0092 -0.0170 -0.0076 0.010515970 5 -2.979631 -2.962549 -2.468805 -0.5108256 6e-04 0.46 0.0613 0.0339 0.0333 -0.0005 -0.0006 0.0090 0.0060 -0.0058 0.058487141 6 -3.030536 -2.979631 -2.528079 -0.5024574 3e-04 0.43 0.0562 0.0333 0.0327 0.0109 -0.0006 0.0093 -0.0120 0.0000 -0.022896759 这是我的火车套件,它是300行。剩下的700行是测试集。我想要完成的是: 对于每一列都适合这种形式的线性模型:Y〜X1。 使用创建的模型,通过使用测试集的第一个X1来获取Y的预测值。 之后,取测试集的第一行并且将其绑定到火车组(现在列车组为301行)。 使用测试集中的第二排X1来预测Y的值。 为测试集的剩余699行重复 将其应用于数据集(X2,...,X14)的所有剩余变量。 当我应用我为每个变量制作的代码时,我已经设法产生准确的结果: fitvaluess< -NULL #empty设置为填充(i in 1:nrow(TESTSET)){#beggin迭代测试集 TRAINSET< -rbind(TRAINSET,TESTSET [i,])#将行添加到列车集 LM predictd< -predict(LM,TESTSET [i + 1,],type =response)#get预测值 fitvaluess< -cbind(fitvaluess,predictd)#get预测值的矢量​​ print(cbind(i,length(TRAINSET $ LHS),length(TRAINSET $ DP),nrow(TRAINSET)))#确保它的工作} 但是,我想自动执行此操作,并重复列。我已经这样做: data< -TRAINSET #cause每次我不得不重新制作列车 fitvalessess< NULL for(i in 1:nrow(TESTSET){#begin iteration on rows of Testset data< -rbind(data,TESTSET [i,])#rbind行到Trainset称为数据 for(j in 1:ncol(TESTSET){#iterate over the collums LM< -lm(data $ LHS〜data [,j],data)#fit OLS predictd fitvaluesss< -cbind(fitvaluesss,predictd)#derive预测值 print(c我,j))#确定它的作品} } 结果不幸是错误的:fitvalues是一个巨大的矩阵: dim(fitvaluesss) [1] 2306 3167#停止运行的中间 哪些没有任何意义,我甚至运行它 我在1:3 和j在1:3 仍然是矩阵非常庞大。我已经尝试从列开始迭代,并跳过线。完全相同的错误结果。由于某些原因,在每次运行中,我从PREDICT函数中获取了至少362个值。我真的坚持这个问题。 任何帮助都是非常受欢迎的。 编辑1:这也被称为财务方面的重要预测。这是一种从当前数据集的模型中预测未来值的方法。 解决方案考虑使用外循环中的列和内循环中的行来反转循环逻辑。另外,尝试嵌套的应用函数返回结构更符合您的需求,而不是循环的。具体来说,内部 vapply()返回每个迭代列的所有测试集预测值的数字向量。然后,外部 sapply()将每个返回的向量绑定到一个矩阵列。 最终, fitvaluess 是一个矩阵,尺寸为: TESTSET nrow X TESTSET ncol 。注意,外循环离开最后一列,因为您不会在Y上退回Y。 fitsvaluess< - sapply(1: (ncol(TESTSET)-1),函数(c){ col 预测值< - vapply(1:nrow(TESTSET),function(r){ TRAINSET< - rbind(TRAINSET,TESTSET [1:r,])#绑定线和当前ROW LM predictd< - predict(LM,TESTSET [r + 1,],type =response)},numeric(1)) }) 为什么 sapply()和 vapply()是包装到 lapply()。其中 sapply()( s imple lapply)可以返回向量或矩阵, vapply()( v erified lapply)允许您专门选择返回的输出 - 向量,列表,矩阵以及类型和长度。所以 vapply 需要指定这样的条件的第三个参数。在这里,我们选择一个长度(或一个对象)的数字向量: numeric(1)。由于这个预先规范,在某些情况下, vapply()往往比 lapply()运行得更快。如果我们只选择了一般的 lapply(),那么我们需要运行不同的列表输出的转换和转换来对齐矩阵输出。在某种程度上,我们可以完成嵌套 vapply()循环! That is a long question I know, but bear with me.I have a dataset in this form: head(TRAINSET) X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 Y1 -2.973012 -2.956570 -2.386837 -0.5861751 4e-04 0.44 0.0728 0.0307 0.0354 0.0078 0.0047 0.0100 -0.0022 0.0038 -0.0052000122 -2.937649 -2.958624 -2.373960 -0.5636891 5e-04 0.44 0.0718 0.0323 0.0351 0.0075 0.0028 0.0095 -0.0019 0.0000 0.0420857813 -2.984238 -2.937649 -2.428712 -0.5555258 2e-04 0.43 0.0728 0.0329 0.0347 0.0088 0.0018 0.0092 -0.0019 -0.0076 0.0045771224 -2.976535 -2.970053 -2.443424 -0.5331107 9e-04 0.47 0.0588 0.0320 0.0331 0.0253 0.0011 0.0092 -0.0170 -0.0076 0.0105159705 -2.979631 -2.962549 -2.468805 -0.5108256 6e-04 0.46 0.0613 0.0339 0.0333 -0.0005 -0.0006 0.0090 0.0060 -0.0058 0.0584871416 -3.030536 -2.979631 -2.528079 -0.5024574 3e-04 0.43 0.0562 0.0333 0.0327 0.0109 -0.0006 0.0093 -0.0120 0.0000 -0.022896759This is the Train set of mine, and it is 300 rows. The remaining 700 rows are the Test set. What I am trying to accomplish is:For each column fit a linear model of this form : Y ~ X1.Use the model created to get the predicted value of the Y by using the first X1 of the Test set.After that, take the first row of the Test set and rbind it to the Train set (now the Train set is 301 rows).Predict the value of Y using the 2nd row of X1 from the test set.Repeat for the remaining 699 rows of the Test set.Apply it for all the remaining variables of the datasets (X2,...,X14).I have managed to produce the accurate results when I apply a code that i made for each variable specifically:fittedvaluess<-NULL #empty set to fillfor(i in 1:nrow(TESTSET)){ #beggin iteration over the rows of Test set TRAINSET<-rbind(TRAINSET,TESTSET[i,]) #add the rows to the train set LM<-lm(Y~X1,TRAINSET) #fit the evergrowing OLS predictd<-predict(LM,TESTSET[i+1,],type = "response") #get the predicted value fittedvaluess<-cbind(fittedvaluess,predictd) #get the vector of the predicted values print(cbind(i,length(TRAINSET$LHS),length(TRAINSET$DP),nrow(TRAINSET))) #to make sure it works}However, i want to automate this to go and repeat it over the columns. I have made this:data<-TRAINSET #cause every time i had to remake the trainsetfittedvaluesss<-NULLfor(i in 1:nrow(TESTSET){ #begin iteration on rows of Testset data<-rbind(data,TESTSET[i,]) # rbind the rows to the Trainset called data for(j in 1:ncol(TESTSET){ #iterate over the collums LM<-lm(data$LHS~data[,j],data) #fit OLS predictd<-predict(LM,TESTSET[i+1,j],type = "response") #get the predicted value fittedvaluesss<-cbind(fittedvaluesss,predictd) #derive the predicted value print(c(i,j)) #make sure it works }}The results are unfortunately wrong: the fittedvalues are a huge matrix : dim(fittedvaluesss)[1] 2306 3167 #Stopped around the middle of its runWhich doesn't make any sense. I have even run it for i in 1:3andj in 1:3and still the matrix was insanely huge. I have tried having the iteration starting from the columns and the go over the lines. Exactly the same wrong results. For some reason in each run i was getting at least 362 values from the PREDICT function. I am really stuck over this problem.Any help is highly welcome.EDIT 1: This is also known as a RECURSIVE FORECASTING methodology in Finance. It is a method to forecast future values from a model fit from your current dataset. 解决方案 Consider reversing your looping logic with columns in outer loop and rows in inner loop. Additionally, try nested apply functions which returns structures more aligned to your needs than the for loop. Specifically, the inner vapply() returns a numeric vector of all testset's predicted values for each iterated column. Then the outer sapply() binds each returned vector to a column of a matrix.Ultimately, fittedvaluess is a matrix with dimensions: TESTSET nrow X TESTSET ncol. Notice too, outer loop leaves out last column since you do not regress Y on Y. fittedvaluess <- sapply(1:(ncol(TESTSET)-1), function(c){ col <- names(TESTSET)[[c]] # RETRIEVE COLUMN NAME FOR LM FORMULA predictvals <- vapply(1:nrow(TESTSET), function(r){ TRAINSET <- rbind(TRAINSET, TESTSET[1:r,]) # BINDING ROWS ON AND PRIOR TO CURRENT ROW LM <- lm(paste0("Y~", col), TRAINSET) # CONCATENATED STRING FORMULA predictd <- predict(LM, TESTSET[r+1,], type="response") }, numeric(1))})Why sapply and vapply?Both sapply() and vapply() are wrappers to lapply(). Where sapply() (simple lapply) can return either a vector or matrix, vapply() (verified lapply) allows you to specifically choose the returned output --vector, list, matrix-- as well as type and length. So vapply requires a third argument specifying such criteria. Here, we choose a numeric vector of one length (or one object): numeric(1). Because of this pre-specification, vapply() tends to run faster than lapply() in some cases. Had we only chose the general lapply(), we would need to run various casting and conversions of list output to align to matrix output. In a way, we could have done nested vapply() loops! 这篇关于在R中添加行时,如何在数据集的所有变量的for循环中应用回归的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持! 上岸,阿里云!
08-28 22:17