使用梯度下降(最陡下降)估计线性回归

本文介绍了使用梯度下降(最陡下降)估计线性回归的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

示例数据

X<-matrix(c(rep(1,97),runif(97)) , nrow=97, ncol=2)
y<-matrix(runif(97), nrow= 97 , ncol =1)

我已经成功创建了费用函数

I have succeed in creating the cost function

COST<-function(theta,X,y){
### Calculate half MSE
    sum((X %*% theta - y)^2)/(2*length(y))
}

但是，当我运行此函数时，它似乎无法收敛100次以上的迭代.

How ever when I run this function , it seem to fail to converge over 100 iterations.

theta <- matrix (0, nrow=2,ncol=1)
num.iters <- 1500
delta = 0

GD<-function(X,y,theta,alpha,num.iters){
    for (i in num.iters){

        while (max(abs(delta)) < tolerance){

            error <- X %*% theta - y
            delta <- (t(X) %*% error) / length(y)
            theta <- theta - alpha * delta
            cost_histo[i] <- COST(theta,X,y)
            theta_histo[[i]] <- theta

  }
  }
        return (list(cost_histo, theta_histo))
  }

有人可以帮助我吗?

欢呼

推荐答案

实现的算法部分是正确的.问题出在

Algorithmic part of your implementation is correct. Problems lie in

GD中的循环结构不正确； for循环是多余的，变量缺少适当的初始化.
使用固定的alpha简单实现梯度下降很危险.通常建议将此alpha选择得足够小，以希望我们总是向下搜索目标函数.但是，这在实践中很少见.例如，多小就足够了?如果太小，那么收敛速度就成问题了.但是如果它很大，我们可能会陷入之字形"的搜索路径甚至发散！

The loop structure in GD is not right; the for loop is redundant and variables lack proper initialization.
Simple implementation of gradient descent by using a fixed alpha is dangerous. It is usually suggested that this alpha should be chosen small enough to hope that we always search down the objective function. However, this is rare in practice. For example, how small is sufficient? If it is small, then convergence speed is a problem; but if it is large, we may be trapped in a 'zig-zag' searching path and even a divergence!

这是渐变下降的可靠版本，用于估算线性回归.改进来自减半策略，以避免锯齿"或发散.查看代码中的注释.在这种策略下，使用大型alpha是安全的. 保证融合.

Here is a robust version of Gradient Descent, for estimation of linear regression. The improvement comes from the step halving strategy, to avoid "zig-zag" or divergence. See comments along the code. Under this strategy, it is safe to use large alpha. Convergence is guaranteed.

# theta: initial guess on regression coef
# alpha: initial step scaling factor
GD <- function(X, y, theta, alpha) {
  cost_histo <- numeric(0)
  theta_histo <- numeric(0)
  # an arbitrary initial gradient, to pass the initial while() check
  delta <- rep(1, ncol(X))
  # MSE at initial theta
  old.cost <- COST(theta, X, y)
  # main iteration loop
  while (max(abs(delta)) > 1e-7) {
    # gradient
    error <- X %*% theta - y
    delta <- crossprod(X, error) / length(y)
    # proposal step
    trial.theta <- theta - alpha * c(delta)
    trial.cost <- COST(trial.theta, X, y)
    # step halving to avoid divergence
    while (trial.cost >= old.cost) {
      trial.theta <- (theta + trial.theta) / 2
      trial.cost <- COST(trial.theta, X, y)
      }
    # accept proposal
    cost_histo <- c(cost_histo, trial.cost)
    theta_histo <- c(theta_histo, trial.theta)
    # update old.cost and theta
    old.cost <- trial.cost
    theta <- trial.theta
    }
  list(cost_histo, theta_histo = matrix(theta_histo, nrow = ncol(X)))
  }

返回时，

cost_histo的长度告诉您进行了多少次迭代(不包括减半)；
theta_histo的每个列在每次迭代中给出theta.

the length of cost_histo tells you how many iterations have been taken (excluding step halving);
each column of theta_histo gives theta per iteration.

实际上，减半可以大大加快收敛速度.如果对COST使用更快的计算方法，则可以提高效率. (对大型数据集最有用.请参见 https://stackoverflow.com/a/40228894/4891738 )

Step halving in fact speeds up convergence greatly. You can get more efficiency if you use a faster computation method for COST. (Most useful for large datasets. See https://stackoverflow.com/a/40228894/4891738)

COST<-function(theta,X, y) {
  c(crossprod(X %*% theta - y)) /(2*length(y))
  }

现在，让我们在示例X，y上考虑其实现.

Now, let's consider its implementation on your example X, y.

oo <- GD(X, y, c(0,0), 5)

107次迭代后收敛.我们可以查看MSE的痕迹:

After 107 iterations it converges. We can view the trace of MSE:

plot(oo[[1]])

请注意，在最初的几个步骤中，MSE的下降非常快，但随后几乎保持平稳.这揭示了梯度下降算法的根本缺陷:随着我们越来越接近最小值，收敛速度越来越慢.

Note that at the first few steps, MSE decreases very fast, but then it is almost flat. This reveals the fundamental drawback of gradient descent algorithm: convergence gets slower and slower as we get nearer and nearer to the minimum.

现在，我们提取最终的估计系数:

Now, we extract the final estimated coefficient:

oo[[2]][, 107]

我们还可以将其与QR因式分解的直接估计进行比较:

We can also compare this with direct estimation by QR factorization:

.lm.fit(X, y)$coef

它们非常接近.

这篇关于使用梯度下降(最陡下降)估计线性回归的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！