本文介绍了在使用我的大正态方程求解最小二乘估计时使用“外部"时内存不足的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

考虑R中的以下示例:

x1 <- rnorm(100000)
x2 <- rnorm(100000)
g <- cbind(x1, x2, x1^2, x2^2)
gg <- t(g) %*% g
gginv <- solve(gg)
bigmatrix <- outer(x1, x2, "<=")
Gw <- t(g) %*% bigmatrix
beta <- gginv %*% Gw
w1 <- bigmatrix - g %*% beta

如果我尝试在计算机上运行这样的事情,它将抛出内存错误(因为bigmatrix太大).

If I try to run such a thing in my computer, it will throw a memory error (because the bigmatrix is too big).

您知道如何在不遇到此问题的情况下实现相同目标吗?

Do you know how can I achieve the same, without running into this problem?

推荐答案

这是具有100,000个响应的最小二乘问题.您的bigmatrix是响应(矩阵),beta是系数(矩阵),而w1是残差(矩阵).

This is a least squares problem with 100,000 responses. Your bigmatrix is the response (matrix), beta is the coefficient (matrix), while w1 is the residual (matrix).

bigmatrix以及w1(如果明确形成)将各自花费

bigmatrix, as well as w1, if formed explicitly, will each cost

(100,000 * 100,000 * 8) / (1024 ^ 3) = 74.5 GB

这太大了.

由于每个响应的估计是独立的,因此实际上无需一次性形成bigmatrix并将其存储在RAM中.我们可以将其形成为平铺图块,并使用一个迭代过程:形成一个图块,使用一个图块,然后丢弃它.例如,下面考虑尺寸为100,000 * 2,000且具有内存大小的图块:

As estimation for each response is independent, there is really no need to form bigmatrix in one go and try to store it in RAM. We can just form it tile by tile, and use an iterative procedure: form a tile, use a tile, then discard it. For example, the below considers a tile of dimension 100,000 * 2,000, with memory size:

(100,000 * 2,000 * 8) / (1024 ^ 3) = 1.5 GB

通过这种迭代过程,可以有效地控制内存使用情况.

By such iterative procedure, the memory usage is effectively under control.

x1 <- rnorm(100000)
x2 <- rnorm(100000)
g <- cbind(x1, x2, x1^2, x2^2)
gg <- crossprod(g)    ## don't use `t(g) %*% g`
## we also don't explicitly form `gg` inverse

## initialize `beta` matrix (4 coefficients for each of 100,000 responses)
beta <- matrix(0, 4, 100000)

## we split 100,000 columns into 50 tiles, each with 2000 columns
for (i in 1:50) {
   start <- 2000 * (i-1) + 1    ## chunk start
   end <- 2000 * i    ## chunk end
   bigmatrix <- outer(x1, x2[start:end], "<=")
   Gw <- crossprod(g, bigmatrix)    ## don't use `t(g) %*% bigmatrix`
   beta[, start:end] <- solve(gg, Gw)
   }

请注意,请勿尝试计算残差矩阵w1,因为它将花费74.5 GB.如果在以后的工作中需要残差矩阵,则仍应尝试将其分解为图块并一个接一个地工作.

Note, don't try to compute the residual matrix w1, as It will cost 74.5 GB. If you need residual matrix in later work, you should still try to break it into tiles and work one by one.

您不必担心这里的循环.每次迭代中的计算成本都很高,足以分摊循环开销.

You don't need to worry about the loop here. The computation inside each iteration is costly enough to amortize looping overhead.

这篇关于在使用我的大正态方程求解最小二乘估计时使用“外部"时内存不足的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-15 20:33