r - 如何绘制来自非常大的数据集(尤其是来自rxGlm输出)的交互作用

我正在根据大量数据数据集计算glm模型。 glm甚至speedglm都需要几天才能计算出来。

我目前有大约3M个观测值和总共400个变量，其中只有一些用于回归。在回归分析中，我使用4个整数自变量（iv1，iv2，iv3，iv4），1个二进制自变量作为因子（iv5），交互项（x * y，其中是整数，而x是二进制虚拟变量作为因子）。最后，我对y年和公司ID ff1都有固定的影响。我有15年的经验，拥有3000家公司。我已通过将固定效果添加为因素来介绍固定效果。我观察到，尤其是3000公司固定效果使ff2 stats以及glm的计算变得非常慢。

因此，我决定尝试使用Microsoft R的speedglm（RevoScaleR），因为它可以处理更多的线程和处理器内核。确实，分析速度要快得多。另外，我将子样本的结果与标准rxGlm的样本进行了比较，并且它们匹配。

我使用了以下功能：

mod1 <- rxGlm(formula = dv ~
                      iv1 + iv2 + iv3+
                      iv4 + iv5 +
                      x * y +
                      ff1  + ff2,
                    family = binomial(link = "probit"), data = dat,
                    dropFirst = TRUE, dropMain = FALSE, covCoef = TRUE, cube = FALSE)

但是，当尝试使用glm包绘制交互作用项时，我遇到了一个问题。调用以下函数后，我收到以下错误：

> plot(effect("x*y", mod1))
Error in terms.default(model) : no terms component nor attribute

我认为问题是effects不存储绘制交互所需的数据。我相信是因为rxGlm对象比rxGlm对象小得多，因此可能包含的数据更少（80 MB对数GB）。

我现在尝试通过glm将rxGlm对象转换为glm。但是，as.glm()调用仍未产生结果，并导致以下错误消息：

Error in dnorm(eta) :
  Non-numerical argument for mathematical function
In addition: Warning messages:
1: In model.matrix.default(mod, data = list(dv = c(1L, 2L,  :
  variable 'x for y' is absent, its contrast will be ignored

如果现在将原始glm与“转换的glm”进行比较，我发现转换的glm包含的项目要少得多。例如，它不包含effects()，并且为对比起见，每个变量仅声明effects。

我现在主要是在寻找一种以某种格式转置contr.treatment输出对象的方法，以便可以与rxGlm函数一起使用。如果没有办法，如何使用effect()包中的函数（例如RevoScaleR）获得交互作用图？ rxLinePlot()也可以相当快地绘制图表，但是，我还没有找到一种方法来获取典型的交互效果图。我要避免先计算完整的rxLinePlot()模型，然后再绘图，因为这会花费很长时间。

最佳答案

如果可以获得系数，就不能自己滚动系数吗？
这不会是数据集大小的问题

# ex. data
n = 2000
dat <- data.frame( dv = sample(0:1, size = n, rep = TRUE),
                   iv1 = sample(1:10, size = n, rep = TRUE),
                   iv2 = sample(1:10, size = n, rep = TRUE),
                   iv3 = sample(1:10, size = n, rep = TRUE),
                   iv4 = sample(0:10, size = n, rep = TRUE),
                   iv5 = as.factor(sample(0:1, size = n, rep = TRUE)),
                   x = sample(1:100, size = n, rep = TRUE),
                   y = as.factor(sample(0:1, size = n, rep = TRUE)),
                   ff1  = as.factor(sample(1:15, size = n, rep = TRUE)),
                   ff2  = as.factor(sample(1:100, size = n, rep = TRUE))
                   )

mod1 <- glm(formula = dv ~
                      iv1 + iv2 + iv3+
                      iv4 + iv5 +
                      x * y +
                      ff1  + ff2,
                    family = binomial(link = "probit"), data = dat)

# coefficients for x, y and their interaction
x1 <- coef(mod1)['x']
y1 <- coef(mod1)['y1']
xy <- coef(mod1)['x:y1']

x <- 1:100
a <- x1*x
b <- x1*x + y1 + xy*x

plot(a~x, type= 'line', col = 'red', xlim = c(0,max(x)), ylim = range(c(a, b)))
lines(b~x, col = 'blue')
legend('topright', c('y = 0', 'y = 1'), col = c('red', 'blue'))