重塑 data.table 的正确/最快方法

本文介绍了重塑 data.table 的正确/最快方法的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在 R 中有一个数据表:

库(data.table)设置种子(1234)DT

我可以很容易地按 data.table 中的组对变量 v 求和:

out <- DT[,list(SUM=sum(v)),by=list(x,y)]出去x y 总和[1,] 1 A 72[2,] 1 B 123[3,] 2 A 84[4,] 2 B 119[5,] 3 A 162[6,] 3 B 96

但是，我希望将组 (y) 作为列，而不是行.我可以使用 reshape 来完成此操作:

out <- reshape(out,direction='wide',idvar='x', timevar='y')出去x SUM.A SUM.B[1,] 1 72 123[2,] 2 84 119[3,] 3 162 96

有没有更有效的方法在聚合数据后重塑数据?有没有办法使用 data.table 操作将这些操作合并为一个步骤?

解决方案

data.table 包实现了更快的 melt/dcast 函数(在 C 中).它还具有允许熔化和铸造多个列的附加功能.请参阅 Github 上的新使用 data.tables 进行高效重塑.p>

data.table 的melt/dcast 函数从 v1.9.0 开始可用，功能包括:

在转换之前无需加载 reshape2 包.但如果您想加载它以进行其他操作，请在加载 data.table 之前加载它.
dcast 也是 S3 泛型.不再有 dcast.data.table().只需使用 dcast().
融化:
- 能够融化在列表"类型的列上.
- 获得 variable.factor 和 value.factor，默认分别为 TRUE 和 FALSE为了与 reshape2 兼容.这允许直接控制 variable 和 value 列的输出类型(作为因素或不作为因素).
- melt.data.table 的 na.rm = TRUE 参数经过内部优化，可在熔解过程中直接去除 NA，因此效率更高.
- 新:melt 可以接受 measure.vars 的列表，并且列表的每个元素中指定的列将组合在一起.通过使用 patterns() 可以进一步促进这一点.请参阅小插图或 ?melt.
dcast:
- 接受多个 fun.aggregate 和多个 value.var.请参阅小插图或 ?dcast.
- 直接在公式中使用rowid() 函数生成一个id-column，有时需要它来唯一标识行.请参阅 ?dcast.
旧基准:
- melt:1000 万行 5 列，61.3 秒减少到 1.2 秒.
- dcast :100 万行 4 列，192 秒减少到 3.6 秒.

科隆提醒(2013 年 12 月)演示幻灯片 32:为什么不提交dcast 对 reshape2 的拉取请求?

I have a data table in R:

library(data.table)
set.seed(1234)
DT <- data.table(x=rep(c(1,2,3),each=4), y=c("A","B"), v=sample(1:100,12))
DT
      x y  v
 [1,] 1 A 12
 [2,] 1 B 62
 [3,] 1 A 60
 [4,] 1 B 61
 [5,] 2 A 83
 [6,] 2 B 97
 [7,] 2 A  1
 [8,] 2 B 22
 [9,] 3 A 99
[10,] 3 B 47
[11,] 3 A 63
[12,] 3 B 49

I can easily sum the variable v by the groups in the data.table:

out <- DT[,list(SUM=sum(v)),by=list(x,y)]
out
     x  y SUM
[1,] 1 A  72
[2,] 1 B 123
[3,] 2 A  84
[4,] 2 B 119
[5,] 3 A 162
[6,] 3 B  96

However, I would like to have the groups (y) as columns, rather than rows. I can accomplish this using reshape:

out <- reshape(out,direction='wide',idvar='x', timevar='y')
out
     x SUM.A SUM.B
[1,] 1    72   123
[2,] 2    84   119
[3,] 3   162    96

Is there a more efficient way to reshape the data after aggregating it? Is there any way to combine these operations into one step, using the data.table operations?

解决方案

The data.table package implements faster melt/dcast functions (in C). It also has additional features by allowing to melt and cast multiple columns. Please see the new Efficient reshaping using data.tables on Github.

melt/dcast functions for data.table have been available since v1.9.0 and the features include:

There is no need to load reshape2 package prior to casting. But if you want it loaded for other operations, please load it before loading data.table.
dcast is also a S3 generic. No more dcast.data.table(). Just use dcast().
melt:
- is capable of melting on columns of type 'list'.
- gains variable.factor and value.factor which by default are TRUE and FALSE respectively for compatibility with reshape2. This allows for directly controlling the output type of variable and value columns (as factors or not).
- melt.data.table's na.rm = TRUE parameter is internally optimised to remove NAs directly during melting and is therefore much more efficient.
- NEW: melt can accept a list for measure.vars and columns specified in each element of the list will be combined together. This is faciliated further through the use of patterns(). See vignette or ?melt.
dcast:
- accepts multiple fun.aggregate and multiple value.var. See vignette or ?dcast.
- use rowid() function directly in formula to generate an id-column, which is sometimes required to identify the rows uniquely. See ?dcast.
Old benchmarks:
- melt : 10 million rows and 5 columns, 61.3 seconds reduced to 1.2 seconds.
- dcast : 1 million rows and 4 columns, 192 seconds reduced to 3.6 seconds.

Reminder of Cologne (Dec 2013) presentation slide 32 : Why not submit a dcast pull request to reshape2?

                        这篇关于重塑 data.table 的正确/最快方法的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！