问题描述
我在 R 中有一个 数据表:
库(data.table)设置种子(1234)DT
我可以很容易地按 data.table 中的组对变量 v 求和:
out <- DT[,list(SUM=sum(v)),by=list(x,y)]出去x y 总和[1,] 1 A 72[2,] 1 B 123[3,] 2 A 84[4,] 2 B 119[5,] 3 A 162[6,] 3 B 96
但是,我希望将组 (y) 作为列,而不是行.我可以使用 reshape
来完成此操作:
out <- reshape(out,direction='wide',idvar='x', timevar='y')出去x SUM.A SUM.B[1,] 1 72 123[2,] 2 84 119[3,] 3 162 96
有没有更有效的方法在聚合数据后重塑数据?有没有办法使用 data.table 操作将这些操作合并为一个步骤?
data.table
包实现了更快的 melt/dcast
函数(在 C 中).它还具有允许熔化和铸造多个列的附加功能.请参阅 Github 上的新 使用 data.tables 进行高效重塑.p>
data.table 的melt/dcast 函数从 v1.9.0 开始可用,功能包括:
在转换之前无需加载
reshape2
包.但如果您想加载它以进行其他操作,请在加载data.table
之前加载它.dcast
也是 S3 泛型.不再有dcast.data.table()
.只需使用dcast()
.融化
:能够融化在列表"类型的列上.
获得
variable.factor
和value.factor
,默认分别为TRUE
和FALSE
为了与reshape2
兼容.这允许直接控制variable
和value
列的输出类型(作为因素或不作为因素).melt.data.table
的na.rm = TRUE
参数经过内部优化,可在熔解过程中直接去除 NA,因此效率更高.新:
melt
可以接受measure.vars
的列表,并且列表的每个元素中指定的列将组合在一起.通过使用patterns()
可以进一步促进这一点.请参阅小插图或?melt
.
dcast
:接受多个
fun.aggregate
和多个value.var
.请参阅小插图或?dcast
.直接在公式中使用
rowid()
函数生成一个id-column,有时需要它来唯一标识行.请参阅 ?dcast.
旧基准:
melt
:1000 万行 5 列,61.3 秒减少到 1.2 秒.dcast
:100 万行 4 列,192 秒减少到 3.6 秒.
科隆提醒(2013 年 12 月)演示幻灯片 32:为什么不提交dcast
对 reshape2
的拉取请求?
I have a data table in R:
library(data.table)
set.seed(1234)
DT <- data.table(x=rep(c(1,2,3),each=4), y=c("A","B"), v=sample(1:100,12))
DT
x y v
[1,] 1 A 12
[2,] 1 B 62
[3,] 1 A 60
[4,] 1 B 61
[5,] 2 A 83
[6,] 2 B 97
[7,] 2 A 1
[8,] 2 B 22
[9,] 3 A 99
[10,] 3 B 47
[11,] 3 A 63
[12,] 3 B 49
I can easily sum the variable v by the groups in the data.table:
out <- DT[,list(SUM=sum(v)),by=list(x,y)]
out
x y SUM
[1,] 1 A 72
[2,] 1 B 123
[3,] 2 A 84
[4,] 2 B 119
[5,] 3 A 162
[6,] 3 B 96
However, I would like to have the groups (y) as columns, rather than rows. I can accomplish this using reshape
:
out <- reshape(out,direction='wide',idvar='x', timevar='y')
out
x SUM.A SUM.B
[1,] 1 72 123
[2,] 2 84 119
[3,] 3 162 96
Is there a more efficient way to reshape the data after aggregating it? Is there any way to combine these operations into one step, using the data.table operations?
The data.table
package implements faster melt/dcast
functions (in C). It also has additional features by allowing to melt and cast multiple columns. Please see the new Efficient reshaping using data.tables on Github.
melt/dcast functions for data.table have been available since v1.9.0 and the features include:
There is no need to load
reshape2
package prior to casting. But if you want it loaded for other operations, please load it before loadingdata.table
.dcast
is also a S3 generic. No moredcast.data.table()
. Just usedcast()
.melt
:is capable of melting on columns of type 'list'.
gains
variable.factor
andvalue.factor
which by default areTRUE
andFALSE
respectively for compatibility withreshape2
. This allows for directly controlling the output type ofvariable
andvalue
columns (as factors or not).melt.data.table
'sna.rm = TRUE
parameter is internally optimised to remove NAs directly during melting and is therefore much more efficient.NEW:
melt
can accept a list formeasure.vars
and columns specified in each element of the list will be combined together. This is faciliated further through the use ofpatterns()
. See vignette or?melt
.
dcast
:accepts multiple
fun.aggregate
and multiplevalue.var
. See vignette or?dcast
.use
rowid()
function directly in formula to generate an id-column, which is sometimes required to identify the rows uniquely. See ?dcast.
Old benchmarks:
melt
: 10 million rows and 5 columns, 61.3 seconds reduced to 1.2 seconds.dcast
: 1 million rows and 4 columns, 192 seconds reduced to 3.6 seconds.
Reminder of Cologne (Dec 2013) presentation slide 32 : Why not submit a dcast
pull request to reshape2
?
这篇关于重塑 data.table 的正确/最快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!