本文介绍了Groupby并将df中的两列转换为矩阵R的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述 我想将下面的data.frame转换为矩阵,其中每小时发生的每个自行车站ID的数量都被计算在内。 > dim(test) [1] 80623 5 head测试,n = 10) bikeid end.station.id start.station.id diff.time小时 1 16052 244 322 6544 14 2 16052 284 432 3406 21 3 16052 461 519 33416 3 4 16052 228 519 26876 13 5 16052 72 435 388 17 6 16052 319 127 27702 11 7 16052 282 2002 33882 4 8 16052 524 2021 2525 10 9 16052 387 351 2397 12 10 16052 388 526 32507 13 输出应该如下所示。 > sample2 start.station.id 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 1 72 44 1 42 22 9 33 39 47 12 30 39 52 43 45 40 62 9 35 24 43 65 59 58 34 2 79 21 11 2 42 5 18 57 64 32 47 61 43 65 38 46 61 48 29 58 22 35 4 50 31 3 82 19 44 7 52 14 19 3 30 25 60 33 60 48 54 25 24 42 62 13 51 23 43 54 7 4 83 45 60 64 5 0 3 54 16 48 67 49 20 59 21 24 38 42 62 38 24 1 35 16 4 5 116 27 62 64 44 55 65 23 13 36 0 62 54 61 6 16 7 58 41 29 1 34 58 35 67 6 119 45 30 41 26 7 39 16 55 28 53 42 9 5 31 18 16 14 37 17 14 16 17 23 50 7 120 3 2 7 53 21 33 31 48 19 50 35 47 8 17 30 9 49 4 48 28 52 9 57 55 8 127 33 44 47 42 6 46 39 30 39 28 19 57 53 41 45 55 9 27 42 19 43 24 37 55 9 137 53 11 60 1 66 37 16 5 2 58 0 46 33 0 60 54 25 66 65 40 36 47 58 40 10 143 61 1 50 62 57 33 12 15 27 19 65 48 12 55 64 14 22 13 12 57 45 13 66 56 66 56 我是忠告ed使用类似于以下公式的公式: 矩阵< - test%>% group_by(start.station .id,小时)%>%汇总(sum = nrow)%>%差价(小时,美元) 但不知道如何正确编写代码使用 data.table : library(data.table)#1.9.6 + setDT(test) dcast(test [,.N,by =。(start.station.id,hour)], start.station.id〜hour,value.var = N) 或者(更慢,但更干净): dcast(test,start.station.id〜hour,fun.aggregate = length,value.var =hour) $ p $ .seed(10932) NN test< - data.table(start.station.id = sample(1000,NN,T), hour = sample(24 ,NN,T)) library(microbenchmark) microbenchmark(times = 100L, preagg = dcast(test [,.N,by =。(start.station.id,hour)], start.station.id〜hour,value。 var $), postagg = dcast(test,start.station.id〜hour, fun.aggregate = length,value.var =hour)) 单位:毫秒 expr分钟lq平均值中位数uq max neval preagg 55.83240 59.88939 66.56289 61.37408 64.37049 166.8902 100 postagg 91.16012 93.68588 101.17297 96.04823 101.20717 203.4270 100 第一个更快的原因是操作 test [,.N,by = vars] 已经在 data.table 中进行了优化。 I would like to convert the following data.frame into a matrix where the the number of each bike station id that occurs per hour is counted. > dim(test)[1] 80623 5head(test, n = 10) bikeid end.station.id start.station.id diff.time hour1 16052 244 322 6544 142 16052 284 432 3406 213 16052 461 519 33416 34 16052 228 519 26876 135 16052 72 435 388 176 16052 319 127 27702 117 16052 282 2002 33882 48 16052 524 2021 2525 109 16052 387 351 2397 1210 16052 388 526 32507 13The output should look like this.> sample2 start.station.id 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 241 72 44 1 42 22 9 33 39 47 12 30 39 52 43 45 40 62 9 35 24 43 65 59 58 342 79 21 11 2 42 5 18 57 64 32 47 61 43 65 38 46 61 48 29 58 22 35 4 50 313 82 19 44 7 52 14 19 3 30 25 60 33 60 48 54 25 24 42 62 13 51 23 43 54 74 83 45 60 64 5 0 3 54 16 48 67 49 20 59 21 24 38 42 62 38 24 1 35 16 45 116 27 62 64 44 55 65 23 13 36 0 62 54 61 6 16 7 58 41 29 1 34 58 35 676 119 45 30 41 26 7 39 16 55 28 53 42 9 5 31 18 16 14 37 17 14 16 17 23 507 120 3 2 7 53 21 33 31 48 19 50 35 47 8 17 30 9 49 4 48 28 52 9 57 558 127 33 44 47 42 6 46 39 30 39 28 19 57 53 41 45 55 9 27 42 19 43 24 37 559 137 53 11 60 1 66 37 16 5 2 58 0 46 33 0 60 54 25 66 65 40 36 47 58 4010 143 61 1 50 62 57 33 12 15 27 19 65 48 12 55 64 14 22 13 12 57 45 13 66 56 66 56I was advised to use a formula similar to : matrix <- test %>% group_by(start.station.id, hour)%>% summarise(sum = nrow) %>% spread(hour, nrow) but do not know how to code it properly 解决方案 Using data.table:library(data.table) #1.9.6+setDT(test)dcast(test[ , .N, by = .(start.station.id, hour)], start.station.id ~ hour, value.var = "N")Alternatively (slower, though cleaner):dcast(test, start.station.id ~ hour, fun.aggregate = length, value.var = "hour")Testing on some fake data:set.seed(10932)NN <- 1e6test <- data.table(start.station.id = sample(1000, NN, T), hour = sample(24, NN, T))library(microbenchmark)microbenchmark(times = 100L, preagg = dcast(test[ , .N, by = .(start.station.id, hour)], start.station.id ~ hour, value.var = "N"), postagg = dcast(test, start.station.id ~ hour, fun.aggregate = length, value.var = "hour"))Unit: milliseconds expr min lq mean median uq max neval preagg 55.83240 59.88939 66.56289 61.37408 64.37049 166.8902 100 postagg 91.16012 93.68588 101.17297 96.04823 101.20717 203.4270 100The reason the first is faster is that the operation test[ , .N, by = vars] has been optimized in data.table. 这篇关于Groupby并将df中的两列转换为矩阵R的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!
11-03 12:35