


I was unhappy with the time dplyr and data.table were taking to create a new variable on my data.frame and decide to compare methods.

令我惊讶的是,将dplyr :: mutate()的结果重新分配给新的data.frame似乎比不这样做更快。

To my surprise, reassigning the results of dplyr::mutate() to a new data.frame seems to be faster than not doing so.



dt <- fread(".... data.csv") #load 200MB datafile

dt1 <- copy(dt)
dt2 <- copy(dt)
dt3 <- copy(dt)

a <- Sys.time()
dt1[, MONTH := month(as.Date(DATE))]
b <- Sys.time(); datatabletook <- b-a

c <- Sys.time()
dt_dplyr <- dt2 %>%
  mutate(MONTH = month(as.Date(DATE)))
d <- Sys.time(); dplyr_reassign_took <- d - c 

e <- Sys.time()
dt3 %>%
  mutate(MONTH = month(as.Date(DATE)))
f <- Sys.time(); dplyrtook <- f - e

datatabletook        = 17sec
dplyrtook            = 47sec
dplyr_reassign_took  = 17sec



.t0 <- Sys.time()
.t1 <- Sys.time()
.t1 - t0    

 # or


使用 Sys.time 方式,您正在将每一行发送到控制台,并且可能会看到每行打印一些返回值,如@Axeman所建议。使用 {...} ,只有一个返回值(括号内的最后一个结果)和 system.time 将抑制打印。

With the Sys.time way, you're sending each line to the console and may see some return value printed for each line, as @Axeman suggested. With {...}, there is only one return value (the last result inside the braces) and system.time will suppress it from printing.


If the printing is costly enough but is not part of what you want to measure, it can make a difference.

有充分的理由更喜欢 system.time 而不是 Sys.time 进行基准测试;来自@MattDowle的评论:

There are good reasons to prefer system.time over Sys.time for benchmarking; from @MattDowle's comment:

ii)它包括个用户 sys 时间以及已用挂钟时间。

ii) it includes user and sys time as well as elapsed wall clock time.

Sys.time()的方式会在测试过程中通过在Chrome中读取电子邮件或使用Excel受到影响运行时,只要您使用 user 和<$ c $, system.time()方式就不会c> sys 部分结果。

The Sys.time() way will be affected by reading your email in Chrome or using Excel while the test runs, the system.time() way won't so long as you use the user and sys parts of the result.


10-30 05:12