本文介绍了如何矢量化或以其他方式加速R中的循环逻辑?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述 我试图计算2套物品之间共同的物品,用于20M +物品数据集。示例数据如下所示。 #serially编号的项目父项< - rep(1:10000) #generate rnorm儿童项目数量 numchild #填写父子如果(numchild [x]> 0){ f1 parent_child (x in 1:length(parents)) - 样本(1:长度(父母),大小= numchild [x]) f2 } else { parent_child } } 以下是我想要做的事:说父项#1有5个子项 - 1,2,3,4 ,5和父项#2有3个子项 - 4,10,22。 我想计算每个(parent_i,parent_j)的长度(交集)组合。在这种情况下,这将是1普通项目 - 4。 我这样做的10M +父项平均有15-20个子项目与( 0,100)的范围。所以这是一个10M×10M的物品项矩阵。我有一个foreach循环,我正在测试一个较小的子集,但不能完全规模化全数据集(具有256GB RAM的64核心机器)。在下面的循环中,为了这个目的,我已经只计算了用户用户矩阵的一半 - >(parent_i,parent_j),它与(parent_j,parent_i)相同。 b rest foreach(j = 1:(length(rest)),.combine = rbind)%dopar%{ common 0){g } } 我一直在试验这个变体(使用Reduce,在daataframe中存储父子),但没有太多的运气。 有没有一种方法可以使这个规模? 解决方案我颠倒了拆分,以便我们有一个父子关系 len child_parent< - split(rep(seq_along(parent_child),len), unlist(parent_child,use.names = FALSE)) 类似下面的内容构造了一对父母共享一个孩子的字符串,跨所有孩子 保持< - sapply(child_parent,length)> 1 int x paste(x [1,],x [2,],sep =。)}) 表(unlist(int,use.names = FALSE)) 或者更快一点 xx nms cnt setNames(tabulate(cnt,length(nms),nms) for f1 len child_parent keep< - sapply(child_parent,length)> 1 int< - lapply(child_parent [keep],function x){x< - combn(sort(x),2) paste(x [1,],x [2,],sep =。)}) xx< - unlist(int,use.names = FALSE) nms cnt setNames(tabulate(cnt,length(nms) ),nms)} (这是针对所有10000个父子元素的) > system.time(ans1 用户系统经过的 14.625 0.012 14.668 > (ans1) 542.1611 542.1832 542.2135 542.2435 542.2527 542.2806 1 1 1 1 1 1 我不确定这是否真的会缩小到你所说的问题的大小,但是这是每个孩子的父母数量的多项式。 加速的一种可能性是记忆组合计算,使用参数的长度作为键并将组合存储为值。这减少了 combn 被调用到child_parent元素的唯一长度的次数。 combn1< - local({ memo< - new.env(parent = emptyenv()) function(x){ key< - as字符(长度(x)) if(!exists(key,memo)) memo [[key]] 粘贴(x [备忘[[key]] [,1]],x [备忘[[key]] [,2]],sep =。)} }) f2 len child_parent unlist(parent_child,use.names = FALSE)) keep< - sapply(child_parent,length)> 1 int xx< - unlist(int,use.names = FALSE) nms< - unique xx) cnt setNames(tabulate(cnt,length(nms)),nms)} 有些帮助 > system.time(ans2 用户系统已用 5.337 0.000 5.347 >相同的(ans1,ans2) [1] TRUE c $ c> paste > Rprof(); ans2 $ by.self self.time self.pct total.time total.pct paste3.92 73.41 3.92 73.41 match0.74 13.86 0.74 13.86 unique.default0.40 7.49 0.40 7.49 as.character0.08 1.50 0.08 1.50 unlist0.08 1.50 0.08 1.50 combn0.06 1.12 0.06 1.12 lapply0.02 0.37 4.00 74.91 任何0.02 0.37 0.02 0.37 setNames0.02 0.37 0.02 0.37 $ by.total ... 我们可以通过将父共享子id编码成一个整数来避免这种情况。由于浮点数用R表示的方式,这将是精确的,直到大约2 ^ 21 编码< - 函数(x,y,n)(x-1)*(n + 1)+ y 解码<函数(n + 1)),y = z %%(n + 1)) combn1和f2的功能如下: $ p $ combn2< - local({ memo< - new.env(parent =如果(!存在(密钥,备忘录))($($))函数(x,encode_n){ key< - as.character(length(x))备注[[key]] encode(x [memo [[key]] [,1]],x [memo [[key] ,2)],encode_n)} }) f3 encode_n len child_parent unname(split(rep(seq_along(parent_child),len), unlist(parent_child,use.names = FALSE))) 保持< - sapply(child_parent,length)> 1 int id uid n do.call(data.frame,c(decode(uid,encode_n),list(n = n)))} 导致 > system.time(f3(parent_child))用户系统流逝 2.140 0.000 2.146 这个比较非常有利(注意上一行的时间是10000个亲子关系),jlhoward的修订答案是 > system.time(result.3 用户系统经过 2.465 0.000 2.468 > system.time(f3(parent_child [1:99]))用户系统已用完 0.016 0.000 0.014 并以更合理的方式进行缩放。 值得一提的是,数据生成例程位于Patrick Burn的第二个圈> R Inferno ,使用copy-and-append算法,而不是预先分配空间并填入空格。避免为循环写入身体作为功能,并使用lapply。通过修复问题,避免在循环中使用复杂的条件 $ b numchild numchild [numchild< 0] 0],sum(numchild n_parents numchild numchild [numchild< 0]< - sample(numchild [numchild> 0],sum(numchild< 0)) parent_child Long time lurker, first time asker.I'm trying to calculate 'items in common between 2 sets of items' for a 20M+ items dataset. Sample data looks like this.#serially numbered itemsparents <- rep(1:10000)#generate rnorm # of children itemsnumchild <- round(rnorm(10000, mean=30, sd=10))#fill the parent-child listparent_child <- list()for (x in 1:length(parents)){ if (numchild[x]>0){ f1 <- sample(1:length(parents), size=numchild[x]) f2 <- list(parents[f1]) parent_child <- c(parent_child, f2) } else { parent_child <- c(parent_child, list(x+1)) #if numchild=0, make up something }}Here is what I want to do: say parent item #1 has 5 children items-- 1,2,3,4,5 and parent item #2 has 3 children item-- 4,10,22.I want to compute the length(intersection) of every (parent_i, parent_j) combination. In the above case, it would be 1 common item-- 4.I am doing this for 10M+ parent items that on average have 15-20 children items with a (0,100) range. So that's a 10M x 10M item-item matrix.I have a foreach loop that I am testing out on a smaller subset that works but doesn't quite scale for the full dataset (64 core machine with 256GB RAM). With the loop below I am already computing only half of the user-user matrix--> (parent_i, parent_j) same as (parent_j, parent_i) for this purpose.#small subseta <- parent_child[1:1000]outerresults <- foreach (i = 1:(length(a)), .combine=rbind, .packages=c('foreach','doParallel')) %dopar% { b <- a[[i]] rest <- a[i+1:length(a)] foreach (j = 1:(length(rest)), .combine=rbind) %dopar% { common <- length(intersect(b, rest[[j]])) if (common > 0) {g <- data.frame(u1=i, u2=j+1, common)} } }I've been experimenting variations on this (using Reduce, storing parent-children in a daataframe etc.) but haven't had much luck.Is there a way to make this scale? 解决方案 I reversed the split, so that we have a child-parent relationshiplen <- sapply(parent_child, length)child_parent <- split(rep(seq_along(parent_child), len), unlist(parent_child, use.names=FALSE))Something like the following constructs a string with pairs of parents sharing a child, across all children keep <- sapply(child_parent, length) > 1int <- lapply(child_parent[keep], function(x) { x <- combn(sort(x), 2) paste(x[1,], x[2,], sep=".")})and tallyingtable(unlist(int, use.names=FALSE))or a little more quicklyxx <- unlist(int, use.names=FALSE)nms <- unique(xx)cnt <- match(xx, nms)setNames(tabulate(cnt, length(nms), nms)forf1 <- function(parent_child) { len <- sapply(parent_child, length) child_parent <- split(rep(seq_along(parent_child), len), unlist(parent_child, use.names=FALSE)) keep <- sapply(child_parent, length) > 1 int <- lapply(child_parent[keep], function(x) { x <- combn(sort(x), 2) paste(x[1,], x[2,], sep=".") }) xx <- unlist(int, use.names=FALSE) nms <- unique(xx) cnt <- match(xx, nms) setNames(tabulate(cnt, length(nms)), nms)}with (this is for all 10000 parent-child elements)> system.time(ans1 <- f1(parent_child)) user system elapsed 14.625 0.012 14.668 > head(ans1)542.1611 542.1832 542.2135 542.2435 542.2527 542.2806 1 1 1 1 1 1 I'm not sure that this would really scale to the size of problem you're talking about, though -- it's polynomial in the number of parents per child. One possibility for speed-up is to 'memoize' the combinatorial calculation, using the length of the argument as a 'key' and storing the combination as 'value'. This reduces the number of times combn is called to the number of unique lengths of elements of child_parent.combn1 <- local({ memo <- new.env(parent=emptyenv()) function(x) { key <- as.character(length(x)) if (!exists(key, memo)) memo[[key]] <- t(combn(length(x), 2)) paste(x[memo[[key]][,1]], x[memo[[key]][,2]], sep=".") }})f2 <- function(parent_child) { len <- sapply(parent_child, length) child_parent <- split(rep(seq_along(parent_child), len), unlist(parent_child, use.names=FALSE)) keep <- sapply(child_parent, length) > 1 int <- lapply(child_parent[keep], combn1) xx <- unlist(int, use.names=FALSE) nms <- unique(xx) cnt <- match(xx, nms) setNames(tabulate(cnt, length(nms)), nms)}which helps somewhat> system.time(ans2 <- f2(parent_child)) user system elapsed 5.337 0.000 5.347 > identical(ans1, ans2)[1] TRUEThe slow part is now paste> Rprof(); ans2 <- f2(parent_child); Rprof(NULL); summaryRprof()$by.self self.time self.pct total.time total.pct"paste" 3.92 73.41 3.92 73.41"match" 0.74 13.86 0.74 13.86"unique.default" 0.40 7.49 0.40 7.49"as.character" 0.08 1.50 0.08 1.50"unlist" 0.08 1.50 0.08 1.50"combn" 0.06 1.12 0.06 1.12"lapply" 0.02 0.37 4.00 74.91"any" 0.02 0.37 0.02 0.37"setNames" 0.02 0.37 0.02 0.37$by.total...We can avoid this by encoding the parents with shared child id into a single integer; because of the way floating point numbers are represented in R, this will be exact until about 2^21encode <- function(x, y, n) (x - 1) * (n + 1) + ydecode <- function(z, n) list(x=ceiling(z / (n + 1)), y = z %% (n + 1))and adjusting our combn1 and f2 functions ascombn2 <- local({ memo <- new.env(parent=emptyenv()) function(x, encode_n) { key <- as.character(length(x)) if (!exists(key, memo)) memo[[key]] <- t(combn(length(x), 2)) encode(x[memo[[key]][,1]], x[memo[[key]][,2]], encode_n) }})f3 <- function(parent_child) { encode_n <- length(parent_child) len <- sapply(parent_child, length) child_parent <- unname(split(rep(seq_along(parent_child), len), unlist(parent_child, use.names=FALSE))) keep <- sapply(child_parent, length) > 1 int <- lapply(child_parent[keep], combn2, encode_n) id <- unlist(int, use.names=FALSE) uid <- unique(xx) n <- tabulate(match(xx, uid), length(uid)) do.call(data.frame, c(decode(uid, encode_n), list(n=n)))}leading to> system.time(f3(parent_child)) user system elapsed 2.140 0.000 2.146 This compares very favorably (note that the timing in the previous line is for 10,000 parent-child relations) with jlhoward's revised answer> system.time(result.3 <- do.call("rbind",lapply(1:99,gg))) user system elapsed 2.465 0.000 2.468> system.time(f3(parent_child[1:99])) user system elapsed 0.016 0.000 0.014 and scales in a much more reasonable way.For what it's worth, the data generation routine is in the second circle of Patrick Burn's R Inferno, using the 'copy-and-append' algorithm rather than pre-allocating the space and filling it in. Avoid this by writing the for loop body as a function, and using lapply. Avoid the need for the complicated conditional in the for loop by fixing the issue before-handnumchild <- round(rnorm(10000, mean=30, sd=10))numchild[numchild < 0] <- sample(numchild[numchild > 0], sum(numchild < 0))or by sampling from a distribution (rpois, rbinom) that generates positive integer values. Data generation is thenn_parents <- 10000numchild <- round(rnorm(n_parents, mean=30, sd=10))numchild[numchild < 0] <- sample(numchild[numchild > 0], sum(numchild < 0))parent_child <- lapply(numchild, sample, x=n_parents) 这篇关于如何矢量化或以其他方式加速R中的循环逻辑?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!
10-17 01:01