通过ddply设置数据帧的子集，然后在子集R上应用adply的函数

本文介绍了通过ddply设置数据帧的子集，然后在子集R上应用adply的函数的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

使用plyr编写逻辑代码时遇到一些麻烦.我的问题涉及两个不同长度的大数据帧，其示例如下:

I am having some trouble with formulating a logical piece of code using plyr. My problem involves two big dataframes of different lengths, with sample as below:

dfSample <-
 structure(list(Type = structure(c(8L, 100L, 86L, 86L, 86L, 86L, 
 33L, 8L, 105L, 44L, 36L, 107L, 107L, 78L, 33L, 105L, 99L, 10L, 
 16L, 75L), .Label = c("Alumni Services", "Anti-Virus and Malware", 
 "Application Integration", "Application Monitoring", "Application Testing", 
 "Audio Visual Support", "Audio Visual Support - CLS", "Audio Visual Support - Non-CLS", 
 "Backup Services", "Banner", "Bus and Law", "Business Analysis", 
 "Careers", "Common Learning Spaces", "Communication and Marketing", 
 "Computer Aided Assessment", "Conference Accounts", "Content Management", 
 "Database Services", "Datacentre", "Desktop Monitoring", "Desktop Software", 
 "Document Management", "Email", "Email Programs", "Encryption", 
 "Eng and the Enviro", "Equipment Disposal", "Estates and Facilities", 
 "Examination Papers", "Faculty Engagement", "Filestore Support Services", 
 "Finance Services", "General Admin Services", "General InfoSec Advice", 
 "Generic Accounts", "Grid Accounts (HPC)", "Health Sciences", 
 "High Performance Computing (HPC)", "Hosted webspace (LAMP/IIS)", 
 "HR and Payroll Services", "HR General", "HR Recruitment", "HR Systems", 
 "Hub Rooms", "Humanities", "ICT Facilities", "ID Card Services", 
 "Identity Management (User accounts)", "Identity Services", "Information Policy Breaches", 
 "Information Risk Analysis", "iSolutions Admin Services", "iSolutions Administration", 
 "IT Training and Development", "Large File Transfer", "Lecture Capture", 
 "Lecture Capture - CLS", "Lecture Capture - Non-CLS", "Legacy Corporate Systems", 
 "Library Services", "Licence Management", "Managed Print Service", 
 "Management Servers", "Media Asset Management", "Media Support", 
 "Medicine", "Meet and Greet", "Misuse and Security Incidents", 
 "Misuse Of Systems", "Mobile Apps", "Mobile Devices", "Natural and Enviro Sci", 
 "Network Access Services", "Network Services", "OS Builds", "Other Learning Systems", 
 "Personal Filestore", "Personal web pages", "Phys and Applied", 
 "Printing (Managed)", "Printing (Not MPS)", "Project Management and Resourcing", 
 "Repair", "Reporting Services", "Request for Software", "Research Filestore", 
 "Research Governance", "Research Management", "Research Output", 
  "Resource Filestore", "Risk Analysis and Assessment", "Security", 
 "Self Service Help", "Server Monitoring", "Service Hosting", 
 "ServiceLine", "Soc and Human Sci", "Software Configuration Management", 
 "Software Licensing and Management", "Software Services", "SportRec", 
 "Staff Accounts", "Staff Desktop Deployment", "Staff Desktop Services", 
 "Staff Desktop Services (Not UoS Build)", "Student Accounts", 
 "Student Admin Services", "Student Personal Workstations", "SUSSED", 
 "Switchboard", "Switchboard Infrastructure", "System Access Request", 
 "Telephony", "University Admin Services", "Unmanaged Printing", 
 "Videoconferencing", "Videoconferencing - CLS", "Videoconferencing - Non-CLS", 
 "Virtual Learning Environment (VLE)", "Visitor Accounts", "Web Statistics", 
 "Windows Core Environment"), class = "factor"), Tkt.Category = structure(c(19L, 
 17L, 17L, 17L, 17L, 17L, 2L, 19L, 5L, 2L, 9L, 9L, 9L, 4L, 2L, 
 5L, 20L, 2L, 19L, 20L), .Label = c("Communication and Collaboration", 
 "Corporate Services", "Data Centre", "Data Storage Services", 
 "Desktop IT", "Faculty IT", "Help Services", "HR", "Identity Management (User accounts)", 
 "Information Security", "Logistics", "Programmes and Projects", 
 "Quality and Testing", "Research Services", "Security", "SLO Corporate Services", 
 "Software", "Standard", "Teaching Services", "Underpinning Services", 
 "Web Services"), class = "factor"), `CreateDateTime` = structure(c(1370087940, 
 1370156160, 1370162340, 1370178840, 1370190000, 1370240400, 1370242920, 
 1370243040, 1370243040, 1370243280, 1370243280, 1370243520, 1370243580, 
 1370243880, 1370243880, 1370244000, 1370244120, 1370244240, 1370244300, 
 1370244360), class = c("POSIXct", "POSIXt")), `ClosingDateTime` = structure(c(1374501300, 
 1372068300, 1379062020, 1390487100, 1379062080, 1375090560, 1373984760, 
 1370856420, 1370440140, 1370508240, 1370338080, 1370243820, 1370243700, 
 1370255520, 1370341440, 1370248680, 1370353560, 1370338800, 1370257140, 
 1374222600), class = c("POSIXct", "POSIXt"))), .Names = c("Type", 
 "Tkt.Category", "CreateDateTime", "ClosingDateTime"
 ), row.names = c(NA, 20L), class = "data.frame")

还有

DF2<-
 structure(list(DateTime = structure(c(1370041200, 1370052000, 
 1370062800, 1370073600, 1370084400, 1370095200, 1370106000, 1370116800, 
 1370127600, 1370138400, 1370149200, 1370160000, 1370170800, 1370181600, 
 1370192400, 1370203200, 1370214000, 1370224800, 1370235600, 1370246400
 ), class = c("POSIXct", "POSIXt"))), .Names = "DateTime", row.names = c(NA, 
 20L), class = "data.frame")

我正在尝试根据某些条件获取dfSample子集的长度，这些条件涉及DF2中每个Tkt.Category的数据，如下所示:

I am trying to get the length of the of a subset of dfSample based on some conditions which involve data from DF2 for each Tkt.Category as below:

QCalc <- function(m) {
  adply(DF2, 1, transform, q=as.character(
                               nrow(subset(m, CreateDateTime <= DateTime & 
                                              ClosingDateTime >= DateTime))))
}

ServiceQueue <- ddply(dfSample, .(Tkt.Category), QCalc)

这似乎不起作用，所以我猜想我为ddply部分制定函数的方式肯定有问题，因为下面的代码段在我使用所有数据时起作用(而不是按):

This does not seem to work, so I guessed there must be problem with the way I formulated the function for the ddply part since the piece of code below works when I am using all my data(not grouping by Tkt.Category):

Q <- adply(DF2, 1, transform, q=as.character(
                                   nrow(subset(dfSample, CreateDateTime<= DateTime &
                                                         `ClosingDateTime>= DateTime))))

使用ddply时，出现的错误消息是找不到对象'm'.有人可以指出我正确的方向来解决此问题吗?

When using ddply, the error message I get is that the object 'm' cannot be found. Could someone point me to the right direction to fix this problem?

推荐答案

如果我们可以重述您的问题，我认为我们可以找到一种更简单的解决方法.您要针对票证类别的每种类型和列表中的每个时间戳计数，在该票证开始之前，之后结束并具有该类别的票证数量.在SQL中，我们将编写如下内容:

If we can restate your problem, I think we can see an easier way to solve it. You want to count, for every type of ticket category and every timestamp in a list, how many tickets start before, end after, and have that category. In SQL we would write something like:

SELECT Tkt.Category, DateTime, count(*)
FROM dfSample join DF2 on
CreateDateTime<= DateTime 
and ClosingDateTime>= DateTime
GROUP BY Tkt.Category, DateTime

但这不是SQL(尽管应该如此，但您是否要从关系数据库中提取此数据?)，它的R-和基R不允许我们使用不等式进行合并.因此，我们可以对合并做一些技巧，并避免一起使用plyr:

But this isn't SQL (although maybe it should be, are you pulling this data from a relational db?), its R - and base R doesn't allow us to merge using inequalities. So instead we can do a little trick with merge and avoid plyr all together:

dfSample$id <- rownames(dfSample)
DFc <- merge(dfSample,DF2)
DFlimited <- DFc[DFc$CreateDateTime <= DFc$DateTime & DFc$ClosingDateTime >= DFc$DateTime,]
DFagg <- aggregate(id ~ Tkt.Category + DateTime, data = DFlimited, length)

这可能很慢，具体取决于表的大小，因为它实际上是在进行完全外部联接，然后进行过滤.如果发现这种情况，请查看Data.Table包-您可以查看问题.

This might be pretty slow, depending on the size of your tables, since its essentially doing a full outer join and then filtering. If you find that is the case, look into the Data.Table package - you can check out this Stack Overflow question for more.

这篇关于通过ddply设置数据帧的子集，然后在子集R上应用adply的函数的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！