问题描述
假设我有一个名为YC
的数据表,如下所示:
Let's say I have a data table called YC
that looks like this:
Categories: colsums: tillTF:
ID: cat NA 0
MA NA 0
spayed NA 0
declawed NA 0
black NA 0
3 NA 0
no 57 1
claws NA 0
calico NA 0
4 NA 0
no 42 1
striped NA 0
0.5 NA 0
yes 84 1
not fixed NA 0
declawed NA 0
black NA 0
0.2 NA 0
yes 19 1
0.2 NA 0
yes 104 1
NH NA 0
spayed NA 0
claws NA 0
striped NA 0
12 NA 0
no 17 1
black NA 0
4 NA 0
yes 65 1
ID: DOG NA 0
MA NA 0
...
只有1)实际上不是数据透视表,它的格式不一致,看起来像是1和2)数据要复杂得多,并且在过去的几十年中输入的数据是不固定的.可以安全地对数据做出的唯一假设是,每个记录有12个变量,并且它们总是以相同的顺序输入.
Only it's 1) not actually pivot table, it's inconsistently formatted to look like one and 2) the data is much more complicated, and was entered inconstantly over the course of a few decades. The only assumption that can be safely made about the data is that there are 12 variables associated with each record, and they are always entered in the same order.
我的目标是解析此数据,以便每个属性和关联的数字记录都位于一行中的适当列中,如下所示:
My goal is to parse this data so that each attribute and associated numeric record are in in appropriate columns in a single row, like this:
Cat MA spayed declawed black 3 no 57
Cat MA spayed claws calico 0.5 no 42
Cat MA not fixed declawed black 0.2 yes 19
Cat MA not fixed declawed black 0.2 yes 104
Cat NH spayed claws striped 12 no 17
Cat NH spayed claws black 4 yes 65
Dog MA ....
我编写了一个for循环,该循环标识一个记录",然后通过向后读数据表中的列直到达到另一个记录",来重新写入数组中的值.我是R的新手,所以在不知道是否可行的情况下写出了我的理想循环.
I've written a for loop which identifies a "record" and then re-writes values in an array by reading backwards up the column in the data table until another "record" is reached. I'm new to R, and so wrote out my ideal loop without knowing whether it was possible.
array<-rep(0, length(7))
for (i in 1:7)
if(YC$tillTF[i]==1){
array[7]<-(YC$colsums[i])
array[6]<-(YC$Categories[i])
array[5]<-(YC$Categories[i-1])
array[4]<-(YC$Categories[i-2])
array[3]<-(YC$Categories[i-3])
array[2]<-(YC$Categories[i-4])
array[1]<-(YC$Categories[i-5])
}
YC_NT<-rbind(array)
填写array
后,我想遍历YC
并在YC_NT
中为每个唯一记录创建一个新行:
Once array
is filled in, I want to loop through YC
and create a new row in YC_NT
for each unique record:
for (i in 8:length(YC$tillTF))
if (YC$tillTF[i]==1){
array[8]<-(YC$colsums[i])
array[7]<-(YC$Categories[i])
if (YC$tillTF[i-1]==0){
array[6]<-YC$Categories[i-1]
}else{
rbind(array, YC_NT)}
if (YC$tillTF[i-2]==0){
array[5]<-YC$Categories[i-2]
}else{
rbind(array, YC_NT)}
if(YC$tillTF[i-3]==0){
array[4]<-YC$Categories[i-3]
}else{
rbind(array, YC_NT)}
if(YC$tillTF[i-4]==0){
array[3]<-YC$Categories[i-4]
}else{
rbind(array, YC_NT)}
if(YC$tillTF[i-5]==0){
array[2]<-YC$Categories[i-5]
}else{
rbind(array, YC_NT)}
if(YC$tillTF[i-6]==0){
array[1]<-YC$Categories[i-6]
}else{
rbind(array, YC_NT)}
}else{
array<-array}
当我在数据上的函数中运行此循环时,我得到的YC_NT
数据表只包含一行.经过几天的搜索,我不知道有一个R函数可以将向量array
添加到数据表的最后一行,而不必每次都给它一个唯一的名称.我的问题:
When I run this loop within a function on my data, I'm getting my YC_NT
data table back containing a single row. After spending a few days searching, I don't know that there is an R function which would be able to add the vector array
to last row of a data table without giving it a unique name every time. My questions:
1)是否有一个函数可以将称为array
的向量添加到数据表的末尾而无需重新写入称为array
的上一行?
1) Is there a function that would add a vector called array
to the end of a data table without re-writing a previous row called array
?
2)如果不存在这样的函数,每当我的for循环到达新的数字记录时,如何为array
创建一个新名称?
2) If no such function exists, how could I create a new name for array
every time my for loop reached a new numeric record?
感谢您的帮助,
推荐答案
所以我假设每次tillTF=1
都会有一条新记录开始.并且为下一个主题指定的n
变量只是最后一个n
变量,先前的值都保持不变.我还假设所有记录都是完整的",因为最后一行是tillTF=1
. (为使最后一个陈述正确,我从样本中删除了最后两行)
So I'm going to assume a new record begins every time tillTF=1
. And that the n
variables specified for the next subject are just the last n
variables, the previous values all remain the same. I'm aslo assuming that all records are "complete" in that the last line is tillTF=1
. (To make the last statement true, I removed the last two lines form your sample)
这就是我读取数据的方式
Here's how I might read the data in
dog <- read.fwf("dog.txt", widths=c(22,11,7), skip=1, stringsAsFactors=F)
dog$V1 <- gsub("\\s{2,}","",dog$V1)
dog$V2 < -gsub("\\s","",dog$V2)
dog$V3 <- as.numeric(gsub("\\s","",dog$V3))
因此,我在这里读取了数据,并删除了多余的空格.现在,我将添加一个ID列,为每个记录提供唯一的ID,并在每次tillTF=1
时递增该值.然后,我将数据拆分到该ID值上
So I read in the data here and and strip off the extra spaces. Now I will add an ID column giving each record a unique ID and incrementing that value every time tillTF=1
. Then i'll split the data on that ID value
dog$ID<-c(0, cumsum(dog$V3[-nrow(dog)]))
dv <- lapply(split(dog, dog$ID), function(x) {
c(x$V1, x$V2[nrow(x)])}
)
现在,我将使用Reduce
浏览列表,并每次将给定ID的最后一个n
变量替换为n
变量.
Now I'll go through the list with Reduce
and each time replace the last n
variables with the n
variables for a given ID.
trans < -Reduce(function(a,b) {
a[(length(a)-length(b)+1):length(a)] <- b
a
}, dv, accumulate=T)
现在,我将所有数据与制表符放在一起,然后使用read.table
处理数据并进行所有适当的数据转换并创建数据框
Now i'll put all the data together with tabs and then use read.table
to process the data and do all the proper data conversions and create a data frame
dd<-read.table(text=sapply(a, paste0, collapse="\t"), sep="\t")
那给
# print(dd)
V1 V2 V3 V4 V5 V6 V7 V8
1 ID: cat MA spayed declawed black 3.0 no 57
2 ID: cat MA spayed claws calico 4.0 no 42
3 ID: cat MA spayed claws striped 0.5 yes 84
4 ID: cat MA not fixed declawed black 0.2 yes 19
5 ID: cat MA not fixed declawed black 0.2 yes 104
6 ID: cat NH spayed claws striped 12.0 no 17
7 ID: cat NH spayed claws black 4.0 yes 65
因此,正如您所看到的,我将"ID:"保留为打开状态,但是剥离它应该很容易.但是这些命令可以为您进行基本的重塑.解决方案中的数组和if语句及绑定减少了,这很好,但是我鼓励您确保要理解每一行都可以理解.
So as you can see, I left the "ID:" on but it should be easy enough to strip that off. But these commands do the basic reshaping for you. There are fewer arrays and if statements and rbinding in the solution which is nice, but I encourage you to make sure you understand each line if you want to use it.
还请注意,我的输出与您的预期输出略有不同;您缺少"84"值,并且将带有"42"的印花布列为"0.5"而不是"4.0".因此,请让我知道我在解释数据或纠正示例输出方面是否有错.
Also note that my output is slightly different than your expected output; you are missing the "84" value and have the calico with "42" listed as "0.5" rather than "4.0". So let me know if I was wrong in how I interpreted the data or perhaps correct the example output.
这篇关于解析R中的数据,替代rbind()可以将其放入"for"目录中.循环将行写入新数据表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!