本文介绍了在大规模FASTA文件中随机加入记录的策略是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

您好,

我有一个相当大(100+ MB)的FASTA文件,我需要以随机顺序从中获取
访问记录。 FASTA格式是用于存储分子生物序列的标准格式

。每个记录都包含一个

标题行,用于描述以''>'开头的序列'

(直角括号),后跟包含实际
序列数据。三个示例FASTA记录是如下:
CW127_A01
TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGG GTGAGTAATG

TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTA ATACCCCATA

GCATTAAACATCW127_A02
TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGG GTGAGTAATG

TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTA ATACCCCATA

GCATTAAACATTCCGCCTGGGGAGTACGGTCGCAAGATTAAAACTCAAAG GAATAGACGGCW127_A03

TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGG GTGAGTAATG

TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTA ATACCCCATA

GCATTAAACATTCCGCCTGGG

....


由于我正在使用的文件中包含数以万计的这些

记录,我相信我需要找到一种方法来散列这个文件,这样我就可以比我更快地检索相应的序列了

解析文件请求-按要求。但是,我对Python来说是非常新的,并且在编程的学习曲线上仍然非常低,而且总的来说是b $ b算法;虽然我确定这类问题无处不在,但我不知道它们是什么,或者不知道在哪里寻找它们。所以我转向大师并再次向你寻求帮助

。 :-)如果你能帮我弄清楚如何编写解决方案

那不会是一个资源妓女,我会感激不尽的。 (我更喜欢

只保留它在Python中,即使我知道与

关系数据库的交互将提供最快的方法 - 组I'' m

尝试编写此代码无法访问RDBMS。)

非常感谢提前,

Chris

解决方案




在内存中保留索引可能是合理的。以下类

通过扫描FASTA文件创建索引文件,并使用marshal

模块将其保存到磁盘。如果索引文件已经存在,则按原样使用。

重新生成索引,只需删除索引文件,然后再次运行程序

。 />

导入os,marshal


类FASTA:


def __init __(self,file):

self.file = open(file)

self.checkindex()


def __getitem __(self,key):

试试:

pos = self.index [key]
除了KeyError之外的


引发IndexError(没有这样的item")

else:

f = self.file

f.seek(pos)

header = f .readline()

断言">" +标题+" \ n"

data = []

而1:

line = f.readline()

如果不是行或行[0] ==">":

break

data.append(line)

返回数据


def checkindex(self):

indexfile = self.file.name +" .index"


试试:

self.index = marshal.load(open(indexfile," rb"))

除了IOError:

打印建筑指数......


index = {}


#扫描文件

f = self.file

f.seek(0)

而1:

pos = f.tell()

line = f.readline()

如果不是行:

break

如果行[0] ==" >":

#将偏移量保存到标题行

header = line [1:]。strip()

index [header] = pos


#将索引保存到磁盘

f = open(indexfile," wb")

marshal.dump(index ,f)

f.close()


self.index = index


db = FASTA(" myfastafile.dat")

print db [" CW127_A02"]

[''TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAG ...


必要时调整。


< / F>




你不需要RDBMS;我自己把它放在DBM或CDB中。




TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGG GTGAGTAATG
TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTA ATACCCCATA
GCATTAAACAT
CW127_A02

TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGG GTGAGTAATG
TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTA ATACCCCATA
GCATTAAACATTCCGCCTGGGGAGTACGGTCGCAAGATTAAAACTCAAAG GAATAGACGG
CW127_A03



TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGG GTGAGTAATG
TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTA ATACCCCATA
GCATTAAACATTCCGCCTGGG


自从我正在使用的文件包含成千上万的这些记录,我相信我需要找到一种方法来散列这个文件,以便我可以更快地检索各自的序列。我可以通过逐个请求解析文件来解析。但是,我对Python很新,而且编程和算法的学习曲线仍然很低;虽然我确定这类问题有无处不在的算法,但我不知道它们是什么,或者在哪里寻找它们。所以我转向大师并再次向你求助。 :-)如果你能帮我弄清楚如何编写一个不会成为资源妓女的解决方案,我会感激不尽的。 (我更愿意将它保留在Python中,即使我知道与
关系数据库的交互会提供最快的方法 - 我试图写的小组这对于无法访问RDBMS。)
非常感谢,
Chris



Hello,
I have a rather large (100+ MB) FASTA file from which I need to
access records in a random order. The FASTA format is a standard format
for storing molecular biological sequences. Each record contains a
header line for describing the sequence that begins with a ''>''
(right-angle bracket) followed by lines that contain the actual
sequence data. Three example FASTA records are below:


TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGG GTGAGTAATG
TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTA ATACCCCATA
GCATTAAACATTCCGCCTGGG
....

Since the file I''m working with contains tens of thousands of these
records, I believe I need to find a way to hash this file such that I
can retrieve the respective sequence more quickly than I could by
parsing through the file request-by-request. However, I''m very new to
Python and am still very low on the learning curve for programming and
algorithms in general; while I''m certain there are ubiquitous
algorithms for this type of problem, I don''t know what they are or
where to look for them. So I turn to the gurus and accost you for help
once again. :-) If you could help me figure out how to code a solution
that won''t be a resource whore, I''d be _very_ grateful. (I''d prefer to
keep it in Python only, even though I know interaction with a
relational database would provide the fastest method--the group I''m
trying to write this for does not have access to a RDBMS.)
Thanks very much in advance,
Chris

解决方案



keeping an index in memory might be reasonable. the following class
creates an index file by scanning the FASTA file, and uses the "marshal"
module to save it to disk. if the index file already exists, it''s used as is.
to regenerate the index, just remove the index file, and run the program
again.

import os, marshal

class FASTA:

def __init__(self, file):
self.file = open(file)
self.checkindex()

def __getitem__(self, key):
try:
pos = self.index[key]
except KeyError:
raise IndexError("no such item")
else:
f = self.file
f.seek(pos)
header = f.readline()
assert ">" + header + "\n"
data = []
while 1:
line = f.readline()
if not line or line[0] == ">":
break
data.append(line)
return data

def checkindex(self):
indexfile = self.file.name + ".index"

try:
self.index = marshal.load(open(indexfile, "rb"))
except IOError:
print "building index..."

index = {}

# scan the file
f = self.file
f.seek(0)
while 1:
pos = f.tell()
line = f.readline()
if not line:
break
if line[0] == ">":
# save offset to header line
header = line[1:].strip()
index[header] = pos

# save index to disk
f = open(indexfile, "wb")
marshal.dump(index, f)
f.close()

self.index = index

db = FASTA("myfastafile.dat")
print db["CW127_A02"]

[''TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAG...

tweak as necessary.

</F>



You don''t need a RDBMS; I''d put it in a DBM or CDB myself.




TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGG GTGAGTAATG
TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTA ATACCCCATA
GCATTAAACAT



TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGG GTGAGTAATG
TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTA ATACCCCATA
GCATTAAACATTCCGCCTGGGGAGTACGGTCGCAAGATTAAAACTCAAAG GAATAGACGG



TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGG GTGAGTAATG
TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTA ATACCCCATA
GCATTAAACATTCCGCCTGGG
...

Since the file I''m working with contains tens of thousands of these
records, I believe I need to find a way to hash this file such that I
can retrieve the respective sequence more quickly than I could by
parsing through the file request-by-request. However, I''m very new to
Python and am still very low on the learning curve for programming and
algorithms in general; while I''m certain there are ubiquitous
algorithms for this type of problem, I don''t know what they are or
where to look for them. So I turn to the gurus and accost you for help
once again. :-) If you could help me figure out how to code a solution
that won''t be a resource whore, I''d be _very_ grateful. (I''d prefer to
keep it in Python only, even though I know interaction with a
relational database would provide the fastest method--the group I''m
trying to write this for does not have access to a RDBMS.)
Thanks very much in advance,
Chris



这篇关于在大规模FASTA文件中随机加入记录的策略是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-14 16:06