[Biopython] Suggestions for working sequence data
Linlin Zhao <Linlin.Zhao <at> hhu.de>
2015-04-30 14:34:48 GMT
I am new to Biopython, but have some experience with Python. Python is
my favourite language, I really want to learn and apply Biopython.
I need to work on a large dataset, the whole genomic sequences of all
bacteria (~2000) from genBank, with size of about 80GB. Each folder
within the dataset is one organism, which includes several files for
different types of data.
My task is easy to describe but I need some suggestions of how to work
on the whole dataset efficiently with Biopython:
1. load FASTA file .ffn (protein genes) in each folder, which looks like
> gi|158303474|gb|CP000828.1|:3233-4009 Acaryochloris marina MBIC11017,
> complete genome
According to the address (3233-4009) of this gene, I then go to .fna
file within the same folder which has the whole genome sequence, and
read 20 base pairs (3213-3232) as the approximate promoter for the gene.
Finally I construct a new gene sequence with address 3213-4009, which is
just adding 20bp in front of the original gene sequence.
2. run motifs, for instance "TACGTC", through all those gene sequences
to find out their frequency of appearance in all bacterial protein
I hope my description is clear. Problem in short is that I need to work
on two files in all 2000 folders, how to load and write file
Would anyone give some hints?
Thanks in advance,
Biopython mailing list - Biopython <at> mailman.open-bio.org