Michiel de Hoon | 1 Jul 2006 23:47
Favicon

Fasta parser

Hi everybody,

The Biopython shows the following approach to parsing a Fasta file:

 >>> from Bio import Fasta
 >>> parser = Fasta.RecordParser()
 >>> file = open("ls_orchid.fasta")
 >>> iterator = Fasta.Iterator(file, parser)
 >>> cur_record = iterator.next()

But for large Fasta files, it's very slow, compared to file.read(), 
which may be due to going through Martel (I believe the same was true 
for large GenBank files).

So I'm thinking about writing a simple-minded Fasta parser for better 
performance with large files. What I'm wondering about:
1) Is there some advantage that I overlooked of using Martel for parsing 
Fasta files?
2) Why is it necessary to create a parser first and passing it to 
Fasta.Iterator? Are there any cases where Fasta.Iterator uses something 
other than a Fasta.RecordParser?

--Michiel.
Iddo Friedberg | 2 Jul 2006 00:52

Re: Fasta parser

Michiel,

There is actually a simple minded fasta reader/writer  that does not use Martel. Bio.SeqIO.FASTA

./I

--
Iddo Friedberg, PhD
Burnham Institute for Medical Research
10901 N. Torrey Pines Rd.
La Jolla, CA 92037 USA
T: +1 858 646 3100 x3516
http://iddo-friedberg.org
http://BioFunctionPrediction.org

-----Original Message-----
From: biopython-dev-bounces <at> lists.open-bio.org on behalf of Michiel de Hoon
Sent: Sat 7/1/2006 2:47 PM
To: biopython-dev <at> biopython.org
Subject: [Biopython-dev] Fasta parser

Hi everybody,

The Biopython shows the following approach to parsing a Fasta file:

 >>> from Bio import Fasta
 >>> parser = Fasta.RecordParser()
 >>> file = open("ls_orchid.fasta")
 >>> iterator = Fasta.Iterator(file, parser)
 >>> cur_record = iterator.next()
(Continue reading)

Michiel de Hoon | 2 Jul 2006 06:43
Favicon

Re: Fasta parser

Thanks Iddo!
I tried the parser in Bio.SeqIO.FASTA and it is indeed a lot faster than 
the Martel-based one in Bio.Fasta.

It would be nice to merge these two modules. However, it raises a bunch 
of design questions (such as Fasta.Record versus SeqRecord, and Seq 
versus string), so it's probably better to wait with that until after 
the next Biopython release. Which, by the way, will be coming up soon.

Thanks,

--Michiel.

Iddo Friedberg wrote:
> Michiel,
> 
> There is actually a simple minded fasta reader/writer  that does not use 
> Martel. Bio.SeqIO.FASTA
> 
> ./I
> 
> --
> Iddo Friedberg, PhD
> Burnham Institute for Medical Research
> 10901 N. Torrey Pines Rd.
> La Jolla, CA 92037 USA
> T: +1 858 646 3100 x3516
> http://iddo-friedberg.org
> http://BioFunctionPrediction.org
> 
(Continue reading)

Iddo Friedberg | 2 Jul 2006 06:48

Re: Fasta parser

By (lack of?) design, my own biopython using code seems to be using both the martel and non-Martel parsers. I
imagine others may have the same. Point being: any design change should make sure that we are back
compatible. 

Thanks very much for your work on the Biopython release.

Cheers,

./I

--
Iddo Friedberg, PhD
Burnham Institute for Medical Research
10901 N. Torrey Pines Rd.
La Jolla, CA 92037 USA
T: +1 858 646 3100 x3516
http://iddo-friedberg.org
http://BioFunctionPrediction.org

-----Original Message-----
From: Michiel de Hoon [mailto:mdehoon <at> c2b2.columbia.edu]
Sent: Sat 7/1/2006 9:43 PM
To: Iddo Friedberg
Cc: biopython-dev <at> biopython.org
Subject: Re: [Biopython-dev] Fasta parser

Thanks Iddo!
I tried the parser in Bio.SeqIO.FASTA and it is indeed a lot faster than 
the Martel-based one in Bio.Fasta.

(Continue reading)

Michiel de Hoon | 2 Jul 2006 16:58
Favicon

New Biopython release coming up

Hi everybody,

The next Biopython release (1.42, code-named "Brooklyn") is coming up. 
I'm planning to finish this release about two weeks from now. The tests 
of Biopython in CVS all pass, so we are doing well. However, there are 
25 bugs listed in Bugzilla, so please have a look to see if there's 
something we can do about them. If you have some code sitting around, 
now would be a good time to commit it to CVS. However, if you are not 
sure if your code is ready for prime time, please hold off until after 
this release. Also, if you have a cvs checkout of Biopython, please make 
sure to update it before doing any commits to avoid overwriting.

Thanks everybody for your contributions to Biopython.

--Michiel.
Peter | 2 Jul 2006 20:11
Picon
Picon

Re: New Biopython release coming up

Michiel de Hoon wrote:
> Hi everybody,
> 
> The next Biopython release (1.42, code-named "Brooklyn") is coming up. 
> I'm planning to finish this release about two weeks from now. The tests 
> of Biopython in CVS all pass, so we are doing well. However, there are 
> 25 bugs listed in Bugzilla, so please have a look to see if there's 
> something we can do about them. If you have some code sitting around, 
> now would be a good time to commit it to CVS. However, if you are not 
> sure if your code is ready for prime time, please hold off until after 
> this release. Also, if you have a cvs checkout of Biopython, please make 
> sure to update it before doing any commits to avoid overwriting.
> 
> Thanks everybody for your contributions to Biopython.
> 
> --Michiel.

Sounds like a good plan Michiel

Did anyone get back to you about the NBCI Blast XML format?  I would say 
parsing blast output is a fairly important feature to a lot of users (I 
may of course be biased)...

Getting down to specifics:

Bugzilla Bug 1997 VARCHAR too small in SCOP tables
http://bugzilla.open-bio.org/show_bug.cgi?id=1997
Suggested fix looked OK to me, but as I've never used SCOP as second 
opinion would be wise.

(Continue reading)

Colosimo, Marc E. | 2 Jul 2006 20:36
Picon
Favicon

Re: Fasta parser


On 7/1/06 5:47 PM, "Michiel de Hoon" <mdehoon <at> c2b2.columbia.edu> wrote:

> Hi everybody,
> 
> The Biopython shows the following approach to parsing a Fasta file:
> 
>>>> from Bio import Fasta
>>>> parser = Fasta.RecordParser()
>>>> file = open("ls_orchid.fasta")
>>>> iterator = Fasta.Iterator(file, parser)
>>>> cur_record = iterator.next()
> 
> But for large Fasta files, it's very slow, compared to file.read(),
> which may be due to going through Martel (I believe the same was true
> for large GenBank files).
> 
> So I'm thinking about writing a simple-minded Fasta parser for better
> performance with large files. What I'm wondering about:
> 1) Is there some advantage that I overlooked of using Martel for parsing
> Fasta files?
> 2) Why is it necessary to create a parser first and passing it to
> Fasta.Iterator? Are there any cases where Fasta.Iterator uses something
> other than a Fasta.RecordParser?

Yes!!!! I use Fasta.SequenceParser which gives me a SeqRecord Object
(Bio.SeqRecord) not some odd Fasta.Record Object that I would have to then
remap into a SeqRecord.

Also, could someone re-run epydoc! My changes in the code have not made it
(Continue reading)

Colosimo, Marc E. | 2 Jul 2006 21:12
Picon
Favicon

Re: BioPython Design

Michiel,

When will this next release be made and what is going into it?

Since you brought up the issue of design question, I'll have my little rant
now. But first, I would like to say that I think it is great that people
contribute code and more importantly their time to this project. With out
all of the core developers there would be no BioPython. So, Kudos to anyone
who has contribute code. Now on to my rant....

<rant>
I'm not a big user of either BioPerl or BioJava. However, they are well
structured and more consistent than BioPython.This FastaIO issue is one of
several design issues that really need to be addressed.

For example, both BioPerl and BioJava use an SeqIO object structure. Our
SeqIO module is heavily underused. For example, we have Fasta, GenBank,
LocusLink, NBRF, SwissProt, UniGene main Modules. Interestingly, there is a
writers.SeqRecord.embl but I can't quickly find something to read in an embl
file! 

Just look at what BioPerl can read in
<http://www.bioperl.org/wiki/HOWTO:SeqIO> and how easy it is to find this
out (even with out the doc page, all of these are listed under
Bio::SeqIO::*)

There is a very short "Coding Convention"
<http://biopython.org/wiki/Contributing#Coding_conventions>, which doesn't
seem to be followed all that well.

(Continue reading)

Michiel de Hoon | 2 Jul 2006 22:54
Favicon

Re: Fasta parser

>> 2) Why is it necessary to create a parser first and passing it to
>> Fasta.Iterator? Are there any cases where Fasta.Iterator uses something
>> other than a Fasta.RecordParser?
> 
> Yes!!!! I use Fasta.SequenceParser which gives me a SeqRecord Object
> (Bio.SeqRecord) not some odd Fasta.Record Object that I would have to then
> remap into a SeqRecord.

I see. This is one of the design issues I ran into when comparing 
Bio.Fasta and Bio.SeqIO.FASTA: Whether parsing a Fasta file should 
result in a Fasta.Record object or a SeqRecord.

> Also, could someone re-run epydoc! My changes in the code have not made it
> to the on-line API docs.

Done.

--Michiel.
Michiel de Hoon | 2 Jul 2006 23:19
Favicon

Re: BioPython Design

Colosimo, Marc E. wrote:
> When will this next release be made ...
I'm planning for the weekend of 15/16 July.

> ... and what is going into it?
Whatever is in CVS at that time. So essentially today's CVS plus as many 
  bug fixes as possible. I'd hold off on any major changes until after 
the  release.

> <rant>
> </rant>

I pretty much agree with Marc here.

 > My suggestion is if enough people are going to ISMB this year
 > (which I am not), that time should be made to think about a
 > road map for BioPython.

Unfortunately, I won't be going either. A Biopython road map seems like 
a good idea though.

 > My suggestions are:
 > 1) split off a branch for ver 2.0 that supports Python 2.4 only
 > (this would suck for Mac people, like me, but its time to move on)

Is there something essential in 2.4 that's missing in 2.3? Not that I 
object against supporting 2.4 only, I'm just wondering. Though I'd be 
hesitant to split off a separate branch, since Biopython is confusing 
enough already as it is.

(Continue reading)


Gmane