Getting raw unparsed records with SeqIO?
Peter <biopython <at> maubp.freeserve.co.uk>
2010-02-02 18:37:25 GMT
Hi all,
Over on enhancement Bug 3000, Martin was asking about
getting raw unparsed strings for each record in a sequence file:
http://bugzilla.open-bio.org/show_bug.cgi?id=3000
This makes sense for sequential files like FASTA and GenBank,
but not for interlaced files like PHYLIP, and has less obvious
uses when there is any kind of header or footer (e.g. XML or
SFF files).
The particular example Martin gave was selecting a subset of
records in a large GenBank file (I've done this myself in the past).
While this can be done via Bio.SeqIO, the process of parsing
the data into a SeqRecord and saving it again is lossy. While
there is room for improvement. For this particular example, I
suggested Martin use the "old" iterator class in Bio.GenBank.
In general things like white space and wrapping mean that a
SeqIO parse/write cannot guarantee a 100% unaltered round
trip, and will also be slower than using the raw record as a string.
Martin suggested adding an optional argument to the parse
function. I'm not sure this is a good API choice, as it would
dramatically alter the return values. Perhaps we could have
a new iterator function in Bio.SeqIO for suitable sequential
files only which returns a series of strings, one for each
record, unmodified?
Either way I don't see how this would be used - surely
(Continue reading)