Re: [Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk?
Peter <biopython <at> maubp.freeserve.co.uk>
2010-06-04 09:16:19 GMT
On Fri, Jun 4, 2010 at 9:49 AM, Leighton Pritchard <lpritc <at> scri.ac.uk> wrote:
> Hi,
>
> On 03/06/2010 Thursday, June 3, 18:52, "Peter"
> <biopython <at> maubp.freeserve.co.uk> wrote:
>
>> There are two major additions that have been discussed (and some
>> code written too): gzip support and storing the index on disk.
>
> [...]
>
>> Now ideally we'd be able to offer both of these features - but if
>> you had to vote, which would be most important and why?
>
> On-disk indexing. But does this not also lend itself (perhaps
> eventually...) also to storing the whole dataset in SQLite or similar to
> avoid syncing problems between the file and the index? Wasn't that also
> part of a discussion on the BIP list some time ago?
That is a much more complicated problem - serialising data from many
different possible files formats. We have BioSQL which is pretty good
for things like GenBank, EMBL, SwissProt etc but not suitable for FASTQ.
I'd rather stick to the simpler task of recording a lookup table mapping
record identifiers to file offsets.
> I've not looked at how you're already parsing from gzip files, so I hope
> it's more time-efficient than what I used to do for bzip, which was write a
> Pyrex wrapper to Flex, which was using the bzip2 library directly. This was
> not a speed improvement over uncompressing the file each time I needed to
> open it (and then using Flex). The same is true for Python's gzip module:
(Continue reading)