Peter | 1 Jun 2010 01:11
Picon
Picon

Re: [Biopython] Cross_match

On Mon, May 31, 2010 at 11:08 PM, Chris Fields wrote:
>
> Yes, but only one file:
>
> http://github.com/bioperl/bioperl-live/blob/master/t/data/testdata.crossmatch
>
> chris
>

Thanks Chris - that saved me searching ;)

Alvaro - would you just want the pairwise alignment,
or are your interested in the hit information (scores etc)?
I'm wondering if adding support in Bio.AlignIO would
be enough (similar to how we support FASTA -m 10
output already).

Peter
_______________________________________________
Biopython mailing list  -  Biopython <at> lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython

Alvaro F Pena Perea | 1 Jun 2010 14:37
Picon
Favicon

Re: [Biopython] Cross_match

Sorry about the mistake, I meant to say Biopython instead bioperl. Well, now
I know that SearchIO supports cross match in bioperl.
Anyway, I agree that it would be great if we could make some method for
cross match in Biopython. In my case I'm more interested in the hit
information than the pairwise alignment. I'll take a look to the Bio.AlignIO

2010/5/31 Peter <biopython <at> maubp.freeserve.co.uk>

> On Mon, May 31, 2010 at 11:08 PM, Chris Fields wrote:
> >
> > Yes, but only one file:
> >
> >
> http://github.com/bioperl/bioperl-live/blob/master/t/data/testdata.crossmatch
> >
> > chris
> >
>
> Thanks Chris - that saved me searching ;)
>
> Alvaro - would you just want the pairwise alignment,
> or are your interested in the hit information (scores etc)?
> I'm wondering if adding support in Bio.AlignIO would
> be enough (similar to how we support FASTA -m 10
> output already).
>
> Peter
>
_______________________________________________
Biopython mailing list  -  Biopython <at> lists.open-bio.org
(Continue reading)

Peter | 1 Jun 2010 14:57
Picon
Picon

Re: [Biopython] Cross_match

On Tue, Jun 1, 2010 at 1:37 PM, Alvaro F Pena Perea
<alvin <at> pasteur.edu.uy> wrote:
> Sorry about the mistake, I meant to say Biopython instead bioperl. Well, now
> I know that SearchIO supports cross match in bioperl.
> Anyway, I agree that it would be great if we could make some method for
> cross match in Biopython. In my case I'm more interested in the hit
> information than the pairwise alignment. I'll take a look to the Bio.AlignIO

Just to clarify - Bio.AlignIO does not currently support cross_match,
but if it did this would be focusing on the pairwise alignments.

Peter
_______________________________________________
Biopython mailing list  -  Biopython <at> lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython

Peter | 3 Jun 2010 19:52
Picon
Picon

[Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk?

Dear Biopythoneers,

We've had several discussions (mostly on the development list) about
extending the Bio.SeqIO.index() functionality. For a quick recap of what
you can do with it right now, see:

http://news.open-bio.org/news/2009/09/biopython-seqio-index/
http://news.open-bio.org/news/2010/04/partial-seq-files-biopython/

There are two major additions that have been discussed (and some
code written too): gzip support and storing the index on disk.

Currently Bio.SeqIO.index() has to re-index the sequence file each
time you run your script. If you run the same script often, it would be
useful to be able to save the index information to disk.  The idea is
that you can then load the index file and get almost immediate
random access to the associated sequence file (with waiting to scan
the file to rebuild the index). The old OBDA style indexes used by
BioPerl, BioRuby etc are one possible file format we might use, but
a simple SQLite database may be preferable. This also would give
us a way to index really big files with many millions of reads without
keeping the file offsets in memory. This is going to be important for
random access to the latest massive sequencing data files.

Next, support for indexing compressed files (initially concentrating
on Unix style gzipped files, e.g. example.fasta.gz) without having
to decompress the whole file. You can already parse these files
with Bio.SeqIO in conjunction with the Python gzip module. It would
be nice to be able to index them too.

(Continue reading)

Renato Alves | 4 Jun 2010 10:10
Picon

Re: [Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk?

Hi Peter and all,

Considering the fact that the first addition 'is' a potential problem,
while the second is more of an optimization, I would put my vote on the
first. In addition, an sqlite or similar solution would also allow one
to use the indexing feature on short run applications where
recalculating the index every time is a costly (sometimes too much)
operation.

Obviously the second would be of great use if put together with the
first, but I'm a little bit biased on that since I was part of the group
that raised the gzip question in the mailing list some time ago.

Regards,
Renato

Quoting Peter on 06/03/2010 06:52 PM:
> Dear Biopythoneers,
> 
> We've had several discussions (mostly on the development list) about
> extending the Bio.SeqIO.index() functionality. For a quick recap of what
> you can do with it right now, see:
> 
> http://news.open-bio.org/news/2009/09/biopython-seqio-index/
> http://news.open-bio.org/news/2010/04/partial-seq-files-biopython/
> 
> There are two major additions that have been discussed (and some
> code written too): gzip support and storing the index on disk.
> 
> Currently Bio.SeqIO.index() has to re-index the sequence file each
(Continue reading)

Peter | 4 Jun 2010 10:42
Picon
Picon

Re: [Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk?

On Fri, Jun 4, 2010 at 9:10 AM, Renato Alves <rjalves <at> igc.gulbenkian.pt> wrote:
> Hi Peter and all,
>
> Considering the fact that the first addition 'is' a potential problem,
> while the second is more of an optimization, I would put my vote on the
> first. In addition, an sqlite or similar solution would also allow one
> to use the indexing feature on short run applications where
> recalculating the index every time is a costly (sometimes too much)
> operation.
>
> Obviously the second would be of great use if put together with the
> first, but I'm a little bit biased on that since I was part of the group
> that raised the gzip question in the mailing list some time ago.
>
> Regards,
> Renato

Hi Renato,

Unfortunately I was inconsistent about which order I used in my email
(gzip vs on disk indexes) so I'm not sure which you are talking about.
Are you saying supporting on disk indexes would be your priority (even
though you did ask look at gzip support in the past)?

Peter
_______________________________________________
Biopython mailing list  -  Biopython <at> lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython

(Continue reading)

Leighton Pritchard | 4 Jun 2010 10:49
Picon
Favicon
Gravatar

Re: [Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk?

Hi,

On 03/06/2010 Thursday, June 3, 18:52, "Peter"
<biopython <at> maubp.freeserve.co.uk> wrote:

> There are two major additions that have been discussed (and some
> code written too): gzip support and storing the index on disk.

[...]

> Now ideally we'd be able to offer both of these features - but if
> you had to vote, which would be most important and why?

On-disk indexing.  But does this not also lend itself (perhaps
eventually...) also to storing the whole dataset in SQLite or similar to
avoid syncing problems between the file and the index?  Wasn't that also
part of a discussion on the BIP list some time ago?

I've not looked at how you're already parsing from gzip files, so I hope
it's more time-efficient than what I used to do for bzip, which was write a
Pyrex wrapper to Flex, which was using the bzip2 library directly.  This was
not a speed improvement over uncompressing the file each time I needed to
open it (and then using Flex).  The same is true for Python's gzip module:

-rw-r--r--  1 lpritc  staff   110M 14 Apr 14:22
phytophthora_infestans_data.tar.gz

$ time gunzip phytophthora_infestans_data.tar.gz

real    0m18.359s
(Continue reading)

Peter | 4 Jun 2010 11:16
Picon
Picon

Re: [Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk?

On Fri, Jun 4, 2010 at 9:49 AM, Leighton Pritchard <lpritc <at> scri.ac.uk> wrote:
> Hi,
>
> On 03/06/2010 Thursday, June 3, 18:52, "Peter"
> <biopython <at> maubp.freeserve.co.uk> wrote:
>
>> There are two major additions that have been discussed (and some
>> code written too): gzip support and storing the index on disk.
>
> [...]
>
>> Now ideally we'd be able to offer both of these features - but if
>> you had to vote, which would be most important and why?
>
> On-disk indexing.  But does this not also lend itself (perhaps
> eventually...) also to storing the whole dataset in SQLite or similar to
> avoid syncing problems between the file and the index?  Wasn't that also
> part of a discussion on the BIP list some time ago?

That is a much more complicated problem - serialising data from many
different possible files formats. We have BioSQL which is pretty good
for things like GenBank, EMBL, SwissProt etc but not suitable for FASTQ.
I'd rather stick to the simpler task of recording a lookup table mapping
record identifiers to file offsets.

> I've not looked at how you're already parsing from gzip files, so I hope
> it's more time-efficient than what I used to do for bzip, which was write a
> Pyrex wrapper to Flex, which was using the bzip2 library directly.  This was
> not a speed improvement over uncompressing the file each time I needed to
> open it (and then using Flex).  The same is true for Python's gzip module:
(Continue reading)

Kevin | 4 Jun 2010 12:53
Picon

Re: [Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk?

I vote for sqlite index. Have been using bsddb to do the same but the  
db is inflated compared to plain text. Performance is not bad using  
btree

For gzip I feel it might be possible to gunzip into a stream which  
biopython can parse on the fly?

Kev

Sent from my iPod

On 04-Jun-2010, at 1:52 AM, Peter <biopython <at> maubp.freeserve.co.uk>  
wrote:

> Dear Biopythoneers,
>
> We've had several discussions (mostly on the development list) about
> extending the Bio.SeqIO.index() functionality. For a quick recap of  
> what
> you can do with it right now, see:
>
> http://news.open-bio.org/news/2009/09/biopython-seqio-index/
> http://news.open-bio.org/news/2010/04/partial-seq-files-biopython/
>
> There are two major additions that have been discussed (and some
> code written too): gzip support and storing the index on disk.
>
> Currently Bio.SeqIO.index() has to re-index the sequence file each
> time you run your script. If you run the same script often, it would  
> be
(Continue reading)

Peter | 4 Jun 2010 14:59
Picon
Picon

Re: [Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk?

On Fri, Jun 4, 2010 at 11:53 AM, Kevin <aboulia <at> gmail.com> wrote:
> I vote for sqlite index. Have been using bsddb to do the same but the db
> is inflated compared to plain text. Performance is not bad using btree

The other major point against bsddb is that future versions of Python
will not include it in the standard library - but Python 2.5+ does have
sqlite3 included.

> For gzip I feel it might be possible to gunzip into a stream which
> biopython can parse on the fly?

Yes of course, like this:

import gzip
from Bio import SeqIO
handle = gzip.open("uniprot_sprot.dat.gz")
for record in SeqIO.parse(handle, "swiss"): print record.id
handle.close()

Parsing is easy - the point of this discussion is random access to
any record within the stream (which requires jumping to an offset).

Peter
_______________________________________________
Biopython mailing list  -  Biopython <at> lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython


Gmane