Massimo Di Pierro | 1 May 2011 06:51
Picon
Gravatar

Re: biopython web interface

Hello Andrea

I am a looking at something a little different than what you are doing but we should definitely collaborate.
I am trying to identify tasks that are not domain specific that could benefit more than one scientific community.

It seems to me all scientific communities have data, have program (in python or not it irrelevant to me) and
have a workflow.
They all need:
1) a tool to post the data online in a semi-automated fashion
2) a tool to share data easily (both via web interface and scripting via web service) with access control
3) a way to annotate the data as in a CMS
4) a mechanism to connect data with a workflow so that certain programs are executed automatically when new
data is uploaded in the system. The programs may require user input so it should possible to somehow
register a task (a program) by describing what input data it needs and what user input it needs and the
system should automatically generate an interface.
5) an interface to local clusters and grid resources to submit computing jobs to

I do not have the resources or the expertise to build an interface specific for biopython but I think we
should collaborate because if what I am going is general enough (and I am not sure it is unless we talk more
about it) it could be used to create an interface to biopython with minimal programming.

I understand your focus is on algorithms but I need to start on data. It is my experience it is very difficult
to automate the workflow of algorithms if there is no standard exchange format for the data.

The first thing I would need to understand are:
- does biopython handle some standard file formats? What do they contain? how can they be recognized? Can
you send me a few example?
- is there a graph of which algorithms run on which file types?
- what are the most common algorithms? Can you point me to the source?

(Continue reading)

Bernardo Clavijo | 2 May 2011 15:52
Picon
Gravatar

Re: biopython web interface

Hello Massimo... first of all... thanks for web2py, which is my tool
of choice for web apps :D

Here goes my 2 cents about all this:

1) I you're looking for a standard format, we should me talking about
sequence files ( fasta / gff ). This approach will be very
restrictive, but i guess it's a starting point.

2) you should look at galaxy, in some point I was hoping to integrate
a web2py programming module directly there (don't know how yet, and
i'm in many things at once, so it's more like a dream than a project).
Galaxy has a fex tutorials and videos that should point you in the
right direction.

3) Sadly, standard data representation has been an issue for some time
for the bioinformatics community, the REST / web services approach has
gain some momentum and some apps talk to each other in some way, but
we still have not much of a standard way to represent all the data.
Ontologies are a strong point also (check http://www.obofoundry.org/ )
with sequence ontology being a great one IMHO pointing on how the data
should be represented (it's recommended, even when not enforced, to
use SO when creating gff3 files).

4) So far, the one tool to "standard biological data saving" I've
found useful was the Chado DB schema, which BTW didn't enforce or even
define how to handle a lot of situations, but is more of a framework
on which to base your own data representation. I guess that's not what
you're looking for, but surely an interesting approach and a lot of
lessons learned there.
(Continue reading)

Feed My Inbox | 4 May 2011 06:37

5/4 active questions tagged biopython - Stack Overflow

// Finding/Replacing substrings with annotations in an ASCII file in Python
// May 3, 2011 at 9:14 AM

http://stackoverflow.com/questions/5870012/finding-replacing-substrings-with-annotations-in-an-ascii-file-in-python
Hello Everyone,

I'm having a little coding issue in a bioinformatics project I'm working on. Basically, my task is to
extract motif sequences from a database and use the information to annotate a sequence alignment file.
The alignment file is plain text, so the annotation will not be anything elaborate, at best simply
replacing the extracted sequences with asterisks in the alignment file itself. 

I have a script which scans the database file, extracts all sequences I need, and writes them to an output
file. What I need is, given a query, to read these sequences and match them to their corresponding
substrings in the ASCII alignment files.  Finally, for every occurrence of a motif sequence (substring of
a very large string of characters) I would replace motif sequence XXXXXXX with a sequence of asterisks *. 

The code I am using goes like this (11SGLOBULIN is the name of the protein entry in the database):

motif_file = open('/users/myfolder/final motifs_11SGLOBULIN','r')
align_file = open('/Users/myfolder/alignmentfiles/11sglobulin.seqs', 'w+') 
finalmotifs = motif_file.readlines()
seqalign = align_file.readlines() 

for line in seqalign:
    if motif[i] in seqalign:  # I have stored all motifs in a list called "motif"
        replace(motif, '*****') 

But instead of replacing each string with a sequence of asterisks, it deletes the entire file. Can anyone
see why this is happening? 

(Continue reading)

João Rodrigues | 4 May 2011 12:21
Picon
Gravatar

Benchmarking PDBParser

Hello all,

Following a few discussions, I'm tempted to benchmark the current
implementation of the PDBParser and see how it fares against an old
implementation (I think I'll use 1.48 since older versions need Numerical
Python). The main objective is to see if the recent developments have a
significant impact in its speed.

I thought of downloading the entire PDB but since it would take several
days, I downloaded the CATH domain list instead. Those are just protein ATOM
records, without any header, but since all modifications were essentially
dealing with ATOM records, etc, I think it might be as valid.

I'll be running tests today and tomorrow and I'll put the results up
somewhere later on. I'm also making the scripts available so it is easy to
benchmark it later on.

Thoughts or suggestions?

Cheers,

João

_______________________________________________
Biopython-dev mailing list
Biopython-dev <at> lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython-dev
Peter Cock | 4 May 2011 12:39
Gravatar

Re: Benchmarking PDBParser

On Wed, May 4, 2011 at 11:21 AM, João Rodrigues <anaryin <at> gmail.com> wrote:
> Hello all,
>
> Following a few discussions, I'm tempted to benchmark the current
> implementation of the PDBParser and see how it fares against an old
> implementation (I think I'll use 1.48 since older versions need Numerical
> Python). The main objective is to see if the recent developments have a
> significant impact in its speed.
>
> I thought of downloading the entire PDB but since it would take several
> days, I downloaded the CATH domain list instead. Those are just protein ATOM
> records, without any header, but since all modifications were essentially
> dealing with ATOM records, etc, I think it might be as valid.
>
> I'll be running tests today and tomorrow and I'll put the results up
> somewhere later on. I'm also making the scripts available so it is easy to
> benchmark it later on.
>
> Thoughts or suggestions?
>
> Cheers,
>
> João

That sounds like a good idea. While you are at it, you could try
both the strict and permissive modes - I wonder what proportion
of the current PDB has problems in the data?

Peter
(Continue reading)

João Rodrigues | 4 May 2011 12:42
Picon
Gravatar

Re: Benchmarking PDBParser

I was not planning on using the PDB database, but I might as well download
it then.

Adding that to the list. I'm also planning on removing all elements and
check the impact of finding the elements.

Cheers,

João [...] Rodrigues
http://nmr.chem.uu.nl/~joao

_______________________________________________
Biopython-dev mailing list
Biopython-dev <at> lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython-dev
João Rodrigues | 4 May 2011 15:23
Picon
Gravatar

Re: Benchmarking PDBParser

Just a word of advice. I tried to download the whole PDB with PDBList.py and
I ran into an error. Their server shut me down due to too many connections.
Perhaps adding an exception catcher like the one we have for NCBI servers
would be useful?

Preliminary results show some degradation of speed..

==> benchmark_CATH-biopython_149.time <==
Total time spent: 530.686s
Average time per structure: 46.839ms

==> benchmark_CATH-biopython_current.time <==
Total time spent: 686.176s
Average time per structure: 60.563ms

I'll write a full summary when I finish downloading the PDB and testing it.
Chad Davis | 4 May 2011 15:55
Picon
Gravatar

Re: Benchmarking PDBParser

I'd be very interested in this as well.
I'm working on some modifications (in the alpha stages still) to the
BioPerl PDB parser (based on the Perl Data Language, analogous to
NumPy) and would be interested to compare all of them (BioPython old
and new, BioPerl old and new).

In my experience, downloading the PDB, just the divided structures,
works best with rsync, and I believe it should only take several
hours, not several days, the first time. It should be as easy as:

rsync -a rsync.wwpdb.org::ftp_data/structures/divided/pdb/ ./pdb

Other options:
http://www.wwpdb.org/downloads.html

Chad

On Wed, May 4, 2011 at 15:23, João Rodrigues <anaryin <at> gmail.com> wrote:
> Just a word of advice. I tried to download the whole PDB with PDBList.py and
> I ran into an error. Their server shut me down due to too many connections.
> Perhaps adding an exception catcher like the one we have for NCBI servers
> would be useful?
>
> Preliminary results show some degradation of speed..
>
> ==> benchmark_CATH-biopython_149.time <==
> Total time spent: 530.686s
> Average time per structure: 46.839ms
>
> ==> benchmark_CATH-biopython_current.time <==
(Continue reading)

João Rodrigues | 4 May 2011 15:57
Picon
Gravatar

Re: Benchmarking PDBParser

Hey Chad,

That's exactly what I ended up doing and it is done ;) Pretty quick, I was
hoping for a day or so!

Best,

João [...] Rodrigues
http://nmr.chem.uu.nl/~joao

On Wed, May 4, 2011 at 3:55 PM, Chad Davis <chad.a.davis <at> gmail.com> wrote:

> I'd be very interested in this as well.
> I'm working on some modifications (in the alpha stages still) to the
> BioPerl PDB parser (based on the Perl Data Language, analogous to
> NumPy) and would be interested to compare all of them (BioPython old
> and new, BioPerl old and new).
>
> In my experience, downloading the PDB, just the divided structures,
> works best with rsync, and I believe it should only take several
> hours, not several days, the first time. It should be as easy as:
>
> rsync -a rsync.wwpdb.org::ftp_data/structures/divided/pdb/ ./pdb
>
> Other options:
> http://www.wwpdb.org/downloads.html
>
> Chad
>
>
(Continue reading)

redmine | 4 May 2011 23:56

[Biopython - Feature #3194] (In Progress) Bio.Phylo export to 'ape' via Rpy2


Issue #3194 has been updated by Eric Talevich.

Status changed from New to In Progress
Assignee changed from Eric Talevich to Biopython Dev Mailing List
% Done changed from 0 to 20
Estimated time set to 0.50

I added a cookbook entry for this on the Biopython wiki:

http://www.biopython.org/wiki/Phylo_cookbook#Convert_to_an_.27ape.27_tree.2C_via_Rpy2

Good enough? Trying it in ipython, it works as advertised, except after calling r.plot() the R plot window
won't close until I exit ipython. Further calls to plot() update the window; it just doesn't close.
----------------------------------------
Feature #3194: Bio.Phylo export to 'ape' via Rpy2
https://redmine.open-bio.org/issues/3194

Author: Eric Talevich
Status: In Progress
Priority: Low
Assignee: Biopython Dev Mailing List
Category: Main Distribution
Target version: Not Applicable
URL: 

There are many more packages for working with phylogenetic data in R, and most of these operate on the basic
tree object defined in the ape package. Let's support interoperability through Rpy2.

The trivial way to do this is serialize a tree to a Newick string, then feed that to the read.tree() function.
(Continue reading)


Gmane