Matthew Strand | 1 Jul 2009 05:01
Favicon

Re: [Biopython] Dealing with Non-RefSeq IDs / InParanoid

For the benefit of future users who find this thread through a search, I
would like to share how to retreive a sequence from NCBI given a non-NCBI
protein ID (or other ID). This was question 3 in my original message.

Suppose you have a non-NCBI protein ID, say CE23997 (from WormBase) and you
want to retrieve the sequence from NCBI.

You can use Bio.Entrez.esearch(db='protein', term='CE23997') to get a list
of NCBI GIs that refrence this identifer. In this case there is only one
(17554770).

Then you can get the sequence using Entrez.efetch(db="protein",
id='17554770', rettype="fasta").

This may be obvious to some, but it was not to me; primarially because I was
unaware of the esearch functionality.

--

-- 
Matthew Strand
_______________________________________________
Biopython mailing list  -  Biopython <at> lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython

Iddo Friedberg | 1 Jul 2009 05:53
Picon
Gravatar

Re: [Biopython] Dealing with Non-RefSeq IDs / InParanoid

Thanks. There is a wiki-based cookbook in the biopython site. Would you like
to put it up there?

Iddo Friedberg
http://iddo-friedberg.net/contact.html

On Jun 30, 2009 8:02 PM, "Matthew Strand" <stran104 <at> chapman.edu> wrote:

For the benefit of future users who find this thread through a search, I
would like to share how to retreive a sequence from NCBI given a non-NCBI
protein ID (or other ID). This was question 3 in my original message.

Suppose you have a non-NCBI protein ID, say CE23997 (from WormBase) and you
want to retrieve the sequence from NCBI.

You can use Bio.Entrez.esearch(db='protein', term='CE23997') to get a list
of NCBI GIs that refrence this identifer. In this case there is only one
(17554770).

Then you can get the sequence using Entrez.efetch(db="protein",
id='17554770', rettype="fasta").

This may be obvious to some, but it was not to me; primarially because I was
unaware of the esearch functionality.

--
Matthew Strand

_______________________________________________ Biopython mailing list -
Biopython <at> lists.open-bio....
(Continue reading)

David WInter | 1 Jul 2009 08:22
Picon

Re: [Biopython] Bio.Sequencing.Ace

Fungazid wrote:
> David hi,
>
> Many many thanks for the diagram.
> I'm not sure I understand the differences between contig.af[readn].padded_start,  and
contig.bs[readn].padded_start, and other unknown parameters. I'll try to compare to the Ace format
>
> Avi
>   
Hi again Avi,

I took me a while to get to grips with the difference, the 'bs' list is 
a mapping of the contig's consensus to the particular read that was used 
to as the 'base segment' in that region. If you have a monospaced font 
in your email client this might help:

             consensus             

|===================================|              
+---read3---x
             +---read5--x
                         +--read1---x

(which would give a contig.bs list with 3 bs instances)

I'm not sure that this is particularly important information for a 454 assembly ;) 

I've updated the examples on the wiki page a little, if you find anything else that you think should 
be there feel free to add to it

(Continue reading)

Matthew Strand | 1 Jul 2009 12:18
Favicon

Re: [Biopython] Dealing with Non-RefSeq IDs / InParanoid

Sure, I can create a page tomorrow when I get into the office. Perhaps
"Retrieving Sequences Based on ID" would be appropriate. Alternative
suggestions are welcome.

On Tue, Jun 30, 2009 at 8:53 PM, Iddo Friedberg <idoerg <at> gmail.com> wrote:

> Thanks. There is a wiki-based cookbook in the biopython site. Would you
> like to put it up there?
>
> Iddo Friedberg
> http://iddo-friedberg.net/contact.html
>
> On Jun 30, 2009 8:02 PM, "Matthew Strand" <stran104 <at> chapman.edu> wrote:
>
> For the benefit of future users who find this thread through a search, I
> would like to share how to retreive a sequence from NCBI given a non-NCBI
> protein ID (or other ID). This was question 3 in my original message.
>
> Suppose you have a non-NCBI protein ID, say CE23997 (from WormBase) and you
> want to retrieve the sequence from NCBI.
>
> You can use Bio.Entrez.esearch(db='protein', term='CE23997') to get a list
> of NCBI GIs that refrence this identifer. In this case there is only one
> (17554770).
>
> Then you can get the sequence using Entrez.efetch(db="protein",
> id='17554770', rettype="fasta").
>
> This may be obvious to some, but it was not to me; primarially because I
> was
(Continue reading)

Chris Fields | 1 Jul 2009 14:35
Favicon
Gravatar

Re: [Biopython] [Bioperl-l] Next-gen modules

Peter,

I just committed a fix to FASTQ parsing last night to support read/ 
write for Sanger/Solexa/Illumina following the biopython convention;  
the only thing needed is more extensive testing for the quality  
scores.  There are a few other oddities with it I intend to address  
soon, but it appears to be working.

The Seq instance iterator actually calls a raw data iterator (hash  
refs of named arguments to the class constructor).  That should act as  
a decent filtering step if needed.

We have automated EMBOSS wrapping but I'm not sure how intuitive it  
is; we can probably reconfigure some of that.

chris

On Jul 1, 2009, at 2:44 AM, Peter Cock wrote:

> Hi all (BioPerl and Biopython),
>
> This is a continuation of a long thread on the BioPerl mailing
> list, which I have now CC'd to the Biopython mailing list. See:
> http://lists.open-bio.org/pipermail/bioperl-l/2009-June/030265.html
>
> On this thread we have been discussing next gen sequencing
> tools and co-coordinating things like consistent file format
> naming between Biopython, BioPerl and EMBOSS. I've been
> chatting to Peter Rice (EMBOSS) while at BOSC/ISMB 2009,
> and he will look into setting up a cross project mailing list for
(Continue reading)

Giles Weaver | 1 Jul 2009 18:27

Re: [Biopython] [Bioperl-l] Next-gen modules

Peter, the trimming algorithm I use employs a sliding window, as follows:

   - For each sequence position calculate the mean phred quality score for a
   window around that position.
   - Record whether the mean score is above or below a threshold as an array
   of zeros and ones.
   - Use a regular expression on the joined array to find the start and end
   of the good quality sequence(s).
   - Extract the quality sequence(s) and replace any bases below the quality
   threshold with N.
   - Trim any Ns from the ends.

A refinement would be to weight the scores from positions in the window, but
this could give a performance hit, and the method seems to work well enough
as is.

Chris, thanks for committing the fix, I'll give bioperl illumina fastq
parsing a workout soon. Peter, as much as I'd love to help out with
biopython, I'm under too much time pressure right now!

Jonathan, some of the Illumina sequencing adapters are listed at
http://intron.ccam.uchc.edu/groups/tgcore/wiki/013c0/Solexa_Library_Primer_Sequences.htmland
http://seqanswers.com/forums/showthread.php?t=198
Adapter sequence typically appears towards the end of the read, though the
latter part of it is often misread as the sequencing quality drops off.
I abuse needle (EMBOSS) into aligning the adapter sequence with each read. I
then use Bio::AlignIO, Bio::Range and a custom scoring scheme to identify
real alignments and trim the sequence. This is not the ideal way of doing
things, but it's fast enough, and does seem to work. The adapter sequence
shouldn't be gapped, so I'm sure there is a lot of scope for optimising the
(Continue reading)

Chris Fields | 1 Jul 2009 18:46
Favicon
Gravatar

Re: [Biopython] [Bioperl-l] Next-gen modules

On Jul 1, 2009, at 11:27 AM, Giles Weaver wrote:

...

> Peter, the trimming algorithm I use employs a sliding window, as  
> follows:
>
>   - For each sequence position calculate the mean phred quality  
> score for a
>   window around that position.
>   - Record whether the mean score is above or below a threshold as  
> an array
>   of zeros and ones.
>   - Use a regular expression on the joined array to find the start  
> and end
>   of the good quality sequence(s).
>   - Extract the quality sequence(s) and replace any bases below the  
> quality
>   threshold with N.
>   - Trim any Ns from the ends.
>
> A refinement would be to weight the scores from positions in the  
> window, but
> this could give a performance hit, and the method seems to work well  
> enough
> as is.
>
> Chris, thanks for committing the fix, I'll give bioperl illumina fastq
> parsing a workout soon. Peter, as much as I'd love to help out with
> biopython, I'm under too much time pressure right now!
(Continue reading)

Peter Cock | 2 Jul 2009 09:20
Gravatar

Re: [Biopython] [Bioperl-l] Next-gen modules

On 7/1/09, Giles Weaver wrote:
> Peter, the trimming algorithm I use employs a sliding window, as follows:
>
>    - For each sequence position calculate the mean phred quality score for a
>    window around that position.
>    - Record whether the mean score is above or below a threshold as an array
>    of zeros and ones.
>    - Use a regular expression on the joined array to find the start and end
>    of the good quality sequence(s).
>    - Extract the quality sequence(s) and replace any bases below the quality
>    threshold with N.
>    - Trim any Ns from the ends.
>
> A refinement would be to weight the scores from positions in the window, but
> this could give a performance hit, and the method seems to work well enough
> as is.

Thanks for the details - that is a bit more complex that what I had been
thinking. Do you have any favoured window size and quality threshold,
or does this really depend on the data itself?

Also, if you find a sequence read that goes "good - poor - good" for
example, do you extract the two good regions as two sub reads
(presumably with a minimum length)? This may be silly for Illumina
where the reads are very short, but might make sense for Roche 454.

> Chris, thanks for committing the fix, I'll give bioperl illumina fastq
> parsing a workout soon. Peter, as much as I'd love to help out with
> biopython, I'm under too much time pressure right now!

(Continue reading)

Rouilly, Vincent | 2 Jul 2009 15:40
Picon

[Biopython] Distributed Annotation System ( DAS ) and BioPython

Hi,

I have question about Distributed Annotation System (DAS).
What is the current best practice to load a SeqRecord from a DAS description ?

-------
I found that this topic has been discussed in the past here (see below), but I couldn't find the up-to-date
method to deal with DAS in BioPython.

[2003] : Draft PyDAS parser from Andrew Dalke:  http://portal.open-bio.org/pipermail/biopython/2003-October/001670.html
Andrew hints at a DAS2 project that might produce a better python tool.

[2006]: Ann Loraine uses a SAX perser to deal with DAS: http://www.bioinformatics.org/pipermail/bbb/2006-December/003694.html

[2007]: PPT Presentation from Sanger Feb 2007: "DAS/2: Next generation Distributed Annotation System". 
Some python code used in the DAS/2 Validation Suite is mentioned.
http://sourceforge.net/projects/dasypus/
Project where Andrew Dalke is involved, but it seems inactive since 2006.
-------

Sorry if I have missed the post where this issue was last discussed,
best wishes,

Vincent.
_______________________________________________
Biopython mailing list  -  Biopython <at> lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython

Giles Weaver | 3 Jul 2009 17:35

Re: [Biopython] [Bioperl-l] Next-gen modules

Regarding the trimming algorithm, I've been using a window size of 5, a
minimum score of 20 and a minimum length of 15 with the Illumina data. In
the past I have used a similar algorithm with a larger window size and much
longer minimum length with sequence from ABI 3XXX machines. I imagine that
the ideal parameters for ABI SOLiD and Roche 454 would likely be similar to
those for Illumina and Sanger sequencing respectively.
Window size doesn't appear to affect performance much, if at all.

For sequences with multiple good regions, I do extract all good regions.
Even with the Illumina data there are sometimes two good regions, but
usually the second is adapter or junk and gets filtered out later. I haven't
seen quality data from a 454 machine recently, and would be interested to
know if multiple good regions are commonplace in 454 data. Can anyone with
access to 454 data comment on this?

Giles

2009/7/2 Peter Cock <p.j.a.cock <at> googlemail.com>

> On 7/1/09, Giles Weaver wrote:
> > Peter, the trimming algorithm I use employs a sliding window, as follows:
> >
> >    - For each sequence position calculate the mean phred quality score
> for a
> >    window around that position.
> >    - Record whether the mean score is above or below a threshold as an
> array
> >    of zeros and ones.
> >    - Use a regular expression on the joined array to find the start and
> end
(Continue reading)


Gmane