Mark A. Jensen | 1 Jul 2009 03:41
Picon
Favicon
Gravatar

Re: Parsing a FASTA file (Was: Bioperl-l Digest, Vol 74, Issue 25)

Hi Paola, 

You want to try Bio::SearchIO, I think. It's not quite clear what you 
want to do, but here's an example of what you can do: 

Get all high-scoring pairs ( the mini-alignments ) involving
the database sequence called "2ojg:A"--

 use Bio::SearchIO;

 my $io = Bio::SearchIO->new(-format=>'fasta', -file=>'yourfile.fasta');
 my $result = $io->next_result;
 my  <at> desired_hsps;

 while ( my $hit = $result->next_hit ) {
   push  <at> desired_hsps, grep { $_->subject->seq_id =~ /2ojg:A/ } $hit->hsps;
 }

 # now all your desired hsps are in the array  <at> desired_hsps;
 # you can get Bio::SimpleAlign objects from them all, for example:
 my  <at> aligns = map { $_->get_aln }  <at> desired_hsps;
 #...and lots of other things...

Look at http://www.bioperl.org/wiki/HOWTO:SearchIO#Using_SearchIO
and http://www.bioperl.org/wiki/HOWTO:SearchIO#Using_the_methods 
for a nice introduction to the Bio::SearchIO system by its authors. They 
use a blast output as an example, but everything applies to fasta output 
as well.

You didn't waste your time writing regexps, by the way. For a Perl
(Continue reading)

Chris Fields | 1 Jul 2009 05:48
Favicon
Gravatar

FASTQ output

I am working on FASTQ output and noticed a real oddity.  Apparently,  
there are three write_* methods for this module, with the odd choice  
of write_seq for Bio::SeqIO::fastq writing FASTA, not FASTQ.   
write_qual() writes Qual format:

http://www.bioperl.org/wiki/Qual_sequence_format

and write_fastq() writes FASTQ.  Now, maybe it's just me, but I think  
an implementation of write_seq() for a specific format should probably  
output that format and not something else entirely unexpected.  Also,  
is there a reason for duplicating output code for qual and FASTA  
output within Bio::SeqIO::fastq, i.e. should we call Bio::SeqIO::fasta/ 
qual instead?

I would consider the write_seq() issue a bug, the others are really  
just maintenance issues.  Anyone have problems with me changing that  
up a bit?

chris
Peter Cock | 1 Jul 2009 09:44
Gravatar

Re: Next-gen modules

Hi all (BioPerl and Biopython),

This is a continuation of a long thread on the BioPerl mailing
list, which I have now CC'd to the Biopython mailing list. See:
http://lists.open-bio.org/pipermail/bioperl-l/2009-June/030265.html

On this thread we have been discussing next gen sequencing
tools and co-coordinating things like consistent file format
naming between Biopython, BioPerl and EMBOSS. I've been
chatting to Peter Rice (EMBOSS) while at BOSC/ISMB 2009,
and he will look into setting up a cross project mailing list for
this kind of discussion in future.

In the mean time, my replies to Giles below cover both BioPerl
and Biopython (and EMBOSS). Giles' original email is here:
http://lists.open-bio.org/pipermail/bioperl-l/2009-June/030398.html

Peter

On 6/30/09, Giles Weaver <giles.weaver <at> googlemail.com> wrote:
>
> I'm developing a transcriptomics database for use with next-gen data, and
> have found processing the raw data to be a big hurdle.
>
> I'm a bit late in responding to this thread, so most issues have already
> been discussed. One thing that hasn't been mentioned is removal of adapters
> from raw Illumina sequence. This is a PITA, and I'm not aware of any well
> developed and documented open source software for removal of adapters
> (and poor quality sequence) from Illumina reads.
>
(Continue reading)

Chris Fields | 1 Jul 2009 14:35
Favicon
Gravatar

Re: Next-gen modules

Peter,

I just committed a fix to FASTQ parsing last night to support read/ 
write for Sanger/Solexa/Illumina following the biopython convention;  
the only thing needed is more extensive testing for the quality  
scores.  There are a few other oddities with it I intend to address  
soon, but it appears to be working.

The Seq instance iterator actually calls a raw data iterator (hash  
refs of named arguments to the class constructor).  That should act as  
a decent filtering step if needed.

We have automated EMBOSS wrapping but I'm not sure how intuitive it  
is; we can probably reconfigure some of that.

chris

On Jul 1, 2009, at 2:44 AM, Peter Cock wrote:

> Hi all (BioPerl and Biopython),
>
> This is a continuation of a long thread on the BioPerl mailing
> list, which I have now CC'd to the Biopython mailing list. See:
> http://lists.open-bio.org/pipermail/bioperl-l/2009-June/030265.html
>
> On this thread we have been discussing next gen sequencing
> tools and co-coordinating things like consistent file format
> naming between Biopython, BioPerl and EMBOSS. I've been
> chatting to Peter Rice (EMBOSS) while at BOSC/ISMB 2009,
> and he will look into setting up a cross project mailing list for
(Continue reading)

Jonathan Epstein | 1 Jul 2009 15:20
Picon
Favicon

Re: Next-gen modules

I too am interested in these topics.  In particular, I would like to 
learn more about "sequencing adapter removal," i.e. what these adapters 
look like, and what strategies you've employed for finding and removing 
them.

Jonathan

Giles Weaver wrote:
> I'm developing a transcriptomics database for use with next-gen data, and
> have found processing the raw data to be a big hurdle.
>
> I'm a bit late in responding to this thread, so most issues have already
> been discussed. One thing that hasn't been mentioned is removal of adapters
> from raw Illumina sequence. This is a PITA, and I'm not aware of any well
> developed and documented open source software for removal of adapters (and
> poor quality sequence) from Illumina reads.
>
> My current Illumina sequence processing pipeline is an unholy mix of
> biopython, bioperl, pure perl, emboss and bowtie. Biopython for converting
> the Illumina fastq to Sanger fastq, bioperl to read the quality values, pure
> perl to trim the poor quality sequence from each read, and bioperl with
> emboss to remove the adapter sequence. I'm aware that the pipeline contains
> bugs and would like to simplify it, but at least it does work...
>
> Ideally I'd like to replace as much of the pipeline as possible with
> bioperl/bioperl-run, but this isn't currently possible due to both a lack of
> features and poor performance. I'm sure the features will come with time,
> but the performance is more of a concern to me. I wonder if Bio::Moose might
> be used to alleviate some of the performance issues? Might next-gen modules
> be an ideal guinea pig for Bio::Moose?
(Continue reading)

Mark A. Jensen | 1 Jul 2009 15:42
Picon
Favicon
Gravatar

Re: Random nucleotide string generator?

You guys earned your scrap:
http://www.bioperl.org/wiki/Random_sequence_generation

cheers and thanks! MAJ
----- Original Message ----- 
From: "Roger Hall" <rahall2 <at> ualr.edu>
To: <bioperl-l <at> bioperl.org>
Sent: Friday, June 26, 2009 2:28 AM
Subject: [Bioperl-l] Random nucleotide string generator?

> All,
>
> Is there a random generator for creating nucleotides (of length l with 
> composition frequencies a, c, g, and t) in there somewhere?
>
> I noticed a thread about it from 2000 and nothing since (searching for "random 
> sequence").
>
> If not - what should the namespace be for such a module should it be undone 
> and desirable?
>
> TIA!
>
> Roger
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l <at> lists.open-bio.org
(Continue reading)

Ocar Campos | 1 Jul 2009 16:30
Picon

Re: Random nucleotide string generator?

Thanks for the add to the wiki Mark.

Cheers.

O'car Campos C.
Bioinformatics Engineering Student.
University of Talca.

2009/7/1 Mark A. Jensen <maj <at> fortinbras.us>

> You guys earned your scrap:
> http://www.bioperl.org/wiki/Random_sequence_generation
>
> cheers and thanks! MAJ
> ----- Original Message ----- From: "Roger Hall" <rahall2 <at> ualr.edu>
> To: <bioperl-l <at> bioperl.org>
> Sent: Friday, June 26, 2009 2:28 AM
> Subject: [Bioperl-l] Random nucleotide string generator?
>
>
>
>  All,
>>
>> Is there a random generator for creating nucleotides (of length l with
>> composition frequencies a, c, g, and t) in there somewhere?
>>
>> I noticed a thread about it from 2000 and nothing since (searching for
>> "random sequence").
>>
>> If not - what should the namespace be for such a module should it be
(Continue reading)

Giles Weaver | 1 Jul 2009 18:27

Re: Next-gen modules

Peter, the trimming algorithm I use employs a sliding window, as follows:

   - For each sequence position calculate the mean phred quality score for a
   window around that position.
   - Record whether the mean score is above or below a threshold as an array
   of zeros and ones.
   - Use a regular expression on the joined array to find the start and end
   of the good quality sequence(s).
   - Extract the quality sequence(s) and replace any bases below the quality
   threshold with N.
   - Trim any Ns from the ends.

A refinement would be to weight the scores from positions in the window, but
this could give a performance hit, and the method seems to work well enough
as is.

Chris, thanks for committing the fix, I'll give bioperl illumina fastq
parsing a workout soon. Peter, as much as I'd love to help out with
biopython, I'm under too much time pressure right now!

Jonathan, some of the Illumina sequencing adapters are listed at
http://intron.ccam.uchc.edu/groups/tgcore/wiki/013c0/Solexa_Library_Primer_Sequences.htmland
http://seqanswers.com/forums/showthread.php?t=198
Adapter sequence typically appears towards the end of the read, though the
latter part of it is often misread as the sequencing quality drops off.
I abuse needle (EMBOSS) into aligning the adapter sequence with each read. I
then use Bio::AlignIO, Bio::Range and a custom scoring scheme to identify
real alignments and trim the sequence. This is not the ideal way of doing
things, but it's fast enough, and does seem to work. The adapter sequence
shouldn't be gapped, so I'm sure there is a lot of scope for optimising the
(Continue reading)

Chris Fields | 1 Jul 2009 18:46
Favicon
Gravatar

Re: Next-gen modules

On Jul 1, 2009, at 11:27 AM, Giles Weaver wrote:

...

> Peter, the trimming algorithm I use employs a sliding window, as  
> follows:
>
>   - For each sequence position calculate the mean phred quality  
> score for a
>   window around that position.
>   - Record whether the mean score is above or below a threshold as  
> an array
>   of zeros and ones.
>   - Use a regular expression on the joined array to find the start  
> and end
>   of the good quality sequence(s).
>   - Extract the quality sequence(s) and replace any bases below the  
> quality
>   threshold with N.
>   - Trim any Ns from the ends.
>
> A refinement would be to weight the scores from positions in the  
> window, but
> this could give a performance hit, and the method seems to work well  
> enough
> as is.
>
> Chris, thanks for committing the fix, I'll give bioperl illumina fastq
> parsing a workout soon. Peter, as much as I'd love to help out with
> biopython, I'm under too much time pressure right now!
(Continue reading)

August 2009 GMOD Meeting

Hello all,

The next GMOD meeting will be held 6-7 August, at the University of
Oxford, in Oxford, United Kingdom. Registration is now open. Space is
available on a first come, first served basis and there is room for 55
attendees. The meeting cost is £50.  See
http://gmod.org/wiki/August_2009_GMOD_Meeting to register

As with previous GMOD meetings, this meeting will have a mixture of
project, component, and user talks. The agenda is driven by attendee
suggestions, and you are encouraged to add your suggestions now (see
http://gmod.org/wiki/August_2009_GMOD_Meeting#Agenda_Suggestions).

For examples of what happens at a GMOD meeting, see the writeups of
the January 2009, July 2008, or any other previous meeting (see
http://gmod.org/wiki/Meetings). GMOD meetings are an excellent way to
meet other GMOD developers and users and to learn (and affect) what's
coming in the project.

Please join us in Oxford this August,

Dave Clements
GMOD Help Desk

Note: Unless you have applied to and been admitted to the Summer
School, don't you dare register for it. The registration web site will
let you do this, but bureaucratic hellishness will ensue.

--
* Learn more about GMOD at: ISMB/ECCB: http://www.iscb.org/ismbeccb2009/
(Continue reading)


Gmane