Leighton Pritchard | 1 Aug 12:42 2006
Picon

Re: Reading sequences: FormatIO, SeqIO, etc

On Mon, 2006-07-31 at 12:08 -0400, Marc Colosimo wrote: 
> On Jul 31, 2006, at 11:14 AM, Peter (BioPython Dev) wrote:
> >>> The SeqUtils/quick_FASTA_reader is interesting in that it loads the
> >>> entire file into memory in one go, and then parses it.  On the other
> >>> hand its not perfect: I would use "\n>" as the split marker  
> >>> rather than
> >>> ">" which could appear in the description of a sequence.
> >>
> >> I agree (not that it's bitten me, yet), but I'd be inclined to go  
> >> with
> >> "%s>" % os.linesep as the split marker, just in case.
> >
> > Good point.  I wonder how many people even know this function exists?
> >
> 
> The only problem with this is that if someone sends you a file not  
> created on your system. [...]  
> This has mostly simplied down to two - Unix and Windows - unless the  
> person uses a Mac GUI app some of which use \r (CR) instead of \n  
> (LF) where Windows uses \r\n (CRLF). I think the standard python  
> disto comes with crlf.py and lfcr.py that can convert the line endings.

Also a good point.  I had a play about with regular expression
splitting/substitution and the SeqUtils.quick_FASTA_reader method to see
if I could capture this variability in line-endings:

def method_quick_FASTA_reader3(filename):
    txt = file(filename).read()
    entries = []
    split_marker = re.compile('^>', re.M)
(Continue reading)

Patrick Pfeffer | 1 Aug 14:02 2006
Picon

GAs in Biopython

Hi there,

isn't there any documentation available for using the genetic algorithm 
available in the package?

Thanks for any kind of help,
Patrick

--

-- 
*************************************
Dipl. Bioinf. Patrick Pfeffer
Arbeitskreis Prof. Dr. G. Klebe
Institut für Pharmazeutische Chemie
Raum A116a
Fachbereich Pharmazie
Philipps-Universität Marburg
Marbacher Weg 6
35032 Marburg  Germany
Fon.: 06421/2825908
http://www.agklebe.de
e-mail: pfefferp <at> staff.uni-marburg.de
************************************* 
Peter (BioPython Dev | 1 Aug 22:53 2006
Picon
Picon

Re: Reading sequences: FormatIO, SeqIO, etc

Peter wrote:
>>> The SeqUtils/quick_FASTA_reader is interesting in that it loads the
>>> entire file into memory in one go, and then parses it.  On the other
>>> hand its not perfect: I would use "\n>" as the split marker  
>>> rather than ">" which could appear in the description of a sequence.

Leighton Pritchard replied:
>> I agree (not that it's bitten me, yet), but I'd be inclined to go  
>> with "%s>" % os.linesep as the split marker, just in case.

Peter then wrote:
> Good point.

I take that back - I was right the first time ;)

You are right to worry about the line sep changing from platform to
platform, but you shouldn't use "%s>" % os.linesep

However, when reading windows style files on windows, the newlines
appear in python as just \n (as do newlines from unix files read on
windows).

When writing text files on windows, again \n gets turned into CR LF on
the disk.

Just using "\n>" would work on any platform reading a FASTA file with
the expected newlines.  As a bonus it would work on Windows when reading
unix style newlines.

To get any platform to read newlines from any other platform what I
(Continue reading)

Leighton Pritchard | 2 Aug 11:25 2006
Picon

Re: Reading sequences: FormatIO, SeqIO, etc

On Mon, 2006-07-31 at 11:36 +0100, Peter (BioPython Dev) wrote:
> Question One
> ============
> Is reading sequence files an important function to you, and if so which 
> file formats in particular (e.g. Fasta, GenBank, ...)

Yes.  FASTA (sequence), GenBank, GFF, PTT, EMBL, ClustalW

> If you have had to write you own code to read a "common" file format 
> which BioPython doesn't support, please get in touch.

EMBL and PTT (though PTT is pretty trivial, and my EMBL parser is not
pretty).

> Question Two - Reading Fasta Files
> ==================================
> Which of the following do you currently use (and why)?:
> 
> (a) Bio.Fasta with the RecordParser (giving FastaRecord objects with a 
> title, and the sequence as a string)
> (b) Bio.Fasta with the FeatureParser (giving SeqRecord objects)
> (c) Bio.Fasta with your own parser (Could you tell us more?)
> (d) Bio.SeqIO.FASTA.FastaReader (giving SeqRecord objects)
> (e) Bio.FormatIO (giving SeqRecord objects)
> (f) Other (Could you tell us more?)

Mostly (f), a homegrown Pyrex/Flex parser.

> Question Three - index_file based dictionaries
> ==============================================
(Continue reading)

Peter (BioPython Dev | 2 Aug 12:45 2006
Picon
Picon

Re: Reading sequences: FormatIO, SeqIO, etc

Leighton Pritchard wrote:
> On Mon, 2006-07-31 at 11:36 +0100, Peter (BioPython Dev) wrote:
> 
>>Question One
>>============
>>Is reading sequence files an important function to you, and if so which 
>>file formats in particular (e.g. Fasta, GenBank, ...)
> 
> Yes.  FASTA (sequence), GenBank, GFF, PTT, EMBL, ClustalW
> 

PTT (Protein table files)

http://www.ibt.unam.mx/biocomputo/hom_make_db.html
(Anyone got an NCBI link for the file format?)

GFF (General Feature Format)

http://genome.ucsc.edu/goldenPath/help/customTrack.html#GFF
http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml

GFF and PTT aren't exactly what I would call sequence files, in that
they don't contain any sequence data.  But thinking about it, maybe
those files could be turned into SeqRecords or SeqFeatures (with empty
sequences).

> 
>>If you have had to write you own code to read a "common" file format 
>>which BioPython doesn't support, please get in touch.
> 
(Continue reading)

Michael Hoffman | 2 Aug 13:00 2006
Picon
Picon

Re: Reading sequences: FormatIO, SeqIO, etc

> Question One
> ============
> Is reading sequence files an important function to you, and if so which
> file formats in particular (e.g. Fasta, GenBank, ...)

Yes. FASTA.

> Question Two - Reading Fasta Files
> ==================================
> Which of the following do you currently use (and why)?:
>
> (f) Other (Could you tell us more?)

I have written my own short iterator so that my code is portable
without requiring Biopython to be installed.

> Question Three - index_file based dictionaries
> ==============================================
> Do you use any of the following:

No.

> Question Four - Record Access...
> ================================
> When loading a file with multiple sequences do you use:
>
> (a) An iterator interface(e.g. Bio.Fasta.Iterator) to give you the
> records one by one in the order from the file.

Yes.
(Continue reading)

Leighton Pritchard | 2 Aug 13:23 2006
Picon

Re: Reading sequences: FormatIO, SeqIO, etc

On Wed, 2006-08-02 at 11:45 +0100, Peter (BioPython Dev) wrote:
> GFF (General Feature Format)
> 
> http://genome.ucsc.edu/goldenPath/help/customTrack.html#GFF
> http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml
> 
> GFF and PTT aren't exactly what I would call sequence files, in that
> they don't contain any sequence data.  

Fair point, but GFF3 (see below) can optionally carry sequence data, and
I use them for exactly what you say here:

> those files could be turned into SeqRecords or SeqFeatures (with empty
> sequences).

I was thinking that GFF3 would be more useful than GFF:

http://song.sourceforge.net/gff3.shtml

NCBI have already gone over to this on bacterial genomes, at least,
(e.g.
ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Nanoarchaeum_equitans/NC_005213.gff), and it's a much
richer format than the original specification.  Andrew Dalke has already written a GFF3 parser/writer,
which is available at

http://www.dalkescientific.com/PyGFF3-0.5.tar.gz

I've not used this in anger, yet...

> Its looks like there is enough overlap between the EMBL and Genbank to
(Continue reading)

Peter (BioPython Dev | 2 Aug 14:56 2006
Picon
Picon

Re: Reading sequences: FormatIO, SeqIO, etc

Leighton Pritchard wrote:
> Fair point, but GFF3 (see below) can optionally carry sequence data,
> and I use them for exactly what you say here:
> 
>> maybe those files could be turned into SeqRecords or SeqFeatures 
>> (with empty sequences).
> 
> I was thinking that GFF3 would be more useful than GFF:
> 
> http://song.sourceforge.net/gff3.shtml
> 

Thanks for the links... interesting that GFF3 allows embedding Fasta
sequences.

>> Reading your other comments, it looks like you wouldn't miss 
>> FastaRecord or GenBank records if they were phased out.
> 
> Not personally, but others may have strong opinions and breakable 
> code, yet.

There is no need to remove the current modules, just mark them as
depreciated.  Of course, if there is some strong support for these
objects then we might not want to be so harsh...

> It may be a side-issue, but should a Clustal parser return an 
> Alignment object or iterate over SeqRecord objects?  And for that 
> matter, what about other MSA files in FASTA format?  I think we ought
> allow parsers to return an Alignment where the user requests it, 
> which is a functionality I'm not currently aware of in the FASTA 
(Continue reading)

Leighton Pritchard | 2 Aug 11:00 2006
Picon

Re: Reading sequences: FormatIO, SeqIO, etc

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views 
expressed by the sender are not necessarily the views of SCRI and its 
subsidiaries.  This email and any files transmitted with it are confidential 
to the intended recipient at the e-mail address to which it has been 
addressed.  It may not be disclosed or used by any other than that addressee.
If you are not the intended recipient you are requested to preserve this 
confidentiality and you must not use, disclose, copy, print or rely on this 
e-mail in any way. Please notify postmaster <at> scri.sari.ac.uk quoting the 
name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are 
present in this email, neither the Institute nor the sender accepts any 
responsibility for any viruses, and it is your responsibility to scan the email 
and the attachments (if any).
Picon
From: Leighton Pritchard <Leighton.Pritchard <at> scri.ac.uk>
Subject: Re: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
Date: 2006-08-02 09:00:20 GMT
On Tue, 2006-08-01 at 21:53 +0100, Peter (BioPython Dev) wrote:
> Peter then wrote:
(Continue reading)

Leighton Pritchard | 2 Aug 11:02 2006
Picon

[Fwd: Re: Reading sequences: FormatIO, SeqIO, etc]

(this time without the signature)

-- 
Dr Leighton Pritchard AMRSC
D131, Plant-Pathogen Interactions, Scottish Crop Research Institute
Invergowrie, Dundee, Scotland, DD2 5DA, UK
T: +44 (0)1382 562731 x2405 F: +44 (0)1382 568578
E: lpritc <at> scri.sari.ac.uk   W: http://bioinf.scri.sari.ac.uk/lp
GPG/PGP: FEFC205C E58BA41B  http://www.keyserver.net             
(If the signature does not verify, please remove the SCRI disclaimer)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views 
expressed by the sender are not necessarily the views of SCRI and its 
subsidiaries.  This email and any files transmitted with it are confidential 
to the intended recipient at the e-mail address to which it has been 
addressed.  It may not be disclosed or used by any other than that addressee.
If you are not the intended recipient you are requested to preserve this 
confidentiality and you must not use, disclose, copy, print or rely on this 
e-mail in any way. Please notify postmaster <at> scri.sari.ac.uk quoting the 
name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are 
present in this email, neither the Institute nor the sender accepts any 
responsibility for any viruses, and it is your responsibility to scan the email 
and the attachments (if any).

Gmane