Michiel de Hoon | 1 Aug 2010 17:14
Picon
Favicon

Re: Python 3 and encoding for online resources

According to this post:

http://stackoverflow.com/questions/1179305/expat-parsing-in-python-3

we need only one parser which always parses a byte stream. Bio.Entrez uses File.UndoHandle but just to look
for potential errors in the first few lines when opening the Entrez url, which in my opinion we shouldn't be
doing anyway since it's the parser's job to decide whether the input is well-formed. So I'd suggest to not
use File.UndoHandle (at all), make sure our parser works with Python 3 byte streams, and ask users to open
any downloaded Entrez XML files in binary mode. Is there a Biopython version (in trunk or otherwise) that
is ready for Python 3? If so, I can have a look at the parser to see if it handles byte streams correctly.

--Michiel.

--- On Tue, 7/27/10, Peter <biopython <at> maubp.freeserve.co.uk> wrote:

> From: Peter <biopython <at> maubp.freeserve.co.uk>
> Subject: [Biopython-dev] Python 3 and encoding for online resources
> To: "Biopython-Dev Mailing List" <biopython-dev <at> biopython.org>
> Date: Tuesday, July 27, 2010, 9:23 AM
> Hi all,
> 
> One of the remaining (pure python) problems with Biopython
> under Python 3 relates to parsing online resources like
> the
> NCBI Entrez API or even Bio.ExPASy.get_sprot_raw().
> See for example test_SeqIO_online.py for a failure.
> 
> In Python 2, urlopen from urlib or urllib2 would give a
> string handle. In python 3, you get a bytes handle (not
> a unicode handle and choosing the encoding is tricky):
(Continue reading)

Peter | 1 Aug 2010 19:54
Picon
Picon

Re: Python 3 and encoding for online resources

On Sun, Aug 1, 2010 at 4:14 PM, Michiel de Hoon <mjldehoon <at> yahoo.com> wrote:
> According to this post:
>
> http://stackoverflow.com/questions/1179305/expat-parsing-in-python-3
>
> we need only one parser which always parses a byte stream.
> Bio.Entrez uses File.UndoHandle but just to look for potential
> errors in the first few lines when opening the Entrez url, which
> in my opinion we shouldn't be doing anyway since it's the
> parser's job to decide whether the input is well-formed.
> So I'd suggest to not use File.UndoHandle (at all), ...

I disagree. The NCBI return multiple different file formats, so
there are multiple different parsers that may get an error page.
Given the NCBI return HTML error pages regardless of what
format the request was (XML, plain text, etc), I think we
have to look for errors before giving the data to the parser.
But that can be done using byte strings just as easily as with
unicode strings.

> make sure our parser works with Python 3 byte streams, and
> ask users to open any downloaded Entrez XML files in binary
> mode.

That sounds workable.

> Is there a Biopython version (in trunk or otherwise) that is ready
> for Python 3? If so, I can have a look at the parser to see if it
> handles byte streams correctly.

(Continue reading)

Michiel de Hoon | 2 Aug 2010 15:50
Picon
Favicon

Re: Python 3 and encoding for online resources

> Or if you just want to grab some code for a quick play, 
>I have a branch where I've been doing this on a
> semi-regular basis:
> 
> http://github.com/peterjc/biopython/tree/auto2to3

Thanks! I used this branch to test the Bio.Entrez and Bio.SwissProt parsers. The Bio.Entrez Parser works
as is; the Bio.SwissProt parser is really easy to fix (just convert each line into a plain string inside the
_read function in Bio.SwissProt.__init__). Perhaps we can do something similar for the other
test_SeqIO_online.py failures (the ones appearing in Bio/SeqIO/FastaIO.py)?

> > So I'd suggest to not use File.UndoHandle (at all),
> ...
> I disagree. The NCBI return multiple different file
> formats, so there are multiple different parsers that may get
> an error page.
>
> Given the NCBI return HTML error pages regardless of what
> format the request was (XML, plain text, etc), I think we
> have to look for errors before giving the data to the
> parser.

Part of the problem solves itself when we change to Python 3. In Python 3, urllib.request.urlopen raises a
urllib.error.HTTPError in cases where urllib.urlopen in Python 2 raises no exception:

mdehoon:~/Software/biopython2to3/peterjc-biopython-06c2ea6 $ python
Python 2.7 (r27:82500, Jul 19 2010, 00:08:00) 
[GCC 4.0.1 (Apple Computer, Inc. build 5370)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib
(Continue reading)

Peter | 2 Aug 2010 16:04
Picon
Picon

Re: Python 3 and encoding for online resources

On Mon, Aug 2, 2010 at 2:50 PM, Michiel de Hoon <mjldehoon <at> yahoo.com> wrote:
>> Or if you just want to grab some code for a quick play,
>>I have a branch where I've been doing this on a
>> semi-regular basis:
>>
>> http://github.com/peterjc/biopython/tree/auto2to3
>
> Thanks! I used this branch to test the Bio.Entrez and Bio.SwissProt parsers.
> The Bio.Entrez Parser works as is; the Bio.SwissProt parser is really easy to
> fix (just convert each line into a plain string inside the _read function in
> Bio.SwissProt.__init__). Perhaps we can do something similar for the other
> test_SeqIO_online.py failures (the ones appearing in Bio/SeqIO/FastaIO.py)?

Maybe (replied in more detail below)

>> > So I'd suggest to not use File.UndoHandle (at all),
>> ...
>> I disagree. The NCBI return multiple different file
>> formats, so there are multiple different parsers that may get
>> an error page.
>>
>> Given the NCBI return HTML error pages regardless of what
>> format the request was (XML, plain text, etc), I think we
>> have to look for errors before giving the data to the
>> parser.
>
> Part of the problem solves itself when we change to Python 3. In Python
> 3, urllib.request.urlopen raises a urllib.error.HTTPError in cases where
> urllib.urlopen in Python 2 raises no exception:
>
(Continue reading)

bugzilla-daemon | 2 Aug 2010 16:21

[Bug 3119] Bio.Nexus can't parse file from Prank 100701 (1st July 2010)

http://bugzilla.open-bio.org/show_bug.cgi?id=3119

biopython-bugzilla <at> maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED

------- Comment #8 from biopython-bugzilla <at> maubp.freeserve.co.uk  2010-08-02 10:21 EST -------
Ari has released PRANK v100802 (2 August 2010) which fixes the NEXUS output
problems identified (unquoted taxa names containing punctuation, extra comma
in translate block).

With Frank's small fix for the tree, we can now parse the latest PRANK output
http://github.com/biopython/biopython/commit/f4b0007d29fdd878e4cc326b12e63e833e246ce4

Marking as fixed.

--

-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
Peter | 2 Aug 2010 19:22
Picon
Picon

EMBOSS SAM/BAM parser and reverse strand reads

Hi all,

One of my immediate questions on learning that EMBOSS 6.3.1 had
SAM/BAM support was how it handled reads mapped to the reverse
strand:

http://lists.open-bio.org/pipermail/emboss-dev/2010-July/000656.html
> What do you do about the strand issue? SAM/BAM stored reads
> which map onto the reverse strand in reverse complement. If
> you want to get back to the original orientation for output as
> FASTQ you must apply the reverse complement (plus reverse
> the quality scores too of course).

As I suspected, currently EMBOSS ignores this and gives the sequence
and quality string as it is stored in the SAM/BAM file.

Here are three consecutive entries from the example SAM file,
http://pysam.googlecode.com/hg/tests/ex1.sam.gz

...
EAS54_65:6:115:538:276	163	chr1	209	99	35M	=	360	186	TATTTGTAATGAAAACTATATTTATGCTATTCAGT	<<<<<<<<;<<<;;<<<;<:<<<:<<<<<<;;;7;	MF:i:18	Aq:i:75	NM:i:0	UQ:i:0	H0:i:1	H1:i:0
EAS219_FC30151:7:51:1429:1043	83	chr1	209	99	35M	=	59	-185	TATTTGTAATGAAAACTATATTTATGCTATTCAGT	9<5<<<<<<<<<<<<<9<<<9<<<<<<<<<<<<<<	MF:i:18	Aq:i:68	NM:i:0	UQ:i:0	H0:i:1	H1:i:0
EAS114_30:1:176:168:513	163	chr1	210	99	35M	=	410	235	ATTTGTAATGAAAACTATATTTATGCTATTCAGTT	<<<<;<<<<<<<<<<<<<<<<<<<:&<<<<:;0;;	MF:i:18	Aq:i:71	NM:i:0	UQ:i:0	H0:i:1	H1:i:0
...

The middle read of this triple, EAS219_FC30151:7:51:1429:1043, maps
to chr1 on the reverse strand - we known this from the flag value 83.

Note 83 = 1 + 2 + 16 + 64, or in hex, 0x53 = 0x40 + 0x10 + 0x02 + 0x01.

(Continue reading)

bugzilla-daemon | 2 Aug 2010 20:12

[Bug 3127] New: SeqIO.write appends text to fasta comments

http://bugzilla.open-bio.org/show_bug.cgi?id=3127

           Summary: SeqIO.write appends text to fasta comments
           Product: Biopython
           Version: 1.54
          Platform: PC
        OS/Version: Windows XP
            Status: NEW
          Severity: minor
          Priority: P2
         Component: Other
        AssignedTo: biopython-dev <at> biopython.org
        ReportedBy: jared.ackers <at> smithsdetection.com

When using the following SeqIO command: 

SeqIO.write(SeqIO.parse("file.txt", "tab"), "file.fas", "fasta")

SeqIO will append the text " <unknown description>" to every sequence ID in the
output file.  The input file has two tab-delimited columns, the first with a
(custom) sequence ID and the second with the corresponding sequence.

--

-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
bugzilla-daemon | 2 Aug 2010 20:50

[Bug 3127] Set SeqRecord description in SeqIO "tab" parser

http://bugzilla.open-bio.org/show_bug.cgi?id=3127

biopython-bugzilla <at> maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|SeqIO.write appends text to |Set SeqRecord description in
                   |fasta comments              |SeqIO "tab" parser

------- Comment #1 from biopython-bugzilla <at> maubp.freeserve.co.uk  2010-08-02 14:50 EST -------
The problem isn't really in Bio.SeqIO.write(), it is with the SeqRecord
default and/or the "tab" parser. Retitling bug...

In the "tab" file format there is no description, so you are getting the
SeqRecord's default description. We'd recently talked about making this
just an empty string, alternatively and with less risk the "tab" parser
could set the description to "" explicitly.

--

-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
Peter | 3 Aug 2010 16:07
Picon
Picon

Re: Python 3 and encoding for online resources

Peter wrote:
>Michiel wrote:
>> So I would suggest to switch from urllib to urllib2 in Bio.Entrez and catch
>> any HTTP errors (urllib2 is translated appropriately by 2to3),
>
> That sounds very sensible.
>

Hi Michiel,

I see you've switched from urllib to urllib2, but you also removed all
the NCBI specific error handling (which it turns out would need to be
updated).

I just tried a simple history example and if you deliberately use a
wrong webenv you get an HTML error page back (from memory
and the comments in our code it used to be a plain text error page):

<html>
<body>
<br/><h2>Error occurred: Unable to obtain query #1</h2><br/><ul
title="some params from request:">
<li>db=pubmed</li>
<li>query_key=1</li>
<li>report=medline</li>
<li>dispstart=0</li>
<li>dispmax=10</li>
<li>mode=text</li>
<li>WebEnv=wrong</li>
</ul>
(Continue reading)

Michiel de Hoon | 3 Aug 2010 17:44
Picon
Favicon

Re: Python 3 and encoding for online resources

Have you tried looking at handle.info(), where handle is the handle returned by urllib.urlopen()?
Another candidate is handle.getcode(). Otherwise, we could try to contact NCBI to see if their error
messages can be returned in a standard format, or at least in a format consistent with the request.
Otherwise, we can also consider not to parse the HTML error message; the SeqIO/Entrez parsers will notice
a format problem and raise an exception anyway.

--Michiel.

--- On Tue, 8/3/10, Peter <biopython <at> maubp.freeserve.co.uk> wrote:

> From: Peter <biopython <at> maubp.freeserve.co.uk>
> Subject: Re: [Biopython-dev] Python 3 and encoding for online resources
> To: "Michiel de Hoon" <mjldehoon <at> yahoo.com>
> Cc: "Biopython-Dev Mailing List" <biopython-dev <at> biopython.org>
> Date: Tuesday, August 3, 2010, 10:07 AM
> Peter wrote:
> >Michiel wrote:
> >> So I would suggest to switch from urllib to
> urllib2 in Bio.Entrez and catch
> >> any HTTP errors (urllib2 is translated
> appropriately by 2to3),
> >
> > That sounds very sensible.
> >
> 
> Hi Michiel,
> 
> I see you've switched from urllib to urllib2, but you also
> removed all
> the NCBI specific error handling (which it turns out would
(Continue reading)


Gmane