Hongbo Zhu 朱宏博 | 2 Dec 10:41 2011
Picon

slicing in Bio.PDB.Chain.__getitem__() ?

Hi,

I propose to add slicing to class Bio.PDB.Chain by changing function
Bio.PDB.Chain.__getitem__().

* Why is slicing necessary for Bio.PDB.Chain?
Protein domain definitions are usually presented as the starting and ending
positions of the domain in protein primary structures, e.g. in SCOP, or
CATH. Slicing comes in handy when extracting domains from PDB files.

* Why is slicing not available at the moment?
I understand that the majority of Bio.PDB.Entity objects are not lists. And
there is not internal *sequential order* for the child entities in these
objects. For example, In Bio.PDB.Model, its child Chain entities do not
really have a sequential order within Model. Slicing seems not make sense.
But Bio.PDB.Chain is exceptional: Residue entities in Bio.PDB.Chain have a
sequence order as presented in the primary structure and slicing becomes a
reasonable operation.

* How to slice a Chain entity?
I think it can be realized by revising the
function Bio.PDB.Chain.__getitem__(). For example:

    def __getitem__(self, id):
        """Return the residue with given id.

        The id of a residue is (hetero flag, sequence identifier, insertion
code).
        If id is an int, it is translated to (" ", id, " ") by the
_translate_id
(Continue reading)

João Rodrigues | 2 Dec 11:32 2011
Picon

Re: slicing in Bio.PDB.Chain.__getitem__() ?

Hey Hongbo,

Interesting idea, but couldn't it be done already with child_list in a more
or less straightforward manner?

Best,

João
No dia 2 de Dez de 2011 10:43, "Hongbo Zhu 朱宏博" <macrozhu <at> gmail.com>
escreveu:

> Hi,
>
> I propose to add slicing to class Bio.PDB.Chain by changing function
> Bio.PDB.Chain.__getitem__().
>
> * Why is slicing necessary for Bio.PDB.Chain?
> Protein domain definitions are usually presented as the starting and ending
> positions of the domain in protein primary structures, e.g. in SCOP, or
> CATH. Slicing comes in handy when extracting domains from PDB files.
>
> * Why is slicing not available at the moment?
> I understand that the majority of Bio.PDB.Entity objects are not lists. And
> there is not internal *sequential order* for the child entities in these
> objects. For example, In Bio.PDB.Model, its child Chain entities do not
> really have a sequential order within Model. Slicing seems not make sense.
> But Bio.PDB.Chain is exceptional: Residue entities in Bio.PDB.Chain have a
> sequence order as presented in the primary structure and slicing becomes a
> reasonable operation.
>
(Continue reading)

Hongbo Zhu 朱宏博 | 2 Dec 13:43 2011
Picon

Re: slicing in Bio.PDB.Chain.__getitem__() ?

Hi, Joao,

thanks for the response. When I spoke of slicing Bio.PDB.Chain, I meant to
slice it using residue id, not list index. And these two ways are
fundamentally different.

For instance :

not only slicing like this:
or
chain.child_list[2:12]  # slice using list index

but also slicing like this:

chain[2:12]   # slice using residue sequence id, not feasible at the moment
                   # NOTE: this is fundamentally different from
chain.child_list[2:12]
or even:
chain[(' ', 2, ' ') : (' ', 12, ' ')] # slice using residue full id, even
better

Of course one can play with child_list and obtain the same outcome. But I
think it would be very convenient to implement it in the __getitem__()
function.

cheers,hongbo

2011/12/2 João Rodrigues <anaryin <at> gmail.com>

> Hey Hongbo,
(Continue reading)

Peter Cock | 5 Dec 11:45 2011

Re: slicing in Bio.PDB.Chain.__getitem__() ?

2011/12/2 Hongbo Zhu 朱宏博 <macrozhu <at> gmail.com>:
> Hi, Joao,
>
> thanks for the response. When I spoke of slicing Bio.PDB.Chain, I meant to
> slice it using residue id, not list index. And these two ways are
> fundamentally different.
>
> For instance :
>
> not only slicing like this:
> or
> chain.child_list[2:12]  # slice using list index
>
> but also slicing like this:
>
> chain[2:12]   # slice using residue sequence id, not feasible at the moment
>                   # NOTE: this is fundamentally different from
> chain.child_list[2:12]
> or even:
> chain[(' ', 2, ' ') : (' ', 12, ' ')] # slice using residue full id, even
> better
>
> Of course one can play with child_list and obtain the same outcome. But I
> think it would be very convenient to implement it in the __getitem__()
> function.
>
> cheers,hongbo

Hi Hongbo,

(Continue reading)

Hongbo Zhu 朱宏博 | 5 Dec 12:46 2011
Picon

Re: slicing in Bio.PDB.Chain.__getitem__() ?

Hi, Peter,

I just realized a special issue concerning slicing Bio.PDB.Chain.
Normally, in python a slice is given by three arguments: start, stop and
step, where the element at position *stop* is not included in the output.
For example,

mylist[2:40:1]  would return: [ mylist[2],mylist[3], ...., mylist[39] ]

But in CATH and SCOP, sequence segments composing domains are given as
start and end position. And the residue at the end position is also
included in the domain definition. e.g. if a domain is defined to be from
residue (' ', 1, ' ') to residue (' ', 40, ' '), a slicing like this
mychain[(' ', 2, ' '): (' ', 40, ' ')] or mychain[2:40] would not include
residue (' ',40,' '). And it is not definite that mychain[(' ', 2, ' '): ('
', 41, ' ')] would give the correct outcome because the residue after ('
',40,' ') does not necessary have to be (' ',41,' '). Of course we can
change the code in the __getitem__() such that it includes the end
position. But then it is against the general python convention of slicing.

So I think maybe an independent function is perhaps needed:

class Chain(Entity):

    def get_slice(self, start, end, step=None):
        """Return a slice of the chain from start to end (including end
position)

        Arguments:
        o start - (string, int, string) or int
(Continue reading)

Peter Cock | 5 Dec 13:15 2011

Re: slicing in Bio.PDB.Chain.__getitem__() ?

On Mon, Dec 5, 2011 at 11:46 AM, Hongbo Zhu 朱宏博 <macrozhu <at> gmail.com> wrote:
> Hi, Peter,
>
> I just realized a special issue concerning slicing Bio.PDB.Chain.
> Normally, in python a slice is given by three arguments: start, stop and
> step, where the element at position *stop* is not included in the output.
> For example,
>
> mylist[2:40:1]  would return: [ mylist[2],mylist[3], ...., mylist[39] ]
>

Yes,

> But in CATH and SCOP, sequence segments composing domains
> are given as start and end position. And the residue at the end
> position is also included in the domain definition.

OK. I'd have to double check what our parsers return (and if
they convert the start/end into C/Python style).

> e.g. if a domain
> is defined to be from residue (' ', 1, ' ') to residue (' ', 40, ' '), a slicing
> like this mychain[(' ', 2, ' '): (' ', 40, ' ')] or mychain[2:40] would not
> include residue (' ',40,' ').

Perhaps I misunderstood - I would not want to allow the syntax
mychain[(' ', 2, ' '): (' ', 40, ' ')] which is unclear, rather only allow
the user to use mychain[2:41] which requires Python counting.

Peter
(Continue reading)

Hongbo Zhu 朱宏博 | 5 Dec 14:38 2011
Picon

Re: slicing in Bio.PDB.Chain.__getitem__() ?

> But in CATH and SCOP, sequence segments composing domains

> > are given as start and end position. And the residue at the end
> > position is also included in the domain definition.
>
> OK. I'd have to double check what our parsers return (and if
> they convert the start/end into C/Python style).
>
> > e.g. if a domain
> > is defined to be from residue (' ', 1, ' ') to residue (' ', 40, ' '), a
> slicing
> > like this mychain[(' ', 2, ' '): (' ', 40, ' ')] or mychain[2:40] would
> not
> > include residue (' ',40,' ').
>
> Perhaps I misunderstood - I would not want to allow the syntax
> mychain[(' ', 2, ' '): (' ', 40, ' ')] which is unclear, rather only allow
> the user to use mychain[2:41] which requires Python counting.
>
>
But even in mychain[2:41], the 2 and 41 should be residue sequence number.
Then it is consistent with the current acceptable syntax mychain[2], where
2 also refers to a sequence number. At the moment, BioPython also
accepts mychain[(' ', 2, ' ')]. So I think mychain[(' ', 2, ' '): (' ', 40,
' ')] would be just a nature extension of mychain[(' ', 2, ' ')].

According to the source code, mychain[2] is considered an abbreviation of
mychain[(' ', 2, ' ')]. Internally, mychain[2] will be translated to
mychain[(' ', 2, ' ')] by function Bio.PDB.Chain.__translate_id(). So if
mychain[2:4] would be allowed, internally it would also
(Continue reading)

Peter Cock | 5 Dec 14:50 2011

Re: slicing in Bio.PDB.Chain.__getitem__() ?

On Mon, Dec 5, 2011 at 1:38 PM, Hongbo Zhu 朱宏博 <macrozhu <at> gmail.com> wrote:
>
>> Perhaps I misunderstood - I would not want to allow the syntax
>> mychain[(' ', 2, ' '): (' ', 40, ' ')] which is unclear, rather only allow
>> the user to use mychain[2:41] which requires Python counting.
>
> But even in mychain[2:41], the 2 and 41 should be residue sequence number.
> Then it is consistent with the current acceptable syntax mychain[2], where 2
> also refers to a sequence number. At the moment, BioPython also
> accepts mychain[(' ', 2, ' ')]. So I think mychain[(' ', 2, ' '): (' ', 40,
> ' ')] would be just a nature extension of mychain[(' ', 2, ' ')].
>
> According to the source code, mychain[2] is considered an abbreviation of
> mychain[(' ', 2, ' ')]. Internally, mychain[2] will be translated to
> mychain[(' ', 2, ' ')] by function Bio.PDB.Chain.__translate_id(). So if
> mychain[2:4] would be allowed, internally it would also
> be first translated to mychain[(' ', 2, ' '): (' ', 40, ' ')]. So in my
> point of view, mychain[2:4] is just an abbreviation for mychain[(' ', 2, '
> '): (' ', 40, ' ')], just like mychain[2] is a short version of mychain[('
> ',2,' ')].
>
> hongbo

I've never really liked these strange tuple IDs, which are usually
but not always full of empty values. I understand some of
the corner cases they handle, but they are very complicated.

You cannot assume 2 will map to (' ', 2, ' ') in general - this
is what the _translate_id method handles. Consider the case
where you have sliced the Chain as discussed, since the
(Continue reading)

Hongbo Zhu 朱宏博 | 5 Dec 15:48 2011
Picon

Re: slicing in Bio.PDB.Chain.__getitem__() ?

On Mon, Dec 5, 2011 at 2:50 PM, Peter Cock <p.j.a.cock <at> googlemail.com>wrote:

> On Mon, Dec 5, 2011 at 1:38 PM, Hongbo Zhu 朱宏博 <macrozhu <at> gmail.com> wrote:
> >
> >> Perhaps I misunderstood - I would not want to allow the syntax
> >> mychain[(' ', 2, ' '): (' ', 40, ' ')] which is unclear, rather only
> allow
> >> the user to use mychain[2:41] which requires Python counting.
> >
> > But even in mychain[2:41], the 2 and 41 should be residue sequence
> number.
> > Then it is consistent with the current acceptable syntax mychain[2],
> where 2
> > also refers to a sequence number. At the moment, BioPython also
> > accepts mychain[(' ', 2, ' ')]. So I think mychain[(' ', 2, ' '): (' ',
> 40,
> > ' ')] would be just a nature extension of mychain[(' ', 2, ' ')].
> >
> > According to the source code, mychain[2] is considered an abbreviation of
> > mychain[(' ', 2, ' ')]. Internally, mychain[2] will be translated to
> > mychain[(' ', 2, ' ')] by function Bio.PDB.Chain.__translate_id(). So if
> > mychain[2:4] would be allowed, internally it would also
> > be first translated to mychain[(' ', 2, ' '): (' ', 40, ' ')]. So in my
> > point of view, mychain[2:4] is just an abbreviation for mychain[(' ', 2,
> '
> > '): (' ', 40, ' ')], just like mychain[2] is a short version of
> mychain[('
> > ',2,' ')].
> >
> > hongbo
(Continue reading)

Peter Cock | 5 Dec 16:53 2011

Re: slicing in Bio.PDB.Chain.__getitem__() ?

On Mon, Dec 5, 2011 at 2:48 PM, Hongbo Zhu 朱宏博 <macrozhu <at> gmail.com> wrote:
>
> On Mon, Dec 5, 2011 at 2:50 PM, Peter Cock wrote:
>>
>> I've never really liked these strange tuple IDs, which are usually
>> but not always full of empty values. I understand some of
>> the corner cases they handle, but they are very complicated.
>
>
> This seems to be the problem of PDB.

Yes.

> I don't know how other packages handle the issue.
> In addition, I once proposed to remove the HETERO-flag in the residue ID.
> http://biopython.org/pipermail/biopython-dev/2011-January/008640.html
> It is only retained for the backwards compatibility with PDB files before
> remediation in 2007. Removing only HETERO-flag does not solve
> the problem totally, but to some extent (say, around 50%).

Breaking the API without making the ID much easier to use is a bad idea.

> PDB entry 1h4w is a good example with icode and the sequence of chain A
> starts with resnum 16.

That shows the problem nicely,

>>> from Bio import PDB
>>> structure = PDB.PDBParser().get_structure("1h4w", "1h4w.pdb")
>>> chain = structure[0]['A']
(Continue reading)


Gmane