Brad Chapman | 2 Jul 2012 12:36
Gravatar

Re: [GSoC] GSoC python variant update 6


Lenna;
Thanks for the updates and thoughts. I like the direction you're moving
after taking everything you've learned from the SQL experiments.

My general suggestions would be:

- Leverage PyVCF for all of the backend parsing. We want to remain
  compatible with this since merging/interfacing with the work James and
  everyone is doing is a primary goal. Keeping a similar code structure
  is a great way to facilitate this.

- For HGVS the general idea is to not be too tied to the VCF format, so
  I wouldn't worry about strict compatibility but rather use it to inform
  choices where you feel that things are mirroring VCF structure rather
  than more general variant representation.

> Another question that may reveal my complete ignorance of haplotypes
> and such: could a polyploid site ever be partially phased? e.g. a
> triploid genotype of 0/1|0?

It's possible but this is kind of a fringe case right now so I wouldn't
especially worry about it.

Thanks again,
Brad
redmine | 3 Jul 2012 10:59

[Biopython - Bug #3368] (New) Bio.GenBank format writer creates invalid start_codon entries.


Issue #3368 has been reported by Kai Blin.

----------------------------------------
Bug #3368: Bio.GenBank format writer creates invalid start_codon entries.
https://redmine.open-bio.org/issues/3368

Author: Kai Blin
Status: New
Priority: Normal
Assignee: Kai Blin
Category: Main Distribution
Target version: 
URL: ftp://ftp.ncbi.nih.gov/genbank/release.notes/gb185.release.notes

When writing genbank output, the module adds quotes around the codon_start qualifier. This isn't
according to the genbank file spec (see URL), and it also breaks the parser of another program I need to use.

<pre>
BioPython generates:
                     /codon_start="1"
while it should generate:
                     /codon_start=1
</pre>

Looking over the Bio.GenBank code, it seems like I need to introduce some special handling for un-quoted
qualifiers. I'd suggest using the same list as BioPerl:

<pre>

(Continue reading)

Adam Hughes | 3 Jul 2012 21:19
Picon
Gravatar

Conserved Domains Database Support

Hi everyone,

I'm new to the BioPython library and was wondering if there was any support
for the conserved domains database from NCBI?  In particular, the
superfamily batch files that their webtool releases.  Doing a Google
search, there was some interest for this back in 2008; however, they were
mainly interested in parsing the HTML output of CDD searches.  Now that CDD
offers a nice, regular downloadable datatype, has any BioPython support
been implemented to work with this?

If not, I'd like to contribute.

The data is simple tab-delmited formats of domain alignments, E.G.:

Q#10000    0    >WHL22.364604.0    superfamily    212291    7    290
1.01528e-138    401.1    cl09099    P-loop_NTPase    superfamily
0

I had envisioned a simple class of mainly getters/setters with a few
methods such as sorting by Query batches.

~Adam
Peter Cock | 4 Jul 2012 00:03
Gravatar

Re: Conserved Domains Database Support

On Tue, Jul 3, 2012 at 8:19 PM, Adam Hughes <hughesadam87 <at> gmail.com> wrote:
> Hi everyone,
>
> I'm new to the BioPython library and was wondering if there was any support
> for the conserved domains database from NCBI?  In particular, the
> superfamily batch files that their webtool releases.  Doing a Google
> search, there was some interest for this back in 2008; however, they were
> mainly interested in parsing the HTML output of CDD searches.

HTML scrappers were always a bit of a pain :(

> Now that CDD
> offers a nice, regular downloadable datatype, has any BioPython support
> been implemented to work with this?
>
> If not, I'd like to contribute.
>
> The data is simple tab-delmited formats of domain alignments, E.G.:
>
> Q#10000    0    >WHL22.364604.0    superfamily    212291 7 290
> 1.01528e-138    401.1    cl09099    P-loop_NTPase    superfamily
> 0
>
> I had envisioned a simple class of mainly getters/setters with a few
> methods such as sorting by Query batches.
>
> ~Adam

That is interesting - and offers to work on Biopython are always
nice. Is this a file giving domain definitions (HMM or whatever
(Continue reading)

Wibowo Arindrarto | 4 Jul 2012 15:03
Picon
Gravatar

GSoC Project Update -- 9

Hello everyone,

The past week I have been working to add PSL parsing support and I've
just posted my update here:
http://bow.web.id/blog/2012/07/initial-blat-support/

Currently, we have parsing, indexing, and writing support. But this
could change (writing might not be supported) due to a possible change
in the current object model. I've explained a bit on why this is the
case in the post, but to summarize it here, it's because we haven't
got a way to properly model segmented HSP sequences. Peter and I have
discussed this a bit, but we haven't figured out an elegant way to
solve it for now.

Aside from working on PSL, I also added more tests and started
refactoring the code as it's starting to get messy.

That's all my update for the past week. For this week, I'll try to
look into other formats and try to come up with possible solutions to
the segmented HSP problem.

regards,
Bow
Reece Hart | 5 Jul 2012 21:40
Gravatar

Re: [GSoC] GSoC python variant update 6

On Fri, Jun 29, 2012 at 11:15 PM, Lenna Peterson <arklenna <at> gmail.com> wrote:

> For a Python variant object, are there any organizational choices that
> would make it easier for future conversion of a variant to HGVS
> syntax? (this is primarily directed at Reece but I'm open to all
> suggestions)
>

Oh, no, things directed at me!

That's a broad question. I'll try to answer without being long winded.

The essential elements of a sequence variant are a reference to a sequence,
the location, and specifics about the operation. The name, allelic depth,
etc are all distinct from these elements and I would store them separately
in a format-specific record or as a subclass.

I don't have much experience with FeatureLocations, but that might be
appropriate. Depending on how far you plan to go with VCF, you'll have to
deal with Locations for breakpoints.

For the Occam's Razor version a model for variation, I'd float this in the
community:

variation := <accession, start, stop, pre_seq, post_seq>

And I'd test this against representing:

   - a single SNP in VCF
   - a compound het from VCF
(Continue reading)

Peter Cock | 8 Jul 2012 21:06
Gravatar

Fwd: [VCFtools-spec] The BCF2 quick reference document is up on the 1000G wiki

This could be important for Lenna's GSoC project.

Heng Li had developed the original binary VCF format,
BCF, but IIRC he wasn't keen to push it as a standard -
see also http://vcftools.sourceforge.net/specs.html and
http://vcftools.sourceforge.net/bcf.pdf

It looks like BCF2 could be more widely used...

Peter

---------- Forwarded message ----------
From: Eric Banks <ebanks <at> broadinstitute.org>
Date: Thu, Jul 5, 2012 at 5:38 PM
Subject: [VCFtools-spec] The BCF2 quick reference document is up on
the 1000G wiki
To: "1000ANALYSIS <at> LIST.NIH.GOV" <1000ANALYSIS <at> list.nih.gov>,
"vcftools-spec <at> lists.sourceforge.net"
<vcftools-spec <at> lists.sourceforge.net>

  Hi everyone,

At the last 1000G meeting we discussed BCF2, the official binary version
of VCF.  The quick reference guide for BCF2 is now linked from the main
VCF page on the 1000G wiki; you can access it directly here:
http://www.1000genomes.org/sites/1000genomes.org/files/documents/bcfv2.pdf

I take no credit for the document itself, which is really the work of
Heng and Mark.  At this point, both the GATK and samtools can produce
BCF files (and they will soon become our standard output format).  We
(Continue reading)

Lenna Peterson | 9 Jul 2012 06:33
Picon
Gravatar

GSoC python variant update 7

Post: http://arklenna.tumblr.com/post/26812132902/

Synopsis:
This week, I wrote a script for PyVCF that can filter a file by sample
as it's being parsed. It's currently named `vcf_sample_filter.py`.
It's designed to be functional from the command line, the Python
interpreter, or as a module.

Next up: come up with a generic-via-extensibility representation of a
variant. I'm working through some examples and should have a basic
outline soon.

Lenna
Peter Cock | 9 Jul 2012 13:33
Gravatar

Re: Fwd: [VCFtools-spec] The BCF2 quick reference document is up on the 1000G wiki

On Mon, Jul 9, 2012 at 12:27 PM, Brad Chapman <chapmanb <at> 50mail.com> wrote:
>
> Peter;
> Thanks for the heads up. I'm excited about BCF2 and am hopeful it'll
> help with some of the painful parts of VCF, like subsetting large files
> by samples. There is also a page about it on the Broad wiki with more details:
>
> http://www.broadinstitute.org/gsa/wiki/index.php/BCF2
>
> In terms of the representation, this stays close to VCF so shouldn't
> change a lot of the API people see. The main changes would be on the
> backend side where we'd like to be able to swap in and out BCF2 and VCF
> (and GVF) transparently with no visible change to the programmer.
>
> Brad

Yes - that's what we should be aiming for, much like the SAM/BAM
duality which has worked really well for sequence alignments.

Note that like BAM, BCF and BCF2 are both compressed with
BGZF - support for which we included in Biopython 1.60. This
can be combined with the Python struct module to parse the
binary data (and with a little more effort will support both Python
2 and 3, see the SFF code for pointers or ask me).

Peter
Brad Chapman | 9 Jul 2012 13:27
Gravatar

Re: Fwd: [VCFtools-spec] The BCF2 quick reference document is up on the 1000G wiki


Peter;
Thanks for the heads up. I'm excited about BCF2 and am hopeful it'll
help with some of the painful parts of VCF, like subsetting large files
by samples. There is also a page about it on the Broad wiki with more details:

http://www.broadinstitute.org/gsa/wiki/index.php/BCF2

In terms of the representation, this stays close to VCF so shouldn't
change a lot of the API people see. The main changes would be on the
backend side where we'd like to be able to swap in and out BCF2 and VCF
(and GVF) transparently with no visible change to the programmer.

Brad

> This could be important for Lenna's GSoC project.
>
> Heng Li had developed the original binary VCF format,
> BCF, but IIRC he wasn't keen to push it as a standard -
> see also http://vcftools.sourceforge.net/specs.html and
> http://vcftools.sourceforge.net/bcf.pdf
>
> It looks like BCF2 could be more widely used...
>
> Peter
>
>
> ---------- Forwarded message ----------
> From: Eric Banks <ebanks <at> broadinstitute.org>
> Date: Thu, Jul 5, 2012 at 5:38 PM
(Continue reading)


Gmane