Fields, Christopher J | 27 Jun 01:46 2015

Fwd: [Genbank-bb] Change to sequence display formats : Removal of GIs by June 2016

Something to keep in mind if parsing breaks (though we should be okay).  I’m more concerned about BLAST+ XML changes...

chris

Begin forwarded message:

From: "Cavanaugh, Mark (NIH/NLM/NCBI) [E]" <cavanaug <at> ncbi.nlm.nih.gov>
Date: June 26, 2015 at 5:13:50 PM CDT
Subject: [Genbank-bb] Change to sequence display formats : Removal of GIs by June 2016

Greetings GenBank Users,

 A very significant change which impacts the GenBank, GenPept, and FASTA
display formats for sequence records at NCBI was announced in the June 2015
GenBank release notes : The removal of GI sequence identifiers.

 This change could have many impacts, so it seems prudent to announce it
independently, to ensure that as many users are aware of the change as
possible. So Section 1.4.1 of the June release notes are reproduced below.

Mark Cavanaugh
GenBank
NCBI/NLM/NIH/HHS


1.4.1 GI sequence identifiers to be removed from GenBank/GenPept/FASTA formats

 As of 06/15/2016, the integer sequence identifiers known as "GIs" will no
longer be included in the GenBank, GenPept, and FASTA formats supported by
NCBI for the display of sequence records.

 As first described in the Release Notes for GenBank 199.0 in December 2013,
NCBI is in the process of moving to storage solutions which utilize only
Accession.Version identifiers. See Section 1.4.2 of these release notes for
additional background information about those developments.

 Although GI sequence identifiers served their purpose well for many years,
the Accession.Version system is completely equivalent (and much more
human-readable).

 And given the shift to non-GI-based systems, the importance of using
Accession.Version identifiers cannot be overstated. So as an initial step, NCBI
will cease the display of GI identifiers in the flatfile and FASTA views of
all sequence records.

 Previously-assigned GI identifiers will continue to exist 'behind the scenes',
and NCBI services (including URLs, APIs, etc) which accept GIs as inputs/arguments
will be supported, for those sequence records that have GIs, for the foreseeable
future.

 Over the next year NCBI will identify all such services that do not yet
support Accession.Version identifiers, and add that support. Users of those
services will then be encouraged to make use of Accession.Version rather than GIs.
Of course, for those services that already support Accession.Version, NCBI
encourages users to begin transitioning away from GI as soon as is practical.

 In the sample record below, nucleotide sequence AF123456 has been assigned a
GI of 6633795, and the protein translation of its coding region feature has
been assigned a GI of 6633796 :

LOCUS       AF123456                1510 bp    mRNA    linear   VRT 12-APR-2012
DEFINITION  Gallus gallus doublesex and mab-3 related transcription factor 1
           (DMRT1) mRNA, partial cds.
ACCESSION   AF123456
VERSION     AF123456.2  GI:6633795
....
    CDS             <1..936
                    /gene="DMRT1"
                    /note="cDMRT1"
                    /codon_start=1
                    /product="doublesex and mab-3 related transcription factor
                    1"
                    /protein_id="AAF19666.1"
                    /db_xref="GI:6633796"
                    /translation="PAAGKKLPRLPKCARCRNHGYSSPLKGHKRFCMWRDCQCKKCSL
                    IAERQRVMAVQVALRRQQAQEEELGISHPVPLPSAPEPVVKKSSSSSSCLLQDSSSPA
                    HSTSTVAAAAASAPPEGRMLIQDIPSIPSRGHLESTSDLVVDSTYYSSFYQPSLYPYY
                    NNLYNYSQYQMAVATESSSSETGGTFVGSAMKNSLRSLPATYMSSQSGKQWQMKGMEN
                    RHAMSSQYRMCSYYPPTSYLGQGVGSPTCVTQILASEDTPSYSESKARVFSPPSSQDS
                    GLGCLSSSESTKGDLECEPHQEPGAFAVSPVLEGE"

 After June 15 2016, the GI value on the VERSION line and the GI /db_xref
qualifier for the coding region feature will no longer be displayed:

LOCUS       AF123456                1510 bp    mRNA    linear   VRT 12-APR-2012
DEFINITION  Gallus gallus doublesex and mab-3 related transcription factor 1
           (DMRT1) mRNA, partial cds.
ACCESSION   AF123456
VERSION     AF123456.2
....
    CDS             <1..936
                    /gene="DMRT1"
                    /note="cDMRT1"
                    /codon_start=1
                    /product="doublesex and mab-3 related transcription factor
                    1"
                    /protein_id="AAF19666.1"
                    /translation="PAAGKKLPRLPKCARCRNHGYSSPLKGHKRFCMWRDCQCKKCSL
                    IAERQRVMAVQVALRRQQAQEEELGISHPVPLPSAPEPVVKKSSSSSSCLLQDSSSPA
                    HSTSTVAAAAASAPPEGRMLIQDIPSIPSRGHLESTSDLVVDSTYYSSFYQPSLYPYY
                    NNLYNYSQYQMAVATESSSSETGGTFVGSAMKNSLRSLPATYMSSQSGKQWQMKGMEN
                    RHAMSSQYRMCSYYPPTSYLGQGVGSPTCVTQILASEDTPSYSESKARVFSPPSSQDS
                    GLGCLSSSESTKGDLECEPHQEPGAFAVSPVLEGE"

 Similarly,  the GI value will be removed from the VERSION line of the GenPept
format. Currently:

LOCUS       AAF19666                 311 aa            linear   VRT 12-APR-2012
DEFINITION  doublesex and mab-3 related transcription factor 1, partial [Gallus
           gallus].
ACCESSION   AAF19666
VERSION     AAF19666.1  GI:6633796
DBSOURCE    accession AF123456.2
....
    CDS             1..311
                    /gene="DMRT1"
                    /coded_by="AF123456.2:<1..936"

As of 06/15/2016:

LOCUS       AAF19666                 311 aa            linear   VRT 12-APR-2012
DEFINITION  doublesex and mab-3 related transcription factor 1, partial [Gallus
           gallus].
ACCESSION   AAF19666
VERSION     AAF19666.1
DBSOURCE    accession AF123456.2
....
    CDS             1..311
                    /gene="DMRT1"
                    /coded_by="AF123456.2:<1..936"

 Note that the coding region feature for GenPept format has never included
the display of nucleotide GI values.

For FASTA format, GI values will be removed from the FASTA header/defline:

Currently:

gi|6633795|gb|AF123456.2| Gallus gallus doublesex and mab-3 related transcription factor 1 (DMRT1) mRNA, partial cds
CCGGCGGCGGGCAAGAAGCTGCCGCGTCTGCCCAAGTGTGCCCGCTGCCGCAACCACGGCTACTCCTCGC
CGCTGAAGGGGCACAAGCGGTTCTGCATGTGGCGGGACTGCCAGTGCAAGAAGTGCAGCCTGATCGCCGA
[....]

gi|6633796|gb|AAF19666.1| doublesex and mab-3 related transcription factor 1, partial
[Gallus gallus]
PAAGKKLPRLPKCARCRNHGYSSPLKGHKRFCMWRDCQCKKCSLIAERQRVMAVQVALRRQQAQEEELGI
SHPVPLPSAPEPVVKKSSSSSSCLLQDSSSPAHSTSTVAAAAASAPPEGRMLIQDIPSIPSRGHLESTSD
LVVDSTYYSSFYQPSLYPYYNNLYNYSQYQMAVATESSSSETGGTFVGSAMKNSLRSLPATYMSSQSGKQ
WQMKGMENRHAMSSQYRMCSYYPPTSYLGQGVGSPTCVTQILASEDTPSYSESKARVFSPPSSQDSGLGC
LSSSESTKGDLECEPHQEPGAFAVSPVLEGE

As of 06/15/2016:

gb|AF123456.2| Gallus gallus doublesex and mab-3 related transcription factor 1 (DMRT1) mRNA, partial cds
CCGGCGGCGGGCAAGAAGCTGCCGCGTCTGCCCAAGTGTGCCCGCTGCCGCAACCACGGCTACTCCTCGC
CGCTGAAGGGGCACAAGCGGTTCTGCATGTGGCGGGACTGCCAGTGCAAGAAGTGCAGCCTGATCGCCGA
[....]

gb|AAF19666.1| doublesex and mab-3 related transcription factor 1, partial
[Gallus gallus]
PAAGKKLPRLPKCARCRNHGYSSPLKGHKRFCMWRDCQCKKCSLIAERQRVMAVQVALRRQQAQEEELGI
SHPVPLPSAPEPVVKKSSSSSSCLLQDSSSPAHSTSTVAAAAASAPPEGRMLIQDIPSIPSRGHLESTSD
LVVDSTYYSSFYQPSLYPYYNNLYNYSQYQMAVATESSSSETGGTFVGSAMKNSLRSLPATYMSSQSGKQ
WQMKGMENRHAMSSQYRMCSYYPPTSYLGQGVGSPTCVTQILASEDTPSYSESKARVFSPPSSQDSGLGC
LSSSESTKGDLECEPHQEPGAFAVSPVLEGE

Please direct any inquiries about these changes to the NCBI Service Desk:

 info <at> ncbi.nlm.nih.gov



_______________________________________________
Genbankb mailing list
Genbankb <at> net.bio.net
http://www.bio.net/biomail/listinfo/genbankb

_______________________________________________
Bioperl-l mailing list
Bioperl-l <at> mailman.open-bio.org
http://mailman.open-bio.org/mailman/listinfo/bioperl-l
Peng Yu | 17 Jun 03:36 2015
Picon

Allowing a list of ids for -id of Bio::DB::EUtilities->new()?

Hi,

To allow multiple ids, I have to manually concatenate ids in the form
of a string '19008417,19008416'. Would it be better to allow it take
the form of a list ('19008417', '19008416').

#!/usr/bin/env perl

use strict;
use warnings;
use Bio::DB::EUtilities;
use XML::Simple;
use Data::Dumper;

my $factory = Bio::DB::EUtilities->new(
  -eutil => 'efetch',
  -db => 'pubmed',
  -email => 'mymail <at> foo.bar',
  -id => '19008417,19008416',
  -retmode => 'xml',
);

#print $factory->get_Response->content;

my $string=$factory->get_Response->content;
print $string, "\n";
my $xml=new XML::Simple;
my $data = $xml->XMLin($string);
print Dumper($data);

--

-- 
Regards,
Peng
Horacio Montenegro | 14 Jun 01:34 2015
Picon

Parsing "PCR_primers" tag from GenBank file

    hi,

    I am trying to parse a GenBank file to extract primers and other
info, outputting it separated with tabs. However, some records have
two "PCR_primers" tags, and they are not being separated. The
"satellite" tag also is doubled, but each one is being correctly
separated with tabs. How can I manage to output each primer pair
separated?

    thanks, Horacio

    One example of a record with two "PCR_primers" tags is Accession
GQ344853, GI 282937571. Bellow is the code snippet to reproduce the
behaviour:

#!/usr/bin/env perl
use Bio::DB::EUtilities;
use Bio::SeqIO;
my $seqio_object = Bio::SeqIO->new(-file => 'GQ344853.gb' );
while (my $seq = $seqio_object->next_seq) {
    print $seq->primary_id, "\t", $seq->length, "\t";
    for my $feat_object ($seq->get_SeqFeatures) {
        print $feat_object->get_tag_values("satellite"), "\t" if
($feat_object->has_tag("satellite"));
        print $feat_object->get_tag_values("PCR_primers"), "\t" if
($feat_object->has_tag('PCR_primers'));
    }
    print "\n";
}
Fields, Christopher J | 9 Jun 16:51 2015

Test

Test email.  Seems to be stuck :(
Peng Yu | 9 Jun 16:47 2015
Picon

An example to query PMID given GEO accession number?

Hi,

I am looking for an example to to query GEO to get the PMID associated
with a GEO accession number.

For example, GSE39684 has the associated PMID 23028701. Can this be
done with bioperl? Does anybody have an example script for doing so?
Thanks.

--

-- 
Regards,
Peng
Peng Yu | 3 Jun 19:00 2015
Picon

Why Bio::DB::EUtilities->new( -eutil => $eutil, -email => 'mymail <at> foo.bar') with no extra argument only works for egquery?

The following code works for egquery but not efetch. I want to find
out what databases are available for an eutil. Do anybody know what is
the correct way? Thanks.

$ cat main.pl
#!/usr/bin/env perl

use strict;
use warnings;

use Bio::DB::EUtilities;

my $eutil=$ARGV[0];
my $factory = Bio::DB::EUtilities->new(
#  -eutil => 'egquery',
  -eutil => $eutil,
  -email => 'mymail <at> foo.bar',
);

print join("\n", $factory->get_databases), "\n";
$ ./main.pl egquery
pubmed
pmc
mesh
books
pubmedhealth
omim
ncbisearch
nuccore
nucgss
nucest
protein
genome
structure
taxonomy
snp
dbvar
epigenomics
gene
sra
biosystems
unigene
cdd
clone
popset
geoprofiles
gds
homologene
pccompound
pcsubstance
pcassay
nlmcatalog
probe
gap
proteinclusters
bioproject
biosample
$ ./main.pl efetch

------------- EXCEPTION -------------
MSG: No parser defined for efetch; use get_Response() directly
STACK Bio::DB::EUtilities::get_Parser
/Users/xxx/perl5/lib/perl5/Bio/DB/EUtilities.pm:57
STACK Bio::DB::EUtilities::get_databases
/Users/xxx/perl5/lib/perl5/Bio/DB/EUtilities.pm:157
STACK toplevel ./main.pl:15
-------------------------------------

--

-- 
Regards,
Peng
Adam Sjøgren | 26 May 10:34 2015
X-Face

Temporary file names in Bio::Tools::Run::StandAlonwWUBlast

  Hi.

I just noticed a problem with the temporary file names that
Bio::Tools::Run::StandAloneWUBlast uses.

wublast automatically tries to ungzip files that have names that match
certain patterns - noticably files ending in "_Z".

B:T:R:StandAloneWUBlast uses B:T:R:StandAloneBlast::_setinput() to
generate a temporary file for the query sequence.

_setinput() calls File::Temp::tempfile() without a SUFFIX (or a
TEMPLATE), which means that when you are (un)lucky, the temporary file
name becomes something like /tmp/i0311ckB_Z - which wublastp then tries
to ungzip, which fails.

I have created a pull request to add TEMPLATE and SUFFIX to
B:T:R:StandAloneBlast::_setinput()'s calls to tempfile():

 * https://github.com/bioperl/bioperl-run/pull/15

  Best regards,

    Adam

--

-- 
                                                          Adam Sjøgren
                                                    adsj <at> novozymes.com

_______________________________________________
Bioperl-l mailing list
Bioperl-l <at> mailman.open-bio.org
http://mailman.open-bio.org/mailman/listinfo/bioperl-l
Peter Cock | 5 May 15:42 2015

Fwd: [blast-announce] New Version of BLAST XML output

FYI - time to update all the Bio* parsers with the new BLAST XML variant...

Peter

---------- Forwarded message ----------
From: Mcginnis, Scott (NIH/NLM/NCBI) [E] <mcginnis <at> ncbi.nlm.nih.gov>
Date: Tue, May 5, 2015 at 2:32 PM
Subject: [blast-announce] New Version of BLAST XML output
To: NLM/NCBI List blast-announce <blast-announce <at> ncbi.nlm.nih.gov>

The NCBI is now making a new version of the BLAST XML available for
testing.  Read about the changes and how to access BLAST results using
the new XML at ftp://ftp.ncbi.nlm.nih.gov/blast/documents/NEWXML/xml2.pdf
wgallin | 1 May 00:10 2015
Picon
Picon

WP_ Accession Protein records

I've been working on scripts that update a small database that I maintain from a selection or records in GENBANK.

I've encountered a problem and am wondering if anyone can suggest a work-around.

I am finding that the protein records that have been replaced by reference to a single entry in the WP_ series accessions in RefSeq have been annotated as removed, rather than replaced.

The result is that when I use a gi number from one of the original (pre-consolidation) entries to recover a record using Bio::DB::Eutilities tools to return a Document Summary I get a record that has the status listed as dead, rather than Replaced by WP_XXXXX.

In the on-line web interface, however, the search has a redirection to the appropriate WP_XXXXX record.

An example is gi number 71065499.

So my question is, is there a Feature other than status in the Document Summary record that gives forwarding information for records replaced by WP_XXXX series records?

Warren Gallin
_______________________________________________
Bioperl-l mailing list
Bioperl-l <at> mailman.open-bio.org
http://mailman.open-bio.org/mailman/listinfo/bioperl-l
Warren Gallin | 27 Apr 19:35 2015
Picon
Picon

Using Utilities to retrieve multiple coding sequences for identical (WP_) protein sequences

With the advent of the WP_   accession series in RefSeq there is no longer a direct link between a single
protein sequence and its encoding nucleotide sequence.

It is possible to find the multiple individual nucleotide records encoding the identical protein
sequences on the Web interface through the “Identical Proteins” link, which generates a list of all
of the coding sequences for the identical protein sequence.

Is there any way to work through these linkages using Bio::DB::Utilities?

Thanks,

Warren Gallin
_______________________________________________
Bioperl-l mailing list
Bioperl-l <at> mailman.open-bio.org
http://mailman.open-bio.org/mailman/listinfo/bioperl-l
Warren Gallin | 20 Apr 22:33 2015
Picon
Picon

inc::TestHelper

Further to my problem installing Bio::DB::Utilities.

Below is a chunk of the the output near the top of the material that plays out during attempted installation
using CPAN

Running make for C/CJ/CJFIELDS/Bio-EUtilities-1.73.tar.gz
---- Unsatisfied dependencies detected during ----
----    CJFIELDS/Bio-EUtilities-1.73.tar.gz   ----
    inc::TestHelper [build_requires]
Running install for module 'inc::TestHelper'

  The module inc::TestHelper isn't available on CPAN.

  Either the module has not yet been uploaded to CPAN, or it is
  temporary unavailable. Please contact the author to find out
  more about the status. Try 'i inc::TestHelper'.

inc::TestHelper is not listed as a module in CPAN

cpan[5]> i inc::TestHelper                                                                          
Module id = inc::TestHelper
    CPAN_VERSION undef
    CPAN_FILE    N/A
    INST_FILE    (not installed)

It is, however, a file that has been downloaded as part of the Utilities 1.7.3 CPAN download

/Users/wgallin1/.cpan/build/Bio-EUtilities-1.73-lE1tKP/inc/TestHelper.pm

So for some reason CPAN is not recognizing the presence of this module, and since it is listed as a dependency
it is failing the install.

Does anyone have a clue as to why this is happening?

Warren Gallin

Gmane