Re: taxonomy ID
Florent Angly <florent.angly <at> gmail.com>
2009-04-01 17:03:28 GMT
FYI, the gi_taxid_nucl.dmp.gz is very large, thus it's likely that you
won't be able to put its information in a hash (unless you have a lot of
memory).
Florent
Smithies, Russell wrote:
> The taxonomy information isn't in the blast output unless you created custom fasta headers for your blast database.
> The easiest way to get the tax_id for your accessions would be to download the gi->tax_id list from ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_nucl.dmp.gz.
> If you load that file into a hash, parse the accessions out of the blast hits then lookup the tax_id from that
hash, I think it should be fairly fast.
>
> Checking which are prokaryotes and which are eukaryotes based on tax_id is a separate problem
> If you grab the taxdump.tar.gz file from the same site, the nodes.dmp file contained within lists what
division each tax_id belongs to (Bacteria, Invertebrates, Mammals, Phages, Plants, etc) so you can
probably work it out from that.
>
> It's not a very BioPerly solution but sometimes just looking up the answer from a file/table/hash is the
simplest way.
>
> Hope this helps,
>
> Russell Smithies
>
> Bioinformatics Applications Developer
> T +64 3 489 9085
> E russell.smithies <at> agresearch.co.nz
>
> Invermay Research Centre
> Puddle Alley,
> Mosgiel,
(Continue reading)