New to BioPerl - A little presentation... and a question about GenBank and Bioperl
Olivier BUHARD <Olivier.Buhard <at> inserm.fr>
2014-03-06 11:05:56 GMT
I'm new to BioPerl and would like to ask you for a few advice about the
use of Bioperl.
I am a molecular biologist and I frequently use Perl to write scripts to
prepare or analyse files I get from various databases, so I'm familiar
enough with Perl.
We work in my lab on a particular type of tumorigenic process called
MSI, for MicroSatellite Instability. I'll not go through all the story
but a hallmark of the associated cancers is that the size of their
genomic repeated DNA sequences spread throughout the genome, is altered.
Up to now, we got a list of those sequences from a collaboration who
could make that for us. But now the list we have is old and we have to
get this information by our own means and naturally I started looking at
And before I go through learning all I need (which I guess, will take
some time), I will really appreciate if someone could tell me if I
Bioperl can help from start to end.
In summary, I plan to search all the short repetitive sequences (I'm
just interested in human genome at the moment) I can find in the Genbank
flat file provided by the NCBI FTP site. The idea is to create a BioSQL
database (I already installed using a schema for mySQL) that I could
query using an appropriate algorithm.
I saw Bioperl is made to read those files with multiple entries. So
building the BioSQL database would not be a problem. My first question
is about how I will crawl through the genomic sequences to detect short
repeat tandem sequences of defined size and patterns (some are
mononucleotides repeats, like (A)27, other could be dinucleotides