Raja durai | 22 May 2013 09:22
Picon

How to analyze Ray-meta Results


Hi,

I have used Ray-meta for assembling some bacterial fastq sample. I am very eager to learn to analyze the results obtained from Ray-meta. Is there any tool to analyze
Ray-meta results?. I am looking forward to hear from you.

Regards
Rajadurai
------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service 
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_may
------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service 
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_may
Sébastien Boisvert | 21 May 2013 19:36
Picon
Favicon
Gravatar

Re: Ray et assemblages bactériens

On 15/05/13 10:58 AM, David Hernandez wrote:
> I am using MPICH2 1.2.1
>
> The total size of the missing parts may significantly differ but there
> are some consistencies.

So the problem is reproducible ?

Is the datat public ?

>
> The size as estimated by the script is OK:
>
> Starting k-mer coverage: 50
> Ending k-mer coverage: 2000
> Average k-mer coverage: 305
> Estimated haploid genome size: 2838719
>
> David
>
>
> On 05/15/2013 04:12 PM, Sébastien Boisvert wrote:
>> [CC'ed to the list]
>>
>> Hello,
>>
>> On 14/05/13 01:18 PM, David Hernandez wrote:
>>> Bonjour Sébastien,
>>>
>>> On n'a jamais communiqué directement mais tu nous avais aidé il y a
>>> assez longtemps pour un assemblage bactérien.
>>
>> OK. I remember your name from the Edena paper which showed that
>> overlap-layout-consensus
>> works on short reads too [1].
>>
>>>
>>> Je t'écris car je teste Ray sur deux datasets de staphylocoque dorés
>>> (~2.8 Mo) avec des reads Illumina 100bp short et long paired-ends.
>>>
>>> Je fais plusieurs assemblages en faisant varier le paramètre k.
>>> Mon problème est que Ray produit des assemblages dans lesquels de
>>> grosses parties du génome manquent (jusqu'à 25%) en fonction de la
>>> valeur de k.
>>>
>>> J'arrive à trouver dans le tas un assemblage presque complet mais je me
>>> demande si j'utilise Ray correctement.
>>>
>>> voici par exemple un script que j'utilise pour lancer Ray:
>>>
>>> DATALOC=/home/david/FAS_MW2/DATA/
>>> RAYEXEC=/usr/local/share/Ray-v2.2.0/Ray
>>>
>>> mpd &
>>>
>>> for K in 41 21 25 29 33 37 45 49 53 57 61 69 73 77 81 85 89 93 97
>>> do
>>> time \
>>> mpirun -np 22 $RAYEXEC -show-memory-usage -k $K -o ray2.2.0.MW2_K$K \
>>> -p ${DATALOC}130326_SN234_M_L001_FAS-443_430bp_R1.fastq
>>> ${DATALOC}130326_SN234_M_L001_FAS-443_430bp_R2.fastq \
>>> -p ${DATALOC}130326_SN234_M_L001_FAS-445_4.2kb.R1.fastq
>>> ${DATALOC}130326_SN234_M_L001_FAS-445_4.2kb.R2.fastq 4171 524 \
>>> -p ${DATALOC}130326_SN234_M_L001_FAS-446_5.5kb.R1.fastq
>>> ${DATALOC}130326_SN234_M_L001_FAS-446_5.5kb.R2.fastq 5550 787 \
>>>    > logRayMW2i$K
>>> done
>>>
>>> Est-ce que j'oublie quelque chose ?
>>
>> Which MPI library are you using ?
>>
>>>
>>> J'ai également fait des essais avec l'option -use-minimum-seed-coverage,
>>> ou en reverse-complémentant les long-paired-ends (pour les orienter
>>> comme les courts) mais ça n'est pas mieux.
>>>
>>> Je joins à cet email un des logs d’exécutions. Tu y remarqueras que la
>>> longueur totale assemblée est de 2320832 alors qu'on devrait être plutot
>>> autour de 2.8 Mo).
>>>
>>> Je ne te cache pas que c'est pour une publi, et que c'est relativement
>>> urgent...
>>>
>>
>> So is the part missing consistentely in every assembly ?
>>
>>
>>
>> Can you try the following script on your CoverageDistribution.txt file ?
>>
>>
>> https://github.com/sebhtml/NGS-Pipelines/blob/master/Calculate-Genome-Size.py
>>
>>
>>
>> (You can get a copy of this tool with "git clone
>> git://github.com/sebhtml/NGS-Pipelines.git").
>>
>>
>> This will estimate the genome size using kmer frequencies. It is quite
>> accurate.
>>
>>
>>
>> Otherwise, this is a typical use case for Ray Cloud Browser to help
>> understand what is going on.
>>
>> See:
>>
>> http://browser.cloud.raytrek.com/client/?map=3&section=0&region=9&location=0&depth=10&zoom=1.2255452109421872
>>
>>
>>
>> ---
>> [1] http://genome.cshlp.org/content/18/5/802.long
>>
>>
>>> Merci d'avance pour tes conseils,
>>> David
>>>
>

------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service 
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_may
Sébastien Boisvert | 15 May 2013 16:12
Picon
Favicon
Gravatar

Re: Ray et assemblages bactériens

[CC'ed to the list]

Hello,

On 14/05/13 01:18 PM, David Hernandez wrote:
> Bonjour Sébastien,
>
> On n'a jamais communiqué directement mais tu nous avais aidé il y a
> assez longtemps pour un assemblage bactérien.

OK. I remember your name from the Edena paper which showed that overlap-layout-consensus
works on short reads too [1].

>
> Je t'écris car je teste Ray sur deux datasets de staphylocoque dorés
> (~2.8 Mo) avec des reads Illumina 100bp short et long paired-ends.
>
> Je fais plusieurs assemblages en faisant varier le paramètre k.
> Mon problème est que Ray produit des assemblages dans lesquels de
> grosses parties du génome manquent (jusqu'à 25%) en fonction de la
> valeur de k.
>
> J'arrive à trouver dans le tas un assemblage presque complet mais je me
> demande si j'utilise Ray correctement.
>
> voici par exemple un script que j'utilise pour lancer Ray:
>
> DATALOC=/home/david/FAS_MW2/DATA/
> RAYEXEC=/usr/local/share/Ray-v2.2.0/Ray
>
> mpd &
>
> for K in 41 21 25 29 33 37 45 49 53 57 61 69 73 77 81 85 89 93 97
> do
> time \
> mpirun -np 22 $RAYEXEC -show-memory-usage -k $K -o ray2.2.0.MW2_K$K \
> -p ${DATALOC}130326_SN234_M_L001_FAS-443_430bp_R1.fastq
> ${DATALOC}130326_SN234_M_L001_FAS-443_430bp_R2.fastq \
> -p ${DATALOC}130326_SN234_M_L001_FAS-445_4.2kb.R1.fastq
> ${DATALOC}130326_SN234_M_L001_FAS-445_4.2kb.R2.fastq 4171 524 \
> -p ${DATALOC}130326_SN234_M_L001_FAS-446_5.5kb.R1.fastq
> ${DATALOC}130326_SN234_M_L001_FAS-446_5.5kb.R2.fastq 5550 787 \
>   > logRayMW2i$K
> done
>
> Est-ce que j'oublie quelque chose ?

Which MPI library are you using ?

>
> J'ai également fait des essais avec l'option -use-minimum-seed-coverage,
> ou en reverse-complémentant les long-paired-ends (pour les orienter
> comme les courts) mais ça n'est pas mieux.
>
> Je joins à cet email un des logs d’exécutions. Tu y remarqueras que la
> longueur totale assemblée est de 2320832 alors qu'on devrait être plutot
> autour de 2.8 Mo).
>
> Je ne te cache pas que c'est pour une publi, et que c'est relativement
> urgent...
>

So is the part missing consistentely in every assembly ?

Can you try the following script on your CoverageDistribution.txt file ?

https://github.com/sebhtml/NGS-Pipelines/blob/master/Calculate-Genome-Size.py

(You can get a copy of this tool with "git clone git://github.com/sebhtml/NGS-Pipelines.git").

This will estimate the genome size using kmer frequencies. It is quite accurate.

Otherwise, this is a typical use case for Ray Cloud Browser to help understand what is going on.

See:

http://browser.cloud.raytrek.com/client/?map=3&section=0&region=9&location=0&depth=10&zoom=1.2255452109421872

---
[1] http://genome.cshlp.org/content/18/5/802.long

> Merci d'avance pour tes conseils,
> David
>

------------------------------------------------------------------------------
AlienVault Unified Security Management (USM) platform delivers complete
security visibility with the essential security capabilities. Easily and
efficiently configure, manage, and operate all of your security controls
from a single console and one unified framework. Download a free trial.
http://p.sf.net/sfu/alienvault_d2d
Sébastien Boisvert | 10 May 2013 23:19
Picon
Favicon
Gravatar

Re: A question on a Ray warning

On 10/05/13 03:46 PM, harrison@... wrote:
> Hello Sebastien,
>
> I have a question about running running Ray, if you do not mind.
>
> I am trying to run Ray with a K value of 55 and I ame getting this warning:
>
> Rank 0: Warning, k > MAXKMERLENGTH
>
> Does Ray put an upper bound on k?

Yes, this is done at compilation.

> If so does this involve recompiling or
> is there a way to run it to allow larger k values?

You need to recompile the software.

>
> Thank you,
> Thomas Harrison
>
>

------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and 
their applications. This 200-page book is written by three acclaimed 
leaders in the field. The early access version is available now. 
Download your free book today! http://p.sf.net/sfu/neotech_d2d_may
Sébastien Boisvert | 10 May 2013 17:38
Picon
Favicon
Gravatar

Re: about assessment of metagenome assemblies

On 10/05/13 11:26 AM, Yuqing Xiong wrote:
> Thank you very much. However, I have a question still. How did you know a contig belongs to which species in
metagenome assembly?

In our paper, we aligned all the assembled contigs to the set of sequences using MUMmer.
We did not clasify contigs before alignment because MUMmer reports maximum unique matches.

See Section "Validation of assemblies" in our paper [1].

---
[1] http://genomebiology.com/2012/13/12/R122

>
>
> On Fri, May 10, 2013 at 9:49 AM, Sébastien Boisvert <sebastien.boisvert.3 <at> ulaval.ca
<mailto:sebastien.boisvert.3@...>> wrote:
>
>     On 08/05/13 05:33 PM, Yuqing Xiong wrote:
>
>         Hi, Sebastien,
>
>         I am a researcher in computational biology, and studying your papers,"Ray: Simultaneous Assembly of
Reads from a Mix of High-Throughput Sequencing Technologies"
>         and "Ray Meta: scalable de novo metagenome assembly and profiling". The former is about non metagenome
assembly and the latter is about metagenome assembly.
>         When the quality of the assemblies are assessed, the assembly metrics at the two paper are different, the
chimeric contig, mismatch, and indel are included
>         at the former, but not inculded at the latter. I think that the cause is from binning, that is, it is not known
that a contig belongs to which species in metagenome assembly.
>
>         My question is that why you don't confirm every contig belong to which species, and then assess the quality
of assembly of every contig in the second paper?
>
>
>     We actually verified the assembly quality in our paper [1] for the 100-genome metagenome and for the 1000-genome
>     metagenome (in the Results section).
>
>         Best regards,
>
>         --Yuqing
>
>
>
>     ---
>     [1] http://genomebiology.com/2012/__13/12/R122 <http://genomebiology.com/2012/13/12/R122>
>
>

------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and 
their applications. This 200-page book is written by three acclaimed 
leaders in the field. The early access version is available now. 
Download your free book today! http://p.sf.net/sfu/neotech_d2d_may
Sébastien Boisvert | 10 May 2013 16:49
Picon
Favicon
Gravatar

Re: about assessment of metagenome assemblies

On 08/05/13 05:33 PM, Yuqing Xiong wrote:
> Hi, Sebastien,
>
> I am a researcher in computational biology, and studying your papers,"Ray: Simultaneous Assembly of
Reads from a Mix of High-Throughput Sequencing Technologies"
>and "Ray Meta: scalable de novo metagenome assembly and profiling". The former is about non metagenome
assembly and the latter is about metagenome assembly.
>When the quality of the assemblies are assessed, the assembly metrics at the two paper are different, the
chimeric contig, mismatch, and indel are included
>at the former, but not inculded at the latter. I think that the cause is from binning, that is, it is not known
that a contig belongs to which species in metagenome assembly.
>
> My question is that why you don't confirm every contig belong to which species, and then assess the quality
of assembly of every contig in the second paper?
>

We actually verified the assembly quality in our paper [1] for the 100-genome metagenome and for the 1000-genome
metagenome (in the Results section).

> Best regards,
>
> --Yuqing

---
[1] http://genomebiology.com/2012/13/12/R122

------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and 
their applications. This 200-page book is written by three acclaimed 
leaders in the field. The early access version is available now. 
Download your free book today! http://p.sf.net/sfu/neotech_d2d_may
Cedar Hesse | 9 May 2013 23:58
Picon
Favicon

Segmentation Fault during assembly of metatranscriptome

Hello,
I recently started using Ray in an attempt to assemble a metatranscriptome.  I have tried multiple assemblies and repeatedly get a segmentation fault.  The fault happens after contigs are written and the AMOS file is generated, however I don't think Ray gets to the scaffolding stage or any downstream plugins (gene ontology, etc...).  Below is my RayCommand.txt file and the last few lines of stdout showing the segfault.

RayCommand.txt
###################################
mpiexec -n 25 Ray \                 # This was also run with "--mca btl ^sm" and the same segfault 
 -p \
 TAH-19.1_rejected_reads.fastq \
 TAH-19.2_rejected_reads.fastq \
 -p \
 TAH-20.1_rejected_reads.fastq \
 TAH-20.2_rejected_reads.fastq \
 -p \
 TAH-21.1_rejected_reads.fastq \
 TAH-21.2_rejected_reads.fastq \
 -p \
 TAH-22.1_rejected_reads.fastq \
 TAH-22.2_rejected_reads.fastq \
 -p \
 TAH-23.1_rejected_reads.fastq \
 TAH-23.2_rejected_reads.fastq \
 -p \
 TAH-24.1_rejected_reads.fastq \
 TAH-24.2_rejected_reads.fastq \
 -o \
 ray_assembly_new \
 -show-memory-usage \
 -write-contig-paths \
 -amos \
 -search \
 /dbase/Ray/EMBL_CDS_Sequences \
 -gene-ontology \
 /dbase/Ray/OntologyTerms.txt \
 /dbase/Ray/Annotations.txt
###################################

STDOUT
####################################
Rank 0 objectName: 20 => peakCoverage: 125 blockSize: 1305
Rank 0 objectName: 0 => peakCoverage: 192 blockSize: 1538
Rank 0 objectName: 12 => peakCoverage: 37 blockSize: 1486
Rank 0 objectName: 19 => peakCoverage: 107 blockSize: 1490
Rank 0 objectName: 16 => peakCoverage: 82 blockSize: 1880
[metatron:01941] *** Process received signal ***
[metatron:01941] Signal: Segmentation fault (11)
[metatron:01941] Signal code: Address not mapped (1)
[metatron:01941] Failing at address: 0xb0c68
[metatron:01941] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0) [0x7f8cd6851cb0]
[metatron:01941] [ 1] Ray(_ZNK4Read6lengthEv+0) [0x549160]
[metatron:01941] [ 2] Ray(_ZN16MessageProcessor33call_RAY_MPI_TAG_GET_READ_MARKERSEP7Message+0x88) [0x4b8208]
[metatron:01941] [ 3] Ray(_ZN11ComputeCore10runVanillaEv+0xb4) [0x5867c4]
[metatron:01941] [ 4] Ray(_ZN11ComputeCore3runEv+0x89) [0x58c019]
[metatron:01941] [ 5] Ray(_ZN7Machine5startEv+0x1876) [0x46d056]
[metatron:01941] [ 6] Ray(_ZN11RankProcessI7MachineE3runEv+0x9f) [0x46a6df]
[metatron:01941] [ 7] Ray(main+0xc7) [0x465477]
[metatron:01941] [ 8] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed) [0x7f8cd64a476d]
[metatron:01941] [ 9] Ray() [0x466a81]
[metatron:01941] *** End of error message ***
[metatron][[32923,1],0][../../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[metatron][[32923,1],4][../../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[metatron][[32923,1],8][../../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[metatron][[32923,1],9][../../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[metatron][[32923,1],12][../../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[metatron][[32923,1],14][../../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[metatron][[32923,1],23][../../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[metatron][[32923,1],2][../../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[metatron][[32923,1],21][../../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[metatron][[32923,1],1][../../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[metatron][[32923,1],3][../../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[metatron][[32923,1],5][../../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[metatron][[32923,1],6][../../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[metatron][[32923,1],7][../../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[metatron][[32923,1],10][../../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[metatron][[32923,1],11][../../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[metatron][[32923,1],13][../../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[metatron][[32923,1],15][../../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[metatron][[32923,1],17][../../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[metatron][[32923,1],18][../../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[metatron][[32923,1],19][../../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[metatron][[32923,1],20][../../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[metatron][[32923,1],22][../../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
--------------------------------------------------------------------------
mpiexec noticed that process rank 24 with PID 1941 on node metatron exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
####################################################

Any help?
Thanks,
-Cedar
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and 
their applications. This 200-page book is written by three acclaimed 
leaders in the field. The early access version is available now. 
Download your free book today! http://p.sf.net/sfu/neotech_d2d_may
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and 
their applications. This 200-page book is written by three acclaimed 
leaders in the field. The early access version is available now. 
Download your free book today! http://p.sf.net/sfu/neotech_d2d_may
Sébastien Boisvert | 8 May 2013 20:04
Picon
Favicon
Gravatar

Re: Ray Meta

On 07/05/13 03:37 PM, C. Titus Brown wrote:
>
> On May 3, 2013, at 8:31 AM, Sébastien Boisvert <sebastien.boisvert.3 <at> ulaval.ca> wrote:
>
>> On 02/05/13 07:49 PM, C. Titus Brown wrote:
>>> Sebastian,
>>>
>>> I cannot find instructions for running Ray Meta.  Any help?
>>>
>>> I've downloaded and executed Ray, and it seems to perform ok on metagenomic data.  Is that what I should be doing?
>>
>> Yes.
>>
>> The Ray executable is generalized to work on multi-cell single-genome samples
>> and on multi-cell multiple-genome samples.
>>
>>>
>>> thanks,
>>> --titus
>>>
>>
>
> Hi Sebastian et al.,
>
> I put together these instructions -- still a work in progress, so don't distribute widely yet, please!
>
> http://ged.msu.edu/angus/2013-hmp-assembly-webinar/playing-with-assemblers.html
>
> Is there anything I should change about the parameters I'm using, do you think?

I think you should use paired reads for your tutorial (you have Illumina data).

Ray is much better with paired reads.

>
> thanks,
> --titus
>

------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and 
their applications. This 200-page book is written by three acclaimed 
leaders in the field. The early access version is available now. 
Download your free book today! http://p.sf.net/sfu/neotech_d2d_may
Sébastien Boisvert | 3 May 2013 17:31
Picon
Favicon
Gravatar

Re: Ray Meta

On 02/05/13 07:49 PM, C. Titus Brown wrote:
> Sebastian,
>
> I cannot find instructions for running Ray Meta.  Any help?
>
> I've downloaded and executed Ray, and it seems to perform ok on metagenomic data.  Is that what I should be doing?

Yes.

The Ray executable is generalized to work on multi-cell single-genome samples
and on multi-cell multiple-genome samples.

>
> thanks,
> --titus
>

------------------------------------------------------------------------------
Get 100% visibility into Java/.NET code with AppDynamics Lite
It's a free troubleshooting tool designed for production
Get down to code-level detail for bottlenecks, with <2% overhead.
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap2
Marcelino Suzuki | 29 Apr 2013 19:39

Re: Error?

	Hi

	I am using
	Intel i-Dataplex
	88 nœuds de calcul IBM dx360 M3 intégrés dans 42 châssis IBM 2U Flex.  
Chaque nœud de calcul proposé est configuré comme suit :
	• 2 processeurs SIX CORE INTEL (WESTMERE) cadencés à 2.66GHz
	• 24 Go de mémoire vive DDR3 1066 Mhz
	• 1 carte Infiniband 40 Gb/s (passage de message + système de fichier)

	Marcelino
= 
= 
= 
= 
========================================================================
             oOOOOo             Marcelino Suzuki,  Assoc. Professor,  
Intl.  Chair, Platform Responsible
           oOOO              Univ Pierre Marie Curie (Paris 6) -  
Observatoire Océanologique de Banyuls
        oOOOOOo.                       UMR 7621 - Laboratoire  
d'Océanographie Microbienne (LOMIC)
     .oOOOOOOOo.                        Marine Biodiversity and  
Biotechnology (bio2mar) Platform
   .oOOOOOOOOOoo.      suzuki@...    http://bit.ly/fq3nbE   
bio2mar.obs-banyuls.fr
.oOOOOOOOOOOOooooo.   Ave du Fontaulé, Banyuls-sur-Mer 66650, France   
+33(0)430192401
0000000000000000000000000000000000000000000000000000000000000000000000000000

On Apr 29, 2013, at 6:23 PM, Sébastien Boisvert wrote:

> [Please s end your messages to the mailing list]
>
> On 29/04/13 12:11 PM, Marcelino Suzuki wrote:
>>         Hello
>>
>>         I am using the new 2.2.0 version,and it took a while, bnut  
>> this time
>> I got the entire process to finish :), but it would be nice to use
>> more processors.  I tried with 36 and it failed (perhaps since I
>> already had a checkpoint strated with 16 ranks tho.
>
> Checkpoints are not compatible if the number of MPI ranks is not the  
> same.
>
>> I still would
>> like to know whether it is running in parallel or only in one core.
>>
>
> Are using a Blue Gene/Q, a Cray XE6, or something else ?
>
>>         Here is the command
>>
>> llsubmit        job2x16k31bimini.cmd:
>>
>> #!/bin/sh
>> #  <at>  job_name = crambe_ray
>> #  <at>  output = $(job_name).out
>> #  <at>  error = $(job_name).err
>>
>> #  <at>  job_type = mpich
>> #  <at>  node =8
>> #  <at>  total_tasks = 16
>> #  <at>  restart = yes
>>
>>
>> #  <at>  wall_clock_limit = 60:00:00,59:55:00
>> #  <at>  queue
>>
>> /opt/cluster/mpi/intel/openmpi-noll-1.4.1-qlc/bin/mpiexec -n 8 - 
>> output-
>> filename cr -machinefile $LOADL_HOSTFILE /work/OOBMECO/bin/ray/Ray
>>   -mini-ranks-per-rank 1
>> -read-write-checkpoints checkpoints
>>   -k 31
>>   -amos
>> -o crambe.3
>> -s /scratch/suzukim/mira/CCB1a.fastq
>> -search /scratch/suzukim/NCBI-taxonomy/NCBI-Eukaryote-Genomes
>> -search /scratch/suzukim/NCBI-taxonomy/NCBI-Finished-Virus-Genomes
>> -search /scratch/suzukim/NCBI-taxonomy/NCBI-Draft-Bacteria-Genomes
>> -search  /scratch/suzukim/NCBI-taxonomy/NCBI-Finished-Bacterial- 
>> Genomes
>> -with-taxonomy  /scratch/suzukim/ggenes-taxonomy/Genome-to- 
>> Taxon.tsv /
>> scratch/suzukim/ggenes-taxonomy/TreeOfLife-Edges.tsv /scratch/ 
>> suzukim/
>> ggenes-taxonomy/Taxon-Names.tsv
>> -search /scratch/suzukim/Ontology/EMBL_CDS_Sequences
>> -gene-ontology /scratch/suzukim/Ontology/OntologyTerms.txt /scratch/
>> suzukim/Ontology/Annotations.txt
>> -show-memory-usage
>>
>> =
>> =
>> =
>> =
>> = 
>> = 
>> = 
>> =====================================================================
>>              oOOOOo             Marcelino Suzuki,  Assoc. Professor,
>> Intl.  Chair, Platform Responsible
>>            oOOO              Univ Pierre Marie Curie (Paris 6) -
>> Observatoire Océanologique de Banyuls
>>         oOOOOOo.                       UMR 7621 - Laboratoire
>> d'Océanographie Microbienne (LOMIC)
>>      .oOOOOOOOo.                        Marine Biodiversity and
>> Biotechnology (bio2mar) Platform
>>    .oOOOOOOOOOoo.      suzuki@...    http://bit.ly/fq3nbE
>> bio2mar.obs-banyuls.fr
>> .oOOOOOOOOOOOooooo.   Ave du Fontaulé, Banyuls-sur-Mer 66650, France
>> +33(0)430192401
>> 0000000000000000000000000000000000000000000000000000000000000000000000000000
>>
>> On Apr 29, 2013, at 5:50 PM, Sébastien Boisvert wrote:
>>
>>> On 20/04/13 04:18 AM, Marcelino Suzuki wrote:
>>>>         Hi Sebastien
>>>>
>>>>         One last question (I hope for the moment)
>>>>
>>>>         For the runs that work for me (2 nodes 16 cores), whenever
>>>> I ask for
>>>> the -output-filename, I have quite different outputs per rank (i.e.
>>>> two of the outfiles output the progress while the rest dont change
>>>> and
>>>> ouput the following lines:
>>>>
>>>> -s (single sequences)
>>>>   Sequences: /scratch/suzukim/mira/CCB1a.fastq
>>>> Enabling CorePlugin 'TaxonomyViewer'
>>>> Enabling memory usage reporting.
>>>> Error, crambe/ already exists, change the -o parameter to another
>>>> value.
>>>>
>>>> Rank 0: assembler memory usage: 37988 KiB
>>>> Rank 0: assembler memory usage: 103712 KiB
>>>> Rank 0: Rank= 0 Size= 1 ProcessIdentifier= 8169
>>>>
>>>>         Is this behavior "normal"? or does it mean I am only
>>>> running the
>>>> calculations in two cores?  If that is the case, I could try to
>>>> assign
>>>> a rank per node, but since I have been having errors whenever I use
>>>> more that two nodes, I rather not.
>>>>
>>>
>>> Can you provide your command line for your 2-node job for which you
>>> are
>>> experiencing problems ?
>>>
>>> Also, which version are you using ? v2.2.0 has a lot of bug fixes
>>> under the hood.
>>>
>>>>         Thanks much!
>>>>
>>>>         Marcelino
>>>>
>>>> =
>>>> =
>>>> =
>>>> =
>>>> =
>>>> =
>>>> =
>>>> = 
>>>> = 
>>>> ===================================================================
>>>>              oOOOOo             Marcelino Suzuki,  Assoc.  
>>>> Professor,
>>>> Intl.  Chair, Platform Responsible
>>>>            oOOO              Univ Pierre Marie Curie (Paris 6) -
>>>> Observatoire Océanologique de Banyuls
>>>>         oOOOOOo.                       UMR 7621 - Laboratoire
>>>> d'Océanographie Microbienne (LOMIC)
>>>>      .oOOOOOOOo.                        Marine Biodiversity and
>>>> Biotechnology (bio2mar) Platform
>>>>    .oOOOOOOOOOoo.      suzuki@...    http://bit.ly/ 
>>>> fq3nbE
>>>> bio2mar.obs-banyuls.fr
>>>> .oOOOOOOOOOOOooooo.   Ave du Fontaulé, Banyuls-sur-Mer 66650,  
>>>> France
>>>> +33(0)430192401
>>>> 0000000000000000000000000000000000000000000000000000000000000000000000000000
>>>>
>>>> On Apr 19, 2013, at 10:52 PM, Sébastien Boisvert wrote:
>>>>
>>>>> On 19/04/13 02:46 PM, Marcelino Suzuki wrote:
>>>>>>         Hi Sebastien
>>>>>>
>>>>>>         Well, I was able to go though with the assembly with
>>>>>> ray2.2.0 and hit
>>>>>> the wall with 1 node 9 ranks and 2 nodes 16 ranks during
>>>>>> BiologicalAbundances.   I am still having issues with multiple
>>>>>> nodes/
>>>>>> cores.  I tried several combinations and the process gets killed
>>>>>> after
>>>>>> ~30 min. I think you call raks the number of processes you send  
>>>>>> via
>>>>>> mpirun -np correct? .
>>>>>
>>>>> Yes, a MPI (message passing interface) rank is usually a process
>>>>> running the application.
>>>>>
>>>>>> Pehaps you might habe some guesses to why.  The
>>>>>> message error I got with 3 nodes, and 25 cores (llsubmit #  <at> 
>>>>>> total_tasks = 25) is:
>>>>>>
>>>>>> mpirun: killing job...
>>>>>>
>>>>>> --------------------------------------------------------------------------
>>>>>> mpirun was unable to cleanly terminate the daemons on the nodes
>>>>>> shown
>>>>>> below. Additional manual cleanup may be required - please refer  
>>>>>> to
>>>>>> the "orte-clean" tool for assistance.
>>>>>> --------------------------------------------------------------------------
>>>>>>          node017
>>>>>>          node028
>>>>>> [node028:03832] [[28276,0],2] routed:binomial: Connection to
>>>>>> lifeline
>>>>>> [[28276,0],0] lost
>>>>>> [node017:03879] [[28276,0],1] routed:binomial: Connection to
>>>>>> lifeline
>>>>>> [[28276,0],0] lost
>>>>>>
>>>>>
>>>>> This is sometimes a byproduct of the node a node that is running  
>>>>> out
>>>>> of memory.
>>>>>
>>>>>>         Any help very welcome
>>>>>>
>>>>>>         Marcelino
>>>>>>
>>>>>>
>>>>>> On Apr 17, 2013, at 1:57 PM, Sébastien Boisvert wrote:
>>>>>>
>>>>>>> See my responses below.
>>>>>>>
>>>>>>> On 16/04/13 06:16 PM, Marcelino Suzuki wrote:
>>>>>>>>         Hurrah it compiled, but could you please take a look at
>>>>>>>> "HELP!" below
>>>>>>>>
>>>>>>>>
>>>>>>>>         Well, after I picked the intel compiler and added the -
>>>>>>>> DMPICH_IGNORE_CXX_SEEK option in the Makefile of RayPlatform
>>>>>>>>
>>>>>>>>          $(Q)$(MPICXX) $(CXXFLAGS) -DMPICH_IGNORE_CXX_SEEK -D
>>>>>>>> RAYPLATFORM_VERSION=\"$(RAYPLATFORM_VERSION)\" -I. -c -o $ <at>  $<
>>>>>>>>
>>>>>>>>         and the Makefile of ray
>>>>>>>>
>>>>>>>>         find . -name *Makefile* | xargs -i sed -i "s/mpicxx/
>>>>>>>> mpicxx -
>>>>>>>> DMPICH_IGNORE_CXX_SEEK/g" {}
>>>>>>>>
>>>>>>>>         It did compile through.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ===     HELP!
>>>>>>>>
>>>>>>>>         I will tried to run jobs with the newly compiled 2.2.0
>>>>>>>> but
>>>>>>>> unfortunately I still get the "Error, crambe/ already exists,
>>>>>>>> change
>>>>>>>> the -o parameter to another value." error
>>>>>>>>
>>>>>>>>         I get two types of errors:
>>>>>>>>
>>>>>>>>         Whenever I start one node with 9 cores it crashes after
>>>>>>>> some 20
>>>>>>>> minuts during assembly, so I am guessing it is some RAM issue,
>>>>>>>> although I am "only" running 3.2 M sequences.  I do get the
>>>>>>>> "Error,
>>>>>>>> crambe/ already exists, change the -o parameter to another
>>>>>>>> value. "
>>>>>>>> but it does not seem to affect it.  Actually when I do  
>>>>>>>> individual
>>>>>>>> outfiles per core, it only one of the files gets written to,
>>>>>>>>
>>>>>>>>
>>>>>>>>         When I spread among different nodes i.e. 4 nodes with 4
>>>>>>>> jobs per
>>>>>>>
>>>>>>> You mean 4 MPI ranks per node, right ?
>>>>>>>
>>>>>>>> node, the job cancels within a minute or so, and individual
>>>>>>>> outfiles
>>>>>>>> for most of cores  have the folowing last lines:
>>>>>>>>
>>>>>>>> Enabling CorePlugin 'TaxonomyViewer'
>>>>>>>> Enabling memory usage reporting.
>>>>>>>> Error, crambe/ already exists, change the -o parameter to  
>>>>>>>> another
>>>>>>>> value.
>>>>>>>>
>>>>>>>
>>>>>>> What is your file system ? And is it shared between your nodes ?
>>>>>>>
>>>>>>>> Rank 0: assembler memory usage: 37992 KiB
>>>>>>>> Rank 0: assembler memory usage: 103716 KiB
>>>>>>>> Rank 0: Rank= 0 Size= 1 ProcessIdentifier= 28047
>>>>>>>>
>>>>>>>>         Nodes (08,11) have no errors and finish by
>>>>>>>>
>>>>>>>> Rank 0 is loading sequence reads
>>>>>>>> Rank 0 : partition is [0;3602181], 3602182 sequence reads
>>>>>>>> Rank 0 is fetching file /scratch/suzukim/mira/CCB1a.fastq with
>>>>>>>> lazy
>>>>>>>> loading (please wait...)
>>>>>>>> Rank 0 has 0 sequence reads
>>>>>>>> Rank 0: assembler memory usage: 142416 KiB
>>>>>>>> Rank 0 has 100000 sequence reads
>>>>>>>> Rank 0: assembler memory usage: 142416 KiB
>>>>>>>> Rank 0 has 200000 sequence reads
>>>>>>>> Rank 0: assembler memory usage: 142416 KiB
>>>>>>>>
>>>>>>>>         Nodes 09,10, have no errors and finish by
>>>>>>>>
>>>>>>>> Rank 0 is loading sequence reads
>>>>>>>> Rank 0 : partition is [0;3602181], 3602182 sequence reads
>>>>>>>> Rank 0 is fetching file /scratch/suzukim/mira/CCB1a.fastq with
>>>>>>>> lazy
>>>>>>>> loading (please wait...)
>>>>>>>>
>>>>>>>>         Marcelino
>>>>>>>>
>>>>>>>> On Apr 16, 2013, at 10:14 PM, Sébastien Boisvert wrote:
>>>>>>>>
>>>>>>>>> To compile the git version:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> git clone git://github.com/sebhtml/ray.git
>>>>>>>>> git clone git://github.com/sebhtml/RayPlatform.git
>>>>>>>>>
>>>>>>>>> cd ray
>>>>>>>>> make
>>>>>>>>>
>>>>>>>>> On 16/04/13 04:09 PM, Marcelino Suzuki wrote:
>>>>>>>>>>      Hello and thanks
>>>>>>>>>>
>>>>>>>>>>      Just that you know,  I am getting rhe same type of
>>>>>>>>>> compilcation
>>>>>>>>>> errors with other .cpp files as well  (after I excluded
>>>>>>>>>> Amos.cpp)
>>>>>>>>>> code/
>>>>>>>>>> plugin_CoverageGatherer/CoverageGatherer.cpp (at least,  
>>>>>>>>>> perhaps
>>>>>>>>>> others)
>>>>>>>>>>
>>>>>>>>>>      I I will wait for the new version, and will try in a
>>>>>>>>>> different
>>>>>>>>>> cluster where I compiled with gcc.
>>>>>>>>>>
>>>>>>>>>>      Marcelino
>>>>>>>>>>
>>>>>>>>>> On Apr 16, 2013, at 10:00 PM, Sébastien Boisvert wrote:
>>>>>>>>>>
>>>>>>>>>>> Hello,
>>>>>>>>>>>
>>>>>>>>>>> v2.2.0 should be released soon. There is one ticket left.
>>>>>>>>>>>
>>>>>>>>>>> Otherwise, I don't think this issue is critical as it does  
>>>>>>>>>>> not
>>>>>>>>>>> prevent
>>>>>>>>>>> assemblies for completing.
>>>>>>>>>>>
>>>>>>>>>>> On 16/04/13 03:35 PM, Marcelino Suzuki wrote:
>>>>>>>>>>>>    Hello again
>>>>>>>>>>>>
>>>>>>>>>>>>    Well, I am not really know how to add the changes using
>>>>>>>>>>>> git
>>>>>>>>>>>> (other
>>>>>>>>>>>> than editing the .cpp files from V2.1.0, which are not  
>>>>>>>>>>>> quite
>>>>>>>>>>>> the
>>>>>>>>>>>> same
>>>>>>>>>>>> as the ones in the commit.  I tried to clone the latest ray
>>>>>>>>>>>> and
>>>>>>>>>>>> RayPlatform, and after some tweaeking to use the intel
>>>>>>>>>>>> compiler
>>>>>>>>>>>> (suggested by my sys admin) and the compilation crashed at
>>>>>>>>>>>> plugin_Amos.cpp.  Maybe you can give me some pointers on
>>>>>>>>>>>> how to
>>>>>>>>>>>> fix
>>>>>>>>>>>> 2.1.0 (which I was able to compile)
>>>>>>>>>>>>
>>>>>>>>>>>>    Thanks
>>>>>>>>>>>>
>>>>>>>>>>>> commands
>>>>>>>>>>>>                  cd
>>>>>>>>>>>>                  git clone https://bitbucket.org/sebhtml/
>>>>>>>>>>>> ray.git
>>>>>>>>>>>>                  mv ray /work/OOBMECO/bin
>>>>>>>>>>>>                  git clone https://bitbucket.org/sebhtml/rayplatform.git
>>>>>>>>>>>>                  mv rayplatform/ /work/OOBMECO/bin
>>>>>>>>>>>>                  cd /work/OOBMECO/bin/ray/
>>>>>>>>>>>>                  source /opt/cluster/gcc-4.4.3/refresh.sh
>>>>>>>>>>>>                  source /opt/cluster/compilers/intel/
>>>>>>>>>>>> Compiler/
>>>>>>>>>>>> 11.1/072/
>>>>>>>>>>>> bin/iccvars.sh intel64
>>>>>>>>>>>>                  source /opt/cluster/compilers/intel/impi/
>>>>>>>>>>>> 4.0.0.028/
>>>>>>>>>>>> bin64/mpivars.sh
>>>>>>>>>>>>                  find . -name *Makefile* | xargs -i sed -i
>>>>>>>>>>>> "s/
>>>>>>>>>>>> mpicxx/
>>>>>>>>>>>> mpicxx -DMPICH_IGNORE_CXX_SEEK/g" {}
>>>>>>>>>>>>
>>>>>>>>>>>>            edited the RayPlatform Makefile and add -
>>>>>>>>>>>> DMPICH_IGNORE_CXX_SEEK
>>>>>>>>>>>>
>>>>>>>>>>>>            make PREFIX=$PWD
>>>>>>>>>>>>
>>>>>>>>>>>> ERROR:
>>>>>>>>>>>>
>>>>>>>>>>>> CXX code/plugin_Amos/Amos.o
>>>>>>>>>>>> code/plugin_Amos/Amos.cpp: In member function â:
>>>>>>>>>>>> code/plugin_Amos/Amos.cpp:239: error: expected primary-
>>>>>>>>>>>> expression
>>>>>>>>>>>> before â token
>>>>>>>>>>>> code/plugin_Amos/Amos.cpp:239: error: â was not declared in
>>>>>>>>>>>> this
>>>>>>>>>>>> scope
>>>>>>>>>>>> code/plugin_Amos/Amos.cpp:240: error: expected primary-
>>>>>>>>>>>> expression
>>>>>>>>>>>> before â token
>>>>>>>>>>>> code/plugin_Amos/Amos.cpp:240: error: â was not declared in
>>>>>>>>>>>> this
>>>>>>>>>>>> scope
>>>>>>>>>>>> make: *** [code/plugin_Amos/Amos.o] Error 1
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> =
>>>>>>>>>>>> =
>>>>>>>>>>>> =
>>>>>>>>>>>> =
>>>>>>>>>>>> =
>>>>>>>>>>>> =
>>>>>>>>>>>> =
>>>>>>>>>>>> =
>>>>>>>>>>>> =
>>>>>>>>>>>> =
>>>>>>>>>>>> =
>>>>>>>>>>>> =
>>>>>>>>>>>> =
>>>>>>>>>>>> =
>>>>>>>>>>>> =
>>>>>>>>>>>> = 
>>>>>>>>>>>> = 
>>>>>>>>>>>> ===========================================================
>>>>>>>>>>>>              oOOOOo             Marcelino Suzuki,  Assoc.
>>>>>>>>>>>> Professor,
>>>>>>>>>>>> Intl.  Chair, Platform Responsible
>>>>>>>>>>>>            oOOO              Univ Pierre Marie Curie (Paris
>>>>>>>>>>>> 6) -
>>>>>>>>>>>> Observatoire Océanologique de Banyuls
>>>>>>>>>>>>         oOOOOOo.                       UMR 7621 -  
>>>>>>>>>>>> Laboratoire
>>>>>>>>>>>> d'Océanographie Microbienne (LOMIC)
>>>>>>>>>>>>      .oOOOOOOOo.                        Marine Biodiversity
>>>>>>>>>>>> and
>>>>>>>>>>>> Biotechnology (bio2mar) Platform
>>>>>>>>>>>>    .oOOOOOOOOOoo.      suzuki@...    http://
>>>>>>>>>>>> bit.ly/
>>>>>>>>>>>> fq3nbE
>>>>>>>>>>>> bio2mar.obs-banyuls.fr
>>>>>>>>>>>> .oOOOOOOOOOOOooooo.   Ave du Fontaulé, Banyuls-sur-Mer  
>>>>>>>>>>>> 66650,
>>>>>>>>>>>> France
>>>>>>>>>>>> +33(0)430192401
>>>>>>>>>>>> 0000000000000000000000000000000000000000000000000000000000000000000000000000
>>>>>>>>>>>>
>>>>>>>>>>>> On Apr 16, 2013, at 3:40 PM, Sébastien Boisvert wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> On 15/04/13 04:40 PM, Marcelino Suzuki wrote:
>>>>>>>>>>>>>>  Hello and thanks for the answer.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  I did undestand that and the directory does not exist
>>>>>>>>>>>>>> when I
>>>>>>>>>>>>>> lauch
>>>>>>>>>>>>>> the job.  I do not get this error from the fist 4 cores,
>>>>>>>>>>>>>> but
>>>>>>>>>>>>>> after
>>>>>>>>>>>>>> core 5, I get the error in the output from all cores.   
>>>>>>>>>>>>>> The
>>>>>>>>>>>>>> weirdest
>>>>>>>>>>>>>> is
>>>>>>>>>>>>>> that I had these messages in the outfile of a run that
>>>>>>>>>>>>>> completed,
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>> that is why I was wondering if that is normal.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> This is a race condition affecting version 2.1.0.
>>>>>>>>>>>>>
>>>>>>>>>>>>> It was fixed by this commit:
>>>>>>>>>>>>>
>>>>>>>>>>>>> commit 73ada528e1d17105e17df524fdb60bdd03b40560
>>>>>>>>>>>>> Author: Sébastien Boisvert <sebastien.boisvert. 
>>>>>>>>>>>>> 3 <at> ulaval.ca>
>>>>>>>>>>>>> Date:   Fri Feb 8 23:55:34 2013 -0500
>>>>>>>>>>>>>
>>>>>>>>>>>>>    fix a race condition during directory probing
>>>>>>>>>>>>>        Resolves-bug: https://github.com/sebhtml/ray/issues/125
>>>>>>>>>>>>>    Signed-off-by: Sébastien Boisvert <sebastien.boisvert.
>>>>>>>>>>>>> 3 <at> ulaval.ca>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> https://bitbucket.org/sebhtml/ray/commits/73ada528e1d17105e17df524fdb60bdd03b40560
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> The upcoming v2.2.0 release will include this bug fix.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> This bug was introduced by the mini-ranks code.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> For example, if you are using a single machine with 32
>>>>>>>>>>>>> cores,
>>>>>>>>>>>>> you can use
>>>>>>>>>>>>>
>>>>>>>>>>>>>    mpiexec -n 32 Ray ...
>>>>>>>>>>>>>
>>>>>>>>>>>>> or you can use
>>>>>>>>>>>>>
>>>>>>>>>>>>>    mpiexec -n 4 Ray -mini-ranks-per-rank 7 ...
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Mini-ranks are experimental, but they seems to work nicely
>>>>>>>>>>>>> on
>>>>>>>>>>>>> SMP
>>>>>>>>>>>>> NUMA machines.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>  Marcelino
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Apr 15, 2013, at 10:14 PM, Sébastien Boisvert wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 14/04/13 04:45 PM, Marcelino Suzuki wrote:
>>>>>>>>>>>>>>>>        Hello Sebastien
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>        I am having issues with assembling a 3.2 M read
>>>>>>>>>>>>>>>> ion
>>>>>>>>>>>>>>>> torrent
>>>>>>>>>>>>>>>> dataset
>>>>>>>>>>>>>>>> using Loadleveler for an cluster intel iDataplex, with
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> mpi
>>>>>>>>>>>>>>>> job
>>>>>>>>>>>>>>>> crashing during execution.  It might be an issue with  
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> cluster
>>>>>>>>>>>>>>>> (I
>>>>>>>>>>>>>>>> got this message from the system administrator: En ce
>>>>>>>>>>>>>>>> moment
>>>>>>>>>>>>>>>> on a
>>>>>>>>>>>>>>>> des
>>>>>>>>>>>>>>>> soucis de réseau IB et de baie de disque), but since  
>>>>>>>>>>>>>>>> I've
>>>>>>>>>>>>>>>> seen a
>>>>>>>>>>>>>>>> Error
>>>>>>>>>>>>>>>> in the output files coming from some cores using the -
>>>>>>>>>>>>>>>> output-
>>>>>>>>>>>>>>>> filename
>>>>>>>>>>>>>>>> option, and just want to make sure I am not missing  
>>>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>> detail:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>        Error, crambe/ already exists, change the -o
>>>>>>>>>>>>>>>> parameter to
>>>>>>>>>>>>>>>> another
>>>>>>>>>>>>>>>> value
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>        crambe is my -o directory
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>        Is that normal behavior?  I noticed I got the  
>>>>>>>>>>>>>>>> same
>>>>>>>>>>>>>>>> errors in a
>>>>>>>>>>>>>>>> outfile of the same assembly that did go through to the
>>>>>>>>>>>>>>>> end.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Ray will not overwrite an existing directory.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> You must change use another directory name.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>        Thanks
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>        Marcelino
>>>>>>>>>>>>>>>> =
>>>>>>>>>>>>>>>> =
>>>>>>>>>>>>>>>> =
>>>>>>>>>>>>>>>> =
>>>>>>>>>>>>>>>> =
>>>>>>>>>>>>>>>> =
>>>>>>>>>>>>>>>> =
>>>>>>>>>>>>>>>> =
>>>>>>>>>>>>>>>> =
>>>>>>>>>>>>>>>> =
>>>>>>>>>>>>>>>> =
>>>>>>>>>>>>>>>> =
>>>>>>>>>>>>>>>> =
>>>>>>>>>>>>>>>> =
>>>>>>>>>>>>>>>> =
>>>>>>>>>>>>>>>> =
>>>>>>>>>>>>>>>> =
>>>>>>>>>>>>>>>> =
>>>>>>>>>>>>>>>> =
>>>>>>>>>>>>>>>> = 
>>>>>>>>>>>>>>>> = 
>>>>>>>>>>>>>>>> =======================================================
>>>>>>>>>>>>>>>>              oOOOOo             Marcelino Suzuki,   
>>>>>>>>>>>>>>>> Assoc.
>>>>>>>>>>>>>>>> Professor,
>>>>>>>>>>>>>>>> Intl.  Chair, Platform Responsible
>>>>>>>>>>>>>>>>            oOOO              Univ Pierre Marie Curie
>>>>>>>>>>>>>>>> (Paris
>>>>>>>>>>>>>>>> 6) -
>>>>>>>>>>>>>>>> Observatoire Océanologique de Banyuls
>>>>>>>>>>>>>>>>         oOOOOOo.                       UMR 7621 -
>>>>>>>>>>>>>>>> Laboratoire
>>>>>>>>>>>>>>>> d'Océanographie Microbienne (LOMIC)
>>>>>>>>>>>>>>>>      .oOOOOOOOo.                        Marine
>>>>>>>>>>>>>>>> Biodiversity
>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>> Biotechnology (bio2mar) Platform
>>>>>>>>>>>>>>>>    .oOOOOOOOOOoo.      suzuki@...    http://
>>>>>>>>>>>>>>>> bit.ly/
>>>>>>>>>>>>>>>> fq3nbE
>>>>>>>>>>>>>>>> bio2mar.obs-banyuls.fr
>>>>>>>>>>>>>>>> .oOOOOOOOOOOOooooo.   Ave du Fontaulé, Banyuls-sur-Mer
>>>>>>>>>>>>>>>> 66650,
>>>>>>>>>>>>>>>> France
>>>>>>>>>>>>>>>> +33(0)430192401
>>>>>>>>>>>>>>>> 0000000000000000000000000000000000000000000000000000000000000000000000000000
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> ------------------------------------------------------------------------------
>>>>>>>>>>>>>>>> Precog is a next-generation analytics platform  
>>>>>>>>>>>>>>>> capable of
>>>>>>>>>>>>>>>> advanced
>>>>>>>>>>>>>>>> analytics on semi-structured data. The platform  
>>>>>>>>>>>>>>>> includes
>>>>>>>>>>>>>>>> APIs
>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>> building
>>>>>>>>>>>>>>>> apps and a phenomenal toolset for data science.
>>>>>>>>>>>>>>>> Developers
>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>> use
>>>>>>>>>>>>>>>> our toolset for easy data analysis & visualization.  
>>>>>>>>>>>>>>>> Get a
>>>>>>>>>>>>>>>> free
>>>>>>>>>>>>>>>> account!
>>>>>>>>>>>>>>>> http://www2.precog.com/precogplatform/ 
>>>>>>>>>>>>>>>> slashdotnewsletter
>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>> Denovoassembler-users mailing list
>>>>>>>>>>>>>>>> Denovoassembler-users@...
>>>>>>>>>>>>>>>> https://lists.sourceforge.net/lists/listinfo/denovoassembler-
>>>>>>>>>>>>>>>> users
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ------------------------------------------------------------------------------
>>>>>>>>>>>>>>> Precog is a next-generation analytics platform capable  
>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>> advanced
>>>>>>>>>>>>>>> analytics on semi-structured data. The platform includes
>>>>>>>>>>>>>>> APIs
>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>> building
>>>>>>>>>>>>>>> apps and a phenomenal toolset for data science.  
>>>>>>>>>>>>>>> Developers
>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>> use
>>>>>>>>>>>>>>> our toolset for easy data analysis & visualization.  
>>>>>>>>>>>>>>> Get a
>>>>>>>>>>>>>>> free
>>>>>>>>>>>>>>> account!
>>>>>>>>>>>>>>> http://www2.precog.com/precogplatform/slashdotnewsletter
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> Denovoassembler-users mailing list
>>>>>>>>>>>>>>> Denovoassembler-users@...
>>>>>>>>>>>>>>> https://lists.sourceforge.net/lists/listinfo/denovoassembler-users
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service 
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
Sébastien Boisvert | 29 Apr 2013 19:24
Picon
Favicon
Gravatar

Re: Ray assembly

[Please CC the mailing list]

On 24/04/13 12:33 PM, Alison Cloutier wrote:
> Hi Danny & Sebastien-
>
>       Sorry for the delay in responding- all of your emails from yesterday just appeared in my inbox (U of T mail
seems to hang on to things sometimes...).
>
>       I'm trying to assemble 5 HiSeq runs of paired end data.

With such a huge amount of data, you should not be collapsing all the files in 3 huge files.

> In pre-processing the data, I first collapsed identical read pairs within each
>  run (i.e. PCR optical duplicates), then quality-trimmed the reads to a cutoff of Q=13 and minimum length
>  threshold of 25 bp, and lastly collapsed any overlapping read pairs into a single contiguous read
>  (this is fairly common in our dataset as it's ancient DNA so many of the inserts are of small size).

Ray will perform better if you provide it with your overlapping paired reads because its heuristics for
paired reads
are much better than its heuristics for single-end reads.

>
>      I then ran each dataset on its own with Ray 2.1.0 (and a more reasonable number of cores, i.e. 10 nodes [80
processes] on the large memory 32g nodes), and had no problems.

Do you have 5 independant unrelated samples or you have 5 HiSeq runs for the same sample ?

>   I did this step so that I could map the raw reads back onto the scaffolds, and then remove reads that map to
scaffolds that

>blast as bacterial (i.e. to remove contaminant sequences, which account for a substantial proportion of
the data when working
>with ancient DNA).  After this final pre-processing step, the full dataset consisted of:

Oh I see. You did these small assemblies to weed out contaminations, right ?

> 151,284,953 paired reads (up to 101 bp each)

That's not a lot considering that you had 5 HiSeq runs (on HiSeq 2000 run is 6 000 000 000 sequences), unless
you are talking about partial
HiSeq runs.

> 23,258,458 single reads (up to 101 bp, cases where partner was removed in quality trimming)
> 84,988,500 'overlap' single reads (where overlapping pairs were merged, so up to ~200 bp length)

Again, Ray will perform better with the non-merged version of these pairs.

>
> In comparison, the largest of the single run datasets consisted of: 111,497,923 paired and 15,938,968
single reads so
>  running a full assembly didn't seem unreasonable. However, in the full assembly the program would stop
partway through with an error that one of the ranks
>  had exited with 'signal 9 (killed)'.  I tried another run with the 'show-memory-usage' options to try to
see if it really was a memory issue, and I also found
>  that as I incrementally increased the number of cores, Ray would progress further but then would just stall.

You should update to Ray v2.2.0 as it has an adaptive Bloom filter that will take lower the memory usage of
sequencing errors.

>  In reading through the denovoassembler
>  postings, Sebastien had suggested to other users that: routing could be necessary when using a large
number  of processes and/or that disabling read

It depends if you get a bad latency.

What are the first few lines of NetworkTest.txt ?

>recycling could avoid cases where the program gets stalled in a loop, so that's why I was trying out those options.

Where exactly is Ray stalling on your dataset ? (You can check that with the standard output file)

>It seems like it was stupid of me to leave all the reads in single gigantic files instead of splitting them up
again after the pre-processing,

It's better for the storage to have many files.

> but I'm not sure that this is the only issue as the same was done for the single-dataset runs (or is it just the
huge files PLUS
>  a huge number of processes that really causes issues?).
>Could there also be issues caused by the data itself- i.e. with ancient DNA,

I don't know about ancient DNA.

> we might expect a lot of fairly short contigs and much missing data (due to the stochastic nature of what DNA
survives intact from an ancient sample),
>  and ancient DNA itself is often more error prone due to damage/lesions.  Could these features cause issues
with the assembly algorithm?

No. You would just get a graph with missing parts.

>
> Any ideas would be greatly appreciated- it seems like using the new Ray release and splitting up the input
files would be good ideas,
>and I'm really, really, really sorry to have caused issues for SciNet (really!).
>
>                Alison

------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service 
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr

Gmane