Rao,Xiayu | 24 Jul 23:59 2014

heatmap.2 x-axis label order problem


I encountered a strange problem. When I executed the same code in R-windows and in R-linux, the resulted
heatmaps have the exact same figure(clustering), but for the x-axis label the order of the duplicates are
reversed (the sample orders are the same).

For example,
R-windows:  sample1-R1, sample1-R2, sample2-R2, sample2-R1             R-3.1.1
R-linux:           sample1-R2, sample1-R1, sample2-R1, sample2-R2             R-3.1.0 on HPC server

Can anyone provide any suggestions to check the problem?
Thank you very much!


	[[alternative HTML version deleted]]

Bioconductor mailing list
Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

Ty Thomson | 24 Jul 20:42 2014

dealing with counfounded effects in experiment design

Hi BioC List,

I'm working with an affymetrix data set where the batches are completely confounded with the factor of
interest for one contrast, treatment time.  From my understanding I can use something like fRMA to
partially mitigate the effect, but otherwise not much I can do.

However, we do have the original tissue samples and the option to re-extract/reprocess some samples in a
new batch.  Due to the study size, rerunning all samples with a proper randomized design is out of the
question.  Are there any studies describing how we might rerun a small subset of samples to recover the
contrast of interest?  Or does anyone have any advice?  For example, can I run 10 samples from each time point
in a new batch - could that be sufficient?  Clearly it could depend on the size of the batch effect, so I
understand that there is probably no definitive answer...

Here's the sample breakdown with the relevant treatments/covariates - you can see that timepoints t1 and
t2 are completely confounded with batch, while timepoints t2 and t3 are not. I also included another
covariate that we want to account for, Ethnicity, and the number of samples in each group:
Batch  Timepoint   Ethnicity  Num_Samples
b1         t1         A         21
b1         t1         B         54
b2         t1         A         20
b2         t1         B         56
b3         t2         A         10
b3         t2         B         35
b3         t3         A         9
b3         t3         B         38
b4         t2         B         49
b4         t3         A         1
b4         t3         B         43
b5         t2         A         28
b5         t2         B         4
(Continue reading)

David | 24 Jul 17:04 2014

affymetrix rat 2.0 gene annotation issue

Dear users, in case it is of help to anyone, just for the records or if someone can share some comments...

I have searched in the list for similar issue but have not found anything related.

We have just performed a Rat 2.0 Gene experiment and have found that current Affymetrix annotation files
(version na34.rn5) have errors in the chromosome, strand, start and end columns. We got in touch with
support and told us that there was an error in the annotation protocol and they will review it.

In addition to this, we have also detected that different transcript_cluster_IDs interrogate same
genes/sequences, with apparently same probesets, so there is quite a degree of redundancy. On this,
support has given us some explanations. Here just the start of the email:

"The short answer is that the re-use of probes with the
same sequence in different transcript clusters is a result of how the Gene
array design handles genes that have duplicate copies in different parts of the
genome, or genes that are part of a widespread gene family with regions of
near-identity. The issues regarding the gene assignments of these probes is a
side-effect of drift in the transcript record between array design and
annotation time. Read on for the
technical nitty-gritty.[...]


Finally, we have found that the Rat Gene 2.0 transcript annotation file , which has some 36000 transcript
clusters i ntotal, has just below 30000 transcripts defined as "main", of which 11000 present no
annotation whatsoever. Which to me is just too much lack of annotation.


(Continue reading)

Peter Davidsen | 24 Jul 17:01 2014

Is a subset of my arrays from degraded RNA?

Dear List,

Although I do realise that my question has more to do with actual data
interpretation that coding using BioC packages, I'm hoping for some
input from other users with experience in microarray data analysis.

I order to support my explanation below, I have made a pdf with
diagnostic plots. I will refer to specific slides as I go along. The
presentation can be downloaded here: https://db.tt/jBqPNxIN

At the moment I'm analysing some microarray data as part of a
collaboration. Unfortunately, I have very little knowledge about the
actual generation/processing of these samples which could help address
my question.

By doing a boxplot on the raw Affymetrix chip data (from the U133plus2
platform), I noticed 2 'batches' based on differences in signal
intensities. Hierarchical clustering using all probesets on the array
supports this devision (Page 1 and 2). Noteworthy, this separation
into batches (i.e. a high and a low intensity batch) can partially be
traced back to the ScanDate of the arrays. That is, the ~100 samples
were scanned over three consecutive days; all samples scanned on the
first day belong to the high intensity batch whereas all samples
scanned on day 3 belong to the low intensity batch. Noteworthy, around
half of the samples scanned on day 2 fall into the high and low
intensity batch, respectively.

When I do a RLE plot (Page 3 - top), the median value for most of the
samples from the low intensity batch is between 0.1 and 0.2 (and not
zero as expected). Further, whereas ~40% of the probesets are called
(Continue reading)

Kaj Chokeshaiusaha [guest] | 24 Jul 14:07 2014

Suitable learning sets, gene selection methods and classification methods for low replicated microarray samples

Dear grateful R helpers,

I'm a biologist who is learning gene expression profile study, and have to deal with low replicated sample
number (2-3 biological replicates per group). Due to my lack of background in bioinformatics, I find CMA
as a very user-friendly package for supervised classification task.

However, I'm suffering with the truth that I really have no clue what suitable choics to choose for my low
replicated sample classfication. These are the choices to:

1. Select method to generate learning datasets
2. Select the gene selection methods
3. Select classification methods
4. Acquire generated learning datasets to be applied with other gene selection methods not available in
CMA package (for example, Rank production and LPE)

Any suggestions would be more than appreciated.

With Respects,
Kaj Chokeshaiusaha 

 -- output of sessionInfo(): 

R version 3.1.0 (2014-04-10)                                                               
Platform: x86_64-pc-linux-gnu (64-bit)                                                     

 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
(Continue reading)

Gordon K Smyth | 24 Jul 08:29 2014

Technical Replicates in EdgeR

> Date: Wed, 23 Jul 2014 15:41:02 -0500
> From: Neha Mehta <nsmehta <at> u.northwestern.edu>
> To: "Ryan C. Thompson" <rct@...>
> Cc: bioconductor@...
> Subject: Re: [BioC] Technical Replicates in EdgeR
> Thank you for your answer! Moving forward I removed the lane that I
> verified by plotMDS to be different from the other two. I have 2 further
> questions.
> 1) I have a few highly expressed genes - the 2 most highly expressing 
> genes make up 23 and 10 percent of all mappable reads, respectively. Do 
> I need to do something to make sure that these genes will not have a 
> negative effect on my DE assessment?

This is what TMM (or compositional normalization) is intended to ensure.

> I plan to use edgeR for DE analysis, and I know I can use TMM to 
> normalize. Will this be enough?

Probably.  There's nothing better available anyway.

> 2) When I ran a MAplot to compare my bio reps I saw that there are some 
> outliers, I have attached examples of 4 pairs of bioreps. Is this 
> something I should be concerned about?

I don't particularly see outliers from your plots, but I do see a lot of 
variation between your reps.  What are they so inconsistent?

Obviously genes that are inconsistent between reps will get large 
(Continue reading)

Nhu Quynh T. Tran | 24 Jul 00:59 2014

Unable to load XPS package in R studio


I installed root successfully and am able to call library(xps) in the terminal, but not in R studio.  The
error message is:

In R Studio:
> library(xps)
Error in dyn.load(file, DLLpath = DLLpath, ...) : 
  unable to load shared object '/Library/Frameworks/R.framework/Versions/3.1/Resources/library/xps/libs/xps.so':
  dlopen(/Library/Frameworks/R.framework/Versions/3.1/Resources/library/xps/libs/xps.so, 6):
Library not loaded:  <at> rpath/libGui.so
  Referenced from: /Library/Frameworks/R.framework/Versions/3.1/Resources/library/xps/libs/xps.so
  Reason: image not found
Error: package or namespace load failed for ‘xps’

I'm using: 
R version 3.1.0 (2014-04-10) -- "Spring Dance"
Platform: x86_64-apple-darwin10.8.0 (64-bit)

In terminal:
> library(xps)

Welcome to xps version 1.24.0
    an R wrapper for XPS - eXpression Profiling System
    (c) Copyright 2001-2014 by Christian Stratowa

Please advise!

Thank you.
(Continue reading)

Kuenne, Carsten | 23 Jul 14:22 2014

DESeq2 question regarding log2FC


I just compared the same dataset using DESeq 1 and DESeq 2. Strikingly, while the baseMeans are the same for
the same gene, the log2FoldChange is actually different!? There must be an error in the R script I am using
or how can that difference be explained?










"real" log2FC



(Continue reading)

Sin Yee Ku | 23 Jul 15:49 2014

Follow up on progress for brainarray cdfs on exon arrays.

Dear All,

I wanted to follow up on the progress regarding the use of remapped CDFs with oligo. From the thread in
January, it was established that oligo did not support the remapped CDFs from brainarray, however it was
something that we could expect for the future.

Are there any updates regarding the progress of the project?

I am currently working with the Mouse Exon 1.0 ST array, and would like to use remapped CDFs from brainarray
for my analysis.

Thanks a bunch!


Vivian (Sin Yee) Ku

Ontario Institute for Cancer Research
MaRS Centre

661 University Avenue

Suite 510
Toronto, Ontario
Canada M5G 0A3

Email: SinYee.Ku@...

Toll-free: 1-866-678-6427
(Continue reading)

Pesce, Francesco | 23 Jul 13:39 2014

DESEq | Batch effect | VST data and linear model adjustment


We have two cohorts, cases and controls and a set of covariates for both of them ( center,
library.prep.date, age, rna.rin.score, sex ).
Center and library.prep.date are collinear with the status (all the cases were collected in London while
the controls
were collected in 4 different centers worldwide) so I used the first principal component of these two
covariates and ran DESeq2 using this design:
~ PC1 + age + rna.rin.score + sex + status

Unfortunately it looks like the batch effect is too strong and I have ~16K genes with adjP<0.05
One question: is the fold-change still reliable  (So that I can use it as rank for GSEA analysie for example) ?
Now, although the differential expression might be hampered by the study design and I don’t know if I can
use these results (what do you think?)
the main problem is the following:

I have based all the analyses for my PhD thesis and the manuscript I am preparing using DESeq.
The pipeline is based on co-expression clustering (WGCNA), diffco-ex between cases and controls and GWAS
hits enrichment in these clusters.

For the pre-processing of these analyses I’ve first obtained the VST data and then adjusted these for the
covariates using a linear model. Then I used the residuals for the analyses:

> vsd <- varianceStabilizingTransformation(dds, blind=TRUE)
> vstMat <- assay(vsd)
> lm=lm(vstMat ~ as.factor(info$library.prep.date) + as.numeric(info$age) + as.factor(info$sex) +
as.numeric(info$rna.rin.score) + as.factor(info$center))
> data = residuals(lm)

The main question is that we are not sure if this pre-processing is correct, does the linear model work here
(Continue reading)

Giovanni Calice | 23 Jul 13:19 2014

dmpFinder results

Hi all,

I've question regard dmpFinder method in Minfi package 1.10.2.

My Dataset has composed by 49 Samples.
I've splitted my methylation profiling matrix (always SWAN normalized,
filtered on p-value and sex chromosomes)
with all Cg Probe ID, in some sub-matrices with Cg Probe ID that covering
some genomic regions of interest.
Then I call dmpFinder method with a sub-matrix as input.

I got strange results from dmpFinder like this (only the head of dmp result

              intercept            f                         pval
63803    -1.7708894377    33.0502014338    5.7429067504821e-07
35558    -2.0594919426    30.5216185561    1.26280541394315e-06
37196    1.6545682828    29.0969331518    1.99216453478782e-06
17326    2.6031068482    24.9262239789    7.98412955973361e-06
13446    2.0077186022    22.9217904185    1.60468187115676e-05
21260    -1.6861162973    22.9201474848    1.60561441086618e-05
54663    -2.2310293103    22.4803114045    1.87681049414807e-05
(Continue reading)