Nicolas Delhomme | 29 Aug 12:11 2014

Re: edgeR: mixing technical replicates from Illumina HiSeq and MiSeq

Really nice approach ryan, I’ll keep it in mind :-)

Nick, I usually don’t use cutoffs for that no. If I’m unsure (e.g. the effect is not obvious or is minimal
- i.e. on the technical variance on the PCA is much smaller than the biological one) I would conduct the
analyses for the different approaches and look how they influences the results and then would select the
more conservative approach. I know this sounds vague, but making this decision is frequently dependent
on how the other samples behave. Hence, what we always do when we make our data and analysis public is that we
also make the analysis code public; i.e. we use knit / pandoc to create an HTML document that details every
decision we’ve made.


Nicolas Delhomme

The Street Lab
Department of Plant Physiology
Umeå Plant Science Center

Tel: +46 90 786 5478
Email: nicolas.delhomme@...
SLU - Umeå universitet
Umeå S-901 87 Sweden

On 29 Aug 2014, at 11:29, Nick N <feralmedic@...> wrote:

> Thanks Ryan and Nicolas!
> I was wondering whether there is some sort of decision tree that can be formalised.
(Continue reading)

Thomas Pfau | 29 Aug 10:30 2014

hugene10sttranscriptclusterACCNUM has no mappings


I just tried to get a probe to accession matching the above annotation 
database. In particular it does not yield any mappings for accessions. 
x <- hugene10sttranscriptclusterACCNUM
mapped_probes <- mappedkeys(x)
yields an empty mapped_probes list.

I'm Running R 3.1.1 on ubuntu.
The loaded packages are:

  [1] oligo_1.28.2 Biostrings_2.32.1 XVector_0.4.0
  [4] IRanges_1.22.10 oligoClasses_1.26.0 
  [7] RSQLite_0.11.4 DBI_0.2-7
[10] AnnotationDbi_1.26.0 GenomeInfoDb_1.0.2 Biobase_2.24.0
[13] BiocGenerics_0.10.0 BiocInstaller_1.14.2

and capture.output(hugene10sttranscriptcluster()) yields:
  [1] "Quality control information for hugene10sttranscriptcluster:"
  [2] ""
  [3] ""
  [4] "This package has the following mappings:"
  [5] ""
  [6] "hugene10sttranscriptclusterACCNUM has 0 mapped keys (of 33297 keys)"
  [7] "hugene10sttranscriptclusterALIAS2PROBE has 60778 mapped keys (of 
103510 keys)"
  [8] "hugene10sttranscriptclusterCHR has 19962 mapped keys (of 33297 
(Continue reading)

Nick N | 28 Aug 18:23 2014

edgeR: mixing technical replicates from Illumina HiSeq and MiSeq


I have a study where a fraction of the samples have been replicated on 2
Illumina platforms (HiSeq and Miseq). These are technical replicates - the
library preparation is the same using the same biological replicates - it's
only the sequencing which is different.

My hunch was that I shall introduce the platform as as an additional
(blocking) factor in the analysis. Than I stumbled upon this post:

It recommends pooling the replicates. The post seems to apply to a
different case ("pure" technical replicates, i.e. no differences in the
sequencing platform used) so I probably shall ignore it. But I still feel a
bit uncertain of the best way to treat the technical replicates. Can you,
please, advise me on this?

many thanks!

	[[alternative HTML version deleted]]

Bioconductor mailing list
Search the archives:

(Continue reading)

Arunkumar Srinivasan | 28 Aug 22:09 2014

IRanges findOverlaps potential bug?

Hi BioC list,

I just came across something I believe is an unexpected result (potential bug) in IRanges. I'm using the
development version, but the same happens in stable version as well. My sincerest apologies if it's a
misunderstanding on my part.

In the example below, I'd expect the result for 'ans2' to be a vector of length 6, but with all values = 4L
instead of 3L. By looking at `?findOverlaps`, it seems like `select` should work with all values of
argument 'type'.

## code
ii = IRanges(start=3L, end=c(5L, 5L, 16L, 20L, 24L, 47L))
# IRanges of length 6
#     start end width
# [1]     3   5     3
# [2]     3   5     3
# [3]     3  16    14
# [4]     3  20    18
# [5]     3  24    22
# [6]     3  47    45

xx = IRanges(start=c(1L,2L,3L,3L,6L), end=c(646L, 481L, 672L, 1037L, 153L))
# IRanges of length 5
#     start  end width
# [1]     1  646   646
# [2]     2  481   480
# [3]     3  672   670
# [4]     3 1037  1035
# [5]     6  153   148
(Continue reading)

Atul | 28 Aug 21:52 2014

Combing Effects (t-stats) from experiment with common reference design?

Hi All,

I was wondering whether there is any approach to combine 't-stat' from 
different comparisons but using same control. These are my contrasts:

Stage1 vs ControlX
Stage2 vs ControlX
Stage3 vs. ControlX
Stage 20 vs. ControlX

Here the control is same i.e. same sample for all contrasts. From 
'limma' analysis I have Fold change, t-stats and p-values for each gene.

Now, is it possible to combine 't-stats' from all different stages to 
single value? Or compute a single combined value for all the contrasts. 
So, that this single metric could be used to rank genes across all time 
points. Is there any package available to do so? I can find methods to 
combine p-values but not the 't-stat'.



	[[alternative HTML version deleted]]

Bioconductor mailing list
(Continue reading)

Tim Triche, Jr. | 28 Aug 18:42 2014

Re: Interspecies differential expression of orthologs with Edger

This is super helpful.  Just to be clear, the most robust solution is to
use edgeR and offset for putative gene length, TMM & library size while
using raw counts (not effective counts) estimated by e.g. RSEM, eXpress, or
the like?

Also re: cross-species comparisons, while my experience is that it is
indeed a can of worms, Mark Gerstein's group recently published a method
that might interest others working on non-model or incompletely annotated

Any thoughts on applicability of the method for kooky experiments such as
comparing Drosophila hemocytes, zebrafish vascular endothelial progenitors
and the same in mice?  Or for that matter, alligator differentiation.

I never realized how hard RNAseq in non-model organisms was until I tried

Statistics is the grammar of science.
Karl Pearson <>

On Thu, Aug 28, 2014 at 8:37 AM, assaf www <assafwww@...> wrote:

> I checked, it's true, the results look the same.
> As for FPKM, of course you are right.
> Thanks a lot
> Assaf
(Continue reading)

Gordon K Smyth | 28 Aug 03:01 2014

normalization and batch correction across multiple project

We have had to regularly address the same issues that you are facing. 
There is no blanket answer -- every case needs to be considered on its own 
merits -- but you seem to be considering the right options.

In our work, we generally adjust for the batch in the limma linear model 
rather than trying to remove it up-front using combat.  Also consider 

As you say, analysing multiple projects together can help estimate a batch 
effect.  However this approach will come unstuck if the samples for the 
projects are very different.  There is another reason why we generally 
avoid analysing multiple projects together.  The projects will usually 
need to be submitted eventually to a public repository such as GEO, and 
the different projects generally have to be submitted independently. 
Users will not be able to reproduce our normalization and analysis unless 
the projects are analyzed separately.

Best wishes

> On Mon, Aug 18, 2014 at 1:11 PM, Adaikalavan Ramasamy wrote:
> Dear all,
> I would like to appeal to the collective wisdom in this group on how best
> to solve this problem of normalization and batch correction.
> We are a service unit for an academic institute and we run several
> projects simultaneously. We use Illumina HT12-v4 microarrays which can
> take up to 12 different samples per chip. As we QC the data from one
(Continue reading)

ferreirafm | 28 Aug 01:06 2014

heatmap.2: Colv dendrogram doesn't match size of x

Hi List, 
I've got some problems doing a heatmap with following peace of code: 

row_distance = dist(dfdat_zscore, method =manhattan") 
row_cluster = hclust(row_distance, method = "ward.D") 
col_distance = dist(dfdat_zscore, method = "manhattan") 
col_cluster = hclust(col_distance, method = "ward.D") 

main = "Z-score", 
margins =c(12,9), 
dendrogram = "both", 
cexCol=0.5, cexRow=0.5, 
key=T, keysize=1.0, 
Rowv = as.dendrogram(row_cluster), 
Colv = as.dendrogram(col_cluster), 
distfun = dist(method = "manhattan"), 
hclustfun = hclust(col_distance, method = "ward.D"), 

I've read some posts suggesting to upgrade gplots. However, the issue seems to persist with gplots_2.14.1 
Any help is appreciated. 

(Continue reading)

Maria Kesa | 27 Aug 22:18 2014

getting normalized expression values from GEO GSE files


My name is Maria and my goal is to get normalized gene expression values
from this study

I installed GEOQuery and it's dependencies RCurl and XML library.

I have two questions:
1. How do I resolve the error that is posted below, when I try to
use gse3398<-getGEO('GSE3398',GSEMatrix=TRUE) ? (I tried installing and
reinstalling RCurl and GEOQuery)
2. How should I normalize the data, considering that there are multiple
platforms in the experiment?
3. If point 1. can not be made to work, I found that it is possible to load
the files manually using the links like (Replacing GPL2648 with the
different platforms in the series)
My question is how do I process these files and put them into an eset in R?
As I ask in question 2, how do I get the normalized gene expression values
out of the data and get the gene names?

Your help would be much appreciated! The error message that I get and the
sessionInfo is below.

> gse3398<-getGEO('GSE3398',GSEMatrix=TRUE)Found 7
file(s)GSE3398-GPL2648_series_matrix.txt.gzsh: 1: curl: not foundError in file(con, "r") : cannot
open the connectionIn addition: Warning messages:1: In
download.file(sprintf("",  :
  download had nonzero exit status2: In file(con, "r") :
  cannot open file
(Continue reading)

Kaur Alasoo [guest] | 27 Aug 16:53 2014

Rsubread featureCounts unexpected strand-specifc counts with Ensembl GTF


I am trying to use featureCounts to perform strand-specifc RNA-Seq read counting with an Ensembl GTF file.
However, it seems to count reads from each strand separately rather than for each gene from the strand
where the gene is located. I have created a small BAM that contains reads from only two genes (one from each
strand) to illustrate the problem.

First, if I specify strandSpefic = 0 I get 97.7% of the reads aligned to two genes:
a = featureCounts("merged.bam", annot.ext = "Homo_sapiens.GRCh38.76.gtf",
strandSpecific = 0, isGTFAnnotationFile = TRUE, isPairedEnd = TRUE)

> a$stat
                      Status counts.merged.bam
1                   Assigned             16179
2       Unassigned_Ambiguity                 0
3    Unassigned_MultiMapping                56
4      Unassigned_NoFeatures               335

And the reads counts for the two genes are:
> a$counts[c("ENSG00000066336","ENSG00000136869"),]
ENSG00000066336 ENSG00000136869 
          11759            4419 

However, when I try to specifiy strandSpecific = 1 or 2 I get reads from only one strand counted in either case
b = featureCounts("counts/merged.bam", annot.ext =
"../../annotations/GRCh38/genes/Homo_sapiens.GRCh38.76.gtf", strandSpecific = 1,
isGTFAnnotationFile = TRUE, isPairedEnd = TRUE)

> b$stat
                      Status counts.merged.bam
(Continue reading)

Scott Robinson [guest] | 27 Aug 14:11 2014

Using custom CDF with 'make.cdf.env'

Dear List,

I made a custom CDF by modifying the original Affymetrix miRNA v1 file. As there is a great level of
redundancy in this chip I have condensed the original 7815 probe sets into 6190 probe sets (by 'moving'
probes from one set to another), however when I try making and attaching my new CDF environment I still seem
to have 7815 probe sets so presumably I must have done something wrong.

I have read the vignette and many similar posts to mine however still cannot work out what I am doing wrong.
Perhaps the problem is with the CDF itself? I have a short script testing the functionality, the output of
which I have copied in below. I will gladly attach the script, CDFs and example CEL file if there is nothing
obviously wrong with the code - would do this now but there doesn't appear to be an option on the webform.

Many thanks,


> folder <- "C:\Work\COPD-ASTHMA\microRNA files\newCDF\test\"
> setwd(paste0(folder,"CEL"))
> options(stringsAsFactors=FALSE)
> library(affy)
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
(Continue reading)