Son Pham | 19 Apr 01:01 2014

Counterintuitive results from edgeR

Hi edgeR community:
I'm testing for DE genes in two groups. Group A has 6 replicates, group B
has 4 replicates. I spot a very strange gene, that should be DE but somehow
edgeR doesn't report it.

Here is the log2CPM of group A:
(5.8804275247607; 5.95666869234362;  6.24683910018322; 4.03634160591149;
4.9390167413036;  5.22009594582392)
and log2CPM of group B:

(6.68129783945799;   6.31414609301072;       6.94357946236393;

It's clear for us that these two groups has different CPM: B > A.
But here is how edgeR reports:
MyGeneID  0.990998868329884(logFC)       6.11753790162748(LogCPM)
0.0238675732815782(Pvalue)      1(FDR).

This is very counter intuitive. Why here we have FDR = 1  ? and even I can
not infer how logFC = 0.99

Any clarification is greatly appreciated !

Thank you,

	[[alternative HTML version deleted]]

Bioconductor mailing list
(Continue reading)

Gordon K Smyth | 19 Apr 01:01 2014

Limma table results [time course analysis]

Dear Stefania,

You have a replicated time course experiment with three time points. 
This is dealt with in detail in the limma User's Guide in Section 9.6.1 
"Replicate time points".  You should be able to apply the advice in that 
section to your data without any problems.

The limma User's Guide tells you that spline curves are only for time 
course experiments with many time points.  You don't have many time points 
-- only three.  There is no advantage in using spline curves for a time 
course epxeriment with only three time points.  You cannot achieve 
anything that the simpler analysis presented in Section 9.6.1 would not 
achieve more explicitly.

Given that you have fitted spline curves, the test that you have conducted 
will correctly select genes whose change over time is different between 
the three genotypes.  However you cannot interpret the individual logFC 
columns in the toptable.  These are simply mathematically defined basis 
vectors for the spline curve -- they have no meaningful interpretation as 
individual columns.

Best wishes

------ original message ------
[BioC] Limma table results

Stefania [guest] guest at
Tue Apr 15 11:09:46 CEST 2014

(Continue reading)

Zhu, Lihua (Julie | 18 Apr 16:23 2014

Re: Question about the ChIPpeakAnno:


Thanks for the email!

AnnotatePeakInBatch has  quite a few parameter to set other than default setting. Depending on how you use
it and your purpose, you could get different kind of annotation done including stranded annotation.

ChIPseq and ChIPchip protocol are strandless, meaning that you cannot distinguish peaks in + strand or –
strand. However, for other type of experiments such as Aseq for example,  peaks have strand information.

For stranded annotation, please do the following.

PlusAnno = TSS.human.NCBI36[strand(TSS.human.NCBI36) == "+" | strand(TSS.human.NCBI36) == "1",]
MinusAnno = TSS.human.NCBI36[strand(TSS.human.NCBI36) == "-" | strand(TSS.human.NCBI36) == "-1",]

myexp =  RangedData(IRanges(start=c(1543200,1557200,1563000,1569800,167889600,100,1000),
names=c("p1","p2","p3","p4","p5","p6", "p7")), strand=c("+" ,"-" ,"+" ,"+" ,"+" ,"+" ,"-" ), space=c(6,6,6,6,5,4,4))

Plus.annotatedPeak = annotatePeakInBatch(myexp[strand(myexp)== "+" ,] , AnnotationData = PlusAnno)

Minus.annotatedPeak = annotatePeakInBatch(myexp[strand(myexp)== "-" ,] , AnnotationData = MinusAnno)

Hope this helps! Please feel free to contact me if you need further clarification or encounter any problem.
I would appreciate if you could cc Bioncondutor list to benefit others who might have similar issues.
(Continue reading)

Bas Jansen | 18 Apr 15:37 2014

Warnings after gcrma normalisation...


I am importing microarray data, and after normalization with gcrma I get
the following warnings:

Warning messages:
1: replacing previous import by ‘utils::head’ when loading ‘mouse4302cdf’
2: replacing previous import by ‘utils::tail’ when loading ‘mouse4302cdf’

Should I be concerned, and if so, can something be done to prevent this?
I am on Mavericks (Mac OSX), on a Macbook Pro from early 2011. Further info
about the session:

> sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
 [1] mouse4302probe_2.14.0 AnnotationDbi_1.26.0  GenomeInfoDb_1.0.2
 [4] mouse4302cdf_2.14.0   inSilicoMerging_1.8.0 inSilicoDb_2.0.1
 [7] RCurl_1.95-4.1        bitops_1.0-6          rjson_0.2.13
(Continue reading)

Arnaud Mounier | 18 Apr 12:35 2014

Ontology lack in goProfiles : no BP or CC just MF

I've build a specific DataFrame with python pandas to compute ontology 
frequencies with goProfiles in bioconductor. I use the basicProfile 
function with option 'GOTermsFrame' but without the optional column 
'Evidence'. I've got one big dataframe as follow :

     In [1]:
     <class 'pandas.core.frame.DataFrame'>
     Int64Index: 119626 entries, 0 to 119625
     Data columns (total 3 columns):
     GeneID      119626 non-null object
     GOID        119626 non-null object
     Ontology    119626 non-null object
     dtypes: object(3)
So, almost 120000 entries with divided with Ontology as follow :
     In [2]: df.groupby(['Ontology'])['Ontology'].count()
     BP          58802
     CC          26867
     MF          33957

When I compute goProfile with any three Ontology at level 2, I've got
this frequencies :

     In [3]: rdf = com.convert_to_r_dataframe(df)
     In [4]: %%R -i rdf
     > library(goProfiles)
     > rdf <-
     > print(head(rdf))
                     GeneID       GOID Ontology
(Continue reading)

Shuhei Yao [guest] | 18 Apr 11:09 2014

logFC and CPM of edgeR

Hi, All.

Let me confirm my understandings about logFC and CPM of edgeR.

1."logFC" using glm functions are calculated from normalized factor. This process does not include
transformation to cpm.
2. CPM should be calculated from count with its normalized factor, Main purpose of cpm to show heatmap (is
described in User's guide).

'logFC' from cpm are different from 'logFC' using glm functions. I'm sure that this result is general
according to my understanding. Is it correct ? And which count data (e.x. cpm, pseudo count and fpkm) is
suitable for pharmacologist ?   


 -- output of sessionInfo(): 

> sessionInfo()
R version 2.15.1 (2012-06-22)
Platform: x86_64-unknown-linux-gnu (64-bit)


Sent via the guest posting facility at

Bioconductor mailing list
(Continue reading)

samuel collombet [guest] | 18 Apr 10:01 2014

Rsamtools - summarizeOverlaps - count read on complementary strand

I have RNAseq data which are single-end (50b), strand specific, but reads are the complement of the RNA (ie
reads are always on the complementary strand to their corresponding gene). 
I would like to count reads on genes using summarizeOverlaps, however I do not find the way to tell him to
count reads that are on the complementary strand to the exons. Is there a way? (with HTseq-count, I was
using the option --stranded=reverse).

 -- output of sessionInfo(): 


Sent via the guest posting facility at

Bioconductor mailing list
Search the archives:

Gordon K Smyth | 18 Apr 03:43 2014

limma modeling, paired samples and continuous variable

> Date: Thu, 17 Apr 2014 09:26:33 +0200
> From: Riba Michela <riba.michela@...>
> To: "James W. MacDonald" <jmacdon@...>
> Cc: bioconductor@...
> Subject: Re: [BioC] limma modeling, paired samples and continuous
> 	variable
> Hi,
> thanks a lot for your kind answer.
> I have to specify an additional observation:
> the "r"parameter is indeed a numeric variable and also in this situation the result is the same.

Actually it is not possible to get the same message as before if you have 
correctly code r as a numeric variable.

> Would be reasonable to try and model the design as:
> design<- <- model.matrix(~0+r)
> #where "r"is a numeric variable?
> for the points about the coefficients I have to reason about


To answer your question "if differential gene expression between two 
classes of disease are correlated with the r status", you probably need a 
Genotype:r iteraction term in your model.

You probably need to show us the whole targets frame for us to help you 
further.  In other words, we need to see:
(Continue reading)

Gordon K Smyth | 18 Apr 03:26 2014

edgeR and calcNormFactors manually

Dear Scott,

Normalization scaling factors are specific to each library, so you cannot 
meaningfully compute them for one set of libraries and copy the results to 
a larger set of libraries.

If you try to explain what you are actually trying to achieve, we might be 
able to give you some advice.

As it is, I don't yet understand why you aren't simply using 
calcNormFactors() in the way it is intended, which would be to apply it to 
all libraries at once, or why you need to do any prior ad hoc scaling.

Not sure what you mean by an 'input' sample, because 'input' is usually 
associated with ChIP-seq rather than RNA-seq.

Best wishes

> Date: Tue, 15 Apr 2014 16:40:53 -0700
> From: Scott Daniel <scottdaniel@...>
> To: bioconductor@...
> Subject: [BioC] edgeR and calcNormFactors manually
> Dear Colleagues,
> I have an RNA-seq library with 27 samples. 9 of these samples are what we
> call input as the other 18 are derived from them through a biochemical
> process. As such, I have already scaled the genes in those 18 samples to
> roughly add-up to the corresponding inputs.
(Continue reading)

Gordon K Smyth | 18 Apr 03:16 2014

use of voom function with attract package

Dear Emmanouela,

The limma package is designed to fit linear models, and it can compute 
t-statistics and F-statistics faster than making your own loop to lm(). If 
you want F-statistics for distinguishing the cell types, why not:

   fit <- lmFit(anal_voom, design)
   fit <- eBayes(fit[,-1])

Then the F-statistics will be in fit$F.

If you want to know whether a particular KEGG pathway tends to have larger 
F-statistics, you could also use:

   geneSetTest(index, fit$F)

where index selects genes in the pathway.  If there are only two cell 
types, a better way would be:

   camera(anal_voom, index, design)

With camera, index could be a list of index vectors for all the KEGG 
pathways at once.

Best wishes

> Date: Tue, 15 Apr 2014 09:44:42 -0700 (PDT)
> From: "Emmanouela Repapi [guest]" <guest@...>
> To: bioconductor@..., emmanouela.repapi@...
(Continue reading)

Gordon K Smyth | 18 Apr 02:36 2014

TCC::ERROR: Need the design matrix for GLM

Dear Panka,

It seems as if you are just using the TCC package to call methods from the 
edgeR package indirectly.

Why not use the edgeR package directly?  That would probably be easier and 
you would have a more direct understanding of the methods being used. 
Your experiment is almost identical to the oral carcinoma case study in 
the edgeR User's Guide.

Best wishes

> Date: Tue, 15 Apr 2014 13:51:17 +0000
> From: Pankaj Agarwal <p.agarwal@...>
> To: "bioconductor@..." <bioconductor@...>
> Cc: "kadota@..." <kadota@...>
> Subject: [BioC] TCC::ERROR: Need the design matrix for GLM.
> Hi,
> I have a rna-seq data consisting of matched tumor/normal samples from two patients.  For normalization of
the counts I am following the steps in the TCC vignette section "3.3 Normalization of two-group count data
without replicates (paired)".  The output from the commands are as follows:
>>  data=read.delim("count_bt2_iGenomes_Ensembl.tsv")
>> head(data)
>                A.sorted.bam B.sorted.bam
> ENSG00000000003                               2400                      1130
(Continue reading)