Khurram Nadeem | 23 Jul 17:33 2014

Importing random subsets of a data file

Hi R folks,

Here is my problem.

*1.* I have a large data file (say, in .csv or .txt format) containing 1
million rows and 500 variables (columns).

*2.* My statistical algorithm does not require the entire dataset but just
a small random sample from the original 1 million rows.

*3. *This algorithm needs to be applied 10000 times, each time generating a
different random sample from the 'big' file as described in (2) above.

Is there a way to 'import' only a (random) subset of rows from the .csv
file without importing the entire dataset? A quick search on various R
forums suggest that read.table() does not have this functionality.
Obviously, I want to avoid importing the whole file because of memory
issues. Looking forward to your help.

 Khurram Nadeem
 Postdoctoral Research Fellow
 Department of Mathematics & Statistics
 Acadia University, NS, Canada.

	[[alternative HTML version deleted]]

(Continue reading)

John McKown | 23 Jul 15:47 2014

Trying to change a qplot() to a ggplot()+

I'm trying to change a qplot to a ggplot. The reason is because I want
two plots of the same data. One a bar char, the other a line graph.
What I'm trying:

#MSU_graph_m1b <-

MSU_graph_m1   <-

MSU_graph_m1b  <- MSU_graph_m1+geom_bar();

#MSU_graph_m1l <- qplot(Int_Start,LicPrLsys4HMSU,data=cpprdald2_m1,geom="line");

MSU_graph_m1l  <- MSU_graph_m1+geom_line();

The commented lines are what works. What fails is the first ggplit() like:

> MSU_graph_m1   <- qqplot(cpprdald2_m1,aes(x=Int_Start,y=LicPrLsys4HSMU,colour=System_alias));
Error in :
  dims [product 9912] do not match the length of object [9923]

cpprdald2_m1 is:
> str(cpprdald2_m1)
'data.frame':   168 obs. of  60 variables:

and Int_Start and LicPrLsys4HSMU are variables in cpprdald2_m1.
Int_Start is a POSIXlt. LicPrLsys4HSMU is a number. I have also tried
with x=as.character(Int_Start) in the aes().
(Continue reading)

Michael Friendly | 23 Jul 15:31 2014

shading cells in a latex table by value

I want to create latex tables of values where the cell background is 
shaded according to the
table value.  For example:

set.seed(1) # reproducibility
mat <- matrix(3 * rnorm(12), 3, 4)
rownames(mat) <- letters[1:3]
colnames(mat) <- LETTERS[1:4]

 > round(mat,1)
      A    B   C    D
a -1.9  4.8 1.5 -0.9
b  0.6  1.0 2.2  4.5
c -2.5 -2.5 1.7  1.2

# colors to use:  blue(+), red(-) with two shading levels,
# depending on abs(x) > 2
cols <- c(rgb(0.85,0.85,1),
           rgb(0.7 ,0.7 ,1),
           rgb(1,0.7 ,0.7 ))
cols <- matrix(cols, 2,2)

cellcol <- apply(mat, 1:2,
                  function(x) {i<-1+(x>0); j<-1+(abs(x)>2); cols[i,j]})
 > cellcol
   A         B         C         D
a "#D9D9FF" "#FFB2B2" "#B2B2FF" "#D9D9FF"
b "#B2B2FF" "#B2B2FF" "#FFB2B2" "#FFB2B2"
(Continue reading)

ce | 23 Jul 15:08 2014

Windows R doesn't recognize shortcuts ?

Hi All,

In Windows 7 , R installation:

R version 3.1.1 Patched (2014-07-14 r66149) -- "Sock it to Me"
Copyright (C) 2014 The R Foundation for Statistical Computing
Platform: x86_64-w64-mingw32/x64 (64-bit)

it doesn't recognize shortcuts in path :

>  list.files(path = "cygwin")

cygwin is a shortcut,  in properties window  Target shows : C:\Users\me\cygwin64\home\me
Real path works :

list.files(path = "C:/Users/me/cygwin64/home/me")
  [1] ""            ""              "a.R"                    ""

Richard Chandler-Mant | 23 Jul 13:06 2014

R_HOME setting on Linux

I have an installation of R 3.0.2 on a CentOS 6.3 distribution in /opt/R/3.0.2

In the R start-up script (/opt/R/3.0.2/bin/R) the value of R_HOME is set to /opt/R/3.0.2/lib/R and this is
the value returned from the R.home() function within an R session.

According to the installation documentation
( the following
statement is made about the R home directory:

"prefix/LIBnn/R or libdir/R
all the rest (libraries, on-line help system, ...). Here LIBnn is usually 'lib', but may be 'lib64' on some
64-bit Linux systems. This is known as the R home directory."

So from this I gather that my R linux installation is correct. The issue I am having is running some of the unit
tests that are included with the core R application, specifically the reg-tests-1b.R file in the tests
directory which contains the following:

## recursive listing of directories
p <- file.path(R.home(), "share","texmf") # always exists, readable lfri <- list.files(p,
recursive=TRUE, include.dirs=TRUE) subdirs <- c("bibtex", "tex") lfnd <- setdiff(list.files(p,
all.files=TRUE, no..=TRUE), ".svn") stopifnot(!, lfri)), identical(subdirs,
lfnd)) ## the first failed for a few days, unnoticed, in the development version of R

This test is failing for me because R.home() = /opt/R/3.0.2/lib/R and the directory
/opt/R/3.0.2/lib/R/share/texmf does not exist but does exist at /opt/R/3.0.2/share/texmf

Is this a bug with the test or an error in my installation?

Thank you in advance for your help.

(Continue reading)

npan1990 . | 23 Jul 11:51 2014

Feature Selection and Regression


I would like to perform feature selection in a set of features that are
used for regression. Especially, those features correspond to the previous
day values (e.g Lag24,Lag25,Lag26...) where lag24 is the value 24 hour
before. The target variable y is the value at the current time (Using past
day features in order to predict the next day). I am currently using SVM
from the e1071 package. However, I found that when I remove some features
the svm performance is increased? Is there any way so to do feature
selection using the SVM? (1).  Also I have tried to use the glmnet package
for doing regression but with no luck. The purpose for using the glmnet was
the LASSO penalizing on the model. Can I do something similar using e1071?
(2) . I am not using any penalizing in e1071 so maybe this is an issue.
Also could you please list me 2-3 packages used for non-linear regression.

Currently I am aware of:
RSNNS (elman, jordan neural networks)
forecast (For using ARIMA)
glmnet (No luck)

I have tried may of these but without very good results even if my data
have a periodicity (25% Mean Relative Absolute Error).
For feature selection until now I use the corrgrams function that returns
the correlation of the features.

My Questions have the symbol ( Question Id).
(Continue reading)

Byron Dom | 23 Jul 06:58 2014

dx accuracy measures from raw data

Here is a partial answer (I think (?))

A common way to display results of this type is as a "receiver operating characteristic." See:

It's displayed as a parametric curve where the parameter is the threshold value, the x-value (abscissa) is
the false-positive rate and the y value is the true-positive rate. Then, a commonly computed
single-number characterization is to compute the area under this curve (AUC) for false-positive rate
running from 0 to 1. There are variations on this but I've just described the standard one.

There are multiple R-packages that will do all of this for you. One of them is the pROC package. See

Date: Sun, 20 Jul 2014 18:28:12 +0100
From: Anoop Shah <anoopsshah <at>>
To: r-help <at>
Subject: [R] dx accuracy measures from raw data
Message-ID: <0E7574AF-9890-419E-AE9D-978860054AF2 <at>>
Content-Type: text/plain

Hello R users!

I am a medic and have been working with R for about 6 months now.

I was hoping to pick someone’s brain about a diagnostic accuracy study that has now been completed.

I am trying to derive the sensitivity, specificity, NPV and PPV with the corresponding 95% CI from the raw data.

My data is in a data frame as below

g.s    t1    t2    t3    t3    t4    t5    index
(Continue reading)

Sowmya Rudregowda | 23 Jul 08:26 2014

need help for ppval() and xirr() in R

Hi ,

My name is sowmya.

I am new to R language. I need "ppval()" and "xirr()" matlab function  in
What exactly ppval() in matlab does , i need to that in R language.
Even xirr() of matlab ,i need it in R language, with the packages which it
needs .
Please help.


	[[alternative HTML version deleted]]

Jennifer Gruhn | 23 Jul 00:08 2014

Randomly sample data frame points relative to raster grid cells

In R, I have a raster entitled "raster_crude" with the following details:

> raster_crude

class : RasterLayer dimensions : 320, 392, 125440 (nrow, ncol, ncell)
resolution : 0.125, 0.125 (x, y) extent : -152, -103, 30, 70 (xmin, xmax,
ymin, ymax) coord. ref. : +proj=longlat +datum=WGS84 +ellps=WGS84

In R, I also have a data frame entitled "rangewide_absences" of longitude
and latitude points, within the extent of raster_crude, and with the
following details:

> class(rangewide_absences)[1] "data.frame"
> names(rangewide_absences)[1] "LON" "LAT"
> nrow(rangewide_absences)[1] 217494
> summary(rangewide_absences)
LON              LAT
Min.   :-134.3   Min.   : 0.00 1st Qu.:-120.3   1st Qu.:39.56
Median :-116.0   Median :43.02
Mean   :-115.0   Mean   :42.72 3rd Qu.:-110.5   3rd Qu.:46.27
Max.   :   0.0   Max.   :59.95

Oftentimes, there is more than one point of rangewide_absences falling
within the grid cells of raster_crude. How do I make a new data frame of
points, maintaining the LON and LAT columns, such that only one point is
sampled PER raster grid cell?

I have been using the package "raster." With the function "extract," one
can extract grid cells to points. However, how do you extract a
(Continue reading)

Marino David | 22 Jul 19:53 2014

Partition of sums of squares (ANOVA)

Hi all r-mailling listers:

Can anyone explain the theory (or the formula) about computing Sum Sq
(color highligh below) related to regression items?  The link of Wikipedia ( gives an
introduction on how to calculate the total, model, and regression sum of
squares. Is it similar to the Sum Sq computation? Is the regression sum of
squares equal to (0.000437+ 0.002545+ 0.060984+ 0.062330+ 0.060480)?

Any suggestion will be greatly appreciated.

Thank you!



Analysis of Variance Table

Response: y
                Df     *Sum Sq*     Mean       Sq F    value Pr(>F)
x1              1    0.000437  0.000437    0.1055    0.8001
x2              1    0.002545  0.002545    0.6141    0.5768
I(x1^2)        1    0.060984  0.060984   14.7162    0.1623
(Continue reading)

Julian Schulze | 22 Jul 10:00 2014

Multiple Imputation of longitudinal data in MICE and statistical analyses of object type mids

Dear all,

I have a problem with performing statistical analyses of longitudinal data after the imputation of
missing values using mice. After the imputation of missings in the wide data-format I convert the
extracted data to the longformat. Because of the longitudinal data participants have duplicate rows (3
timepoints) and this causes problems when converting the long-formatted data set into a type mids
object. Does anyone know how to create a mids object or something else appropriate after the imputation? I
want to use lmer,lme for pooled fixed effects afterwards. I tried a lot of different things, but still cant
figure it out.

Thanks in advance and see the code below for a minimal reproducible example:

# minimal reproducible example

## Make up some data

# ID Variable, Group, 3 Timepoints outcome measure (X1-X3)
Data <- data.frame(
    ID = sort(sample(1:100)),
    GROUP = sample(c(0, 1), 100, replace = TRUE),
    matrix(sample(c(1:5,NA), 300, replace=T), ncol=3)

# install.packages("mice")

# Impute the data in wide format
m.out <- mice(Data, maxit = 5, m = 2, seed = 9, pred=quickpred(Data, mincor = 0.0, exclude =
(Continue reading)