For those who are excited about the new 2.13.0 release of R (particularly the compiler package) but are grumbling about having to remember what packages they have and re-install them, see this great advice or this piece on upgrading in osx for how to automate the process.

# Category Archives: R

# Extra! Extra! Get Your gridExtra!

The more I use it, the deeper I fall in love with ggplot2. I know, some of you have heard me kvel about it ad nauseum (oh, yiddish and latin in one sentence – extra points!). But the graphs really look great, and once you wrap your head around a few concepts, it’s surprisingly easy to make it do most anything you want.

Except for one thing.

One thing I loved about the old R plotting functions was the ability to setup panels easily, and fill them with totally different graphs. Ye olde par(mfrow=c(2,2)) for a 2 x 2 grid, for example.

Enter gridExtra. Let the games begin.

What exactly do I mean? Let’s say I’m working with the soil chemistry data in the vegan package. First, maybe I just want to eyeball the historgrams of both the hummus depth and bare soil columns.

To do this in ggplot2, and with a single commend to put them in a single window, first you need to melt the data with reshape2 so that the column names are actually grouping variables, and then you can plot it. In the process, you create an additional data frame. And, you also have to do some extra specifying of scales, facets, etc. etc. Here’s the code and graphs.

```
library(ggplot2) #for plotting
library(reshape2) #for data reshaping
library(vegan) #for the data
data(varechem)
#First, reshape the data so that Hummus depth and Bare soil are your grouping variables
vMelt<-melt(varechem, measure.vars=c("Humdepth", "Baresoil"))
#Now plot it. Use fill to color things differently, facet_wrap to split this into two panels,
#And don't forget that the x scales are different - otherwise things look odd
qplot(value, data=vMelt, fill=variable)+facet_wrap( facets=~variable, scale="free_x")
```

This produces a nice graph. But, man, I had to think about reshaping things, and all of those scales? What if I just wanted to make two historgrams, and slam 'em together. This is where gridExtra is really nice. Through its function grid.arrange, you can make a multi-paneled graph using ggplot2 plots, lattice plots, and more (although, not regular R plots...I think).

So, let's see the same example, but with gridExtra.

```
library(gridExtra)
#make two separate ggplot2 objects
humDist<-qplot(Humdepth, data=varechem, fill=I("red"))
bareDist<-qplot(Baresoil, data=varechem, fill=I("blue"))
#Now use grid.arrange to put them all into one figure.
#Note the use of ncol to specify two columns. Things are nicely flexible here.
grid.arrange(humDist, bareDist, ncol=2)
```

"Oh, what a trivial problem," you may now be saying. But, if you want to, say, plot up 5 different correlations, or, say, the same scatterplot with 4 different model fits, this is a life-saver - if nothing else, in terms of readability of your code for later use.

This is all well and good, but, simple. Let's get into more fun multi-panel figures. Let's say we wanted a bivariate scatter-plot of Hummus Depth and Bare Soil with a linear fit. But, we also wanted to plot the histograms of each variable in adjacent panels. Oh, and flip the histogram of whatever is on the y-axis. Sexy, no? This is pretty straightforward. We can use the ggplot2 objects we already have, flip the co-ordinates on one, create a bivariate plot with a fit, and fill in one final panel with something blank.

```
#First, the correlation. I'm using size just to make bigger points. And then I'll add a smoothed fit.
corPlot<-qplot(Humdepth, Baresoil, data=varechem, size=I(3))+stat_smooth(method="lm")
#OK, we'll need a blank panel. gridExtra can make all sorts of shapes, so, let's make a white box
blankPanel<-grid.rect(gp=gpar(col="white"))
#Now put it all together, but don't forget to flip the Baresoil histogram
grid.arrange(humDist, blankPanel, corPlot, bareDist, ncol=2)
```

Nice. Note the use of the grid.rect. gridExtra is loaded with all sorts of interesting ways to place shapes and other objects into your plots - including my favorite - grid.table, for when you don't want to deal with text.

```
a<-anova(lm(Baresoil ~ Humdepth, data=varechem))
grid.table(round(a, digits=3))
```

Or, heck, if you want to make that part of the above plot, use tableGrob instead of grid.table, and then slot it in where the blank panel is. The possibilities are pretty endless!

**UPDATE**: Be sure to see Karthik's comment below about alternatively using viewports. Quite flexible, and very nice, if a hair more complex.

# More, Please!

Thanks to Jim, I’ve been using R in the shell more and more – in concert with vi. It’s been fun, and nice to integrate my workflows all on the server (although I haven’t had to do much graphing yet – I’m sure I’ll start kvetching then and return to a nice gui).

One thing that has frustrated me is that large dumps of output – say, a list composed of elements that are 100 lines each – just whip past me without an ability to scroll through more slowly. The page function helps somewhat, but, it gets wonky when looking at S4 objects. I wanted something more efficient that used – something more…well, like more! So i peered into *page*, and whipped up a *more* function that some of you may find useful. Of course, I’m sure that there is a simpler way, but, when all else fails…write it yourself!

```
more<-function(x, pager=getOption("pager")){
#put everything into a local file using sink
file <- tempfile("Rpage.")
sink(file)
show(x)
sink()
#use file.show as you can use the default R pager
file.show(file, title = "", delete.file = TRUE, pager = pager)
}
```

# RStudio – An IDE for the Masses!

I’ll admit it. There’s one thing that always makes me sad working on a mac. R. How does R make me sad on a mac? I look over at my compatriots in Windows using fun Integrated Development Environments (IDEs) like Tinn-R, and I sigh. On the other hand, I just had the sad little text editor and shell. Sure, it was enough, and I had wrung some sweet sweet code from that simple setup, but windows would get lost, I’d lose track of what file was where, what plot window was open, and would sometimes even forget which instance of R I was working in when I was working on two projects at the same time (one for simulation, one for analysis).

I mean, sure, I could go through the rigamarole of doing everything through some flavor of Emacs like a 31337 h4X0r. But my Emacs days are behind me. I would have preferred a simpler solution.

So when I saw news of a new, cross-platform, free, lovely IDE called RStudio hit R-bloggers and the Twitterverse, I rejoiced.

But would it be just a kludgy piece of bunk, or a nice, smooth, user experience? I figured I’d give it a whirl. I hopped on over to the website, and had a pleasant easy download experience. Then I fired it up, and ran an old analysis. After fiddling around a little just to get my feet on the ground (which took only a minute or two – the whole thing was quite intuitive), I was pretty pleased. The interface was clean, simple, and purty.

And some features – such as image exporting – were like a dream. So easy, in fact, that I decided to confront it with the problem of exporting an image for publication at the proper image quality (something which is always a bit of a hassle in R normally). But to my delight, no problem. Just fiddled with the size a little bit, and presto! High quality pub-ready image.

So, overall, I’m impressed. And with a Twitter feed, blog, and interactive support forum, I think it looks like this IDE is going to be a great tool for science. So go check it out!

# Merge Me Baby One More Time!

OK – has this ever happened to you? You are working with a team of collaborators all using a common dataset – maybe from an Agency, and LTER, or someone else’s data altogether. Each of you has some task – incorporating new data, running fancy models and putting the results back into the data for later analysis, etc.

Months later you come back, ready to put it all together to execute your master plan of analysis, when suddenly you realize that you’ve all been working on slightly different versions of the original dataset.

Somewhere along the way, things got forked. How do you bring it all back together?!

Now, if you’d been a good little monkey, all of your work would be scripted in R (or otherwise), you can just plug in whatever version of the data you all agree on, and still get the proper new dataset out the other end.

But maybe you did something crazy that took a week to calculate each single datapoint. Or maybe some of your collaborators just *GASP* did it all in Excel.

How do you bring things back together? How do you get a modicum of control.

Enter merge.

I recently had to use merge to bring under just these conditions – one collaborator had added a whole new column of data to a dataset of several thousand entries, while another had added a hundred new lines of data. Worse, both datasets were sorted differently, so, we couldn’t just slam them together.

So, let’s go through an example of how to bring such a thing together. Note, if you want all of the below in one convenient file, grab data_merging.R and later on you’ll want to also have this function for sorting data frames.

Let’s begin my making some old dataset with an unique identifier row called ‘identifier’. You do have one of these in your data, don’t you? If not, for shame! For shame!

```
olddata<-data.frame(expand.grid(1:10, 1:10))
names(olddata)<-c("col1", "col2")
#create an identifier
olddata$identifier<-1:nrow(olddata)
```

OK, now, let's say you've put in some hard work and have added a whole new column or new measurements for the existing data points. Let's create a modified dataset

```
moddata<-cbind(olddata, newstuff=rnorm(nrow(olddata), 10, 15))
```

But wait! Your collaborator has been working on the old dataset, and collected a few new data points - say, 10 new rows. But...they don't have the new column of measurements. To simulate this, we'll use the old data, and add on 10 new rows.

```
newdata<-rbind(olddata, data.frame(col1=round(runif(10)), col2=round(runif(10)), identifier=(nrow(olddata)+1):(nrow(olddata)+10)))
```

Uh oh! How do we combine newdata and moddata! Now, in this silly example, we could just use a cbind and add tack a few NAs on. In the real world, people do silly things when they work with datasets all the time - typically re-ordering them to suit their needs. So, let's say both datasets are all shaken up and in different row orders.

In that case, we'd want to use merge. And, indeed, unless you are 110% sure the data frames line up, you should probably use merge anyway, just to make sure. It's the safe way to bring two datasets together. Remember, practice safe data management, people!

Merge takes several arguments - two data frames, the names of common columns, and whether you're going to use all rows of one data frame or another. In this case, we want all of the rows of the new data, and we want to say that the column names of the new data frame are also common to both data sets. Basically, to master merge the arguments to really pay attention to are "by" (and/or by.x and by.y) and "all" (and/or all.x and all.y) In this instance, we have the following:

```
mergedata<-merge(newdata, moddata, by=names(newdata), all.x=T)
```

Take a look at the merged data - you'll note that it's in a wacky order, but if you browse through, you'll see that for some of the entries in the column newstuff, there are NA values - and that these are for rows that were present in newdata, but not in the older modified data set.

Things are kinda out of order. If you want to resort the dataset, I highly

recommend grabbing an awesome function called sort.data.frame which you can get here. It's one of the things I always load up in my .Rprofile.

```
source("./sort.data.frame.R")
sortedmergedata<-sort.data.frame(mergedata, ~+identifier)
head(sortedmergedata)
col1 col2 identifier newstuff
9 1 1 1 12.506103
21 2 1 2 18.998247
31 3 1 3 -7.231088
41 4 1 4 -11.318055
51 5 1 5 22.595297
61 6 1 6 -10.120211
```

Ah, there we go, nice and in order. Now that all of your data is together, safe and sound. Time to move onto the real fun of analysis!

# Do Not Log-Transform Count Data, Bitches!

OK, so, the title of this article is actually *Do not log-transform count data*, but, as @ascidacea mentioned, you just can’t resist adding the “bitches” to the end.

Onwards.

If you’re like me, when you learned experimental stats, you were taught to worship at the throne of the Normal Distribution. Always check your data and make sure it is normally distributed! Or, make sure that whatever lines you fit to it have normally distributed error around them! Normal! Normal normal normal!

And if you violate normality – say, you have count data with no negative values, and a normal linear regression would create situations where negative values are possible (e.g., what does it mean if you predict negative kelp! ah, the old dreaded nega-kelp), then no worries. Just log transform your data. Or square root. Or log(x+1). Or SOMETHING to linearize it before fitting a line and ensure the sacrament of normality is preserved.

This has led to decades of thoughtless transformation of count data without any real thought as to the consequences by in-the-field ecologists.

But statistics has had a better answer for decades – generalized linear models (glm for R nerds, gzlm for SAS goombas who use proc genmod. What? I’m biased!) whereby one specifies a nonlinear function with a corresponding non-normal error distribution. The canonical book on this was first published ’round 1983. Sure, one has to think more about the particular model and error distribution they specify, but, if you’re not thinking about these things in the first place, why are you doing science?

“But, hey!” you might say, “Glms and transformed count data should produce the same results, no?”

From first principles, Jensen’s inequality says no – consider the consequences for error of the transformation approach of log(y) = ax+b+error versus the glm approach y=e^(ax+b)+error. More importantly, the error distributions from generalized linear models may often be far far faaar more appropriate to the data you have at hand. For example, count data is discrete, and hence, a normal distribution will never be quite right. Better to use a poisson or a negative binomial.

But, “Sheesh!”, one might say, “Come on! How different can these models be? I mean, I’m going to get roughly the same answer, right?”

O’Hara and Kotze’s paper takes this question and runs with it. They simulate count data from negative binomial distributions and look at the results from generalized linear models with negative binomial or quasi-poisson error terms (see here for the difference) versus a slew of transformations.

Intriguingly, they find that glms (with either distribution) always perform well, while each transformation performs poorly at some or all values.

More intriguingly to me are the results regarding bias. Bias is the deviation between a fit parameter and its true value. Bascially, it’s a measure of how wrong your answer is. Again, here they find almost no bias in the glms, but bias all over the charts for transformed fits.

They sum it up nicely

For count data, our results suggest that transformations perform poorly. An additional problem with regression of transformed variables is that it can lead to impossible predictions, such as negative numbers of individuals. Instead statistical procedures designed to deal with counts should be used, i.e. methods for fitting Poisson or negative binomial models to data. The development of statistical and computational methods over the last 40 years has made it easier to fit these sorts of models, and the procedures for doing this are available in any serious statistics package.

Or, more succinctly, “Do not log-transform count data, bitches!”

“But how?!” I’m sure some of you are saying. Well, after checking into some of the relevant literature, it’s quite straightforward.

Given the ease of implementing glms in languages like R (one uses the glm function, checks diagnostics of residuals to ensure compliance with model assumptions, then can use Likliehood ratio testing akin to anova with, well, the Anova function) this is something easily within the grasp of the everyday ecologist. Heck, you can even do posthocs with multcomp, although if you want to correct your p-values (and there are reasons to believe you shouldn’t), you need to carefully consider the correction type.

For example, consider this data from survivorship on the Titanic (what, it’s in the multcomp documentation!) – although, granted, it’s looking at proportion survivorship, but, still, you’ll see how the code works:

```
library(multcomp)
### set up all pair-wise comparisons for count data
data(Titanic)
mod <- glm(Survived ~ Class, data = as.data.frame(Titanic), weights = Freq, family = binomial)
### specify all pair-wise comparisons among levels of variable "Class"
### Note, Tukey means the type of contrast matrix. See ?contrMat
glht.mod <- glht(mod, mcp(Class = "Tukey"))
###summaryize information
###applying the false discovery rate adjustment
###you know, if that's how you roll
summary(glht.mod, test=adjusted("fdr"))
```

There are then a variety of ways to plot or otherwise view glht output.

So, that's the nerdy details. In sum, though, the next time you see someone doing analyses with count data using simple linear regression or ANOVA with a log, sqrt, arcsine sqrt, or any other transformation, jump on them like a live grenade. Then, once the confusion has worn off, give them a copy of this paper. They'll thank you, once they're finished swearing.

O’Hara, R., & Kotze, D. (2010). Do not log-transform count data Methods in Ecology and Evolution, 1 (2), 118-122 DOI: 10.1111/j.2041-210X.2010.00021.x

# Web-Based Multilevel Modeling

This is tremendously cool. A nice intuitive web-based interface for the lme4 package in R (and you neither need to know R or understand the intricacies of the lme4 package) that gives you pdf output and plots. If you just want to play around and not worry about coding things up, it’s a great little option. Be sure to check out the demo video.

# Let’s All Go Down to the Barplot!

I’m back for another pean to ANOVA-loving experimental ecology. Barplots (or point-range plots – but which is better is a discussion for a different time) are the thing! And a friend of mine was recently asking me how to do a decent barplot with ggplot2 for data of moderate complexity (e.g., a 2-way ANOVA).

So I tried it.

And then I hit my head against a wall for a little while.

And then I think I figured it out, so, I figured I’d post it so that there is less head-smashing in the world.

To do a boxplot is simple. And may statisticians would argue that they are more informative, so, really, we should abandon barplots. Take the following example which looks at the highway milage of cars of various classes in two years.

```
library(ggplot2)
data(mpg)
#a simple qplot with a boxplot geometry - n.b., I like the bw theme
qplot(class, hwy, fill=factor(year), data=mpg, geom="boxplot", position="dodge")+theme_bw()
```

A few notes. The x-axis her is class. The fill attribute splits things by year (which is continuous, so we need to make it look like a factor). And the final position=”dodge” really is the key. It splits the bars for different years apart, and it shall come into play again in a moment.

This code produces a perfectly nice boxplot:

Lovely! Unless you want a barplot. For this, two things must happen. First, you need to get the average and standard error values that you desire. I do this using ddply (natch). Second, you’ll need to add in some error bars using geom_errorbar. In your geom_errorbar, you’ll need to invoke the position statement again.

```
#create a data frame with averages and standard deviations
hwy.avg<-ddply(mpg, c("class", "year"), function(df)
return(c(hwy.avg=mean(df$hwy), hwy.sd=sd(df$hwy))))
#create the barplot component
avg.plot<-qplot(class, hwy.avg, fill=factor(year), data=hwy.avg, geom="bar", position="dodge")
#add error bars
avg.plot+geom_errorbar(aes(ymax=hwy.avg+hwy.sd, ymin=hwy.avg-hwy.sd), position="dodge")+theme_bw()
```

This produces a perfectly serviceable and easily intelligible boxplot. Only. Only... well, it's time for a confession.

I hate the whiskers on error bars. Those weird horizontal things, they make my stomach uneasy. And the default for geom_errorbar makes them huge and looming. What purpose do they really serve? OK, don't answer that. Still, they offend the little Edward Tufte in me (that must be very little, as I'm using barplots).

So, I set about playing with width and other things to make whisker-less error bars. Fortunately for me, there is geom_linerange, that does the same thing, but with a hitch. It's "dodge" needs to be told just how far to dodge. I admit, here, I played about with values until I found ones that worked, so your milage may vary depending on how many treatment levels you have. Either way, the resulting graph was rather nice.

```
#first, define the width of the dodge
dodge <- position_dodge(width=0.9)
#now add the error bars to the plot
avg.plot+geom_linerange(aes(ymax=hwy.avg+hwy.sd, ymin=hwy.avg-hwy.sd), position=dodge)+theme_bw()
```

And voila! So, enjoy yet another nifty capability of ggplot2!

Great! I will say this, though. I have also played around with data with unequal representation of treatments (e.g., replace year with class or something in the previous example) - and, aside from making empty rows for missing treatment combinations, the plots are a little funny. Bars expand to fill up space left by vacant treatments. Try `qplot(class, hwy, data=mpg, fill= manufacturer, geom="boxplot")`

to see what I mean. If anyone knows how to change this, so all bar widths are the same, I'd love to hear it.

# Simple Data Visualization

OK, so, I know I already raved about one Hadley Wickham project and how it has changed my life last week. But what can I say, the man is a genius. And if you are using R (and let’s face it, you should be) and you want simple sexy graphs made quick, the man has you covered once again.

I first learned about ggplot2 while scanning through some slides of the LA Area RUG meetings (that I missed – I still haven’t been to any!) by the folks from Michael Driscoll.

And I liked what I saw – ggplot2 and lattice (which I admit, I had kind of avoided) seemed more intuitive than I thought. Then I ran into a series of articles on ggplot2 from the Learning R blog and I was hooked. Still am. And why I ask?

Let’s consider a task – you have some data split into four groups. For each group, you want to plot a fitted regression between two covariates. You want this split into panels, with common scales, and nicely placed axis labels. Also, you want it to be purty. While you can do this with the standard graphics package (and, I admit, I sometimes like the austerity of the standard graphics), it would take a for loop, tons of formatting instructions, and a number of steps where you could easily mess the whole thing up. Not to mention that you might have to change a good bit if you want to add another group.

Here is how easy it is with ggplot2. Note, there are only two lines of actual graphing code. The rest is just making up the data.

```
library(ggplot2)
#create data - x is a predictor, g is a group, y is the response
x<-1:100
g<-as.factor(c("A", "B", "C", "D"))
#i love expand grid, as it easily creates full
#matrices of all combinations of multiple variables
z<-data.frame(expand.grid(x=x,g=g))
#add a little error in to our results
z$y<- rnorm(length(z$x),z$x*c(3,15,29, 50)[unclass(z$g)]+5, sd=200)
#this first bit just sets up the basic plotting, with colors and panels
a<-qplot(x,y,group=g, colour=g, data=z, facets=~g)
#now, add a fitted line with the error around it shown.
# Also, change the theme. And plot the whole thing!
a+stat_smooth(method="lm", se=T, group=g, fill="black")+theme_bw()
```

All of which yields the following pretty figure:

And that stat_smooth statement can take lots of other arguments - e.g., glms (I've tried, and it looks great!)

So check it out - even for just casual data exploration, there's some real clarity to be found. And I look forward to trying out other products by Prof. Wickham!

# Sometimes, you just need to use a plyr

I haven’t posted anything about R-nerdery in quite some time. But I have to pause for a moment, and sing the praises of a relatively new package that has made my life exponentially easier. The plyr package.

R has the capability to apply a single function to a vector or list using apply or mapply, or their various derivatives. This returns another vector or list.

This is great in principal, but in practice, with indexing, odd return objects, and difficulties in using multiple arguments, it can get out of hand for complex functions. Hence, one often resorts to a for loop.

Let me give you an example. Let’s say I have some data from a simple experiment where I wanted to look at the effect of adding urchins, lobsters, or both on a diverse community of sessile invertebrates – mussels, squirts, etc. Heck, let’s say, I had **one gazillion species** whose responses I was interested in. Now let’s say I want a simple summary table of the change in the density each species – and my data has before and after values. So, my data would look like this.

Urchins | Lobsters | Before.After | Replicate | Species | Density | |

Yes | No | Before | 1 | 1 | 30 | |

Yes | No | After | 1 | 1 | 10 | |

Yes | No | Before | 1 | 2 | 45 | |

Yes | No | After | 1 | 2 | 23 | |

. . . . . |
||||||

Yes | Yes | Before | 15 | Gazillion | 10 | |

Yes | Yes | After | 15 | Gazillion | 9 |

So, each row represents the measurement for one species in one replicate either before or after the experiment. Now, previously, to get an output table with the mean change for each species for each treatment, I would have had to do something like this:

```
#first, create blank list of vectors waiting to be filled
#as well as blank vectors to remember what treatment
#combinations we're looking at
mean.list<-list()
urchin.vec<-vector()
lobster.vec<-vector()
for(i in 1:gazillion) mean.list[[paste("sp",i,sep=".")]]==vector()
#then, a for loop for each combination
for (i in levels(my.data$Urchins)){
for (j in levels(my.data$Lobsters)){
urchin.vec<-c(urchin.vec,i)
lobster.vec<-c(lobster.vec,j)
#get the right subset of the data
sub.data<-subset(my.data, my.data$Urchins==i & my.data$Lobster==j)
#now loop over all species
for (k in 1:gazillion){
sub.data<-subset(sub.data, sub.data$Species==k)
before.data<-subset(sub.data, sub.data$Before.After=="Before")
after.data<-subset(sub.data, sub.data$Before.After=="After")
mean.list[[paste("sp",i-3,sep=".")]]<-
c(mean.list[[paste("sp",i-3,sep=".")]], mean(after.data[k]-before.data[k])
}
}
}
#then turn it into a data frame to match back up to the right values
mean.data<-as.data.frame(mean.list)
mean.data$urchins<-as.factor(urchin.vec)
mean.data$lobsters<-as.factor(lobster.vec)
```

Ok, did that exhaust you as much it did me? Now, here's how to get the same result using ddply (more on why it's called ddply in a moment)

```
#a function where, if you give it a data frame with species 1 - gazillion
#will give you the mean for each species. Note, the which statement
#lets me look at before and after data separately
#Also note the Change= in the return value. That gives a new column name
#for multiple return columns, just return a vector, with each
#new column getting it's own name
tab.means<-function(df){ return(Change =
df$Density[which($Before.After=="After")]
- df$Density[which($Before.After=="Before")]) }
#now use ddply
mean.data<-ddply(df, c("Urchins", "Lobsters", "Species"), tab.means))
#ok, in fairness, each species doesn't have it's own column. But, we can
#just use reshape
mean.data<-reshape(mean.data, timevar="Species", idvar=c("Urchins", "Lobsters", direction="wide)
```

Um. Whoah. Notice a small difference there. Now, perhaps the earlier code could have been cleaned up a bit with reshape, but still, 6 lines of code versus 18 - and most of that 18 was bookkeeping code, not something doing real useful computation.

Soooo.... Check out plyr! It will make your life a happier, more relaxed place. Enjoy!

(p.s. I'm noticing wordpress is ignoring all of my code indenting - anyone have any quick fixes for that?)