#SciFund Preview….

Whew, it’s been a bit since I posted here. Rest assured, little sea squirts, there are some interesting new things in the works. Some things that are science-y (in which I try and use this blog as a sounding board/ lab notebook) and some things not so much.

In the not so much category, a ton of my time has lately been going to the organizing of the #SciFund challenge – a large initiative to try and crowdfunding for science! If you haven’t been following it, check out our initial manifesto here.

I’m pretty stoked about the whole thing – it’s a real way of connecting science to the public via a funding mechanism. And with us launching on November 1st, it’s been an absolute pleasure to watch the creative and innovative videos that participating scientists have been putting together to solicit funds.

Videos, you say? Am I doing one?

Why yes! So to give you a hint of what’s to come, here’s a brief preview of my #SciFund video. I think you’ll all agree, it’s vintage me, attempting to sell one of the more arcane (to the public) pieces of my research in a way that might just connect. We shall see.

More to come in a week…

Ecological SEMs and Composite Variables: What, Why, and How

I’m a HUGE fan of Structural Equation Modeling. For those of you unfamiliar with the technique, it’s awesome for three main reasons.

  1. It’s a method of teasing apart direct and indirect interactions in your data.
  2. It allows you to assess the importance of underlying latent variables that you cannot measure, but for which have measured indicators.
  3. As it’s formally presented, with path diagrams showing connections between variables, it’s SUPER easy to link conceptual models with your data. See Grace et al. 2010 for a handy guide to this.

Also, there is a quite simple and intuitive R package for fitting SEMs, lavaan (LAtent VAriable Analysis). Disclaimer, I just hopped on board as a lavaan developer (yay!). I’ve also recently started a small project to find cool examples of SEM in the Ecological literature, and then using the provided information, post the models coded up in lavaan so that others can see how to put some of these models together.

As Ecologists, we often use latent variables to incorporate known measurement error of a quantity – i.e., a latent variable with a single indicator and fixed variance. We’re often not interested in the full power of latent variables – latents with multiple indicators. Typically, this is because we’ve actually measured everything we want to measure. We’re not like political scientists who have to quantify fuzzy things like Democracy, or Authoritarianism, or Gastronomicism. (note, I want to live in a political system driven by gastronomy – a gastronomocracy!)

However, we’re still fascinated by the idea of bundling different variables together into a single causal effect, and maybe evaluating the relative contribution of each of those variables within a model. In SEM, this is known as the creation of a Composite Variable. This composite is still an unmeasured quantity – like a latent variable – but with no error variance, and with “indicators” actually driving the variable, rather than having the unmeasured variable causing the expression of its indicators.

Let me give you an example. Let’s say we want to measure the effect of nutrients on diatom species richness in a stream. You’re particularly concerned about nitrogen. However, you can’t bring water samples back to the lab, so, you’re relying on some moderately accurate nitrogen color strips, the biomass of algae (more algae = more nitrogen!), and your lab tech, Stu, who claims he can taste nitrogen in water (and has been proved to be stunningly accurate in the past). In this case, you have a latent variable. The true nitrogen content of the water is causing the readings by these three different indicators.

A composite variable is a different beast. Let’s say we have the same scenario. But, now you have really good measurements of nitrogen. In fact, you have good measurements of both ammonium (NH4) and nitrate (NO3). You want to estimate a “nitrogen effect”, but, you know that both of these different forms of N will contribute to the effect in a slightly different way. You could just construct a model with effects going from both NO3 and NH4 to species richness. If you want to represent the total “Nitrogen Effect” in your model, however, and evaluate the effect of each form of nitrogen on its total effect, you would create a composite. The differences become clear when looking at the path diagram of each case.

Here, I’m adopting the custom of observed variables in squares, latent variables in ovals, and composite variables in hexagons. Note that, as indicators of nitrogen in the latent variable model, each observed indicator has some additional variation due to factors other than nitrogen – δi. There is no such error in the composite variable model. Also, I’m showing that the error in the Nitrogen Effect in the composite variable model is indeed set to 0. There are sometimes reasons where that shouldn’t be 0, but that’s another topic for another time.

This may still seem abstract to you. So, let’s look at an example in practice. One way we often use composites is to bring together a linear and nonlinear effect of a single variable. For example, we know that often nutrient supply rates have a humped shape effect on species richness – i.e., the highest richness happens at intermediate supply rates. One nice example of that is in a paper by Cardinale et al. in 2009 looking at relationships between manipulated nutrient supply, species richness, and algal productivity. To capture that relationship with a composite variable, one would have a ‘nitrogen effect’ affected by N and N2. This nitrogen effect would then affect local species richness.

So, how would you code this model up in lavaan, and then evaluate it.

Well, hey, the data from this paper are freely available, so, let’s use this as an example. For a full replication of the model presented in the paper see here. However, Cardinale et al. didn’t use any composite variables, so, let’s create a model of our own capturing the Nitrogen-Richness relationship while also accounting for local species richness being influenced by regional species richness.

In the figure above, we have the relationship between resource supply rate and local species richness on an agar plate to the left. Separate lines are for separate streams. The black line is the average fit with the supplied equation. On the right, we have a path diagram representing this relationship, as well as the influence of regional species richness.

So, we have a path diagram. Now comes the tricky part. Coding. One thing about the current version of lavaan (0.4-8) is that it does not have a way to represent composite variables. This will change in the future (believe me), but, it may take a while, so, let me walk you through the tricks of incorporating latent variables now. Basically, there are four steps.

  1. Define the variable as a regression, where the composite is determined by it’s causal variables. Also, fix one of the effects to 1. This gives your composite variable a scale.
  2. Specify that the composite has an error variance of 0.
  3. Now treat the composite as a latent variable. It’s indicators are it’s response variables. This may seem odd. However, it’s all just ways of specifying causal pathways – an indicator pathway and a regression pathway have the same meaning in terms of causality. The software just needs something specified so that it doesn’t go looking for our composite variable in our data. Hence, defining it as a latent variable whose indicators are endogenous responses. I actually find this helpful, as it also makes me think more carefully about what a composite variable is, and how too many responses may make my model not identified.
  4. Lastly, because we don’t want to fix the effect of our composite on its response to 1, we’ll have to include an argument in the fitting function that makes it not force the first latent variable loading to be set to 1. We’ll also have to specify that we then want the variance of the response to latent variables freely estimated. Yeah, I know. Note: this all can play havoc when you have both latent and composite variables, so be careful. See here for an example.
  5. Everything else – other regression relationships, showing that nonlinearities are derived quantities, etc.

OK, that’s a lot. How’s it work in practice? Below is the code to fit the model in the path diagram. I’ve labeled the steps in comments, and, included the regional ~ local richness relationship as well as the relationship showing that logN2 was derived from logN. Note, this is a centered squared variable. And, yes, all nitrogen values have been log transformed here.

#simple SA model with N and regional SR using a composite
#Variables: logN = log nutrient supply rate, logNcen2 = log supply rate squared
# SA = Species richness on a patch of Agar, SR = stream-wide richness
compositeModel<-'
	#1) define the composite, scale to logN
	Nitrogen ~ 1*logN + logNcen2 #loading on the significant path!

	#2) Specify 0 error variance
	Nitrogen ~~ 0*Nitrogen

  	#3) now, because we need to represent this as a latent variable
  	#show how species richness is an _indicator_ of nitrogen
  	Nitrogen =~ SA

	#4) BUT, make sure the variance of SA is estimated
  	SA ~~ SA

	#Regional Richness also has an effect
	SA ~ SR

	#And account for the derivation of the square term from the linear term
	logNcen2 ~ logN
  	'

 # we specify std.lv=T so that the Nitrogen-SA relationship isn't fixed to 1
 compositeFit <- sem(compositeModel, data=cards, std.lv=T)

Great! It should fit just fine. I'm not going to focus on the regional relationship, as it is predictable and positive. Moreover, when we look at the results, two things immediately pop out at us about the effect of nutrient supply rate.

                   Estimate  Std.err  Z-value  P(>|z|)
Latent variables:
  Nitrogen =~
    SA                0.362    0.438    0.827    0.408

Regressions:
  Nitrogen ~
    logN              1.000
    logNcen2         -1.311    1.373   -0.955    0.340

Wait, what? The Nitrogen effect was not detectably different from 0? Nor was there a nonlinear effect? What's going on here?

What's going on is that the scale of the composite is affecting our results. We've set it to 1. Whenever you are fixing scales, you should always check and see, what would happen if you changed which path was set to 1. So, we can simply set the scale to the nonlinear variable, refit the model, and see if this makes a difference. If it doesn't, then that means there is no nitrogen effect at all!

So, change

Nitrogen ~ 1*logN + logNcen2

to

Nitrogen ~ logN + 1*logNcen2

And, now let's see the results…..

                   Estimate  Std.err  Z-value  P(>|z|)
Latent variables:
  Nitrogen =~
    SA               -0.474    0.239   -1.989    0.047

Regressions:
  Nitrogen ~
    logN             -0.763    0.799   -0.955    0.340
    logNcen2          1.000

Ah HA! Not only is the linear effect not different from 0, but now we see that fixing the nonlinear effect allows the nutrient signal to come through.

But wait, you may say, that effect is negative? Well, remember that the scale of the nitrogen effect is the same as the nonlinear scale. And, a positive hump-shaped relationship will have a negative squared term. So, given how we've setup the model, yes, that should be negative.

*whew!* That was a lot. And this for a very simile model involving composites and nonlinearities. I thought I'd throw that out as it's a common use of composites, and interpreting nonlinearities in SEMs is always a little tricky and worth bending your brain around. Other uses of composites include summing up a lot of linear quantities, a composite for the application of treatments, and more. But, this should give you a good sense of what they are, how to code them in lavaan, and how to use them in the future.

For a more in depth treatment of this topic, and latent variables versus composites, I urge you to check out this excellent piece by Jim Grace and Ken Bollen. Happy model fitting!

Extra! Extra! Get Your gridExtra!

The more I use it, the deeper I fall in love with ggplot2. I know, some of you have heard me kvel about it ad nauseum (oh, yiddish and latin in one sentence – extra points!). But the graphs really look great, and once you wrap your head around a few concepts, it’s surprisingly easy to make it do most anything you want.

Except for one thing.

One thing I loved about the old R plotting functions was the ability to setup panels easily, and fill them with totally different graphs. Ye olde par(mfrow=c(2,2)) for a 2 x 2 grid, for example.

Enter gridExtra. Let the games begin.

What exactly do I mean? Let’s say I’m working with the soil chemistry data in the vegan package. First, maybe I just want to eyeball the historgrams of both the hummus depth and bare soil columns.

To do this in ggplot2, and with a single commend to put them in a single window, first you need to melt the data with reshape2 so that the column names are actually grouping variables, and then you can plot it. In the process, you create an additional data frame. And, you also have to do some extra specifying of scales, facets, etc. etc. Here’s the code and graphs.

library(ggplot2) #for plotting
library(reshape2) #for data reshaping

library(vegan) #for the data
data(varechem)

#First, reshape the data so that Hummus depth and Bare soil are your grouping variables
vMelt<-melt(varechem, measure.vars=c("Humdepth", "Baresoil"))

#Now plot it.  Use fill to color things differently, facet_wrap to split this into two panels,
#And don't forget that the x scales are different - otherwise things look odd
qplot(value, data=vMelt, fill=variable)+facet_wrap( facets=~variable, scale="free_x")

This produces a nice graph. But, man, I had to think about reshaping things, and all of those scales? What if I just wanted to make two historgrams, and slam 'em together. This is where gridExtra is really nice. Through its function grid.arrange, you can make a multi-paneled graph using ggplot2 plots, lattice plots, and more (although, not regular R plots...I think).

So, let's see the same example, but with gridExtra.

library(gridExtra)

#make two separate ggplot2 objects
humDist<-qplot(Humdepth, data=varechem, fill=I("red"))
bareDist<-qplot(Baresoil, data=varechem, fill=I("blue"))

#Now use grid.arrange to put them all into one figure.
#Note the use of ncol to specify two columns.  Things are nicely flexible here.
grid.arrange(humDist, bareDist, ncol=2)

"Oh, what a trivial problem," you may now be saying. But, if you want to, say, plot up 5 different correlations, or, say, the same scatterplot with 4 different model fits, this is a life-saver - if nothing else, in terms of readability of your code for later use.

This is all well and good, but, simple. Let's get into more fun multi-panel figures. Let's say we wanted a bivariate scatter-plot of Hummus Depth and Bare Soil with a linear fit. But, we also wanted to plot the histograms of each variable in adjacent panels. Oh, and flip the histogram of whatever is on the y-axis. Sexy, no? This is pretty straightforward. We can use the ggplot2 objects we already have, flip the co-ordinates on one, create a bivariate plot with a fit, and fill in one final panel with something blank.

#First, the correlation.  I'm using size just to make bigger points.  And then I'll add a smoothed fit.
corPlot<-qplot(Humdepth, Baresoil, data=varechem, size=I(3))+stat_smooth(method="lm")

#OK, we'll need a blank panel.  gridExtra can make all sorts of shapes, so, let's make a white box
blankPanel<-grid.rect(gp=gpar(col="white"))

#Now put it all together, but don't forget to flip the Baresoil histogram
grid.arrange(humDist, blankPanel, corPlot, bareDist, ncol=2)

Nice. Note the use of the grid.rect. gridExtra is loaded with all sorts of interesting ways to place shapes and other objects into your plots - including my favorite - grid.table, for when you don't want to deal with text.

a<-anova(lm(Baresoil ~ Humdepth, data=varechem))
grid.table(round(a, digits=3))


Or, heck, if you want to make that part of the above plot, use tableGrob instead of grid.table, and then slot it in where the blank panel is. The possibilities are pretty endless!

UPDATE: Be sure to see Karthik's comment below about alternatively using viewports. Quite flexible, and very nice, if a hair more complex.

More, Please!

Thanks to Jim, I’ve been using R in the shell more and more – in concert with vi. It’s been fun, and nice to integrate my workflows all on the server (although I haven’t had to do much graphing yet – I’m sure I’ll start kvetching then and return to a nice gui).

One thing that has frustrated me is that large dumps of output – say, a list composed of elements that are 100 lines each – just whip past me without an ability to scroll through more slowly. The page function helps somewhat, but, it gets wonky when looking at S4 objects. I wanted something more efficient that used – something more…well, like more! So i peered into page, and whipped up a more function that some of you may find useful. Of course, I’m sure that there is a simpler way, but, when all else fails…write it yourself!

more<-function(x, pager=getOption("pager")){
        #put everything into a local file using sink
	file <- tempfile("Rpage.")
        sink(file)
        show(x)
        sink()

        #use file.show as you can use the default R pager
	file.show(file, title = "", delete.file = TRUE, pager = pager)
	}

RStudio – An IDE for the Masses!

I’ll admit it. There’s one thing that always makes me sad working on a mac. R. How does R make me sad on a mac? I look over at my compatriots in Windows using fun Integrated Development Environments (IDEs) like Tinn-R, and I sigh. On the other hand, I just had the sad little text editor and shell. Sure, it was enough, and I had wrung some sweet sweet code from that simple setup, but windows would get lost, I’d lose track of what file was where, what plot window was open, and would sometimes even forget which instance of R I was working in when I was working on two projects at the same time (one for simulation, one for analysis).

I mean, sure, I could go through the rigamarole of doing everything through some flavor of Emacs like a 31337 h4X0r. But my Emacs days are behind me. I would have preferred a simpler solution.

So when I saw news of a new, cross-platform, free, lovely IDE called RStudio hit R-bloggers and the Twitterverse, I rejoiced.

But would it be just a kludgy piece of bunk, or a nice, smooth, user experience? I figured I’d give it a whirl. I hopped on over to the website, and had a pleasant easy download experience. Then I fired it up, and ran an old analysis. After fiddling around a little just to get my feet on the ground (which took only a minute or two – the whole thing was quite intuitive), I was pretty pleased. The interface was clean, simple, and purty.

RStudio Window

Oh! Quite the little interface, there! Click to get a larger view.

And some features – such as image exporting – were like a dream. So easy, in fact, that I decided to confront it with the problem of exporting an image for publication at the proper image quality (something which is always a bit of a hassle in R normally). But to my delight, no problem. Just fiddled with the size a little bit, and presto! High quality pub-ready image.

So, overall, I’m impressed. And with a Twitter feed, blog, and interactive support forum, I think it looks like this IDE is going to be a great tool for science. So go check it out!

Merge Me Baby One More Time!

OK – has this ever happened to you? You are working with a team of collaborators all using a common dataset – maybe from an Agency, and LTER, or someone else’s data altogether. Each of you has some task – incorporating new data, running fancy models and putting the results back into the data for later analysis, etc.

Months later you come back, ready to put it all together to execute your master plan of analysis, when suddenly you realize that you’ve all been working on slightly different versions of the original dataset.

Somewhere along the way, things got forked. How do you bring it all back together?!

Now, if you’d been a good little monkey, all of your work would be scripted in R (or otherwise), you can just plug in whatever version of the data you all agree on, and still get the proper new dataset out the other end.

But maybe you did something crazy that took a week to calculate each single datapoint. Or maybe some of your collaborators just *GASP* did it all in Excel.

How do you bring things back together? How do you get a modicum of control.

Enter merge.

I recently had to use merge to bring under just these conditions – one collaborator had added a whole new column of data to a dataset of several thousand entries, while another had added a hundred new lines of data. Worse, both datasets were sorted differently, so, we couldn’t just slam them together.

So, let’s go through an example of how to bring such a thing together. Note, if you want all of the below in one convenient file, grab data_merging.R and later on you’ll want to also have this function for sorting data frames.

Let’s begin my making some old dataset with an unique identifier row called ‘identifier’. You do have one of these in your data, don’t you? If not, for shame! For shame!

olddata<-data.frame(expand.grid(1:10, 1:10))
names(olddata)<-c("col1", "col2")

#create an identifier
olddata$identifier<-1:nrow(olddata)

OK, now, let's say you've put in some hard work and have added a whole new column or new measurements for the existing data points. Let's create a modified dataset

moddata<-cbind(olddata, newstuff=rnorm(nrow(olddata), 10, 15))

But wait! Your collaborator has been working on the old dataset, and collected a few new data points - say, 10 new rows. But...they don't have the new column of measurements. To simulate this, we'll use the old data, and add on 10 new rows.

newdata<-rbind(olddata, data.frame(col1=round(runif(10)), col2=round(runif(10)), identifier=(nrow(olddata)+1):(nrow(olddata)+10)))

Uh oh! How do we combine newdata and moddata! Now, in this silly example, we could just use a cbind and add tack a few NAs on. In the real world, people do silly things when they work with datasets all the time - typically re-ordering them to suit their needs. So, let's say both datasets are all shaken up and in different row orders.

In that case, we'd want to use merge. And, indeed, unless you are 110% sure the data frames line up, you should probably use merge anyway, just to make sure. It's the safe way to bring two datasets together. Remember, practice safe data management, people!

Merge takes several arguments - two data frames, the names of common columns, and whether you're going to use all rows of one data frame or another. In this case, we want all of the rows of the new data, and we want to say that the column names of the new data frame are also common to both data sets. Basically, to master merge the arguments to really pay attention to are "by" (and/or by.x and by.y) and "all" (and/or all.x and all.y) In this instance, we have the following:

mergedata<-merge(newdata, moddata, by=names(newdata), all.x=T)

Take a look at the merged data - you'll note that it's in a wacky order, but if you browse through, you'll see that for some of the entries in the column newstuff, there are NA values - and that these are for rows that were present in newdata, but not in the older modified data set.

Things are kinda out of order. If you want to resort the dataset, I highly
recommend grabbing an awesome function called sort.data.frame which you can get here. It's one of the things I always load up in my .Rprofile.

source("./sort.data.frame.R")
sortedmergedata<-sort.data.frame(mergedata, ~+identifier)
head(sortedmergedata)

   col1 col2 identifier   newstuff
9     1    1          1  12.506103
21    2    1          2  18.998247
31    3    1          3  -7.231088
41    4    1          4 -11.318055
51    5    1          5  22.595297
61    6    1          6 -10.120211

Ah, there we go, nice and in order. Now that all of your data is together, safe and sound. Time to move onto the real fun of analysis!

Do Not Log-Transform Count Data, Bitches!

ResearchBlogging.org OK, so, the title of this article is actually Do not log-transform count data, but, as @ascidacea mentioned, you just can’t resist adding the “bitches” to the end.

Onwards.

If you’re like me, when you learned experimental stats, you were taught to worship at the throne of the Normal Distribution. Always check your data and make sure it is normally distributed! Or, make sure that whatever lines you fit to it have normally distributed error around them! Normal! Normal normal normal!

And if you violate normality – say, you have count data with no negative values, and a normal linear regression would create situations where negative values are possible (e.g., what does it mean if you predict negative kelp! ah, the old dreaded nega-kelp), then no worries. Just log transform your data. Or square root. Or log(x+1). Or SOMETHING to linearize it before fitting a line and ensure the sacrament of normality is preserved.

This has led to decades of thoughtless transformation of count data without any real thought as to the consequences by in-the-field ecologists.

But statistics has had a better answer for decades – generalized linear models (glm for R nerds, gzlm for SAS goombas who use proc genmod. What? I’m biased!) whereby one specifies a nonlinear function with a corresponding non-normal error distribution. The canonical book on this was first published ’round 1983. Sure, one has to think more about the particular model and error distribution they specify, but, if you’re not thinking about these things in the first place, why are you doing science?

“But, hey!” you might say, “Glms and transformed count data should produce the same results, no?”

From first principles, Jensen’s inequality says no – consider the consequences for error of the transformation approach of log(y) = ax+b+error versus the glm approach y=e^(ax+b)+error. More importantly, the error distributions from generalized linear models may often be far far faaar more appropriate to the data you have at hand. For example, count data is discrete, and hence, a normal distribution will never be quite right. Better to use a poisson or a negative binomial.

But, “Sheesh!”, one might say, “Come on! How different can these models be? I mean, I’m going to get roughly the same answer, right?”

O’Hara and Kotze’s paper takes this question and runs with it. They simulate count data from negative binomial distributions and look at the results from generalized linear models with negative binomial or quasi-poisson error terms (see here for the difference) versus a slew of transformations.

Intriguingly, they find that glms (with either distribution) always perform well, while each transformation performs poorly at some or all values.

Estimated root mean-squared error from six different models. Curves from the quasi-poisson model are the same as the negative binomial. Note that the glm lines (black solid) all hang out around 0 as opposed to the transformed fits.

More intriguingly to me are the results regarding bias. Bias is the deviation between a fit parameter and its true value. Bascially, it’s a measure of how wrong your answer is. Again, here they find almost no bias in the glms, but bias all over the charts for transformed fits.

Estimated mean biases from six different models, applied to data simulated from a negative binomial distribution. A low bias means that the method will, on average, return the 'true' value. Note that the bias for transformed fits is all over the place. But with a glm, bias is always minimal.

They sum it up nicely

For count data, our results suggest that transformations perform poorly. An additional problem with regression of transformed variables is that it can lead to impossible predictions, such as negative numbers of individuals. Instead statistical procedures designed to deal with counts should be used, i.e. methods for fitting Poisson or negative binomial models to data. The development of statistical and computational methods over the last 40 years has made it easier to fit these sorts of models, and the procedures for doing this are available in any serious statistics package.

Or, more succinctly, “Do not log-transform count data, bitches!”

“But how?!” I’m sure some of you are saying. Well, after checking into some of the relevant literature, it’s quite straightforward.

Given the ease of implementing glms in languages like R (one uses the glm function, checks diagnostics of residuals to ensure compliance with model assumptions, then can use Likliehood ratio testing akin to anova with, well, the Anova function) this is something easily within the grasp of the everyday ecologist. Heck, you can even do posthocs with multcomp, although if you want to correct your p-values (and there are reasons to believe you shouldn’t), you need to carefully consider the correction type.

For example, consider this data from survivorship on the Titanic (what, it’s in the multcomp documentation!) – although, granted, it’s looking at proportion survivorship, but, still, you’ll see how the code works:

library(multcomp)
### set up all pair-wise comparisons for count data
data(Titanic)
mod <- glm(Survived ~ Class, data = as.data.frame(Titanic), weights = Freq, family = binomial)

### specify all pair-wise comparisons among levels of variable "Class"
### Note, Tukey means the type of contrast matrix.  See ?contrMat
glht.mod <- glht(mod, mcp(Class = "Tukey"))

###summaryize information
###applying the false discovery rate adjustment
###you know, if that's how you roll
summary(glht.mod, test=adjusted("fdr"))

There are then a variety of ways to plot or otherwise view glht output.

So, that's the nerdy details. In sum, though, the next time you see someone doing analyses with count data using simple linear regression or ANOVA with a log, sqrt, arcsine sqrt, or any other transformation, jump on them like a live grenade. Then, once the confusion has worn off, give them a copy of this paper. They'll thank you, once they're finished swearing.

O’Hara, R., & Kotze, D. (2010). Do not log-transform count data Methods in Ecology and Evolution, 1 (2), 118-122 DOI: 10.1111/j.2041-210X.2010.00021.x

Viva la Neo-Fisherian Liberation Front!

p≤0.05

This post was chosen as an Editor's Selection for ResearchBlogging.org Significant p-values. For so many scientists using statistics, this is your lord. Your master. Heck, it has its own facebook group filed under religious affiliations (ok, so, maybe I created that.) And it is a concept to whose slavish devotion we may have sacrificed a good bit of forward progress in science over the past half century. Time to blow up the cathedral! Or so says Stuart Hurlbert and Celia Lombardi in a recent fascinating review.

But first, for the uninitiated, what does it mean? Let’s say you’re running an experiment. You want to see whether fertilizer affects the growth rate of plants. You get a bunch of random plots, seed them, and add fertilizer to half of them. You then compare the mean growth rates of the two groups of plots. But are they really different? In essence, a p value gives you the probability that they are the same. And if it is very low, you can reject the idea that they are the same. Well, sort of.

A p value, as defined by the Patron Saint of Statistics for us experimental grunts, R. A. Fisher, is the probability of observing some result given that a hypothesis being tested is true. Of, if d=data, and h=a hypothesis, p(d|h) in symbolic language – | means given. Typically, this hypothesis being tested is a null hypothesis – that there is no difference between treatments, or the slope of a line is 0. However, note a few things about this tricky statement. 1) It is not the probability of accepting the hypothesis you’re trying to reject. 2) It makes no claims any particular hypothesis being true. For all practical purposes, in the framework of testing a null hypothesis, however, a low p value means there is a very low probability that

OK. But what is this 0.05 thing all about? Well, p will range from 0 to 1. As formalized by Jerzy Neyman and Egon Pearson (no, not THAT Egon), the idea of Null Hypothesis Significance Testing (NHST) is one where the researcher established a critical value of p, called α. The researcher then tests the null statistical hypothesis of interest, and if p falls at or below alpha, the results are deemed ‘statistically significant’ – i.e. you can safely reject the null. By historical accident of old ideas, copyright, a little number rounding, a lack of computational power to routinely calculate exact p values in the 30s, and some early textbooks 0.05 has become the standard for much of science.

Indeed, it is mother’s milk for any experimental scientist who has taken a stats course in the last 40+ years. It is enshrined in some journal publication policies. It is used for the quality control of a great deal of biomedical research. It is the result we hope and yearn for whenever we run an experiment.

It may also be a false god – an easy yes/no that can lead to into the comfortable trap of not thinking critically about a problem. After all, if your test wasn’t “significant”, why bother with the results? This is a dangerous line of thinking. It can seriously retard scientific progress and certainly has led to all sorts of jerrymandering of statistical tests and datasets, or even adjusting α up to 0.10 or down to 0.01, depending on the desired result. Or, worse, scientists misreading the stats, and claiming that a REALLY low p value meant a REALLY large effect (seriously!) or that a very high value means that one can accept a null hypothesis.

Scientists are, after all, only human. And are taught by other humans. And while they are trained in statistics, are not statisticians themselves. All too human errors creep in.

Aside from reviewing a tremendous amount of literature, Hurlbert and Lombardi perhaps best sum up the case as follows – suppose you were to look at the results of two different statistical tests. One one, p=0.051. In the other, p=0.049. If we are going with the α=0.05 paradigm, then one test we would not reject the null. In the other we would and label the effect as ‘significant’.

Clearly, this is a little too arbitrary. H&L lay out a far more elegant solution – one that is being rapidly incorporated in many fields and has been advocated for some time in the statistical literature. It is as follows:

1) Report a p-value for a test. 2) Do not assign it significance, but rather refer to the level of support it gives for rejecting a null – strong, weak, moderate, practically non-existent. Make sure this statement of support is grounded in the design and power of the experiment. Suspend judgement on rejecting a null if the p value is high, as p-value testing is NOT the same as giving evidence FOR a null (something so many of us forget). 3) Use this in accumulation with other lines of evidence to draw a conclusion about a research hypothesis.

This neoFisherian Significance Assessment (NFSA) seems so simple, so elegant. And it puts the scientist back into the science in a way that NHST does not.

There have been of course other proposals. Many have advocated throwing out p values and reporting confidence intervals and effect sizes. This information can be incredibly invaluable, but CIs can often be p values in disguise. Effect sizes are great, but without an estimate of variability, they can be deceiving. Indeed, the authors argue that p value reporting is the way to assess the support for rejecting a null, but that the nuance with which it is done is imperative.

They also review several other alternatives and critiques – Bayesian ideas or information theoretic approaches, although I think there is some misunderstanding there that leads the authors to see conflict with their views where there actually is none. Still, it does not distract from their main message.

It should also be noted that this is one piece of a larger agenda by the authors to force scientists (and particularly ecologists) to rethink how we approach statistics. There’s another paper out there that demonstrates why one-tailed t-tests are the devil (the appendix of which is worth leading to see how conflicted even textbooks on the subject can be), and another is in review on why corrections for multiple hypothesis testing (e.g. Tukey tests and Bonferroni corrections) are in many cases quite unnecessary.

Strong stuff (although who would expect less from the man who gave us the clarion call against pseudoreplication). But intellectually, the arguments make a lot of sense. If anything, it forces a greater concentration on the weight of evidence, rather than a black-and-white situation. It puts the scientist back in the limelight, forcing them to build a case and apply their knowledge, skills, and creativity as a researcher.

Quite liberating. We shall see if it is adopted, and where it leads. I leave you with an excerpt from the concluding remarks.

We came along after the dust had settled, and have just tried to push over the last remaining structures of the old cathedral and to show the logic of the neoFisherian reformation. Most of the stone building blocks from the old cathedral were still of value. They just needed to be reassembled with fresh mortar by a new generation of scientists and statisticians to increase the guest capacity and beautify the gardens of the neoFisherian cottage.

Hurlbert, S. H., & Lombardi, C. M. (2009). Final collapse of the Neyman-Pearson decision theoretic framework and rise of the neoFisherian Annales Zoologici Fennici, 46, 311-349