Do Not Log-Transform Count Data, Bitches! OK, so, the title of this article is actually Do not log-transform count data, but, as @ascidacea mentioned, you just can’t resist adding the “bitches” to the end.


If you’re like me, when you learned experimental stats, you were taught to worship at the throne of the Normal Distribution. Always check your data and make sure it is normally distributed! Or, make sure that whatever lines you fit to it have normally distributed error around them! Normal! Normal normal normal!

And if you violate normality – say, you have count data with no negative values, and a normal linear regression would create situations where negative values are possible (e.g., what does it mean if you predict negative kelp! ah, the old dreaded nega-kelp), then no worries. Just log transform your data. Or square root. Or log(x+1). Or SOMETHING to linearize it before fitting a line and ensure the sacrament of normality is preserved.

This has led to decades of thoughtless transformation of count data without any real thought as to the consequences by in-the-field ecologists.

But statistics has had a better answer for decades – generalized linear models (glm for R nerds, gzlm for SAS goombas who use proc genmod. What? I’m biased!) whereby one specifies a nonlinear function with a corresponding non-normal error distribution. The canonical book on this was first published ’round 1983. Sure, one has to think more about the particular model and error distribution they specify, but, if you’re not thinking about these things in the first place, why are you doing science?

“But, hey!” you might say, “Glms and transformed count data should produce the same results, no?”

From first principles, Jensen’s inequality says no – consider the consequences for error of the transformation approach of log(y) = ax+b+error versus the glm approach y=e^(ax+b)+error. More importantly, the error distributions from generalized linear models may often be far far faaar more appropriate to the data you have at hand. For example, count data is discrete, and hence, a normal distribution will never be quite right. Better to use a poisson or a negative binomial.

But, “Sheesh!”, one might say, “Come on! How different can these models be? I mean, I’m going to get roughly the same answer, right?”

O’Hara and Kotze’s paper takes this question and runs with it. They simulate count data from negative binomial distributions and look at the results from generalized linear models with negative binomial or quasi-poisson error terms (see here for the difference) versus a slew of transformations.

Intriguingly, they find that glms (with either distribution) always perform well, while each transformation performs poorly at some or all values.

Estimated root mean-squared error from six different models. Curves from the quasi-poisson model are the same as the negative binomial. Note that the glm lines (black solid) all hang out around 0 as opposed to the transformed fits.

More intriguingly to me are the results regarding bias. Bias is the deviation between a fit parameter and its true value. Bascially, it’s a measure of how wrong your answer is. Again, here they find almost no bias in the glms, but bias all over the charts for transformed fits.

Estimated mean biases from six different models, applied to data simulated from a negative binomial distribution. A low bias means that the method will, on average, return the 'true' value. Note that the bias for transformed fits is all over the place. But with a glm, bias is always minimal.

They sum it up nicely

For count data, our results suggest that transformations perform poorly. An additional problem with regression of transformed variables is that it can lead to impossible predictions, such as negative numbers of individuals. Instead statistical procedures designed to deal with counts should be used, i.e. methods for fitting Poisson or negative binomial models to data. The development of statistical and computational methods over the last 40 years has made it easier to fit these sorts of models, and the procedures for doing this are available in any serious statistics package.

Or, more succinctly, “Do not log-transform count data, bitches!”

“But how?!” I’m sure some of you are saying. Well, after checking into some of the relevant literature, it’s quite straightforward.

Given the ease of implementing glms in languages like R (one uses the glm function, checks diagnostics of residuals to ensure compliance with model assumptions, then can use Likliehood ratio testing akin to anova with, well, the Anova function) this is something easily within the grasp of the everyday ecologist. Heck, you can even do posthocs with multcomp, although if you want to correct your p-values (and there are reasons to believe you shouldn’t), you need to carefully consider the correction type.

For example, consider this data from survivorship on the Titanic (what, it’s in the multcomp documentation!) – although, granted, it’s looking at proportion survivorship, but, still, you’ll see how the code works:

### set up all pair-wise comparisons for count data
mod <- glm(Survived ~ Class, data =, weights = Freq, family = binomial)

### specify all pair-wise comparisons among levels of variable "Class"
### Note, Tukey means the type of contrast matrix.  See ?contrMat
glht.mod <- glht(mod, mcp(Class = "Tukey"))

###summaryize information
###applying the false discovery rate adjustment
###you know, if that's how you roll
summary(glht.mod, test=adjusted("fdr"))

There are then a variety of ways to plot or otherwise view glht output.

So, that's the nerdy details. In sum, though, the next time you see someone doing analyses with count data using simple linear regression or ANOVA with a log, sqrt, arcsine sqrt, or any other transformation, jump on them like a live grenade. Then, once the confusion has worn off, give them a copy of this paper. They'll thank you, once they're finished swearing.

O’Hara, R., & Kotze, D. (2010). Do not log-transform count data Methods in Ecology and Evolution, 1 (2), 118-122 DOI: 10.1111/j.2041-210X.2010.00021.x

Viva la Neo-Fisherian Liberation Front!


This post was chosen as an Editor's Selection for Significant p-values. For so many scientists using statistics, this is your lord. Your master. Heck, it has its own facebook group filed under religious affiliations (ok, so, maybe I created that.) And it is a concept to whose slavish devotion we may have sacrificed a good bit of forward progress in science over the past half century. Time to blow up the cathedral! Or so says Stuart Hurlbert and Celia Lombardi in a recent fascinating review.

But first, for the uninitiated, what does it mean? Let’s say you’re running an experiment. You want to see whether fertilizer affects the growth rate of plants. You get a bunch of random plots, seed them, and add fertilizer to half of them. You then compare the mean growth rates of the two groups of plots. But are they really different? In essence, a p value gives you the probability that they are the same. And if it is very low, you can reject the idea that they are the same. Well, sort of.

A p value, as defined by the Patron Saint of Statistics for us experimental grunts, R. A. Fisher, is the probability of observing some result given that a hypothesis being tested is true. Of, if d=data, and h=a hypothesis, p(d|h) in symbolic language – | means given. Typically, this hypothesis being tested is a null hypothesis – that there is no difference between treatments, or the slope of a line is 0. However, note a few things about this tricky statement. 1) It is not the probability of accepting the hypothesis you’re trying to reject. 2) It makes no claims any particular hypothesis being true. For all practical purposes, in the framework of testing a null hypothesis, however, a low p value means there is a very low probability that

OK. But what is this 0.05 thing all about? Well, p will range from 0 to 1. As formalized by Jerzy Neyman and Egon Pearson (no, not THAT Egon), the idea of Null Hypothesis Significance Testing (NHST) is one where the researcher established a critical value of p, called α. The researcher then tests the null statistical hypothesis of interest, and if p falls at or below alpha, the results are deemed ‘statistically significant’ – i.e. you can safely reject the null. By historical accident of old ideas, copyright, a little number rounding, a lack of computational power to routinely calculate exact p values in the 30s, and some early textbooks 0.05 has become the standard for much of science.

Indeed, it is mother’s milk for any experimental scientist who has taken a stats course in the last 40+ years. It is enshrined in some journal publication policies. It is used for the quality control of a great deal of biomedical research. It is the result we hope and yearn for whenever we run an experiment.

It may also be a false god – an easy yes/no that can lead to into the comfortable trap of not thinking critically about a problem. After all, if your test wasn’t “significant”, why bother with the results? This is a dangerous line of thinking. It can seriously retard scientific progress and certainly has led to all sorts of jerrymandering of statistical tests and datasets, or even adjusting α up to 0.10 or down to 0.01, depending on the desired result. Or, worse, scientists misreading the stats, and claiming that a REALLY low p value meant a REALLY large effect (seriously!) or that a very high value means that one can accept a null hypothesis.

Scientists are, after all, only human. And are taught by other humans. And while they are trained in statistics, are not statisticians themselves. All too human errors creep in.

Aside from reviewing a tremendous amount of literature, Hurlbert and Lombardi perhaps best sum up the case as follows – suppose you were to look at the results of two different statistical tests. One one, p=0.051. In the other, p=0.049. If we are going with the α=0.05 paradigm, then one test we would not reject the null. In the other we would and label the effect as ‘significant’.

Clearly, this is a little too arbitrary. H&L lay out a far more elegant solution – one that is being rapidly incorporated in many fields and has been advocated for some time in the statistical literature. It is as follows:

1) Report a p-value for a test. 2) Do not assign it significance, but rather refer to the level of support it gives for rejecting a null – strong, weak, moderate, practically non-existent. Make sure this statement of support is grounded in the design and power of the experiment. Suspend judgement on rejecting a null if the p value is high, as p-value testing is NOT the same as giving evidence FOR a null (something so many of us forget). 3) Use this in accumulation with other lines of evidence to draw a conclusion about a research hypothesis.

This neoFisherian Significance Assessment (NFSA) seems so simple, so elegant. And it puts the scientist back into the science in a way that NHST does not.

There have been of course other proposals. Many have advocated throwing out p values and reporting confidence intervals and effect sizes. This information can be incredibly invaluable, but CIs can often be p values in disguise. Effect sizes are great, but without an estimate of variability, they can be deceiving. Indeed, the authors argue that p value reporting is the way to assess the support for rejecting a null, but that the nuance with which it is done is imperative.

They also review several other alternatives and critiques – Bayesian ideas or information theoretic approaches, although I think there is some misunderstanding there that leads the authors to see conflict with their views where there actually is none. Still, it does not distract from their main message.

It should also be noted that this is one piece of a larger agenda by the authors to force scientists (and particularly ecologists) to rethink how we approach statistics. There’s another paper out there that demonstrates why one-tailed t-tests are the devil (the appendix of which is worth leading to see how conflicted even textbooks on the subject can be), and another is in review on why corrections for multiple hypothesis testing (e.g. Tukey tests and Bonferroni corrections) are in many cases quite unnecessary.

Strong stuff (although who would expect less from the man who gave us the clarion call against pseudoreplication). But intellectually, the arguments make a lot of sense. If anything, it forces a greater concentration on the weight of evidence, rather than a black-and-white situation. It puts the scientist back in the limelight, forcing them to build a case and apply their knowledge, skills, and creativity as a researcher.

Quite liberating. We shall see if it is adopted, and where it leads. I leave you with an excerpt from the concluding remarks.

We came along after the dust had settled, and have just tried to push over the last remaining structures of the old cathedral and to show the logic of the neoFisherian reformation. Most of the stone building blocks from the old cathedral were still of value. They just needed to be reassembled with fresh mortar by a new generation of scientists and statisticians to increase the guest capacity and beautify the gardens of the neoFisherian cottage.

Hurlbert, S. H., & Lombardi, C. M. (2009). Final collapse of the Neyman-Pearson decision theoretic framework and rise of the neoFisherian Annales Zoologici Fennici, 46, 311-349

Let’s All Go Down to the Barplot!

I’m back for another pean to ANOVA-loving experimental ecology. Barplots (or point-range plots – but which is better is a discussion for a different time) are the thing! And a friend of mine was recently asking me how to do a decent barplot with ggplot2 for data of moderate complexity (e.g., a 2-way ANOVA).

So I tried it.

And then I hit my head against a wall for a little while.

And then I think I figured it out, so, I figured I’d post it so that there is less head-smashing in the world.

To do a boxplot is simple. And may statisticians would argue that they are more informative, so, really, we should abandon barplots. Take the following example which looks at the highway milage of cars of various classes in two years.


#a simple qplot with a boxplot geometry - n.b., I like the bw theme
qplot(class, hwy, fill=factor(year), data=mpg, geom="boxplot", position="dodge")+theme_bw()

A few notes. The x-axis her is class. The fill attribute splits things by year (which is continuous, so we need to make it look like a factor). And the final position=”dodge” really is the key. It splits the bars for different years apart, and it shall come into play again in a moment.

This code produces a perfectly nice boxplot:

Lovely! Unless you want a barplot. For this, two things must happen. First, you need to get the average and standard error values that you desire. I do this using ddply (natch). Second, you’ll need to add in some error bars using geom_errorbar. In your geom_errorbar, you’ll need to invoke the position statement again.

#create a data frame with averages and standard deviations
hwy.avg<-ddply(mpg, c("class", "year"), function(df)

#create the barplot component
avg.plot<-qplot(class, hwy.avg, fill=factor(year), data=hwy.avg, geom="bar", position="dodge")

#add error bars
avg.plot+geom_errorbar(aes(,, position="dodge")+theme_bw()

This produces a perfectly serviceable and easily intelligible boxplot. Only. Only... well, it's time for a confession.

I hate the whiskers on error bars. Those weird horizontal things, they make my stomach uneasy. And the default for geom_errorbar makes them huge and looming. What purpose do they really serve? OK, don't answer that. Still, they offend the little Edward Tufte in me (that must be very little, as I'm using barplots).

So, I set about playing with width and other things to make whisker-less error bars. Fortunately for me, there is geom_linerange, that does the same thing, but with a hitch. It's "dodge" needs to be told just how far to dodge. I admit, here, I played about with values until I found ones that worked, so your milage may vary depending on how many treatment levels you have. Either way, the resulting graph was rather nice.

#first, define the width of the dodge
dodge <- position_dodge(width=0.9) 

#now add the error bars to the plot
avg.plot+geom_linerange(aes(,, position=dodge)+theme_bw()

And voila! So, enjoy yet another nifty capability of ggplot2!

Great! I will say this, though. I have also played around with data with unequal representation of treatments (e.g., replace year with class or something in the previous example) - and, aside from making empty rows for missing treatment combinations, the plots are a little funny. Bars expand to fill up space left by vacant treatments. Try qplot(class, hwy, data=mpg, fill= manufacturer, geom="boxplot") to see what I mean. If anyone knows how to change this, so all bar widths are the same, I'd love to hear it.

First you get the urchins, then you get the power, then you get the data!

Well, it’s been a few weeks since I put the revised version of my diversity-disturbance experiment out in the field. I completed sampling on Monday, and, *whew*, my power analysis paid off! Although not entirely how I expected…

First off, kicking the maximum urchin density up to 50 per cage (as opposed to 16) was indeed helpful. Even after 1 week, there was a strong urchin effect. And after three weeks, well…

What is perhaps most interesting is what one sees in looking at the relationship between the number or urchins added to a cage and the change in the amount of cover. I sampled by counting the amount of cover of benthic sessile species under points in a fixed grids. Looking at the change in total number of points we see the following relationship:

A far cry from the results in trial 1. What’s fascinating, however, is to look at the points in the 0-16 urchin range in trial 2. Take a gander at the above figure, and, as you can see, there appears to be no real pattern – another scattershot. It really did require high densities of the spiny little buggers to generate a strong grazing effect. The guidance from my power analysis was right on!

What is perhaps the most exciting (and puzzling) is my preliminary results regarding diversity and disturbance. The following only really shows up when looking at algal cover and algal species richness (i.e., number of species in a plot). It shows that, yes, the effect of urchin grazing appears to change with initial algal species richness – but not in the way one would expect. Basically, low diversity plots with low number of urchins GREW. High diversity plots with few urchins in them LOST COVER. However, at high densities of urchins, everything got chomped, and pretty evenly.

What does this mean? I have no clue as of right now! As soon as the data was collected, I began feverishly entering and analyzing. Why such a rush? I had to decide whether to move the cages and run yet a third trial, or, just cut off the fencing on the cages, and use the weighted frames to mark the plots in order to follow how the community in each plot recovers. After seeing the preceeding graph, and weighing many costs and benefits, I’m sticking with my single trial (the data from trial 1 is being treated as pilot data and won’t be used in the final analysis). So, yesterday was a crazy day of cutting hundred of cable ties and herding thousands (about 1500) of urchins underwater.

I’ll be following up with recovery, but for now, the really heavy lifting is finis. And it looks quite intriguing, although I emphasize that this is all VERY preliminary.

I’ve Got the Power!

There is nothing like that pain of taking a first look at fresh precious new data from a carefully designed experiment which took months of planning and construction, 8 divers working full days, and a lot of back-and-forth with colleagues, and then you find absolutely no patterns of interest.

Yup, it’s a real takes-your-breath-away-hand-me-a-beer kind of moment. Fortunately, I was on my way to a 4th of July dinner with my family up in the mountains of Colorado, so, I wasn’t quite going to commit hari-kari on the spot. Still, the moment of vertigo and neausea was somewhat impressive.

Because, let’s admit it, as much as science is about repeated failure until you get an experiment right (I mean, the word is “experiment” not “interesting-meaningful-data-generating-excersise”), it still never feels good.

So, round one of my experiment to test for feedbacks between sessile species diversity and urchin grazing showed bupkiss. Not only was there no richness effect, but, well, there wasn’t even a clear effect that urchins mattered. High density urchin treatments lost about as much cover and diversity of algae and sessile inverts as treatments with low or no urchins in them. What had gone wrong? And can this project be salvaged?

Yeah, I can see a pattern here.  Sure...

Yeah, I can see a pattern here. Sure...

Rather than immediately leaping to a brilliant biological conclusion (that was frankly not in my head) I decided to jump into something called Power Analysis. Now, to most experimental scientists, power analysis is that thing you were told to do in your stats class to tell you how big your sample size should be, but, you know, sometimes you do it, but not always. Because, really, it is often taught as a series of formulae with no seeming rhyme nor reason (although you know it has something to do with statistical significance). Most folk use a few canned packages for ANOVA or regression, but the underlying mechanics seem obscure. And hence, somewhat fear-provoking.

Well, here’s the dirty little secret of power analysis. Really, at the end of the day, it’s just running simulations of a model of what you THINK is going on, adding in some error, and then seeing how often your experimental design of choice will give you easily interpretable (or even correct) results.

The basic mechanics are this. Have some underlying deterministic model. In my case, a model that described how the cover and diversity of sessile species should change given an initial diversity, cover, and number of urchins. Then, add in random error – either process error in each effect itself, or just noise from the environment (e.g., some normal error term with mean 0 and a standard deviation). Use this to get made-up data for your experiment. Run your states and get parameter estimates, p values, etc. Then, do the same thing all over again. Over many simulations of a particular design, you can get an average coefficient estimates, the number of times you get a significant result, etc. “Power” for a given effect is determined by the number of “significant” results (where you define if you’re going with p=0.05 or whatnot) divided by the number of simulations.

So that’s just what I did. I had coefficient estimates (albeit nonsignificant ones). So, what would I need to detect them? I started by playing with the number of replicates. What if I had just put out more cages? Nope. The number of replicates I’d need to get the power into the 0.8 range alone (and you want as close to 1 as possible) was mildly obscene.

So, what about number of urchins? What if instead of having 0 and 16 as my lowest and highest densities?


Shown above are two plots. On the left, you see the p value for the species richness * urchin density interaction for each simulated run at a variety of maximum urchin densities. On the right you see the power (# of simulations where p<0.05 / total number of simulations). Note that, for the densities I worked with, power is around 0.3? And if I kicked it up to 50, well, things get much nicer. As soon as I saw this graph, the biology of the the Southern California Bight walked up behind me and whapped me over the head. Of course. I had been using a maximum density that corresponded to your average urchin barren. The density of urchins that can hang out, and graze down any new growth. But that's not the density of urchins that CREATED the barren in the first place. Sure enough, looking at our long-term data, this new value of 50 corresponded to a typical front of urchins that would move through an area and lay waste to the kelp forest around it. Which is what I had been interested in the first place. *headdesk* So, round two is up and running now. The cages are stocked with obscenely high densities of the little purple spikey dudes. And we shall see what happens! I am quite excited, and hopeful. Because, I've got the Power! (and so should you!)

Simple Data Visualization

OK, so, I know I already raved about one Hadley Wickham project and how it has changed my life last week. But what can I say, the man is a genius. And if you are using R (and let’s face it, you should be) and you want simple sexy graphs made quick, the man has you covered once again.

I first learned about ggplot2 while scanning through some slides of the LA Area RUG meetings (that I missed – I still haven’t been to any!) by the folks from Michael Driscoll.

And I liked what I saw – ggplot2 and lattice (which I admit, I had kind of avoided) seemed more intuitive than I thought. Then I ran into a series of articles on ggplot2 from the Learning R blog and I was hooked. Still am. And why I ask?

Let’s consider a task – you have some data split into four groups. For each group, you want to plot a fitted regression between two covariates. You want this split into panels, with common scales, and nicely placed axis labels. Also, you want it to be purty. While you can do this with the standard graphics package (and, I admit, I sometimes like the austerity of the standard graphics), it would take a for loop, tons of formatting instructions, and a number of steps where you could easily mess the whole thing up. Not to mention that you might have to change a good bit if you want to add another group.

Here is how easy it is with ggplot2. Note, there are only two lines of actual graphing code. The rest is just making up the data.


#create data - x is a predictor, g is a group, y is the response
g<-as.factor(c("A", "B", "C", "D"))

#i love expand grid, as it easily creates full
#matrices of all combinations of multiple variables

#add a little error in to our results
z$y<- rnorm(length(z$x),z$x*c(3,15,29, 50)[unclass(z$g)]+5, sd=200)

#this first bit just sets up the basic plotting, with colors and panels
a<-qplot(x,y,group=g, colour=g, data=z, facets=~g)

#now, add a fitted line with the error around it shown.
# Also, change the theme.  And plot the whole thing!
a+stat_smooth(method="lm", se=T, group=g, fill="black")+theme_bw()

All of which yields the following pretty figure:

Nice, eh?

And that stat_smooth statement can take lots of other arguments - e.g., glms (I've tried, and it looks great!)

So check it out - even for just casual data exploration, there's some real clarity to be found. And I look forward to trying out other products by Prof. Wickham!

Sometimes, you just need to use a plyr

I haven’t posted anything about R-nerdery in quite some time. But I have to pause for a moment, and sing the praises of a relatively new package that has made my life exponentially easier. The plyr package.

R has the capability to apply a single function to a vector or list using apply or mapply, or their various derivatives. This returns another vector or list.

This is great in principal, but in practice, with indexing, odd return objects, and difficulties in using multiple arguments, it can get out of hand for complex functions. Hence, one often resorts to a for loop.

Let me give you an example. Let’s say I have some data from a simple experiment where I wanted to look at the effect of adding urchins, lobsters, or both on a diverse community of sessile invertebrates – mussels, squirts, etc. Heck, let’s say, I had one gazillion species whose responses I was interested in. Now let’s say I want a simple summary table of the change in the density each species – and my data has before and after values. So, my data would look like this.

Urchins Lobsters Before.After Replicate Species Density
Yes No Before 1 1 30
Yes No After 1 1 10
Yes No Before 1 2 45
Yes No After 1 2 23
Yes Yes Before 15 Gazillion 10
Yes Yes After 15 Gazillion 9

So, each row represents the measurement for one species in one replicate either before or after the experiment. Now, previously, to get an output table with the mean change for each species for each treatment, I would have had to do something like this:

#first, create blank list of vectors waiting to be filled
#as well as blank vectors to remember what treatment 
#combinations we're looking at
for(i in 1:gazillion) mean.list[[paste("sp",i,sep=".")]]==vector()

#then, a for loop for each combination
for (i in levels($Urchins)){
  for (j in levels($Lobsters)){
    #get the right subset of the data<-subset(,$Urchins==i &$Lobster==j)

    #now loop over all species
    for (k in 1:gazillion){<-subset(,$Species==k)<-subset(,$Before.After=="Before")<-subset(,$Before.After=="After")
	c(mean.list[[paste("sp",i-3,sep=".")]], mean([k][k])


#then turn it into a data frame to match back up to the right values<$urchins<-as.factor(urchin.vec)$lobsters<-as.factor(lobster.vec)

Ok, did that exhaust you as much it did me? Now, here's how to get the same result using ddply (more on why it's called ddply in a moment)

#a function where, if you give it a data frame with species 1 - gazillion
#will give you the mean for each species.  Note, the which statement
#lets me look at before and after data separately
#Also note the Change= in the return value.  That gives a new column name
#for multiple return columns, just return a vector, with each 
#new column getting it's own name

tab.means<-function(df){ return(Change = 
			- df$Density[which($Before.After=="Before")]) }

#now use ddply<-ddply(df, c("Urchins", "Lobsters", "Species"), tab.means))

#ok, in fairness, each species doesn't have it's own column.  But, we can 
#just use reshape<-reshape(, timevar="Species", idvar=c("Urchins", "Lobsters", direction="wide)

Um. Whoah. Notice a small difference there. Now, perhaps the earlier code could have been cleaned up a bit with reshape, but still, 6 lines of code versus 18 - and most of that 18 was bookkeeping code, not something doing real useful computation.

Soooo.... Check out plyr! It will make your life a happier, more relaxed place. Enjoy!

(p.s. I'm noticing wordpress is ignoring all of my code indenting - anyone have any quick fixes for that?)

when NOT to MANOVA
And now its time for a multivariate stats geek out.

The statistics that we use determine the inferences we draw from our data. The more statistical tools you learn to use, the more likely you are likely to slip on a loose bit of data, and stab yourself in the eyeball with your swiss-army-knife of p values. It’s happened to all of us, and it is rarely due to willful misconduct. Rather, it’s a misunderstanding, or even being taught something improperly.

I was given a sharp reminder of this recently while doing some work on how to breakdown some repeated measures MANOVAs- useful for dealing with violations of sphericity, etc. However, I fell across this wonderful little ditty by Keselman et al – Statistical Practices of Educational Researchers: An Analysis of their ANOVA, MANOVA, and ANCOVA Analyses (yes, as an Ecologist, I love reading stats papers from other disciplines – I actually find it often more helpful than reading ones in my own discipline).

Now, everything in my own work was A-OK, but I found this note on the usage of MANOVA fascinating. (bolded phrases are mine)

In an overwhelming 84% (n = 66) of the studies, researchers never used the results of the MANOVA(s) to explain effects of the grouping variable(s). Instead, they interpreted the results of multiple univariate analyses. In other words, the substantive conclusions were drawn from the multiple univariate results rather than from the MANOVA. With the discovery of the use of such univariate methods, one may ask: Why were the MANOVAs conducted in the first place? Applied researchers should remember that MANOVA tests linear combinations of the outcome variables (determined by the variable intercorrelations) and therefore does not yield results that are in any way comparable with a collection of separate univariate tests.

Although it was not indicated in any article, it was surmised that researchers followed the MANOVA-univariate data analysis strategy for protection from excessive Type I errors in univariate statistical testing. This data analysis strategy may not be overly surprising, because it has been suggested by some book authors (e.g., Stevens, 1996, p. 152; Tabachnick & Fidell, 1996, p. 376). There is very limited empirical support for this strategy. A counter position may be stated simply as follows: Do not conduct a MANOVA unless it is the multivariate effects that are of substantive interest. If the univariate effects are those of interest, then it is suggested that the researcher go directly to the univariate analyses and bypass MANOVA. When doing the multiple univariate analyses, if control over the overall Type I error is of concern (as it often should be), then a Bonferroni (Huberty, 1994, p. 17) adjustment or a modified Bonferroni adjustment may be made (for a more extensive discussion on the MANOVA versus multiple ANOVAs issue, see Huberty & Morris, 1989). Focusing on results of multiple univariate analyses preceded by a MANOVA is no more logical than conducting an omnibus ANOVA but focusing on results of group contrast analyses (Olejnik & Huberty, 1993).

I, for one, was dumbstruck. This is EXACTLY why more than one of my stats teachers have told me MANOVA was most useful. I even have advised others to do this myself – like the child of some statistically abusive parent. But really, if the point is controlling for type I error, why not do a Bonferroni or (my personal favorite) a False Discovery Rate correction? To invoke the MANOVA like some arcane form of magic is disingenuous to your data. Now, if you’re interested in the canonical variables, and what they say, then by all means! Do it! But if not, you really have to ask whether you’re following a blind recipe, or if you understand what you are up to.

This paper is actually pretty brilliant in documenting a number of things like this that we scientists do with our ANOVA, ANCOVA, and MANOVA. It’s worth reading just for that, and to take a good sharp look in the statistical mirror.

H. J. Keselman, C. J. Huberty, L. M. Lix, S. Olejnik, R. A. Cribbie, B. Donahue, R. K. Kowalchuk, L. L. Lowman, M. D. Petoskey, J. C. Keselman, J. R. Levin (1998). Statistical Practices of Educational Researchers: An Analysis of their ANOVA, MANOVA, and ANCOVA Analyses Review of Educational Research, 68 (3), 350-386 DOI: 10.3102/00346543068003350

Baby Got Stats!

I was completely tickled last year with the oh so amusing Statz Rappers. It kept me and my nerdy stats friends laughing for days. Rapping. Stats. The Internet.

Good times.

But little did I know that rapping about statistics was really just hitting its stride on youtube. This is Why We Plot began my trip down the rabbit hole. Quite a nice effort. I followed this with that Stats Rap – not bad, and oh so mellow. Also I love the equations.

But the pièce de résistance is Baby Got Stats from Dorry Serev in the biostats department at JHU. Oh. Dear. Lord. I laughed. I wept. I even put some of the lyrics into my .sig file. Enjoy.

(note, mildly nsfw? maybe?)

There are a ton more – just follow the related links. It’s kind of amazing. And if anyone gets a yen to start doing some multivariate SEM or Bayesian raps, I want to know!