# Do Not Log-Transform Count Data, Bitches!

Onwards.

If you’re like me, when you learned experimental stats, you were taught to worship at the throne of the Normal Distribution. Always check your data and make sure it is normally distributed! Or, make sure that whatever lines you fit to it have normally distributed error around them! Normal! Normal normal normal!

And if you violate normality – say, you have count data with no negative values, and a normal linear regression would create situations where negative values are possible (e.g., what does it mean if you predict negative kelp! ah, the old dreaded nega-kelp), then no worries. Just log transform your data. Or square root. Or log(x+1). Or SOMETHING to linearize it before fitting a line and ensure the sacrament of normality is preserved.

This has led to decades of thoughtless transformation of count data without any real thought as to the consequences by in-the-field ecologists.

But statistics has had a better answer for decades – generalized linear models (glm for R nerds, gzlm for SAS goombas who use proc genmod. What? I’m biased!) whereby one specifies a nonlinear function with a corresponding non-normal error distribution. The canonical book on this was first published ’round 1983. Sure, one has to think more about the particular model and error distribution they specify, but, if you’re not thinking about these things in the first place, why are you doing science?

“But, hey!” you might say, “Glms and transformed count data should produce the same results, no?”

From first principles, Jensen’s inequality says no – consider the consequences for error of the transformation approach of log(y) = ax+b+error versus the glm approach y=e^(ax+b)+error. More importantly, the error distributions from generalized linear models may often be far far faaar more appropriate to the data you have at hand. For example, count data is discrete, and hence, a normal distribution will never be quite right. Better to use a poisson or a negative binomial.

But, “Sheesh!”, one might say, “Come on! How different can these models be? I mean, I’m going to get roughly the same answer, right?”

O’Hara and Kotze’s paper takes this question and runs with it. They simulate count data from negative binomial distributions and look at the results from generalized linear models with negative binomial or quasi-poisson error terms (see here for the difference) versus a slew of transformations.

Intriguingly, they find that glms (with either distribution) always perform well, while each transformation performs poorly at some or all values.

Estimated root mean-squared error from six different models. Curves from the quasi-poisson model are the same as the negative binomial. Note that the glm lines (black solid) all hang out around 0 as opposed to the transformed fits.

More intriguingly to me are the results regarding bias. Bias is the deviation between a fit parameter and its true value. Bascially, it’s a measure of how wrong your answer is. Again, here they find almost no bias in the glms, but bias all over the charts for transformed fits.

Estimated mean biases from six different models, applied to data simulated from a negative binomial distribution. A low bias means that the method will, on average, return the 'true' value. Note that the bias for transformed fits is all over the place. But with a glm, bias is always minimal.

They sum it up nicely

For count data, our results suggest that transformations perform poorly. An additional problem with regression of transformed variables is that it can lead to impossible predictions, such as negative numbers of individuals. Instead statistical procedures designed to deal with counts should be used, i.e. methods for fitting Poisson or negative binomial models to data. The development of statistical and computational methods over the last 40 years has made it easier to fit these sorts of models, and the procedures for doing this are available in any serious statistics package.

Or, more succinctly, “Do not log-transform count data, bitches!”

“But how?!” I’m sure some of you are saying. Well, after checking into some of the relevant literature, it’s quite straightforward.

Given the ease of implementing glms in languages like R (one uses the glm function, checks diagnostics of residuals to ensure compliance with model assumptions, then can use Likliehood ratio testing akin to anova with, well, the Anova function) this is something easily within the grasp of the everyday ecologist. Heck, you can even do posthocs with multcomp, although if you want to correct your p-values (and there are reasons to believe you shouldn’t), you need to carefully consider the correction type.

For example, consider this data from survivorship on the Titanic (what, it’s in the multcomp documentation!) – although, granted, it’s looking at proportion survivorship, but, still, you’ll see how the code works:

```library(multcomp)
### set up all pair-wise comparisons for count data
data(Titanic)
mod <- glm(Survived ~ Class, data = as.data.frame(Titanic), weights = Freq, family = binomial)

### specify all pair-wise comparisons among levels of variable "Class"
### Note, Tukey means the type of contrast matrix.  See ?contrMat
glht.mod <- glht(mod, mcp(Class = "Tukey"))

###summaryize information
###applying the false discovery rate adjustment
###you know, if that's how you roll
```

There are then a variety of ways to plot or otherwise view glht output.

So, that's the nerdy details. In sum, though, the next time you see someone doing analyses with count data using simple linear regression or ANOVA with a log, sqrt, arcsine sqrt, or any other transformation, jump on them like a live grenade. Then, once the confusion has worn off, give them a copy of this paper. They'll thank you, once they're finished swearing.

O’Hara, R., & Kotze, D. (2010). Do not log-transform count data Methods in Ecology and Evolution, 1 (2), 118-122 DOI: 10.1111/j.2041-210X.2010.00021.x

## 26 thoughts on “Do Not Log-Transform Count Data, Bitches!”

1. Nice. You’re actually getting quite good at this, uh, witty blogging about statistics.

2. i think poisson glm has no error term in the linear model. the counts themselves are modeled directly rather than ‘error’

3. Just wanted to say: I randomly stumbled upon this, and as a stats challenged biologist (one of the legion) this was a nice easy and informative read. Thanks!

4. Hi, love the post! However, what does one do if one’s GLM is still overdispersed and one has tried poiss,quasipoiss & nb and is frusterated beyond belief??

5. Hi, love the post! However, what does one do if one’s GLM is still overdispersed and one has tried poiss,quasipoiss & nb and is frusterated beyond belief??

6. Well, I think you can start by asking what is the appropriate error distribution for the type of problem you’re analyzing, and then go from there. Similarly, none of these are magic bullets for a model that is not appropriate to the type of data.

7. @Wei, that was indeed a simplification. The more direct way of putting it is
y_predicted = e^(ax+b)
y_observed ~ pois(y_predicted)

But, the notation of adding error is more familiar to much of our ecological audience.

8. Just wanted to drop in and say that I have sent this post all over Scripps. And am seriously thinking about how to teach my high school intern about glm.

9. HA! Viva la glm!

10. I’ve been all over the frickin’ internet to find the appropriate method for pairwise comparisons of factor levels in a quasipoisson GLM (overdispersed count data), and I think I’ve found it! (please tell me your Titanic example works for quasipoisson as well as binomial!) I must admit I skipped over your enthusiastically-titled page the first couple times, but I’m so glad I finally read through it. You’re my hero.

11. Glad to help, lqlqlq. The Titanic example is from the R help files. While I have not tried using glht with a glm using a quasipoisson distribution, I see no reason why it should not work. If you have doubts, feel free to drop a line to the r-help listserv. I’ve gotten great help on the ins and outs of the multcomp library there in the past.

12. @Lindsey — I was recently faced with this problem. In my case, the overdispersion was due to an an abundance of zero-count responses. The solution my stats consulted suggested (which I was able to later find support for in an article in Ecological Modelling) was to first analyze all the data using a logistic regression and then do a regular glm conditional on the response being greater than 0. The first part tells you if explanatory variables predict whether or not your species or event of interest occurs, and the latter tells you about the relationship with the explanatory variables if you have the response of interest.

13. But can someone please develop a negative binomial model with cross-classified random effects? Please.

14. Thanks for the post. Maybe this will help my adviser stop using ANOVA’s with count data.

15. Pingback: Methods blogging « methods.blog

17. Thanks for this nice post, that a colleague just has sent me! While agreeing with all you say, there is one thing I am wondering about. Above you state “Given the ease of implementing glms in languages like R (one uses the glm function, checks diagnostics of residuals to ensure compliance with model assumptions, then…”. Does that mean you know how to interpret the diagnostic plots of a glm? If so, you would be the first person I meet that can do that and I would be very interested in learning about it. In my experience, the diagnostic plots of the glm look very similar and just as bad as the ones of a linear model on the untransformed data. From these plots I cannot find any assurance that the glm model is in fact appropriate. Any comment by you on this topic would be very well received. Thanks Sebastian

• Oh my god, that statement is so true. It is so easy to suggest the use of GLMs, but as you point out, noone really knows how to diagnose the fit. There is so much contradictory information out there.

18. I’m doing a PhD in Biostatistics focusing on the modelling of count data with non-normal distribution. Can someone propose to me which area can I focus on under this topic to make a good PhD research thesis

19. In your first plot of RMSE vs True mean, what are all the r-values? Thank you for any reply!

20. The second plot you have simulated data from six different models, with theta given over each plot (fit parameter) and true parameter on the x-axis. How come the bias is not through the roof for all lines when the difference between true and fitted parameters are becoming greater? Or am I just not understanding the plots.

My reference is wikipedia: “In statistics, the bias (or bias function) of an estimator is the difference between this estimator’s expected value and the true value of the parameter being estimated”

• Eros – for both of your replies, this isn’t my paper. For far more detail, I’d urge you to go and read the paper that is linked to here. I think you’ll get a much better understanding that way. Feel free to email me if you cannot access it and I’ll send it your way.

21. Hi and thank you for the blog post – very helpful. However, what about negative numbers? Like, say, a percent change in bird densities among treatments? Poisson doesn’t jive with negative numbers. How does one proceed in this instance? I fear that adding a constant yields the same problem as transforming the data.

• Well, if you have negative numbers, then you’ve got a different distribution on your hands! So, often, a rethink is needed. I’ve lately become a fan of using weighting to accommodate for different variance structure in normal distributions, but there are a variety of options – see https://biol609.github.io/lectures/04_gls.html#/section for some options and examples.

22. Great, finally someone that speaks about statistics in an actual comprehensible way!! I am an eternal beginner in stats (meaning that I can learn things and do them right but then I forget everything and have to start from zero) but this I can understand.
Thank you so much!