# The Map of Science

Why does it take so long for awesome cutting-edge statistical to make their way over to ecology? There are a myriad of techniques out there that have been around for 20, 30, 40, or more years that could help so many ecologists from banging their head into a wall over and over and over and…well, you get the point. But, it takes quite a while for them to percolate over to us. This is not for lack of user-friendly tools, often. Rather it has to do with the connectivity of disiciplines.

For example, I was having a lovely conversation with Jim Grace the other day about using Structural Equation Modeling for predictive purposes, and we ended up chatting a little about history. SEM as it is done currently – using maximum likelihood approaches to fit a model to a covariance or correlation matrix – really dates to the late 1960s and early 1970s. Before then, scientists in a number of disciplines used a wide variety of approaches to examine path models (a là Sewall Wright’s Path Analysis), or perform Factor Analysis, or approach other multivariate models that often included latent variables. These techniques were fairly heterogeneous, even though they attempted to do roughly similar things.

It took Karl Jöreskog‘s wonderful papers outlining his LISREL technique and software using maximum likelihood to really bring the whole enterprise together into modern SEM.

And yet, despite the fact that this seminal work was published in the 70s, there are Ecological papers well into 90s that use piecewise regression models to fit path analyses. Why?

The answer can be summed up by this beautiful diagram detailing the connectivity of science in 2004 from the ever-interesting eigenfactor.org (and hat-tip to Jim for pointing it out to me).

Orange circles represent fields, with larger, darker circles indicating larger field size as measured by Eigenfactor score™. Blue arrows represent citation flow between fields. An arrow from field A to field B indicates citation traffic from A to B, with larger, darker arrows indicating higher citation volume. Image from eigenfactor.org.

Basically, these methods were developed for economics, and saw their first heavy use there and and sociology, political science, education, and psychology. In terms of connectivity, Ecology & Evolution sites on the other side of a doughnut hole of communication (with the occasional exception of psychology). Historically, the fields where the newest techniques are being developed are rarely examined by ecologists, and it is to our loss. Fortunately, I think this is a historical trend. With the rise of search engines, message-boards, and copious mailing lists, I do wonder if a connectivity graph from 2004-2010 would be much tighter.

Connectivity can only be a boon for science. With environmental issues beginning to impinge on every endeavor, it has become more important than ever to survey the breadth of what is out there.

So, hey, sign-up for alerts for a journal that you think will have no relevance to you. Who knows what might drop into your inbox.

# Viva la Neo-Fisherian Liberation Front!

p≤0.05

Significant p-values. For so many scientists using statistics, this is your lord. Your master. Heck, it has its own facebook group filed under religious affiliations (ok, so, maybe I created that.) And it is a concept to whose slavish devotion we may have sacrificed a good bit of forward progress in science over the past half century. Time to blow up the cathedral! Or so says Stuart Hurlbert and Celia Lombardi in a recent fascinating review.

But first, for the uninitiated, what does it mean? Let’s say you’re running an experiment. You want to see whether fertilizer affects the growth rate of plants. You get a bunch of random plots, seed them, and add fertilizer to half of them. You then compare the mean growth rates of the two groups of plots. But are they really different? In essence, a p value gives you the probability that they are the same. And if it is very low, you can reject the idea that they are the same. Well, sort of.

A p value, as defined by the Patron Saint of Statistics for us experimental grunts, R. A. Fisher, is the probability of observing some result given that a hypothesis being tested is true. Of, if d=data, and h=a hypothesis, p(d|h) in symbolic language – | means given. Typically, this hypothesis being tested is a null hypothesis – that there is no difference between treatments, or the slope of a line is 0. However, note a few things about this tricky statement. 1) It is not the probability of accepting the hypothesis you’re trying to reject. 2) It makes no claims any particular hypothesis being true. For all practical purposes, in the framework of testing a null hypothesis, however, a low p value means there is a very low probability that

OK. But what is this 0.05 thing all about? Well, p will range from 0 to 1. As formalized by Jerzy Neyman and Egon Pearson (no, not THAT Egon), the idea of Null Hypothesis Significance Testing (NHST) is one where the researcher established a critical value of p, called α. The researcher then tests the null statistical hypothesis of interest, and if p falls at or below alpha, the results are deemed ‘statistically significant’ – i.e. you can safely reject the null. By historical accident of old ideas, copyright, a little number rounding, a lack of computational power to routinely calculate exact p values in the 30s, and some early textbooks 0.05 has become the standard for much of science.

Indeed, it is mother’s milk for any experimental scientist who has taken a stats course in the last 40+ years. It is enshrined in some journal publication policies. It is used for the quality control of a great deal of biomedical research. It is the result we hope and yearn for whenever we run an experiment.

It may also be a false god – an easy yes/no that can lead to into the comfortable trap of not thinking critically about a problem. After all, if your test wasn’t “significant”, why bother with the results? This is a dangerous line of thinking. It can seriously retard scientific progress and certainly has led to all sorts of jerrymandering of statistical tests and datasets, or even adjusting α up to 0.10 or down to 0.01, depending on the desired result. Or, worse, scientists misreading the stats, and claiming that a REALLY low p value meant a REALLY large effect (seriously!) or that a very high value means that one can accept a null hypothesis.

Scientists are, after all, only human. And are taught by other humans. And while they are trained in statistics, are not statisticians themselves. All too human errors creep in.

Aside from reviewing a tremendous amount of literature, Hurlbert and Lombardi perhaps best sum up the case as follows – suppose you were to look at the results of two different statistical tests. One one, p=0.051. In the other, p=0.049. If we are going with the α=0.05 paradigm, then one test we would not reject the null. In the other we would and label the effect as ‘significant’.

Clearly, this is a little too arbitrary. H&L lay out a far more elegant solution – one that is being rapidly incorporated in many fields and has been advocated for some time in the statistical literature. It is as follows:

1) Report a p-value for a test. 2) Do not assign it significance, but rather refer to the level of support it gives for rejecting a null – strong, weak, moderate, practically non-existent. Make sure this statement of support is grounded in the design and power of the experiment. Suspend judgement on rejecting a null if the p value is high, as p-value testing is NOT the same as giving evidence FOR a null (something so many of us forget). 3) Use this in accumulation with other lines of evidence to draw a conclusion about a research hypothesis.

This neoFisherian Significance Assessment (NFSA) seems so simple, so elegant. And it puts the scientist back into the science in a way that NHST does not.

There have been of course other proposals. Many have advocated throwing out p values and reporting confidence intervals and effect sizes. This information can be incredibly invaluable, but CIs can often be p values in disguise. Effect sizes are great, but without an estimate of variability, they can be deceiving. Indeed, the authors argue that p value reporting is the way to assess the support for rejecting a null, but that the nuance with which it is done is imperative.

They also review several other alternatives and critiques – Bayesian ideas or information theoretic approaches, although I think there is some misunderstanding there that leads the authors to see conflict with their views where there actually is none. Still, it does not distract from their main message.

It should also be noted that this is one piece of a larger agenda by the authors to force scientists (and particularly ecologists) to rethink how we approach statistics. There’s another paper out there that demonstrates why one-tailed t-tests are the devil (the appendix of which is worth leading to see how conflicted even textbooks on the subject can be), and another is in review on why corrections for multiple hypothesis testing (e.g. Tukey tests and Bonferroni corrections) are in many cases quite unnecessary.

Strong stuff (although who would expect less from the man who gave us the clarion call against pseudoreplication). But intellectually, the arguments make a lot of sense. If anything, it forces a greater concentration on the weight of evidence, rather than a black-and-white situation. It puts the scientist back in the limelight, forcing them to build a case and apply their knowledge, skills, and creativity as a researcher.

Quite liberating. We shall see if it is adopted, and where it leads. I leave you with an excerpt from the concluding remarks.

We came along after the dust had settled, and have just tried to push over the last remaining structures of the old cathedral and to show the logic of the neoFisherian reformation. Most of the stone building blocks from the old cathedral were still of value. They just needed to be reassembled with fresh mortar by a new generation of scientists and statisticians to increase the guest capacity and beautify the gardens of the neoFisherian cottage.

Hurlbert, S. H., & Lombardi, C. M. (2009). Final collapse of the Neyman-Pearson decision theoretic framework and rise of the neoFisherian Annales Zoologici Fennici, 46, 311-349

# when NOT to MANOVA

And now its time for a multivariate stats geek out.

The statistics that we use determine the inferences we draw from our data. The more statistical tools you learn to use, the more likely you are likely to slip on a loose bit of data, and stab yourself in the eyeball with your swiss-army-knife of p values. It’s happened to all of us, and it is rarely due to willful misconduct. Rather, it’s a misunderstanding, or even being taught something improperly.

I was given a sharp reminder of this recently while doing some work on how to breakdown some repeated measures MANOVAs- useful for dealing with violations of sphericity, etc. However, I fell across this wonderful little ditty by Keselman et al – Statistical Practices of Educational Researchers: An Analysis of their ANOVA, MANOVA, and ANCOVA Analyses (yes, as an Ecologist, I love reading stats papers from other disciplines – I actually find it often more helpful than reading ones in my own discipline).

Now, everything in my own work was A-OK, but I found this note on the usage of MANOVA fascinating. (bolded phrases are mine)

In an overwhelming 84% (n = 66) of the studies, researchers never used the results of the MANOVA(s) to explain effects of the grouping variable(s). Instead, they interpreted the results of multiple univariate analyses. In other words, the substantive conclusions were drawn from the multiple univariate results rather than from the MANOVA. With the discovery of the use of such univariate methods, one may ask: Why were the MANOVAs conducted in the first place? Applied researchers should remember that MANOVA tests linear combinations of the outcome variables (determined by the variable intercorrelations) and therefore does not yield results that are in any way comparable with a collection of separate univariate tests.

Although it was not indicated in any article, it was surmised that researchers followed the MANOVA-univariate data analysis strategy for protection from excessive Type I errors in univariate statistical testing. This data analysis strategy may not be overly surprising, because it has been suggested by some book authors (e.g., Stevens, 1996, p. 152; Tabachnick & Fidell, 1996, p. 376). There is very limited empirical support for this strategy. A counter position may be stated simply as follows: Do not conduct a MANOVA unless it is the multivariate effects that are of substantive interest. If the univariate effects are those of interest, then it is suggested that the researcher go directly to the univariate analyses and bypass MANOVA. When doing the multiple univariate analyses, if control over the overall Type I error is of concern (as it often should be), then a Bonferroni (Huberty, 1994, p. 17) adjustment or a modified Bonferroni adjustment may be made (for a more extensive discussion on the MANOVA versus multiple ANOVAs issue, see Huberty & Morris, 1989). Focusing on results of multiple univariate analyses preceded by a MANOVA is no more logical than conducting an omnibus ANOVA but focusing on results of group contrast analyses (Olejnik & Huberty, 1993).

I, for one, was dumbstruck. This is EXACTLY why more than one of my stats teachers have told me MANOVA was most useful. I even have advised others to do this myself – like the child of some statistically abusive parent. But really, if the point is controlling for type I error, why not do a Bonferroni or (my personal favorite) a False Discovery Rate correction? To invoke the MANOVA like some arcane form of magic is disingenuous to your data. Now, if you’re interested in the canonical variables, and what they say, then by all means! Do it! But if not, you really have to ask whether you’re following a blind recipe, or if you understand what you are up to.

This paper is actually pretty brilliant in documenting a number of things like this that we scientists do with our ANOVA, ANCOVA, and MANOVA. It’s worth reading just for that, and to take a good sharp look in the statistical mirror.

H. J. Keselman, C. J. Huberty, L. M. Lix, S. Olejnik, R. A. Cribbie, B. Donahue, R. K. Kowalchuk, L. L. Lowman, M. D. Petoskey, J. C. Keselman, J. R. Levin (1998). Statistical Practices of Educational Researchers: An Analysis of their ANOVA, MANOVA, and ANCOVA Analyses Review of Educational Research, 68 (3), 350-386 DOI: 10.3102/00346543068003350