Here a Tau, there a Tau… Plotting Quantile Regressions

I’ve ended up digging into quantile regression a bit lately (see this excellent gentle introduction to quantile regression
for ecologists
[pdf] for what it is and some great reasons why to use it -see also here and here). In R this is done via the quantreg package, which is pretty nice, and has some great plotting diagnostics, etc. But what it doesn’t have out of the box is a way to simply plot your data, and then overlay quantile regression lines at different levels of tau.

The documentation has a nice example of how to do it, but it’s long tedious code. And I had to quickly whip up a few plots for different models.

So, meh, I took the tedious code and wrapped it into a quickie function. Which I dorp here for your delectation. Unless you have some better fancier way to do it (which I’d love to see – especially for ggplot….)

Here’s the function:

quantRegLines <- function(rq_obj, lincol="red", ...){  
  #get the taus
  taus <- rq_obj$tau
  
  #get x
  x <- rq_obj$x[,2] #assumes no intercept
  xx <- seq(min(x, na.rm=T),max(x, na.rm=T),1)
  
  #calculate y over all taus
  f <- coef(rq_obj)  
  yy <- cbind(1,xx)%*%f
  
  if(length(lincol)==1) lincol=rep(lincol, length(taus))
  #plot all lines
  for(i in 1:length(taus)){
    lines(xx,yy[,i], col=lincol[i], ...)
  }
  
}

And an example use.

data(engel)
attach(engel)
 
taus <- c(.05,.1,.25,.75,.9,.95)
plot(income,foodexp,xlab="Household Income",
     ylab="Food Expenditure",
     pch=19, col=alpha("black", 0.5))
 
rq_fit <- rq((foodexp)~(income),tau=taus)
 
quantRegLines(rq_fit)

Oh, and I set it up to make pretty colors in plots, too.

plot(income, foodexp, xlab = "Household Income", 
    ylab = "Food Expenditure", 
    pch = 19, col = alpha("black", 0.5))

quantRegLines(rq_fit, rainbow(6))
legend(4000, 1000, taus, rainbow(6), title = "Tau")

All of this is in a repo over at github (natch), so, fork and play.

More on Bacteria and Groups

Continuing with bacterial group-a-palooza

I followed Ed’s suggestions and tried both a binomial distribution and a Poisson distribution for abundance such that the probability of a density of one species s in one group g in one plot r where there are S_g species in group gis

A_rgs ~ Poisson(\frac{A_rg}{S_g})

In the analysis I’m doing, interesting, the results do change a bit such that the original network only results are confirmed.

I am having one funny thing, though, which I can’t lock down. Namely, the no-group option always has the lowest AIC once I include abundances – and this is true both for binomial and Poisson distributions. Not sure what that is about. I’ve put the code for all of this here and made a sample script below. This doesn’t reproduce the behavior, but, still. Not quite sure what this blip is about.

For the sample script, we have five species and three possible grouping structures. It looks like this, where red nodes are species or groups and blue nodes are sites:

Screen Shot 2013-04-12 at 4.32.50 PM

And the data looks like this

  low med high  1   2   3
1   1   1    1 50   0   0
2   2   1    1 45   0   0
3   3   2    2  0 100   1
4   4   2    2  0 112   7
5   5   3    2  0  12 110

So, here’s the code:

And the results:

> aicdf
     k LLNet LLBinomNet  LLPoisNet   AICpois  AICbinom AICnet
low  5     0    0.00000  -20.54409  71.08818  60.00000     30
med  3     0  -18.68966  -23.54655  65.09310  73.37931     18
high 2     0 -253.52264 -170.73361 353.46723 531.04527     12

We see that the two different estimations disagree, with the binomial favorint disaggregation and poisson favoring moderate aggregation. Interesting. Also, the naive network only approach favors complete aggregation. Interesting. Thoughts?

Filtering Out Exogenous Pairs of Variables from a Basis Set

Sometimes in an SEM for which you're calculating a test of D-Separation, you want all exogenous variables to covary. If you have a large model with a number of exogenous variables, coding that into your basis set can be a pain, and hence, you can spend a lot of time filtering out elements that aren't part of your basis set, particularly with the ggm library. Here's a solution – a function I'm calling filterExoFromBasiSet


#Takes a basis set list from basiSet in ggm and a vector of variable names

filterExoFromBasiSet <- function(set, exo) {
    pairSet <- t(sapply(set, function(alist) cbind(alist[1], alist[2])))
    colA <- which(pairSet[, 1] %in% exo)
    colB <- which(pairSet[, 2] %in% exo)
    both <- c(colA, colB)
    both <- unique(both[which(duplicated(both))])

    set[-both]
}

How does it work? Let's say we have the following model:

y1 <- x1 + x2

Now, we should have no basis set. But…

library(ggm)

modA <- DAG(y1 ~ x1 + x2)
basiSet(modA)
## [[1]]
## [1] "x2" "x1"

Oops – there's a basis set! Now, instead, let's filter it

basisA <- basiSet(modA)
filterExoFromBasiSet(basisA, c("x1", "x2"))
## list()

Yup, we get back an empty list.

This function can come in handy. For example, let's say we're testing a model with an exogenous variable that does not connect to an endogenous variable, such as

y1 <- x1
x2 (which is exogenous)

Now –


modB <- DAG(y ~ x1, 
               x2 ~ x2)

basisB <- basiSet(modB)
filterExoFromBasiSet(basisB, c("x1", "x2"))
## [[1]]
## [1] "x2" "y"  "x1"

So, we have the correct basis set with only one element.

What about if we also have an endogenous variable that has no paths to it?


modC <- DAG(y1 ~ x1, 
               x2 ~ x2, 
               y2 ~ y2)

basisC <- basiSet(modC)

filterExoFromBasiSet(basisC, c("x1", "x2"))
## [[1]]
## [1] "y2" "x2"
## 
## [[2]]
## [1] "y2" "x1"
## 
## [[3]]
## [1] "y2" "y1" "x1"
## 
## [[4]]
## [1] "x2" "y1" "x1"

This yields the correct 4 element basis set.

Extracting p-values from different fit R objects

Let's say you want to extract a p-value and save it as a variable for future use from a linear or generalized linear model – mixed or non! This is something you might want to do if, say, you were calculating Fisher's C from an equation-level Structural Equation Model. Here's how to extract the effect of a variable from multiple different fit models. We'll start with a data set with x, y, z, and a block effect (we'll see who in a moment).


x <- rep(1:10, 2)
y <- rnorm(20, x, 3)
block <- c(rep("a", 10), rep("b", 10))

mydata <- data.frame(x = x, y = y, block = block, z = rnorm(20))

Now, how would you extract the p-value for the parameter fit for z from a linear model object? Simply put, use the t-table from the lm object's summary

alm <- lm(y ~ x + z, data = mydata)

summary(alm)$coefficients
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   1.1833     1.3496  0.8768 0.392840
## x             0.7416     0.2190  3.3869 0.003506
## z            -0.4021     0.8376 -0.4801 0.637251

# Note that this is a matrix.  
# The third row, fourth column is the p value
# you want, so...

p.lm <- summary(alm)$coefficients[3, 4]

p.lm
## [1] 0.6373

That's a linear model, what about a generalized linear model?

aglm <- glm(y ~ x + z, data = mydata)

summary(aglm)$coefficients
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   1.1833     1.3496  0.8768 0.392840
## x             0.7416     0.2190  3.3869 0.003506
## z            -0.4021     0.8376 -0.4801 0.637251

# Again, is a matrix.  
# The third row, fourth column is the p value you
# want, so...

p.glm <- summary(aglm)$coefficients[3, 4]

p.glm
## [1] 0.6373

That's a linear model, what about a generalized linear model?


anls <- nls(y ~ a * x + b * z, data = mydata, 
     start = list(a = 1, b = 1))

summary(anls)$coefficients
##   Estimate Std. Error t value  Pr(>|t|)
## a   0.9118     0.1007   9.050 4.055e-08
## b  -0.4651     0.8291  -0.561 5.817e-01

# Again, is a matrix.  
# The second row, fourth column is the p value you
# want, so...

p.nls <- summary(anls)$coefficients[2, 4]

p.nls
## [1] 0.5817

Great. Now, what if we were running a mixed model? First, let's look at the nlme package. Here, the relevant part of the summary object is the tTable

library(nlme)
alme <- lme(y ~ x + z, random = ~1 | block, data = mydata)

summary(alme)$tTable
##               Value Std.Error DF t-value  p-value
## (Intercept)  1.1833    1.3496 16  0.8768 0.393592
## x            0.7416    0.2190 16  3.3869 0.003763
## z           -0.4021    0.8376 16 -0.4801 0.637630

# Again, is a matrix.  
# But now the third row, fifth column is the p value
# you want, so...

p.lme <- summary(alme)$tTable[3, 5]

p.lme
## [1] 0.6376

Last, what about lme4? Now, for a linear lmer object, you cannot get a p value. But, if this is a generalizes linear mixed model, you are good to go (as in Shipley 2009). Let's try that here.

library(lme4)

almer <- lmer(y ~ x + z + 1 | block, data = mydata)

# no p-value!
summary(almer)@coefs
##             Estimate Std. Error t value
## (Intercept)    4.792     0.5823   8.231

# but, for a genearlined linear mixed model
# and yes, I know this is a
# bad model but, you know, demonstration!

aglmer <- lmer(y + 5 ~ x + z + (1 | block), 
        data = mydata, family = poisson(link = "log"))

summary(aglmer)@coefs
##             Estimate Std. Error z value  Pr(>|z|)
## (Intercept)  1.90813    0.16542  11.535 8.812e-31
## x            0.07247    0.02471   2.933 3.362e-03
## z           -0.03193    0.09046  -0.353 7.241e-01

# matrix again!  Third row, fourth column
p.glmer <- summary(aglmer)@coefs[3, 4]

p.glmer
## [1] 0.7241

A Quick Note in Weighting with nlme

I’ve been doing a lot of meta-analytic things lately. More on that anon. But one quick thing that came up was variance weighting with mixed models in R, and after a few web searches, I wanted to post this, more as a note-to-self and others than anything. Now, in a simple linear model, weighting by variance or sample size is straightforward.

#variance
lm(y ~ x, data = dat, weights = 1/v)

#sample size
lm(y ~ x, data = dat, weights = n)

You can use the same sort of weights argument with lmer. But, what about if you’re using nlme? There are reasons to do so. Things change a bit, as nlme uses a wide array of weighting functions for the variance to give it some wonderful flexibility – indeed, it’s a reason to use nlme in the first place! But, for such a simple case, to get the equivalent of the above, here’s the tricky little difference. I’m using gls, generalized least squares, but this should work for lme as well.

#variance
gls(y ~ x, data=dat, weights = ~v)

#sample size
gls(y ~ x, data = dat, weights = ~1/n)

OK, end note to self. Thanks to John Griffin for prompting this.

Why I’m Teaching Computational Data Analysis for Biology

This is a x-post from the blog I’ve setup for my course blog. As my first class at UMB, I’m teaching An Introduction to Computational Data Analysis for Biology – basically mixing teaching statistics and basic programming. It’s something I’ve thought a long time about teaching – although the rubber meeting the road has been fascinating.

As part of the course, I’m adapting an exercise that I learned while taking English courses – in particular from a course on Dante’s Divine Comedy. I ask that students write 1 page weekly to demonstrate that they are having a meaningful interaction with the material. I give them a few pages from this book as a prompt, but really they can write about anything. One student will post on the blog per week (and I’m encouraging them to use the blog for posting other materials as well – we shall see, it’s an experiment). After they post, I hope that it will start a conversation, at least amongst participants in the class. I also think this post might pair well with some of Brian McGill’s comments on statistical machismo to show you a brief sketch of my own evolution as a data analyst.

I’ll be honest, I’m excited. I’m excited to be teaching Computational Data Analysis to a fresh crop of graduate students. I’m excited to try and take what I have learned over the past decade of work in science, and share that knowledge. I am excited to share lessons learned and help others benefit from the strange explorations I’ve had into the wild world of data.

I’m ready to get beyond the cookbook approach to data. When I began learning data analysis, way back in an undergraduate field course, it was all ANOVA all the time (with brief diversions to regression or ANCOVA). There was some point and click software that made it easy, so long as you knew the right recipe for the shape of your data. The more complex the situation, the more creative you had to be in getting an accurate sample, and then in determining what was the right incantation of sums of squares to get a meaningful test statistic. And woe be it if your p value from your research was 0.051.

I think I enjoyed this because it was fitting a puzzle together. That, and I love to cook, so, who doesn’t want to follow a good recipe?

Still, there was something that always nagged me. This approach – which I encountered again and again – seemed stale. The body of analysis was beautiful, but it seemed divorced from the data sets that I saw starting to arrive on the horizon – data sets that were so large, or chocked full of so many different variables, that something seemed amiss.

The answer rippled over me in waves. First, comments from an editor – Ram Meyers – for a paper of mine began to lift the veil. I had done all of my analyses as taught (and indeed even used for a class) using ANOVA and regression, multiple comparison, etc. etc. in the classic vein. Ram asked why, particularly given that the Biological processes that generated my data should in no way generate something with a normal – or even log-normal – distribution. While he agreed that the approximation was good enough, he made me go back, and jump off the cliff into the world of generalized linear models. It was bracing. But he walked me through it – over the phone even.

So, a new recipe, yes? But it seemed like something more was looming.

Then, an expiration of a JMP site license with one week left on a paper revision left me bereft. The only free tool I could turn to that seemed to do what I wanted it to do was R.

Wonderful, messy, idiosyncratic R.

I jumped in and learned the bare minimum of what I needed to know to do my analysis…and lost myself.

I had taken computer science in college, and even written the backend of a number of websites in PERL (also wonderful, messy, and idiosyncratic). What I enjoyed most about programming was that you could not hide from how you manipulated information. Programming has a functional aspect at the core where an input must be translated into a meaningful output according to the rules that you craft.

Working with R, I was crafting rules to generate meaningful statistical output. But what were those rules but my assumptions about how nature worked. The fundamentals of what I was doing all along – fitting a line to data with an error distribution – that should be based in biology, not arbitrary assumptions – was laid all the more bare. Some grumblingly lovely help from statistical denizens on the R help boards helped to bring this in sharp focus.

So, I was ready when, for whatever reason, fate thrust me into a series of workshops on Bayesian statistics, AIC analysis, hierarchical modeling, time series analysis, data visualization, meta-analysis, and last – Structural Equation Modeling.

I was delighted to learn more and more of how statistical analysis had grown beyond what I had been taught. I drank deeply of it. I know, that’s pretty nerdy, but, there you have it.

The new techniques all shared a common core – that they were engines of inference about biological processes. How I, as the analyst, made assumptions about how the world worked was up to me. Once I had a model of how my system worked in mind – sketched out, filled with notes on error distributions, interactions, and more, I could sit back and think about what inferential tools would give me the clearest answers I needed.

I had moved instead of finding the one right recipe in a giant cookbook to choosing the right tools out of a toolbox. And then using the tools of computer science – optimizing algorithms, thinking about efficient data storage, etc – to let my tools work bring data and biological models together.

It’s exciting. And that’s the core philosophy I’m trying to convey in this semester. (N.B. the spellchecker tried to change convey to convert – there’s something there).

Think about biology. Think about a line. Think about a probability distribution. Put them together, and find out what stories your data can tell you about the world.

Missing my Statsy Goodness? Check out #SciFund!

I know, I know, I have been kinda lame about posting here lately. But that’s because my posting muscle has been focused on the new analyses for what makes a succesful #SciFund proposal. I’ve been posting them at the #SciFund blog under the Analysis tag – so check it out. There’s some fun stats, and you get to watch me be a social scientist for a minute. Viva la interdisciplinarity!

Running R2WinBUGS on a Mac Running OSX

I have long used JAGS to do all of my Bayesian work on my mac. Early on, I tried to figure out how to install WinBUGS and OpenBUGS and their accompanying R libraries on my mac, but, to no avail. I just had too hard of a time getting them running and gave up.

But, it would seem that some things have changed with Wine lately, and it is now possible to not only get WinBUGS itself running nicely on a mac, but to also get R2WinBUGS to run as well. Or at least, so I have discovered after an absolutely heroic (if I do say so myself) effort to get it all running (this was to help out some students I’m teaching who wanted to be able to do the same exercises as their windows colleagues). So, I present the steps that I’ve worked out. I do not promise this will work for everyone – and in fact, if it fails at some point, I want to know about it so that perhaps we can fix it so that more people can get WinBUGS up and running.

Or just run JAGS (step 1} install the latest version, step 2} install rjags in R. Modify your code slightly. Run it. Be happy.)

So, this tutorial works to get the whole WinBUGS shebang running. Note that it hinges on installing the latest development version of Wine, not the stable version (at least as of 1/17/12). If you have previously installed wine using macports, good on you. Now uninstall it with “sudo port uninstall wine”. Otherwise, you will not be able to do this.

Away we go!

1) Have the free version of XCode Installed from http://developer.apple.com/xcode/. You may have to sign up for an apple developer account. Whee! You’re a developer now!

2) Have X11 Installed from your system install disc.

3) Install http://www.macports.org/install.php and install from the package installer. See also here for more information. Afterwards, open the terminal and type

echo export PATH=/opt/local/bin:/opt/local/sbin:$PATH$'n'export MANPATH=/opt/local/man:$MANPATH | sudo tee -a /etc/profile

You will be asked for your password. Don’t worry that it doesn’t display anything as you type. Press enter when you’ve finished typing your password.

4) Open your terminal and type

sudo port install wine-devel

5) Go have a cup of coffe, check facebook, or whatever you do while the install chugs away.

6) Download WinBUGS 1.4.x from here. Also download the immortality key and the patch.

7) Open your terminal, and type

cd Downloads
wine WinBUGS14.exe

Note, if you have changed your download directory, you will need to type in the path to the directory where you download files now (e.g., Desktop).

8 ) Follow the instructions to install WinBUGS into c:Program Files.

9) Run WinBUGS via the terminal as follows:

wine ~/.wine/drive_c/Program Files/WinBUGS14/WinBUGS14

10) After first running WinBUGS, install the immortality key. Close WinBUGS. Open it again as above and install the patch. Close it. Open it again and WinBUGS away!

11) To now use R2WinBugs fire up R and install the R2WinBUGS library.

12) R2WinBugs should now work normally with one exception. When you use the bugs function, you will need to supply the following additional argument:

bugs.directory='/Users/YOURUSERNAME/.wine/drive_c/Program Files/WinBUGS14'

filling in your username where indicated. If you don’t know it, in the terminal type

ls /Users

No, ~ will not work for those of you used to it. Don’t ask me why.

Seeing Through the Measurement Error

I am part of an incredibly exciting project – the #SciFund Challenge. #SciFund is an attempt to have scientists link their work to the general public through crowdfunding. As I’m one of the organizers, I thought I should have some skin in the game. But what skin?

Well, people are pitching some incredibly sexy projects – tracking puffin migrations, coral reefs conservation, snake-squirrel interactions (WITH ROBOSQUIRRELS!), mathematical modeling of movements like Occupy Wall Street, and many many more. It’s some super sexy science stuff.

So what is my project going to address? Measurement error.

WOOOOOOOOO MEASUREMENT ERROR!

But wait, before you roll your eyes at me, this is REALLY IMPORTANT. Seriously!

It can change everything we know about a system!

I’m working with a 30 year data set from the Channel Islands National Park. 30 years of divers going out and counting everything in those forests to see what’s there. They’ve witnessed some amazing change – El Niños, changes in fishing pressure, changes in fishing pressure, changes in urbanization on the coast, and more. It’s perhaps the best long-term large-scale full community subtidal data set in existence (and if there are better, um, send ‘em my way because I want to work with them!)

But 30 years – that’s a lot of different divers working on this data set under a ton of different environmental conditions. Doing basic sampling on SCUBA is arduous, and given the sometimes crazy environmental conditions, there is probably some small amount of variation in the data due to processes other than biology. To demonstrate this to a general non-statistical audience, I created the following video. Enjoy watching me in my science film debut…oh dear.

OK, my little scientific audience. You might look at this and think, meh, 30 years of data, won’t any measurement error due to those kinds of conditions or differences in the crew going out to do these counts just average out? With so much data, it shouldn’t be important! Jarrett just wanted an excuse to make a silly science video!

And that’s where you may well be wrong (well, about the data part, anyway). I’ve been working with this data for a long time, and one of my foci has been to try and tease out the signals of community processes, like the relative importance of predation and grazing versus nutrients and habitat provision. Your basic top-down bottom-up kind of thing. While early models showed, yep, they’re both important, and here’s how and why, some rather strident reviewer comments came back and forced me to rethink the models, adding in a great deal more complexity even to the simplest one.

And this is where measurement error became important. Measurement error can obscure the signal of important processes in complex models. A process may be there, may be important in your data, but if you’re not properly controlling for measurement error it can hide real biological patterns.

For example, below is a slice of one model done with two different analyses. I’m looking at whether there are any relationships between predators, grazers, and kelp. On the left hand side, we have the results from the fit model without using calibration data to quantify measurement error. While it appears that there is a negative relationship between grazers and kelp, there is no detectable relationship between predators and grazers (hence the dashed line – it ain’t different from 0).

This is because there is so much extra variation in records of grazer abundances due to measurement error that we cannot see the predator -> grazer relationship.

Now let’s consider the model on the right. Here, I’ve assumed that 10% of the variation in the data is due to measurement error (i.e., an R2 of 0.9 between observed and actual grazer abundances). So, I have “calibration” data. This error rate is made up, just to show the consequences of folding the error in to our analysis.

Just folding in this very small amount of measurement error, we get a change in the results of the model. We can now see a negative relationship between predators and grazers.

I need this calibration data to ensure that the results I’m seeing in my analyses of this incredible 30 year kelp forest data set are real, and not due to spurious measurement error. So I’m hoping wonderful folk like you (or people you know – seriously, forward http://scifund.rockethub.com around to everyone you know! RIGHT NOW!) will see the video, read the project description, and pitch in to help support kelp forest research.

If we’re going to use a 30 year data set to understand kelp forests and environmental change, we want to do it right, so, let’s figure out how much variation in the data is real, and how much is mere measurement error. It’s not hard, and the benefits to marine research are huge.