Living a Dream: Back to SML

So, today, I’m going to catch a ferry out to the Shoals Marine Lab. I’m just going to be out for a day to meet with the undergrad intern I’m mentoring. I’ll be back later this summer to work with her and setup some permanent monitoring transects.

I have to be honest, this is one of those moments in my life where I am watching a dream come through. I went to SML in the summer of 1999 to take some classes. It changed everything for me. I cannot recount the number of paths that opened up due to that summer than I have run down, higgledy-piggledy. To return now as a mentor and researcher? To have the chance to really learn the secrets of the sea around the Isles of Shoals? To come back with new eyes after a decade of developing as a scientist? I’m having a hard time expressing my excitement and joy.

Rather than kvell any more about the place and my excitement, here’s a video from the participants of the Underwater Research Course at SML (which, also, totally formative to who I am as a scientist – thanks, Jim!) I think it conveys a lot of what I could say, but in images and video that’s much more telling.

Best Peer Review Experience Ever

So, I recently submitted a piece regarding the future of scholarly publishing in Ecology & Evolutionary Biology. Simultaneous to posting, I put up a preprint in PeerJ Preprints and also put it on Google Docs for line by line commentary (which you are welcome to give!). I asked in both places that commenters identify themselves, unless they felt deeply uncomfortable.

OMG the experience has been amazing!

At PeerJ you can comment on the main page of the article, and others can rate it – which is fantastic – and I’ve gotten some wonderful feedback there (thanks Lars!)

The Google Doc experience has been even more fascinating, given the ability to put in line by line comments.

One of our reviewers is using the Google Doc for their comments. It has made it easy to see what they are saying, respond to things that I think are relevant (or I’ll just change some of the text in the next draft for bigger changes), and have an interactive experience with the reviewer. It absolutely fabulous.

I’ve been really fascinated by the idea of how collaboration can improve peer review ever since reading Leek et al.’s 2011 piece Cooperation between Referees and Authors Increases Peer Review Accuracy. I’m delighted that one of our reviewers has embraced that ethos, and in so doing, I can see how this will really help with future publications if not just Ross Mounce, but everyone embraced this model. Very cool!

A Preprint Experiment: Four Pillars and a Foundation for the Future of Scholarly Publishing

x-post from the OpenPub Project blog

So, we got together, had two working group meetings to discuss the future of scholarly publishing in Ecology, Evolutionary Biology, and the Earth and Ocean Sciences. What were were thinking that entire time?

We’ve just submitted a piece that brings together our broad ideas (some of which have been seen before), but, simultaneous to publication, we’ve also decided to put up a preprint. Why? Simply put, immediate access is one of our four pillars of the future of scholarly publication. Once you feel something is ready for public consumption, put it out there! We’ve been delighted to watch the evolution of PeerJ Preprints, so we’ve placed our piece there.

Byrnes et al. (2013) The four pillars of scholarly publishing: The future and a foundation. PeerJ PrePrints 1:e11 http://dx.doi.org/10.7287/peerj.preprints.11

This immediate access to the piece goes hand in hand with another of our four pillars. Open Review. We want to know what you think. And now. We hope you give us feedback over at the preprint. Or, if you want to give us more detailed annotated comment, we’ve put it in a comment-open Google doc. Highlight something you disagree with. Argue with us. We welcome it! We’d ask that you put your name with the comment. We want a discussion, as discussion will improve this manuscript and help us shape our argument rather than just one-way commenting. This will also allow *you* to get full recognition for your comments, and we will include this in future acknowledgements.

So, enjoy the piece – our commentary is not a straight experiment-analysis-discussion piece, but rather part of a broader ecosystem of scholarly products that we feel are important to get out there. We look forward to hearing what you think of the piece!

Favorite Wave Sensor?

Screen Shot 2013-04-23 at 3.50.12 PM

So, Internet, I’m setting up a number of monitoring sites this summer. I’m hoping to get good wave height measurements from them to look at disturbance. I have yet to find something like the CDIP swell height model for the Gulf of Maine (although, I’d love to hear that this is due to a failure of my google-fu). So, I’m casting about for some good wave height sensors.

The problem I face is that I’d like to deploy a lot of them, and in some highly variable conditions, secured to subtidal transects. Possibly for 6 months. Or maybe even up to a year. So, I’m trying to see if a lower cost smaller profile solution is available. Something like a tidbit for wave heights.

I’ve found a few that kinda sorta fit the bill, although not quite. There’s the venerable SeaBird, products from RBR, the SEAGUARD tide and wave recorder, and the MIDAS from Valeport.

I’m a little worried that this might be too large or expensive given the conditions I’m deploying in. Or I might be asking for a pipe dream.

Internet – any thoughts, recommendations, or experiences you’d care to share? Am I being ridiculous?

More on Bacteria and Groups

Continuing with bacterial group-a-palooza

I followed Ed’s suggestions and tried both a binomial distribution and a Poisson distribution for abundance such that the probability of a density of one species s in one group g in one plot r where there are S_g species in group gis

A_rgs ~ Poisson(\frac{A_rg}{S_g})

In the analysis I’m doing, interesting, the results do change a bit such that the original network only results are confirmed.

I am having one funny thing, though, which I can’t lock down. Namely, the no-group option always has the lowest AIC once I include abundances – and this is true both for binomial and Poisson distributions. Not sure what that is about. I’ve put the code for all of this here and made a sample script below. This doesn’t reproduce the behavior, but, still. Not quite sure what this blip is about.

For the sample script, we have five species and three possible grouping structures. It looks like this, where red nodes are species or groups and blue nodes are sites:

Screen Shot 2013-04-12 at 4.32.50 PM

And the data looks like this

  low med high  1   2   3
1   1   1    1 50   0   0
2   2   1    1 45   0   0
3   3   2    2  0 100   1
4   4   2    2  0 112   7
5   5   3    2  0  12 110

So, here’s the code:

And the results:

> aicdf
     k LLNet LLBinomNet  LLPoisNet   AICpois  AICbinom AICnet
low  5     0    0.00000  -20.54409  71.08818  60.00000     30
med  3     0  -18.68966  -23.54655  65.09310  73.37931     18
high 2     0 -253.52264 -170.73361 353.46723 531.04527     12

We see that the two different estimations disagree, with the binomial favorint disaggregation and poisson favoring moderate aggregation. Interesting. Also, the naive network only approach favors complete aggregation. Interesting. Thoughts?

Groupapalooza: Adapting Food Web Trophic Group Methods for Defining Bacterial “Species”

The following is some notes on a technique I’m developing for a cool collaboration between me, Jen Bowen, and David Weisman. I think it has some generality to it, and I’d love any feedback from the more mathematical crowd…I also wrote it to make sure I knew what I was doing – translating scribbled equations to code to results – so it does freeflow a bit. It may change based on feedback – consider this a working document.

So. Away we go.

What do food webs and determining the identity of bacterial species based on sequences and co-occurrence data have in common? How can bacterial ‘species’ advance basic food web research?

Networks. And AIC scores.

Let me explain.

I’ve long been a huge fan of Allesina and Pascual’s 2009 paper on deriving trophic groups de novo from food web networks. In short, they say that if you have a simple binary network (a eats b, or a doesn’t eat b), you can use information theory to determine trophic groups within a network. I’ve applied their methods in the past to kelp forests, and seen some interesting things, andEd Baskerville has a great paper on using the technique for Seringetti food webs.

So how does this connect to bacteria?

I’m working on an analysis where my collaborators have surveyed bacterial communities at a number of different sites. We want to know the abundance of different species at different sites. However, how to define a bacterial ‘species’ is a tricky question. OK – let me poorly explain my understanding of bacterial taxonomic definitions (don’t kill me, Jen!) Let’s say you amplify and sequence a sample. You may get a number of different representative sequences from that sample. And you can get a measure of the abundance of each sequence type.

Now, on to species – looking at any pair of sequences (looong sequences of many base pairs), you may find two that are, say, one base pair different from each other. Are these two ‘sequences’ independent species or not? What if they differed by 2 base pairs? What about 3? 4? Now, a researcher can define an ‘operational taxonomic unit’ or OTU by all sequences that are X% different from each other – and X is up to them. Thus, once you define your percent similarity, you can sum up all of the species in each OTU, and get the abundance of each “species” in each plot.

This is somewhat unsatisfying. I mean, what if you had two sequences that were 98% similar, but all of sequence A was in one plot, and all of sequence B was in another plot. Now you tell me – is this one species or two?

Let’s take that one step further. Let’s suppose A and B are both in a plot. But sequence A has 10x the abundance of sequence B. Furthermore, in a second plot, both are present, but sequence B is 10x more abundant. Again, one species or two?

The approach I want to lay out here answers this using a slight modification of Allesina and Pasqual’s framework. Namely, we’re going to look at patterns of association, sequence similarity, and abundances to define OTUs.

The Association Part
At the core of Allesina and Pasqual’s framework is the following observation. Let’s say you are dealing with a food web. You’ve got all sorts of directed connections of species A eating species B. Now, let’s say you want to define two trophic groups. Definitions of predator, prey, etc., are not important here. Just that in each group, you’ll have one set of species that eats species in the other group, and vice-versa. Like in this diagram:

Screen shot 2013-04-08 at 3.07.42 PM

So far, so good, yes? Now, the question is, which of these is a better is a better descriptor of the structure of the network, after penalizing for complexity. I.e., we want a general schema. Is the amount of information lost by grouping things a-ok, given that we’ve reduced the complexity of out model of how the world works?

A&P derive a wonderful formula for this. It involved two pieces. First, for each A -> B connection between groups we’ve made, we can derive a probability of producing that particular graph with those species assigned into exactly those groups. L(ab) is the number of links going from species in A to species in B, and S(i) is, say, the number of species in group i. If we define p(ab) as L(ab) / [S(a)S(b)]. The probability of a given link in the network – say, A -> B – given p(ab) can be defined as

p(network | p(ab) = p(ab)L(1-p(ab))^S(a)S(b) – L

Which implies that the likelihood of p(ab) given the network is the same.

Likelihood (p(ab) | network)) = p(ab)L(1-p(ab))^S(a)S(b) – L

or

Log-Likelihood = L*log(p(ab)) + (S(a)S(b) – L)*log(1-p(ab))

Cool, right?

Let’s call one of those LLs, L(a->b). Now, the Log-Likelihood of a given network configuration with groups is just

LL(all p(ij) | whole network) = LL(a->b) + LL(b -> a) + LL(a -> a) + LL(b -> b)

where LL(a->b) is one of those log-likelihood calculations above. We’ll call this LL(network) for future use.

Now, what about this comparison and penalty for complexity? Here’s where things get even better. We know that there are S total species, and k^2 probabilities, where k is the number of groups. So, voila, we have an AIC for a group structure’d network

AIC = -2 * LL(network) + 2S + 2k^2

and as each AIC for each configuration captures information about information lost by a particular network, we can directly compare different grouping schemas. Note that the AIC for the baseline network is just 2S + 2K^2.

So what does this have to do with bacteria?!?!

OK, ok, hold your horses. Let’s think about sequences and their associations with a site as a link. Let’s consider both sequences and sites as nodes in a network. So, if one sequence associates with one site, that’s a directed link from sequence to site. It’s a bipartite graph. Now, instead of searching through all possible group structures, our groups are defined by OTUs that are created from different levels of sequence similarity. We can calculate the LL for each group -> site association the same as we calculated the LL for A -> B before. The difference is, however, that there are fewer probabilities over the whole network. Instead of there being k^2 probabilities, there are k*r where r is the number of replicate plots we’ve sampled. So

AIC = -2 * LL(OTU network) + 2S + 2k*r

The beauty of this approach is that instead of having to search through group structures, we have 1 grouping per degree of sequence similarity. Granted, we can have tens of thousands of groups, so, it’s still a moderately heinous calculation (go-go mclapply!), but it’s not so bad.

But, what about that abundance problem?

So, until now, I’ve been talking about binary networks, where links are either 1 or 0. As far as I know, no one has derived a weighted-network analog of the A&P approach. On the other hand, here, our network weights are real abundances. Because of this, we can calculate a Likelhood of species with some set of abundances in a plot being part of the same group. Then,

LL(OTU group A -> 1 Plot) = LL(network) + LL(sequences having the observed pattern of abundances in that plot if they are in the same group)

I’m making this jump from the
probability of species in one group being in that group and connecting to one plot = probability of species connecting to plot * probability of species having that pattern of densities.

p(network & abundance) = p(network) * p(abundances)

OK, so, how to we get that p(abundances) aka L(parameters | observed abundances)?

I’m going to throw out a proposal. I’m totally game to hear others, but I think this is reasonable.

If two sequences are indeed the same OTU, they should respond in similar ways to environmental variation. Thus, you should have an equal probability, if you were to sample random individuals from a group in one plot, of drawing either species. So, in the figure below, on the left, the two sequences (in red), even though they both associate with this one site, are different OTUs. Or, rather, it is highly unlikely they are from the same OTU. On the right, they are likely from the same OTU.

Screen shot 2013-04-08 at 3.07.47 PM

This is great, as we now have a parameter for each group-plot combination: the probability of drawing and individual with one of the sequences within a group. And we’re defining that probability as 1/number of sequences in a group. It’s rolling a dice. And we’re rolling it the number of times as we have total ‘individuals’ observed. So, for each sequences, we have a probability of drawing it, and a number of dice rolls…and we should be able to calculate a p(sequence | p(i in j in plot q)) which is the same as Likelihood(p(i in j in plot q) | sequence). I’ll call like Likelihood(abundance ijq). Using a(iq) as the abundance of species i in plot q and A(jq) is the abundance of all species in group j in plot q and S(jq) is the number of sequence types in group j in plot q

Likelihood(abundance ijq) = dbinom(a(iq) | size=A(jq), p=1/S(jq))

Log that, sum over all species in all plots, and we get LL(abundance).

We’ve added 2*k*r more parameters, so, now,

AIC = -2 * LL(OTU network) -2 * LL(OTU abundances) + 2S + 4k*r

Aaand…. that’s it. I think. We should be able to use this to scan across all OTU structures based on sequence similarity, calculate an AIC for each, and then use the OTU structure with the smallest AIC as our ‘species’.

Now, we could of course add additional information. For example, what if we knew some environmental information about plots, etc. We could probably use that to create groups of plots, rather than just use individual plots.

I also wonder if this can be related to a more general solution for weighted networks, and get back to A&P’s original formulation for food webs. Perhaps assuming that all interaction strengths are drawn from the same distribution with the same mean and variance. That should do it, and be relatively simple to implement. Heck, one could even try different distributional assumptions.

References
Allesina, S. & Pascual, M. (2009). Food web models: a plea for groups. Ecol. Lett., 12, 652–662. 10.1111/j.1461-0248.2009.01321.x

Baskerville, E.B., Dobson, A.P., Bedford, T., Allesina, S., Anderson, T.M. & Pascual, M. (2011). Spatial Guilds in the Serengeti Food Web Revealed by a Bayesian Group Model. PLoS Comp Biol, 7, e1002321. 10.1371/journal.pcbi.1002321