*The following is some notes on a technique I’m developing for a cool collaboration between me, Jen Bowen, and David Weisman. I think it has some generality to it, and I’d love any feedback from the more mathematical crowd…I also wrote it to make sure I knew what I was doing – translating scribbled equations to code to results – so it does freeflow a bit. It may change based on feedback – consider this a working document.*

*So. Away we go.*

What do food webs and determining the identity of bacterial species based on sequences and co-occurrence data have in common? How can bacterial ‘species’ advance basic food web research?

Networks. And AIC scores.

Let me explain.

I’ve long been a huge fan of Allesina and Pascual’s 2009 paper on deriving trophic groups de novo from food web networks. In short, they say that if you have a simple binary network (a eats b, or a doesn’t eat b), you can use information theory to determine trophic groups within a network. I’ve applied their methods in the past to kelp forests, and seen some interesting things, andEd Baskerville has a great paper on using the technique for Seringetti food webs.

So how does this connect to bacteria?

I’m working on an analysis where my collaborators have surveyed bacterial communities at a number of different sites. We want to know the abundance of different species at different sites. However, how to define a bacterial ‘species’ is a tricky question. OK – let me poorly explain my understanding of bacterial taxonomic definitions (don’t kill me, Jen!) Let’s say you amplify and sequence a sample. You may get a number of different representative sequences from that sample. And you can get a measure of the abundance of each sequence type.

Now, on to species – looking at any pair of sequences (looong sequences of many base pairs), you may find two that are, say, one base pair different from each other. Are these two ‘sequences’ independent species or not? What if they differed by 2 base pairs? What about 3? 4? Now, a researcher can define an ‘operational taxonomic unit’ or OTU by all sequences that are X% different from each other – and X is up to them. Thus, once you define your percent similarity, you can sum up all of the species in each OTU, and get the abundance of each “species” in each plot.

This is somewhat unsatisfying. I mean, what if you had two sequences that were 98% similar, but all of sequence A was in one plot, and all of sequence B was in another plot. Now you tell me – is this one species or two?

Let’s take that one step further. Let’s suppose A and B are both in a plot. But sequence A has 10x the abundance of sequence B. Furthermore, in a second plot, both are present, but sequence B is 10x more abundant. Again, one species or two?

The approach I want to lay out here answers this using a slight modification of Allesina and Pasqual’s framework. Namely, we’re going to look at patterns of association, sequence similarity, and abundances to define OTUs.

**The Association Part**

At the core of Allesina and Pasqual’s framework is the following observation. Let’s say you are dealing with a food web. You’ve got all sorts of directed connections of species A eating species B. Now, let’s say you want to define two trophic groups. Definitions of predator, prey, etc., are not important here. Just that in each group, you’ll have one set of species that eats species in the other group, and vice-versa. Like in this diagram:

So far, so good, yes? Now, the question is, which of these is a better is a better descriptor of the structure of the network, after penalizing for complexity. I.e., we want a general schema. Is the amount of information lost by grouping things a-ok, given that we’ve reduced the complexity of out model of how the world works?

A&P derive a wonderful formula for this. It involved two pieces. First, for each A -> B connection between groups we’ve made, we can derive a probability of producing that particular graph with those species assigned into exactly those groups. L(ab) is the number of links going from species in A to species in B, and S(i) is, say, the number of species in group i. If we define p(ab) as L(ab) / [S(a)S(b)]. The probability of a given link in the network – say, A -> B – given p(ab) can be defined as

p(network | p(ab) = p(ab)^{L}(1-p(ab))^^{S(a)S(b) – L}

Which implies that the likelihood of p(ab) given the network is the same.

Likelihood (p(ab) | network)) = p(ab)^{L}(1-p(ab))^^{S(a)S(b) – L}

or

Log-Likelihood = L*log(p(ab)) + (S(a)S(b) – L)*log(1-p(ab))

Cool, right?

Let’s call one of those LLs, L(a->b). Now, the Log-Likelihood of a given network configuration with groups is just

LL(all p(ij) | whole network) = LL(a->b) + LL(b -> a) + LL(a -> a) + LL(b -> b)

where LL(a->b) is one of those log-likelihood calculations above. We’ll call this LL(network) for future use.

Now, what about this comparison and penalty for complexity? Here’s where things get even better. We know that there are S total species, and k^2 probabilities, where k is the number of groups. So, voila, we have an AIC for a group structure’d network

AIC = -2 * LL(network) + 2S + 2k^2

and as each AIC for each configuration captures information about information lost by a particular network, we can directly compare different grouping schemas. Note that the AIC for the baseline network is just 2S + 2K^2.

*So what does this have to do with bacteria?!?!*

OK, ok, hold your horses. Let’s think about sequences and their associations with a site as a link. Let’s consider both sequences and sites as nodes in a network. So, if one sequence associates with one site, that’s a directed link from sequence to site. It’s a bipartite graph. Now, instead of searching through all possible group structures, our groups are defined by OTUs that are created from different levels of sequence similarity. We can calculate the LL for each group -> site association the same as we calculated the LL for A -> B before. The difference is, however, that there are fewer probabilities over the whole network. Instead of there being k^2 probabilities, there are k*r where r is the number of replicate plots we’ve sampled. So

AIC = -2 * LL(OTU network) + 2S + 2k*r

The beauty of this approach is that instead of having to search through group structures, we have 1 grouping per degree of sequence similarity. Granted, we can have tens of thousands of groups, so, it’s still a moderately heinous calculation (go-go mclapply!), but it’s not so bad.

**But, what about that abundance problem?**

So, until now, I’ve been talking about binary networks, where links are either 1 or 0. As far as I know, no one has derived a weighted-network analog of the A&P approach. On the other hand, here, our network weights are real abundances. Because of this, we can calculate a Likelhood of species with some set of abundances in a plot being part of the same group. Then,

LL(OTU group A -> 1 Plot) = LL(network) + LL(sequences having the observed pattern of abundances in that plot if they are in the same group)

I’m making this jump from the

probability of species in one group being in that group and connecting to one plot = probability of species connecting to plot * probability of species having that pattern of densities.

p(network & abundance) = p(network) * p(abundances)

OK, so, how to we get that p(abundances) aka L(parameters | observed abundances)?

I’m going to throw out a proposal. I’m totally game to hear others, but I think this is reasonable.

If two sequences are indeed the same OTU, they should respond in similar ways to environmental variation. Thus, you should have an equal probability, if you were to sample random individuals from a group in one plot, of drawing either species. So, in the figure below, on the left, the two sequences (in red), even though they both associate with this one site, are different OTUs. Or, rather, it is highly unlikely they are from the same OTU. On the right, they are likely from the same OTU.

This is great, as we now have a parameter for each group-plot combination: the probability of drawing and individual with one of the sequences within a group. And we’re defining that probability as 1/number of sequences in a group. It’s rolling a dice. And we’re rolling it the number of times as we have total ‘individuals’ observed. So, for each sequences, we have a probability of drawing it, and a number of dice rolls…and we should be able to calculate a p(sequence | p(i in j in plot q)) which is the same as Likelihood(p(i in j in plot q) | sequence). I’ll call like Likelihood(abundance ijq). Using a(iq) as the abundance of species i in plot q and A(jq) is the abundance of all species in group j in plot q and S(jq) is the number of sequence types in group j in plot q

Likelihood(abundance ijq) = dbinom(a(iq) | size=A(jq), p=1/S(jq))

Log that, sum over all species in all plots, and we get LL(abundance).

We’ve added 2*k*r more parameters, so, now,

AIC = -2 * LL(OTU network) -2 * LL(OTU abundances) + 2S + 4k*r

Aaand…. that’s it. I think. We should be able to use this to scan across all OTU structures based on sequence similarity, calculate an AIC for each, and then use the OTU structure with the smallest AIC as our ‘species’.

Now, we could of course add additional information. For example, what if we knew some environmental information about plots, etc. We could probably use that to create groups of plots, rather than just use individual plots.

I also wonder if this can be related to a more general solution for weighted networks, and get back to A&P’s original formulation for food webs. Perhaps assuming that all interaction strengths are drawn from the same distribution with the same mean and variance. That should do it, and be relatively simple to implement. Heck, one could even try different distributional assumptions.

**References**

Allesina, S. & Pascual, M. (2009). Food web models: a plea for groups. Ecol. Lett., 12, 652–662. 10.1111/j.1461-0248.2009.01321.x

Baskerville, E.B., Dobson, A.P., Bedford, T., Allesina, S., Anderson, T.M. & Pascual, M. (2011). Spatial Guilds in the Serengeti Food Web Revealed by a Bayesian Group Model. PLoS Comp Biol, 7, e1002321. 10.1371/journal.pcbi.1002321

This general approach seems reasonable: choose a level of OTU similarity for groups that makes it so that sequences in the same group have statistically indistinguishable abundance patterns.

A few notes to start:

(1) The AIC calculation in Allesina & Pascual has a problem: the 2*S term shouldn’t be there, since you can’t count the assignment of species into groups as normal parameters. The derivation of AIC depends on differentiating with respect to parameters—I think the factor K drops out when you take the determinant of the Jacobian of the Fisher information matrix…or something like that—and you can’t do that to categorical variables. Basically, the model violates the assumptions of AIC, which is why I switched to a full-Bayesian formulation for the Serengeti work. (Not that Bayesian model selection is a walk in the park…)

(2) Note that (1) doesn’t matter for your purposes, because the group structure is totally determined by the level of OTU similarity—so you can safely drop the 2*S term and not worry about it.

(3) I don’t think independent binomials is quite right, but it probably wouldn’t change things much. I think the “process” + “observation” model should be: the abundance of a group in a plot is Poisson or zero-inflated Poisson; and then the abundance of sequences within that group would be multinomial with probabilities 1/(number of sequences in the group). And this is mathematically identical to having each sequence within a group be independently Poisson-distributed.

(4) Should the abundance probabilities actually be different in different plots, or should they be the same? If different, it might make sense to infer them as being clustered around similar values using a Bayesian hierarchical model—or infer a grouping of the plots, so plots in the same group have the same parameters.

OK, enough words; I think the model should look like this:

Abundance of group g in plot r: A_rg ~ Poisson(lambda_rg)

Abundance of vector of sequences in group g in plot r: a_rgs ~ Multinomial(A_rg, 1/N_g)

(where a_rgs is a vector of sequence abundances, and 1/N_g is the probability of every component in the multinomial probability vector)

which should be equivalent to just

Abundance of seq. s in plot r: A_rgs ~ Poisson(lambda_rg / N_gs)

(where now s is just a single sequence in the group in the plot).

Likelihood = product over all rg of [Poisson-Multinomial likelihood for group g in plot r]

Parameter count = [Number of Plots] X [Number of Groups]

If you made this thing Bayesian, with a conjugate gamma prior to the Poisson rates, the marginal likelihood should be analytically tractable, and thus just as easy to calculate as the AIC. I’d be happy to help with a comparison between AIC vs. marginal likelihood. I’m becoming a fan of “maximum marginal likelihood”/”empirical Bayes”, where you fix the top-level priors (in this case, the shape/rate parameters controlling the gamma prior on the Poisson rates) at a value that gives the highest marginal likelihood (probability of observing the data given the model, integrated over parameter uncertainty).

…and I’ll write a separate comment regarding how this would apply to food webs.

Thanks for the response, Ed! So, if I get this right, and sift down to simplify a bit, I like your point about A_rg being Poisson distributed amongst plots. However, that assumes some mean abundance across the landscape (lambda_rg). Based on the way this dataset is setup, there are different environmental treatments for different plots, so, I don’t think drawing from a grand mean abundance would be accurate. Furthermore, if we’re talking about creating groups, here, shouldn’t abundances within each plot be viewed as a separate piece of information independent of what’s going on in other plots?

That said, I like the use of a Poisson versus the binomial. Thus

A_rgs ~ Poisson(A_rg/S_g)

Or, a zero inflated version thereof (haven’t dug into the 0 inflated version yet…trying to keep it simple for the moment). This could be used for likelihoods as above.

Also, I agree on dropping the 2S. It’s an additive constant anyway, and should make no difference with respect to delta-AIC values.

I like the Bayesian take. It may be overkill for this leg of the project. But…I think, particularly given what you’ve put below, this might be something worth investigating for food webs with continuous interaction strengths – or for other problems like this (I’ve already had one other person contact me with a similar problem who thinks this can be used for their work!)

Ah, if the different plots are different treatments, then certainly you’d want different parameters for each of them. If they were expected to be the same, I guess it would be a model-selection question as to whether identical or different parameters would be better.

Either way, though, I didn’t mean to imply that you wouldn’t treat the data in different plots as additional data points—A_rg was meant to be the abundance of a single group within a single plot.

And yes, good point about the 2S—as long as you’re within the world of the group model, the 2S doesn’t make a difference. But if you want to compare it to other models with AIC, you’re probably unfairly penalizing the group model.

Lastly: with this model, the Bayesian approach is no more computationally intensive than AIC except for trying different hyperparameter values, so it might be interesting to do the comparison just to see if the basic results are robust to the technical details.

Actually, you’ve just raised a fascinating point. The different plots are 4 different treatments. 8 plots. So, we could have 1 model where lambda_rg is the same across replicate blocks of the same treatment, and another where it is not. That would be an interesting comparison to make, and would actually test some hypotheses my collaborators are interested in.

With respect to the Bayesian version, indeed, should be tested. I’m not as familiar with putting those kinds of models together, though. I admit, I flirt with Bayes awesome, but haven’t taken him to the prom yet.

And as for food webs! Say you had counts of how many times different predation events happened (to ignore observation for the moment), giving you integer values for your adjacency matrix instead of binary values. A simple approach is to model the event counts as Poisson.

In the group model (aka “stochastic block model”), this would mean changing the probabilities to Poisson rates.

However, there’s a problem here, already present in the binary version of the model: there’s a tendency in the model to simply group high-degree nodes with high-degree nodes, and low-degree nodes with low-degree nodes. A fix for this is to have species-level parameters controlling how much they eat/are eaten (which could be interpreted as a combination of abundance and traits such as size), and adjust the group-level rates by those species-level tendencies.

Actually, that adjustment is already published and nicely analyzed in this paper by Brian Karrer:

Stochastic blockmodels and community structure in networks, Karrer & Newman 2011 Phys. Rev. E. http://arxiv.org/abs/1008.3926

That paper is on undirected networks—which is why I stupidly ignored it for a while—but the adaptation to directed networks is pretty straightforward.

You could also imagine making those species-level adjustments be regressions on abundance, traits, etc.

I hadn’t heard the term blockmodel before to describe what you and Stefano are doing. I like it. Is there a big literature on this?

I think you’re exactly right if we have integer values for edges. What about webs that are based on, say, biomass differences, etc., or where we have continuous measurements of per capita interaction strengths? Would the method work? Seems like if there was a straightforward way to do this, it might be nice to get out there.

Yeah, turns out some statisticians have been doing it for a while; I believe I cited this one as the original:

Wang & Wong, Stochastic blockmodels for directed graphs, 1987, J. Am. Stat. Assoc.

http://scholar.google.com/scholar?cluster=18401648986334539021

do you support LaTeX equations? $\lambda_{rg} \sim \mathrm{Gamma}(\alpha_\lambda, \beta_\lambda)$

There you go! See the documentation for wp-latex! http://wordpress.org/extend/plugins/wp-latex/ Basically, you use a wp [] tag with the word latex in it.

Pingback: More on Bacteria and Groups | i'm a chordata! urochordata!

Pingback: Beautiful Bacterial Networks in the Marsh | i'm a chordata! urochordata!