I’m a HUGE fan of Structural Equation Modeling. For those of you unfamiliar with the technique, it’s awesome for three main reasons.

- It’s a method of teasing apart direct and indirect interactions in your data.
- It allows you to assess the importance of underlying latent variables that you cannot measure, but for which have measured indicators.
- As it’s formally presented, with path diagrams showing connections between variables, it’s SUPER easy to link conceptual models with your data. See Grace et al. 2010 for a handy guide to this.

Also, there is a quite simple and intuitive R package for fitting SEMs, lavaan (LAtent VAriable Analysis). Disclaimer, I just hopped on board as a lavaan developer (yay!). I’ve also recently started a small project to find cool examples of SEM in the Ecological literature, and then using the provided information, post the models coded up in lavaan so that others can see how to put some of these models together.

As Ecologists, we often use latent variables to incorporate known measurement error of a quantity – i.e., a latent variable with a single indicator and fixed variance. We’re often not interested in the full power of latent variables – latents with multiple indicators. Typically, this is because we’ve actually measured everything we want to measure. We’re not like political scientists who have to quantify fuzzy things like Democracy, or Authoritarianism, or Gastronomicism. (note, I want to live in a political system driven by gastronomy – a gastronomocracy!)

However, we’re still fascinated by the idea of bundling different variables together into a single causal effect, and maybe evaluating the relative contribution of each of those variables within a model. In SEM, this is known as the creation of a Composite Variable. This composite is still an unmeasured quantity – like a latent variable – but with no error variance, and with “indicators” actually driving the variable, rather than having the unmeasured variable causing the expression of its indicators.

Let me give you an example. Let’s say we want to measure the effect of nutrients on diatom species richness in a stream. You’re particularly concerned about nitrogen. However, you can’t bring water samples back to the lab, so, you’re relying on some moderately accurate nitrogen color strips, the biomass of algae (more algae = more nitrogen!), and your lab tech, Stu, who claims he can taste nitrogen in water (and has been proved to be stunningly accurate in the past). In this case, you have a **latent variable**. The true nitrogen content of the water is causing the readings by these three different indicators.

A **composite variable** is a different beast. Let’s say we have the same scenario. But, now you have really good measurements of nitrogen. In fact, you have good measurements of both ammonium (NH4) and nitrate (NO3). You want to estimate a “nitrogen effect”, but, you know that both of these different forms of N will contribute to the effect in a slightly different way. You could just construct a model with effects going from both NO3 and NH4 to species richness. If you want to represent the total “Nitrogen Effect” in your model, however, and evaluate the effect of each form of nitrogen on its total effect, you would create a composite. The differences become clear when looking at the path diagram of each case.

Here, I’m adopting the custom of observed variables in squares, latent variables in ovals, and composite variables in hexagons. Note that, as indicators of nitrogen in the latent variable model, each observed indicator has some additional variation due to factors other than nitrogen – δ_{i}. There is no such error in the composite variable model. Also, I’m showing that the error in the Nitrogen Effect in the composite variable model is indeed set to 0. There are sometimes reasons where that shouldn’t be 0, but that’s another topic for another time.

This may still seem abstract to you. So, let’s look at an example in practice. One way we often use composites is to bring together a linear and nonlinear effect of a single variable. For example, we know that often nutrient supply rates have a humped shape effect on species richness – i.e., the highest richness happens at intermediate supply rates. One nice example of that is in a paper by Cardinale et al. in 2009 looking at relationships between manipulated nutrient supply, species richness, and algal productivity. To capture that relationship with a composite variable, one would have a ‘nitrogen effect’ affected by N and N^{2}. This nitrogen effect would then affect local species richness.

So, how would you code this model up in lavaan, and then evaluate it.

Well, hey, the data from this paper are freely available, so, let’s use this as an example. For a full replication of the model presented in the paper see here. However, Cardinale et al. didn’t use any composite variables, so, let’s create a model of our own capturing the Nitrogen-Richness relationship while also accounting for local species richness being influenced by regional species richness.

In the figure above, we have the relationship between resource supply rate and local species richness on an agar plate to the left. Separate lines are for separate streams. The black line is the average fit with the supplied equation. On the right, we have a path diagram representing this relationship, as well as the influence of regional species richness.

So, we have a path diagram. Now comes the tricky part. Coding. One thing about the current version of lavaan (0.4-8) is that it does not have a way to represent composite variables. This will change in the future (believe me), but, it may take a while, so, let me walk you through the tricks of incorporating latent variables now. Basically, there are four steps.

- Define the variable as a regression, where the composite is determined by it’s causal variables. Also, fix one of the effects to 1. This gives your composite variable a scale.
- Specify that the composite has an error variance of 0.
- Now treat the composite as a latent variable. It’s indicators are it’s response variables. This may seem odd. However, it’s all just ways of specifying causal pathways – an indicator pathway and a regression pathway have the same meaning in terms of causality. The software just needs something specified so that it doesn’t go looking for our composite variable in our data. Hence, defining it as a latent variable whose indicators are endogenous responses. I actually find this helpful, as it also makes me think more carefully about what a composite variable is, and how too many responses may make my model not identified.
- Lastly, because we don’t want to fix the effect of our composite on its response to 1, we’ll have to include an argument in the fitting function that makes it not force the first latent variable loading to be set to 1. We’ll also have to specify that we then want the variance of the response to latent variables freely estimated. Yeah, I know. Note: this all can play havoc when you have both latent and composite variables, so be careful. See here for an example.
- Everything else – other regression relationships, showing that nonlinearities are derived quantities, etc.

OK, that’s a lot. How’s it work in practice? Below is the code to fit the model in the path diagram. I’ve labeled the steps in comments, and, included the regional ~ local richness relationship as well as the relationship showing that logN^{2} was derived from logN. Note, this is a centered squared variable. And, yes, all nitrogen values have been log transformed here.

```
#simple SA model with N and regional SR using a composite
#Variables: logN = log nutrient supply rate, logNcen2 = log supply rate squared
# SA = Species richness on a patch of Agar, SR = stream-wide richness
compositeModel<-'
#1) define the composite, scale to logN
Nitrogen ~ 1*logN + logNcen2 #loading on the significant path!
#2) Specify 0 error variance
Nitrogen ~~ 0*Nitrogen
#3) now, because we need to represent this as a latent variable
#show how species richness is an _indicator_ of nitrogen
Nitrogen =~ SA
#4) BUT, make sure the variance of SA is estimated
SA ~~ SA
#Regional Richness also has an effect
SA ~ SR
#And account for the derivation of the square term from the linear term
logNcen2 ~ logN
'
# we specify std.lv=T so that the Nitrogen-SA relationship isn't fixed to 1
compositeFit <- sem(compositeModel, data=cards, std.lv=T)
```

Great! It should fit just fine. I'm not going to focus on the regional relationship, as it is predictable and positive. Moreover, when we look at the results, two things immediately pop out at us about the effect of nutrient supply rate.

```
Estimate Std.err Z-value P(>|z|)
Latent variables:
Nitrogen =~
SA 0.362 0.438 0.827 0.408
Regressions:
Nitrogen ~
logN 1.000
logNcen2 -1.311 1.373 -0.955 0.340
```

Wait, what? The Nitrogen effect was not detectably different from 0? Nor was there a nonlinear effect? What's going on here?

What's going on is that the scale of the composite is affecting our results. We've set it to 1. Whenever you are fixing scales, you should always check and see, what would happen if you changed which path was set to 1. So, we can simply set the scale to the nonlinear variable, refit the model, and see if this makes a difference. If it doesn't, then that means there is no nitrogen effect at all!

So, change

`Nitrogen ~ 1*logN + logNcen2`

to

`Nitrogen ~ logN + 1*logNcen2`

And, now let's see the results…..

```
Estimate Std.err Z-value P(>|z|)
Latent variables:
Nitrogen =~
SA -0.474 0.239 -1.989 0.047
Regressions:
Nitrogen ~
logN -0.763 0.799 -0.955 0.340
logNcen2 1.000
```

Ah HA! Not only is the linear effect not different from 0, but now we see that fixing the nonlinear effect allows the nutrient signal to come through.

But wait, you may say, that effect is negative? Well, remember that the scale of the nitrogen effect is the same as the nonlinear scale. And, a positive hump-shaped relationship will have a negative squared term. So, given how we've setup the model, yes, that should be negative.

*whew!* That was a lot. And this for a very simile model involving composites and nonlinearities. I thought I'd throw that out as it's a common use of composites, and interpreting nonlinearities in SEMs is always a little tricky and worth bending your brain around. Other uses of composites include summing up a lot of *linear* quantities, a composite for the application of treatments, and more. But, this should give you a good sense of what they are, how to code them in lavaan, and how to use them in the future.

For a more in depth treatment of this topic, and latent variables versus composites, I urge you to check out this excellent piece by Jim Grace and Ken Bollen. Happy model fitting!

Curious if you know of Juan Santos at NESCent, and if you know if he or others have implemented phylogenetic SEMs. Thanks, S

Indeed, Scott. After coding up that example and using it as a stellar example of CFA for my class, I’ve emailed back and forth with him a bit. I rather like his work.

Sorry I don’t entirely share your enthusiasm for SEMs, Jarrett. What’s often taken to be a feature–ability to summarize putatively ‘causal’ relations among any sorts of variables, including latent variables–I think is often a bug. I think about causality in process-based, dynamical systems terms (state variables, and the rate processes that make those state variables increase or decline). I find statistically-defined marginal effects of one variable on another to be a poor substitute for this way of thinking. I see SEMs not as descriptions of causality, but as linear summaries of the effects of (typically unknown, complex, and nonlinear) causal processes, and often not especially helpful summaries. You’re fitting a lot of parameters, which gives the model a lot of freedom to give you the right answer (i.e. a good fit) for the wrong reasons. SEMs also are often inspired by, or even presented as tests of, purely verbal models of how the world works, and so give us a false sense of rigor. Verbal models often don’t hold up when you attempt to convert them to math, but an SEM will never reveal logical gaps or flaws in a verbal model. So SEMs end up covering up rather than exposing lack of rigor in our hypotheses. Especially in systems with reciprocal causality–which are ubiquitous in ecology and evolution–SEMs aren’t a great way of describing causal processes and summarizing their effects. If you want to model, say, predator-prey dynamics in a stochastic environment, an SEM is absolutely not the way to go.

So having said all that, I suppose I should in fairness note that I do think SEMs have a role to play, especially as sources of novel hypotheses that can be tested experimentally (rather than as hypothesis tests themselves). Tim Wootton’s 1993 work on indirect interactions in the rocky intertidal is an excellent example of experimental testing of SEMs.

Jeremy, I understand your skepticism, and agree with a lot of what you’re saying. Indeed, your arguments broadly apply to pretty much any regression-like technique. It’s fundamentally correlative. Just because you can take a verbal model, draw a box-and-arrow diagram – with or without indirect effects – and fit a model to it doesn’t make it true. There are some True Believers ™ in the SEM crowd who would take umbrage at that, and the SEM literature is littered with discussion of whether or not SEM is a true test of causality. A lot of that comes from a misunderstanding of some of the interesting work of Les Hayduk and other on assessing model fit of an SEM. Basically, if your model structure is at all inconsistent with the pathways of how effects propagate from one variable or another – not allowing for pathways that, though they may be missing intermediary variables along the way – your model will not produce a sufficiently similar covariance matrix than that from the data. So, as with so many statistical techniques, it’s excellent for falsifying causal models. Or showing that the model you can construct with the data at hand is, at least, consistent with your hypothesis. And then we get into the larger debate of modes of inference, as always.

As a dyed-in-the-wool experimentalist, I agree – you can construct the most elegant model in the world and either a) if there are no experiments or previous manipulative work in the field, its utility is quite circumscribed, and one must be very careful of making any causal claims OR b) your elegant model may be entirely innapropriate for the data at hand (e.g., your dynamics in a stochastic environment example). It is easy for one who first gets into SEM to see it as a hammer and the world as full of nails, as for many who come from the AG-Stats background common in so many Ecology graduate programs, the explosion of complexity is seductive.

That said, in situations where you have a strong a priori hypothesis or hypotheses involving a series of direct and indirect interactions, some simple nonlinearities, and even latent variables for which you have no direct measurement – or even may not ever be able to have one – then SEM is perhaps the tool for you.

I think for me it all comes down to the mathematical model underlying how you view your system of interest. SEM is one of many ways of defining a system of equations and then fitting coefficients using your data at hand. But that system of equations, and all of the assumptions behind it, are what’s at the core. As it is with pretty much and parametric statistical technique. It’s about finding the right tool for the right job, and if you are going to make causal statement, making sure you can back up your words with solid mechanistic insights gained from experiments, natural history, and those that have gone on before you. But, again, this is true of any statistical modeling technique applied to observational data.

Hope this more nuanced view is helpful!

p.s. For more on the causality side, I recommend checking out some work by Judea Pearl such as this. Also, on the reciprocal front, so-called non-recursive models can be fit in SEM, but, the bar for model identification (i.e., can unique parameters be determined?) is raised substantially, and one must adopt a few additional methods in order to properly assess properties such as variance explained, etc. But, it is possible, and I’ve had some luck with it in examining top-down and bottom-up forces acting simultaneously, for example.

Excellent post Jarrett and interesting exchange All.

Regarding Jeremy’s comments, I believe Jarrett has clarified that SEM is a scientific framework that uses statistical tools. The framework can handle virtually any statistical specification you like, though that does not mean all software packages have broad capabilities. For an example, our paper last year in Ecology included nonlinear paths of several types and various nonGaussian response types, as well as hierarchical structure (http://www.esajournals.org/doi/abs/10.1890/09-2137.1). For this we used a Bayesian estimation approach, but the same model could be fit using a combination of packages in R (and then requiring a graphical modeling approach to model evaluation, as Shipley has demonstrated). I certainly agree that causal processes correspond more closely to a longitudinal model. While reciprocal interactions can be included in static models, longitudinal models permit a more informative approach. Chapter 2 in my 2006 book features exactly such an example and we have pursued that investigation of causal loops through subsequent papers including those that model the temporal dynamics more completely. Of course, often we only have static data and our analyses depend on stationarity assumptions. Laura Gough and I have evaluated such assumptions in her experimental test of our model. (http://www.esajournals.org/doi/abs/10.1890/0012-9658(1999)080%5B0882:EOECOP%5D2.0.CO%3B2). As a final point, I would ask Jeremy that you keep an open mind about the capabilities of SEM and not close the door to what its 90-year history has to contribute to causal science. Even the great Judea Pearl, who just added the Turing Prize (and $250,000) to his trophy chest of awards, ultimately declared that ‘structural equations are the language of studying causal relations’. – Best to all, Jim Grace

Hi Jarrett, thanks for this fantastic post! It has been really helpful to me. Now, I wonder whether there are any advances in working with composite variables like modelling endogenous composites and a composite feeding into another one. Perhaps you could expand on this in a new post? Coding it in R is specially hard,

Thanks!

Yes indeed! First – endogenous composites. Welp, you can have the drivers of a composite influences by exogenous (or other endogenous) variables, but not the composite itself. Remember, it’s a weighted sum of different elements. So, build your composite as normal, but have the drivers of the composite all influences by other variables. Composites feeding into one another is definitely possible – and a nice way to build up a set of concepts in a logical way!

Hi Jarret,

Thanks for this interesting post. I’ve been playing with the Cardinale composite example above and seem to be obtaining the same answer with this simpler model specification:

compositeModel2<-'

Nitrogen <~ logN + 1*logNcen2 # change '~' to '<~' for the formative construct

Nitrogen ~~ 0*Nitrogen # Specify 0 error variance

# Nitrogen =~ SA

# SA ~~ SA

SA ~ SR + Nitrogen # specify regression on SR and Nitrogen composite

logNcen2 ~ logN

'

compositeFit2 <-sem(compositeModel2, data=cards) # no need for 'std.lv=T'

Do you agree they are the same, and would this mean the trick in the blog post for specifying the composite is no longer required?

Thanks!

Sorry, comments need to be removed for the code to run.

Yes – this post was written before lavaan added a specification for composite variables. It actually did so because of conversations I had with Yves, and I think internally uses the specification I put above to resolve them. They’re still tricky, but much easier to use now than they used to be!

Also, sorry for the gap – was in the field!