Spicier Analyses with Interactive R Leaflet Maps



Who wants to make a kickass, public-friendly, dynamic, online appendix with a map for their papers? ME! (and you, right?) Let’s talk about a cool way to make your data sing to a much broader audience than a static image.

Last time, I mentioned that I had also been playing the Rstudio’s Leaflet package. For those who don’t know, Leaflet is a javascript library for generating interactive maps. I don’t know javascript (really), but I do know R, so this library is incredibly powerful for someone like me – particularly when paired with Rstudio’s HTMLwidgets or Shiny for publishing things.

So, what is leaflet? How does it work? Let’s start with a simple example. First, install the thing!

if (!require('devtools')) install.packages('devtools')
devtools::install_github('rstudio/leaflet')

Now, at its core, leaflet makes maps. And then puts data on them. It uses the magrittr for pipes as syntax, which may be a bit daunting at first, but I think you’ll see that it’s all prety straightforward.

Let’s start with a super-basic simple map.

library(leaflet)
leaflet() %>% addTiles()

Oh! Map! Scrollable! Easy to use! And you can zoom down to streets. What’s going on here?

First, leaflet() starts the whole process. You can feed it arguments of data frames and such for later use. Second, addTiles says “OK, go to a map server that feeds up image tiles, and grab ’em!” It defaults to OpenStreetMap, but there are others. For example, ESRI has a tile server with satellite imagery.

leaflet() %>% 
  addTiles(urlTemplate="http://server.arcgisonline.com/ArcGIS/rest/services/World_Imagery/MapServer/tile/{z}/{y}/{x}")

There are a number of arguments and functions you could string after addTiles to zoom around, start in a certain area, manipulate the basic map widget in different ways, but I’d like to focus on two common applications for folk like us. First, site maps! Let’s say you had some awesome data from over 100 sites around the world, and you wanted to show it off. Maybe you wanted people to really be able to zoom in and see what the environment was like around your data (i.e., why I went and found the ESRI tile server). How would you do this?

Well, leaflet provides a few different ways to put this stuff on a map. You can add circles, markers, circlemarkers, and much more. Let’s give an example on a basic map. Note the formula interface for supplying values to latitude and longitude.

#fake some data
set.seed(100)
myData <- data.frame(Latitude = runif(100, -80,80), 
                     Longitude = runif(100, -180,180),
                     measurement1 = rnorm(100, -3, 2), 
                     Year = round(runif(100, 1975,2015)))
#map it!
leaflet() %>% 
  addTiles() %>%
  addCircles(data=myData, lat = ~Latitude, lng = ~Longitude)

Awesome! But we know there’s information in that thar data. There’s a Year and a Measurement. How can we communicate this information via our map? Well, there are a few ways. Note those circles had a color and a size? We can use that. Circles and markers are also clickable. So, we can add some additional information to our data and feed that into the map. Let’s start with a popup message.

#round for readability
myData$info <- paste0("Year: ", myData$Year, 
                      "<br> Value: ", round(myData$measurement1, 3))

Note the use of some HTML in there. That’s because leaflet displays things in a web browser, so, we need to use HTML tags for text formatting and line breaks.

OK, great. Now what about a color? This can get a bit trickier with continuous values, as you’ll need to roll your own gradients. Kind of a pain in the ass. If you just wanted one or two colors (say, positive or negative), you can just do something simple, or even feed a single value to the color argument – say, “red” – but let’s get a little fancy. The classInt library is perfect for gradient construction. Here’s an example. We’ll start with grabbing a palette and making it a color ramp.

#grab a palette
library(RColorBrewer)

pal <- brewer.pal(11, "Spectral")

#now make it more continuous 
#as a colorRamp
pal <- colorRampPalette(pal)

Great! Now we can use classInt to map the palette to our measurement values.

#now, map it to your values
library(classInt)
palData <- classIntervals(myData$measurement1, style="quantile")

#note, we use pal(100) for a smooth palette here
#but you can play with this
myData$colors <- findColours(palData, pal(100))

Now that we have info, colors, and, well, let’s say we want to make the circles bigger – let’s put it all together!

#Put it on a map!
newMap <- leaflet() %>% 
  addTiles() %>%
  addCircleMarkers(data=myData, lat = ~Latitude, lng = ~Longitude,
             color = ~colors, popup = ~info, radius = 8)

newMap

Very cool! With different properties for zooming, filling, sizing points, different styles of markers, and more, I think the possibilities for online apendicies for papers – or outreach objects – are pretty endless here. Go forth, ye, and play!

Also, check out the code in this gist!


MEOW! It’s Marine Ecoregions in R!

So, I’m on paternity leave (yay! more on that another day – but HOMG, you guys, my daughter is amazing!) and while my daughter is amazing, there are many hours in the day and wee morning where I find myself rocking slowly back and forth, knowing that if I stop even for a second, this little bundle of cute sleeping away in the wrap will wake and howl.

So, what’s a guy on leave to do? Well, why not learn some new R tricks that have been on my list for, in some cases years, but I have not had time to do otherwise. In particular, time to ramp up some of my geospatial skills. As, really, I can type while rocking. And need to do something to stay awake. (One can only watch Clone Wars for so long – and why is it so much better than the prequels?)

In particular, I’ve really wanted to become a more proficient map-maker. I’ve been working on a few global projects lately, and wanted to visualize the results in some more powerful intuitive ways. Many of these projects have used Spalding et al.’s 2007 Marine Ecoregions of the World classification (or MEOW) as a basis. So, wouldn’t it be cool if we could make some nice plots of those regions, and maybe fill them in with colors according to some result?

Where to start? Well, to begin, how does one get the geographic information in to R? Fortunately, there’s a shapefile over at Marineregions.org.

Actually, heck, there are a LOT of marine region-like shapefiles over there that we might all want to use for different maps. And everything I’m about to say can generalize to any of those other shapefiles!

Oh, for those of you as ignorant as me, a shapefile is a geospatial file that has information about polygons with some additional data attached to them. To futz with them, we need to first load a few R libraries

#for geospatial tools
library(rgdal)
library(maptools)
library(rgeos)

#for some data reduction
library(dplyr)

These are wonderful tools that will help you load and manipulate your shapefiles. Note that I’ve also loaded up dplyr, which I’ve been playing with and finally learning. I’m a huge fan of ye olde plyr library, but dplyr has really upped my game, as it’s weirdly more intuitive – particularly with pipes. For a tutotrial on that, see here – and perhaps I’ll write more about it later.

OK, so, libraries aside, how do we deal with the file we get at Marineregions.org? Well, once we download and unzip it into a folder, we can take a look at what’s inside

#Let's see what's in this shapefile!
#Note - the paths here are relative to where I
#am working in this file - you may have to change them
ogrInfo("../../MEOW-TNC", "meow_ecos")
## Source: "../../MEOW-TNC", layer: "meow_ecos"
## Driver: ESRI Shapefile number of rows 232 
## Feature type: wkbPolygon with 2 dimensions
## Extent: (-180 -89.9) - (180 86.9194)
## CRS: +proj=longlat +datum=WGS84 +no_defs  
## LDID: 87 
## Number of fields: 9 
##         name type length typeName
## 1   ECO_CODE    2     19     Real
## 2  ECOREGION    4     50   String
## 3  PROV_CODE    2     19     Real
## 4   PROVINCE    4     40   String
## 5   RLM_CODE    2     19     Real
## 6      REALM    4     40   String
## 7   ALT_CODE    2     19     Real
## 8 ECO_CODE_X    2     19     Real
## 9   Lat_Zone    4     10   String

OK, cool! We can see that it’s an ESRI shapefile with 232 rows – that’s 232 ecoregions. Each row of data has a number of properties – Province, Realm (which are both higher order geospatial classifications), some numeric IDs, and information about latitudinal zone. We can also see that it’s in the WGS84 projection – more on projections another time – and that it’s chocked full of polygons.

OK, that’s all well and good, but let’s load and plot the sucker!

#get an ecoregions shapefile, and from that make a provience and realm shapefile
#from http://www.marineregions.org/downloads.php
#http://www.marineregions.org/sources.php#meow
regions <- readOGR("../../MEOW-TNC", "meow_ecos")
## OGR data source with driver: ESRI Shapefile 
## Source: "../../MEOW-TNC", layer: "meow_ecos"
## with 232 features and 9 fields
## Feature type: wkbPolygon with 2 dimensions
plot(regions)

unnamed-chunk-3-1

WHOAH! Cool. Regions! In the ocean! Nice! What a beautiful simple little plot. Again, well and good. But…..what can we do with this? Well, a quick call to class shows us that regions is a SpatialPolygonsDataFrame. Which of course has it’s own plotting methods, and such. So, you could fill some things, make some borders, overlay – the sky’s the limit. But, there are two things I want to show you how to do to make your life more flexible.

Higher Order Geographic Regions

One might be to look at Provinces and Realms. In general, when you have shapefiles, if you want to make aggregted polygons, you have to go through a few steps. Let’s say we want to look at Provinces. A province is composed of many ecoregions. Fortunately, there’s a function to unite SpatialPolygons (that’s the class we’re dealing with here that’s part of the SpatialPolygonsDataFrame) given some identifier.

#Unite the spatial polygons for each region into one
provinces <- unionSpatialPolygons(regions, regions$PROVINCE)

OK, great. But we still need to add some data to that. This provinces is just a SpatialPolygons object. To do that, let’s make a new reduced data frame using dplyr.

#Make a data frame that will have Province level info and above
prov_data <- regions@data %>%
  group_by(PROVINCE) %>%
  summarise(PROV_CODE = PROV_CODE[1], REALM = REALM[1], RLM_CODE=RLM_CODE[1], Lat_Zone=Lat_Zone[1])

Bueno. We now have a much smaller data frame that is for Provinces only. The last step is to make a new Spatial Polygons Data Frame by joining the data and the polygons. There are two tricks here. First, make sure the right rows in the data are joined to the right polygons. For that, we’ll use a join statement. The second, the new data frame has to have row names matching the names of the polygons. I don’t often use this, but in making a data frame, you can supply row names. So, here we go:

#merge the polygons with the new data file
#note the row.names argument to make sure they map to each other
provinces <- SpatialPolygonsDataFrame(provinces, 
                                      data=data.frame(
                                        join(data.frame(PROVINCE=names(provinces)),
                                             prov_data),
                                        row.names=row.names(provinces)))
## Joining by: PROVINCE

Not gorgeous, but it gets the job done. We can of course do this for realms as well.

#######
#make realms shapefile
########
#make spatial polygons for realms
realms <- unionSpatialPolygons(regions, regions$REALM)

#make new data
realm_data <- regions@data %>%
  group_by(REALM) %>%
  summarise(RLM_CODE = RLM_CODE[1],  Lat_Zone=Lat_Zone[1])

#merge the two!
realms <- SpatialPolygonsDataFrame(realms, 
                                   data=data.frame(
                                     join(data.frame(REALM=names(realms)),
                                          realm_data),
                                     row.names=row.names(realms)))
## Joining by: REALM

Excellent. So – did it work? And how different are these three different spatial things anyway? Well, let’s plot them!

#########Plot them all
par(mfrow=c(2,2), mar=c(0,0,0,0))
plot(regions, main="Ecoregion", cex.main=5)
plot(provinces, lwd=2, border="red", main="Province")
plot(realms, lwd=2, border="blue", main="Realm")
par(mfrow=c(1,1))

unnamed-chunk-8-1

Lovely.

ggplot ’em! I admit, I’m a [ggplot2][9] junkie. I just find it the fastest way to make publication quality graphs with little fuss or muss. Or make something quick and dirty to send to colleagues. But, you can’t just go and plot a SpatialPointsDataFrame in ggplot2 with ease and then use it as you will. So what’s a guy to do?

I will admit, I’m shamelessly gacking the following from https://github.com/hadley/ggplot2/wiki/plotting-polygon-shapefiles. It provides a three step process where what you do, essentially, is turn the whole mess into a data frame with the polygons providing points for plotting geom_path or geom_polygon pieces.

Step 1, you need an ID column in your data. Let’s do this for both ecoregions and provinces

regions@data$id = rownames(regions@data)
provinces@data$id = rownames(provinces@data)

OK – step 2 is the fortify function. Fortify converts an R object into a data frame for ggplot2. In this case –

library(ggplot2)
regions.points = fortify(regions, ECOREGION="id")
## Regions defined for each Polygons
provinces.points = fortify(provinces, PROVINCES="id")
## Regions defined for each Polygons

Great! Now that we have these two knew fortified data frames that describe the points that we’ll be plotting, the last thing to do is to join the points with the actual, you know, data! For that, I like to use join:

regions.df = join(regions.points, regions@data, by="id")
provinces.df = join(provinces.points, provinces@data, by="id")

What’s great about this is that, from now on, if I have another data frame of some sort that has a Ecoregion or Province as one of it’s headings – for example, let’s say I ran a linear model where Ecoregion was a fixed effect, I have a coefficient for each Ecoregion, and I’ve turned the coefficient table into a data frame with Ecoregion as one of the columns – as long as the name of the identifying column in my new data frame and my data frame for plotting are the same, I can use join to add a new column to my regions.df or provinces.df for plotting.

But, for now, I can show you how these would plot out in ggplot2. To do this, we use geom_polygon to define an area that we can fill as we want, and geom_path to stroke the outside of the areas and do with them what you will.

#####Make some ggplots for later visualization
base_ecoregion_ggplot <- ggplot(regions.df) + theme_bw() +
  aes(long,lat,group=group) + 
  geom_polygon(fill=NA) +
  geom_path(color="black") +
  coord_equal() 


base_province_ggplot <- ggplot(provinces.df) + theme_bw() +
  aes(long,lat,group=group) + 
  geom_polygon(fill=NA) +
  geom_path(color="black") +
  coord_equal() 

Note that there’s a fill=NA argument? That’s where I could put something like coefficient from that joined data, or temperature, or whatever I’ve tacked on to the whole shebang. Let’s see what they look like in ggplot2.

base_ecoregion_ggplot + ggtitle("Ecoregions")

unnamed-chunk-13-1

base_province_ggplot + ggtitle("Provinces")

unnamed-chunk-13-2

So what’s the advantage of putting them into ggplot? Well, besides using all of the graphical aestehtics for your polygon fills and paths, you can add points (say, sites), lines, or whatnot. One example could be, let’s say you wanted to visualize the borders of land (and countries!) on the map with ecoregions. Cool! Let’s get the worldmap, turn it into a data frame, and then add a geom_path with the world map on it.

library(maps)
worldmap <- map('world', plot=F)
worldmap.df <- data.frame(longitude =worldmap$x,latitude=worldmap$y)

base_province_ggplot+
  geom_path(data=worldmap.df, aes(x=longitude, y=latitude, group=NULL), color="darkgreen")

unnamed-chunk-14-1

The possibilities really are endless at this point for cool visualizations.

EDIT – OK, here’s a cool example with filled polygons using a random ‘score’ to determine fill and RColorBrewer for pretty colors!

#let's make some fancy colors
library(RColorBrewer)

#Make a data frame with Ecoregion as an identifier
thing <- data.frame(ECOREGION = regions$ECOREGION,
                    score = runif(nrow(regions), -100, 100))

#merge the score data with the regions data frame
regions.df2 <- merge(regions.df, thing)

#plot!
ggplot(regions.df2) + theme_bw() +
       aes(long,lat,group=group) + 
       geom_polygon(mapping=aes(fill=score)) +
       geom_path(color="black") +
       coord_equal() +
       scale_fill_gradientn(colours=brewer.pal(11, "Spectral"))

Screen Shot 2015-03-28 at 11.38.02 AM

Next time, Leaflet! If I can figure out how to post it's output. And for those of you who don't know leaflet, prepare to be wowed.

Also, all code for this example is in a gist over here!

Space and SEMs

One question that comes up time and time again when I teach my SEM class is, “What do I do if I have spatially structured data?” Maybe you have data that was sampled on a grid, and you know there are spatial gradients. Maybe your samples are clustered across a landscape. Or at separate sites. A lot of it boils down to worrying about the hidden spatial wee beasties lurk in the background.

I’m going to stop for a moment and suggest that before we go any further you read Brad Hawkins’s excellent Eight (and a half) deadly sins of spatial analysis where he warns of the danger of throwing out the baby with the bathwater. Remember, in any modeling technique, you want to ensure that you’re capturing as much biological signal as is there, and then adjust for remaining spatial correlation. Maybe your drivers vary in a spatial pattern. That’s OK! They’re still your drivers.

That said, ignoring residual spatial autocorrelation essentially causes you to think you have a larger sample size than you think you do (remember the assumption of independent data points) and as such your standard errors are too tight, and you may well produce overconfident results.

To deal with this in a multivariate Structural Equation Modeling context, we have a few options. First, use something like Jon Lefcheck’s excellent piecewiseSEM package and fit your models with mixed model or generalized least squares tools that can accomodate spatial correlation matrices as part of the model. If you have non-spatial information about structure, I’ve started digging into the lavaan.survey package, which has been fun (and is teaching me a lot about survey statistics).

But, what if you just want to go with a model you’ve fit using covariance matrices and maximum likelihood, like you do, using lavaan in R? It should be simple, right?

Well, I’ve kind of tossed this out as a suggestion in the ‘advanced topics’ portion of my class for years, but never implemented it. This year, I got off of my duff, and have been working this up, and have both a solid example, and a function that should make your lives easier – all wrapped up over at github. And I’d love any comments or thoughts on this, as, to be honest, spatial statistics is not where I spend a lot of time. Although I seem to be spending more and more time there these days… silly spatially structured observational datasets…that I seem to keep creating.

Anyway, let’s use as an example the Boreal Vegetation dataset from Zuur et al.’s Mixed Effects Models and Extensions in Ecology with R. The data shows vegetation NDVI from satellite data, as well as a number of other covariates – information on climate (days where the temperature passed some threshold, I believe), wetness, and species richness. And space. Here’s what the data look like, for example:

# Boreality data from http://www.highstat.com/book2.htm
# Mixed Effects Models and Extensions in Ecology with R (2009). 
# Zuur, Ieno, Walker, Saveliev and Smith. Springer
boreal <- read.table("./Boreality.txt", header=T)

#For later
source("./lavSpatialCorrect.R")

#Let's look at the spatial structure
library(ggplot2)

qplot(x, y, data=boreal, size=Wet, color=NDVI) +
  theme_bw(base_size=18) + 
  scale_size_continuous("Index of Wetness", range=c(0,10)) + 
  scale_color_gradient("NDVI", low="lightgreen", high="darkgreen")

visualize-data-1

So, there are both clear associations of variables, but also a good bit of spatial structure. Ruh roh! Well, maybe it’s all in the drivers. Let’s build a model where NDVI is affected by species richness (nTot), wetness (Wet), and climate (T61) and richness is itself also affected by climate.

library(lavaan)

## This is lavaan 0.5-17
## lavaan is BETA software! Please report any bugs.

# A simple model where NDVI is determined
# by nTot, temperature, and Wetness
# and nTot is related to temperature
borModel <- '
  NDVI ~ nTot + T61 + Wet 
  nTot ~ T61
'

#note meanstructure=T to obtain intercepts
borFit <- sem(borModel, data=boreal, meanstructure=T)

OK, great, we have a fit model – but we fear that the SEs may be too small! Is there any spatial structure in the residuals? Let’s look.

# residuals are key for the analysis
borRes <- as.data.frame(residuals(borFit, "casewise"))

#raw visualization of NDVI residuals
qplot(x, y, data=boreal, color=borRes$NDVI, size=I(5)) +
  theme_bw(base_size=17) + 
  scale_color_gradient("NDVI Residual", low="blue", high="yellow")

residuals-1

Well…sort of. A clearer way to see this that I like is just to see signs of residuals.

#raw visualization of sign of residuals
qplot(x, y, data=boreal, color=borRes$NDVI>0, size=I(5)) +
  theme_bw(base_size=17) + 
  scale_color_manual("NDVI Residual >0", values=c("blue", "red"))

residual-analysis-sign-1

OK, we can clearly see the positive residuals clustering on the corners, and negatives ones more prevalent in the middle. Sort of. Are they really? Well, we can correct for them one we know the degree of spatial autocorrelation, Moran’s I. To do this, there are a few steps. First, calculate the spatial weight matrix – essentially, the inverse of the distance between any pair of points. Close points should have a lower weight on the resulting analyses than nearer points.

#Evaluate Spatial Residuals
#First create a distance matrix
library(ape)
distMat <- as.matrix(dist(cbind(boreal$x, boreal$y)))

#invert this matrix for weights
distsInv <- 1/distMat
diag(distsInv) <- 0

OK, that done, we can determine whether there was any spatial autocorrelation in the residuals. Let’s just focus on NDVI.

#calculate Moran's I just for NDVI
mi.ndvi <- Moran.I(borRes$NDVI, distsInv)
mi.ndvi

## $observed
## [1] 0.08265236
## 
## $expected
## [1] -0.001879699
## 
## $sd
## [1] 0.003985846
## 
## $p.value
## [1] 0

Yup, it’s there. We can then use this correlation to calculate a spatially corrected sample size, which will be smaller than our initial sample size.

#What is our corrected sample size?
n.ndvi <- nrow(boreal)*(1-mi.ndvi$observed)/(1+mi.ndvi$observed)

And given that we can get parameter variances and covariances from the vcov matrix, it’s a snap to calculate new SEs, remembering that the variance of a parameter has the sample size in the denominator.

#Where did we get the SE from?
sqrt(diag(vcov(borFit)))

##    NDVI~nTot     NDVI~T61     NDVI~Wet     nTot~T61   NDVI~~NDVI 
## 1.701878e-04 2.254616e-03 1.322207e-01 5.459496e-01 1.059631e-04 
##   nTot~~nTot       NDVI~1       nTot~1 
## 6.863893e+00 6.690902e-01 1.617903e+02

#New SE
ndvi.var <- diag(vcov(borFit))[1:3]

ndvi.se <- sqrt(ndvi.var*nrow(boreal)/n.ndvi)

ndvi.se

##    NDVI~nTot     NDVI~T61     NDVI~Wet 
## 0.0001848868 0.0024493462 0.1436405689

#compare to old SE
sqrt(diag(vcov(borFit)))[1:3]

##    NDVI~nTot     NDVI~T61     NDVI~Wet 
## 0.0001701878 0.0022546163 0.1322207383

Excellent. From there, it’s a hop, skip, and a jump to calculating a z-score and ensuring that this parameter is still different from zero (or not!)

#new z values
z <- coef(borFit)[1:3]/ndvi.se

2*pnorm(abs(z), lower.tail=F)

##     NDVI~nTot      NDVI~T61      NDVI~Wet 
##  5.366259e-02  1.517587e-47 3.404230e-194

summary(borFit, standardized=T)

## lavaan (0.5-17) converged normally after  62 iterations
## 
##   Number of observations                           533
## 
##   Estimator                                         ML
##   Minimum Function Test Statistic                1.091
##   Degrees of freedom                                 1
##   P-value (Chi-square)                           0.296
## 
## Parameter estimates:
## 
##   Information                                 Expected
##   Standard Errors                             Standard
## 
##                    Estimate  Std.err  Z-value  P(>|z|)   Std.lv  Std.all
## Regressions:
##   NDVI ~
##     nTot             -0.000    0.000   -2.096    0.036   -0.000   -0.044
##     T61              -0.035    0.002  -15.736    0.000   -0.035   -0.345
##     Wet              -4.270    0.132  -32.295    0.000   -4.270   -0.706
##   nTot ~
##     T61               1.171    0.546    2.144    0.032    1.171    0.092
## 
## Intercepts:
##     NDVI             10.870    0.669   16.245    0.000   10.870  125.928
##     nTot           -322.937  161.790   -1.996    0.046 -322.937  -30.377
## 
## Variances:
##     NDVI              0.002    0.000                      0.002    0.232
##     nTot            112.052    6.864                    112.052    0.991

See! Just a few simple steps! Easy-peasy! And a few changes – the effect of species richness is no longer so clear, for example

OK, I lied. That’s a lot of steps. But, they’re repetative. So, I whipped up a function that should automate this, and produce useful output for each endogenous variable. I need to work on it a bit, and I’m sure issues will come up with latents, composites, etc. But, just keep your eyes peeled on the github for the latest update.

lavSpatialCorrect(borFit, boreal$x, boreal$y)

## $Morans_I
## $Morans_I$NDVI
##     observed     expected          sd p.value    n.eff
## 1 0.08265236 -0.001879699 0.003985846       0 451.6189
## 
## $Morans_I$nTot
##     observed     expected          sd p.value    n.eff
## 1 0.03853411 -0.001879699 0.003998414       0 493.4468
## 
## 
## $parameters
## $parameters$NDVI
##             Parameter      Estimate    n.eff      Std.err   Z-value
## NDVI~nTot   NDVI~nTot -0.0003567484 451.6189 0.0001848868  -1.92955
## NDVI~T61     NDVI~T61 -0.0354776273 451.6189 0.0024493462 -14.48453
## NDVI~Wet     NDVI~Wet -4.2700526589 451.6189 0.1436405689 -29.72734
## NDVI~~NDVI NDVI~~NDVI  0.0017298286 451.6189 0.0001151150  15.02696
## NDVI~1         NDVI~1 10.8696158663 451.6189 0.7268790958  14.95382
##                  P(>|z|)
## NDVI~nTot   5.366259e-02
## NDVI~T61    1.517587e-47
## NDVI~Wet   3.404230e-194
## NDVI~~NDVI  4.889505e-51
## NDVI~1      1.470754e-50
## 
## $parameters$nTot
##             Parameter    Estimate    n.eff     Std.err   Z-value
## nTot~T61     nTot~T61    1.170661 493.4468   0.5674087  2.063171
## nTot~~nTot nTot~~nTot  112.051871 493.4468   7.1336853 15.707431
## nTot~1         nTot~1 -322.936937 493.4468 168.1495917 -1.920534
##                 P(>|z|)
## nTot~T61   3.909634e-02
## nTot~~nTot 1.345204e-55
## nTot~1     5.479054e-02

Happy coding, and I hope this helps some of you out. If you’re more of a spatial guru than I, and have any suggestions, feel free to float them in the comments below!

Here a Tau, there a Tau… Plotting Quantile Regressions

I’ve ended up digging into quantile regression a bit lately (see this excellent gentle introduction to quantile regression
for ecologists
[pdf] for what it is and some great reasons why to use it -see also here and here). In R this is done via the quantreg package, which is pretty nice, and has some great plotting diagnostics, etc. But what it doesn’t have out of the box is a way to simply plot your data, and then overlay quantile regression lines at different levels of tau.

The documentation has a nice example of how to do it, but it’s long tedious code. And I had to quickly whip up a few plots for different models.

So, meh, I took the tedious code and wrapped it into a quickie function. Which I dorp here for your delectation. Unless you have some better fancier way to do it (which I’d love to see – especially for ggplot….)

Here’s the function:

quantRegLines <- function(rq_obj, lincol="red", ...){  
  #get the taus
  taus <- rq_obj$tau
  
  #get x
  x <- rq_obj$x[,2] #assumes no intercept
  xx <- seq(min(x, na.rm=T),max(x, na.rm=T),1)
  
  #calculate y over all taus
  f <- coef(rq_obj)  
  yy <- cbind(1,xx)%*%f
  
  if(length(lincol)==1) lincol=rep(lincol, length(taus))
  #plot all lines
  for(i in 1:length(taus)){
    lines(xx,yy[,i], col=lincol[i], ...)
  }
  
}

And an example use.

data(engel)
attach(engel)
 
taus <- c(.05,.1,.25,.75,.9,.95)
plot(income,foodexp,xlab="Household Income",
     ylab="Food Expenditure",
     pch=19, col=alpha("black", 0.5))
 
rq_fit <- rq((foodexp)~(income),tau=taus)
 
quantRegLines(rq_fit)

Oh, and I set it up to make pretty colors in plots, too.

plot(income, foodexp, xlab = "Household Income", 
    ylab = "Food Expenditure", 
    pch = 19, col = alpha("black", 0.5))

quantRegLines(rq_fit, rainbow(6))
legend(4000, 1000, taus, rainbow(6), title = "Tau")

All of this is in a repo over at github (natch), so, fork and play.

More on Bacteria and Groups

Continuing with bacterial group-a-palooza

I followed Ed’s suggestions and tried both a binomial distribution and a Poisson distribution for abundance such that the probability of a density of one species s in one group g in one plot r where there are S_g species in group gis

A_rgs ~ Poisson(\frac{A_rg}{S_g})

In the analysis I’m doing, interesting, the results do change a bit such that the original network only results are confirmed.

I am having one funny thing, though, which I can’t lock down. Namely, the no-group option always has the lowest AIC once I include abundances – and this is true both for binomial and Poisson distributions. Not sure what that is about. I’ve put the code for all of this here and made a sample script below. This doesn’t reproduce the behavior, but, still. Not quite sure what this blip is about.

For the sample script, we have five species and three possible grouping structures. It looks like this, where red nodes are species or groups and blue nodes are sites:

Screen Shot 2013-04-12 at 4.32.50 PM

And the data looks like this

  low med high  1   2   3
1   1   1    1 50   0   0
2   2   1    1 45   0   0
3   3   2    2  0 100   1
4   4   2    2  0 112   7
5   5   3    2  0  12 110

So, here’s the code:

And the results:

> aicdf
     k LLNet LLBinomNet  LLPoisNet   AICpois  AICbinom AICnet
low  5     0    0.00000  -20.54409  71.08818  60.00000     30
med  3     0  -18.68966  -23.54655  65.09310  73.37931     18
high 2     0 -253.52264 -170.73361 353.46723 531.04527     12

We see that the two different estimations disagree, with the binomial favorint disaggregation and poisson favoring moderate aggregation. Interesting. Also, the naive network only approach favors complete aggregation. Interesting. Thoughts?

Filtering Out Exogenous Pairs of Variables from a Basis Set

Sometimes in an SEM for which you're calculating a test of D-Separation, you want all exogenous variables to covary. If you have a large model with a number of exogenous variables, coding that into your basis set can be a pain, and hence, you can spend a lot of time filtering out elements that aren't part of your basis set, particularly with the ggm library. Here's a solution – a function I'm calling filterExoFromBasiSet


#Takes a basis set list from basiSet in ggm and a vector of variable names

filterExoFromBasiSet <- function(set, exo) {
    pairSet <- t(sapply(set, function(alist) cbind(alist[1], alist[2])))
    colA <- which(pairSet[, 1] %in% exo)
    colB <- which(pairSet[, 2] %in% exo)
    both <- c(colA, colB)
    both <- unique(both[which(duplicated(both))])

    set[-both]
}

How does it work? Let's say we have the following model:

y1 <- x1 + x2

Now, we should have no basis set. But…

library(ggm)

modA <- DAG(y1 ~ x1 + x2)
basiSet(modA)
## [[1]]
## [1] "x2" "x1"

Oops – there's a basis set! Now, instead, let's filter it

basisA <- basiSet(modA)
filterExoFromBasiSet(basisA, c("x1", "x2"))
## list()

Yup, we get back an empty list.

This function can come in handy. For example, let's say we're testing a model with an exogenous variable that does not connect to an endogenous variable, such as

y1 <- x1
x2 (which is exogenous)

Now –


modB <- DAG(y ~ x1, 
               x2 ~ x2)

basisB <- basiSet(modB)
filterExoFromBasiSet(basisB, c("x1", "x2"))
## [[1]]
## [1] "x2" "y"  "x1"

So, we have the correct basis set with only one element.

What about if we also have an endogenous variable that has no paths to it?


modC <- DAG(y1 ~ x1, 
               x2 ~ x2, 
               y2 ~ y2)

basisC <- basiSet(modC)

filterExoFromBasiSet(basisC, c("x1", "x2"))
## [[1]]
## [1] "y2" "x2"
## 
## [[2]]
## [1] "y2" "x1"
## 
## [[3]]
## [1] "y2" "y1" "x1"
## 
## [[4]]
## [1] "x2" "y1" "x1"

This yields the correct 4 element basis set.

Extracting p-values from different fit R objects

Let's say you want to extract a p-value and save it as a variable for future use from a linear or generalized linear model – mixed or non! This is something you might want to do if, say, you were calculating Fisher's C from an equation-level Structural Equation Model. Here's how to extract the effect of a variable from multiple different fit models. We'll start with a data set with x, y, z, and a block effect (we'll see who in a moment).


x <- rep(1:10, 2)
y <- rnorm(20, x, 3)
block <- c(rep("a", 10), rep("b", 10))

mydata <- data.frame(x = x, y = y, block = block, z = rnorm(20))

Now, how would you extract the p-value for the parameter fit for z from a linear model object? Simply put, use the t-table from the lm object's summary

alm <- lm(y ~ x + z, data = mydata)

summary(alm)$coefficients
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   1.1833     1.3496  0.8768 0.392840
## x             0.7416     0.2190  3.3869 0.003506
## z            -0.4021     0.8376 -0.4801 0.637251

# Note that this is a matrix.  
# The third row, fourth column is the p value
# you want, so...

p.lm <- summary(alm)$coefficients[3, 4]

p.lm
## [1] 0.6373

That's a linear model, what about a generalized linear model?

aglm <- glm(y ~ x + z, data = mydata)

summary(aglm)$coefficients
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   1.1833     1.3496  0.8768 0.392840
## x             0.7416     0.2190  3.3869 0.003506
## z            -0.4021     0.8376 -0.4801 0.637251

# Again, is a matrix.  
# The third row, fourth column is the p value you
# want, so...

p.glm <- summary(aglm)$coefficients[3, 4]

p.glm
## [1] 0.6373

That's a linear model, what about a generalized linear model?


anls <- nls(y ~ a * x + b * z, data = mydata, 
     start = list(a = 1, b = 1))

summary(anls)$coefficients
##   Estimate Std. Error t value  Pr(>|t|)
## a   0.9118     0.1007   9.050 4.055e-08
## b  -0.4651     0.8291  -0.561 5.817e-01

# Again, is a matrix.  
# The second row, fourth column is the p value you
# want, so...

p.nls <- summary(anls)$coefficients[2, 4]

p.nls
## [1] 0.5817

Great. Now, what if we were running a mixed model? First, let's look at the nlme package. Here, the relevant part of the summary object is the tTable

library(nlme)
alme <- lme(y ~ x + z, random = ~1 | block, data = mydata)

summary(alme)$tTable
##               Value Std.Error DF t-value  p-value
## (Intercept)  1.1833    1.3496 16  0.8768 0.393592
## x            0.7416    0.2190 16  3.3869 0.003763
## z           -0.4021    0.8376 16 -0.4801 0.637630

# Again, is a matrix.  
# But now the third row, fifth column is the p value
# you want, so...

p.lme <- summary(alme)$tTable[3, 5]

p.lme
## [1] 0.6376

Last, what about lme4? Now, for a linear lmer object, you cannot get a p value. But, if this is a generalizes linear mixed model, you are good to go (as in Shipley 2009). Let's try that here.

library(lme4)

almer <- lmer(y ~ x + z + 1 | block, data = mydata)

# no p-value!
summary(almer)@coefs
##             Estimate Std. Error t value
## (Intercept)    4.792     0.5823   8.231

# but, for a genearlined linear mixed model
# and yes, I know this is a
# bad model but, you know, demonstration!

aglmer <- lmer(y + 5 ~ x + z + (1 | block), 
        data = mydata, family = poisson(link = "log"))

summary(aglmer)@coefs
##             Estimate Std. Error z value  Pr(>|z|)
## (Intercept)  1.90813    0.16542  11.535 8.812e-31
## x            0.07247    0.02471   2.933 3.362e-03
## z           -0.03193    0.09046  -0.353 7.241e-01

# matrix again!  Third row, fourth column
p.glmer <- summary(aglmer)@coefs[3, 4]

p.glmer
## [1] 0.7241

A Quick Note in Weighting with nlme

I’ve been doing a lot of meta-analytic things lately. More on that anon. But one quick thing that came up was variance weighting with mixed models in R, and after a few web searches, I wanted to post this, more as a note-to-self and others than anything. Now, in a simple linear model, weighting by variance or sample size is straightforward.

#variance
lm(y ~ x, data = dat, weights = 1/v)

#sample size
lm(y ~ x, data = dat, weights = n)

You can use the same sort of weights argument with lmer. But, what about if you’re using nlme? There are reasons to do so. Things change a bit, as nlme uses a wide array of weighting functions for the variance to give it some wonderful flexibility – indeed, it’s a reason to use nlme in the first place! But, for such a simple case, to get the equivalent of the above, here’s the tricky little difference. I’m using gls, generalized least squares, but this should work for lme as well.

#variance
gls(y ~ x, data=dat, weights = ~v)

#sample size
gls(y ~ x, data = dat, weights = ~1/n)

OK, end note to self. Thanks to John Griffin for prompting this.

Why I’m Teaching Computational Data Analysis for Biology

This is a x-post from the blog I’ve setup for my course blog. As my first class at UMB, I’m teaching An Introduction to Computational Data Analysis for Biology – basically mixing teaching statistics and basic programming. It’s something I’ve thought a long time about teaching – although the rubber meeting the road has been fascinating.

As part of the course, I’m adapting an exercise that I learned while taking English courses – in particular from a course on Dante’s Divine Comedy. I ask that students write 1 page weekly to demonstrate that they are having a meaningful interaction with the material. I give them a few pages from this book as a prompt, but really they can write about anything. One student will post on the blog per week (and I’m encouraging them to use the blog for posting other materials as well – we shall see, it’s an experiment). After they post, I hope that it will start a conversation, at least amongst participants in the class. I also think this post might pair well with some of Brian McGill’s comments on statistical machismo to show you a brief sketch of my own evolution as a data analyst.

I’ll be honest, I’m excited. I’m excited to be teaching Computational Data Analysis to a fresh crop of graduate students. I’m excited to try and take what I have learned over the past decade of work in science, and share that knowledge. I am excited to share lessons learned and help others benefit from the strange explorations I’ve had into the wild world of data.

I’m ready to get beyond the cookbook approach to data. When I began learning data analysis, way back in an undergraduate field course, it was all ANOVA all the time (with brief diversions to regression or ANCOVA). There was some point and click software that made it easy, so long as you knew the right recipe for the shape of your data. The more complex the situation, the more creative you had to be in getting an accurate sample, and then in determining what was the right incantation of sums of squares to get a meaningful test statistic. And woe be it if your p value from your research was 0.051.

I think I enjoyed this because it was fitting a puzzle together. That, and I love to cook, so, who doesn’t want to follow a good recipe?

Still, there was something that always nagged me. This approach – which I encountered again and again – seemed stale. The body of analysis was beautiful, but it seemed divorced from the data sets that I saw starting to arrive on the horizon – data sets that were so large, or chocked full of so many different variables, that something seemed amiss.

The answer rippled over me in waves. First, comments from an editor – Ram Meyers – for a paper of mine began to lift the veil. I had done all of my analyses as taught (and indeed even used for a class) using ANOVA and regression, multiple comparison, etc. etc. in the classic vein. Ram asked why, particularly given that the Biological processes that generated my data should in no way generate something with a normal – or even log-normal – distribution. While he agreed that the approximation was good enough, he made me go back, and jump off the cliff into the world of generalized linear models. It was bracing. But he walked me through it – over the phone even.

So, a new recipe, yes? But it seemed like something more was looming.

Then, an expiration of a JMP site license with one week left on a paper revision left me bereft. The only free tool I could turn to that seemed to do what I wanted it to do was R.

Wonderful, messy, idiosyncratic R.

I jumped in and learned the bare minimum of what I needed to know to do my analysis…and lost myself.

I had taken computer science in college, and even written the backend of a number of websites in PERL (also wonderful, messy, and idiosyncratic). What I enjoyed most about programming was that you could not hide from how you manipulated information. Programming has a functional aspect at the core where an input must be translated into a meaningful output according to the rules that you craft.

Working with R, I was crafting rules to generate meaningful statistical output. But what were those rules but my assumptions about how nature worked. The fundamentals of what I was doing all along – fitting a line to data with an error distribution – that should be based in biology, not arbitrary assumptions – was laid all the more bare. Some grumblingly lovely help from statistical denizens on the R help boards helped to bring this in sharp focus.

So, I was ready when, for whatever reason, fate thrust me into a series of workshops on Bayesian statistics, AIC analysis, hierarchical modeling, time series analysis, data visualization, meta-analysis, and last – Structural Equation Modeling.

I was delighted to learn more and more of how statistical analysis had grown beyond what I had been taught. I drank deeply of it. I know, that’s pretty nerdy, but, there you have it.

The new techniques all shared a common core – that they were engines of inference about biological processes. How I, as the analyst, made assumptions about how the world worked was up to me. Once I had a model of how my system worked in mind – sketched out, filled with notes on error distributions, interactions, and more, I could sit back and think about what inferential tools would give me the clearest answers I needed.

I had moved instead of finding the one right recipe in a giant cookbook to choosing the right tools out of a toolbox. And then using the tools of computer science – optimizing algorithms, thinking about efficient data storage, etc – to let my tools work bring data and biological models together.

It’s exciting. And that’s the core philosophy I’m trying to convey in this semester. (N.B. the spellchecker tried to change convey to convert – there’s something there).

Think about biology. Think about a line. Think about a probability distribution. Put them together, and find out what stories your data can tell you about the world.