Animated Species Invasions with rgbif and gganimate

Introduction

Since I discovered GBIF, I’ve been hooked. What is GBIF? From their website: “GBIF—the Global Biodiversity Information Facility—is an international network and research infrastructure funded by the world’s governments and aimed at providing anyone, anywhere, open access to data about all types of life on Earth.” In 2018, GBIF passed the mark of one billion occurence records, which is just incredible. As far as open science projects go, GBIF should be a model for everyone. They’ve not only made their data available, but they’ve made it accessible. Their massive database is lightning fast thanks to hadoop and related tools running the backend, their interface and user experience is top notch (and multilingual!), they’re pushing the biodiversity field forward by creating new standards and tools, and they’re supporting early career scientists and open source projects that tie in with their mission.

I’ve spent hours playing with the GBIF website, where you can look at species occurences on a map and filter by time, taxonomy, etc. But what if we want to analyze this data ourselves in R? Luckily Scott Chamberlain at rOpenSci has written the rgbif package, which lets us interact with the GBIF API to download data into R. I’ve also been playing with Thomas Lin Pedersen’s excellent redesign of the gganimate package, and I thought that visualizing species occurence through time would be an excellent way to utilize both of these tools!

Which species should we animate? Well I live in Pennsylvania, where we’ve heard a lot about the Spotted Lanterfly recently. The Spotted Lanternfly is an invasive insect pest that is native to China but recently spread to Pennsylvania. This pest represents a huge threat to agriculture as it can feed on fruit and lumber trees. My plan is to fetch some data on the Spotted Lanternfly (and other invasive species) from GBIF using the rgbif package, and then animate their spread through time in the style outlined by Daniela Vázquez in this excellent blog post, but I’ll use the updated gganimate.

Getting data with rgbif

First we’ll load the rgbif package to get data, and the tidyverse for some data cleaning

#load packages
library(rgbif)
library(tidyverse)
library(gganimate)

Getting data with rgbif is dead simple, just use the occ_search() function with the species name.

#get data for spotted lanternfly
fly <- occ_search(scientificName = "Lycorma delicatula")

This gives us a dataframe where each row is one species occurence record. As we can see, there’s tons of information here. I’m going to clean up some names and select only a few columns that we’ll focus on today. I’m also going to remove some occurences that have missing data.

Data cleaning

#select the columns we need and filter missing data out
fly_data <- 
  fly$data %>%
  select(lon = decimalLongitude, lat = decimalLatitude, date = eventDate) %>%
  filter(!is.na(date)) %>%
  filter(!(is.na(lat) | is.na(lon))) %>%
  mutate(date = parse_datetime(date))

Next I’m going to round the longitude and latitude values to the first decimal place. I’m doing this because there are many records that are in clusters and I want to see each cluster as one point that will grow over time. If you were doing this for a smaller area like a single state or country you might want to leave things more spread out, but I’m working globally so I’m ok with sacrificing some resolution in the data. I’m also going to round the dates down to the nearest month since I want my animation to proceed month-by-month.

fly_data$lon <- round(fly_data$lon, 1)
fly_data$lat <- round(fly_data$lat, 1)
fly_data$date <- lubridate::floor_date(fly_data$date, unit = "month")

Next I’m going to add some useful fields for later and clean up the data a little bit. I’m adding a location ID so that we can see how many unique locations we have and group by that variable later. Then I’m going to add a tally to see how many occurences there are for each location and date. Finally I’ll make sure there’s no duplicates.

Data wrangling

fly_data_edited <-
  fly_data %>%
  group_by(lon, lat) %>%
  { mutate(ungroup(.), location = group_indices(.)) } %>%
  ungroup() %>%
  group_by(location, date, lon, lat) %>%
  add_tally() %>%
  arrange(location, date) %>%
  summarize_all(sum) %>%
  ungroup()

This is looking good, but for this animation, I want each location to start out as a small circle and grow over time as we add more members. To get this “growing” and make a smooth animation, I’m going to use tidyr::expand along a dummy field that has all dates to get the number of occurences for each location at every month between the start and end of our animation. Then I’ll get the cumulative sum so that we can make each location “grow”. This data wrangling step is a little unorthodox, and there’s probably a better way to do it, but this will work.

#get all dates in dataset
all_dates <- seq(min(fly_data_edited$date), max(fly_data_edited$date), by = "month")
#make into a dummy tibble
dates_df <- tibble(lon = runif(length(all_dates)), lat = runif(length(all_dates)), date = all_dates, location = as.integer(rep(10000, length(all_dates))), n = as.integer(rep(0, length(all_dates))))

#expand+clean the data
fly_data_anim <-
  as.tibble(rbind(fly_data_edited, dates_df)) %>%
  tidyr::expand(date, nesting(location, lon, lat)) %>%
  left_join(., fly_data_edited, by=c("location", "date")) %>%
  select(date, location, lon.x, lat.x, n) %>%
  replace_na(list(n=0)) %>% 
  group_by(location) %>%
  mutate(n_sum = cumsum(n)) %>%
  arrange(location, date) %>%
  select(-5) %>%
  rename(lon = lon.x, lat = lat.x) %>%
  filter(location != 10000) %>%
  filter(date >= "2000-01-01") %>%
  arrange(n_sum)

You’ll notice that at the end I filtered out any entries that were before the year 2000. This is because after looking at the data, I noticed there’s almost no change during this period, but it adds a ton of dead frames to the beginning of the animation, so I just removed it to make our visualization more dynamic.

Make the animation

Now here comes the animation! First we’ll build a map with ggplot, add our locations with geom_point() and then use gganimate::transition_time() to flow through the dates and make the points grow.

#define colors
colors <- c("#3e0074", "#8900ff", "#a360fb", "#a360fb", "#d0a3ff", "#d0a3ff", "#d0a3ff", "#d0a3ff")
pal <- colorRampPalette(colors)

#make our ggplot
world <- ggplot() +
  borders("world", colour = "black", fill = "white") +
  geom_point(data = fly_data_anim, aes(x = lon, y = lat, size = ifelse(n_sum==0, NA, n_sum), colour = n_sum),
             alpha = .5) +
  scale_size_continuous(range = c(1, 6)) +
  scale_colour_gradientn(colours = pal(256)) +
  labs(title = "Documented Locations of Spotted Lanternfly", 
       subtitle = "Date: {frame_time}") +
  theme(plot.title = element_text(size = 16),
        plot.subtitle = element_text(size = 14),
        axis.line = element_blank(), 
        axis.title = element_blank(), 
        axis.text = element_blank(), 
        axis.ticks = element_blank(),
        panel.grid = element_blank(),
        panel.background = element_rect(fill = "#b3cde0"),
        legend.position = "none") +
  coord_map(xlim = c(-180,180), ylim = c(-60, 90)) +
  transition_time(date) +
  ease_aes("linear")

#make the animation
animate(world, nframes = 100, fps = 10, height = 604, width = 1000, renderer = gifski_renderer(loop = FALSE))

If animation isn’t playing, click it or reload page, it’s set to stop at the last frame

Nice! I think there’s room for improvement in the aesthetics of this animation, and maybe we could cut off a few more of the early years, but I’ll leave that for next time. I think this shows super clearly how there was a long period where the Spotted Lanternfly was isolated to China, and then in the 2010s it spread quickly to korea and other areas in East Asia, before spreading to Pennsylvania in 2016 (official dates from the USDA say the first occurence in PA was 2014). And our animation shows very clearly that the invasion is in full swing, exploding to record numbers in 2018… yikes!

Another example with Kudzu

This is too fun, so let’s do one more. Up next, Kudzu! Kudzu is one of the most costly and problematic invasive species in the United States. Kudzu is a vine that grows fast and chokes out whatever it covers. The vine was spread to the US from Japan and has since covered much of the southern US. I actually don’t know when Kudzu was spread to the US, so this will be a learning experience for me!

######kudzu
#get the data
kudzu <- occ_search(scientificName = "Pueraria montana var. lobata")

kudzu_data <- 
  kudzu$data %>%
  select(lon = decimalLongitude, lat = decimalLatitude, date = eventDate) %>%
  filter(!is.na(date)) %>%
  filter(!(is.na(lat) | is.na(lon))) %>%
  mutate(date = parse_datetime(date))

kudzu_data$lon <- round(kudzu_data$lon, 1)
kudzu_data$lat <- round(kudzu_data$lat, 1)
kudzu_data$date <- lubridate::floor_date(kudzu_data$date, unit = "month")

kudzu_data_edited <-
  kudzu_data %>%
  group_by(lon, lat) %>%
  { mutate(ungroup(.), location = group_indices(.)) } %>%
  ungroup() %>%
  group_by(location, date, lon, lat) %>%
  add_tally() %>%
  arrange(location, date) %>%
  summarize_all(sum) %>%
  ungroup()

kudzu_all_dates <- seq(min(kudzu_data_edited$date), max(kudzu_data_edited$date), by = "month")

kudzu_dates_df <- tibble(lon = runif(length(kudzu_all_dates)), lat = runif(length(kudzu_all_dates)), date = kudzu_all_dates, location = as.integer(rep(10000, length(kudzu_all_dates))), n = as.integer(rep(0, length(kudzu_all_dates))))

kudzu_data_anim <-
  as.tibble(rbind(kudzu_data_edited, kudzu_dates_df)) %>%
  tidyr::expand(date, nesting(location, lon, lat)) %>%
  left_join(., kudzu_data_edited, by=c("location", "date")) %>%
  select(date, location, lon.x, lat.x, n) %>%
  replace_na(list(n=0)) %>% 
  group_by(location) %>%
  mutate(n_sum = cumsum(n)) %>%
  arrange(location, date) %>%
  select(-5) %>%
  rename(lon = lon.x, lat = lat.x) %>%
  filter(location != 10000) %>%
  arrange(n_sum)

colors <- c("#3e0074", "#8900ff", "#a360fb", "#a360fb", "#d0a3ff", "#d0a3ff", "#d0a3ff", "#d0a3ff")
pal <- colorRampPalette(colors)

#make our ggplot
world <- ggplot() +
  borders("world", colour = "black", fill = "white") +
  geom_point(data = kudzu_data_anim, aes(x = lon, y = lat, size = ifelse(n_sum==0, NA, n_sum), colour = n_sum),
             alpha = .5) +
  scale_size_continuous(range = c(1, 8)) +
  scale_colour_gradientn(colours = pal(256)) +
  labs(title = "Documented Locations of Kudzu", 
       subtitle = "Date: {frame_time}") +
  theme(plot.title = element_text(size = 16),
        plot.subtitle = element_text(size = 14),
        axis.line = element_blank(), 
        axis.title = element_blank(), 
        axis.text = element_blank(), 
        axis.ticks = element_blank(),
        panel.grid = element_blank(),
        panel.background = element_rect(fill = "#b3cde0"),
        legend.position = "none") +
  coord_map(xlim = c(-180,180), ylim = c(-60, 90)) +
  transition_time(date) +
  ease_aes("linear")

animate(world, nframes = 300, fps = 20, height = 604, width = 1000)

Wow, looks like Kudzu is a problem in Australia as well! According to our map, Kudzu spread to the US in the 60s, and has just gotten worse since then. The infestation in Australia happened around the same time, could there be a related cause?

I’m super addicted to these maps. Of course they’re not representative of actual species ranges, as they’re affected by sampling bias, but I do think it’s a great way to get an initial look at occurence data, and we can even see lots of patterns like invasive species that match well with published in-depth studies. To end, I’ll leave you with a similar gif that I made for Thanksgiving showing the GBIF occurences of Wild Turkey (Meleagris gallopavo) through time in the United States. I worked harder on the aesthetics for this one, and I think it’s a good example of how cool these visualizations can be!

Related

comments powered by Disqus