## Overview

In this vignette we will see how to retrieve and prepare Reynolds optimally interpolated sea surface temperature (OISST) data for calculating marine heatwaves (MHWs). The OISST product is a global 1/4 degree gridded dataset of Advanced Very High Resolution Radiometer (AVHRR) derived SSTs at a daily resolution, starting on 1 September 1981. The source of the data is currently the NOAA NCDC.

Each daily global file is around 8.3 MB, so they add up to a large amount of data when a time series of the recommended 30 year minimum duration for the detection of MHWs is downloaded. If one were to download all of the data currently available it would exceed 100 GB of total disk space. It is therefore best practice to download only a subset of the data as would match one’s study area. Thanks to the rerddap package this is easy to do in R.

For this vignette we will be accessing the NOAA OISST data hosted on this ERDDAP web interface. One may download the data there manually with the user friendly web interface, or use the rerddap package to do the same through R. Due to some recent developments the rerddap package on CRAN may provide an unstable interface to the NOAA ERDDAP servers. This issue has already been addressed in the development version of the package and so we will begin by installing the package from GitHub. Please note that this package has quite a few dependencies and so may take a few minutes to install. If any errors are encountered during this process please consult the GitHub page here. We will also need functionality in the tidyverse package that is only available in the development version so we will be installing that from GitHub as well.

# The two packages we will need
# NB: The packages only need to be installed from GitHub once
# devtools::install_github("tidyverse/tidyverse")
# devtools::install_github("ropensci/rerddap")

library(tidyverse)
library(rerddap)

# The information for the NOAA OISST data
rerddap::info(datasetid = "ncdc_oisst_v2_avhrr_by_time_zlev_lat_lon", url = "https://www.ncei.noaa.gov/erddap/")

With our packages loaded and our target dataset identified we may now begin the download with the griddap() function. While putting this vignette together however I noticed one little hiccup in the work flow. It seems that the ERDDAP server does not like it when one tries to access more than nine consecutive years of data in one request, regardless of the spatial extent being requested. So before we download our data we are going to make a wrapper function that helps us control the range of times we want to download. This will reduce the amount of redundant coding we would otherwise need to do.

# This function expects the user to provide it with a start and end date
OISST_sub <- function(time_df){
oisst_res <- griddap(x = "ncdc_oisst_v2_avhrr_by_time_zlev_lat_lon",
url = "https://www.ncei.noaa.gov/erddap/",
time = c(time_df$start, time_df$end),
depth = c(0, 0),
latitude = c(-40, -35),
longitude = c(15, 21),
fields = "sst")\$data %>%
mutate(time = as.Date(str_remove(time, "T00:00:00Z"))) %>%
dplyr::rename(t = time, temp = sst) %>%
select(lon, lat, t, temp) %>%
na.omit()
}

In the wrapper function above we see that we have chosen to download only the ‘sst’ data out of the several variables (‘fields’) available to us. We also see that we have chosen the spatial extent of latitude -40 to -35 and longitude 15 to 21. This a small window over some of the Agulhas Retroflection to the south west of the coastline of South Africa. A larger area is not being chosen here simply due to the speed constraints of downloading the data and detecting the events therein. One may simply change the lon/lat values above as necessary to match the desired study area. The function will also be re-labelling the ‘time’ column as ‘t’, and the ‘sst’ column as ‘temp’. We do this so that they match the default column names that are expected for calculating MHWs and we won’t have to do any extra work later on.

One must note here that depending on the RAM available on one’s machine, it may not be possible to handle all of the data downloaded at once if they are very large (e.g. > 5 GB). The discussion on the limitations of the R language due to its dependence on virtual memory is beyond the scope of this vignette, but if one limits one’s downloads to no more than several square pixels at a time that should be fine. Were one to try to download the whole Indian Ocean, for example, that may cause issues if being run on a laptop or computer of a similar power.

### The date range

With our wrapper function written we would now need to run it several times in order to grab all of the OISST data from 1982-01-01 to 2018-12-31. Even though each year of data for the extent used in this vignette is only ~360 KB, the server does not like it when more than 9 years of consecutive data are requested. The server will also end a users connection after ~17 individual files have been requested. Because we can’t download all of the data in one request, and we can’t download the data one year at a time, we will need to make requests for multiple batches of data. To accomplish this we will create a dataframe of start and end dates that will allow us to automate the entire download while meeting the aforementioned criteria.

# Date download range by start and end dates per year
dl_years <- data.frame(date_index = 1:5,
start = as.Date(c("1982-01-01", "1990-01-01",
"1998-01-01", "2006-01-01", "2014-01-01")),
end = as.Date(c("1989-12-31", "1997-12-31",
"2005-12-31", "2013-12-31", "2018-12-31")))

### The data

One could also use the plyr suite of functions to automate the process of downloading and processing multiple files, but I’ve chosen here to stick with the tidyverse native approach. If the below chunk of code fails or times out, simply re-run it until all of the data have been downloaded.

It is worth pointing out here that these data are downloaded as cached files on the users computer by using the hoardr package. This means that if one runs the same command again, it will not re-download the data because it first looks in the folder where it has automatically cached the data for you and sees that it may simply draw the data from there. No need to change anything or write a second script for loading data.

# Download all of the data with one nested request
# The time this takes will vary greatly based on connection speed
system.time(
OISST_data <- dl_years %>%
group_by(date_index) %>%
group_modify(~OISST_sub(.x)) %>%
ungroup() %>%
select(lon, lat, t, temp)
) # 921 seconds, ~184 seconds per batch

If the above code chunk is giving errors it is likely due to either not having the development version of the tidyverse installed (see beginning of vignette), or one’s Internet connection is timing out. There are also rare instances where the NOAA server is not responding due to an issue on their end. Any connection based issues may be resolved by simply waiting for a few minutes, or by ensuring a stable connection.

### The saved file

With the data downloaded and prepared for further use, all that’s left to do is save them.

# Save the data as an .Rda file as it has a much better compression rate than .RData
saveRDS(OISST_data, file = "~/Desktop/OISST_vignette.Rda")

Note above that I have chosen to save the file to my desktop. This is not normally where one (hopefully!) would save such a file. Rather one would be saving these data into the project folder out of which one is working. In the next vignette we will see how to detect MHWs in gridded data.