NCDC Storm Events - Importing Data

The NCDC Storm Events database is a collection of three datasets of weather events for areas monitored by the National Weather Service. The dataset timeperiod begins in January, 1950 and ends, currently, June, 2018.

There are three tables included in the database (the terms “database” and “tables” are used loosely here; they’re csv.gz files on a FTP server).

  • details - A 51 x n dataset

  • fatalities - A 11 x n dataset

  • locations - A 11 x n dataset

A mostly complete codebook is available on the FTP server. I have modified this document to a format more readable for me.

Each table is split by month into a gzipped CSV file and available for download. The naming format is as such,

StormEvents_{TABLE}-ftp_v1.0_d{YEAR}_c{YMD}.csv.gz

where TABLE is one of “details”, “fatalities”, or “locations”, YEAR is the four-digit year, and “YMD” is the year, two-digit month, and two-digit date the zip file was last modified. There are no delimeters any of the fields.

For this phase of the project, I will be using four R libraries:

  • curl - Used to retrieve FTP listings

  • glue - Insert variables within expressions

  • purrr - Map lists and dataframes

  • readr - read and write CSVs

library(curl)
library(glue)
library(purrr)
library(readr)

I also want to go ahead and declare some variable I will use later. The first is ftp which is the URL to the FTP server. The second, tables, stores the name of each of the three tables.

And, last, when reading in the CSVs I found I needed to take more control rather than allowing readr::read_csv to guess; this would produce errors I’ll get to in a moment.

So, rather than typing “c” repeatedly as the col_type parameter, I just created a list ot hold the values. I use glue_collapse to paste the values together.

(I thought you could use one character that read_csv would repeat itself; I’m sure this was once a feature. But, unless I messed up, simply using col_types = "c" did not work.)

#' FTP URL
ftp <- "ftp://ftp.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/"

#' Three datasets we'll be obtaining
tables <- c("details", "fatalities", "locations")

#' Expected col_types per dataset.
table_col_types <- list(
  "details" = glue_collapse(rep("c", 51L)),
  "fatalities" = glue_collapse(rep("c", 11L)),
  "locations" = glue_collapse(rep("c", 11L))
)
#' Establish connection, get list of gz datasets
con = curl(ftp, "r")
tbl = read.table(con, stringsAsFactors = TRUE, fill = TRUE)
close(con)

Take a look at the tbl object.

head(tbl)
##           V1 V2  V3   V4    V5  V6 V7   V8
## 1 drwxr-xr-x  2 ftp 1005 32768 May 14 2014
## 2 -rw-r--r--  1 ftp 1005  2020 May 14 2014
## 3 -rw-r--r--  1 ftp 1005 23834 May  6 2014
## 4 -rw-r--r--  1 ftp 1005 10597 Jan 20 2017
## 5 -rw-r--r--  1 ftp 1005 12020 Feb 24 2016
## 6 -rw-r--r--  1 ftp 1005 12634 Jun 19 2017
##                                                    V9
## 1                                              legacy
## 2                                              README
## 3                       Storm-Data-Export-Format.docx
## 4 StormEvents_details-ftp_v1.0_d1950_c20170120.csv.gz
## 5 StormEvents_details-ftp_v1.0_d1951_c20160223.csv.gz
## 6 StormEvents_details-ftp_v1.0_d1952_c20170619.csv.gz

This is basically a representation of the FTP site in a dataframe format. All that I am concerned with is tbl$V9 which are the file names. I can glue this to ftp within a recursive function using pattern matching to extract only the files I want.

The pattern matching portion is easy; every file name starts with “StormEvents_” followed by one of “details”, “fatalities”, or “locations”.

I’ll use purrr::map to build a list of length three for each table, then assign each element of the list the name of it’s respective table.

#' Split out the datasets into their own lists. Will end up with a list of
#' length 3 for each table containing all related dataset URLs
by_table <-
  map(
    .x = glue("^StormEvents_{tables}"),
    .f = grep,
    x = tbl$V9,
    value = TRUE
  ) %>%
  set_names(nm = tables)
str(by_table)
## List of 3
##  $ details   : chr [1:69] "StormEvents_details-ftp_v1.0_d1950_c20170120.csv.gz" "StormEvents_details-ftp_v1.0_d1951_c20160223.csv.gz" "StormEvents_details-ftp_v1.0_d1952_c20170619.csv.gz" "StormEvents_details-ftp_v1.0_d1953_c20160223.csv.gz" ...
##  $ fatalities: chr [1:69] "StormEvents_fatalities-ftp_v1.0_d1950_c20170120.csv.gz" "StormEvents_fatalities-ftp_v1.0_d1951_c20160223.csv.gz" "StormEvents_fatalities-ftp_v1.0_d1952_c20170619.csv.gz" "StormEvents_fatalities-ftp_v1.0_d1953_c20160223.csv.gz" ...
##  $ locations : chr [1:69] "StormEvents_locations-ftp_v1.0_d1950_c20170120.csv.gz" "StormEvents_locations-ftp_v1.0_d1951_c20160223.csv.gz" "StormEvents_locations-ftp_v1.0_d1952_c20170619.csv.gz" "StormEvents_locations-ftp_v1.0_d1953_c20160223.csv.gz" ...

I now have a nice, clean listing of all my files for each table.

It’s time to read in the CSVs. Again, I fall back to purrr but I want to use map_df this time; I want one dataframe for each table. I’ll describe my parameters.

  • .x - I use glue again to effectively take all of the values in by_table$details and append the ftp variable.

  • .f - As I’ll be reading in CSV files, I use read_csv from the readr package.

  • col_types - Originally I left this parameter to it’s default. However, this would prove to be an issue. read_csv guesses a column type for each column in a CSV; this is controlled by the guess_max parameter which defaults to the smaller value of 1000 or the n_max parameter. Some values would be guessed to be numeric. But, on importing the datasets it might find a character that would generate a warning. To deal with this during cleanup, I decided to make them all characters.

I then save the datasets to a Git LFS repo.

#' Get the details dataset (this can take a while)
details <-
  map_df(
    .x = glue("{ftp}{by_table$details}"),
    .f = read_csv,
    #' Make character to avoid reading errors or warnings
    col_types = table_col_types$details
  )

write_csv(details, path = "./ncdc_storm_events/details.csv")
#' ...fatalities (can take a while, too)...
fatalities <-
  map_df(
    .x = glue("{ftp}{by_table$fatalities}"),
    .f = read_csv,
    #' Make character to avoid reading errors or warnings
    col_types = table_col_types$fatalities
  )

write_csv(fatalities, path = "./ncdc_storm_events/fatalities.csv")
#' ...and locations (a bit faster.
locations <-
  map_df(
    .x = glue("{ftp}{by_table$locations}"),
    .f = read_csv,
    #' Make character to avoid reading errors or warnings
    col_types = table_col_types$locations
  )

write_csv(locations, path = "./ncdc_storm_events/locations.csv")

Next, I’ll start performing some cleanup. And, whoo boy, is there some cleaning to do.

Script

Session Info

pander::pander(sessionInfo())

R version 3.4.4 (2018-03-15)

Platform: x86_64-pc-linux-gnu (64-bit)

locale: LC_CTYPE=C.UTF-8, LC_NUMERIC=C, LC_TIME=C.UTF-8, LC_COLLATE=C.UTF-8, LC_MONETARY=C.UTF-8, LC_MESSAGES=C.UTF-8, LC_PAPER=C.UTF-8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=C.UTF-8 and LC_IDENTIFICATION=C

attached base packages: methods, stats, graphics, grDevices, utils, datasets and base

other attached packages: readr(v.1.1.1), purrr(v.0.2.5), glue(v.1.3.0) and curl(v.3.2)

loaded via a namespace (and not attached): Rcpp(v.0.12.17), knitr(v.1.20), magrittr(v.1.5), hms(v.0.4.2), R6(v.2.2.2), rlang(v.0.2.1), stringr(v.1.3.1), tools(v.3.4.4), xfun(v.0.1), htmltools(v.0.3.6), yaml(v.2.1.19), rprojroot(v.1.3-2), digest(v.0.6.15), tibble(v.1.4.2), bookdown(v.0.7), evaluate(v.0.10.1), rmarkdown(v.1.9), blogdown(v.0.6), stringi(v.1.2.2), compiler(v.3.4.4), pander(v.0.6.2), pillar(v.1.2.3), backports(v.1.1.2) and pkgconfig(v.2.0.1)

Previous
comments powered by Disqus