ncdc_storm_events
is a project that downloads the NCDC Storm Events database from the National Oceanic and Atmospheric Administration.
The NCDC Storm Events Database is a collection of observations for significant weather events. The "database" contains information on property damage, loss of life, intensity of systems and more.
The dataset is updated somewhat regularly, but is not real-time. As of this writing, the dataset covers the time period January, 1950, to Aug, 2018.
There are three tables within the database:
-
details
-
fatalities
-
locations
The details
dataset is the heaviest raw dataset, weighing over 1.1G when combined and saved as CSV. locations
is much smaller; only 75M (CSV) with fatalities
a featherweight at 1.2K.
details
also happens to be the dirtiest. With 51 columns, at least a dozen of these are unnecessary - mostly related to date or time observations (there are 13 variables total, all redundant, for date/time info).
fatalities
has the same issue with dates and times as details
, but not nearly on the same scale.
locations
, though treated as its own dataset, is very comparable to the location data within details
and much of this information is redundant.
The entire database, imo, is a great project for exploring, tidying and cleaning somewhat large datasets. The challenge comes in ensuring data you may remove (what seems to be redundant) should in fact be removed.
The datasets are broken down by table, then further broken down by year of the observations. They are stored in csv.gz format on the NCDC NOAA FTP server.
Each file is named like,
StormEvents_{TABLE}-ftp_v1.0_d{YEAR}_c{LAST_MODIFIED}.csv.gz
where TABLE is one of details, locations, or fatalities, YEAR is the year of the observations, and LAST_MODIFIED is the last datetime modification of the archive file.
Files are downloaded with ./code/01_get_data.R. All csv.gz datasets are bound together into one dataframe for each of the three tables. These raw data files are saved in the data directory.
All three datasets can be tidied to some extent using ./code/02_tidy_data.R. The details
dataset is the worse with over a dozen date and/or time variables. These variables are dropped and BEGIN_DATE_TIME
and END_DATE_TIME
are reformatted to YYYY-MM-DD HH:MM:SS format as a character string. Timezone information is not saved. Though it is included in the dataset, it is near-completely incorrect.
Additionally, date and time variables in fatalities
and locations
are also modified to remove redundancy or, in the case of locations
which matches the date/time values in details
, have been completely removed.
The damage variables in details
have also been modified. The raw data uses alphanumeric characters; for example, "2k" or "2K" for $2,000 and "10B" or "10b" for $10,000,000,000. These have been cleaned to integer values.
Lastly, the narrative variables (EPISODE_NARRATIVE
and EVENT_NARRATIVE
) are split out to their own respective dataset to avoid redundancy and reduce the size of the other datasets.
Where the raw data is well over 1.2G, the entire dataset, after tidying, sits at 539M.
All tidied datasets are located in the output directory.
When updating data, this repo will use BGP Repo-Cleaner to remove the history of data files from the repository. When this is done, the repository will need to be removed from production environments in favor of a fresh clone
-
DT 0.5
-
kableExtra 1.0.1
-
maps 3.3.0
-
mapproj 1.2.6
-
rnaturalearth 0.1.0
-
tidyverse 1.2.1
-
viridis 0.5.1
-
workflowr 0.2.0
- R 3.5.2 - The R Project for Statistical Computing
Please read Contributing for details on code of conduct.