The goal of retroharmonize
is to facilitate retrospective (ex-post)
harmonization of survey data in a reproducible manner. The package
provides tools for organizing the metadata, standardizing the coding of
variables, variable names and value labels, including missing values,
and for documenting all transformations, with the help of comprehensive
S3 classes.
Currently being generalized from problems solved in the not yet released eurobarometer package (doi.)
The package is available on CRAN:
install.packages("retroharmonize")
The development version can be installed from GitHub with:
# install.packages("devtools")
devtools::install_github("rOpenGov/retroharmonize")
You can download the manual in PDF for the 0.2.4 release but it is significantly different from 0.2.5.
Surveys, i.e., systematic primary observation and data collections are important data sources of both social and natural sciences. They are in most cases the primary data sources of scientific research. Drawing information from several surveys, conducted in different locations or in different time can greatly enhance the inferential capacity of the surveys, but it requires significant data processing and statistical processing work. Our R software package offers a practical and comprehensive solution to harmonizing the datasets and their codebooks.
Statistical matching is a related concepts that can take a harmonized dataset further, for example, with creating new, statistically better, unified weights. For these problems, StatMatch is a mature solution in R.
Survey data harmonization refers to procedures that improve the data comparability or the inferential capacity of multiple surveys. The retroharmonize package support various data processing, documentation, file/type conversion aspects of various retrospective survey harmonization workflows (i.e. harmonization tasks related to surveys that already have already been conducted, recorded into a coded file.)
From a technical perspective, the aim of the survey harmonization is to create a single, tidy, joined harmonized dataset in the form of a data frame that contains a row identifier, which is truly unique across all observations, and which also contains the concatenated and harmonized variables. We do this in a way that provides an unambiguous mapping of numerical coded and labelled data, including special and missing data. This way we avoid coercion that may lead to logical errors due to syntactically correct, but logically inconsistent variable labelling in across differently coded source files. Taking the harmonization to the level of type harmonization to numeric and factor classes allows the use of R’s powerful statistical packages that require numeric or factor type input, and a wide range of survey output harmonization (harmonized statistics and indicators.
For an extended overview of these problems with illustrations please refer to the vignette Survey Harmonization.
Survey data, i.e., data derived from questionnaires or systematic data collection, such as inspecting objects in nature, recording prices at shops are usually stored databases, and converted to complex files retaining at least coding, labelling metadata together with the data. This must be imported to R so that the appropriate harmonization tasks can be carried out with the appropriate R types.
After importing data with some descriptive metadata such as numerical coding and labelling, we need to create a map of the information that is in our R session to prepare a harmonization plan. We must find information related to sufficiently similar concepts that can be harmonized to be successfully joined into a single variable, and eventually a table of similar variables must be joined.
We create a map of the measured concepts that needs to be harmonized, for example, a binary sex variable with missing cases and a four-level categorical variable on gender identification that has other and declined options. See the vignette Working With Survey Metadata how mapping the metadata of the surveys can help getting started with this first step.
We use a crosswalk table or a crosswalk scheme for all the variable name, value label and type conversion tasks that we plan to do.
Make sure that survey_1$sex
and survey_2$gender
can be concatenated
to a gender vector or survey_joined$gender
. See more in the Working
With A Crosswalk
Table.
For example, Female=0 in survey_1$sex
and female=2 in
survey_2$gender
becomes consistently female=0. Missing and declined
values are consistently handled.
To use R’s statistical functions with the concatenated version of
survey_1$sex
and survey_2$gender
they must have the same R type. In
the vast majority of the cases either numeric or factor, and in data
visualization applications sometimes character. See more in the
Harmonize Value
Labels
vignette.
To review statistical results and model results derived from the
concatenated variable (or the joined data frame), they must remain
comparable with survey_1$sex
and survey_2$gender
. It is also
necessary to have a new, unique row ID for each observation. If you want
to make your work available outside R, in a different software, the
joined, longitudional data frame must be exported in a consistent
manner.
We also provide three extensive case studies illustrating how the
retroharmonize
package can be used for ex-post harmonization of data
from cross-national surveys:
The creators of retroharmonize
are not affiliated with either
Afrobarometer, Arab Barometer, Eurobarometer, or the organizations that
designs, produces or archives their surveys.
We create a large, harmonized dataset for extensive testing of our packages capabilities. The replication data of this special use case can be found on
You can find this harmonized dataset on Zenodo in the Digital Music Observatory and the Cultural Creative Sectors Industries Data Observatory repositories.
We are building experimental APIs data in the form of automated observatories, which are running retroharmonize regularly and improving known statistical data sources. See also the Green Deal Data Observatory and the Economy Data Observatory.
Survey data is often available in SPSS’s custom labelled format.
Unfortunately, joining data with different labelling is not possible.
When you do not need to preserve the history of complex harmonization
problems, codebook, etc, then you do not necessary need to look under
the hoods of our S3 classes. The new labelled_spss_survey()
class is
an inherited extension of haven’s labelled_spss
class. It not
only preserves variable and value labels and the user-defined missing
range, but also gives an identifier, for example, the filename or the
wave number, to the vector. Additionally, it enables the preservation—
as metadata attributes—the original variable names, labels, and value
codes and labels, from the source data. This way, the harmonized data
also contain the pre-harmonization record. The vignette Working With
The labelled_spss_survey
Class
provides more information about the labelled_spss_survey()
class.
In Harmonize Value
Labels
we discuss the characteristics of the labelled_spss_survey()
class and
demonstrates the problems that using this class solves.
Our package has been tested on three harmonized survey’s microdata. Because retroharmonize is not affiliated with any of these data sources, to replicate our tutorials or work with the data, you have download the data files from these sources, and you have to cite those sources in your work.
Afrobarometer data: Cite Afrobarometer Arab Barometer data: cite Arab Barometer. Eurobarometer data: The Eurobarometer data Eurobarometer raw data and related documentation (questionnaires, codebooks, etc.) are made available by GESIS, ICPSR and through the Social Science Data Archive networks. You should cite your source, in our examples, we rely on the GESIS data files.
For main developer and contributors, see the package homepage.
This work can be freely used, modified and distributed under the GPL-3 license:
citation("retroharmonize")
#>
#> To cite package 'retroharmonize' in publications use:
#>
#> Antal D (2022). _retroharmonize: Ex Post Survey Data Harmonization_.
#> R package version 0.2.5.002,
#> <https://retroharmonize.dataobservatory.eu/>.
#>
#> A BibTeX entry for LaTeX users is
#>
#> @Manual{,
#> title = {retroharmonize: Ex Post Survey Data Harmonization},
#> author = {Daniel Antal},
#> year = {2022},
#> note = {R package version 0.2.5.002},
#> url = {https://retroharmonize.dataobservatory.eu/},
#> }
For contact information, see the package homepage.
Please note that the retroharmonize
project is released with a
Contributor Code of
Conduct.
By contributing to this project, you agree to abide by its terms.