Modifying Meta Data

Marlen Holtmann, Johanna Busse, Benjamin Becker

2021-10-06

eatGADS allows importing data from SPSS files basically without any loss of meta data. Meta data are stored in a GADSdat object with the following structure:

#>    varName varLabel format display_width labeled value valLabel missings
#> 17  gender   Gender   F8.0            NA     yes     1   Female    valid
#> 18  gender   Gender   F8.0            NA     yes     2     Male    valid

On value level, additional meta data can be stored, namely:

These meta data provide an overview of the actual data and are essential to their understanding, comparable to a codebook. Most importantly, it also allows other researchers to understand the data without much effort and ensures transparency. In this vignette we will illustrate how in eatGADS meta data can be accessed and how it can be modified. We also discuss some recode and data modification functions that lead to automatic meta data changes.

For illustrative purposes we use a small example data set from the campus files of the German PISA Plus assessment (called pisa). The complete campus files and the original data set can be accessed here and here.

library(eatGADS)
gads <- pisa

Meta data structure

With the function extractMeta() you can inspect the existing metadata of one, several or all variables in a GADSdat object. We will use this function frequently throughout the vignette to check whether changes to meta data have been successful.

extractMeta(gads, vars = c("hisei", "schtype"))
#>    varName                             varLabel format display_width labeled value
#> 5  schtype                         School track   F8.0            NA     yes     1
#> 6  schtype                         School track   F8.0            NA     yes     2
#> 7  schtype                         School track   F8.0            NA     yes     3
#> 39   hisei Highest parental occupational status   F8.2            NA      no    NA
#>                                     valLabel missings
#> 5                 Gymnasium (academic track)    valid
#> 6                                 Realschule    valid
#> 7  schools with several courses of education    valid
#> 39                                      <NA>     <NA>

Modifying meta data

Variable names, labels, and SPSS format

Changes to meta data on variable level are straightforward. Variable names can be changed with the changeVarNames() function. The old variable names are overwritten.

# Changing variable names
extractMeta(gads, vars = "hisei")
#>    varName                             varLabel format display_width labeled value valLabel
#> 39   hisei Highest parental occupational status   F8.2            NA      no    NA     <NA>
#>    missings
#> 39     <NA>
gads_labeled <- changeVarNames(GADSdat = gads, oldNames = "hisei", newNames = "hisei_new")
extractMeta(gads_labeled, vars = "hisei_new")
#>      varName                             varLabel format display_width labeled value valLabel
#> 39 hisei_new Highest parental occupational status   F8.2            NA      no    NA     <NA>
#>    missings
#> 39     <NA>

Variable labels can be adjusted analogously via the changeVarLabels() function.

# Changing variable labels 
extractMeta(gads_labeled, vars = "hisei_new")
#>      varName                             varLabel format display_width labeled value valLabel
#> 39 hisei_new Highest parental occupational status   F8.2            NA      no    NA     <NA>
#>    missings
#> 39     <NA>
gads_labeled <- changeVarLabels(GADSdat = gads_labeled, varName = "hisei_new", 
                                varLabel = "Parental occupational status (highest)")
extractMeta(gads_labeled, vars = "hisei_new")
#>      varName                               varLabel format display_width labeled value valLabel
#> 39 hisei_new Parental occupational status (highest)   F8.2            NA      no    NA     <NA>
#>    missings
#> 39     <NA>

The same applies for the SPSS format of a variable using the changeSPSSformat() function.

# Changing SPSS format
extractMeta(gads_labeled, "hisei_new")
#>      varName                               varLabel format display_width labeled value valLabel
#> 39 hisei_new Parental occupational status (highest)   F8.2            NA      no    NA     <NA>
#>    missings
#> 39     <NA>
gads_labeled <- changeSPSSformat(GADSdat = gads_labeled, varName = "hisei_new", 
                                 format = "F10.2")
extractMeta(gads_labeled, "hisei_new")
#>      varName                               varLabel format display_width labeled value valLabel
#> 39 hisei_new Parental occupational status (highest)  F10.2            NA      no    NA     <NA>
#>    missings
#> 39     <NA>

Value labels

Changes to meta data on value level follow the same principle. With the changeValLabels() function, value labels can be added or modified. Note that value labels and missing codes should be given to numeric values, even if a variable is a character variable.

# Adding value labels
extractMeta(gads_labeled, "hisei_new")
#>      varName                               varLabel format display_width labeled value valLabel
#> 39 hisei_new Parental occupational status (highest)  F10.2            NA      no    NA     <NA>
#>    missings
#> 39     <NA>
gads_labeled <- changeValLabels(GADSdat = gads_labeled, varName = "hisei_new", 
                                value = c(-94, -99), valLabel = c("miss1", "miss2"))
extractMeta(gads_labeled, "hisei_new")
#>      varName                               varLabel format display_width labeled value valLabel
#> 38 hisei_new Parental occupational status (highest)  F10.2            NA     yes   -99    miss2
#> 39 hisei_new Parental occupational status (highest)  F10.2            NA     yes   -94    miss1
#>    missings
#> 38    valid
#> 39    valid

# Changing value labels
gads_labeled <- changeValLabels(GADSdat = gads_labeled, varName = "hisei_new", 
                                value = c(-94, -99), 
                                valLabel = c("missing: Question omitted",
                                             "missing: Not administered"))
extractMeta(gads_labeled, "hisei_new")
#>      varName                               varLabel format display_width labeled value
#> 38 hisei_new Parental occupational status (highest)  F10.2            NA     yes   -99
#> 39 hisei_new Parental occupational status (highest)  F10.2            NA     yes   -94
#>                     valLabel missings
#> 38 missing: Not administered    valid
#> 39 missing: Question omitted    valid

Value labels can also be deleted using the removeValLabels() function.

# Removing value labels
extractMeta(gads_labeled, "schtype")
#>   varName     varLabel format display_width labeled value                                  valLabel
#> 4 schtype School track   F8.0            NA     yes     1                Gymnasium (academic track)
#> 5 schtype School track   F8.0            NA     yes     2                                Realschule
#> 6 schtype School track   F8.0            NA     yes     3 schools with several courses of education
#>   missings
#> 4    valid
#> 5    valid
#> 6    valid
gads_labeled <- removeValLabels(GADSdat = gads_labeled, varName = "schtype", 
                                value = 1:3)
extractMeta(gads_labeled, "schtype")
#>   varName     varLabel format display_width labeled value valLabel missings
#> 4 schtype School track   F8.0            NA      no    NA     <NA>     <NA>

Missings

Missing codes can be modified using the changeMissings() function. Valid entries for missings are "miss" and "valid".

# Defining missings
extractMeta(gads_labeled, "hisei_new")
#>      varName                               varLabel format display_width labeled value
#> 38 hisei_new Parental occupational status (highest)  F10.2            NA     yes   -99
#> 39 hisei_new Parental occupational status (highest)  F10.2            NA     yes   -94
#>                     valLabel missings
#> 38 missing: Not administered    valid
#> 39 missing: Question omitted    valid
gads_labeled <- changeMissings(GADSdat = gads_labeled, varName = "hisei_new", 
                               value = c(-94, -99), missings = c("miss", "miss"))
extractMeta(gads_labeled, "hisei_new")
#>      varName                               varLabel format display_width labeled value
#> 36 hisei_new Parental occupational status (highest)  F10.2            NA     yes   -99
#> 37 hisei_new Parental occupational status (highest)  F10.2            NA     yes   -94
#>                     valLabel missings
#> 36 missing: Not administered     miss
#> 37 missing: Question omitted     miss

Usually an alignment of value labels and missing codes is desirable. For example, in our variable hisei_new the value -94 has received a missing code and the value label "missing: Questions omitted". To make these alignments easier, the checkMissings() function exists. It allows searching for regular expressions in the value labels and comparing missing codes and vice versa. Per default, missing codes are automatically adjusted (addMissingCode = TRUE) and value label mismatches just reported (addMissingLabel = FALSE).

# Creating a new value label for a missing value but leaving the missing code as valid
gads_labeled <- changeValLabels(GADSdat = gads_labeled, varName = "gender", 
                                value = -94, valLabel = "missing: Question omitted")
# Creating a new missing code but leaving the value label empty
gads_labeled <- changeMissings(GADSdat = gads_labeled, varName = "gender", 
                                value = -99, missings = "miss")

# Checking value label and missing code allignment
gads_labeled <- checkMissings(gads_labeled, missingLabel = "missing") 
#> The following variables have value labels including the term 'missing' which are not coded as missing:
#> gender
#> 'miss' is inserted into column missings for 1 rows.
#> The following variables have values coded as missings but value labels do not include the term 'missing':
#> gender

Reusing meta data

Sometimes one variable already contains the meta data which should be added to another variable. reuseMeta() allows us to copy meta data from one variable other another variable, even across different data sets. The function allows us to transfer the complete meta data, only value labels or a specific selection of value labels (only valid values or missing codes). In the example below we transfer only the missing codes from one variable to another.

extractMeta(gads_labeled, "age")
#>    varName             varLabel format display_width labeled value valLabel missings
#> 18     age Age of student at T1   F8.2            NA      no    NA     <NA>     <NA>
gads_labeled <- reuseMeta(GADSdat = gads_labeled, varName = "age",
                          other_GADSdat = gads_labeled, other_varName = "hisei_new",
                          missingLabels = "only", addValueLabels = TRUE)
extractMeta(gads_labeled, "age")
#>    varName             varLabel format display_width labeled value                  valLabel
#> 18     age Age of student at T1   F8.2            NA     yes   -99 missing: Not administered
#> 19     age Age of student at T1   F8.2            NA     yes   -94 missing: Question omitted
#>    missings
#> 18     miss
#> 19     miss

Modifying data and meta data

Note that in eatGADS meta data is stored together with the actual data set. Therefore, changes to the actual data set imply changes to the meta data. If a variable is removed from the data set, its meta data is no longer needed. If a new variable is created, new meta data needs to be created. If a variable is recoded, the meta data entries need to be recoded accordingly.

Selecting variables

If a certain subset of variables in the GADSdat is needed, individual variables can either be extracted (extractVars()) or removed (removeVars()) from the data set.

# Selecting variables
gads_motint <- extractVars(gads_labeled, 
                           vars = c("int_a", "int_b", "int_c", "int_d", "instmot_a"))
#> Removing the following rows from meta data: idstud, idschool, idclass, schtype, sameteach, g8g9, ganztag, classsize, repeated, gender, age, language, migration, hisced, hisei_new, homepos, books, pared, computer_age, internet_age, int_use_a, int_use_b, truancy_a, truancy_b, truancy_c, instmot_b, instmot_c, instmot_d, norms_a, norms_b, norms_c, norms_d, norms_e, norms_f, anxiety_a, anxiety_b, anxiety_c, anxiety_d, anxiety_e, selfcon_a, selfcon_b, selfcon_c, selfcon_d, selfcon_e, worketh_a, worketh_b, worketh_c, worketh_d, worketh_e, worketh_f, worketh_g, worketh_h, worketh_i, intent_a, intent_b, intent_c, intent_d, intent_e, behav_a, behav_b, behav_c, behav_d, behav_e, behav_f, behav_g, behav_h, teach_a, teach_b, teach_c, teach_d, teach_e, cognact_a, cognact_b, cognact_c, cognact_d, cognact_e, cognact_f, cognact_g, cognact_h, cognact_i, discpline_a, discpline_b, discpline_c, discpline_d, discpline_e, relation_a, relation_b, relation_c, relation_d, relation_e, belong_a, belong_b, belong_c, belong_d, belong_e, belong_f, belong_g, belong_h, belong_i, attitud_a, attitud_b, attitud_c, attitud_d, attitud_e, attitud_f, attitud_g, attitud_h, grade_de, grade_ma, grade_bio, grade_che, grade_phy, grade_sci, ma_pv1, ma_pv2, ma_pv3, ma_pv4, ma_pv5, rea_pv1, rea_pv2, rea_pv3, rea_pv4, rea_pv5, sci_pv1, sci_pv2, sci_pv3, sci_pv4, sci_pv5
#> No rows added to meta data.
namesGADS(gads_motint)
#> [1] "int_a"     "int_b"     "int_c"     "int_d"     "instmot_a"

gads_int <- removeVars(gads_motint, vars = "instmot_a") 
#> Removing the following rows from meta data: instmot_a
#> No rows added to meta data.
namesGADS(gads_int)
#> [1] "int_a" "int_b" "int_c" "int_d"

Recoding values

The function recodeGADS() allows the recoding of a variable.

# Original data and meta data
gads_labeled$dat$gender[1:10]
#>  [1] 1 1 2 2 1 1 2 2 1 1
extractMeta(gads_labeled, "gender")
#>    varName varLabel format display_width labeled value                  valLabel missings
#> 14  gender   Gender   F8.0            NA     yes   -99                      <NA>     miss
#> 15  gender   Gender   F8.0            NA     yes   -94 missing: Question omitted     miss
#> 16  gender   Gender   F8.0            NA     yes     1                    Female    valid
#> 17  gender   Gender   F8.0            NA     yes     2                      Male    valid
# Recoding 
gads_labeled <- recodeGADS(GADSdat = gads_labeled, varName = "gender", 
                           oldValues = c(1, 2), newValues = c(10, 20))
# New data and meta data
gads_labeled$dat$gender[1:10]
#>  [1] 10 10 20 20 10 10 20 20 10 10
extractMeta(gads_labeled, "gender")
#>    varName varLabel format display_width labeled value                  valLabel missings
#> 14  gender   Gender   F8.0            NA     yes   -99                      <NA>     miss
#> 15  gender   Gender   F8.0            NA     yes   -94 missing: Question omitted     miss
#> 16  gender   Gender   F8.0            NA     yes    10                    Female    valid
#> 17  gender   Gender   F8.0            NA     yes    20                      Male    valid

Moreover, the variable allows recoding of values that were not assigned value labels or even NA values.

# Recoding of NA values 
gads_labeled$dat$int_a[1:10]
#>  [1]  2  2  3  2  1  2 NA NA NA NA
gads_labeled <- recodeGADS(GADSdat = gads_labeled, varName = "int_a", 
                           oldValues = NA, newValues = -94)
gads_labeled$dat$int_a[1:10]
#>  [1]   2   2   3   2   1   2 -94 -94 -94 -94

For recoding specific values into NA values, the function recode2NA() exists. It allows recoding a specific value across multiple variables (while recodeGADS() allows recoding multiple values for a single variable). Existing value labels for the specified values are deleted. For each variable it is reported how many cases have been recoded.

# Recoding of values as Missing/NA
gads_labeled$dat$schtype[1:10]
#>  [1] 2 3 1 3 2 3 1 3 2 1
gads_labeled <- recode2NA(gads_labeled, recodeVars = c("hisei_new", "schtype"), 
                          value = "3")
#> Recodes in variable hisei_new: 0
#> Recodes in variable schtype: 111
gads_labeled$dat$schtype[1:10]
#>  [1]  2 NA  1 NA  2 NA  1 NA  2  1

Adding variables

Adding variables to a GADSdat object is unfortunately not straight forward and requires some understanding of the underlying object structure. A GADSdat object is basically a list with two entries, the data (dat ) and the meta data (labels). For adding variables we need to extract the dat object and add the new variables to it. Afterwards, we need to update the meta data using the updateMeta() function.

# Extract the data
newDat <- gads_labeled$dat
# Adding a variable
newDat$classsize_kat <- ifelse(newDat$classsize > 15, 
                                         yes = "big", no = "small") 
# Updating meta data
gads_labeled2 <- updateMeta(gads_labeled, newDat = newDat)
#> No rows removed from meta data.
#> Adding meta data for the following variables: classsize_kat
extractMeta(gads_labeled2, "classsize_kat")
#>                     varName varLabel format display_width labeled value valLabel missings
#> classsize_kat classsize_kat     <NA>   <NA>            NA      no    NA     <NA>     <NA>

Changing meta data (and data) via an Excel sheet

So far, the introduced functions work well for modifying the meta data of small data sets or for individual variables. However, frequently we are interested in modifying the meta data of a larger number of variables simultaneously. For this purpose eatGADS provides a workflow that works well with Excel spreadsheets. Thereby changes to meta data are divided into two levels: the variable and the value level.

Variable level

We start by extracting this change table via the getChangeMeta() function

# variable level
meta_var <- getChangeMeta(GADSdat = pisa, level = "variable")

While in principle one could modify the change table directly in R, it more convenient to do this in Excel. The change table can be written to .xlsx via the eatAnalysis::write_xlsx() function. To perform changes, entries are made into the corresponding "_new"-columns.

# write to excel
eatAnalysis::write_xlsx(meta_var, row.names = FALSE, "variable_changes.xlsx")

 

 

The excel file can be read back into R via readxl::read_xlsx().

# write to excel
meta_var_changed <- readxl::read_excel("variable_changes.xlsx", col_types = rep("text", 8))

The applyChangeMeta() function applies the meta data changes to the GADSdat object.

gads2 <- applyChangeMeta(meta_var_changed, GADSdat = pisa)
extractMeta(gads2, vars = c("idstud", "idschool", "schoolType"))
#>      varName                    varLabel format display_width labeled value
#> 2     idstud Student Identifier Variable   F8.0            NA      no    NA
#> 3   idschool                   School-ID  F10.0            NA      no    NA
#> 5 schoolType                School track   F8.0            NA     yes     1
#> 6 schoolType                School track   F8.0            NA     yes     2
#> 7 schoolType                School track   F8.0            NA     yes     3
#>                                    valLabel missings
#> 2                                      <NA>     <NA>
#> 3                                      <NA>     <NA>
#> 5                Gymnasium (academic track)    valid
#> 6                                Realschule    valid
#> 7 schools with several courses of education    valid

Value level

At value level, information on value, value labels or missings can be changed. The general workflow is identical

# value level
meta_val <- getChangeMeta(GADSdat = pisa, level = "value")
# write to excel
eatAnalysis::write_xlsx(meta_val, row.names = FALSE, "value_changes.xlsx")

 

 

# write to excel
meta_val_changed <- readxl::read_excel("value_changes.xlsx", 
                                       col_types = c("text", rep(c("numeric", "text", "text"), 2)))
gads3 <- applyChangeMeta(meta_val_changed, GADSdat = pisa)
extractMeta(gads3, vars = c("schtype", "sameteach"))
#>     varName                               varLabel format display_width labeled value
#> 4   schtype                           School track   F8.0            NA     yes     1
#> 5   schtype                           School track   F8.0            NA     yes     2
#> 6   schtype                           School track   F8.0            NA     yes     3
#> 7 sameteach Same math teacher in both school years   F8.0            NA     yes    10
#> 8 sameteach Same math teacher in both school years   F8.0            NA     yes    20
#>                                    valLabel missings
#> 4                            Acamedic Track    valid
#> 5                                Realschule    valid
#> 6 schools with several courses of education    valid
#> 7                                        No    valid
#> 8                                       Yes    valid