GitHub - reconhub/matchmaker: Dictionary-based cleaning for categorical variables (original) (raw)

matchmaker R package

Lifecycle: experimental CRAN status Travis build status AppVeyor build status Codecov test coverage

The goal of {matchmaker} is to provide dictionary-based cleaning for R users in a simple and intuitive manner built on the{forcats} package. Some of the features of this package include:

Installation

You can install {matchmaker} from CRAN:

install.packages("matchmaker")

Example

The matchmaker package has two user-facing functions that perform dictionary-based cleaning:

Each of these functions have four manditory options:

Mostly, users will be working with match_df() to transform values across specific columns. A typical workflow would be to:

  1. construct your dictionary in a spreadsheet program based on your data
  2. read in your data and dictionary to data frames in R
  3. match!

library("matchmaker")

Read in data set

dat <- read.csv(matchmaker_example("coded-data.csv"), stringsAsFactors = FALSE ) dat$date <- as.Date(dat$date)

Read in dictionary

dict <- read.csv(matchmaker_example("spelling-dictionary.csv"), stringsAsFactors = FALSE )

Data

This is the top of our data set, generated for example purposes

id date readmission treated facility age_group lab_result_01 lab_result_02 lab_result_03 has_symptoms followup
ef267c 2019-07-08 NA 0 C 10 unk high inc NA u
e80a37 2019-07-07 y 0 3 10 inc unk norm y oui
b72883 2019-07-07 y 1 8 30 inc norm inc oui
c9ee86 2019-07-09 n 1 4 40 inc inc unk y oui
40bc7a 2019-07-12 n 1 6 0 norm unk norm NA n
46566e 2019-07-14 y NA B 50 unk unk inc NA NA

Dictionary

The dictionary looks like this:

options values grp orders
y Yes readmission 1
n No readmission 2
u Unknown readmission 3
.missing Missing readmission 4
0 Yes treated 1
1 No treated 2
.missing Missing treated 3
1 Facility 1 facility 1
2 Facility 2 facility 2
3 Facility 3 facility 3
4 Facility 4 facility 4
5 Facility 5 facility 5
6 Facility 6 facility 6
7 Facility 7 facility 7
8 Facility 8 facility 8
9 Facility 9 facility 9
10 Facility 10 facility 10
.default Unknown facility 11
0 0-9 age_group 1
10 10-19 age_group 2
20 20-29 age_group 3
30 30-39 age_group 4
40 40-49 age_group 5
50 50+ age_group 6
high High .regex ^lab_result_ 1
norm Normal .regex ^lab_result_ 2
inc Inconclusive .regex ^lab_result_ 3
y yes .global Inf
n no .global Inf
u unknown .global Inf
unk unknown .global Inf
oui yes .global Inf
.missing missing .global Inf

Matching

Clean spelling based on dictionary -----------------------------

cleaned <- match_df(dat, dictionary = dict, from = "options", to = "values", by = "grp" ) head(cleaned) #> id date readmission treated facility age_group #> 1 ef267c 2019-07-08 Missing Yes Unknown 10-19 #> 2 e80a37 2019-07-07 Yes Yes Facility 3 10-19 #> 3 b72883 2019-07-07 Yes No Facility 8 30-39 #> 4 c9ee86 2019-07-09 No No Facility 4 40-49 #> 5 40bc7a 2019-07-12 No No Facility 6 0-9 #> 6 46566e 2019-07-14 Yes Missing Unknown 50+ #> lab_result_01 lab_result_02 lab_result_03 has_symptoms followup #> 1 unknown High Inconclusive missing unknown #> 2 Inconclusive unknown Normal yes yes #> 3 Inconclusive Normal Inconclusive missing yes #> 4 Inconclusive Inconclusive unknown yes yes #> 5 Normal unknown Normal missing no #> 6 unknown unknown Inconclusive missing missing