Skip to contents

For each pair of categorical columns, compares the joint (normalized contingency) distributions of real and synthetic data via total variation distance, scoring 1 - TVD (the SDMetrics ContingencySimilarity score). This is the categorical analogue of correlation similarity and captures categorical-vs-categorical dependence.

Usage

contingency_similarity(real, synthetic, meta)

Arguments

real

A data frame of real data.

synthetic

A data frame of synthetic data.

meta

An rsdv_metadata object.

Value

A list with pairs (a tibble of column_1, column_2, score) and score (the mean over pairs). score is NA_real_ when there are fewer than two categorical columns — there is no dependence to measure, so propagating NA (rather than 1) avoids overstating fidelity in the aggregated quality report.

Examples

# \donttest{
meta  <- metadata(adult_income)
syn   <- gaussian_copula_synthesizer(meta) |> fit(adult_income)
synth <- sample(syn, n = 500)
contingency_similarity(adult_income, synth, meta)
#> $pairs
#> # A tibble: 36 × 3
#>    column_1  column_2       score
#>    <chr>     <chr>          <dbl>
#>  1 workclass education      0.833
#>  2 workclass marital_status 0.883
#>  3 workclass occupation     0.79 
#>  4 workclass relationship   0.887
#>  5 workclass race           0.945
#>  6 workclass sex            0.949
#>  7 workclass native_country 0.924
#>  8 workclass income         0.929
#>  9 education marital_status 0.834
#> 10 education occupation     0.673
#> # ℹ 26 more rows
#> 
#> $score
#> [1] 0.8600278
#> 
# }