Estimates the fraction of synthetic rows where a sensitive column value can be correctly inferred from known columns via a k-NN lookup in the real training data.
Details
known_cols must be numeric, because nearest-neighbour lookup operates on
Euclidean distance over the columns. If you want to use a categorical
column as a known attribute, one-hot encode it first (e.g. with
model.matrix(~ col - 1, data)).
Examples
real <- data.frame(age = sample(20:60, 50, replace = TRUE),
income = sample(c("low", "high"), 50, replace = TRUE),
stringsAsFactors = FALSE)
syn <- real[sample(50), ]
attribute_disclosure_risk(real, syn, sensitive_col = "income", known_cols = "age")
#> [1] 0.72