There are many repositories for Personally Identifiable Information (PII), including e-mail client Address Books, Application User Data, etc.. In “Taking Work Home”, it was noted that to protect PII from disclosure, one must redact data in documents. In “RUST The SQL”, some techniques were given to write SQL queries to accomplish this redaction in spreadsheets.
Free form marked up documents (HTML, XML, etc.) are a challenge to redact because there is one “master copy” and two output streams. Spreadsheets however can be modeled in a data base in a systematic way, provided the SQL engine is up to the task (JET/ACCESS is not), so that the different outputs can use a cut and paste. PERL or PYTHON could easily generate the SQL scripts. However making this into a “web application” rather defeats the purpose – making sure raw data and PII stay within a firewall and within the perimeter of a physical location.
Simple deletion invites re-identification (filling in the holes). Redaction, when done properly, can prevent re-identification “mistakes”.
In particular, we would like to model a Comma Separated Variable (CSV) data set, where Column Headings are in Row #1. This is a common import/export format for all sorts of applications and Address Books.
| A | B | C | D | [...] |
1 | ColNameA | ColNameB | ColNameC | ColNameD | ... |
2 | dataA1 | dataB1 | dataC1 | dataD1 | ... |
3 | dataA2 | [null] | [null] | dataD2 | ... |
... | ... | ... | ... | ... | ... |
A table specifying Column Properties has been added.
... | ... | ... | ... | ... | ... | ... |
PropName2 | -2 | [redacted] | [null] | [redacted] | [null] | ... |
PropName1 | -1 | A | B | C | D | ... |
PropName0 | 0 | ColNameA | ColNameB | ColNameC | ColNameD | ... |
Property | Sequence | ColNameA | ColNameB | ColNameC | ColNameD | ... |
|
| A | B | C | D | [...] |
| 1 | ColNameA | ColNameB | ColNameC | ColNameD | ... |
| 2 | dataA1 | dataB1 | dataC1 | dataD1 | ... |
| 3 | dataA2 | [null] | [null] | dataD2 | ... |
| ... | ... | ... | ... | ... | ... |
... | ... | ... | ... | ... | ... | ... |
PropName2 | -2 | [redacted] | [null] | [redacted] | [null] | ... |
PropName1 | -1 | A | B | C | D | ... |
PropName0 | 0 | ColNameA | ColNameB | ColNameC | ColNameD | ... |
Property | Sequence | ColNameA | ColNameB | ColNameC | ColNameD | ... |
|
| A | B | C | D | [...] |
| 1 | ColNameA | ColNameB | ColNameC | ColNameD | ... |
| 2 | [redacted] | dataB1 | [redacted] | dataD1 | ... |
| 3 | [redacted] | [fallback] | [redacted] | dataD2 | ... |
| ... | ... | ... | ... | ... | ... |
The CSV File is represented by three Tables in the Data Base.
a)The Original CSV
b)Additional Columns
c)Column Properties (Model)
Each new data source adds two Tables, analogs of a) and b). The Column Properties should have entries for each column in a) and b) and for every data source, [(a union b) is identical to c]. In practice, both a and b will have an index (ID) and the b index will be a Foreign Key which points to some row of a. The Row# of the Column Properties should be Zero for the Column Names and -1 for the Spreadsheet Column Range Label [A ... ZZZ]. Additional Column Properties may be added. In this example, -2 is the index for a redaction flag with a value of either [redacted] or null.
| A | B | C | D | [...] |
1 | ColNameA | ColNameB | ColNameC | ColNameD | ... |
2 |
| dataB1 |
| dataD1 | ... |
3 |
|
|
| dataD2 | ... |
... |
| ... |
| ... | ... |
| A | B | C | D | [...] |
1 | ColNameA | ColNameB | ColNameC | ColNameD | ... |
2 | [redacted] | dataB1 | [redacted] | dataD1 | ... |
3 | [redacted] | [fallback] | [redacted] | dataD2 | ... |
... | [redacted] | ... | [redacted] | ... | ... |
1)All columns of data are fully populated in the “privacy” output. Re-identification by joining this result with another data source column is much harder because it is no longer clear which data source cell value should be taken as the “authority”. The job of Re-identification is often split into two tasks 1) Extraction of one data source 2) Joining with another data source. This is not particularly sinister itself because step 1 calls for special expertise. Nonetheless not every “J. Booth” is “John Booth” or “John W. Booth” or “John Wilkes Booth”.
2)The logic, as a case statement is given below. All cases were considered even though only the case of Redaction needs all the logic. That is, in fact, the point of the exercise, since re-identification depends upon column-wise comparison of two data sources with the assumption that one source is an authority and the other “empty”.
Spreadsheet View
CASE
WHEN (Cell Value is not null) THEN (Cell Value) // result (Cell Value)
WHEN (Cell Value is null) THEN (Cell Value) // result NULL
ELSE (Cell Value) // (never executed)
END AS Column Name
Redaction
CASE
WHEN (Cell Value is not null)
AND
(Property Column is not null) THEN (Value of Property Column) // result '[redacted]'
WHEN (Cell Value is null) THEN '[fallback]' // result '[fallback]'
ELSE (Cell Value) // result (Cell Value)
END AS Column Name
Deletion
CASE
WHEN (Cell Value is not null) THEN (Cell Value) // result NULL
WHEN (Cell Value is null) THEN (Cell Value) // result NULL
ELSE (Cell Value) // (never executed)
END AS Column Name