An OpenOffice BASE SQL Model of Spreadsheets

Privacy in Redaction

Copyright © 2009 RustPrivacy.Org

Files

 

Forward

There are many repositories for Personally Identifiable Information (PII), including e-mail client Address Books, Application User Data, etc..  In “Taking Work Home”, it was noted that to protect PII from disclosure, one must redact data in documents.  In “RUST The SQL”, some techniques were given to write SQL queries to accomplish this redaction in spreadsheets.  

Free form marked up documents (HTML, XML, etc.) are a challenge to redact because there is one “master copy” and two output streams.  Spreadsheets however can be modeled in a data base in a systematic way, provided the SQL engine is up to the task (JET/ACCESS is not), so that the different outputs can use a cut and paste.  PERL or PYTHON could easily generate the SQL scripts.  However making this into a “web application” rather defeats the purpose – making sure raw data and PII stay within a firewall and within the perimeter of a physical location.

Simple deletion invites re-identification (filling in the holes).  Redaction, when done properly, can prevent re-identification “mistakes”.

 

 

Introduction

 

In particular, we would like to model a Comma Separated Variable (CSV) data set, where Column Headings are in Row #1.  This is a common import/export format for all sorts of applications and Address Books.

Spreadsheet (CSV with Headings)

 

 

A

B

C

D

[...]

1

ColNameA

ColNameB

ColNameC

ColNameD

...

2

dataA1

dataB1

dataC1

dataD1

...

3

dataA2

[null]

[null]

dataD2

...

...

...

...

...

...

...

 

 

 

Methods

A table specifying Column Properties has been added.

Data View (Spreadsheet View)

 

...

...

...

...

...

...

...

PropName2

-2

[redacted]

[null]

[redacted]

[null]

...

PropName1

-1

A

B

C

D

...

PropName0

0

ColNameA

ColNameB

ColNameC

ColNameD

...

Property

Sequence

ColNameA

ColNameB

ColNameC

ColNameD

...

 

 

A

B

C

D

[...]

 

1

ColNameA

ColNameB

ColNameC

ColNameD

...

 

2

dataA1

dataB1

dataC1

dataD1

...

 

3

dataA2

[null]

[null]

dataD2

...

 

...

...

...

...

...

...

 

 

Data View (Privacy View)

 

...

...

...

...

...

...

...

PropName2

-2

[redacted]

[null]

[redacted]

[null]

...

PropName1

-1

A

B

C

D

...

PropName0

0

ColNameA

ColNameB

ColNameC

ColNameD

...

Property

Sequence

ColNameA

ColNameB

ColNameC

ColNameD

...

 

 

A

B

C

D

[...]

 

1

ColNameA

ColNameB

ColNameC

ColNameD

...

 

2

[redacted]

dataB1

[redacted]

dataD1

...

 

3

[redacted]

[fallback]

[redacted]

dataD2

...

 

...

...

...

...

...

...

 

 

 

Results

The CSV File is represented by three Tables in the Data Base.

  1. a)The Original CSV 

  2. b)Additional Columns 

  3. c)Column Properties (Model) 

Each new data source adds two Tables, analogs of a) and b).  The Column Properties should have entries for each column in a) and b) and for every data source, [(a union b) is identical to  c].  In practice, both a and b will have an index (ID) and the b index will be a Foreign Key which points to some row of a.  The Row# of the Column Properties should be Zero for the Column Names and -1 for the Spreadsheet Column Range Label [A ... ZZZ].  Additional Column Properties may be added.  In this example, -2 is the index for a redaction flag with a value of either [redacted] or null.

 

Output (original with deletion)

 

 

A

B

C

D

[...]

1

ColNameA

ColNameB

ColNameC

ColNameD

...

2

 

dataB1

 

dataD1

...

3

 

 

 

dataD2

...

...

 

...

 

...

...

 

Output (privacy)

 

 

A

B

C

D

[...]

1

ColNameA

ColNameB

ColNameC

ColNameD

...

2

[redacted]

dataB1

[redacted]

dataD1

...

3

[redacted]

[fallback]

[redacted]

dataD2

...

...

[redacted]

...

[redacted]

...

...

 

 

 

 

 

Discussion

  1. 1)All columns of data are fully populated in the “privacy” output.  Re-identification by joining this result with another data source column is much harder because it is no longer clear which data source cell value should be taken as the “authority”.  The job of Re-identification is often split into two  tasks 1) Extraction of one data source  2) Joining with another data source.   This is not particularly sinister itself because step 1 calls for special expertise.  Nonetheless not every “J. Booth” is “John Booth” or “John W. Booth” or “John Wilkes Booth”. 

  2. 2)The logic, as a case statement is given below.  All cases were considered even though only the case of Redaction needs all the logic.  That is, in fact, the point of the exercise, since re-identification depends upon column-wise comparison of two data sources with the assumption that one source is an authority and the other “empty”. 

     

Spreadsheet View

 

  CASE

     WHEN (Cell Value is not null) THEN (Cell Value) // result (Cell Value)

     WHEN (Cell Value is null) THEN (Cell Value)     // result NULL

     ELSE (Cell Value)                               // (never executed)

  END AS Column Name

 

Redaction

 

  CASE

     WHEN (Cell Value is not null)

                   AND

          (Property Column is not null) THEN (Value of Property Column)  // result '[redacted]'

     WHEN (Cell Value is null) THEN '[fallback]'                         // result '[fallback]'

     ELSE (Cell Value)                                                   // result (Cell Value)

  END AS Column Name

 

Deletion

 

  CASE

     WHEN (Cell Value is not null) THEN (Cell Value) //  result NULL

     WHEN (Cell Value is null) THEN (Cell Value)     //  result NULL

     ELSE (Cell Value)                               //  (never executed)

  END AS Column Name