R4 is a risk assessment tool designed to help evaluate the potential for re-identification of records in datasets (including 'de-identified' datasets), so as to support decision making on what data can be shared in what context.

The tool examines the risk of re-identification for single attributes and combinations of multiple attributes in the dataset, and presents a dashboard to view overall risk and examine problematic attributes or records in finer detail. Its graphical user interface and risk ranking provide a one-look view of the re-identification risk of a dataset, and allows easy drill-down to the most relevant data affecting that risk.

R4 also simplifies the process of preparing a dataset for sharing or release by highlighting problematic records, and offering mitigation methods such as aggregation and perturbation to be applied to chosen attributes. Once a mitigation is applied R4 re-analyses the modified data, so that it can be used in a cycle of risk mitigation and assessment until the residual risk is considered acceptable.

R4 helps data custodians and managers better understand the risk of re-identification so as to make informed decisions about their data, and to reduce that risk through treatment of problematic attributes or records.

Features and benefits

Dashboard and drill-down

R4 provides a readily absorbed overview of the risk associated with a dataset - displaying the riskiest combination of attributes for each different number of attributes combined, along with their risk scores and infographics such as pie charts and cumulative distribution function (CDF) risk chart.

 

R4 provides a quick and easy means of drilling down to the most problematic records - selecting a section of the CDF chart will bring up a sample of the records included in that section, with an option to download the full set of records as a CSV for further analysis.

In-place risk mitigation

R4 allows standard techniques such as binning and perturbation to be applied to one or more attributes from directly within the tool. Modified versions of the attributes are added to the dataset, and the risk analysis is run on them, to see what effect the transformations would have.

Consistent risk metrics

R4 provides a reliable and consistent way to quantify re-identification risk associated with records in a dataset, and a dataset itself. Re-identification risk (RIR) notation has been developed so that its meaning is consistent, in terms of the risk, across different scales of dataset - that is, 10 records being re-identifiable in a cohort of 100 represents less information leakage than 10 being identifiable out of 10 million - and RIR reflects this.

Results export and scoring API

R4 supports the export of results, or of a subset of results, to CSV, including columns resulting from applying any mitigations to the data.

R4 is also designed as a REST API (Application Programming Interface) - a well-defined interface to execute queries and return their results, typically over a network. This means that R4 can be called directly from other components, and it is possible to create new interfaces to the tool, integrate into other tools, or skip the front-end entirely.

Contact Data61

Your contact details

First name must be filled in

We'll need to know what you want to contact us about so we can give you an answer.