The challenge

Linking population-sized dataset

The scale of population-sized linkages has not been achievable with previous approaches. Additionally, the project required the flexibility to link data with disparate data types that exist in real-world datasets. Previously, we would only consider text string similarities whereas real data also contains numbers and categorical features.

Using Anonlink as the base platform, a side effect of its linking approach is the linkage results are sensitive to the choice of parameters. There are dozens of parameters to tune in a real-world linkage, combined with the challenge that the analyst usually does not have access to the full data. As such, use of the Anonlink ecosystem requires expert knowledge to achieve good results.

Additionally, in real-world linkages, linkage errors are not evenly spread across all subpopulations, resulting in the introduction of an unfair bias. This bias can be hard to detect in upstream tasks and should therefore be tackled within the linkage process.

Our response

Evaluation, implement and integrate

To allow large scale record linkage, we evaluated different blocking methods, implemented the two most suitable ones and integrated it into the Anonlink ecosystem. We developed an automatic hyperparameter optimisation framework to replace manual hyperparameter tuning. As the optimisation operates unsupervised, we also analysed various heuristic losses, and to our knowledge, we are the first to consider the fairness of the record linkage process. We prototyped a framework to enforce fairness in record linkage.

The results

Extending the software

We made a big step towards product readiness by extending the software to handle millions by millions of linkages, and the support for more datatypes. Furthermore, we utilised unsupervised machine learning to perform parameter tuning, which enables non-experts to achieve excellent linkage results. And finally, we published and prototyped a novel fairness-aware privacy preserving record linkage system which enables the protection of vulnerable minority groups.

To further the product readiness of the Anonlink system, we intend to address issues around the deployment of the system, such as ease of deployment, or security risks when uploading large scale data to the server.

In addition, it is challenging to balance the trade-off between information leakage and optimisation performance. We also seek to explore ways to improve absolute fairness in record linkage, and have better control over the trade-off between accuracy and fairness.

Do business with us to help your organisation thrive

We partner with small and large companies, government and industry in Australia and around the world.

Contact us now to start doing business

Contact Data61

How can we help you create your data-driven future? Use the form below to send us a message.
Your contact details
0 / 100
0 / 1900
You shouldn't be able to see this field. Please try again and leave the field blank.

For security reasons attachments are not accepted.