The challenge

Increasing fraudulent attempts to obtain sensitive information

Phishing is the fraudulent attempt to obtain sensitive information such as usernames, passwords, and credit card details by disguising oneself as a trustworthy entity in an electronic communication such as email, instant messaging and SMS.  Users are often lured by communications purported to be from trusted parties, such as social networking websites, auction sites, banks, online payment processors, and IT administrators.

This malicious activity often directs users to enter personal information via a fake website that matches the look and feel of the legitimate website.

To mitigate the negative impacts of phishing, security software providers, financial institutions, and academic researchers have studied various approaches to building an automated phishing website detection system.  These methods have included the use of blacklists, and investigating the website content, URL and web-related features. These methods have included the use of blacklists, investigation of the website content, URL and website-related features. Typically, these algorithms consider the HTML code of a webpage, the hyperlink of the site ( e.g. ), and formatting e.g. colour and bold/italic texts.

Our response

Compression based Algorithms

Our goal is to detect and predict phishing websites before they can do any harm to the users. Previous phishing detection methods employed machine learning algorithms. They used traditional classification techniques like naive Bayes, logistic regression, k-nearest neighbours, support vector machines, decision trees and artificial neural networks. These algorithms are not able to cope with the dynamic nature of phishing, as the fraudsters are constantly changing the webpage design and hyperlink every couple of hours.

By combining different algorithmic techniques, researchers at CSIRO's Data61 and UNSW have designed a novel and more effective phishing detection solution, PhishZip1. The PhishZip algorithm uses file compression to distinguish phishing websites from legitimate ones.  File Compression is the process of encoding information using fewer bits than the original representation, to reduce the file size (e.g. from 10MB to 4MB).  We use the DEFLATE file compression algorithm to compress both legitimate and phishing websites and separating them by examining how much they get compressed.  Legitimate and phishing websites have different compression ratios.

We introduce a systematic process of selecting meaningful words, which are associated with phishing and non-phishing websites, by analysing the likelihood of word occurences, and the optimal likelihood threshold.  These words are used as the pre-defined dictionary for our compression models. They are used to train the algorithm into identifying instances where a proliferation of these key words indicate a malicious website. Compression ratio measures the distance, or cross-entrophy between the predicted website and the phishing and non-phishing website content distribution. High compression ratio is associated with low entrophy, which indicates that the contribution distribution is similar to the common word distribution in phishing and non-phishing websites.

Unlike Machine-based learning models, PhishZip's approach does not require model training or HTML parsing. Instead we compress the HTML file to determine whether it is a phishing website.  Thus, classification with compressed algorithms is faster and simpler.

The results

Preventing financial loss and protecting privacy

The project has a significant impact on phishing and spamming emails and websites.   We have used this algorithm on several phishing websites which are clones of PayPal, Facebook, Microsoft, ING Direct and other popular websites.  We found the Phish Zip has correctly been able to identify phishing websites with more than 83% accuracy.  This will result in the prevention of huge financial losses and protect the privacy of users' personal data, including passwords and credit card numbers.  These are targeted towards organisations with email and website servers.

We have tested our algorithms on large datasets. One of the phishing dataset repositories that we are using is PhishTank 2. This is a very comprehensive database of phishing websites. An example of the legitimate and phishing PayPal website is shown below.

The project is a joint collaboration with the University of NSW, with co-authored paper,  'PhishZip: A new Compression-based Algorithm for Detecting Phishing Websites' which was published in the IEEE Conference on Communications and Network Security (CNS 2020).    PhishZip can be used as web service to detect and block phishing websites.

The next step is to build a complete suite of software tools and services that detect, predict and prevent phishing and spam websites for mobile, laptop and desktop users.

PhishZip is a currently evolving research project, however, if you are interested in early access, please click here.


  1. PhishZip: A new Compression-based Algorithm for Detecting Phishing Websites, Rizka Purwanto, Arindam Pal, Alan Blair and Sanjay Jha, IEEE Conference on Communications and Network Security (CNS2020), Avignon, France.
  2. PhishTank : A comprehensive repository of phishing websites.

Do business with us to help your organisation thrive

We partner with small and large companies, government and industry in Australia and around the world.

Contact us now to start doing business

Contact Data61

How can we help you create your data-driven future? Use the form below to send us a message.
Your contact details
0 / 100
0 / 1900
You shouldn't be able to see this field. Please try again and leave the field blank.

For security reasons attachments are not accepted.