The numbers are mind-boggling: 11.5 million documents in total, comprising 4.8 million emails, 2.1 million PDFs, 1.1 million images and 320,000 text files. The size of the Panama Papers leak, at 2.6 terabytes of data, eclipses any previous leak by a whistleblower. To put it in context, the amount of data in the Panama Papers leak was 2,000 times the amount in the WikiLeaks State Department cables in 2010.
Trying to sift through data like this manually would be a Sisyphean task, so technology was required. Enter the little-known Australian company Nuix.
The software company has worked with the D.C.-based International Consortium of Investigative Journalists (ICIJ) for over four years, giving them free access to their software that can take huge troves of unstructured data and turn it into an indexed and searchable database. Nuix took center stage this week with the unprecedented leak, that exposed deacdes' worth of documents from Panama City-based law firm Mossack Fonseca, which showed how the world's mega-rich shelter their wealth and conceal its origins through shell companies based in tax havens like the British Virgin Islands.
Without the Nuix software, the company says, journalists would simply not have been able to tell the Panama Papers story. “It would not be possible to conduct this investigation with a manual workflow,” Carl Barron, a senior consultant from Nuix, told International Business Times.
In this case Nuix worked with the German newspaper Süddeutsche Zeitung (SZ), whose reporter Bastian Obermayer was contacted via encrypted chat about the leak. The source, who remains anonymous, said their life was in danger but wanted to make the information available to “make these crimes public.”
Having established contact, the source provided the huge trove of data piecemeal to the reporter over a period of months. Despite the huge amount of data, Barron says Nuix’s platform is so powerful that given the right hardware it could have churned through all the documents in just one and a half days, though because of the piecemeal data delivery, the indexing process took two months to complete.
The result was a Google-style interface that allowed the almost 400 journalists who ended up collaborating on the leak to access the indexed data very easily. “Once the data is indexed, it is then in the Nuix platform, which is completely searchable,” Barron said. “It is all structured in our index allowing you to filter based on file types; you can filter based on email attachments and even items that are not searchable.”
On top of this, the ICIJ developers built a search engine protected by two-factor authentication and shared the URL for the system via encrypted email with the journalists working on the project. The system included a real-time chat system to help journalists collaborate and also provide real-time translation services for documents in foreign languages.
While text files would have been relatively easy to search, the real key to the success of this leak was being able to index and search documents such as images and PDFs, including contracts, passports and other scanned files.
Using optical character recognition (OCR), Nuix’s software is able to pick out text from the images and link names and locations to those found elsewhere in the data. “Once all the indexing was complete, once all the OCR work was complete, [the journalists] could simply search across all of the information and begin connecting the dots,” Barron said.
Barron stressed that Nuix did not at any point have access to the data in question. It was brought on board very early in the process, after SZ had reached out to the ICIJ for help in analyzing the leak. “We were told a brief overview, very typical in this scenario,” Barron said, revealing that the company wasn’t even aware the story was going to break on Sunday night.
Nuix provided SZ with a number of licenses for its software, which was installed on high-performance computers behind a firewall at the newspaper’s headquarters. The data was stored on systems that were never connected to the internet, according to Barron, in order to protect it from those who may have been seeking to destroy it.
The Nuix software was originally developed in 2000 by a group of scientists who wanted to create a processing engine to bring structure to unstructured data. The result of 16 years of research and development is Nuix’s patented parallel processing engine, which the company claims can search virtually unlimited volumes of data with “unmatched speed and precision.” Barron says that no other software on the market could have processed the Panama Papers with the speeds and accuracy that Nuix did.
As well as having a long-standing relationship with the ICIJ, the Nuix software was likely chosen because it handles this scale of data on a daily basis. “This is only a medium-sized document set in the worlds of e-discovery or regulatory investigations — some of our customers handle similar volumes of data every day,” Eddie Sheehy, the company’s CEO, said in an emailed statement.
Nuix sells its software in 65 countries around the world and works with organizations such as the United Nations and the U.S. Secret Service as well as many other law enforcement agencies and governments.
As well as donating free licenses of its software to efforts like the Panama Papers, the company also sells a product called Proof Finder, which it describes as a fully-featured version of its software that can “thoroughly investigate data sets of up to 15GB” for only $100 a year. All money earned from Proof Finder sales goes to not-for-profit organization Room to Read, which works to build schools and increase literacy among children in Asia and Africa, with dedicated support to help girls finish high school.