NSA and Big Data: How Technology Innovation Fuels PRISM

NSA Internet Phone — The U.S. National Security Agency logo appears on an Apple iPhone display. Reuters

The National Security Agency has been involved in intelligence-gathering schemes since its inception in 1949, but 21st century technology has advanced far beyond the wiretap and the codebook. Modern intelligence gathering, like the recently unveiled PRISM program, is the product of the “big data” era.

“There’s nothing surprising technically” about programs like PRISM, data-mining expert and KDNuggets editor Gregory Piatetsky-Shapiro said in a phone interview.

Most technology experts can’t speak with too much certainty about a program they haven’t seen, like PRISM, but based on publicly available information, something like PRISM has been possible for years. The same innovations in software and hardware that aid your Google query or help advertisers track your habits online -- like when you examine a book on Amazon and then see an ad for that book pop up later on Facebook -- also allow the NSA to sort through reportedly tens of billions of pieces of information a month.

One of the major components of PRISM is believed to be an open-source database called Apache Accumulo, which the NSA began working on in late 2007. Originally called CloudBase, Accumulo is built on top of a software framework called Apache Hadoop and is similar to Google’s BigTable storage system. (If you would like to buy Accumulo for yourself, some of the developers that worked on the project with the NSA sell a commercial version through their company Sqrrl.)

“Accumulo’s ability to handle data in a variety of formats -- a characteristic called ‘schemaless’ in database jargon -- means the NSA can store data from numerous sources all within the database and add new analytic capabilities in days or even hours,” Derrick Harris wrote for GigaOM.

Some of the other advances aren’t necessarily of the hardware or software variety. The science of studying networks has been growing by leaps and bounds, allowing analysts to tease relationships from seemingly unrelated data points.

“If the NSA just has the metadata -- who calls whom -- that’s sufficient to determine the status of people,” Piatetsky-Shapiro said. “You don’t necessarily need the conversation if you have the network.”

Piatetsky-Shapiro pointed to a humorous Slate article published Monday that imagined British agents flagging Paul Revere as a person of interest based on his relationships with other colonial independence agitators.

“Rest assured that we only collected metadata on these people, and no actual conversations were recorded or meetings transcribed,” Duke University sociologist Kieran Healy wrote for Slate. “All I know is whether someone was a member of an organization or not. Surely this is but a small encroachment on the freedom of the Crown’s subjects.”

The leaks have also highlighted the degree to which almost all Internet communication is part of a giant interconnected and tangled web. For example, online, the difference between a foreign communication -- which the NSA might flag, and a domestic one, which it shouldn't, can sometimes get a little hazy.

“One interesting tidbit from the Guardian leaks is how much the U.S. is the center of global communications,” Piatetsky-Shapiro said. Internet communications often “take the cheapest route.” So, because the U.S. has so much available capacity, an email from, say, Pakistan to Canada, could be routed through America.

When one imagines how the NSA might go about analyzing the content from all the different types of communications it has stored, a line of spooks in a darkened room poring over emails is not what comes to mind these days. The customer service industry is already using language-processing algorithms to break down spoken and written sentences. By breaking down sentence structures and weighing the words contained in any of the various modes of communication, it’s possible for a program to organize messages by “intent.” It’s not too big a leap to assume that whatever actual conversations the NSA does capture would be sifted by a similar algorithm.

“If that stuff’s being used in the commercial world, it’s logical to assume they’d be using it as well,” Forrester Research analyst Glenn O’Donnell said in a phone interview.

Google

Join the Discussion