JPL, meet PDF.
While NASA's Jet Propulsion Laboratory (JPL) is renowned for piloting rovers on Mars and deploying spacecraft to study planets in the solar system, JPL's latest project is more down-to-earth: assembling the world's largest publicly available archive of PDFs for security research.
PDF files are the most popular form of digital document in the world. And while they might look like scanned copies of paper documents, they are actually collections of text, images, movies and active scripts that aren't as secure as they should be given their ubiquity. To address this concern, JPL has partnered with the nonprofit PDF Association to develop the new archive of files that will help researchers analyze potential threats across a wide library of real PDFs.
The project involves assembling roughly 8 million PDFs totaling more than 8TB of data from various online sources. The effort is a part of a Defense Advanced Research Projects Agency (DARPA) initiative called Safe Documents (SafeDocs), which aims to make digital documents safe from malicious code and other security concerns.
"PDFs are used everywhere and are important for contracts, legal documents, 3D engineering designs, and many other purposes," Tim Allison, a JPL data scientist, said in a statement. "Unfortunately, they are complex and can be compromised to hide malicious code or render different information for different users in a malicious way." To confront these and other challenges from PDFs, a large sample of real-world PDFs needs to be collected from the internet to create a shared, freely available resource for software experts."
Using the freely available Common Crawl public repository of web crawl information as a starting point, JPL researchers identified PDFs to add to the collection, including those that were incomplete due to Common Crawl's download limit of 1 megabyte per downloaded file. JPL then accessed those PDF URLs directly to download the full documents, ensuring a fully representative archive of the types of PDFs accessible on the web.
By making the collection available to the public, JPL hopes researchers will be able to use and analyze the PDFs to identify better ways of securing the information these documents contain.