Upgrade to Pro — share decks privately, control downloads, hide ads and more …

2017 - Robert L. Bolin - The "Lost" Federal Government Technical Reports

PyBay
August 11, 2017

2017 - Robert L. Bolin - The "Lost" Federal Government Technical Reports

Robert L. Bolin has discovered 100K Federal Technical reports that are not digitized and not indexed online. He proposes using Python to deconstruct and parse the PDF version of the index and create a modern library index giving people intellectual access to those reports.

Video: https://youtu.be/4n1WPSYDCyE

PyBay

August 11, 2017
Tweet

More Decks by PyBay

Other Decks in Programming

Transcript

  1. Background • I am a graduate student working on a

    Certificate in Digital Humanities at the University of Nebraska-Lincoln. • I am also an emeritus faculty member. I was previously the business librarian at UNL.
  2. The Situation During my career as a librarian, I discovered

    a collection of roughly 100k Federal government technical reports which are effectively “lost” because there is no online intellectual access to them. • They have not been digitized. • Their printed indexes are difficult to search -- even in PDF format.
  3. Scope of the Collection • The collection covers a vast

    range of technical and scientific subjects representing millions of dollars of Federal R&D spending from 1946 to 1961. In addition it includes: • Documents related to the American development of radar and other weapons and technology during World War II and • Material obtained from the archives of German industry and government and of the German armed forces following World War II.
  4. My First Effort I attempted to create a database listing

    the “lost” technical reports using XML entering the data by hand. The effort worked however it was too slow. I would not live long enough to complete the project. Here is a screen shot from my first effort:
  5. My Current Plan I am learning Python so that I

    can use the powerful text processing capabilities of Python to parse the PDF version of the printed index in order to create a modern online library database giving access to the lost Federal technical reports.
  6. I Want to Fail I hope that someone will digitize

    the lost technical reports making an index unnecessary. The lost technical reports are held by the Library of Congress and they are not copyrighted. They would be easy to digitize with modern digitization technology.
  7. Further Resources I have written an article on the lost

    technical reports. It is a bit dated but essentially correct. It is at: http://digitalcommons.unl.edu/libraryscience/158/ My first attempt at an index to the lost technical reports is available at: http://unllib.unl.edu/Bolin_resources/bsir-xml/