Digitizing the Past
Given the following:
- Necessity is the mother of invention.
- There is a need to “move” artifacts from the physical domain to the digital domain to increase access and transportability while protecting the original
- To “do it right” could easily exceed $100,000 for a simple text digitization project
- A digital image of the original is just the beginning
- Preservation also means resolving authenticity issues.
- Knowledge management dictates adequate access to the information contained “IN” the text, not just a representation “OF” the text.
I am experimenting with a low-cost tool suite to enable historians to digitally capture and then manage original texts. Admittedly, the digital capture piece is a tiny fraction of the project, but in a low-cost environment, may require the most labor. The conversion of and the meta-data associated with the information “IN” an artifact are the trickier pieces that require detailed analysis and consistent, uniform, logical rules to be made useful.
I would like to develop a methodology to capture and preserve an original artifact, efficiently convert the information contained in the artifact to machine-readable language, establish a standard set of meta-data associated with a text-based artifact, and render the results in an authentic reproduction of the original in a fully Section 508-compliant and machine searchable, generally accepted format.
I think I can do the first half of that. I have no idea how to do the last half of that, but between MySQL and .XML and a few years of database development experience I think I can pull off the rest.
Here’s what I have so far. Last spring, while volunteering at the National Guard Education Foundation with Zanya, I was given permission to digitally capture, from the original, The Militaman’s Pocket Companion, published in 1822. All I walked away with were 5 megapixel .jpegs of each page taken with my cheap HP digital camera. This is the Preface page 1:
Using Adobe Photoshop, I was able to create the following, easily readable reproduction:
Using Adobe Acrobat’s OCR capability rendered about a 90% solution. Unsatisfied, I tested some other software and found SimpleOCR’s freeware to achieve about a 98% solution:
After manually reviewing/comparing/editing the copy from the original, I am left with this 99.x% machine-readable, fully searchable, copiable, and transformable, data object.
That’s the easy part. I think. From there, one has to develop the appropriate data normalization for massive strings. Not sure how well that will work. Also, rules for generating appropriate meta-data must be developed and applied to the database structure. Finally, the queries have to be written to extract portions of the data from the text as required by the user… this is something I have zero experience in the thin-client world of the internet.
Simultaneously, I would like to investigate .XMLs properties to see if that may render a more useful method. Further, using digital signatures and electronic certificates, I think I can vouch for the authenticity of the changes made and at least be able to claim the mistakes as mine alone.
The intent is to develop a very low-cost, efficient methodology that takes into account preserving the original, converting it within acceptable guidelines, ensuring its authenticity, and enabling transmission over the web, thus allowing an unlimited number of reviewers of the material to examine a text without harming the physical object. In the end, at least this text would be made available back to the National Guard Education Foundation for display as a part of their emerging web-presence.
Having said that, I worry if this methodology and development may be too narrowly focused and not what this class intends. I am focusing primarily on the methodology of the archiving and less on the actual document preserved for the purposes of the project.
Thoughts? Comments? Derision?
12 Comments »