Given the following:

  1. Necessity is the mother of invention.
  2. There is a need to “move” artifacts from the physical domain to the digital domain to increase access and transportability while protecting the original
  3. To “do it right” could easily exceed $100,000 for a simple text digitization project
  4. A digital image of the original is just the beginning
  5. Preservation also means resolving authenticity issues.
  6. Knowledge management dictates adequate access to the information contained “IN” the text, not just a representation “OF” the text.

I am experimenting with a low-cost tool suite to enable historians to digitally capture and then manage original texts. Admittedly, the digital capture piece is a tiny fraction of the project, but in a low-cost environment, may require the most labor. The conversion of and the meta-data associated with the information “IN” an artifact are the trickier pieces that require detailed analysis and consistent, uniform, logical rules to be made useful.

I would like to develop a methodology to capture and preserve an original artifact, efficiently convert the information contained in the artifact to machine-readable language, establish a standard set of meta-data associated with a text-based artifact, and render the results in an authentic reproduction of the original in a fully Section 508-compliant and machine searchable, generally accepted format.

I think I can do the first half of that. I have no idea how to do the last half of that, but between MySQL and .XML and a few years of database development experience I think I can pull off the rest.

Here’s what I have so far. Last spring, while volunteering at the National Guard Education Foundation with Zanya, I was given permission to digitally capture, from the original, The Militaman’s Pocket Companion, published in 1822. All I walked away with were 5 megapixel .jpegs of each page taken with my cheap HP digital camera. This is the Preface page 1:

Using Adobe Photoshop, I was able to create the following, easily readable reproduction:

Using Adobe Acrobat’s OCR capability rendered about a 90% solution. Unsatisfied, I tested some other software and found SimpleOCR’s freeware to achieve about a 98% solution:

After manually reviewing/comparing/editing the copy from the original, I am left with this 99.x% machine-readable, fully searchable, copiable, and transformable, data object.

That’s the easy part. I think. From there, one has to develop the appropriate data normalization for massive strings. Not sure how well that will work. Also, rules for generating appropriate meta-data must be developed and applied to the database structure. Finally, the queries have to be written to extract portions of the data from the text as required by the user… this is something I have zero experience in the thin-client world of the internet.

Simultaneously, I would like to investigate .XMLs properties to see if that may render a more useful method. Further, using digital signatures and electronic certificates, I think I can vouch for the authenticity of the changes made and at least be able to claim the mistakes as mine alone.

The intent is to develop a very low-cost, efficient methodology that takes into account preserving the original, converting it within acceptable guidelines, ensuring its authenticity, and enabling transmission over the web, thus allowing an unlimited number of reviewers of the material to examine a text without harming the physical object. In the end, at least this text would be made available back to the National Guard Education Foundation for display as a part of their emerging web-presence.

Having said that, I worry if this methodology and development may be too narrowly focused and not what this class intends. I am focusing primarily on the methodology of the archiving and less on the actual document preserved for the purposes of the project.

Thoughts? Comments? Derision?


September 20, 2009


  1. Interesting concept. Have you checked around to see if anything else like it is out there? Would your project be a one-stop shopping application to include scanning, transforming, storing, and disseminating? Or are you thinking about patching together several tools that the user would navigate through in sequence? And I’m guessing your target audience would be non-technical folks looking for an easy tool to use?

    Comment by colamaria | September 20, 2009 | Reply

  2. Interesting. The development of the “correct” meta data enforces an organization on the data which other researchers might not agree with. Part of your project – if you are going to make the development of meta data part of that project is a method for disseminating candidate tags and getting feedback from other people who might use your data. Or perhaps it might be a method that would allow the development of tags that can be applied by users on data that has been processed thus allowing for an infinite way of categorizing data. But that also means problems in retrieving data. I don’t think this is too narrow a project, but a project that might take the whole semester defining the bounds.

    Comment by theoldscholar | September 20, 2009 | Reply

    • Good point oldscholar. Maybe it is sufficient and useful to develop a metadata architecture in a normalized sense that could be applied by individual historians/archivists/etc in any way they see fit, but falls within a logical framework.

      Colamaria, I am not thinking about a one-stop shop/service and more a methodological approach on “how-to” complete with a sample architecture and framework for what would have to be a very simple, but flexible database. Ideally, the “user” would see how I created a digital document from the original, and how I could analyze/access/review the data using data analysis tools while making the entire document “available” to things like Google crawlers. But YES, the focus is for the non-techies who are willing to learn a bit…

      Admittedly, I am not sure how all that will work out.

      Thanks for the feedback!
      – DeadGuyQuotes

      Comment by DeadGuyQuotes | September 20, 2009 | Reply

  3. This is way beyond my tech knowledge; but I love the idea. You initially scanned a book that was already in type that could be “understood” by the computer as type. Would this work for written documents – all the different scrawl early Americanists are subjected to? Or things such as ledgers? That may be asking way too much…but I can dream.

    Comment by lprice3 | September 21, 2009 | Reply

    • Lprice3 –

      The shareware I am using claims it will do an optical character recognition (OCR) on handwriting and it claims it “learns”. Some of the more advanced systems are supposedly pretty good at it, but early scripts would present significant, though not insurmountable, challenges.

      Comment by DeadGuyQuotes | September 21, 2009 | Reply

  4. Wow – now I really love your project.

    Comment by lprice3 | September 21, 2009 | Reply

  5. As I understood, you are looking for some cost-effective methodology of preserving and digitizing the documents. If that is your purpose then you have my blessing. (-:

    Comment by alex_lesanu | September 21, 2009 | Reply

    • Now that I have Alex’s blessing, I will move forward with ALL haste!!

      Comment by DeadGuyQuotes | September 21, 2009 | Reply

  6. I was thinking of your project when I was perusing Alex’s link and found this page. I got there following Alex link and then going to Britain the General Strike. There are all these documents and artifacts and if you go into the advanced search you can enter things like “Churchill” or “Chamberlin” and it will find the documents with those people mentioned. I originally thought it was key word searches, but it returned some meeting attendee lists with a J. Churchill in them written in cursive. Did they translate each document? Did they do what you are suggesting? I don’t know but this is a very useful site, especially since I am doing a paper in my other course this semester on British culture and labor laws between the wars.

    So thank you both.

    Comment by TheOldScholar | September 22, 2009 | Reply

    • It appears that they are using the same basic idea behind the database that serves the National Security Archives at GW. (accessible by GMU students at )

      They essentially use metadata tags to identify the artifacts and facilitate searching. The problem, however, lies in the quality/accuracy of the metatdata tags. In my reasearch on the Cuban Missile Crisis, I found numerous errors and found significant limitations in the search capability. I wound up conducting a needle-in-a-haystack approach to sorting through artifacts and finding some winners. Using that, I stumbled on a lot of real winners, but it was a huge pain and labor-intensive process.


      Comment by DeadGuyQuotes | September 22, 2009 | Reply

  7. I’ll let you know if I find the same limitations in this site. Being able to do string searches and keyword searches would be very beneficial. Or maybe a wiki type approach where each researcher can “fix” quality and accuracy of keywords metadata thus allowing those searchers to “learn” over time. Sort of the old Artificial Intelligence learning capability. Then the search engine could become like the old librarian that has been around for 50 years and knows what people are looking for and is genuinely helpful.

    Comment by theoldscholar | September 22, 2009 | Reply

  8. Old Friend, I see this becoming one of the most significant achievements in scholarship in recent times. The potential for expansion seems limitless. Fully-developed, such a site would rival the ancient Library of Alexandria in its value to scholars, making you somewhat akin to Ptolemy I Soter. You need collaborators and funding; this is too important for it to be starved of resources.

    Comment by Cap'n Dub | October 19, 2009 | Reply

