DeadGuyQuotes's Blog

American History in the Making

Developing Standards and Techniques for Digitizing History: Laying the Foundation for Future Collaboration & Development of Digital Artifacts


Abstract

In a multi-phase project I will develop a low-cost methodology for digital archiving documents, develop and store them in a standards-based data storage platform, and set the conditions to scale up from this foundation with future phases and funding creating a collaborative, accessible, online digital archive with fully reproduce-able, searchable, capture-able, translatable, and malleable datasets and online sources.

Phase I – Prototyping Perform data modeling and prototyping a non-production database for testing and exploration purposes. In essence, answer the question, “What data is in the artifact?” and developing the proper place for that data for maximum efficiency and use for the future through data normalization.
Phase II- Capture Perform and document for repetition, a low-budget document capture and artifact preservation where a historic text is extracted from the original document, stored efficiently in a database model and presented to the user in both the original capture (picture) format and a searchable, .pdf or data string format. See Digitizing the Past for a reasonably full explanation of this process. I will be leveraging access to artifacts from the National Guard Education Foundation’s archives.
Phase III- Web Access Develop the online access portal for this data store while archiving all available artifacts in the immediate database. This element will be little different than other online resources save the unique material available. The University of Michigan Digital Library offers what appears to be a common standard of .pdf representation. I want to go further and make the text itself a part of the data. This phase will also present a web-access portal template that other institutions can leverage – freely available in the spirit of open-source development.
Phase IV- Initial Expansion Develop partnerships and data shares across multiple institutions with similar projects in development or production. The level of participation directly influences the scale of this phase.
Phase V- Infinite Expansion Expand collaborative efforts by potentially make this capability available to amateur as well as resource-constrained archivists and historians by providing a standards-based methodology and data capture technique and a collaborative platform to share the data once stored. This aspect of the final phase will be limited only by technology maintenance and scalability costs

Requirement

The requirement for this project is simple. Museums, archives and libraries have a mission to preserve and make available their holdings. The costs are often prohibitive for displays and for complex online archives thus limiting the effectiveness of smaller institutions’ ability to succeed in their missions. By establishing a phased approach, institutions and individuals will be able to choose when and how they implement this methodology. Ultimately, this “how” to can include a “where” capability as collaboration and external input can be presented to the host institution or institutions for inclusion in their dataset. The requirement is to develop low-cost methods and technologies to enable resource-constrained archivists, curators, and historians to develop a worldwide audience for their unique data.

Features & Functions

The primary capability of this project will be a “how-to” methodology in a resource-constrained environment detailing how to capture artifacts and translate them into datasets for future/other uses. To exemplify the methodology, a secondary feature will be the full presentation of The Militaman’s Pocket Companion, published in 1822 and held by the National Guard Education Foundation in Washington, DC. As fully developed the phases themselves offer staggered capability for each level of development.

Phase I – Prototyping Offers a functional assessment and the “how-to capture and store the data” portion of this project. The result will be some data snapshots and budgetary/capability/technological assessments of what is involved in digitally capturing an artifact. It will also offer a detailed step-by-step guide of how to accomplish this task in a very low-budget environment. This information will be presented in detail on my blog and a static website at http://www.plague-rat.com.

It is my intent to complete Phase I within the scope of this class.

Phase II- Capture Fully capture and digitally
preserve the target text. This will take the form of an e-book based in three formats:

  1. .pdf from the original photographs
  2. .pdf from the original text (pre OCR)
  3. .pdf from the OCR’d result.

In addition to the three formats, there will be an associated database with the texts, original photographs, and metadata.

It is my intent to complete Phase II within the scope of this class.

Phase III- Web Access Outline a grant proposal to develop the web access portal that will professionally and efficiently exploit the data gathered in Phase II and allow for an expanding pool of artifacts to be included. Conceptually this will fall somewhere between Google Books and Footnote.com with a significant difference in meta data access and digital cross linking.

The proposal will outline how the data will be presented in a data-centric point of view with direct linkage to the artifact representations (original photographs) while allowing for tagging and linking to and between other artifacts in the collection. Further, this data will be fully Section 508 compliant. This may be accomplished at a keyword level or a subject level or other available metadata.

It is my intent to scope and present a grant proposal to accomplish Phase III within the scope of this class.

Phase IV- Initial Expansion Outlines the methodology, and architectural and collaborative framework for expansion to other organizations leveraging the same resource-constrained methodology. Ideally, this will be done in a nominal cost-sharing environment whereby the web access portal gains access to the archives and artifacts of other institutions and the other institutions develop the datasets.

It is my intent to present a well-developed scope and vision for this phase to set the stage for future grant and development work on implementation as a part of the Phase III grant proposal for this class.

Phase V- Infinite Expansion Outlines an expandable methodology, and architectural and collaborative framework for expansion to a logically infinite number of organizations and contributors leveraging the resource-constrained artifact capture and data development techniques. Costs and limitations will be driven by scale and available technology.

It is my intent to present a well-developed concept for this phase identifying some of the risks and benefits of project pursuit to set the stage for future grant and development work on implementation as a part of the Phase III grant proposal for this class.

Audience

The audiences for this project will evolve as scale and participation evolves. As such, anticipated audience is best defined by phase.

Phase I – Prototyping Target at small organizations and institutions as well as amateur and professional archivists, curators, and historians working in a resource-constrained environment.
Phase II- Capture Narrowly target the National Guard Education Foundation which is the organization responsible for archiving the test artifacts I am using for this project development. The larger target audience will be the same as Phase I as Phase II intends to provide a practical demonstration of the results of the techniques outlined in Phase I. Since the capture is a process and the test will be one text, the audience is confined to a very practical level.
Phase III- Web Access Target the same group identified in Phase I, and will incorporate the larger audience of the NGEF identified in Phase II. The first audience will benefit from the methodology presented as well as the web-access portal template available while the second audience will benefit from the test artifact and expanded holdings of the NGEF. Any actual web-development will be presented on a very narrow scale. The grant proposal will highlight the larger target audience.
Phase IV- Initial Expansion Audiences will expand to include partner institutions and will involve a deeper connection to professional or student research archivists, curators, and historians.
Phase V- Infinite Expansion Audiences will expand again to encompass amateur and professional
archivists, curators, and historians as well as institutions for research, connection, sharing, and comment.

Technologies

The technologies for this project will evolve with the phases. As the initial intent is to get data available as soon as possible the technology will be completely off-the-shelf and easily available for less than $3000. The Infinite Expansion phase will involve detailed custom programming and expansive data storage techniques. Phase VI and V costs could exceed several million dollars for development and maintenance.

Phase I – Prototyping Requires a consumer-quality digital camera and memory working from a consumer-quality computer with moderate storage and processing power with a graphics manipulation, optical character recognition, and simple relational database engine. For development, I will use Adobe CS3 (CS4 is the current version and is extremely expensive) with Adobe Acrobat, Adobe Photoshop, and, if needed Adobe Dreamweaver. For OCR I will use a freeware version of SimpleOCR
and for a database engine I will use
MS Access or MySQL. I may potentially use MS Visio Pro for data modeling and MS Project for planning and tracking with MS Office for general documentation.
Phase II- Capture Requires the tools cited in Phase I with a possible move to SQL server, I will conduct the full capture of the text.
Phase III- Web Access This will be a relatively simple .xml and .css website with probable .net data ties to the database engine for web presentation. The site will most likely be developed using Adobe Dreamweaver
or potentially
MS Visual Studio. Adding server and development software significantly increases the costs, but remains below $10,000. Hosting becomes an additional, recurring cost.
Phase IV- Initial Expansion The technology for this phase will greatly be determined by the scale of implementation. I assume a medium-to-large-scale implementation requiring substantial computing and storage resources to include a full SQL server, Storage Area Networks (SANS), MS ISA servers for web generation. The presentation may require additional Flash programming but should continue to rely on relatively simple and efficient coding in .xml, .css. and .net.
Phase V- Infinite Expansion This phase could exponentially increase the technology requirements in terms of storage, speed, bandwidth and scale. The base languages and databases should require little changes and only some expansion. Flash will definitely be involved.

Web 2.0 – User Input

User input will vary with the audience. The initial phases of development present the user with information they can leverage and subsequently input on their own projects, but not directly within the scope of this project. The later phases are almost completely user driven.

Phase I & II No user input. The information available can enable users to replicate the methods within their goals.
Phase III- Web Access Potential user input via blog as a form of commentary on the methodologies presented. The information available can enable users to replicate the methods within their goals.
Phase IV- Initial Expansion Collaborative organization input, largely behind-the-scenes as access to artifacts expands and other users are able to capture the datasets and share the data. This is not intended to be a “user-friendly” consumer type of experience, but shared server resources where research personnel can access the “back-end” of the system for direct input of data.
Phase V- Infinite Expansion Fully capable user input. Expanding access depends on user conformance to the capture and dataset standards, but easy access to the system via a simple web front-end. Envisioned is a peer-review/moderation process that verifies data conformance and propriety.
Advertisements

October 18, 2009 - Posted by | Clio I - History and New Media | , ,

5 Comments »

  1. There always has to be a show-off in every class doesn’t there. This is one very well though out program. Very ambitious also. There is a lot of stuff you have signed up for in this semester. You state that you are going to have an interface to a database with server side .NET access. Do you already have a production server that will allow you to do that?
    Also you may want to check out a Grant Application that was accepted last year that seems to be right down your alley. Drexel University applied to develop a tool to automatically generate metadata from scanned newspapers http://www.neh.gov/ODH/Default.aspx?tabid=111&id=7

    If that tool is free and readily available it might help you reduce some of your manual effort on the creation of metadata. I don’t know how well it works, but the concept sounds interesting.

    I have not looked into it but the OMEKA software from CHNM allows museums to put up their collections in an easy open form. It would probably be worth a quick look see if they are expecting input in any certain way. If they are, your digitization could offer the capability to create “Omeka Ready” output. I don’t know how these grants are awarded, but interfacing with other known successes would seem to increase your odds for winning a grant.

    Comment by theoldscholar | October 18, 2009 | Reply

  2. Oh yeah, something else I came across. There was a DEH grant given to the University of Maryland which created AXE, a web‐based annotation tool capable of identifying “regions of interest” in video, audio, and image files located anywhere on the internet, and encoding these regions in an XML file. If it could annotate the original scanned in document, there might be some functionality in your tool. It sounds interesting and they say it is used in Zotero. However when you go to their site (http://mith.info/AXE/) they must think it is so simple they don’t need any directions or hints. I could not figure out what it does. Maybe Dr. Cohen knows what it does.

    Comment by theoldscholar | October 18, 2009 | Reply

  3. I’m not sure I understand this project. Following up on the previous conversation about language barriers in discussing DH, I think this is a good example of how you need to be able to talk about DH on multiple levels. The language of this proposal has the technical specificity required for a grant (I assume). But, having worked in a small historical society and trying to encourage them to delve into DH, I think it’s unlikely that amateurs would understand this language. I don’t see that as a problem. But I do think that it means you have to be able to translate it to different ways of understanding if you want to get small, amateur organizations to understand the project, and participate.

    As I understand your project, you want to scan a lot of things, and make them available across institutions. Sharing would be accomplished by using common digitizing processes and file formats. You would develop this project in stages, starting with your own museum material, and then expand to other organizations,getting more funding along the way. Is that close?

    Comment by rachel | October 20, 2009 | Reply

    • You are pretty close…

      And you are right.

      For the purposes of the project and given my immediate audience as “digital historians” I am talking about the processes and methods to share the information contained in the artifacts — not really the artifacts themselves, though that DOES happen… it is secondary. Most digital archives seem to focus on sharing the artifact where I am interested in the data/information/knowlege contained IN the artifact.

      When talking to any particular organization about inclusion into this project, the discussion takes on a highly talored form to address the issues of that particular institution… usually the technical level of that audience is a key factor.

      The whole point is to be able to take a highly technical process of capturing (not necessarily scanning, but capture which could be scanning, photos, etc) a text artifact, translate/transcribe the text with minimal touch labor, and then inject that into a standards-based database or system to enable sharing, analysis, examination and/or presentation.

      Thanks for the comments and please let me know if I have still left anyone confused!

      DGQ

      Comment by DeadGuyQuotes | October 20, 2009 | Reply

  4. Carl,
    I found some scanning/processing tools on the Project Gutenberg site you might find interesting. http://www.gutenberg.org/wiki/Gutenberg:Tools_FAQ

    Comment by theoldscholar | October 29, 2009 | Reply


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: