DeadGuyQuotes's Blog

American History in the Making

Developing Standards and Techniques for Digitizing History: Laying the Foundation for Future Collaboration & Development of Digital Artifacts

Abstract

In a multi-phase project I will develop a low-cost methodology for digital archiving documents, develop and store them in a standards-based data storage platform, and set the conditions to scale up from this foundation with future phases and funding creating a collaborative, accessible, online digital archive with fully reproduce-able, searchable, capture-able, translatable, and malleable datasets and online sources.

Phase I – Prototyping Perform data modeling and prototyping a non-production database for testing and exploration purposes. In essence, answer the question, “What data is in the artifact?” and developing the proper place for that data for maximum efficiency and use for the future through data normalization.
Phase II- Capture Perform and document for repetition, a low-budget document capture and artifact preservation where a historic text is extracted from the original document, stored efficiently in a database model and presented to the user in both the original capture (picture) format and a searchable, .pdf or data string format. See Digitizing the Past for a reasonably full explanation of this process. I will be leveraging access to artifacts from the National Guard Education Foundation’s archives.
Phase III- Web Access Develop the online access portal for this data store while archiving all available artifacts in the immediate database. This element will be little different than other online resources save the unique material available. The University of Michigan Digital Library offers what appears to be a common standard of .pdf representation. I want to go further and make the text itself a part of the data. This phase will also present a web-access portal template that other institutions can leverage – freely available in the spirit of open-source development.
Phase IV- Initial Expansion Develop partnerships and data shares across multiple institutions with similar projects in development or production. The level of participation directly influences the scale of this phase.
Phase V- Infinite Expansion Expand collaborative efforts by potentially make this capability available to amateur as well as resource-constrained archivists and historians by providing a standards-based methodology and data capture technique and a collaborative platform to share the data once stored. This aspect of the final phase will be limited only by technology maintenance and scalability costs

Requirement

The requirement for this project is simple. Museums, archives and libraries have a mission to preserve and make available their holdings. The costs are often prohibitive for displays and for complex online archives thus limiting the effectiveness of smaller institutions’ ability to succeed in their missions. By establishing a phased approach, institutions and individuals will be able to choose when and how they implement this methodology. Ultimately, this “how” to can include a “where” capability as collaboration and external input can be presented to the host institution or institutions for inclusion in their dataset. The requirement is to develop low-cost methods and technologies to enable resource-constrained archivists, curators, and historians to develop a worldwide audience for their unique data.

Features & Functions

The primary capability of this project will be a “how-to” methodology in a resource-constrained environment detailing how to capture artifacts and translate them into datasets for future/other uses. To exemplify the methodology, a secondary feature will be the full presentation of The Militaman’s Pocket Companion, published in 1822 and held by the National Guard Education Foundation in Washington, DC. As fully developed the phases themselves offer staggered capability for each level of development.

Phase I – Prototyping Offers a functional assessment and the “how-to capture and store the data” portion of this project. The result will be some data snapshots and budgetary/capability/technological assessments of what is involved in digitally capturing an artifact. It will also offer a detailed step-by-step guide of how to accomplish this task in a very low-budget environment. This information will be presented in detail on my blog and a static website at http://www.plague-rat.com.

It is my intent to complete Phase I within the scope of this class.

Phase II- Capture Fully capture and digitally
preserve the target text. This will take the form of an e-book based in three formats:

  1. .pdf from the original photographs
  2. .pdf from the original text (pre OCR)
  3. .pdf from the OCR’d result.

In addition to the three formats, there will be an associated database with the texts, original photographs, and metadata.

It is my intent to complete Phase II within the scope of this class.

Phase III- Web Access Outline a grant proposal to develop the web access portal that will professionally and efficiently exploit the data gathered in Phase II and allow for an expanding pool of artifacts to be included. Conceptually this will fall somewhere between Google Books and Footnote.com with a significant difference in meta data access and digital cross linking.

The proposal will outline how the data will be presented in a data-centric point of view with direct linkage to the artifact representations (original photographs) while allowing for tagging and linking to and between other artifacts in the collection. Further, this data will be fully Section 508 compliant. This may be accomplished at a keyword level or a subject level or other available metadata.

It is my intent to scope and present a grant proposal to accomplish Phase III within the scope of this class.

Phase IV- Initial Expansion Outlines the methodology, and architectural and collaborative framework for expansion to other organizations leveraging the same resource-constrained methodology. Ideally, this will be done in a nominal cost-sharing environment whereby the web access portal gains access to the archives and artifacts of other institutions and the other institutions develop the datasets.

It is my intent to present a well-developed scope and vision for this phase to set the stage for future grant and development work on implementation as a part of the Phase III grant proposal for this class.

Phase V- Infinite Expansion Outlines an expandable methodology, and architectural and collaborative framework for expansion to a logically infinite number of organizations and contributors leveraging the resource-constrained artifact capture and data development techniques. Costs and limitations will be driven by scale and available technology.

It is my intent to present a well-developed concept for this phase identifying some of the risks and benefits of project pursuit to set the stage for future grant and development work on implementation as a part of the Phase III grant proposal for this class.

Audience

The audiences for this project will evolve as scale and participation evolves. As such, anticipated audience is best defined by phase.

Phase I – Prototyping Target at small organizations and institutions as well as amateur and professional archivists, curators, and historians working in a resource-constrained environment.
Phase II- Capture Narrowly target the National Guard Education Foundation which is the organization responsible for archiving the test artifacts I am using for this project development. The larger target audience will be the same as Phase I as Phase II intends to provide a practical demonstration of the results of the techniques outlined in Phase I. Since the capture is a process and the test will be one text, the audience is confined to a very practical level.
Phase III- Web Access Target the same group identified in Phase I, and will incorporate the larger audience of the NGEF identified in Phase II. The first audience will benefit from the methodology presented as well as the web-access portal template available while the second audience will benefit from the test artifact and expanded holdings of the NGEF. Any actual web-development will be presented on a very narrow scale. The grant proposal will highlight the larger target audience.
Phase IV- Initial Expansion Audiences will expand to include partner institutions and will involve a deeper connection to professional or student research archivists, curators, and historians.
Phase V- Infinite Expansion Audiences will expand again to encompass amateur and professional
archivists, curators, and historians as well as institutions for research, connection, sharing, and comment.

Technologies

The technologies for this project will evolve with the phases. As the initial intent is to get data available as soon as possible the technology will be completely off-the-shelf and easily available for less than $3000. The Infinite Expansion phase will involve detailed custom programming and expansive data storage techniques. Phase VI and V costs could exceed several million dollars for development and maintenance.

Phase I – Prototyping Requires a consumer-quality digital camera and memory working from a consumer-quality computer with moderate storage and processing power with a graphics manipulation, optical character recognition, and simple relational database engine. For development, I will use Adobe CS3 (CS4 is the current version and is extremely expensive) with Adobe Acrobat, Adobe Photoshop, and, if needed Adobe Dreamweaver. For OCR I will use a freeware version of SimpleOCR
and for a database engine I will use
MS Access or MySQL. I may potentially use MS Visio Pro for data modeling and MS Project for planning and tracking with MS Office for general documentation.
Phase II- Capture Requires the tools cited in Phase I with a possible move to SQL server, I will conduct the full capture of the text.
Phase III- Web Access This will be a relatively simple .xml and .css website with probable .net data ties to the database engine for web presentation. The site will most likely be developed using Adobe Dreamweaver
or potentially
MS Visual Studio. Adding server and development software significantly increases the costs, but remains below $10,000. Hosting becomes an additional, recurring cost.
Phase IV- Initial Expansion The technology for this phase will greatly be determined by the scale of implementation. I assume a medium-to-large-scale implementation requiring substantial computing and storage resources to include a full SQL server, Storage Area Networks (SANS), MS ISA servers for web generation. The presentation may require additional Flash programming but should continue to rely on relatively simple and efficient coding in .xml, .css. and .net.
Phase V- Infinite Expansion This phase could exponentially increase the technology requirements in terms of storage, speed, bandwidth and scale. The base languages and databases should require little changes and only some expansion. Flash will definitely be involved.

Web 2.0 – User Input

User input will vary with the audience. The initial phases of development present the user with information they can leverage and subsequently input on their own projects, but not directly within the scope of this project. The later phases are almost completely user driven.

Phase I & II No user input. The information available can enable users to replicate the methods within their goals.
Phase III- Web Access Potential user input via blog as a form of commentary on the methodologies presented. The information available can enable users to replicate the methods within their goals.
Phase IV- Initial Expansion Collaborative organization input, largely behind-the-scenes as access to artifacts expands and other users are able to capture the datasets and share the data. This is not intended to be a “user-friendly” consumer type of experience, but shared server resources where research personnel can access the “back-end” of the system for direct input of data.
Phase V- Infinite Expansion Fully capable user input. Expanding access depends on user conformance to the capture and dataset standards, but easy access to the system via a simple web front-end. Envisioned is a peer-review/moderation process that verifies data conformance and propriety.

October 18, 2009 Posted by | Clio I - History and New Media | , , | 5 Comments

Web 2.0! We don’t need no stinking Web 2.0!

Partly as a result of the conversation ZaYna (yes, I am her friend and I have a chronic spelling problem) and I have been having off-and-on this semester, and partly as a coalescing of my own ramblings, I offer my own definition of Web 2.0 for our consideration.

Web 2.0 is:

  1. Less a technological construct and more a social construct.
  2. An environment of collaboration and openness.
  3. Dependent on, but not limited by, open, logical, essential technical standards – the antithesis of proprietary models. (Linux, for example is a computer operating system like Windows or MAC which does not belong to any particular company and is based on an open language and essential core, called a kernel. Anyone can learn how to build applications for Linux and can publish them… Wikipedia exists in the same, open framework where anyone can publish).
  4. An environment where there are no rules, only what can be considered fundamental scientific laws… essentially the very basic programming schtuff acting like irrefutable gravity, and where you are free to express, collaborate, or share as you see fit.
  5. Much like the Douglas Adam’s babel fish , Web 2.0 can serve as ubiquitous translation and sharing point for information.
  6. A playground with no walls where everyone is invited and there is enough room on the merry-go-round for all.

The question remains, and Bell’s essay examines, how to we structure an essentially unstructured playground and make it suitable for academic discourse?

Bell offers a great example in his discussion about the Gutenberg-e Prize and what it can mean for a significant shift in hyper-textual scholarship and rigorous peer review. In addition to that, we have to examine our responsibilities as historians. We have to exercise discipline in our writing and our peer review. We have to write clearly and research rigorously. Through hyper-textualization, we can provide direct access to our primary resources. This requires careful consideration of our conclusions as all of our source material can be open to scrutiny. This can provide for far superior writing.

The danger lies in blogs, emails, and twitters. Web 2.0’s lack of structure opens a wide door for lazy, rapid-fire, ill-considered writings. There are advantages in rapid response, and world-wide broadcasting, but there are significant risks, namely to our reputations.

Web 2.0 is a utopian dream without artificial superstructures imposing hierarchy and arbitrary information channels and filters. To be taken seriously, the policing of such a “wild-west” atmosphere must be taken up by each denizen of the new utopia.

October 6, 2009 Posted by | Clio I - History and New Media | , , , | 1 Comment