In addition to the abstract below, I have attached the
presentation file for those who may be interested in the methodology animation.
Also, the digital version of the Militiaman’s Pocket Companion is attached! Due to the large size of the file (70Mbs) it is best to right-click the link and “Save As” on your PC.
Final Grant Proposal: Grant Proposal for Developing Methods for Knowledge Management & Digital Preservation
In a multi-phase this project presents a low-cost methodology for digital capture,
preservation, and archival of original documents; develops and stores processed
and distributable versions in a standards-based data storage platform, and
sets conditions to scale from this foundation to a collaborative, accessible, online digital archive with fully reproduce-able, searchable, capture-able,
translatable, and malleable datasets and online sources.
Phase I – Prototyping
Completed in November 2009, this phase established a coherent methodology for project development by prototyping the
Phase II- Capture
Completed in November 2009, this phase performed and documented a low-budget method for digital capture from an original artifact, electronic artifact preservation,
Phase III- Web Access
This phase is the focus of this grant funding request. A team of professional developers will construct a suitable multi-media database for storage and access of original artifact captures, distributable .pdf versions, and
Phase IV- Initial Expansion
Beyond the scope of this grant request, this phase seeks to develop partnerships and data shares across multiple institutions with similar projects in development or production. The level of participation directly influences the scale of this phase.
Phase V- Infinite Expansion
Optionally, and depending on the success of the earlier phases, this phase will greatly expand collaborative efforts by potentially make this capability available to amateur and resource-constrained archivists and historians by providing a standards-based methodology and data capture technique and a collaborative platform to share
When reading The Access Principle by John Willinsky, I was particularly intrigued with the way Willinsky approached the topic of open access and politics. Essentially, the message is more information and access to scholarly research and evidence can and should inform the global, national, or local political policy debates. Ideally, members of the government, bureaucrats or politicians, should have access to the latest and best of academic research. More importantly, members of a democratic society should have access to the same. Realized, this brave new world would be filled with informed, reasoned debate. Journalism would live up to its ideals, and mysticism, emotion, and rhetoric would fall down to evidence and logic.
Sounds like the Reformation.
In fact, Willinsky references the impact of the printing press on the same event.
He bravely faces the critical issues surrounding this most noble ideal: context; context and the informed and capable public able to read the material. This is not to say that people aren’t intelligent enough, but there is a problem given at least the American society today. Willinsky points to it when he quotes Christopher Forrest, “The public reads the bottom line.” I will tell you from personal experience that bureaucrats, politicians, soldiers, and any government support personnel also read “the bottom line.” Massive and complex issues are dealt with in one-page summaries. Detailed and sensitive issues are handled in boiled-down bullets. Willinsky espouses a fantastic ideal, but reality still presents a problem.
I have previously expressed concern over the information age in that we have too much information and very few efficient and effective tools to cull through the mountains of data and conclusion. Opening all the doors to the ivory tower’s basement will further complicate the overwhelming sense of information overload. As a collection of academics, citizens, and servants we must work harder on good knowledge management tools and principles to better see the future that Willinsky calls for.
Until then, I may just play the role of ostrich…
∞ tsp of data
Countless hours of digitization
½ Supercomputer (or 4cups Cloud Computing)
1 petabyte of storage
dash of creativity
75 gallons of coffee
1.75L Wild Turkey (101)
budget… lots of budget
Preheat the coffee pot.
Cull for hours identifying targets to digitally preserve.
Scan, photograph, capture, and torture original sources for digitally preserved replicas.
Switch from coffee to whiskey
Realize you are in WAY over your head… run screaming to the hills and embrace your typewriter. Shimmy and shake, drink heavily, calm down and try again.
Pay someone to do something to get the project off the ground while wondering about the relevance of this to historical study.
Bake, survive a crash, learn about disaster recovery, recover, and present your treasure for the world.
Receive 25 hits on your site (4 from family, 10 from friends, 11 random accidents).
Set on fire.
Join a monastery, make beer, drink beer dream of life before electricity.
This seems to be the way to concoct a fine dish of informatics flambeau.
Our fine friends at Wikipedia offer the following somewhat verbose definition of Informatics: “Informatics is the science of information, the practice of information processing, and the engineering of information systems. Informatics studies the structure, algorithms, behavior, and interactions of natural and artificial systems that store, process, access and communicate information.”
Put differently, informatics is “a broad academic field encompassing artificial intelligence, cognitive science, computer science, information science, and social science.”
Informatics, knowledge management, Peter Norvig, Patrick Leary, and The American Council of Learned Societies Commission on Cyberinfrastructure for the Humanities and Social Sciences all seem to be chasing the same notion: Improve accessibility by connecting data and information with the right people. Web 2.0 is all about the data and connecting people and communities to that data. This is truly a daunting task that is wrenching social scientists from their comfortable piles of moldy books and manuscripts and throwing them in front of bleeding edge technologists. This is not a pleasant occurrence, as our class would most heartily attest.
Leary, Rosenzweig, Cohen, and others have screamed about the perils on either side of the straight and narrow path. From data inundation, sloppy results from easy publishing, veracity issues, copyrights and wrongs, cherry picking from what is easily available, to missing the opportunities of chance (ie browsing the stacks and finding that needle in the haystack), there are very real and legitimate concerns.
The same authors and a host of other evangelicals will proclaim the gospel of access and the troves of newly available data. This will only improve with time, they say. I tend to side with the evangelicals… BUT I lean heavily on the requirement to make science/technology work for us.
Present and future studies in history (and, I would argue, every field), much like modern production, will be driven by efficiencies, accuracy, and continuous improvement to the processes of research and publication. Here is where Peter Norvig comes in. Complex computer engines will provide what he called lexical co-occurrences, enlighten the offline penumbra, and connect researchers with a larger community and its data. BUT, beware to the researcher, today this is as risky as Columbus setting out across the Atlantic looking for the orient. Keep in mind, his mission, as ordered, was a complete disaster. The algorithms, programs, methods, and technology are all improving, but they aren’t there yet.
We are all cooking an informatics flambeau. The ingredients are volatile and the results are most definitely on fire. Historians cannot escape the drive to efficiency in research methods and output, but we cannot become experts either. Developing the technology required takes a lifetime of expertise and extremely detailed knowledge in quantum computing. The question is, how can we bridge the gap and become historians who affect the future of the tools we need and who influence the technology for our field?
The Clio I class is a great forum for the exploration of technology and makes a great proving ground for the tech-neophyte ( Newb not a n00b) but I am concerned that we are leaving some of the larger philosophical questions aside in our relative fear of technology. We have to understand the technology not to become developers, but to wield some of the tools and, more importantly, allow us to communicate at some level with the expert technologists.
Just some thoughts… mostly barking at the wind.
Ok… the acronym was an absolute accident, but hey, I’m with the Government, I am a card-carrying official acronym producer. I guess it is natural… or a gift…
This week’s reading really obviates the need for my project in some ways and really opens the curtain to the real issues surrounding digital tool sets. At the root, I am working a Text Encoding Initiative where I do a basic text capture, presentation, preservation, encoding and then some investigation into the power of metadata and the presentation of the text as data. But the problem is… and I suppose this is a legitimate concern across academia… why is my idea any better or different than anyone else’s?
Amidst the concerns of Rosenzweig’s excellent synopsis of the digital challenges and opportunities, how are professional historians supposed to move forward? I think the answer to both questions may be captured by Rosenzweig’s conclusion: “What is often said of military strategy seems to apply to digital preservation: ‘the greatest enemy of a good plan is the dream of a perfect plan.’ We have never preserved everything; we need to start preserving something.” As my efforts are targeted at low-budget, standards-based efforts this seems to fall into line with both the NINCH and Rosenzweig articles.
We must train ourselves in basic standards of historical method using the new tools so we can have any hope for effectively digging through the mountains of data that are emerging for historical analysis. Simultaneously, as the mountain of data is growing efforts must continue to ensure archivists and historians preserve the right documents and data. For historians studying governments, this can be a little easier, but still very challenging. NARA is one example of how little is actually being saved. Costs, legislation, and technology all impact how and what we save. But the historian wants to have the opportunity to look at it all.
The digital realm is covered in opportunities for success and dangerous mines ready to blow up the unsuspecting historian. These issues include technology, ownership, distribution, accuracy, preservation, cost, as well as myriad other dangers. Now is the time that these issues have to be solved. Rosenzweig points out that schools have to train their graduate students to grapple with the issues and even master them. George Mason University’s attempts at digital history are a great start, but leave many specific and highly particular issues at bay.
To paraphrase Rosenzweig, we have to start something digital.
Developing Standards and Techniques for Digitizing History: Laying the Foundation for Future Collaboration & Development of Digital Artifacts
In a multi-phase project I will develop a low-cost methodology for digital archiving documents, develop and store them in a standards-based data storage platform, and set the conditions to scale up from this foundation with future phases and funding creating a collaborative, accessible, online digital archive with fully reproduce-able, searchable, capture-able, translatable, and malleable datasets and online sources.
|Phase I – Prototyping||Perform data modeling and prototyping a non-production database for testing and exploration purposes. In essence, answer the question, “What data is in the artifact?” and developing the proper place for that data for maximum efficiency and use for the future through data normalization.|
|Phase II- Capture||Perform and document for repetition, a low-budget document capture and artifact preservation where a historic text is extracted from the original document, stored efficiently in a database model and presented to the user in both the original capture (picture) format and a searchable, .pdf or data string format. See Digitizing the Past for a reasonably full explanation of this process. I will be leveraging access to artifacts from the National Guard Education Foundation’s archives.|
|Phase III- Web Access||Develop the online access portal for this data store while archiving all available artifacts in the immediate database. This element will be little different than other online resources save the unique material available. The University of Michigan Digital Library offers what appears to be a common standard of .pdf representation. I want to go further and make the text itself a part of the data. This phase will also present a web-access portal template that other institutions can leverage – freely available in the spirit of open-source development.|
|Phase IV- Initial Expansion||Develop partnerships and data shares across multiple institutions with similar projects in development or production. The level of participation directly influences the scale of this phase.|
|Phase V- Infinite Expansion||Expand collaborative efforts by potentially make this capability available to amateur as well as resource-constrained archivists and historians by providing a standards-based methodology and data capture technique and a collaborative platform to share the data once stored. This aspect of the final phase will be limited only by technology maintenance and scalability costs|
The requirement for this project is simple. Museums, archives and libraries have a mission to preserve and make available their holdings. The costs are often prohibitive for displays and for complex online archives thus limiting the effectiveness of smaller institutions’ ability to succeed in their missions. By establishing a phased approach, institutions and individuals will be able to choose when and how they implement this methodology. Ultimately, this “how” to can include a “where” capability as collaboration and external input can be presented to the host institution or institutions for inclusion in their dataset. The requirement is to develop low-cost methods and technologies to enable resource-constrained archivists, curators, and historians to develop a worldwide audience for their unique data.
Features & Functions
The primary capability of this project will be a “how-to” methodology in a resource-constrained environment detailing how to capture artifacts and translate them into datasets for future/other uses. To exemplify the methodology, a secondary feature will be the full presentation of The Militaman’s Pocket Companion, published in 1822 and held by the National Guard Education Foundation in Washington, DC. As fully developed the phases themselves offer staggered capability for each level of development.
|Phase I – Prototyping||Offers a functional assessment and the “how-to capture and store the data” portion of this project. The result will be some data snapshots and budgetary/capability/technological assessments of what is involved in digitally capturing an artifact. It will also offer a detailed step-by-step guide of how to accomplish this task in a very low-budget environment. This information will be presented in detail on my blog and a static website at http://www.plague-rat.com.
|Phase II- Capture||Fully capture and digitally
preserve the target text. This will take the form of an e-book based in three formats:
In addition to the three formats, there will be an associated database with the texts, original photographs, and metadata.
It is my intent to complete Phase II within the scope of this class.
|Phase III- Web Access||Outline a grant proposal to develop the web access portal that will professionally and efficiently exploit the data gathered in Phase II and allow for an expanding pool of artifacts to be included. Conceptually this will fall somewhere between Google Books and Footnote.com with a significant difference in meta data access and digital cross linking.
The proposal will outline how the data will be presented in a data-centric point of view with direct linkage to the artifact representations (original photographs) while allowing for tagging and linking to and between other artifacts in the collection. Further, this data will be fully Section 508 compliant. This may be accomplished at a keyword level or a subject level or other available metadata.
|Phase IV- Initial Expansion||Outlines the methodology, and architectural and collaborative framework for expansion to other organizations leveraging the same resource-constrained methodology. Ideally, this will be done in a nominal cost-sharing environment whereby the web access portal gains access to the archives and artifacts of other institutions and the other institutions develop the datasets.
It is my intent to present a well-developed scope and vision for this phase to set the stage for future grant and development work on implementation as a part of the Phase III grant proposal for this class.
|Phase V- Infinite Expansion||Outlines an expandable methodology, and architectural and collaborative framework for expansion to a logically infinite number of organizations and contributors leveraging the resource-constrained artifact capture and data development techniques. Costs and limitations will be driven by scale and available technology.
It is my intent to present a well-developed concept for this phase identifying some of the risks and benefits of project pursuit to set the stage for future grant and development work on implementation as a part of the Phase III grant proposal for this class.
The audiences for this project will evolve as scale and participation evolves. As such, anticipated audience is best defined by phase.
|Phase I – Prototyping||Target at small organizations and institutions as well as amateur and professional archivists, curators, and historians working in a resource-constrained environment.|
|Phase II- Capture||Narrowly target the National Guard Education Foundation which is the organization responsible for archiving the test artifacts I am using for this project development. The larger target audience will be the same as Phase I as Phase II intends to provide a practical demonstration of the results of the techniques outlined in Phase I. Since the capture is a process and the test will be one text, the audience is confined to a very practical level.|
|Phase III- Web Access||Target the same group identified in Phase I, and will incorporate the larger audience of the NGEF identified in Phase II. The first audience will benefit from the methodology presented as well as the web-access portal template available while the second audience will benefit from the test artifact and expanded holdings of the NGEF. Any actual web-development will be presented on a very narrow scale. The grant proposal will highlight the larger target audience.|
|Phase IV- Initial Expansion||Audiences will expand to include partner institutions and will involve a deeper connection to professional or student research archivists, curators, and historians.|
|Phase V- Infinite Expansion||Audiences will expand again to encompass amateur and professional
archivists, curators, and historians as well as institutions for research, connection, sharing, and comment.
The technologies for this project will evolve with the phases. As the initial intent is to get data available as soon as possible the technology will be completely off-the-shelf and easily available for less than $3000. The Infinite Expansion phase will involve detailed custom programming and expansive data storage techniques. Phase VI and V costs could exceed several million dollars for development and maintenance.
|Phase I – Prototyping||Requires a consumer-quality digital camera and memory working from a consumer-quality computer with moderate storage and processing power with a graphics manipulation, optical character recognition, and simple relational database engine. For development, I will use Adobe CS3 (CS4 is the current version and is extremely expensive) with Adobe Acrobat, Adobe Photoshop, and, if needed Adobe Dreamweaver. For OCR I will use a freeware version of SimpleOCR
and for a database engine I will use MS Access or MySQL. I may potentially use MS Visio Pro for data modeling and MS Project for planning and tracking with MS Office for general documentation.
|Phase II- Capture||Requires the tools cited in Phase I with a possible move to SQL server, I will conduct the full capture of the text.|
|Phase III- Web Access||This will be a relatively simple .xml and .css website with probable .net data ties to the database engine for web presentation. The site will most likely be developed using Adobe Dreamweaver
or potentially MS Visual Studio. Adding server and development software significantly increases the costs, but remains below $10,000. Hosting becomes an additional, recurring cost.
|Phase IV- Initial Expansion||The technology for this phase will greatly be determined by the scale of implementation. I assume a medium-to-large-scale implementation requiring substantial computing and storage resources to include a full SQL server, Storage Area Networks (SANS), MS ISA servers for web generation. The presentation may require additional Flash programming but should continue to rely on relatively simple and efficient coding in .xml, .css. and .net.|
|Phase V- Infinite Expansion||This phase could exponentially increase the technology requirements in terms of storage, speed, bandwidth and scale. The base languages and databases should require little changes and only some expansion. Flash will definitely be involved.|
Web 2.0 – User Input
User input will vary with the audience. The initial phases of development present the user with information they can leverage and subsequently input on their own projects, but not directly within the scope of this project. The later phases are almost completely user driven.
|Phase I & II||No user input. The information available can enable users to replicate the methods within their goals.|
|Phase III- Web Access||Potential user input via blog as a form of commentary on the methodologies presented. The information available can enable users to replicate the methods within their goals.|
|Phase IV- Initial Expansion||Collaborative organization input, largely behind-the-scenes as access to artifacts expands and other users are able to capture the datasets and share the data. This is not intended to be a “user-friendly” consumer type of experience, but shared server resources where research personnel can access the “back-end” of the system for direct input of data.|
|Phase V- Infinite Expansion||Fully capable user input. Expanding access depends on user conformance to the capture and dataset standards, but easy access to the system via a simple web front-end. Envisioned is a peer-review/moderation process that verifies data conformance and propriety.|
Partly as a result of the conversation ZaYna (yes, I am her friend and I have a chronic spelling problem) and I have been having off-and-on this semester, and partly as a coalescing of my own ramblings, I offer my own definition of Web 2.0 for our consideration.
Web 2.0 is:
- Less a technological construct and more a social construct.
- An environment of collaboration and openness.
- Dependent on, but not limited by, open, logical, essential technical standards – the antithesis of proprietary models. (Linux, for example is a computer operating system like Windows or MAC which does not belong to any particular company and is based on an open language and essential core, called a kernel. Anyone can learn how to build applications for Linux and can publish them… Wikipedia exists in the same, open framework where anyone can publish).
- An environment where there are no rules, only what can be considered fundamental scientific laws… essentially the very basic programming schtuff acting like irrefutable gravity, and where you are free to express, collaborate, or share as you see fit.
- Much like the Douglas Adam’s babel fish , Web 2.0 can serve as ubiquitous translation and sharing point for information.
- A playground with no walls where everyone is invited and there is enough room on the merry-go-round for all.
The question remains, and Bell’s essay examines, how to we structure an essentially unstructured playground and make it suitable for academic discourse?
Bell offers a great example in his discussion about the Gutenberg-e Prize and what it can mean for a significant shift in hyper-textual scholarship and rigorous peer review. In addition to that, we have to examine our responsibilities as historians. We have to exercise discipline in our writing and our peer review. We have to write clearly and research rigorously. Through hyper-textualization, we can provide direct access to our primary resources. This requires careful consideration of our conclusions as all of our source material can be open to scrutiny. This can provide for far superior writing.
The danger lies in blogs, emails, and twitters. Web 2.0’s lack of structure opens a wide door for lazy, rapid-fire, ill-considered writings. There are advantages in rapid response, and world-wide broadcasting, but there are significant risks, namely to our reputations.
Web 2.0 is a utopian dream without artificial superstructures imposing hierarchy and arbitrary information channels and filters. To be taken seriously, the policing of such a “wild-west” atmosphere must be taken up by each denizen of the new utopia.
- Roy Rosenzweig, “Can History be Open Source? Wikipedia and the Future of the Past“
- Roger Bruce, “Capturing Expertise for the Evaluation of Photographs“
- Mark Lawrence Kornbluh , “From Digital Repositories to Information Habitats: H-Net, the Quilt Index, Cyber Infrastructure, and Digital Humanities“
- Cathy N. Norton, “The Encyclopedia of Life, Biodiversity Heritage Library, Biodiversity Informatics and Beyond Web 2.0“
- Jeffrey Schnapp, “Animating the Archive“
Reading Rosenzweig, Kornbluh, Norton, and Schnapp I am struck by the overt idealism of Web 2.0. One could argue that a revolution of thought and feeling is well underway, that a true democratization of information is arriving, and a new era of collaboration and true meritocracy is on the horizon. Rosenzweig discusses the challenges of overcoming what he calls “possessive individualism” (italics in original) and presents a well-reasoned case study of Wikipedia with an analysis of its achievements and failures. Throughout his article I was impressed by the enthusiastic embrace of the notions behind this “new” collaborative world. Rosenzweig appears to claim that new media is about ideals, not technology. He does this by challenging the notions of the collegiate business model, the need for professional historians to make online history better and more available/accessible to all, the fee-for-service model of the exclusive online archives, and notes the ideals of Wikipedia where one direct challenge to professional historians is clear: There is no privileged position.
Rosenzweig suggests, and I agree, that collaboration is good, ego is bad, and professionals owe it to the amateurs to help them, and the amateurs are in relationship to work with the professionals on some of the data crunching. Sounds very utopian. In fact, it seems to mirror Goggle’s unofficial corporate motto: Don’t be evil.
Google is a pretty good example of the prevalence of ideals in this brave new world. Their corporate mission is: to organize the world’s information and make it universally accessible and useful. Wow. This goal is so lofty that it may be considered hubris to think they could actually pull it off. BUT, the Google phenomenon is real and they are moving towards their mission. They are buoyed by belief and apparently, their ten commandments support the claim that they are a belief-based organization:
- Focus on the user and all else will follow.
- It’s best to do one thing really, really well.
- Fast is better than slow.
- Democracy on the web works.
- You don’t need to be at your desk to need an answer.
- You can make money without doing evil.
- There’s always more information out there.
- The need for information crosses all borders.
- You can be serious without a suit.
- Great just isn’t good enough.
These ten things, as they are called by Google, are not technology-based or economy-based objectives… they are all-out philosophy. This seems to be exactly what Rosenzweig was commenting on. Kornbluh agrees as he attacks the stove-pipe, selfish mentality of previous/current works in favor of collaborative development, sharing, and exploration. This is an essential concept behind cloud-computing, another Google-supported initiative. He describes the Quilt Index as a great success in this collaborative environment, and I have no doubt that it is. The fact that is has grown to such a degree is testimony to the value of standards-based development and collaboration.
Rosenzweig and Kornbluh idealistically point to the one thing your mother may have taught you: It is nice to share and play well with others. Ironically, this seems to fly in the face of current academic practices. While professional academic historians exude the collegial nature of Senators, they can be a rowdy and vindictive bunch. Attend a controversial conference and watch the panel discussions for proof. After all, as Rosenzweig pointed out, a scholar’s measure is his or her reputation as gained through research, publication, and significant labor and as preserved in the form of authorship of the results of that research. It is possessive individualism. If you take that away, what, then, will a scholar use for his CV?
Further challenging the ideals of the Web 2.0 utopia is the Wikipedian declaration that rank has no privilege. After years of servitude to academia, there are no laurels, no seats of honor. That’s a hard pill to swallow and will be fought. If my academic opinion is weighted equally with a Pulitzer-prize wining academician, or a weekend warrior, what are the capitalistic goals? Why work so hard?
Norton and Schnapp examine the possibilities of this new world and point to some of the obvious benefits. Norton discusses some of the cloud-computing-esque notions of digital cross-walking of standards-based data indices. She gives the example of the changes in naming conventions over time for species. That information alone can save countless hours of cross-referencing data. This efficiency can allow for greater allocation of resources to research, not data mining. But, the key is there have to be multiple inputs to standards-based data. We have to share. Schnapp seems to agree when he examines the changes coming to libraries and archives away from the product-based to the process-based. In other words, they become enablers of data transfer, not necessary the agents.
Despite traditional capitalist objections to this model of irrational belief and non-attributable sharing, it appears to work.
Wikipedia provides the evidence. Examining the discussion tab of the Wiki article on the Cuban Missile Crisis, one discovers a vibrant discussion of the material and a rather useful grading scale within broader subcategories as well as an importance scale. This is the most effective and efficient peer review I have encountered.
The history provides a fair picture of how viable this topic still is with over 500 edits this year. History may help with the attribution “problem” for academics and for the editors… but it fails to answer why people spent time editing the articles. Web 2.0 collaboration is a belief system that has significant advantages and it seems to throw much of the capitalist model on its head. The economic/resource advantages of standards and collaboration are obvious, but attribution is a significant emotional component. I refer to attribution in this case as “ownership” of an idea, a conclusion, a process, etc.
Web 2.0 is not about technology or tools, it is about a balance of beliefs and a utopian vision for the world’s data.
Given the following:
- Necessity is the mother of invention.
- There is a need to “move” artifacts from the physical domain to the digital domain to increase access and transportability while protecting the original
- To “do it right” could easily exceed $100,000 for a simple text digitization project
- A digital image of the original is just the beginning
- Preservation also means resolving authenticity issues.
- Knowledge management dictates adequate access to the information contained “IN” the text, not just a representation “OF” the text.
I am experimenting with a low-cost tool suite to enable historians to digitally capture and then manage original texts. Admittedly, the digital capture piece is a tiny fraction of the project, but in a low-cost environment, may require the most labor. The conversion of and the meta-data associated with the information “IN” an artifact are the trickier pieces that require detailed analysis and consistent, uniform, logical rules to be made useful.
I would like to develop a methodology to capture and preserve an original artifact, efficiently convert the information contained in the artifact to machine-readable language, establish a standard set of meta-data associated with a text-based artifact, and render the results in an authentic reproduction of the original in a fully Section 508-compliant and machine searchable, generally accepted format.
I think I can do the first half of that. I have no idea how to do the last half of that, but between MySQL and .XML and a few years of database development experience I think I can pull off the rest.
Here’s what I have so far. Last spring, while volunteering at the National Guard Education Foundation with Zanya, I was given permission to digitally capture, from the original, The Militaman’s Pocket Companion, published in 1822. All I walked away with were 5 megapixel .jpegs of each page taken with my cheap HP digital camera. This is the Preface page 1:
Using Adobe Photoshop, I was able to create the following, easily readable reproduction:
Using Adobe Acrobat’s OCR capability rendered about a 90% solution. Unsatisfied, I tested some other software and found SimpleOCR’s freeware to achieve about a 98% solution:
After manually reviewing/comparing/editing the copy from the original, I am left with this 99.x% machine-readable, fully searchable, copiable, and transformable, data object.
That’s the easy part. I think. From there, one has to develop the appropriate data normalization for massive strings. Not sure how well that will work. Also, rules for generating appropriate meta-data must be developed and applied to the database structure. Finally, the queries have to be written to extract portions of the data from the text as required by the user… this is something I have zero experience in the thin-client world of the internet.
Simultaneously, I would like to investigate .XMLs properties to see if that may render a more useful method. Further, using digital signatures and electronic certificates, I think I can vouch for the authenticity of the changes made and at least be able to claim the mistakes as mine alone.
The intent is to develop a very low-cost, efficient methodology that takes into account preserving the original, converting it within acceptable guidelines, ensuring its authenticity, and enabling transmission over the web, thus allowing an unlimited number of reviewers of the material to examine a text without harming the physical object. In the end, at least this text would be made available back to the National Guard Education Foundation for display as a part of their emerging web-presence.
Having said that, I worry if this methodology and development may be too narrowly focused and not what this class intends. I am focusing primarily on the methodology of the archiving and less on the actual document preserved for the purposes of the project.
Thoughts? Comments? Derision?
I really want to meet Errol Morris. Anyone that will go to those links to definitively un-definitively determine the order of a couple of 159 year old photographs with such humor and undaunted enthusiasm is someone I want to know.
Much like the heroes in Alice’s Restaurant digital historians are sidelined by various misdemeanors and conventions. Instead of a thrilling debate about garbage or cannon balls, I think the issues of veracity are core to what historians have always faced. Any primary source could be a ruse. A witness is guaranteed to miss something, and diaries, while usually interesting reading, have to be very gingerly weighed. So, do the challenges of the digital domain really make that much of a difference?
In matters of scale, probably; there is simply more digital data to consider. In matters of integrity, probably not; historians have to weigh each piece of evidence carefully and independently. So the key lesson here is not to blindly trust the picture, the email, the recording, or any element of data; but to carefully correlate it with items that can be verified or at least corroborated.
Valid, correlated, corroborated data is one aspect of a larger problem that Cohen and Rosenzweig bring up in their descriptions of the seven qualities of “digital media and networks that potentially allow us to do things better (p 3, Digital History).” In discussing capacity, accessibility, flexibility, diversity, manipulability, interactivity, and hypertextuality (as well as their corollaries: quality, durability, readability, passivity, and inaccessibility) they bring up the larger topic of Knowledge Management.
Industry, government, and the military like the idea of knowledge management (KM) and have widely varying definitions and implications for KM. In theory, data management manages data at the molecular level. Information management manages access and transport of groups of data allowing for the development and dissemination of information built on the data. KM is IM with some nebulous measurement of artificial intelligence (AI), experience, analysis, wisdom and timing. In other words, (at least in the military’s attempted implementation) KM is the right data at the right place to the right person at the right time to make the right decision.
While not peeling back the cover of that black box of hocus pocus, I think that digital historians face a similar task.
Digital history is the art of acquiring, assessing, making available, analyzing, and effectively using digital means for better historical analysis, writing, conclusions, etc. In other words, digital history enables the right evidence available to the right researcher and the right time to inform the best conclusion.
In all of that hocus pocus, veracity of data, availability, readability, durability, and passivity are all concerns to the digital historian. Nevertheless, while utopia is not around the corner, major advances in the tools and methods are impacting research. My own research into the Cuban missile crisis (www.october1962.com) was largely a digital affair from start to finish. If you examine my bibliography, you will see the National Security Archives at George Washington University were critical to my research. While working from a computer, I accumulated hundreds of memos, messages, transcripts, orders, etc. that I could never have obtained in person. Once written in a traditional format, I was able to transform the data into a web-site and make some of my primary sources available for download.
All in all, I tend to side with Michael Frisch’s “tools-based” view of digital history. It is not quite a new field, but a new and decidedly powerful suite of tools emerging to historians.
In Lev Manovich’s The Language of New Media( MIT Press, 2001), the author posits a couple of observations about new media. Beyond his “aim to describe and understand the logic driving the development of the language of new media” (p.7), he raises several key/troubling issues. Among them are:
- What are some of the implications of “databases as a cultural form” (p219)?
- How do we protect history from the “new media [ability] to create versions of the same object…” (p. 39)?
In the foreword, Mark Tribe describes the “net art” community as one which “possessed an anarchic quality of entrepreneurial meritocracy strikingly different from the rest of the art world…” (p. xii). I think this description fairly describes the impact of the database culture on the post-industrial cultures of the West. In his statement, there is a comparison between an unstructured meritocracy (an oxymoron) and the implicit “rest of the world” which is neither exactly anarchic nor meritocratic. I think that one could describe the “rest of the world” in such loose terms effectively enough, but I believe this leaves an opportunity for the “rest of the world” to strive for structure and the database culture to require at least structure in its framework if not value definitions. In other words, as Manovich points out, there are two ways to order data, flat and hierarchical. That is the key to the anarchy that describes the database culture.
The structure is found in the meritocracy… or on this case, the rational order of the database. The more rational, the higher order; the more flexible, the more useful. The anarchy stems from two opposing methods of implementing that order, flat and equally distributed (in its own way a form of meritocracy) or hierarchical.
Eschewing the esoteric, what are other implications of the database culture?
In the flat organization of data, one relies on hyperlinking extensively. Manovich argues that this is the demise of rhetoric (p. 77). It removes building the case for an argument from a linear progression and presents data in a random access scenario begging the question: Does this change the definition of an intelligent, capable, gifted being? From Plato to the Renaissance to the Enlightenment, the mark of an educated and truly intelligent person was his or her ability to accumulate knowledge and translate it into well-reasoned thought and logic. (Admittedly, this is a boorish over-simplification.) What, then, is the definition of an intelligent, capable, gifted being now?
In the Information Age, there is more relevant material available than can be consumed, let alone mastered. The cultural impact of this is that gifts of reason and rhetoric have indeed been replaced with capability in tools. It is no longer particularly valuable to know the last Aztec ruler, the strategic import of the Second Peloponnesian War, or the role of the church in the development of the printing press. It is valuable to know the events occurred and there may be some import associated with them, but particularly, it is critical to know where to find the data. Analysis of data occurs at near-real time from a vast library increasingly available at the fingertips. As a result, successful people in the current age are not ones with vast knowledge of things, but vast access and experience in finding out.
A second point that Manovich exposed is along the lines of authenticity. In a digital realm, how can we trust the data?
Below is a photograph of DeadGuyQuotes flying an airplane in 2008. On the left you see the author in the right-hand seat of the aircraft in what is typically the co-pilot’s seat, implying he is not the “pilot-in-command” (PIC) (a relevant term in the view of the Federal Aviation Administration). On the right, you see the pilot in the left-hand seat in the pilot’s seat implying he is the PIC.
Which is it? Does it matter?
It does. I was actually flying from the co-pilot’s seat while a licensed pilot was PIC flying from the other seat. My medical certificate has expired and I am not legally able to assume control of an aircraft. But, with a very simple move from Photoshop, I became the pilot. The historian would need the dates and my log book to attempt to validate the veracity of this photo.
Since my logbook does not reflect the day’s travel the historian is left wondering if I simply failed to enter the trip, or I was choosing to fly illegally, or there is a forgery somewhere. In my logbook the historian would have noted an expired medical certificate, but that does not prove that a valid one does not exist somewhere else.
How do we protect the immeasurable amounts of data being collected? How do we determine integrity and authenticity?
In the near future, the historian and archivist will routinely sort through petabytes of email trying to establish a chain of events and discussions where today we are quite happy to swim in lakes of scanned memoranda. Conducting digital forensics on every source is impractical and cost-prohibitive.
I don’t have any particular solutions, either technically or philosophically, but I am greatly concerned about this challenge. As a historian interested in executive American history, and a member of the executive branch of the government, I see a disaster in the making. We are not protecting our archives to make them available to future historians and tools such as the National Archives’ Electronic Records Archives are current projects doomed to failure as trying to do too much for too many with no standards.
Professor Cohen was right, the problem with blogs is not writing enough, but writing too much. There is much more to discuss on the notion of database culture and much more to develop on the preservation of history in the digital domain.