ProQuest: A Case Study in Digitization

I’ve used databases for research in previous classes but am new to ProQuest. I really enjoyed perusing its newspaper collections; I think it’s a great resource for primary documents, and I thought it also concretely demonstrated some of the concepts covered in Cohen and Rosenzweig’s chapter “Becoming Digital“.

Pros and Cons

The basic setup of the ProQuest database is not unlike that of other scholarly databases such as JSTOR, and its layout makes it easy to use.  I liked that there were multiple search options: a basic search, an advanced search, and an option to search through individual newspaper collections. Another feature that I found very helpful allowed me to narrow my search results by publication date. When I did a basic search for “Titanic“, I could limit the results to the years between 1912-1915, in the recent aftermath of the disaster, or include articles from the 21st Century that covered research and recovery efforts.

When I searched for articles about the Titanic, I did not limit the newspaper search field, so results came from all of the newspaper databases, from the New York papers (which yielded the greatest number of hits) to the Los Angeles paper. I did a similar search for “stock market crash”, again limiting the publication date search field but including all newspapers, this time for the years around 1929. After clicking on an article, I explored the different  page layouts–the default, which looked almost identical to a PDF, was the easiest to read. The “page view” showed the selected article in the context of the whole newspaper–the entirety of that issue could be viewed, but the text was extremely small. The “Page View–PDF” was another option, but sometimes it was more difficult to read than the PDF that appeared when the article was chosen. This was occasionally frustrating given that the search terms were not highlighted in the article as they often are in a database like JSTOR.

Another caveat was brought to light when I did a search for “Neil Armstrong” and “moon” (for this search, I used only the New York Times and the Wall Street Journal because I wanted to compare two papers from the same area during the same period–the New York Times yielded many more relevant results). In looking through multiple articles, I only came across one image and one advertisement, both black and white and not of high resolution, which suggests that this database may not be extremely helpful if one is looking for photographs to supplement his text.

I know I’d need to do many more searches to find out if that were the overarching case, or if it just happened to occur in the articles I was searching. Overall, I was pleased by the database and think it’s an excellent collection of primary resources.

ProQuest: A Case Study

Searching ProQuest‘s large database helped me to understand the concepts described in Cohen and Rosenzweig’s third chapter. It seemed to me as though several main ideas were showcased here. First, ProQuest is a massive database: it contains the full text of  newspapers from major cities with issues ranging from the mid-1800s to the present day. This is digitization on a massive scale. Who decided what cities would be represented? Why have the New York Times issues been digitized through 2008, while the Chicago Tribune issues are only completed through 1988? What accounts for that 20-year difference? Cost and audience seem likely factors, but who determines them?

Before reading Chapter 3 of Digital History, I had never considered or been aware of the ways that text is displayed. Cohen and Rosenzweig note that JSTOR uses a “hybrid” approach to their presentation, making use of  both “page images” and OCR, or optical character recognition. It seems to me as though the ProQuest database uses a similar format because it seemed to operate the same way:

Because JSTOR believes that the “appearance of typographical and other errors could undermine the perception of quality that publishers have worked long and hard to establish,” they display the scanned page image and then use the uncorrected OCR only as an invisible search file. This means that if you search for “Mary Beard,” you will be shown all the pages where her name appears, but you will have to scan the page images to find the specific spot on the page.

[As I side note, I have read some JSTOR articles that do include highlighted search terms, but  ProQuest seemed not to highlight search terms.]

The use of  page images and OCR might also describe why it was hard to read some of the newspaper articles in the full text PDF version:

 Even the best OCR software programs have limitations. They don’t, for example, do well with non-Latin characters, small print, certain fonts, complex page layouts or tables, mathematical or chemical symbols, or most texts from before the nineteenth century. Forget handwritten manuscripts. And even without these problems, the best OCR programs will still make mistakes.

Small font and complex page layouts of newspaper articles seem likely causes of the sometimes distorted text.

Finally, I found it interesting that the articles that I’d read contained very few and relatively low-quality image, but then realized that the hybrid method of scanning the page and creating a page image would not enhance the resolution or bit depth of the picture in the newspaper (which itself is not the original) and may have been old or faded. As Cohen and Rosenzweig state:

In digitizing images, as with all digitizing, the quality of the digital image rests on the quality of the original, the digitizing method employed, the skill of the person doing the digitizing, and the degree to which the digital copy has adequately “sampled” the analog original.

If a combination approach of page images and OCR is used by ProQuest, the variations in text and image quality seem to make sense, and despite its few shortcomings in those areas, it seems to me to be an excellent resource.

 

Nelson’s “Complex Information Processing”

Background

In his article “Complex Information Processing”, Nelson expands upon Bush’s idea of the “memex” to explain the necessary developments for creating a file structure that would allow for the storage of personal files and act “as an adjunct for creativity”.  I was glad to read this article shortly after having read Bush’s famous piece (“As We May Think”), because it seemed that in the twenty years between the articles were published, a great deal of technical knowledge had been gained.  Nelson, however, expressed some regret that his vision for a computer-based file structure that could be used on a personal—rather than strictly professional—level had not yet come to fruition. He attributed the lack of productivity to high cost, little sense of need, and uncertainty about system design.

Nelson did not dwell on the first two arguments, but rather focused almost exclusively on the third. He proposed that a computer with the capabilities he desired required three parts:

  1. information structure (zippered lists)
  2. file structure (Evolutionary List File, ELF)
  3. file language (Personalized Retrieval, Indexing, and Documentation Evolutionary System, PRIDE)

Nelson explained that this kind of system would have features “specifically adapted to useful change”:

  • it would be able “to sustain changes in the bulk and block arrangements of its contents”
  • it would permit dynamic outlining–the process by which the change in one text sequence guides an automatic change in the next sequence
  • it would allow multiple drafts to remain on file for comparison for an indefinite time period
  • it would be simple to use

These features make the system evolutionary.

Evolutionary  List File (ELF)

In order to ensure the simplicity of the evolutionary system, the file structure of ELF consists of zippered lists (which allow for the linkage of two or more related lists). ELF consists of:

  • entries—discrete data ranging from text to symbols; can be created at any time
  • lists—an ordered set of entries; can be combined, rearranged, or divided
  • links—a connector in one entry that links it to another entry; any number of links can be created

Its basic structure is rendered here: Complex Information Technology Picture

Nelson proposes that this structure is so evolutionary because of its “psychological virtue”—its simplicity, its ability to “be easily taught to people who do not understand computers”.

Speaking as someone with little technical experience, I think that’s a goal extremely relevant to computer users of Nelson’s time and our time alike.

 

A Reflection: Bush’s “As We May Think”

Many of Bush’s speculations about the nature of the technology of the future in his article “As We May Think” have come to fruition; his insight is particularly impressive given that the article was written 67 years ago.  Though some of his examples (like the section on photography) are written in very technical language that might not be accessible to many modern readers, his imagined “memex” should resonate almost immediately with all of us—Bush’s idea is extremely similar to our modern computer.

Though Bush spends a great deal of time reflecting on possible technological advances of the future, he also explores the consequences of these new inventions, which was what I found to be the most interesting part of his article.  Before writing about the “memex”, Bush observed that “There is a growing mountain of research…The investigator is staggered by the findings and conclusions of thousands of other workers—conclusions which he cannot find time to grasp, much less to remember, as they appear.” Later, he also remarked that “Truly significant attainments [can] become lost in the mass of the inconsequential…The difficulty seems to be, not so much that we publish unduly in view of the extent and variety of present day interests, but rather that publication has been extended far beyond our present ability to make real use of the record”.  When Bush wrote this article, a researcher might have spent weeks perusing the shelves of various libraries to find sources he needed and still might have missed a book that could have been of great importance simply because of the sheer volume of works available or because he was unaware that a particular work even existed. Now (though he still hopefully uses the library!), a researcher can conduct an online search that can help him narrow his search criteria and become familiar with the newest research and publications on his topic.

And yet, the modern researcher is also bound to miss information relevant to his search—he simply has the ability to be made aware of more significant material than ever before. The problem that Bush’s researcher faced has not disappeared with the advent of the internet and computer—so much information is now available in one place that the modern researcher also needs to discover how to efficiently find and utilize information that is of importance to him.

Bush suggested that the most effective way of organizing and manipulating information would be through a process of indexing by association, as the human brain does.  The “memex” would allow a man to link items which he saw as related by saving them together and by making marginal notes that could connect one article or photograph with another.  Grouping by association has become commonplace today, particularly through social media, where links to videos, blogs, and articles can be shared with others (often in response to something they have said or posted).  Association is also customary in search engines like Google and Bing, which suggest related searches while providing information regarding their users’ initial queries. It’s prevalent in internet advertising, too; some websites tailor the advertisements on their pages based on a person’s previous searches or on the content he’s viewing at that moment.  Perhaps the most obvious examples of association are links found in online articles to relevant articles on other websites.

In addition to association, another way that modern man streamlines his viewing of the vast content of the internet is through RSS readers like Google Reader that allow him to subscribe to websites he deems important, so that all new information from the various sites comes to him in one location.  Whether subscribing to a variety of sites for enjoyment’s sake or following a group of related academic sites for the latest breakthroughs, I believe Bush’s 1945 prediction is true today:

There is a new profession of trailblazers, those who find delight in the task of establishing useful trails through the enormous mass of the common record. The inheritance from the master becomes, not only his additions to the world’s record, but for his disciples the entire scaffolding by which they were erected.

Bush’s optimistic view of the benefits of technological advancements for society suggest that the intellectual scaffolding created by today’s scientists, inventors, and researchers will allow man to “grow in the wisdom of race experience”.  Only time will tell if this one of Bush’s predictions was accurate.

Hello world!

Hello, world (of fellow HIST 390 students)! I’m Jess, and I’m a junior history major at  Mason. I’m looking forward to using this blog (my first) as we learn more about a branch of history that’s really piqued my interest.