I’ve used databases for research in previous classes but am new to ProQuest. I really enjoyed perusing its newspaper collections; I think it’s a great resource for primary documents, and I thought it also concretely demonstrated some of the concepts covered in Cohen and Rosenzweig’s chapter “Becoming Digital“.
Pros and Cons
The basic setup of the ProQuest database is not unlike that of other scholarly databases such as JSTOR, and its layout makes it easy to use. I liked that there were multiple search options: a basic search, an advanced search, and an option to search through individual newspaper collections. Another feature that I found very helpful allowed me to narrow my search results by publication date. When I did a basic search for “Titanic“, I could limit the results to the years between 1912-1915, in the recent aftermath of the disaster, or include articles from the 21st Century that covered research and recovery efforts.
When I searched for articles about the Titanic, I did not limit the newspaper search field, so results came from all of the newspaper databases, from the New York papers (which yielded the greatest number of hits) to the Los Angeles paper. I did a similar search for “stock market crash”, again limiting the publication date search field but including all newspapers, this time for the years around 1929. After clicking on an article, I explored the different page layouts–the default, which looked almost identical to a PDF, was the easiest to read. The “page view” showed the selected article in the context of the whole newspaper–the entirety of that issue could be viewed, but the text was extremely small. The “Page View–PDF” was another option, but sometimes it was more difficult to read than the PDF that appeared when the article was chosen. This was occasionally frustrating given that the search terms were not highlighted in the article as they often are in a database like JSTOR.
Another caveat was brought to light when I did a search for “Neil Armstrong” and “moon” (for this search, I used only the New York Times and the Wall Street Journal because I wanted to compare two papers from the same area during the same period–the New York Times yielded many more relevant results). In looking through multiple articles, I only came across one image and one advertisement, both black and white and not of high resolution, which suggests that this database may not be extremely helpful if one is looking for photographs to supplement his text.
I know I’d need to do many more searches to find out if that were the overarching case, or if it just happened to occur in the articles I was searching. Overall, I was pleased by the database and think it’s an excellent collection of primary resources.
ProQuest: A Case Study
Searching ProQuest‘s large database helped me to understand the concepts described in Cohen and Rosenzweig’s third chapter. It seemed to me as though several main ideas were showcased here. First, ProQuest is a massive database: it contains the full text of newspapers from major cities with issues ranging from the mid-1800s to the present day. This is digitization on a massive scale. Who decided what cities would be represented? Why have the New York Times issues been digitized through 2008, while the Chicago Tribune issues are only completed through 1988? What accounts for that 20-year difference? Cost and audience seem likely factors, but who determines them?
Before reading Chapter 3 of Digital History, I had never considered or been aware of the ways that text is displayed. Cohen and Rosenzweig note that JSTOR uses a “hybrid” approach to their presentation, making use of both “page images” and OCR, or optical character recognition. It seems to me as though the ProQuest database uses a similar format because it seemed to operate the same way:
Because JSTOR believes that the “appearance of typographical and other errors could undermine the perception of quality that publishers have worked long and hard to establish,” they display the scanned page image and then use the uncorrected OCR only as an invisible search file. This means that if you search for “Mary Beard,” you will be shown all the pages where her name appears, but you will have to scan the page images to find the specific spot on the page.
[As I side note, I have read some JSTOR articles that do include highlighted search terms, but ProQuest seemed not to highlight search terms.]
The use of page images and OCR might also describe why it was hard to read some of the newspaper articles in the full text PDF version:
Even the best OCR software programs have limitations. They don’t, for example, do well with non-Latin characters, small print, certain fonts, complex page layouts or tables, mathematical or chemical symbols, or most texts from before the nineteenth century. Forget handwritten manuscripts. And even without these problems, the best OCR programs will still make mistakes.
Small font and complex page layouts of newspaper articles seem likely causes of the sometimes distorted text.
Finally, I found it interesting that the articles that I’d read contained very few and relatively low-quality image, but then realized that the hybrid method of scanning the page and creating a page image would not enhance the resolution or bit depth of the picture in the newspaper (which itself is not the original) and may have been old or faded. As Cohen and Rosenzweig state:
In digitizing images, as with all digitizing, the quality of the digital image rests on the quality of the original, the digitizing method employed, the skill of the person doing the digitizing, and the degree to which the digital copy has adequately “sampled” the analog original.
If a combination approach of page images and OCR is used by ProQuest, the variations in text and image quality seem to make sense, and despite its few shortcomings in those areas, it seems to me to be an excellent resource.