My Computer’s Security: A Self-Evaluation

Uh oh.

Maybe that short phrase best sums up my feelings about the security of my computer and its accounts after reading this week’s articles.  I know excellent hackers exist and have heard of massive security breeches that have compromised a large number of passwords (I’m not sure I’d realized that those cases often involved millions of people). I had no idea, though, that hackers’ methods were so sophisticated as to be able to crack any password that contains a word that appears in the dictionary–English or otherwise.  I read Matt Honan’s horror story and was left wondering what I would I do if I found myself in that situation. (What could I do?) As Honan says, he got his information back a lot more quickly than I ever could have given his tech contacts at Google and Twitter.

As I read through Mason’s IT Security advice page, I was struck by two very different emotions: gratitude and complacency. I am grateful that the basic steps that I should take to better ensure my computer’s security are laid out in an organized, user-friendly way with a paragraph explaining why I should take the time to complete these tasks. The complacency exists because I am comfortable with my system as it is and don’t exactly look forward to making the necessary changes (even though they will be for my benefit!) I don’t like that feeling of complacency; it will hinder my security updates if I let it. It’s also, frankly, rather dumb, because I was once the victim of a hacking, and I  should be excited to prevent that from happening again to the best of my ability.

A few weeks into my freshman year, I used one of the JC Info Desk computers to check my e-mail between classes because I hadn’t been carrying my laptop. Bad idea. When I tried to check it that afternoon on my own computer, I couldn’t log in. IT Support had my computer for three days; when I got it back, I saw that someone had sent 5000 (yes, 5000!) spam e-mails to individuals and organizations all around the world. Though I received several interesting e-mails from people in China, Italy, and Spain in the next few days, I had come away largely unscathed. The IT guys were great, and though going without my computer then was a major inconvenience, the problem was resolved relatively easily. I can’t imagine having that happen now; going without my computer would be devastating for both academic and extracurricular activities.

So, in order to prevent being hacked again, I am taking Mason’s security tips to heart. I have a good deal to work on:

  1. Activate a password-protected screen saver: check! Mine is protected. I’m one for seven!
  2. Use strong passwords for all of your accounts: Hmm. I think I get a half a point for this. I have separate passwords for almost every account, and many of my passwords are made up of a pattern of letters and numbers that I remember (they contain no dictionary words). Many, though, simply have words with numbers attached to them. That’s something I should probably change soon. Honan suggests using a site to randomly generate passwords and keeping them safe in one location, like Dropbox. This might be hard to get used to, but it makes a lot of sense.
  3. Automatically receive critical updates: I’ll have to look into this. As far as I know, my Windows updates are configured automatically; I get pop-ups relatively frequently saying that my updates have been installed. I’ve never used Internet Explorer to manually update my system.
  4. Verify that anti-virus software is configured correctly: Again, I’ll have to look into this; Mason’s site says that I should have Norton’s Symantec software configured, but I use McAfee Anti-Virus software and receive daily updates. I’m not sure that there’s a need to use both.
  5. Use anti-spy software: Again, one more thing I need to check on. I believe McAfee has an Ad-Aware feature, but I’m not using SpyBot.
  6. Unique passwords for all user accounts: I’m the only user, and my “guest” feature is turned off. Two for seven (points that need no further research)! –>[not a great score, I know]
  7. Back up files weekly: This is a major problem on my part. I’ve had an electronic post-it note on my desktop for months that says “Back up pictures and files!” But I haven’t. I need to make this a priority so that things that are really important to me–pictures and classwork–don’t end up being lost forever, much as Matt Honan feared his daughter’s baby pictures were.
  8. Step A: Use Windows XP Professional: check!
  9. Step B: Limit use of Internet Explorer: check! I prefer Firefox, thank you.

This week’s articles were sobering; who knows when something like the hack Honan experienced might happen to any of us? Armed with a checklist of ways I can start to improve my internet safety, though, I’m confident that I’ll at least be moving in the right direction.

 

Copyright Law: Withstanding the test of TIME?

As we discussed in class on Monday, copyright is a really complicated issue, especially in our digital age, where the lines between what information, music, and ideas belong to whom are increasingly blurred.

Academia is one place where the effects of stringent copyright laws are acutely felt. Scholars want to protect their ideas and gain credit for their work, but often, they must also draw upon the works of others in order to pursue their own research or provide supplemental material in their classrooms. The “Fair Use” policy provides some help by allowing scholars and teachers to make use of copyrighted material if they meet certain criteria regarding its:

  1. purpose
  2.  nature
  3. extent
  4. effect on the market

“Fair use” is a broad concept whose implementation can often occur on a case-by-case basis; the reproduction of a copyrighted work for an educational or non-profit purpose does not guarantee that it automatically qualifies as worthy of an exemption from normal copyright policy.  One such project that exhibits this paradox is Mark Davies’ Time Magazine Corpus.

Time Magazine Corpus: A ‘Subscriber’ to the Fair Use Act?

The Time Magazine Corpus is one of seven corpora (large collections of text) created by Brigham Young linguistics professor Mark Davies.  The corpora exist to “[find] out how native speakers actually speak and write; [to look] at language variation and change; [to find] the frequency of words, phrases, and collocates; and [to design] authentic language teaching materials and resources”, according to the corpora website. These goals are achieved by digitizing vast amounts of historical data and analyzing their contents to find patterns in words usage. With almost a quarter of a million visitors each month, the corpus.byu.edu website suggests that it is the most-accessed corpora available.

The Time Magazine Corpus has a digitized copy of every version of  Time Magazine since 1923 in its stores for analysis; collectively, they contain over a hundred million words. Surely one of the most popular corpora on the internet is subject to the conditions of the Fair Use Act. Or is it?

  1. Purpose–Based upon the website’s stated goals (see above), the Time Magazine Corpus exists to further learning and research about the development of American English. That goal is educational and non-commercial.
  2. Nature–the Time Magazine issues that are presented in the corpus are published and generally factual (with some more subjective pieces).
  3. Extent–Clearly, more than10% of all issues of Time Magazine have been utilized; the entirety of the Time Magazine archive since 1923 is housed in the BYU corpora.
  4. Effect on market/value–to access the full text of any of these issues, one must be a subscriber to Time Magazine.

The purpose and nature of the Time Magazine Corpus seem to be reasonable under Fair Use, but the extent and effect on value call the project into question.

The Teach Act of 2002 adds another interesting layer to the corpus copyright debate. GMU’s Copyright Office says that the TEACH Act “allows digitizing of analog materials, [but] only if not already available in that form”.  While Time Magazine had already  digitized its stores, the corpus has made them available in a new digital “form”–one in which the text can be searched extensively.

Additionally, Cohen and Rosenzweig show in this table that works published before 1923 have become part of the public domain, while all worked published after that year are subject to copyright policy. The Time Magazine Corpus incorporates issues from 1923 to the present: they’re all subject to copyright laws.

A visit to the Time Magazine website’s archive quickly confirms that material is subject to copyright–and protected. While content is “available exclusively for TIME subscribers”, the “Reprints and Permissions” page does detail the process of  obtaining permission to reprint or copy material. Interestingly, when I clicked on the “search here” link under the third point, “Licensing/Republishing Content in Print”, I was able to read entire articles published in a 2002 issue (even though I am not a subscriber), but I did not have access to articles published in a 1932 issue.

How, then, could it possibly be legal for Davies to utilize the whole Time Magazine archive? He tells us himself in the “Questions?” page of his site under numbers 8 and 9:

Our corpora contain hundreds of millions of words of copyrighted material. The only way that their use is legal (under US Fair Use Law) is because of the limited “Keyword in Context” (KWIC) displays. It’s kind of like the “snippet defense” used by Google. They retrieve and index billions of words of copyright material, but they only allow end users to access “snippets” of this data from their servers…We would love to allow end users to have access to full-text, but we simply cannot…We have to be 100% compliant with US Fair Use Law, and that means no full text for anyone under any circumstances — ever. Sorry about that.

Thinking back to one of our earlier class discussions, I searched for the term “solar energy” in the corpus, and when I clicked on one of the 98 results, I was directed to a few sentences that provided the context in which the words were used. There was also a link to the original article in Time Magazine, but when I clicked on it, I was directed to the Time Magazine site and received the same message as I had before: “Time Magazine content is available exclusively for TIME subscribers”.

By ensuring that those who use the corpus are not able to view the original full-text of an article, Davies does not violate copyright law. Instead, his corpus allows for detailed research about the patterns of American speech, and is resource that TIME should be excited to be part of (and likely is, given the fact that they haven’t pressed charges against Davies–at least not yet.)

Copyright Confusion: Creativity, Collaboration, and…Criminals?

For the most part, I think that I’ve generally associated the word “copyright”  with mountains of paperwork, complicated rules that hardly anyone understands, and a subject that is important but frankly, boring. This week’s readings changed my conceptions of the word and the broader issue–it’s certainly not boring, but rather extremely charged and the source of fierce debate. It directly affects most people on the planet.

The Basics

I read through the PowerPoint presentation that Mason’s Copyright Office had created (“Copyright Tutorial: The Basics”) and found it to conform to my first notions of copyright discussions–important, but a little boring. It provided good background information (especially about the types of documents that fall into the public domain and fair use), though, and it brought to light some of the issues faced by teachers and scholars that are particularly relevant to all of us at Mason.

Dr. Cohen and Dr. Rosenzweig’s chapter, “Owning the Past?“, further defined the concepts briefly described in the PowerPoint and put the copyright debate and its related issues in context in this digital age, where knowing what constitutes copyright infringement has become clouded with uncertainty.  The authors note that we can all contribute to and borrow from the intellectual property that circulates the web and have to deal with the consequences of doing both:

Those who create historical materials on the web are, indeed, likely to find themselves on both sides of the legal and ethical fence—creating intellectual property that they want to “protect” and “using” the intellectual property of others…Few people do digital history without both making a creative contribution of their own and benefiting from the creativity of others.

We prefer to view the web as a “commons,” or a shared storehouse of human creations, rather than a “marketplace,” and we align ourselves with the broad movement of lawyers and scholars, like Stanford University law professor Lawrence Lessig, who have promoted the notion of a “Creative Commons.”  In this, we advocate a balance between the rights and needs of the “owners” and “users” of intellectual property, but a balance that favors the enlargement of the “public domain”—taken here to mean not just the formal realm of works with no legal copyright protection, but also more broadly the arena defined by fair use and the sharing and dissemination of ideas and creativity. To see intellectual work entirely as “property” undercuts the norms of sharing and collaboration that are integral to a field like history.

Expanding the “Creative Commons” seems a great way to encourage the “sharing and collaboration that are integral” to history and other social sciences, but after viewing the chart provided by Cohen and Rosenzweig showing when texts or photographs will enter the public domain, one wonders whether a major overhaul of the copyright system might be needed to make that expansion possible. With the passing of recent laws extending copyright benefits an even greater number of years following an author’s death, works that are sought after may not become available until long after the person desiring them inquires. Mark Twain’s copyrights have outlived his grandchildren! As Cohen and Rosenzweig point out, historians generally are able to practice “fair use” and acquire some information from books not in the public domain, but they might be able to do so much more if only the copyright system did not protect works for multiple generations: “Copyright radicalism in the early 21st Century has come to mean embracing an 18th Century law”.

Those seeking greater access to texts are not the only ones up-in-arms about copyright policy. Perhaps the most visible conflicts appear in the studio, where recording artists are fighting for their “rights” to use old songs in what they say is a battle for creative freedom.

Copyright and the Music Industry

We live in a remix culture.

~Jeff Change, Solesides Records

In Copyright Criminals, DJs from the remix “underground” describe the struggles they face in producing their music, which they deem completely original. They “sample” a variety of tunes–old and new–splicing and mixing them together with new beats to create a new sound. Change says that DJs give us snatches of our history, reinterpreting it and presenting it to us in the present day–it’s “audio archaeology”. Many of the artists interviewed claim that their music reintroduced the world to lost classics, actually doing the original artist/recording label a favor.

Many of the DJs argued that their art form, that of mixing a variety of sounds, is not unique to the music industry and should not be punished as though it is. They discussed the “sampling” that goes on in the art world and film industries: the painting of photographs, the digitization and editing of works of art, and even the fairy tales borrowed and made into major motion pictures by Walt Disney. They insist that sampling is not a new phenomenon–the video quoted Igor Stravinsky, the prolific composer, to stress that point: “A good composer does not imitate–he steals”.

Though most DJs don’t view their art as “stealing” and don’t want others to do so, either, there are some producers who dislike sampling immensely: the sound mixer for Nirvana called it “cheap and lazy”. Artists whose work has been used extensively also have mixed feelings. For example, Clyde Stubblefield, the drummer for the James Brown band, said he is honored that his drum loops are used so frequently. He doesn’t ask for royalties, but says he would like to be credited when his work is sampled. Other artists aren’t so generous, and sampling clearances have become a big business. Now DJs are trying to figure out ways to circumvent the very unclear digital audio copyright rules that can cost them their savings or even their careers. They find it outrageous that it’s “easier, cheaper, and faster to cover an artist’s song (unless you try to change the words),  rather than to sample it” under the copyright act last updated in 1976. The DJ for Complete Outlaw noted, “You’re either rich enough to afford the law, or [you’re] a complete outlaw.”

It doesn’t seem as though it should need to be that way. As with digital historians, who both “sample” others’ ideas and create their own, so music consumers have now also become producers, who can create mixes with only a computer and the proper software. In order to take full advantage of the creative power of our society’s artists, thinkers, scientists, and writers, it seems as though a copyright overhaul is necessary to redefine intellectual and creative “property”. This will not be an easy task. But as one DJ pointed out,

That’s how society moves forward. It doesn’t just create new things–it evolves by taking old things and changing them.

Wikipedia and Photographs in the Age of Editing: Who Decides the Truth?

I think I’ve used Wikipedia only a handful of times in my life. I would guess that to most people that sounds like a blatant lie coming from a college student, but it’s been drilled into me by my mom (an English teacher) and my own teachers for so long: Wikipedia is simply not a credible source. (With that in mind, I just didn’t bother using it for much of anything.) I think in many cases that’s still true; most professors are just not interested in seeing a citation from “The Free Encyclopedia” given its mixed reviews and questionable sources. Wikipedia has been working hard to clean up its image, though, and I saw proof of that as I was working on today’s assignment.

Umlauts and the Spanish Civil War

I thought Jon Udell’s video about the Wikipedia page describing heavy metal umlauts was really neat–I continue to be surprised by the cool things tech-savvy people can do (I particularly enjoyed watching the updated posts stream in real-time). One of the things I found most interesting was that even though the topic doesn’t seem particularly scholarly, it was monitored in what I assume is the same way another page with a historical subject might also be: when an act of cyber vandalism was committed, it was corrected within two minutes. That’s a very impressive response time, given the volume of articles linked to Wikipedia (over 4,000,000 in English, according to their home page). It might still be relatively easy for a hacker to change a page, but Wikipedia seems to be making long strides in ensuring that those changes don’t last.

An act of obvious vandalism as noted in Jon Udell’s video seems as though it might be easier to catch than a missing link, a faulty citation, or incorrect information. How do the editors at Wikipedia check for those kinds of misrepresentations? I imagine Wikipedia’s editors, and those who read a particular post, are actively making corrections and adjustments as they read.

I think the Wikipedia entry for “Spanish Civil War” demonstrates the more pro-active role that editors and readers have been taking to make Wikipedia a trustworthy site. The page was clearly well-read; it not only contained a good deal of information, but also pictures, images of documents, flags, and lists of related people and groups, as well as a  “See Also…” section (which included a link to a page about Guernica, the painting that initially made me choose to perform a search for “Spanish Civil War”). I was impressed by what I found; the information provided good background reading for anyone unfamiliar with the topic, and as the University of Maryland’s Library site suggests, the page was organized and easy to use, as a good website should be.

One of the things I found most interesting was that two of the countless links within the article were red, rather than blue like the others (both were names of people associated with the war). I decided to click on those links, and both led to a page which said that the term had not been found, and that the subject should be searched under a different name. I thought the fact that someone had noticed and reported the missing links demonstrated that the site was being monitored and corrected. I was most impressed by the full citations, which included mostly scholarly journal articles or books and even had links to the books’ ISBN numbers! Wikipedia is definitely trying to ensure accuracy and transparency in this way. After watching Udell’s video, I also decided to check the “Recent Changes” section, and I found that a number of modifications had been made only today, September 16.

Fenton and Gardner: Wikipedia’s Predecessors?

I really enjoyed Morris’ three-part series, “Which Came First?” The speculation about the order of  the famous Crimean War photographs involved tactics, positioning, scientific tests, and questions of Fenton’s character, among other considerations.  The fact that the key to cracking the mystery involved something as simple as gravity and rock movement (even though that discovery was a complex process) was  surprising. Morris’ closing remarks, though, highlight the ability for the seemingly irrefutable conclusion to become the subject of further debate in the future:

I spoke with Dennis Purcell recently and asked, “Do you think these essays will put this issue – the issue of which came first – finally to rest?” Dennis replied, “No. I don’t think so. There could be some guy who reads your essays, writes in, and says: ‘You know, there aren’t just two photographs. I found another. There are actually three.’”

Morris’ extensive research and travel to find the solution to this question remind us that we may  not always be able to find the answers to our historical questions. As Roy Flukinger noted, “It’s one of the fascinating things about photo history. It always gives us more questions than answers. Historical photographs may give you the possibility of new facts, and may give you the chance to ask new questions.” In the case of the Shadow of the Valley of Death photographs, the question arose largely because historians and curators debated whether either or both of the photographs had been staged, and if so, in what order.  Staging is the pre-digital equivalent to photoshopping or editing–the photographer modified a particular scene to present a different meaning, to raise a question, or to make a point.

This discussion reminded me of a similar debate I’d come across as a junior in high school regarding Alexander Gardner’s Photographic Sketchbook of the Civil War, which Morris mentions in his article.

Unfortunately, I don’t have my copy with me at school, but I remember the discussion we had about Gardner’s staging weapons, moving articles of clothing, and taking some shots similar to Fenton’s that beg an explanation. For example, the image on the front cover of the book (also shown below), is called “The Harvest of Death”, and shows fallen Confederate soldiers . The second picture, ” Field Where General Reynolds Fell”, is purportedly the same picture taken from a different angle, but Gardner says it shows “our own [Union] men”.

“The Harvest of Death”:

“Field Where General Reynolds Fell”:

Both Fenton and Gardner’s photographs were taken in the 1850s. Even without digital editing, they modified their landscapes to project their own interpretations or ideas–not unlike our modern photoshopping or changing of Wikipedia pages. Unlike the editors on Wikipedia who respond in two minutes, however, we’re only just now recognizing their alterations.

Scholarly Scavenger Hunt

I approached this assignment happily–it’s not often that you get to play a game (of sorts) for homework. I was also slightly apprehensive; I love to search using Google, and I knew going completely without it might prove to be challenging. I found that to be the case.

I decided to focus primarily on databases like ProQuest, JSTOR, and other journals held by the Mason Library, as we discussed in class. Using some of the new Boolean operations I’d learned (the tilde, for example) was helpful, but I still struggled to find exactly what I was looking for.

I decided to search for the three topics in order, so that’s how I’ll report my findings:

1. Op-Ed/Labor Dispute/Public School Teachers/Pre-1970

I’m a little abashed to say I didn’t know what an “op-ed” was; I had thought it meant opinion/editorial. I did a quick Google search (briefly ignore what I said about not using Google at all above, please–I thought this didn’t hurt since I was looking up background information). I compared the Google definition with a well-cited Wikipedia article; both said that “op-ed” is actually an editorial piece written by a published and/or well-known writer (as opposed to someone from the general public) who usually does not work for the newspaper in which the article is being published; it  happens to appear on the page opposite the regular editorials, hence the name. The Wikipedia article also noted that this concept was officially adopted around 1970 by the New York Times.

Given that information, I decided that ProQuest would be a good place from which to start. I performed several searches:

  • “public school teachers” AND “labor ~disputes”
  • “public school teachers” AND strikes
  • I limited the search field to articles before 1970
  • I limited the results to include only articles, commentary, correspondence, or editorials
  • I eventually limited the search to articles from the New York Times (because of what I’d read earlier) but then expanded it to include all available newspapers again

I believe the most promising article I found is from the Chicago Tribune. While I can’t say whether it entirely fits the bill as an “op-ed” because no author is attributed to the piece, it’s an editorial that fits all other criteria: “Strikes that Should Be Prohibited“.

2. Solar Power/U.S.

I found this search to be the easiest of the three (relatively speaking). I decided to use ProQuest again and had a surprising number of results. It was difficult to find the first documented case of solar power used in the U.S., so I tried using different search terms to see if I could determine when the phrase was first used.  Some of my search criteria included:

  • “solar power” AND “United States”
  • “solar power” AND emergence AND “United States”
  • “solar collection” AND “United States”
  • “solar heating” AND “United States”

When I found articles that used those terms, I’d then limit the search to years before that article to see if solar energy had been discussed even earlier. These two articles seemed to document the increasing interest in solar power in the U.S. well: one in the Los Angeles Times in 1938, “Solar Energy Study Planned” , and one in the New York Times that described the use of solar power in Florida homes by the 1940s: “Solar Power Use Rises Slightly, But Cost Still Poses Obstacle“.

3. California Ballot Initiatives/Voting Records

I found this to be the most difficult search; it was hard to find any sources that gave a comprehensive list of California’s ballot initiatives, let alone a record of the voting results for those initiatives. I searched ProQuest, JSTOR (especially political journals like ABI/Inform that fell under the “Government Documents” or “Political Science and Law” categories), and the Mason Library holdings to no avail. I finally gave in to my desire to return to Google, using the Scholar feature to search for California AND “~Ballot Initiatives”. I was excited to discover this article: “Judicial Review of Ballot Initiatives: The Changing Role of State and Federal Courts“. Though I did not read the lengthy article in its entirety, I gathered that the authors focused primarily on California when discussing the impact of ballot initiatives and discuss in-depth the results of some of the most well known. They may not provide a record of all statistics based on these initiatives, but they do discuss the impact of initiatives on voting extensively, and I think it might be a good place for someone researching this topic to begin.

I enjoyed this scavenger hunt, despite the momentary frustration it caused. I may still prefer Google for a quick search, but I really appreciate the number of sources available when on a quest for more thorough information!

 

 

ProQuest: A Case Study in Digitization

I’ve used databases for research in previous classes but am new to ProQuest. I really enjoyed perusing its newspaper collections; I think it’s a great resource for primary documents, and I thought it also concretely demonstrated some of the concepts covered in Cohen and Rosenzweig’s chapter “Becoming Digital“.

Pros and Cons

The basic setup of the ProQuest database is not unlike that of other scholarly databases such as JSTOR, and its layout makes it easy to use.  I liked that there were multiple search options: a basic search, an advanced search, and an option to search through individual newspaper collections. Another feature that I found very helpful allowed me to narrow my search results by publication date. When I did a basic search for “Titanic“, I could limit the results to the years between 1912-1915, in the recent aftermath of the disaster, or include articles from the 21st Century that covered research and recovery efforts.

When I searched for articles about the Titanic, I did not limit the newspaper search field, so results came from all of the newspaper databases, from the New York papers (which yielded the greatest number of hits) to the Los Angeles paper. I did a similar search for “stock market crash”, again limiting the publication date search field but including all newspapers, this time for the years around 1929. After clicking on an article, I explored the different  page layouts–the default, which looked almost identical to a PDF, was the easiest to read. The “page view” showed the selected article in the context of the whole newspaper–the entirety of that issue could be viewed, but the text was extremely small. The “Page View–PDF” was another option, but sometimes it was more difficult to read than the PDF that appeared when the article was chosen. This was occasionally frustrating given that the search terms were not highlighted in the article as they often are in a database like JSTOR.

Another caveat was brought to light when I did a search for “Neil Armstrong” and “moon” (for this search, I used only the New York Times and the Wall Street Journal because I wanted to compare two papers from the same area during the same period–the New York Times yielded many more relevant results). In looking through multiple articles, I only came across one image and one advertisement, both black and white and not of high resolution, which suggests that this database may not be extremely helpful if one is looking for photographs to supplement his text.

I know I’d need to do many more searches to find out if that were the overarching case, or if it just happened to occur in the articles I was searching. Overall, I was pleased by the database and think it’s an excellent collection of primary resources.

ProQuest: A Case Study

Searching ProQuest‘s large database helped me to understand the concepts described in Cohen and Rosenzweig’s third chapter. It seemed to me as though several main ideas were showcased here. First, ProQuest is a massive database: it contains the full text of  newspapers from major cities with issues ranging from the mid-1800s to the present day. This is digitization on a massive scale. Who decided what cities would be represented? Why have the New York Times issues been digitized through 2008, while the Chicago Tribune issues are only completed through 1988? What accounts for that 20-year difference? Cost and audience seem likely factors, but who determines them?

Before reading Chapter 3 of Digital History, I had never considered or been aware of the ways that text is displayed. Cohen and Rosenzweig note that JSTOR uses a “hybrid” approach to their presentation, making use of  both “page images” and OCR, or optical character recognition. It seems to me as though the ProQuest database uses a similar format because it seemed to operate the same way:

Because JSTOR believes that the “appearance of typographical and other errors could undermine the perception of quality that publishers have worked long and hard to establish,” they display the scanned page image and then use the uncorrected OCR only as an invisible search file. This means that if you search for “Mary Beard,” you will be shown all the pages where her name appears, but you will have to scan the page images to find the specific spot on the page.

[As I side note, I have read some JSTOR articles that do include highlighted search terms, but  ProQuest seemed not to highlight search terms.]

The use of  page images and OCR might also describe why it was hard to read some of the newspaper articles in the full text PDF version:

 Even the best OCR software programs have limitations. They don’t, for example, do well with non-Latin characters, small print, certain fonts, complex page layouts or tables, mathematical or chemical symbols, or most texts from before the nineteenth century. Forget handwritten manuscripts. And even without these problems, the best OCR programs will still make mistakes.

Small font and complex page layouts of newspaper articles seem likely causes of the sometimes distorted text.

Finally, I found it interesting that the articles that I’d read contained very few and relatively low-quality image, but then realized that the hybrid method of scanning the page and creating a page image would not enhance the resolution or bit depth of the picture in the newspaper (which itself is not the original) and may have been old or faded. As Cohen and Rosenzweig state:

In digitizing images, as with all digitizing, the quality of the digital image rests on the quality of the original, the digitizing method employed, the skill of the person doing the digitizing, and the degree to which the digital copy has adequately “sampled” the analog original.

If a combination approach of page images and OCR is used by ProQuest, the variations in text and image quality seem to make sense, and despite its few shortcomings in those areas, it seems to me to be an excellent resource.

 

Nelson’s “Complex Information Processing”

Background

In his article “Complex Information Processing”, Nelson expands upon Bush’s idea of the “memex” to explain the necessary developments for creating a file structure that would allow for the storage of personal files and act “as an adjunct for creativity”.  I was glad to read this article shortly after having read Bush’s famous piece (“As We May Think”), because it seemed that in the twenty years between the articles were published, a great deal of technical knowledge had been gained.  Nelson, however, expressed some regret that his vision for a computer-based file structure that could be used on a personal—rather than strictly professional—level had not yet come to fruition. He attributed the lack of productivity to high cost, little sense of need, and uncertainty about system design.

Nelson did not dwell on the first two arguments, but rather focused almost exclusively on the third. He proposed that a computer with the capabilities he desired required three parts:

  1. information structure (zippered lists)
  2. file structure (Evolutionary List File, ELF)
  3. file language (Personalized Retrieval, Indexing, and Documentation Evolutionary System, PRIDE)

Nelson explained that this kind of system would have features “specifically adapted to useful change”:

  • it would be able “to sustain changes in the bulk and block arrangements of its contents”
  • it would permit dynamic outlining–the process by which the change in one text sequence guides an automatic change in the next sequence
  • it would allow multiple drafts to remain on file for comparison for an indefinite time period
  • it would be simple to use

These features make the system evolutionary.

Evolutionary  List File (ELF)

In order to ensure the simplicity of the evolutionary system, the file structure of ELF consists of zippered lists (which allow for the linkage of two or more related lists). ELF consists of:

  • entries—discrete data ranging from text to symbols; can be created at any time
  • lists—an ordered set of entries; can be combined, rearranged, or divided
  • links—a connector in one entry that links it to another entry; any number of links can be created

Its basic structure is rendered here: Complex Information Technology Picture

Nelson proposes that this structure is so evolutionary because of its “psychological virtue”—its simplicity, its ability to “be easily taught to people who do not understand computers”.

Speaking as someone with little technical experience, I think that’s a goal extremely relevant to computer users of Nelson’s time and our time alike.

 

A Reflection: Bush’s “As We May Think”

Many of Bush’s speculations about the nature of the technology of the future in his article “As We May Think” have come to fruition; his insight is particularly impressive given that the article was written 67 years ago.  Though some of his examples (like the section on photography) are written in very technical language that might not be accessible to many modern readers, his imagined “memex” should resonate almost immediately with all of us—Bush’s idea is extremely similar to our modern computer.

Though Bush spends a great deal of time reflecting on possible technological advances of the future, he also explores the consequences of these new inventions, which was what I found to be the most interesting part of his article.  Before writing about the “memex”, Bush observed that “There is a growing mountain of research…The investigator is staggered by the findings and conclusions of thousands of other workers—conclusions which he cannot find time to grasp, much less to remember, as they appear.” Later, he also remarked that “Truly significant attainments [can] become lost in the mass of the inconsequential…The difficulty seems to be, not so much that we publish unduly in view of the extent and variety of present day interests, but rather that publication has been extended far beyond our present ability to make real use of the record”.  When Bush wrote this article, a researcher might have spent weeks perusing the shelves of various libraries to find sources he needed and still might have missed a book that could have been of great importance simply because of the sheer volume of works available or because he was unaware that a particular work even existed. Now (though he still hopefully uses the library!), a researcher can conduct an online search that can help him narrow his search criteria and become familiar with the newest research and publications on his topic.

And yet, the modern researcher is also bound to miss information relevant to his search—he simply has the ability to be made aware of more significant material than ever before. The problem that Bush’s researcher faced has not disappeared with the advent of the internet and computer—so much information is now available in one place that the modern researcher also needs to discover how to efficiently find and utilize information that is of importance to him.

Bush suggested that the most effective way of organizing and manipulating information would be through a process of indexing by association, as the human brain does.  The “memex” would allow a man to link items which he saw as related by saving them together and by making marginal notes that could connect one article or photograph with another.  Grouping by association has become commonplace today, particularly through social media, where links to videos, blogs, and articles can be shared with others (often in response to something they have said or posted).  Association is also customary in search engines like Google and Bing, which suggest related searches while providing information regarding their users’ initial queries. It’s prevalent in internet advertising, too; some websites tailor the advertisements on their pages based on a person’s previous searches or on the content he’s viewing at that moment.  Perhaps the most obvious examples of association are links found in online articles to relevant articles on other websites.

In addition to association, another way that modern man streamlines his viewing of the vast content of the internet is through RSS readers like Google Reader that allow him to subscribe to websites he deems important, so that all new information from the various sites comes to him in one location.  Whether subscribing to a variety of sites for enjoyment’s sake or following a group of related academic sites for the latest breakthroughs, I believe Bush’s 1945 prediction is true today:

There is a new profession of trailblazers, those who find delight in the task of establishing useful trails through the enormous mass of the common record. The inheritance from the master becomes, not only his additions to the world’s record, but for his disciples the entire scaffolding by which they were erected.

Bush’s optimistic view of the benefits of technological advancements for society suggest that the intellectual scaffolding created by today’s scientists, inventors, and researchers will allow man to “grow in the wisdom of race experience”.  Only time will tell if this one of Bush’s predictions was accurate.