Web Archiving

Web Archiving

Trevor Alvord (Brigham Young University, USA)
DOI: 10.4018/978-1-4666-5888-2.ch757


Web Archiving is an emerging and growing field in Information Science and Technology; not even out of its second decade yet playing a critical role in the documentation and preservation of the human experience portrayed on the Internet. With an estimated 72 hours of video uploaded to YouTube every minute and over 300 million images added to Facebook daily the Internet has by far become the preferred method for documenting ones life; traditional mediums such as journals and scrapbooks have given way to blogs and Flickr accounts. This article explores the history, issues—such as appraisal, metadata, and copyright—uses, and the developing lexicon of web archiving while examining current tools and services available in the field.
Chapter Preview


Jinfang Niu states (2012) that web archiving is “the process of gathering up data that has been recorded on the World Wide Web, storing it, ensuring the data is preserved in an archive, and making the collected data available for future research.” (para. 1) Although Niu’s excellent definition is expansive, current archival practice is more focused. Typically web archiving is done on the surface of the web, crawling and harvesting the HTML output that is seen by the user while avoiding the deeper, often larger files used to create that HTML output. These large sets of data or databases are typically being managed through data curation and not through web archiving.

Internet preservation began in 1996 with the formation of the Internet Archive, a non-profit organization founded by Brewster Kahle. In partnership with Alexa Internet, another Kahle-owned company that specialized in tracking web usage, crawling and capturing the World Wide Web began. In 1999, Amazon.com purchased Alexa Internet and Kahle started devoting more time to the Internet Archive. By 2001, Kahle launched the Wayback Machine (named for the WABAC Machine built by Mr. Peabody of the Rocky & Bullwinkle cartoon), which opened up to the general public the entire 10 billion URLs captured by the Internet Archive at the time (Green, 2002). The Internet Archive is no longer alone in preserving the web. Through its web crawling service Archive-it, the Internet Archive is adding around one billion pages per week and has archived over 360 billion URLs, all of which are available for use at waybackmachine.org (Rossi, 2013).

Key Terms in this Chapter

Document: An element of web resources that have a distinct web address. Images, PDF documents, embedded video, and cascading style sheets (CSS) are examples of a document.

Crawler: A software program hosted online or locally that manages the capture of web content.

Capture: Copying digital information for the Internet and transferring that information into a repository.

Robots.txt: A simple text file that signifies to a crawler the depth at which information can be harvested, if at all, in a particular site or webpage. Robot files are located in the root directory of a website.

Seed: A unique and specific URL loaded into a crawler program. The seed acts as the pointer telling the crawler what to capture.

Uniform Resource Locator (URL): A web address consisting of the access protocol (http), the root name followed by the domain designation, and any optional subdomains. For example: http:// (access protocol) www.whitehouse (root name ) .gov (domain).

Crawl: The event in which a crawler conducted a capture of web content.

Harvest: The act of capturing web content through a crawler program.

Complete Chapter List

Search this Book: