Vries, “Uncovering the unarchived web (original) (raw)

Uncovering the Unarchived Web

In Proceedings of the 37th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York NY, 2014.

Many national and international heritage institutes realize the im- portance of archiving the web for future culture heritage. Web archiving is currently performed either by harvesting a national do- main, or by crawling a pre-defined list of websites selected by the archiving institution. In either method, crawling results in more information being harvested than just the websites intended for preservation; which could be used to reconstruct impressions of pages that existed on the live web of the crawl date, but would have been lost forever. We present a method to create representations of what we will refer to as a web collection’s aura: the web documents that were not included in the archived collection, but are known to have existed — due to their mentions on pages that were included in the archived web collection. To create representations of these unarchived pages, we exploit the information about the unarchived URLs that can be derived from the crawls by combining crawl date distribution, anchor text and link structure. We illustrate empiri- cally that the size of the aura can be substantial: in 2012, the Dutch Web archive contained 12.3M unique pages, while we uncover ref- erences to 11.9M additional (unarchived) pages.

Finding pages on the unarchived Web

IEEE/ACM Joint Conference on Digital Libraries, 2014

Web archives preserve the fast changing Web, yet are highly incomplete due to crawling restrictions, crawling depth and frequency, or restrictive selection policies-most of the Web is unarchived and therefore lost to posterity. In this paper, we propose an approach to recover significant parts of the unarchived Web, by reconstructing descriptions of these pages based on links and anchors in the set of crawled pages, and experiment with this approach on the Dutch Web archive.

Uncovering information hidden in Web archives

D-Lib magazine, 2002

The Internet has turned into an important aspect of our information infrastructure and society, with the Web forming a part of our cultural heritage. Several initiatives thus set out to preserve it for the future. The resulting Web archives are by no means only a collection of historic Web pages. They hold a wealth of information that waits to be exploited, information that may be substantial to a variety of disciplines. With the time-line and metadata available in such a Web archive, additional analyzes that go beyond mere information exploration become possible. In the context of the Austrian On-Line Archive (AOLA), we established a Data Warehouse as a key to this information. The Data Warehouse makes it possible to analyze a variety of characteristics of the Web in a flexible and interactive manner using on-line analytical processing (OLAP) techniques. Specifically, technological aspects such as operating systems and Web servers used, the variety of file types, forms or scripting languages encountered, as well as the link structure within domains, may be used to infer characteristics of technology maturation and impact or community structures.

Historical Infrastructures for Web Archiving

Historical infrastructures for Web archiving: Annotation of ephemeral collections for research Charles van den Heuvel and Meghan Dougherty The World Wide Web is becoming a source of information for researchers, who are more aware of the possibilities for collections of Internet content as resources. Some have begun creating archives of web content for social science and humanities research. However, there is a growing gulf between policies shared between global and national institutions creating web archives and the practices of researchers making use of the archives. Each set of stakeholders finds the others’ web archiving contributions less applicable to their own field. Institutions find the contributions of researchers to be too narrow to meet the needs of the institution’s audience, and researchers find the contributions of institutions to be too broad to meet the needs of their research methods. Resources are extended to advance both institutional and researcher tools, but the gulf between the two is persistent. Institutions generally produce web archives that are broad in scope but with limited access and enrichment tools. The design of common access interfaces, such as the Internet Archive’s Wayback Machine, limit access points to archives to only URL and date. This narrow access limits the ways in which web archives can be valuable for exploring research questions in the humanities and social sciences. Individual scholars, in catering to their own disciplinary and methodological needs, produce web archives that are narrow in scope, and whose access and enrichment tools are personalized to work within the boundaries of the project for which the web archive was built. There is no way to explore a subset of an archive by topic, event, or idea. The current search paradigm in web archiving access tools is built primarily on retrieval, not discovery. We suggest that there is a need for extensible tools to enhance access to and enrichment of web archives to make them more readily reusable and so, more valuable for both institutions and researchers, and that annotation activities can serve as one potential guide for development of such tools to bridge the divide. The contextual knowledge production evolving from annotation not only adds value to web archives by providing one solution to the problem of limited resources for generating metadata in web archives; it also forms part of our collective memory and needs to be preserved together with the original content. In the 19th and 20th centuries documentalists, such as Paul Otlet (1868-1944) began exploring methods to order, access, and annotate ephemeral, dynamic material for research. Otlet developed a documentation system in which bibliographical material describing content transmitted by all sorts of media (radio, film, gramophone and television) was stored together with various forms of annotations, ranging from updates to expressions of opinion. It imagined researchers working together on a global level to create and to enrich collective memory. We claim that these pre-web annotation initiatives are also of interest for future strategies to access and preserve more dynamic and ephemeral forms of digital cultural heritage, such as web archiving.

2014 not found: a cross-platform approach to retrospective web archiving

Internet Histories, 2019

While web archiving techniques capture snapshots of websites inreal time, this article introduces an approach for building specialcollections for web archiving in retrospect. Retrospective webarchiving (RWA) aims to fill-in gaps in existing archives, as well asto expand the boundaries of web archiving from the open webto cross-platform curation, and from national to international per-spectives. The proposed approach is tested on a case study ofthe 2014 War in Gaza. The retrospective collection contains118,508 unique URIs and relevant metadata, carbon-dated to theperiod of the military operation, in 46 languages and 5692domain suffixes, collected from Wikipedia, Google, Twitter andYouTube. Findings suggest that four years after the war, 50% ofthe URIs were still available on the live web, but only 38% ofthem have already been archived elsewhere. Although URL shar-ing on social media was found the most relevant for retrospectivecuration, the platformisation of the web, along with the popular-ity of URL shortening services, severely impacts their archivalcoverage. The article suggests taking into account platformdynamics and cultural differences in link sharing practices, whenthinking about future curation policies for both real timeand RWA.

The evolution of web archiving

International Journal on Digital Libraries, 2016

Web archives preserve information published on the web or digitized from printed publications. Much of this information is unique and historically valuable. However, the lack of knowledge about the global status of web archiving initiatives hamper their improvement and collaboration. To overcome this problem, we conducted two surveys, in 2010 and 2014, which provide a comprehensive characterization on web archiving initiatives and their evolution. We identified several patterns and trends that highlight challenges and opportunities. We discuss these patterns and trends that enable to define strategies, estimate resources and provide guidelines for research and development of better technology. Our results show that during the last years there was a significant growth in initiatives and countries hosting these initiatives, volume of data and number of contents preserved. While this indicates that the web archiving community is dedicating a growing effort on preserving digital information, other results presented throughout the paper raise concerns such as the small amount of archived data in comparison with the amount of data that is being published online.

Web Archiving and Digital Libraries (WADL) 2016: Highlights and Introduction to this Special Issue

Bull. IEEE Tech. Comm. Digit. Libr., 2017

This workshop, reported in the following 12 papers, explored the integration of Web archiving and digital libraries, so the complete life cycle involved was introduced: creation/authoring, uploading/publishing in the Web (2.0), (focused) crawling, indexing, exploration (searching, browsing), archiving (of events), etc. It included particular coverage of current topics of interest, e.g., big data, mobile web archiving, and systems (e.g., Memento, SiteStory, Hadoop processing).

Web Archives: The Future (s)

2011

EXECUTIVE SUMMARY This report has been written by researchers at the Oxford Internet Institute for the International Internet Preservation Consortium (IIPC). The aim is to stimulate further discussion among web archivists and researchers about the future ways in which web archives can be used by researchers.

The Web Archives Workbench (WAW) Tool Suite: Taking an Archival Approach to the Preservation of Web Content

Library Trends, 2009

The ECHO DEPository (also known as ECHO DEP, an abbreviation for Exploring Collaborations to Harvest Objects in a Digital Environment for Preservation) is an NDIIPP-partner project led by the University of Illinois at Urbana-Champaign in collaboration with OCLC and a consortium of partners, including five state libraries and archives. A core deliverable of the project's first phase was OCLC's development of the Web Archives Workbench (WAW), an opensource suite of Web archiving tools for identifying, describing, and harvesting Web-based content for ingestion into an external digital repository. Released in October 2007, the suite is designed to bridge the gap between manual selection and automated capture based on the "Arizona Model," which applies a traditional aggregate-based archival approach to Web archiving. Aggregate-based archiving refers to archiving items by group or in series, rather than individually. Core functionality of the suite includes the ability to identify Web content of potential interest through crawls of "seed" URLs and the domains they link to; tools for creating and managing metadata for association with harvested objects; website structural analysis and visualization to aid human content selection decisions; and packaging using a PREMIS-based METS profile developed by the ECHO DEPository to support easier ingestion into multiple repositories. This article provides background on the Arizona Model; an overview of how the tools work and their technical implementation; and a brief summary of user feedback from testing and implementing the tools.

Vries, “Uncovering the unarchived web (original) (raw)

Related papers