2014 not found: a cross-platform approach to retrospective web archiving (original) (raw)

Historical Infrastructures for Web Archiving

Historical infrastructures for Web archiving: Annotation of ephemeral collections for research Charles van den Heuvel and Meghan Dougherty The World Wide Web is becoming a source of information for researchers, who are more aware of the possibilities for collections of Internet content as resources. Some have begun creating archives of web content for social science and humanities research. However, there is a growing gulf between policies shared between global and national institutions creating web archives and the practices of researchers making use of the archives. Each set of stakeholders finds the others’ web archiving contributions less applicable to their own field. Institutions find the contributions of researchers to be too narrow to meet the needs of the institution’s audience, and researchers find the contributions of institutions to be too broad to meet the needs of their research methods. Resources are extended to advance both institutional and researcher tools, but the gulf between the two is persistent. Institutions generally produce web archives that are broad in scope but with limited access and enrichment tools. The design of common access interfaces, such as the Internet Archive’s Wayback Machine, limit access points to archives to only URL and date. This narrow access limits the ways in which web archives can be valuable for exploring research questions in the humanities and social sciences. Individual scholars, in catering to their own disciplinary and methodological needs, produce web archives that are narrow in scope, and whose access and enrichment tools are personalized to work within the boundaries of the project for which the web archive was built. There is no way to explore a subset of an archive by topic, event, or idea. The current search paradigm in web archiving access tools is built primarily on retrieval, not discovery. We suggest that there is a need for extensible tools to enhance access to and enrichment of web archives to make them more readily reusable and so, more valuable for both institutions and researchers, and that annotation activities can serve as one potential guide for development of such tools to bridge the divide. The contextual knowledge production evolving from annotation not only adds value to web archives by providing one solution to the problem of limited resources for generating metadata in web archives; it also forms part of our collective memory and needs to be preserved together with the original content. In the 19th and 20th centuries documentalists, such as Paul Otlet (1868-1944) began exploring methods to order, access, and annotate ephemeral, dynamic material for research. Otlet developed a documentation system in which bibliographical material describing content transmitted by all sorts of media (radio, film, gramophone and television) was stored together with various forms of annotations, ranging from updates to expressions of opinion. It imagined researchers working together on a global level to create and to enrich collective memory. We claim that these pre-web annotation initiatives are also of interest for future strategies to access and preserve more dynamic and ephemeral forms of digital cultural heritage, such as web archiving.

The evolution of web archiving

International Journal on Digital Libraries, 2016

Web archives preserve information published on the web or digitized from printed publications. Much of this information is unique and historically valuable. However, the lack of knowledge about the global status of web archiving initiatives hamper their improvement and collaboration. To overcome this problem, we conducted two surveys, in 2010 and 2014, which provide a comprehensive characterization on web archiving initiatives and their evolution. We identified several patterns and trends that highlight challenges and opportunities. We discuss these patterns and trends that enable to define strategies, estimate resources and provide guidelines for research and development of better technology. Our results show that during the last years there was a significant growth in initiatives and countries hosting these initiatives, volume of data and number of contents preserved. While this indicates that the web archiving community is dedicating a growing effort on preserving digital information, other results presented throughout the paper raise concerns such as the small amount of archived data in comparison with the amount of data that is being published online.

Web-archiving and social media: an exploratory analysis

International Journal of Digital Humanities, 2021

The archived web provides an important footprint of the past, documenting online social behaviour through social media, and news through media outlets websites and government sites. Consequently, web archiving is increasingly gaining attention of heritage institutions, academics and policy makers. The importance of web archives as data resources for (digital) scholars has been acknowledged for investigating the past. Still, heritage institutions and academics struggle to 'keep up to pace' with the fast evolving changes of the World Wide Web and with the changing habits and practices of internet users. While a number of national institutions have set up a national framework to archive 'regular' web pages, social media archiving (SMA) is still in its infancy with various countries starting up pilot archiving projects. SMA is not without challenges; the sheer volume of social media content, the lack of technical standards for capturing or storing social media data and social media's ephemeral character can be impeding factors. The goal of this article is threefold. First, we aim to extend the most recent descriptive state-of-the-art of national web archiving, published in the first issue of International Journal of Digital Humanities (March 2019) with information on SMA. Secondly, we outline the current legal, technical and operational (such as the selection and preservation policy) aspects of archiving social media content. This is complemented with results from an online survey to which 15 institutions responded. Finally, we discuss and reflect on important challenges in SMA that should be considered in future archiving projects.

The Importance of Web Archives for Humanities

International Journal of Humanities and Arts Computing, 2014

The web is the primary means of communication in developed societies. It contains descriptions of recent events generated through distinct perspectives. Thus, the web is a valuable resource for contemporary historical research. However, its information is extremely ephemeral. Several research studies have shown that only a small amount of information remains available on the web for longer than one year. Web archiving aims to acquire, preserve and provide access to historical information published online. In April 2013, there were at least sixty four web archiving initiatives worldwide. Altogether, these archived collections of web documents form a comprehensive picture of our cultural, commercial, scientific and social history. Web archiving has also an important sociological impact because ordinary citizens are publishing personal information online without preservation concerns. In the future, web archives will probably be the only source of personal memories to many people. We p...

Historical Infrastructures for Web Archiving: Annotation of Ephemeral Collections for Researchers and Cultural Heritage Institutions

2009

The World Wide Web is becoming a source of information for researchers, who are more aware of the possibilities for collections of Internet content as resources. Some have begun creating archives of web content for social science and humanities research. However, there is a growing gulf between policies shared between global and national institutions creating web archives and the practices of researchers making use of the archives. Each set of stakeholders finds the others’ web archiving contributions less applicable to their own field. Institutions find the contributions of researchers to be too narrow to meet the needs of the institution’s audience, and researchers find the contributions of institutions to be too broad to meet the needs of their research methods. Resources are extended to advance both institutional and researcher tools, but the gulf between the two is persistent. Institutions generally produce web archives that are broad in scope but with limited access and enrich...

Web Archives: The Future (s)

2011

EXECUTIVE SUMMARY This report has been written by researchers at the Oxford Internet Institute for the International Internet Preservation Consortium (IIPC). The aim is to stimulate further discussion among web archivists and researchers about the future ways in which web archives can be used by researchers.

Web Archiving Methods and Approaches: A Comparative Study

Library Trends, 2005

The Web is a virtually infi nite information space, and archiving its entirety, all its aspects, is a utopia. The volume of information presents a challenge, but it is neither the only nor the most limiting factor given the continuous drop in storage device costs. Signifi cant challenges lie in the management and technical issues of the location and collection of Web sites. As a consequence of this, archiving the Web is a task that no single institution can carry out alone. This article will present various approaches undertaken today by different institutions; it will discuss their focuses, strengths, and limits, as well as a model for appraisal and identifying potential complementary aspects amongst them. A comparison for discovery accuracy is presented between the snapshot approach done by the Internet Archive (IA) and the eventbased collection done by the Bibliothèque Nationale de France (BNF) in 2002 for the presidential and parliamentary elections. The balanced conclusion of this comparison allows for identifi cation of future direction for improvement of the former approach.

A survey on web archiving initiatives

2011

Web archiving has been gaining interest and recognized importance for modern societies around the world. However, for web archivists it is frequently difficult to demonstrate this fact, for instance, to funders. This study provides an updated and global overview of web archiving. The obtained results showed that the number of web archiving initiatives significantly grew after 2003 and they are concentrated on developed countries. We statistically analyzed metrics, such as, the volume of archived data, archive file formats or number of people engaged. Web archives all together must process more data than any web search engine. Considering the complexity and large amounts of data involved in web archiving, the results showed that the assigned resources are scarce. A Wikipedia page was created to complement the presented work and be collaboratively kept up-to-date by the community.

It Takes A Village To Save The Web: The End Of Term Web Archive

Documents to the People, 2012

The goal of the project team was to execute a comprehensive harvest of the federal government domains (.gov, .mil, .org, etc.) in the final months of the Bush administration, and to document changes in the federal government websites as agencies transitioned to the Obama administration. This collaborative effort was prompted by the announcement that the National Archives and Records Administration (NARA), which had conducted harvests of prior administration transitions, would not be archiving agency websites during the 2008 transition. 1 This announcement prompted some considerable debate about the role of NARA in web archiving and the value of archiving websites in their totality. It also came just as the International Internet Preservation Consortium (IIPC) held its 2008 General Assembly. All five project partners are IIPC members, and were able to convene an immediate meeting to discuss what actions should be taken. With little time and no funding, the five End of Term (EOT) Project organizations responded together with the range of skills and resources needed to build the archive. The End of Term Web Archive (eotarchive.cdlib.org) includes federal government websites in the legislative, executive, and judicial branches of government. It holds over 160 million documents harvested from 3,300 websites, and represents sixteen terabytes of data. This article