Evaluating Methods to Rediscover Missing Web Pages from the Web Infrastructure (original) (raw)
Related papers
An Evaluation of Link Neighborhood Lexical Signatures to Rediscover Missing Web Pages
2011
Abstract: For discovering the new URI of a missing web page, lexical signatures, which consist of a small number of words chosen to represent the" aboutness" of a page, have been previously proposed. However, prior methods relied on computing the lexical signature before the page was lost, or using cached or archived versions of the page to calculate a lexical signature.
Using the Web Infrastructure for Just-In-Time Recovery of Missing Web Pages
The Internet provides access to a great number of web sites, but the structure of the web is constantly changing. Missing web pages remain a pervasive problem that users experience every day. This dissertation is about creating a method to overcome this problem by automatically mapping between Uniform Resource Identifiers (URIs) and textual content of web pages using lexical signatures (LSs) and tags. We introduce a "just-in-time" approach to support the preservation of web content relying on the "living" web. We propose a method to harness the collective behavior of the Web Infrastructure and investigate the suitability of lexical signatures and tags to give a "good enough" description of the "aboutness" of missing pages. Utilizing Internet search engines by querying these LSs will return the replacement page or a very similar page which can be provided to the user. We investigate the evolution of lexical signatures over time and propose a framework to aid in the creation of LSs. Analyzing snapshots of the web from recent years will enable us to investigate the decay of such lightweight descriptions and also the characteristics of missing pages (http error code 404). We propose to evaluate and measure the quality of the framework with information retrieval methods such as precision and recall.
Just-in-time recovery of missing web pages
2006
We present Opal, a light-weight framework for interactively locating missing web pages (http status code 404). Opal is an example of "in vivo" preservation: harnessing the collective behavior of web archives, commercial search engines, and research projects for the purpose of preservation. Opal servers learn from their experiences and are able to share their knowledge with other Opal servers by mutual harvesting using the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). Using cached copies that can be found on the web, Opal creates lexical signatures which are then used to search for similar versions of the web page. We present the architecture of the Opal framework, discuss a reference implementation of the framework, and present a quantitative analysis of the framework that indicates that Opal could be effectively deployed.
Revisiting Lexical Signatures to (Re-)Discover Web Pages
A lexical signature (LS) is a small set of terms derived from a document that capture the "aboutness" of that document. A LS generated from a web page can be used to discover that page at a different URL as well as to find relevant pages in the Internet. From a set of randomly selected URLs we took all their copies from the Internet Archive between 1996 and 2007 and generated their LSs. We conducted an overlap analysis of terms in all LSs and found only small overlaps in the early years (1996 − 2000) but increasing numbers in the more recent past (from 2003 on). We measured the performance of all LSs in dependence of the number of terms they consist of. We found that LSs created more recently perform better than early LSs created between 1996 and 2000. All LSs created from year 2000 on show a similar pattern in their performance curve. Our results show that 5-, 6-and 7-term LSs perform best with returning the URLs of interest in the top ten of the result set. In about 50% of all cases these URLs are returned as the number one result and in 30% of all times we considered the URLs as not discoved.
Retrieving web pages using content, links, urls and anchors
2002
For this year's web track, we concentrated on the entry page finding task. For the content-only runs, in both the ad-hoc task and the entry page finding task, we used an information retrieval system based on a simple unigram language model. In the Ad hoc task we experimented with alternatieve approaches to smoothing. For the entry page task, we incorporated additional information into the model. The sources of information we used in addition to the document's content are links, URLs and anchors.
Finding pages on the unarchived Web
IEEE/ACM Joint Conference on Digital Libraries, 2014
Web archives preserve the fast changing Web, yet are highly incomplete due to crawling restrictions, crawling depth and frequency, or restrictive selection policies-most of the Web is unarchived and therefore lost to posterity. In this paper, we propose an approach to recover significant parts of the unarchived Web, by reconstructing descriptions of these pages based on links and anchors in the set of crawled pages, and experiment with this approach on the Dutch Web archive.
Information Retrieval on the World Wide Web
IEEE Internet Computing, 1997
T he World Wide Web is a very large distributed digital information space. From its origins in 1991 as an organization-wide collaborative environment at CERN for sharing research documents in nuclear physics, the Web has grown to encompass diverse information resources: personal home pages; online digital libraries; virtual museums; product and service catalogs; government information for public dissemination; research publications; and Gopher, FTP, Usenet news, and mail servers. Some estimates suggest that the Web currently includes about 150 million pages and that this number doubles every four months.
Investigating the Change of Web Pages Titles Over Time
Inaccessible web pages are part of the browsing experience. The content of these pages however is often not completely lost but rather missing. Lexical signatures (LS) generated from the web pages' textual content have been shown to be suitable as search engine queries when trying to discover a (missing) web page. Since LSs are expensive to generate, we investigate the potential of web pages' titles as they are available at a lower cost. We present the results from studying the change of titles over time. We take titles from copies provided by the Internet Archive of randomly sampled web pages and show the frequency of change as well as the degree of change in terms of the Levenshtein score. We found very low frequencies of change and high Levenshtein scores indicating that titles, on average, change little from their original, first observed values (rooted comparison) and even less from the values of their previous observation (sliding).
Curlcrawler Optimization: A Framework For Crawling The Web With URL Tracking And Canonicalization
International journal of engineering research and technology, 2013
Information is a vital role playing versatile thing from availability at church level to web through trends of books. WWW is now the huge, exposed, up-to-date, interoperable and dynamic repository of information available to everyone, everywhere and every time. In addition to the size of information available on the web its scheme, authority, dynamism, appearance and interoperability are the attributes that are growing and adopted exponentially[7,10]. These attributes are the directing one to coin a new term web 2.0 that is an evolution of web from its embryo. Search engines are the striking one to sail the web for several purposes because moreover information on the web is voyaged using search engines like AltaVista, WebCrawler, Hot Boat etc. Owing to the directing factors of its ever-growing exponential growth with the availability of endless pool of information, optimization of its design blueprint is the thrust arena of engineering endeavor. This paper is an experimental strives to develop and implement an extended framework with extended architecture to make search engines more efficient using local resource utilization features of the programming. This work is an implementation experience for use of focused and path oriented approach to provide a cross featured framework for search engines with human powered approach. In addition to curl programming, personalization of information, caching and graphical perception, main features of this framework are cross platform, cross architecture, url tracking, focused, path oriented, human powered and url canonicalization[7,21. The first part of the paper covers related work that has been done mostly in the field of general search engine in over ongoing research project for crawling the web. The second part defines architecture and functioning of developed framework and compares it to search engine optimization for web pages. The third part provides an overview and critical analysis of developed framework like experimental results, pseudo code, data structure etc.
Effectively finding the relevant web pages from the world web wide
Effective & efficient retrieval of the required quality web pages on the web is becoming a greater challenge. Early work on search engines concentrated on the textual content of web pages to find relevant pages, but in recent years, the analysis of information encoded in hyperlinks has been used to improve search engine performance. For these reasons, this paper presents three hyperlink analysis-based algorithms to find relevant pages for a given web page (URL). The first algorithm comes from the extended co citation analysis of the web pages. The second one takes advantage of linear algebra theories to reveal deeper relationships among the web pages and to identify relevant pages more precisely and effectively. The third one presents a variation on the use of linkage analysis for automatically categorizing web pages, by defining a similarity measure. This measure is used to categorize hyperlinks themselves, rather than web pages. Also this paper presents a moved Page Algorithm to detect and eliminate the dead pages.