The Invisible Web Navigating the Web Outside Traditional Search Engines (original) (raw)

Uncovering the Unarchived Web

In Proceedings of the 37th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York NY, 2014.

Many national and international heritage institutes realize the im- portance of archiving the web for future culture heritage. Web archiving is currently performed either by harvesting a national do- main, or by crawling a pre-defined list of websites selected by the archiving institution. In either method, crawling results in more information being harvested than just the websites intended for preservation; which could be used to reconstruct impressions of pages that existed on the live web of the crawl date, but would have been lost forever. We present a method to create representations of what we will refer to as a web collection’s aura: the web documents that were not included in the archived collection, but are known to have existed — due to their mentions on pages that were included in the archived web collection. To create representations of these unarchived pages, we exploit the information about the unarchived URLs that can be derived from the crawls by combining crawl date distribution, anchor text and link structure. We illustrate empiri- cally that the size of the aura can be substantial: in 2012, the Dutch Web archive contained 12.3M unique pages, while we uncover ref- erences to 11.9M additional (unarchived) pages.

Searching the World Wide Web

1997

With the explosive growth of the Web, one of the biggest challenges in exploiting the wealth of available information is to locate the relevant documents. Search engines play a crucial role in addressing this problem by precompiling a large index of available information to quickly produce a set of possibly relevant documents in response to a query. While most Web users make extensive use of the Internet search engines, few people have more than a vague idea of how these systems work.

Specialist Tools for Tackling the Hidden Web

2008

Abstract: When we fail to find information through our favourite search tool we often blame the so-called "hidden web". Some industry commentators imbue the hidden web with an aura of mystery as though esoteric, magical incantations are needed to reveal its secrets. The reality is more prosaic. The size of the search engine databases is part of the problem. Yahoo claims to have over 20 billion pages and the rest range from 5-12 billion pages. Valuable information can so easily be "lost " because of the huge volume of data. To make matters worse, information is now presented in an increasing variety of formats: ebooks, photographs, videos, podcasts of interviews and conference presentations, blogs, RSS feeds, TV and radio programmes. The possibilities seem endless. Tackling this level of information overload requires lateral thinking on the part of the searcher when building a search strategy. Google is a fantastic search tool but not foolproof. It is not comprehe...

Vries, “Uncovering the unarchived web

2016

Many national and international heritage institutes real-ize the importance of archiving the web for future culture heritage. Web archiving is currently performed either by harvesting a national domain, or by crawling a pre-defined list of websites selected by the archiving institution. In either method, crawling results in more information being harvested than just the websites intended for preservation; which could be used to reconstruct impressions of pages that existed on the live web of the crawl date, but would have been lost forever. We present a method to create representations of what we will refer to as a web collection’s aura: the web documents that were not included in the archived collection, but are known to have existed — due to their mentions on pages that were included in the archived web collection. To create representations of these unarchived pages, we exploit the information about the unarchived URLs that can be de-rived from the crawls by combining crawl date d...

White Paper: The Deep Web: Surfacing Hidden Value

The Journal of Electronic Publishing, 2001

Searching on the Internet today can be compared to dragging a net across the surface of the ocean. While a great deal may be caught in the net, there is still a wealth of information that is deep, and therefore, missed. The reason is simple: Most of the Web's information is buried far down on dynamically generated sites, and standard search engines never find it.

PERSONALIZED WEB CRAWLER FOR INVISIBLE WEB

International Journal of Mathematical Archive, 2012

This paper discusses about the Hidden web. The vast expanses of the Web are completely invisible to search engines. Even worse, this "Invisible Web" is in all likelihood growing significantly faster than the visible Web you're familiar with. The Invisible Web is made up of information stored in databases. Unlike pages on the visible Web, information in databases is generally inaccessible to the software spiders and crawlers that compile search engine indexes". Here in this Paper I discuss the existence of a hidden or "deep Web" with approximately 500 billion individual documents, most of which are available to the public but not accessible through conventional search engines. That's because many of these documents use frames or are in database-driven Web sites such as eBay, Amazon.com, and the Library of Congress, which the spiders can't crawl. Here I discuss the different issues related to invisible Web and different existent strategies to crawl the deep web. Next I try to give some novel idea to crawl the d

Uncovering information hidden in Web archives

D-Lib magazine, 2002

The Internet has turned into an important aspect of our information infrastructure and society, with the Web forming a part of our cultural heritage. Several initiatives thus set out to preserve it for the future. The resulting Web archives are by no means only a collection of historic Web pages. They hold a wealth of information that waits to be exploited, information that may be substantial to a variety of disciplines. With the time-line and metadata available in such a Web archive, additional analyzes that go beyond mere information exploration become possible. In the context of the Austrian On-Line Archive (AOLA), we established a Data Warehouse as a key to this information. The Data Warehouse makes it possible to analyze a variety of characteristics of the Web in a flexible and interactive manner using on-line analytical processing (OLAP) techniques. Specifically, technological aspects such as operating systems and Web servers used, the variety of file types, forms or scripting languages encountered, as well as the link structure within domains, may be used to infer characteristics of technology maturation and impact or community structures.

The Invisible Web Navigating the Web Outside Traditional Search Engines (original) (raw)

Related papers