TITLE

Web Crawling

AUTHOR(S)
Olston, Christopher; Najork, Marc
PUB. DATE
April 2010
SOURCE
Foundations & Trends in Information Retrieval;2010, Vol. 4 Issue 3, p175
SOURCE TYPE
Academic Journal
DOC. TYPE
Article
ABSTRACT
This is a survey of the science and practice of web crawling. While at first glance web crawling may appear to be merely an application of breadth-first-search, the truth is that there are many challenges ranging from systems concerns such as managing very large data structures to theoretical questions such as how often to revisit evolving content sources. This survey outlines the fundamental challenges and describes the state-of-the-art models and solutions. It also highlights avenues for future work.
ACCESSION #
48298423

 

Related Articles

  • Joint Web-Feature (JFEAT): A Novel Web Page Classification Framework. Han, Lim Wern; Alhashmi, Saadat M. // Communications of the IBIMA;Dec2010, p1 

    With the increasing amount of web pages over the internet, it has been a major concern to obtain information on the internet accurately at a reasonable cost with decent performance. A potential solution is through the classification of web pages into meaningful categories. An effective...

  • A SERVER-SIDE SUPPORT LAYER FOR CLIENT PERSPECTIVE TRANSPARENT WEB CONTENT MIGRATION. BUFNEA, DARIUS; HALIŢĂ, DIANA // Studia Universitatis Babes-Bolyai, Informatica;Sep2013, Vol. 58 Issue 3, p78 

    The migration process of a website's content within a Content Management System almost always implies changes in the site structure as seen by search engines and web clients. This variation leads to somedis advantages, such as misdirecting search engines visitors to old, unavailable, URLs. Even...

  • Automatic Query Generation and Query Relevance Measurement for Unsupervised Language Model Adaptation of Speech Recognition. Ito, Akinori; Kajiura, Yasutomo; Suzuki, Motoyuki; Makino, Shozo // EURASIP Journal on Audio Speech & Music Processing;2009, Vol. 2009, Special section p1 

    We are developing a method of Web-based unsupervised language model adaptation for recognition of spoken documents. The proposed method chooses keywords from the preliminary recognition result and retrieves Web documents using the chosen keywords. A problem is that the selected keywords tend to...

  • Server Side Includes for Site Management. Notess, Greg R. // Online;Jul/Aug2000, Vol. 24 Issue 4, p78 

    Provides information on the Server Slide Includes (SSI), a tool which can simplify the task of maintaining a Web site. Comparison with hypertext markup language; Functions of SSI; Capabilities; Problems associated with SSI application; Significance of SSI to Web searchers.

  • Novel approaches to crawling important pages early. Alam, Md.; Ha, JongWoo; Lee, SangKeun // Knowledge & Information Systems;Dec2012, Vol. 33 Issue 3, p707 

    Web crawlers are essential to many Web applications, such as Web search engines, Web archives, and Web directories, which maintain Web pages in their local repositories. In this paper, we study the problem of crawl scheduling that biases crawl ordering toward important pages. We propose a set of...

  • A software infrastructure for supporting spontaneous and personalized interaction in home computing environments. Nakajima, Tatsuo; Satoh, Ichiro // Personal & Ubiquitous Computing;Nov2006, Vol. 10 Issue 6, p379 

    Our daily lives are expected to change dramatically due to the popularity of ubiquitous computing technologies. These will make it possible to integrate various aspects of our lives. However, a new approach is required to seamlessly deal with devices embedded in our environments. Future embedded...

  • Compression of Semistructured Documents. Leo Galambos; Jan Lansky; Michal Zemlicka; Katsiaryna Chernik // International Journal of Information Technology;2008, Vol. 4 Issue 1, p11 

    EGOTHOR is a search engine that indexes the Web and allows us to search the Web documents. Its hit list contains URL and title of the hits, and also some snippet which tries to shortly show a match. The snippet can be almost always assembled by an algorithm that has a full knowledge of the...

  • Smart Approach to Reduce the Web Crawling Traffic of Existing System using HTML based Update File at Web Server.  // International Journal of Computer Applications;Dec2010, Vol. 11, p34 

    The article presents a study which describes a method in reducing web crawling traffic when downloading information from the web. It is inferred that the approach utilizes a HyperText Markup Language (HTML)-based UPDATE file. The UPDATE file maintains a list of updated universal resource...

  • Enrich the E-publishing Community Website with Search Engine Optimization Technique. Vadivel, R.; Baskaran, K. // International Journal of Computer Science Issues (IJCSI);Sep2011, Vol. 8 Issue 5, p404 

    Internet has played vital role in the online business. Every business peoples are needed to show their information clients or end user. In search engines have million indexed pages. A search engine optimization technique has to implement both web applications static and dynamic. There is no...

Share

Read the Article

Courtesy of THE LIBRARY OF VIRGINIA

Sorry, but this item is not currently available from your library.

Try another library?
Sign out of this library

Other Topics