Novel approaches to crawling important pages early

Alam, Md.; Ha, JongWoo; Lee, SangKeun
December 2012
Knowledge & Information Systems;Dec2012, Vol. 33 Issue 3, p707
Academic Journal
Web crawlers are essential to many Web applications, such as Web search engines, Web archives, and Web directories, which maintain Web pages in their local repositories. In this paper, we study the problem of crawl scheduling that biases crawl ordering toward important pages. We propose a set of crawling algorithms for effective and efficient crawl ordering by prioritizing important pages with the well-known PageRank as the importance metric. In order to score URLs, the proposed algorithms utilize various features, including partial link structure, inter-host links, page titles, and topic relevance. We conduct a large-scale experiment using publicly available data sets to examine the effect of each feature on crawl ordering and evaluate the performance of many algorithms. The experimental results verify the efficacy of our schemes. In particular, compared with the representative RankMass crawler, the FPR-title-host algorithm reduces computational overhead by a factor as great as three in running time while improving effectiveness by 5 % in cumulative PageRank.


Related Articles

  • Design and Implementation of A Focused Crawler - TargetCrawler. Feng Jian; Chen Jing-zhou; Cao Lei // International Journal of Grid & Distributed Computing;2014, Vol. 7 Issue 4, p149 

    Adopting focused crawler to search web sites is the trend of next generation search engines. Design and implementation of a focused crawler - TargetCrawler is introduced in detail, including its overall architecture, main modules, working processes and two key algorithms, duplicate removing...

  • Web Crawling. Olston, Christopher; Najork, Marc // Foundations & Trends in Information Retrieval;2010, Vol. 4 Issue 3, p175 

    This is a survey of the science and practice of web crawling. While at first glance web crawling may appear to be merely an application of breadth-first-search, the truth is that there are many challenges ranging from systems concerns such as managing very large data structures to theoretical...

  • Making sure people will come to your website. Mocomber, Laurie // Northern Colorado Business Report;2/24/2012, Vol. 17 Issue 12, p18 

    The article provides information about a list of ways which can be followed in order to attract maximum number of population towards self-created website. It says that the direct way involves typing in the URL in the browser window with an effort to get it placed at the front, creation of the...

  • Welcome to Web 3.0. REH, ROBERT // Credit Union Magazine;Sep2011, Vol. 77 Issue 9, p70 

    The article focuses on the development of Web 3.0, the new stage in the evolution of the Internet and Web applications. It notes that the web allows credit unions (CUs) to respond to members' needs and promotes more productive member interactions. Other capabilities of Web 3.0 includes improving...

  • Usability Inspection of Web Portals: Results and Insights from Empirical Study. Granić, Andrina; Marangunić, Nikola; Mitrović, Ivica // International Journal of Computer Science Issues (IJCSI);May2013, Vol. 10 Issue 3, p234 

    Web portals are a special breed of web sites, providing a large and diverse user population with a blend of information, services and facilities. Whether they reach their aim of facilitating users' access to diverse resources and to which extent, remains an open question. In the paper this issue...

  • THE MANAGEMENT OF A WEBSITE'S HISTORICAL LINKS AND DOCUMENTS. David Chao // Issues in Information Systems;2015, Vol. 16 Issue 4, p64 

    An organization's websites change constantly to reflect the dynamic nature of its activities and its environment. Consequently, historical links and documents are generated that include outdated URLs, the old versions of web pages and, the deleted web pages. These old versions are snapshots of a...

  • Website marketing turnoffs. Kawasaki, Guy // Entrepreneur;Jun2009, Vol. 37 Issue 6, p30 

    The article discusses the Internet marketing turnoffs which could hinder the adoption of products and services. It says that users should not be forced to register and the web site's uniform resource locator (URL) should be short. It notes that a search box, bookmarks and electronic mail...

  • Chapter 5: Sources and Resources. Trainor, Cindi; Price, Jason // Library Technology Reports;Oct2010, Vol. 46 Issue 7, p34 

    This chapter provides citations for articles and websites that the authors used to compose this report as well as sources that provide background or further information relevant to the topics addressed.

  • Chapter 1: Introduction. Trainor, Cindi; Price, Jason // Library Technology Reports;Oct2010, Vol. 46 Issue 7, p5 

    This chapter of “Rethinking Library Linking” introduces the concepts and purposes of link resolver software and the OpenURL standard and how current user behavior and new tools worked in tandem to create change in what is required for an effective link resolver.


Read the Article


Sorry, but this item is not currently available from your library.

Try another library?
Sign out of this library

Other Topics