Site Crawling

Site Crawler:  Download all web pages from a given domain or base URL.

Site Crawl Start
Start URL (must start with
http:// https:// ftp:// smb:// file://)

empty
Link-List of URL
Sitemap URL

load all files in domain
load only files in a sub-path of given url
not more than documents

Hints

  • Crawl Speed Limitation

    No more that two pages are loaded from the same host in one second (not more that 120 document per minute) to limit the load on the target server.
  • Target Balancer

    A second crawl for a different host increases the throughput to a maximum of 240 documents per minute since the crawler balances the load over all hosts.
  • High Speed Crawling

    A 'shallow crawl' which is not limited to a single host (or site) can extend the pages per minute (ppm) rate to unlimited documents per minute when the number of target hosts is high. This can be done using the Expert Crawl Start servlet.
  • Scheduler Steering

    The scheduler on crawls can be changed or removed using the API Steering.