wamblee.org Git - utils/blob - crawler/basic/ABOUT.txt

   1 This is a general library for implementing a web crawler.
   2
   3 The crawler works by retrieving an HTML page and transforming the HTML
   4 (content + presentation) into content using XSLT stylesheets. Using a convention
   5 for links in the converted content, it becomes possible to build a generic interface on the retrieved pages for navigating through the content.
   6
   7 A configuration file determines how a certain page must be retrieved and transformed.
   8
   9