X-Git-Url: http://wamblee.org/gitweb/?a=blobdiff_plain;f=crawler%2Fkiss%2Fdocs%2Fcontent%2Fxdocs%2Findex.xml;h=50397ef2a3dbc51fb0242d5626c669aa8cef9919;hb=098fb64daea94942558b8bcbde992e8ef690b759;hp=8886d717b14d8903228e96465ebd0edfcebf99a8;hpb=5313be7b0501c6189380555ae8d9c355d7fee145;p=utils diff --git a/crawler/kiss/docs/content/xdocs/index.xml b/crawler/kiss/docs/content/xdocs/index.xml index 8886d717..50397ef2 100644 --- a/crawler/kiss/docs/content/xdocs/index.xml +++ b/crawler/kiss/docs/content/xdocs/index.xml @@ -18,9 +18,53 @@
- Welcome to the KiSS crawler + Automatic Recording for KiSS Hard Disk Recorders
- + + + KiSS makes regular updates to their site that sometimes require adaptations + to the crawler. If it stops working, check out the most recent version here. + +
+ Changelog + +
+ 31 August 2006 +
    +
  • Added windows bat file for running the crawler under windows. + Very add-hoc, will be generalized.
  • +
+
+
+ 24 August 2006 +
    +
  • The crawler now uses desktop login for crawling. Also, it is much more efficient since + it no longer needs to crawl the individual programs. This is because the channel page + includes descriptions of programs in javascript popups which can be used by the crawler. + The result is a significant reduction of the load on the KiSS EPG site. Also, the delay + between requests has been increased to further reduce load on the KiSS EPG site.
  • +
  • + The crawler now crawls programs for tomorrow instead of for today. +
  • +
  • + The web based crawler is configured to run only between 7pm and 12pm. It used to run at + 5am. +
  • +
+
+ +
+ 13-20 August 2006 +

+ There were several changes to the login procedure, requiring modifications to the crawler. +

+
    +
  • The crawler now uses the 'Referer' header field correctly at login.
  • +
  • KiSS now uses hidden form fields in their login process which are now also handled correctly by the + crawler.
  • +
+
+
Overview @@ -38,7 +82,7 @@ patterns. Often you are looking for the same programs and for certain types of programs. So, wouldn't it be nice to have a program do this work for you and automatically record programs and notify you - of possibly interesting ones. + of possibly interesting ones?

This is where the KiSS crawler comes in. This is a simple crawler which @@ -46,27 +90,260 @@ programme information from there. Then based on that it automatically records programs for you or sends notifications about interesting ones.

+

+ In its current version, the crawler can be used in two ways: +

+
Downloading + +

+ At this moment, no formal releases have been made and only the latest + version can be downloaded. +

+

+ The easy way to start is the + standalone program binary version + or using the web + application. +

+

+ The latest source can be obtained from subversion with the + URL https://wamblee.org/svn/public/utils. The subversion + repository allows read-only access to anyone. +

+

+ The application was developed and tested on SuSE linux 9.1 with JBoss 4.0.2 application + server (only required for the web application). It requires at least a Java Virtual Machine + 1.5 or greater to run. +

Configuring the crawler + +

+ The crawler comes with three configuration files: +

+ +

+ For the standalone program, all configuration files are in the conf directory. + For the web application, the properties files is located in the WEB-INF/classes + directory of the web application, and crawler.xml and programs.xml + are located outside of the web application at a location configured in the properties file. +

+ + +
+ Crawler configuration <code>crawler.xml</code> + +

+ First of all, copy the config.xml.example file + to config.xml. After that, edit the first entry of + that file and replace user and passwd + with your personal user id and password for the KiSS Electronic + Programme Guide. +

+
+ +
+ Program configuration +

+ Interesting TV shows are described using program + elements. Each program element contains + one or more match elements that describe + a condition that the interesting program must match. +

+

+ Matching can be done on the following properties of a program: +

+ + + + + + + + + + + + + + + + + + + +
Field nameDescription
nameProgram name
descriptionProgram description
channelChannel name
keywordsKeywords/classification of the program.
+

+ The field to match is specified using the field + attribute of the match element. If no field name + is specified then the program name is matched. Matching is done + by converting the field value to lowercase and then doing a + perl-like regular expression match of the provided value. As a + result, the content of the match element should be specified in + lower case otherwise the pattern will never match. + If multiple match elements are specified for a + given program element, then all matches must + apply for a program to be interesting. +

+

+ Example patterns: +

+ + + + + + + + + + + + + +
PatternExample of matching field values
the.*x.*files"The X files", "The X-Files: the making of"
star trek"Star Trek Voyager", "Star Trek: The next generation"
+ +

+ It is possible that different programs cannot be recorded + since they overlap. To deal with such conflicts, it is possible + to specify a priority using the priority element. + Higher values of the priority value mean a higher priority. + If two programs have the same priority, then it is (more or less) + unspecified which of the two will be recorded, but it will at least + record one program. If no priority is specified, then the + priority is 1 (one). +

+ +

+ Since it is not always desirable to try to record every + program that matches the criteria, it is also possible to + generate notifications for interesting programs only without + recording them. This is done by specifying the + action alement with the content notify. + By default, the action is record. + To make the mail reports more readable it is possible to + also assign a category to a program for grouping interesting + programs. This can be done using the category + element. Note that if the action is + notify. then the priority element + is not used. +

+ +
+ +
+ Notification configuration +

+ Edit the configuration file org.wamblee.crawler.properties. + The properties file is self-explanatory. +

+
+ + +
Installing and running the crawler + +
+ Standalone application +

+ In the binary distribution, execute the + run script for your operating system + (run.bat for windows, and + run.sh for unix). +

+
+ +
+ Web application +

+ After deploying the web application, navigate to the + application in your browser (e.g. + http://localhost:8080/wamblee-crawler-kissweb). + The screen should show an overview of the last time it ran (if + it ran before) as well as a button to run the crawler immediately. + Also, the result of the last run can be viewed. + The crawler will run automatically every morning at 5 AM local time, + and will retry at 1 hour intervals in case of failure to retrieve + programme information. +

+
+ +
+ Source distribution +

+ With the source code, build everything with + ant dist-lite, then locate the binary + distribution in lib/wamblee/crawler/kiss/kiss-crawler-bin.zip. + Then proceed as for the binary distribution. +

+
+ +
+ General usage +

+ When the crawler runs, it + retrieves the programs for tomorrow. As a result, it is advisable + to run the program at an early point of the day as a scheduled + task (e.g. cron on unix). For the web application this is + preconfigured at 5AM. +

+ + If you deploy the web application today, it will run automatically + on the next (!) day. This even holds if you deploy the application + before the normal scheduled time. + + +

+ Modifying the program to allow it to investigate tomorrow's + programs instead is easy as well but not yet implemented. +

+
+ +
Examples +

+ The best example is in the distribution itself. It is my personal + programs.xml file. +

Contributing + +

+ You are always welcome to contribute. If you find a problem just + tell me about it and if you have ideas am I always interested to + hear about them. +

+

+ If you are a programmer and have a fix for a bug, just send me a + patch and if you are fanatic enough and have ideas, I can also + give you write access to the repository. +