X-Git-Url: http://wamblee.org/gitweb/?a=blobdiff_plain;f=crawler%2Fkiss%2Fdocs%2Fcontent%2Fxdocs%2Findex.xml;h=54dd1d237ea512da4dd785d65cceed0009ccdbd7;hb=54903ea538a09fdb1e2ee6dc37e89bb85aebfec4;hp=0e66c72c529fb2496b75afa2ea421af012003196;hpb=1ab5d124aa4c30acd74f7464f9c0f6490cf55c0c;p=utils diff --git a/crawler/kiss/docs/content/xdocs/index.xml b/crawler/kiss/docs/content/xdocs/index.xml index 0e66c72c..54dd1d23 100644 --- a/crawler/kiss/docs/content/xdocs/index.xml +++ b/crawler/kiss/docs/content/xdocs/index.xml @@ -16,304 +16,298 @@ limitations under the License. --> - -
- Automatic Recording for KiSS Hard Disk Recorders -
+ +
+ Automatic Recording for KiSS Hard Disk Recorders +
- - KiSS makes regular updates to their site that sometimes require adaptations - to the crawler. If it stops working, check out the most recent version here. - + KiSS makes regular updates to their site that sometimes require adaptations to the + crawler. If it stops working, check out the most recent version here.
Changelog +
+ 21 November 2006 +
    +
  • Corrected the config.xml again.
  • +
  • Corrected errors in the documentation for the web application. It starts running at 19:00 + and not at 5:00.
  • +
+
+
+ 19 November 2006 +
    +
  • Corrected the config.xml file to deal with changes in the login procedure.
  • +
+
+
+ 17 November 2006 +
    +
  • Corrected the packed distributions. The standalone distribution had an error in the + scripts and was missing libraries
  • + +
+
+
+ 7 September 2006 +
    +
  • KiSS modified the login procedure. It is now working again.
  • +
  • Generalized the startup scripts. They should now be insensitive to the specific + libraries used.
  • +
+
+
+ 31 August 2006 +
    +
  • Added windows bat file for running the crawler under windows. Very add-hoc, will be + generalized.
  • +
+
+
+ 24 August 2006 +
    +
  • The crawler now uses desktop login for crawling. Also, it is much more efficient since + it no longer needs to crawl the individual programs. This is because the channel page + includes descriptions of programs in javascript popups which can be used by the crawler. + The result is a significant reduction of the load on the KiSS EPG site. Also, the delay + between requests has been increased to further reduce load on the KiSS EPG site.
  • +
  • The crawler now crawls programs for tomorrow instead of for today.
  • +
  • The web based crawler is configured to run only between 7pm and 12pm. It used to run + at 5am.
  • +
+
+
13-20 August 2006 -

- There were several changes to the login procedure, requiring modifications to the crawler. -

+

There were several changes to the login procedure, requiring modifications to the + crawler.

  • The crawler now uses the 'Referer' header field correctly at login.
  • -
  • KiSS now uses hidden form fields in their login process which are now also handled correctly by the - crawler.
  • +
  • KiSS now uses hidden form fields in their login process which are now also handled + correctly by the crawler.
Overview - -

- In 2005, KiSS introduced the ability - to schedule recordings on KiSS hard disk recorder (such as the - DP-558) through a web site on the internet. When a new recording is - scheduled through the web site, the KiSS recorder finds out about - this new recording by polling a server on the internet. - This is a really cool feature since it basically allows programming - the recorder when away from home. -

-

- After using this feature for some time now, I started noticing regular - patterns. Often you are looking for the same programs and for certain - types of programs. So, wouldn't it be nice to have a program - do this work for you and automatically record programs and notify you - of possibly interesting ones? -

-

- This is where the KiSS crawler comes in. This is a simple crawler which - logs on to the KiSS electronic programme guide web site and gets - programme information from there. Then based on that it automatically - records programs for you or sends notifications about interesting ones. -

-

- In its current version, the crawler can be used in two ways: -

+ +

In 2005, KiSS introduced the ability to schedule recordings + on KiSS hard disk recorder (such as the DP-558) through a web site on the internet. When a + new recording is scheduled through the web site, the KiSS recorder finds out about this new + recording by polling a server on the internet. This is a really cool feature since it + basically allows programming the recorder when away from home.

+

After using this feature for some time, I started noticing regular patterns. Often you + are looking for the same programs and for certain types of programs. So, wouldn't it be nice + to have a program do this work for you and automatically record programs and notify you of + possibly interesting ones?

+

This is where the KiSS crawler comes in. This is a simple crawler which logs on to the + KiSS electronic programme guide web site and gets programme information from there. Then + based on that it automatically records programs for you or sends notifications about + interesting ones.

+

In its current version, the crawler can be used in two ways:

    -
  • standalone program: A standalone program run as a scheduled task.
  • -
  • web application: A web application running on a java - application server. With this type of use, the crawler also features an automatic retry - mechanism in case of failures, as well as a simple web interface.
  • +
  • standalone program: + A standalone program run from the command-line or as a scheduled task.
  • +
  • web application: A web application running on a java application + server. With this type of use, the crawler also features an automatic retry mechanism in + case of failures, as well as a simple web interface.
- +
Downloading - -

- At this moment, no formal releases have been made and only the latest - version can be downloaded. -

-

- The easy way to start is the - standalone program binary version - or using the web - application. -

-

- The latest source can be obtained from subversion with the - URL https://wamblee.org/svn/public/utils. The subversion - repository allows read-only access to anyone. -

-

- The application was developed and tested on SuSE linux 9.1 with JBoss 4.0.2 application - server (only required for the web application). It requires at least a Java Virtual Machine - 1.5 or greater to run. -

+ +

At this moment, no formal releases have been made and only the latest version can be + downloaded.

+

The easy way to start is the standalone program + binary version or using the web application.

+

The latest source can be obtained from subversion with the URL + https://wamblee.org/svn/public/utils. The subversion repository allows + read-only access to anyone.

+

The application was developed and tested on SuSE linux 10.1 with + JBoss 4.0.4 application + server. An application server or servlet container is only required for the + web application. The crawler requires at least a Java Virtual Machine + 1.5 or greater to run.

- +
Configuring the crawler - -

- The crawler comes with three configuration files: -

+ +

The crawler comes with three configuration files:

    -
  • crawler.xml: basic crawler configuration - tailored to the KiSS electronic programme guide.
  • -
  • programs.xml: containing a description of which - programs must be recorded and which programs are interesting.
  • -
  • org.wamblee.crawler.properties: Containing a configuration
  • +
  • crawler.xml: basic crawler configuration tailored to the KiSS electronic + programme guide.
  • +
  • programs.xml: containing a description of which programs must be recorded + and which programs are interesting.
  • +
  • org.wamblee.crawler.properties: Containing a configuration
-

- For the standalone program, all configuration files are in the conf directory. - For the web application, the properties files is located in the WEB-INF/classes - directory of the web application, and crawler.xml and programs.xml - are located outside of the web application at a location configured in the properties file. -

- - +

For the standalone program, all configuration files are in the conf + directory. For the web application, the properties files is located in the + WEB-INF/classes directory of the web application, and + crawler.xml and programs.xml are located outside of the web + application at a location configured in the properties file.

+ +
Crawler configuration <code>crawler.xml</code> - -

- First of all, copy the config.xml.example file - to config.xml. After that, edit the first entry of - that file and replace user and passwd - with your personal user id and password for the KiSS Electronic - Programme Guide. -

+ +

First of all, copy the config.xml.example file to config.xml. + After that, edit the first entry of that file and replace user and + passwd with your personal user id and password for the KiSS Electronic + Programme Guide.

+
+ +
+ Program configuration +

Interesting TV shows are described using program elements. Each + program element contains one or more match elements that + describe a condition that the interesting program must match.

+

Matching can be done on the following properties of a program:

+ + + + + + + + + + + + + + + + + + + + + +
Field nameDescription
nameProgram name
descriptionProgram description
channelChannel name
keywordsKeywords/classification of the program.
+

The field to match is specified using the field attribute of the + match element. If no field name is specified then the program name is + matched. Matching is done by converting the field value to lowercase and then doing a + perl-like regular expression match of the provided value. As a result, the content of the + match element should be specified in lower case otherwise the pattern will never match. If + multiple match elements are specified for a given program + element, then all matches must apply for a program to be interesting.

+

Example patterns:

+ + + + + + + + + + + + + +
PatternExamples of matching field values
the.*x.*files"The X files", "The X-Files: the making of"
star trek"Star Trek Voyager", "Star Trek: The next generation"
+ +

It is possible that different programs cannot be recorded since they overlap. To deal + with such conflicts, it is possible to specify a priority using the priority + element. Higher values of the priority value mean a higher priority. If two programs have + the same priority, then it is (more or less) unspecified which of the two will be + recorded, but it will at least record one program. If no priority is specified, then the + priority is 1 (one).

+ +

Since it is not always desirable to try to record every program that matches the + criteria, it is also possible to generate notifications for interesting programs only + without recording them. This is done by specifying the action alement with + the content notify. By default, the action is + record. To make the mail reports more readable it is possible to also assign + a category to a program for grouping interesting programs. This can be done using the + category element. Note that if the action is + notify. then the priority element is not used.

+
-
- Program configuration -

- Interesting TV shows are described using program - elements. Each program element contains - one or more match elements that describe - a condition that the interesting program must match. -

-

- Matching can be done on the following properties of a program: -

- - - - - - - - - - - - - - - - - - - -
Field nameDescription
nameProgram name
descriptionProgram description
channelChannel name
keywordsKeywords/classification of the program.
-

- The field to match is specified using the field - attribute of the match element. If no field name - is specified then the program name is matched. Matching is done - by converting the field value to lowercase and then doing a - perl-like regular expression match of the provided value. As a - result, the content of the match element should be specified in - lower case otherwise the pattern will never match. - If multiple match elements are specified for a - given program element, then all matches must - apply for a program to be interesting. -

-

- Example patterns: -

- - - - - - - - - - - - - -
PatternExample of matching field values
the.*x.*files"The X files", "The X-Files: the making of"
star trek"Star Trek Voyager", "Star Trek: The next generation"
- -

- It is possible that different programs cannot be recorded - since they overlap. To deal with such conflicts, it is possible - to specify a priority using the priority element. - Higher values of the priority value mean a higher priority. - If two programs have the same priority, then it is (more or less) - unspecified which of the two will be recorded, but it will at least - record one program. If no priority is specified, then the - priority is 1 (one). -

- -

- Since it is not always desirable to try to record every - program that matches the criteria, it is also possible to - generate notifications for interesting programs only without - recording them. This is done by specifying the - action alement with the content notify. - By default, the action is record. - To make the mail reports more readable it is possible to - also assign a category to a program for grouping interesting - programs. This can be done using the category - element. Note that if the action is - notify. then the priority element - is not used. -

- -
-
Notification configuration -

- Edit the configuration file org.wamblee.crawler.properties. - The properties file is self-explanatory. -

+

Edit the configuration file org.wamblee.crawler.properties. The properties + file is self-explanatory.

- - - - + + + +
Installing and running the crawler - +
Standalone application -

- In the binary distribution, execute the - run script for your operating system - (run.bat for windows, and - run.sh for unix). -

+

In the binary distribution, execute the run script for your operating + system (run.bat for windows, and run.sh for unix).

- +
Web application -

- After deploying the web application, navigate to the - application in your browser (e.g. - http://localhost:8080/wamblee-crawler-kissweb). - The screen should show an overview of the last time it ran (if - it ran before) as well as a button to run the crawler immediately. - Also, the result of the last run can be viewed. - The crawler will run automatically every morning at 5 AM local time, - and will retry at 1 hour intervals in case of failure to retrieve - programme information. -

+

After deploying the web application, navigate to the application in your browser (e.g. + http://localhost:8080/wamblee-crawler-kissweb). The screen should show an + overview of the last time it ran (if it ran before) as well as a button to run the crawler + immediately. Also, the result of the last run can be viewed. The crawler will run + automatically starting after 19:00, + and will retry at 1 hour intervals in case + of failure to retrieve programme information. +

+ +

+ Since the crawler checks the status at + 1 hour intervals it can run for the first time anytime between 19:00 and 20:00. This is done + on purpose since it means that crawlers run by different people will not all start running + simultaneously and is thus more friendly to the KiSS servers.

- +
Source distribution +

With the source code, build everything with maven2 as follows:

+ + mvn -Dmaven.test.skip=true install + cd crawler + mvn package assembly:assembly +

- With the source code, build everything with - ant dist-lite, then locate the binary - distribution in lib/wamblee/crawler/kiss/kiss-crawler-bin.zip. - Then proceed as for the binary distribution. -

+ After this, locate the + binary distribution in the target subdirectory of the crawler + directory. Then + proceed as for the binary distribution.

+
- +
General usage -

- When the crawler runs, it - retrieves the programs for today. As a result, it is advisable - to run the program at an early point of the day as a scheduled - task (e.g. cron on unix). For the web application this is - preconfigured at 5AM. -

-

- Modifying the program to allow it to investigate tomorrow's - programs instead is easy as well but not yet implemented. +

When the crawler runs, it retrieves the programs for tomorrow.

+ If you deploy the web application today, it will run automatically on the next (!) + day. This even holds if you deploy the application before the normal scheduled time.
- - + +
Examples - -

- The best example is in the distribution itself. It is my personal - programs.xml file. -

+ +

The best example is in the distribution itself. It is my personal + programs.xml file.

- +
Contributing - -

- You are always welcome to contribute. If you find a problem just - tell me about it and if you have ideas am I always interested to - hear about them. -

-

- If you are a programmer and have a fix for a bug, just send me a - patch and if you are fanatic enough and have ideas, I can also - give you write access to the repository. -

+ +

You are always welcome to contribute. If you find a problem just tell me about it and if + you have ideas am I always interested to hear about them.

+

If you are a programmer and have a fix for a bug, just send me a patch and if you are + fanatic enough and have ideas, I can also give you write access to the repository.

- - + +