wamblee.org Git - utils/blob - crawler/kiss/docs/content/xdocs/index.xml

   1 <?xml version="1.0" encoding="UTF-8"?>
   2 <!--
   3   Copyright 2002-2004 The Apache Software Foundation or its licensors,
   4   as applicable.
   5
   6   Licensed under the Apache License, Version 2.0 (the "License");
   7   you may not use this file except in compliance with the License.
   8   You may obtain a copy of the License at
   9
  10       http://www.apache.org/licenses/LICENSE-2.0
  11
  12   Unless required by applicable law or agreed to in writing, software
  13   distributed under the License is distributed on an "AS IS" BASIS,
  14   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  15   See the License for the specific language governing permissions and
  16   limitations under the License.
  17 -->
  18 <!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V2.0//EN" "http://forrest.apache.org/dtd/document-v20.dtd">
  19 <document>
  20   <header>
  21     <title>Automatic Recording for KiSS Hard Disk Recorders</title>
  22   </header>
  23   <body>
  24     <warning>
  25       KiSS makes regular updates to their site that sometimes require adaptations
  26       to the crawler. If it stops working, check out the most recent version here.
  27     </warning>
  28     <section id="changelog">
  29       <title>Changelog</title>
  30       <section>
  31         <title>13-20 August 2006</title>
  32         <p>
  33           There were several changes to the login procedure, requiring modifications to the crawler.
  34         </p>
  35         <ul>
  36           <li>The crawler now uses the 'Referer' header field correctly at login.</li>
  37           <li>KiSS now uses hidden form fields in their login process which are now also handled correctly by the
  38               crawler.</li>
  39         </ul>
  40       </section>
  41     </section>
  42     <section id="overview">
  43       <title>Overview</title>
  44
  45       <p>
  46         In 2005,  <a href="site:links/kiss">KiSS</a> introduced the ability
  47         to schedule recordings on KiSS hard disk recorder (such as the
  48         DP-558) through a web site on the internet. When a new recording is
  49         scheduled through the web site, the KiSS recorder finds out about
  50         this new recording by polling a server on the internet.
  51         This is a really cool feature since it basically allows programming
  52         the recorder when away from home.
  53       </p>
  54       <p>
  55         After using this feature for some time now, I started noticing regular
  56         patterns. Often you are looking for the same programs and for certain
  57         types of programs. So, wouldn't it be nice to have a program
  58         do this work for you and automatically record programs and notify you
  59         of possibly interesting ones?
  60       </p>
  61       <p>
  62         This is where the KiSS crawler comes in. This is a simple crawler which
  63         logs on to the KiSS electronic programme guide web site and gets
  64         programme information from there. Then based on that it automatically
  65         records programs for you or sends notifications about interesting ones.
  66       </p>
  67       <p>
  68         In its current version, the crawler can be used in two ways:
  69       </p>
  70       <ul>
  71         <li><strong>standalone program</strong>: A standalone program run as a scheduled task.</li>
  72         <li><strong>web application</strong>: A web application running on a java
  73           application server. With this type of use, the crawler also features an automatic retry
  74           mechanism in case of failures, as well as a simple web interface. </li>
  75       </ul>
  76     </section>
  77
  78     <section>
  79       <title>Downloading</title>
  80
  81       <p>
  82         At this moment, no formal releases have been made and only the latest
  83         version can be downloaded.
  84       </p>
  85       <p>
  86         The easy way to start is the
  87         <a href="installs/crawler/kiss/kiss-crawler-bin.zip">standalone program binary version</a>
  88         or using the <a href="installs/crawler/kissweb/wamblee-crawler-kissweb.war">web
  89           application</a>.
  90       </p>
  91       <p>
  92         The latest source can be obtained from subversion with the
  93         URL <code>https://wamblee.org/svn/public/utils</code>. The subversion
  94         repository allows read-only access to anyone.
  95       </p>
  96       <p>
  97         The application was developed and tested on SuSE linux 9.1 with JBoss 4.0.2 application
  98         server (only required for the web application). It requires at least a Java Virtual Machine
  99         1.5 or greater to run.
 100       </p>
 101     </section>
 102
 103     <section>
 104       <title>Configuring the crawler</title>
 105
 106       <p>
 107         The crawler comes with three configuration files:
 108       </p>
 109       <ul>
 110         <li><code>crawler.xml</code>: basic crawler configuration
 111           tailored to the KiSS electronic programme guide.</li>
 112         <li><code>programs.xml</code>: containing a description of which
 113           programs must be recorded and which programs are interesting.</li>
 114         <li><code>org.wamblee.crawler.properties</code>: Containing a configuration  </li>
 115       </ul>
 116       <p>
 117         For the standalone program, all configuration files are in the <code>conf</code> directory.
 118         For the web application, the properties files is located in the <code>WEB-INF/classes</code>
 119         directory of the web application, and <code>crawler.xml</code> and <code>programs.xml</code>
 120         are located outside of the web application at a location configured in the properties file.
 121       </p>
 122
 123
 124       <section>
 125         <title>Crawler configuration <code>crawler.xml</code></title>
 126
 127         <p>
 128           First of all, copy the <code>config.xml.example</code> file
 129           to <code>config.xml</code>. After that, edit the first entry of
 130           that file and replace <code>user</code> and <code>passwd</code>
 131           with your personal user id and password for the KiSS Electronic
 132           Programme Guide.
 133         </p>
 134       </section>
 135
 136         <section>
 137           <title>Program configuration</title>
 138           <p>
 139             Interesting TV shows are described using <code>program</code>
 140             elements. Each <code>program</code> element contains
 141             one or more <code>match</code> elements that describe
 142             a condition that the interesting program must match.
 143           </p>
 144           <p>
 145             Matching can be done on the following properties of a program:
 146           </p>
 147           <table>
 148             <tr><th>Field name</th>
 149             <th>Description</th></tr>
 150             <tr>
 151               <td>name</td>
 152               <td>Program name</td>
 153             </tr>
 154             <tr>
 155               <td>description</td>
 156               <td>Program description</td>
 157             </tr>
 158             <tr>
 159               <td>channel</td>
 160               <td>Channel name</td>
 161             </tr>
 162             <tr>
 163               <td>keywords</td>
 164               <td>Keywords/classification of the program.</td>
 165             </tr>
 166           </table>
 167           <p>
 168             The field to match is specified using the <code>field</code>
 169             attribute of the <code>match</code> element. If no field name
 170             is specified then the program name is matched. Matching is done
 171             by converting the field value to lowercase and then doing a
 172             perl-like regular expression match of the provided value. As a
 173             result, the content of the match element should be specified in
 174             lower case otherwise the pattern will never match.
 175             If multiple <code>match</code> elements are specified for a
 176             given <code>program</code> element, then all matches must
 177             apply for a program to be interesting.
 178           </p>
 179           <p>
 180             Example patterns:
 181           </p>
 182           <table>
 183             <tr>
 184               <th>Pattern</th>
 185               <th>Example of matching field values</th>
 186             </tr>
 187             <tr>
 188               <td>the.*x.*files</td>
 189               <td>"The X files", "The X-Files: the making of"</td>
 190             </tr>
 191             <tr>
 192               <td>star trek</td>
 193               <td>"Star Trek Voyager", "Star Trek: The next generation"</td>
 194             </tr>
 195           </table>
 196
 197           <p>
 198             It is possible that different programs cannot be recorded
 199             since they overlap. To deal with such conflicts, it is possible
 200             to specify a priority using the <code>priority</code> element.
 201             Higher values of the priority value mean a higher priority.
 202             If two programs have the same priority, then it is (more or less)
 203             unspecified which of the two will be recorded, but it will at least
 204             record one program. If no priority is specified, then the
 205             priority is 1 (one).
 206           </p>
 207
 208           <p>
 209             Since it is not always desirable to try to record every
 210             program that matches the criteria, it is also possible to
 211             generate notifications for interesting programs only without
 212             recording them. This is done by specifying the
 213             <code>action</code> alement with the content <code>notify</code>.
 214             By default, the <code>action</code> is <code>record</code>.
 215             To make the mail reports more readable it is possible to
 216             also assign a category to a program for grouping interesting
 217             programs. This can be done using the <code>category</code>
 218             element. Note that if the <code>action</code> is
 219             <code>notify</code>. then the <code>priority</code> element
 220             is not used.
 221           </p>
 222
 223         </section>
 224
 225       <section>
 226         <title>Notification configuration</title>
 227         <p>
 228            Edit the configuration file <code>org.wamblee.crawler.properties</code>.
 229           The properties file is self-explanatory.
 230         </p>
 231       </section>
 232     </section>
 233
 234
 235
 236
 237     <section>
 238       <title>Installing and running the crawler</title>
 239
 240       <section>
 241         <title>Standalone application</title>
 242         <p>
 243           In the binary distribution, execute the
 244           <code>run</code> script for your operating system
 245           (<code>run.bat</code> for windows, and
 246           <code>run.sh</code> for unix).
 247         </p>
 248       </section>
 249
 250       <section>
 251         <title>Web application</title>
 252         <p>
 253           After deploying the web application, navigate to the
 254           application in your browser (e.g.
 255           <code>http://localhost:8080/wamblee-crawler-kissweb</code>).
 256           The screen should show an overview of the last time it ran (if
 257           it ran before) as well as a button to run the crawler immediately.
 258           Also, the result of the last run can be viewed.
 259           The crawler will run automatically every morning at 5 AM local time,
 260           and will retry at 1 hour intervals in case of failure to retrieve
 261           programme information.
 262         </p>
 263       </section>
 264
 265       <section>
 266         <title>Source distribution</title>
 267         <p>
 268           With the source code, build everything with
 269           <code>ant dist-lite</code>, then locate the binary
 270           distribution in <code>lib/wamblee/crawler/kiss/kiss-crawler-bin.zip</code>.
 271           Then proceed as for the binary distribution.
 272         </p>
 273       </section>
 274
 275       <section>
 276         <title>General usage</title>
 277         <p>
 278           When the crawler runs, it
 279           retrieves the programs for today. As a result, it is advisable
 280           to run the program at an early point of the day as a scheduled
 281           task (e.g. cron on unix). For the web application this is
 282           preconfigured at 5AM.
 283         </p>
 284         <note>
 285           If you deploy the web application today, it will run automatically
 286           on the next (!) day. This even holds if you deploy the application
 287           before 5AM in the morning.
 288         </note>
 289
 290         <p>
 291           Modifying the program to allow it to investigate tomorrow's
 292           programs instead is easy as well but not yet implemented.
 293         </p>
 294       </section>
 295
 296
 297     </section>
 298
 299     <section id="examples">
 300       <title>Examples</title>
 301
 302       <p>
 303         The best example is in the distribution itself. It is my personal
 304         <code>programs.xml</code> file.
 305       </p>
 306     </section>
 307
 308     <section>
 309       <title>Contributing</title>
 310
 311       <p>
 312         You are always welcome to contribute. If you find a problem just
 313         tell me about it and if you have ideas am I always interested to
 314         hear about them.
 315       </p>
 316       <p>
 317         If you are a programmer and have a fix for a bug, just send me a
 318         patch and if you are fanatic enough and have ideas, I can also
 319         give you write access to the repository.
 320       </p>
 321     </section>
 322
 323
 324   </body>
 325 </document>