wamblee.org Git - utils/blob - crawler/kiss/docs/content/xdocs/index.xml

   1 <?xml version="1.0" encoding="UTF-8"?>
   2 <!--
   3   Copyright 2002-2004 The Apache Software Foundation or its licensors,
   4   as applicable.
   5
   6   Licensed under the Apache License, Version 2.0 (the "License");
   7   you may not use this file except in compliance with the License.
   8   You may obtain a copy of the License at
   9
  10       http://www.apache.org/licenses/LICENSE-2.0
  11
  12   Unless required by applicable law or agreed to in writing, software
  13   distributed under the License is distributed on an "AS IS" BASIS,
  14   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  15   See the License for the specific language governing permissions and
  16   limitations under the License.
  17 -->
  18 <!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V2.0//EN" "http://forrest.apache.org/dtd/document-v20.dtd">
  19 <document>
  20   <header>
  21     <title>Automatic Recording for KiSS Hard Disk Recorders</title>
  22   </header>
  23   <body>
  24     <warning>
  25       KiSS makes regular updates to their site that sometimes require adaptations
  26       to the crawler. If it stops working, check out the most recent version here.
  27     </warning>
  28     <section id="changelog">
  29       <title>Changelog</title>
  30       <section>
  31        <title>17 November 2006</title>
  32         <ul>
  33           <li>Corrected the packed distributions. The standalone distribution
  34               had an error in the scripts and was missing libraries </li>
  35
  36         </ul>
  37        </section>
  38           <section>
  39         <title>7 September 2006</title>
  40         <ul>
  41           <li>KiSS modified the login procedure. It is now working again.</li>
  42           <li>Generalized the startup scripts. They should now be insensitive to the specific libraries used. </li>
  43         </ul>
  44       </section>
  45       <section>
  46         <title>31 August 2006</title>
  47         <ul>
  48           <li>Added windows bat file for running the crawler under windows.
  49               Very add-hoc, will be generalized. </li>
  50         </ul>
  51       </section>
  52       <section>
  53         <title>24 August 2006</title>
  54         <ul>
  55           <li>The crawler now uses desktop login for crawling. Also, it is much more efficient since
  56           it no longer needs to crawl the individual programs. This is because the channel page
  57             includes descriptions of programs in javascript popups which can be used by the crawler.
  58           The result is a significant reduction of the load on the KiSS EPG site. Also, the delay
  59             between requests has been increased to further reduce load on the KiSS EPG site. </li>
  60           <li>
  61             The crawler now crawls programs for tomorrow instead of for today.
  62           </li>
  63           <li>
  64             The web based crawler is configured to run only between 7pm and 12pm. It used to run at
  65             5am.
  66           </li>
  67         </ul>
  68       </section>
  69
  70       <section>
  71         <title>13-20 August 2006</title>
  72         <p>
  73           There were several changes to the login procedure, requiring modifications to the crawler.
  74         </p>
  75         <ul>
  76           <li>The crawler now uses the 'Referer' header field correctly at login.</li>
  77           <li>KiSS now uses hidden form fields in their login process which are now also handled correctly by the
  78               crawler.</li>
  79         </ul>
  80       </section>
  81     </section>
  82     <section id="overview">
  83       <title>Overview</title>
  84
  85       <p>
  86         In 2005,  <a href="site:links/kiss">KiSS</a> introduced the ability
  87         to schedule recordings on KiSS hard disk recorder (such as the
  88         DP-558) through a web site on the internet. When a new recording is
  89         scheduled through the web site, the KiSS recorder finds out about
  90         this new recording by polling a server on the internet.
  91         This is a really cool feature since it basically allows programming
  92         the recorder when away from home.
  93       </p>
  94       <p>
  95         After using this feature for some time now, I started noticing regular
  96         patterns. Often you are looking for the same programs and for certain
  97         types of programs. So, wouldn't it be nice to have a program
  98         do this work for you and automatically record programs and notify you
  99         of possibly interesting ones?
 100       </p>
 101       <p>
 102         This is where the KiSS crawler comes in. This is a simple crawler which
 103         logs on to the KiSS electronic programme guide web site and gets
 104         programme information from there. Then based on that it automatically
 105         records programs for you or sends notifications about interesting ones.
 106       </p>
 107       <p>
 108         In its current version, the crawler can be used in two ways:
 109       </p>
 110       <ul>
 111         <li><strong>standalone program</strong>: A standalone program run as a scheduled task.</li>
 112         <li><strong>web application</strong>: A web application running on a java
 113           application server. With this type of use, the crawler also features an automatic retry
 114           mechanism in case of failures, as well as a simple web interface. </li>
 115       </ul>
 116     </section>
 117
 118     <section>
 119       <title>Downloading</title>
 120
 121       <p>
 122         At this moment, no formal releases have been made and only the latest
 123         version can be downloaded.
 124       </p>
 125       <p>
 126         The easy way to start is the
 127         <a href="installs/crawler/target/wamblee-crawler-0.2-SNAPSHOT-kissbin.zip">standalone program binary version</a>
 128         or using the <a href="installs/crawler/kissweb/target/wamblee-crawler-kissweb.war">web
 129           application</a>.
 130       </p>
 131       <p>
 132         The latest source can be obtained from subversion with the
 133         URL <code>https://wamblee.org/svn/public/utils</code>. The subversion
 134         repository allows read-only access to anyone.
 135       </p>
 136       <p>
 137         The application was developed and tested on SuSE linux 9.1 with JBoss 4.0.2 application
 138         server (only required for the web application). It requires at least a Java Virtual Machine
 139         1.5 or greater to run.
 140       </p>
 141     </section>
 142
 143     <section>
 144       <title>Configuring the crawler</title>
 145
 146       <p>
 147         The crawler comes with three configuration files:
 148       </p>
 149       <ul>
 150         <li><code>crawler.xml</code>: basic crawler configuration
 151           tailored to the KiSS electronic programme guide.</li>
 152         <li><code>programs.xml</code>: containing a description of which
 153           programs must be recorded and which programs are interesting.</li>
 154         <li><code>org.wamblee.crawler.properties</code>: Containing a configuration  </li>
 155       </ul>
 156       <p>
 157         For the standalone program, all configuration files are in the <code>conf</code> directory.
 158         For the web application, the properties files is located in the <code>WEB-INF/classes</code>
 159         directory of the web application, and <code>crawler.xml</code> and <code>programs.xml</code>
 160         are located outside of the web application at a location configured in the properties file.
 161       </p>
 162
 163
 164       <section>
 165         <title>Crawler configuration <code>crawler.xml</code></title>
 166
 167         <p>
 168           First of all, copy the <code>config.xml.example</code> file
 169           to <code>config.xml</code>. After that, edit the first entry of
 170           that file and replace <code>user</code> and <code>passwd</code>
 171           with your personal user id and password for the KiSS Electronic
 172           Programme Guide.
 173         </p>
 174       </section>
 175
 176         <section>
 177           <title>Program configuration</title>
 178           <p>
 179             Interesting TV shows are described using <code>program</code>
 180             elements. Each <code>program</code> element contains
 181             one or more <code>match</code> elements that describe
 182             a condition that the interesting program must match.
 183           </p>
 184           <p>
 185             Matching can be done on the following properties of a program:
 186           </p>
 187           <table>
 188             <tr><th>Field name</th>
 189             <th>Description</th></tr>
 190             <tr>
 191               <td>name</td>
 192               <td>Program name</td>
 193             </tr>
 194             <tr>
 195               <td>description</td>
 196               <td>Program description</td>
 197             </tr>
 198             <tr>
 199               <td>channel</td>
 200               <td>Channel name</td>
 201             </tr>
 202             <tr>
 203               <td>keywords</td>
 204               <td>Keywords/classification of the program.</td>
 205             </tr>
 206           </table>
 207           <p>
 208             The field to match is specified using the <code>field</code>
 209             attribute of the <code>match</code> element. If no field name
 210             is specified then the program name is matched. Matching is done
 211             by converting the field value to lowercase and then doing a
 212             perl-like regular expression match of the provided value. As a
 213             result, the content of the match element should be specified in
 214             lower case otherwise the pattern will never match.
 215             If multiple <code>match</code> elements are specified for a
 216             given <code>program</code> element, then all matches must
 217             apply for a program to be interesting.
 218           </p>
 219           <p>
 220             Example patterns:
 221           </p>
 222           <table>
 223             <tr>
 224               <th>Pattern</th>
 225               <th>Example of matching field values</th>
 226             </tr>
 227             <tr>
 228               <td>the.*x.*files</td>
 229               <td>"The X files", "The X-Files: the making of"</td>
 230             </tr>
 231             <tr>
 232               <td>star trek</td>
 233               <td>"Star Trek Voyager", "Star Trek: The next generation"</td>
 234             </tr>
 235           </table>
 236
 237           <p>
 238             It is possible that different programs cannot be recorded
 239             since they overlap. To deal with such conflicts, it is possible
 240             to specify a priority using the <code>priority</code> element.
 241             Higher values of the priority value mean a higher priority.
 242             If two programs have the same priority, then it is (more or less)
 243             unspecified which of the two will be recorded, but it will at least
 244             record one program. If no priority is specified, then the
 245             priority is 1 (one).
 246           </p>
 247
 248           <p>
 249             Since it is not always desirable to try to record every
 250             program that matches the criteria, it is also possible to
 251             generate notifications for interesting programs only without
 252             recording them. This is done by specifying the
 253             <code>action</code> alement with the content <code>notify</code>.
 254             By default, the <code>action</code> is <code>record</code>.
 255             To make the mail reports more readable it is possible to
 256             also assign a category to a program for grouping interesting
 257             programs. This can be done using the <code>category</code>
 258             element. Note that if the <code>action</code> is
 259             <code>notify</code>. then the <code>priority</code> element
 260             is not used.
 261           </p>
 262
 263         </section>
 264
 265       <section>
 266         <title>Notification configuration</title>
 267         <p>
 268            Edit the configuration file <code>org.wamblee.crawler.properties</code>.
 269           The properties file is self-explanatory.
 270         </p>
 271       </section>
 272     </section>
 273
 274
 275
 276
 277     <section>
 278       <title>Installing and running the crawler</title>
 279
 280       <section>
 281         <title>Standalone application</title>
 282         <p>
 283           In the binary distribution, execute the
 284           <code>run</code> script for your operating system
 285           (<code>run.bat</code> for windows, and
 286           <code>run.sh</code> for unix).
 287         </p>
 288       </section>
 289
 290       <section>
 291         <title>Web application</title>
 292         <p>
 293           After deploying the web application, navigate to the
 294           application in your browser (e.g.
 295           <code>http://localhost:8080/wamblee-crawler-kissweb</code>).
 296           The screen should show an overview of the last time it ran (if
 297           it ran before) as well as a button to run the crawler immediately.
 298           Also, the result of the last run can be viewed.
 299           The crawler will run automatically every morning at 5 AM local time,
 300           and will retry at 1 hour intervals in case of failure to retrieve
 301           programme information.
 302         </p>
 303       </section>
 304
 305       <section>
 306         <title>Source distribution</title>
 307         <p>
 308           With the source code, build everything with
 309           <code>ant dist-lite</code>, then locate the binary
 310           distribution in <code>lib/wamblee/crawler/kiss/kiss-crawler-bin.zip</code>.
 311           Then proceed as for the binary distribution.
 312         </p>
 313       </section>
 314
 315       <section>
 316         <title>General usage</title>
 317         <p>
 318           When the crawler runs, it
 319           retrieves the programs for tomorrow. As a result, it is advisable
 320           to run the program at an early point of the day as a scheduled
 321           task (e.g. cron on unix). For the web application this is
 322           preconfigured at 5AM.
 323         </p>
 324         <note>
 325           If you deploy the web application today, it will run automatically
 326           on the next (!) day. This even holds if you deploy the application
 327           before the normal scheduled time.
 328         </note>
 329
 330         <p>
 331           Modifying the program to allow it to investigate tomorrow's
 332           programs instead is easy as well but not yet implemented.
 333         </p>
 334       </section>
 335
 336
 337     </section>
 338
 339     <section id="examples">
 340       <title>Examples</title>
 341
 342       <p>
 343         The best example is in the distribution itself. It is my personal
 344         <code>programs.xml</code> file.
 345       </p>
 346     </section>
 347
 348     <section>
 349       <title>Contributing</title>
 350
 351       <p>
 352         You are always welcome to contribute. If you find a problem just
 353         tell me about it and if you have ideas am I always interested to
 354         hear about them.
 355       </p>
 356       <p>
 357         If you are a programmer and have a fix for a bug, just send me a
 358         patch and if you are fanatic enough and have ideas, I can also
 359         give you write access to the repository.
 360       </p>
 361     </section>
 362
 363
 364   </body>
 365 </document>