wamblee.org Git - utils/blob - crawler/kiss/docs/content/xdocs/index.xml

   1 <?xml version="1.0" encoding="UTF-8"?>
   2 <!--
   3   Copyright 2002-2004 The Apache Software Foundation or its licensors,
   4   as applicable.
   5
   6   Licensed under the Apache License, Version 2.0 (the "License");
   7   you may not use this file except in compliance with the License.
   8   You may obtain a copy of the License at
   9
  10       http://www.apache.org/licenses/LICENSE-2.0
  11
  12   Unless required by applicable law or agreed to in writing, software
  13   distributed under the License is distributed on an "AS IS" BASIS,
  14   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  15   See the License for the specific language governing permissions and
  16   limitations under the License.
  17 -->
  18 <!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V2.0//EN" "http://forrest.apache.org/dtd/document-v20.dtd">
  19 <document>
  20   <header>
  21     <title>Automatic Recording for KiSS Hard Disk Recorders</title>
  22   </header>
  23   <body>
  24     <warning> KiSS makes regular updates to their site that sometimes require adaptations to the
  25       crawler. If it stops working, check out the most recent version here. </warning>
  26     <section id="changelog">
  27       <title>Changelog</title>
  28       <section>
  29         <title>21 November 2006</title>
  30         <ul>
  31           <li>Corrected the <code>config.xml</code> again.</li>
  32           <li>Corrected errors in the documentation for the web application. It starts running at 19:00
  33               and not at 5:00.</li>
  34         </ul>
  35       </section>
  36       <section>
  37         <title>19 November 2006</title>
  38         <ul>
  39           <li>Corrected the <code>config.xml</code> file to deal with changes in the login procedure.</li>
  40         </ul>
  41       </section>
  42       <section>
  43         <title>17 November 2006</title>
  44         <ul>
  45           <li>Corrected the packed distributions. The standalone distribution had an error in the
  46             scripts and was missing libraries </li>
  47
  48         </ul>
  49       </section>
  50       <section>
  51         <title>7 September 2006</title>
  52         <ul>
  53           <li>KiSS modified the login procedure. It is now working again.</li>
  54           <li>Generalized the startup scripts. They should now be insensitive to the specific
  55             libraries used. </li>
  56         </ul>
  57       </section>
  58       <section>
  59         <title>31 August 2006</title>
  60         <ul>
  61           <li>Added windows bat file for running the crawler under windows. Very add-hoc, will be
  62             generalized. </li>
  63         </ul>
  64       </section>
  65       <section>
  66         <title>24 August 2006</title>
  67         <ul>
  68           <li>The crawler now uses desktop login for crawling. Also, it is much more efficient since
  69             it no longer needs to crawl the individual programs. This is because the channel page
  70             includes descriptions of programs in javascript popups which can be used by the crawler.
  71             The result is a significant reduction of the load on the KiSS EPG site. Also, the delay
  72             between requests has been increased to further reduce load on the KiSS EPG site. </li>
  73           <li> The crawler now crawls programs for tomorrow instead of for today. </li>
  74           <li> The web based crawler is configured to run only between 7pm and 12pm. It used to run
  75             at 5am. </li>
  76         </ul>
  77       </section>
  78
  79       <section>
  80         <title>13-20 August 2006</title>
  81         <p> There were several changes to the login procedure, requiring modifications to the
  82           crawler. </p>
  83         <ul>
  84           <li>The crawler now uses the 'Referer' header field correctly at login.</li>
  85           <li>KiSS now uses hidden form fields in their login process which are now also handled
  86             correctly by the crawler.</li>
  87         </ul>
  88       </section>
  89     </section>
  90     <section id="overview">
  91       <title>Overview</title>
  92
  93       <p> In 2005, <a href="site:links/kiss">KiSS</a> introduced the ability to schedule recordings
  94         on KiSS hard disk recorder (such as the DP-558) through a web site on the internet. When a
  95         new recording is scheduled through the web site, the KiSS recorder finds out about this new
  96         recording by polling a server on the internet. This is a really cool feature since it
  97         basically allows programming the recorder when away from home. </p>
  98       <p> After using this feature for some time, I started noticing regular patterns. Often you
  99         are looking for the same programs and for certain types of programs. So, wouldn't it be nice
 100         to have a program do this work for you and automatically record programs and notify you of
 101         possibly interesting ones? </p>
 102       <p> This is where the KiSS crawler comes in. This is a simple crawler which logs on to the
 103         KiSS electronic programme guide web site and gets programme information from there. Then
 104         based on that it automatically records programs for you or sends notifications about
 105         interesting ones. </p>
 106       <p> In its current version, the crawler can be used in two ways: </p>
 107       <ul>
 108         <li><strong>standalone program</strong>:
 109         A standalone program run from the command-line or as a scheduled task.</li>
 110         <li><strong>web application</strong>: A web application running on a java application
 111           server. With this type of use, the crawler also features an automatic retry mechanism in
 112           case of failures, as well as a simple web interface. </li>
 113       </ul>
 114     </section>
 115
 116     <section>
 117       <title>Downloading</title>
 118
 119       <p> At this moment, no formal releases have been made and only the latest version can be
 120         downloaded. </p>
 121       <p> The easy way to start is the <a
 122           href="installs/crawler/target/wamblee-crawler-0.2-SNAPSHOT-kissbin.zip">standalone program
 123           binary version</a> or using the <a
 124           href="installs/crawler/kissweb/target/wamblee-crawler-kissweb.war">web application</a>. </p>
 125       <p> The latest source can be obtained from subversion with the URL
 126           <code>https://wamblee.org/svn/public/utils</code>. The subversion repository allows
 127         read-only access to anyone. </p>
 128       <p> The application was developed and tested on SuSE linux 10.1 with
 129         JBoss 4.0.4 application
 130         server. An application server or servlet container is only required for the
 131         web application. The crawler requires at least a Java Virtual Machine
 132         1.5 or greater to run. </p>
 133     </section>
 134
 135     <section>
 136       <title>Configuring the crawler</title>
 137
 138       <p> The crawler comes with three configuration files: </p>
 139       <ul>
 140         <li><code>crawler.xml</code>: basic crawler configuration tailored to the KiSS electronic
 141           programme guide.</li>
 142         <li><code>programs.xml</code>: containing a description of which programs must be recorded
 143           and which programs are interesting.</li>
 144         <li><code>org.wamblee.crawler.properties</code>: Containing a configuration </li>
 145       </ul>
 146       <p> For the standalone program, all configuration files are in the <code>conf</code>
 147         directory. For the web application, the properties files is located in the
 148           <code>WEB-INF/classes</code> directory of the web application, and
 149         <code>crawler.xml</code> and <code>programs.xml</code> are located outside of the web
 150         application at a location configured in the properties file. </p>
 151
 152
 153       <section>
 154         <title>Crawler configuration <code>crawler.xml</code></title>
 155
 156         <p> First of all, copy the <code>config.xml.example</code> file to <code>config.xml</code>.
 157           After that, edit the first entry of that file and replace <code>user</code> and
 158             <code>passwd</code> with your personal user id and password for the KiSS Electronic
 159           Programme Guide. </p>
 160       </section>
 161
 162       <section>
 163         <title>Program configuration</title>
 164         <p> Interesting TV shows are described using <code>program</code> elements. Each
 165             <code>program</code> element contains one or more <code>match</code> elements that
 166           describe a condition that the interesting program must match. </p>
 167         <p> Matching can be done on the following properties of a program: </p>
 168         <table>
 169           <tr>
 170             <th>Field name</th>
 171             <th>Description</th>
 172           </tr>
 173           <tr>
 174             <td>name</td>
 175             <td>Program name</td>
 176           </tr>
 177           <tr>
 178             <td>description</td>
 179             <td>Program description</td>
 180           </tr>
 181           <tr>
 182             <td>channel</td>
 183             <td>Channel name</td>
 184           </tr>
 185           <tr>
 186             <td>keywords</td>
 187             <td>Keywords/classification of the program.</td>
 188           </tr>
 189         </table>
 190         <p> The field to match is specified using the <code>field</code> attribute of the
 191             <code>match</code> element. If no field name is specified then the program name is
 192           matched. Matching is done by converting the field value to lowercase and then doing a
 193           perl-like regular expression match of the provided value. As a result, the content of the
 194           match element should be specified in lower case otherwise the pattern will never match. If
 195           multiple <code>match</code> elements are specified for a given <code>program</code>
 196           element, then all matches must apply for a program to be interesting. </p>
 197         <p> Example patterns: </p>
 198         <table>
 199           <tr>
 200             <th>Pattern</th>
 201             <th>Examples of matching field values</th>
 202           </tr>
 203           <tr>
 204             <td>the.*x.*files</td>
 205             <td>"The X files", "The X-Files: the making of"</td>
 206           </tr>
 207           <tr>
 208             <td>star trek</td>
 209             <td>"Star Trek Voyager", "Star Trek: The next generation"</td>
 210           </tr>
 211         </table>
 212
 213         <p> It is possible that different programs cannot be recorded since they overlap. To deal
 214           with such conflicts, it is possible to specify a priority using the <code>priority</code>
 215           element. Higher values of the priority value mean a higher priority. If two programs have
 216           the same priority, then it is (more or less) unspecified which of the two will be
 217           recorded, but it will at least record one program. If no priority is specified, then the
 218           priority is 1 (one). </p>
 219
 220         <p> Since it is not always desirable to try to record every program that matches the
 221           criteria, it is also possible to generate notifications for interesting programs only
 222           without recording them. This is done by specifying the <code>action</code> alement with
 223           the content <code>notify</code>. By default, the <code>action</code> is
 224           <code>record</code>. To make the mail reports more readable it is possible to also assign
 225           a category to a program for grouping interesting programs. This can be done using the
 226             <code>category</code> element. Note that if the <code>action</code> is
 227           <code>notify</code>. then the <code>priority</code> element is not used. </p>
 228
 229       </section>
 230
 231       <section>
 232         <title>Notification configuration</title>
 233         <p> Edit the configuration file <code>org.wamblee.crawler.properties</code>. The properties
 234           file is self-explanatory. </p>
 235       </section>
 236     </section>
 237
 238
 239
 240
 241     <section>
 242       <title>Installing and running the crawler</title>
 243
 244       <section>
 245         <title>Standalone application</title>
 246         <p> In the binary distribution, execute the <code>run</code> script for your operating
 247           system (<code>run.bat</code> for windows, and <code>run.sh</code> for unix). </p>
 248       </section>
 249
 250       <section>
 251         <title>Web application</title>
 252         <p> After deploying the web application, navigate to the application in your browser (e.g.
 253             <code>http://localhost:8080/wamblee-crawler-kissweb</code>). The screen should show an
 254           overview of the last time it ran (if it ran before) as well as a button to run the crawler
 255           immediately. Also, the result of the last run can be viewed. The crawler will run
 256           automatically starting after 19:00,
 257           and will retry at 1 hour intervals in case
 258           of failure to retrieve programme information.
 259           </p>
 260
 261           <p>
 262           Since the crawler checks the status at
 263           1 hour intervals it can run for the first time anytime between 19:00 and 20:00. This is done
 264           on purpose since it means that crawlers run by different people will not all start running
 265           simultaneously and is thus more friendly to the KiSS servers.  </p>
 266       </section>
 267
 268       <section>
 269         <title>Source distribution</title>
 270         <p> With the source code, build everything with maven2 as follows:</p>
 271         <source>
 272            mvn -Dmaven.test.skip=true install
 273            cd crawler
 274            mvn package assembly:assembly
 275         </source>
 276         <p>
 277           After this, locate the
 278           binary distribution in the <code>target</code> subdirectory of the <code>crawler</code>
 279           directory. Then
 280           proceed as for the binary distribution.</p>
 281
 282       </section>
 283
 284       <section>
 285         <title>General usage</title>
 286         <p> When the crawler runs, it retrieves the programs for tomorrow.
 287         </p>
 288         <note> If you deploy the web application today, it will run automatically on the next (!)
 289           day. This even holds if you deploy the application before the normal scheduled time. </note>
 290       </section>
 291
 292
 293     </section>
 294
 295     <section id="examples">
 296       <title>Examples</title>
 297
 298       <p> The best example is in the distribution itself. It is my personal
 299         <code>programs.xml</code> file. </p>
 300     </section>
 301
 302     <section>
 303       <title>Contributing</title>
 304
 305       <p> You are always welcome to contribute. If you find a problem just tell me about it and if
 306         you have ideas am I always interested to hear about them. </p>
 307       <p> If you are a programmer and have a fix for a bug, just send me a patch and if you are
 308         fanatic enough and have ideas, I can also give you write access to the repository. </p>
 309     </section>
 310
 311
 312   </body>
 313 </document>