records programs for you or sends notifications about interesting ones.
</p>
<p>
- In its current version, the crawler can be used a standalone program
- only and the preferred way to run it is as a scheduled task.
+ In its current version, the crawler can be used in two ways:
</p>
+ <ul>
+ <li><strong>standalone program</strong>: A standalone program run as a scheduled task.</li>
+ <li><strong>web application</strong>: A web application running on a java
+ application server. With this type of use, the crawler also features an automatic retry
+ mechanism in case of failures, as well as a simple web interface. </li>
+ </ul>
</section>
<section>
</p>
<p>
The easy way to start is the
- <a href="installs/crawler/kiss/kiss-crawler-bin.zip">binary version</a>.
+ <a href="installs/crawler/kiss/kiss-crawler-bin.zip">standalone program binary version</a>
+ or using the <a href="installs/crawler/kissweb/wamblee-crawler-kissweb.war">web
+ application</a>.
</p>
<p>
The latest source can be obtained from subversion with the
URL <code>https://wamblee.org/svn/public/utils</code>. The subversion
repository allows read-only access to anyone.
</p>
+ <p>
+ The application was developed and tested on SuSE linux 9.1 with JBoss 4.0.2 application
+ server (only required for the web application). It requires at least a Java Virtual Machine
+ 1.5 or greater to run.
+ </p>
</section>
<section>
<title>Configuring the crawler</title>
<p>
- The crawler comes with two configuration files, namely
- <code>crawler.xml</code> and <code>programs.xml</code>.
+ The crawler comes with three configuration files:
+ </p>
+ <ul>
+ <li><code>crawler.xml</code>: basic crawler configuration
+ tailored to the KiSS electronic programme guide.</li>
+ <li><code>programs.xml</code>: containing a description of which
+ programs must be recorded and which programs are interesting.</li>
+ <li><code>org.wamblee.crawler.properties</code>: Containing a configuration </li>
+ </ul>
+ <p>
+ For the standalone program, all configuration files are in the <code>conf</code> directory.
+ For the web application, the properties files is located in the <code>WEB-INF/classes</code>
+ directory of the web application, and <code>crawler.xml</code> and <code>programs.xml</code>
+ are located outside of the web application at a location configured in the properties file.
</p>
+
<section>
<title>Crawler configuration <code>crawler.xml</code></title>
Programme Guide.
</p>
</section>
-
- <section>
- <title>Program configuration: <code>programs.xml</code></title>
-
- <p>
- The <code>programs.xml</code> file contains the following
- configuration items:
- </p>
- <ul>
- <li>Notification configuration: Describing how to
- do notification of the results of crawling the site. </li>
- <li>Zero or more configurations of interesting programs. </li>
- </ul>
- <section>
- <title>Notification configuration</title>
- <p>
- Notification is configured in the (surprise, surprise!)
- <code>notification</code> element. This notification element
- is used to configure respectively sender mail address (= reply
- address), recipient address, subject of the email, smtp server
- host and port and optional username and password.
- In addition it contains the names of the stylesheets to
- generate the HTML and Text reports. These stylesheets
- should not be changed.
- </p>
- </section>
-
+
<section>
<title>Program configuration</title>
<p>
</table>
<p>
- It is possible that different programs cannot be recorded at
+ It is possible that different programs cannot be recorded
since they overlap. To deal with such conflicts, it is possible
to specify a priority using the <code>priority</code> element.
Higher values of the priority value mean a higher priority.
</p>
</section>
-
-
+
+ <section>
+ <title>Notification configuration</title>
+ <p>
+ Edit the configuration file <code>org.wamblee.crawler.properties</code>.
+ The properties file is self-explanatory.
+ </p>
</section>
</section>
+
+
+
<section>
<title>Installing and running the crawler</title>
<section>
- <title>Binary distribution</title>
+ <title>Standalone application</title>
<p>
In the binary distribution, execute the
<code>run</code> script for your operating system
</p>
</section>
+ <section>
+ <title>Web application</title>
+ <p>
+ After deploying the web application, navigate to the
+ application in your browser (e.g.
+ <code>http://localhost:8080/wamblee-crawler-kissweb</code>).
+ The screen should show an overview of the last time it ran (if
+ it ran before) as well as a button to run the crawler immediately.
+ Also, the result of the last run can be viewed.
+ The crawler will run automatically every morning at 5 AM local time.
+ </p>
+ </section>
+
<section>
<title>Source distribution</title>
<p>
<section>
<title>General usage</title>
<p>
- The crawler, as it is now, is s standalone program which is
- intended to be run from a command-line. When it runs, it
+ When the crawler runs, it
retrieves the programs for today. As a result, it is advisable
to run the program at an early point of the day as a scheduled
- task (e.g. cron on unix).
+ task (e.g. cron on unix). For the web application this is
+ preconfigured at 5AM.
</p>
<p>
Modifying the program to allow it to investigate tomorrow's
<p>
The best example is in the distribution itself. It is my personal
- <code>programs.xml</code> file.
+ <code>programs.xml</code> file.
</p>
</section>