5.2. Cache mode storage

5.2.1. Introduction

cache words storage mode is able to index and search quickly through several millions of documents.

5.2.2. Cache mode word indexes structure

The main idea of cache storage mode is that word index and data for URL sorting is stored on disk rather than in a SQL database. Full URL information however is kept in SQL database (tables url and urlinfo). Word index is divided into number of files specified by WrdFiles command (default value is 0x300). URLs sorting information is divided into number of files specified by URLDataFiles command (default value is 0x300).

Note: Beware: you should have identical values for WrdFiles and URLDataFiles commands in all your configs.

Word index is located in files under /var/tree directory of DataparkSearch installation. URLs sorting information is located in files under /var/url directory of DataparkSearch installation.

indexer and cached use memory buffers to cache some portion of cache mode data before flushing it to the disk. The size of such buffers can be adjusted by CacheLogWords and CacheLogDels commands in indexer.conf and cached.conf config files respectively. Default values are 1024 for CacheLogWords and 10240 for CacheLogDels. An estimation of total memory used for such buffers can be calculated as follow:

Volume = WrdFiles * (16 + 16 * CacheLogWords + 8 * CacheLogDels), for 32-bit systems
Volume = WrdFiles * (32 + 20 * CacheLogWords + 12 * CacheLogDels), for 64-bit systems

5.2.3. Cache mode tools

There are two additional programs cached and splitter used in cache mode indexing.

cached is a TCP daemon which collects word information from indexers and stores it on your hard disk. It can operate in two modes, as old cachelogd daemon to logs data only, and in new mode, when cachelogd and splitter functionality are combined.

splitter is a program to create fast word indexes using data collected by cached. Those indexes are used later in search process.

5.2.4. Starting cache mode

To start "cache mode" follow these steps:

  1. Start cached server:

    cd /usr/local/dpsearch/sbin

    ./cached 2>cached.out &

    It will write some debug information into cached.out file. cached also creates a cached.pid file in /var directory of base DataparkSearch installation.

    cached listens to TCP connections and can accept several indexers from different machines. Theoretical number of indexers connections is equal to 128. In old mode cached stores information sent by indexers in /var/splitter/ directory of DataparkSearch installation. In new mode it stores in /var/tree/ directory.

    By default, cached starts in new mode. To run it in old mode, i.e. logs only mode, run it with -l switch:

    cached -l

    Or by specify LogsOnly yes command in your cached.conf.

    You can specify port for cached to use without recompiling. In order to do that, please run

    ./cached -p8000

    where 8000 is the port number you choose.

    You can as well specify a directory to store data (it is /var directory by default) with this command:

    ./cached -w /path/to/var/dir

  2. Configure your indexer.conf as usual and for DBAddr command add cache as value of dbmode parameter and localhost:7000 as value of cached parameter (see Section 3.10.2>).

  3. Run indexers. Several indexers can be executed simultaneously. Note that you may install indexers on different machines and then execute them with the same cached server. This distributed system allows making indexing faster.

  4. Flushing cached buffers and url data, and creating cache mode limits. To flush cached buffers and url data and to create cache mode limits after indexing is done, send -HUP signal to cached. You can use cached.pid file to do this:

    kill -HUP `cat /usr/local/dpsearch/var/cached.pid`

    N.B.: you needs wait till all buffers will be flushed before going to next step.

  5. Creating word index. This stage is no needs, if cached runs in new, i.e. combined, mode. When some information is gathered by indexers and collected in /var/splitter/ directory by cached it is possible to create fast word indexes. splitter program is responsible for this. It is installed in /sbin directory. Note that indexes can be created anytime without interrupting current indexing process.

    Run splitter without any arguments:

    /usr/local/dpsearch/sbin/splitter

    It will take sequentially all prepared files in /var/splitter/ directory and use them to build fast word index. Processed logs in /var/splitter/ directory are truncated after this operation.

5.2.5. Optional usage of several splitters

splitter has two command line arguments: -f [first file] -t [second file] which allows limiting used files range. If no parameters are specified splitter distributes all prepared files. You can limit files range using -f and -t keys specifying parameters in HEX notation. For example, splitter -f 000 -t A00 will create word indexes using files in the range from 000 to A00. These keys allow using several splitters at the same time. It usually gives more quick indexes building. For example, this shell script starts four splitters in background:

#!/bin/sh
splitter -f 000 -t 3f0 &
splitter -f 400 -t 7f0 &
splitter -f 800 -t bf0 &
splitter -f c00 -t ff0 &

5.2.6. Using run-splitter script

There is a run-splitter script in /sbin directory of DataparkSearch installation. It helps to execute subsequently all three indexes building steps.

"run-splitter" has these two command line parameters:

run-splitter --hup --split

or a short version:

run-splitter -k -s

Each parameter activates corresponding indexes building step. run-splitter executes all three steps of index building in proper order:

  1. Sending -HUP signal to cached. --hup (or -k) run-splitter arguments are responsible for this.

  2. Running splitter. Keys --split (or -s).

In most cases just run run-splitter script with all -k -s arguments. Separate usage of those three flags which correspond to three steps of indexes building is rarely required.

run-splitter have optional parameters: -p=n and -v=m to specify pause in seconds after each log buffer update and verbose level respectively. n is seconds number (default value: 0), m is verbosity level (default value: 4).

5.2.7. Doing search

To start using search.cgi in the "cache mode", edit as usually your search.htm template and add the "cache" as value of dbmode parameter of DBAddr command.

5.2.8. Using search limits

To use search limits in cache mode, you should add appropriate Limit command(s) to your indexer.conf (or cached.conf, if cached is used) and to search.htm or searchd.conf (if searchd is used).

Limit prm:type [SQL-Request [DBAddr]]

To use, for example, search limit by tag, by category and by site, add follow lines to search.htm or to indexer.conf (searchd.conf, if searchd is used).

Limit t:tag
Limit c:category
Limit site:siteid

where t - name of CGI parameter (&t=) for this constraint, tag - type of constraint.

Instead of tag/category/siteid in example above you can use any of values from table below:

Table 5-1. Cache mode predefined limit types

categoryCategory limit.
tagTag limit.
timeTime limit (a hour precision).
languageLanguage limit.
contentContent-Type limit.
siteidurl.site_id limit.
linkLimit by pages what links to url.rec_id specified.
hostname (obsolete)Hostname (url) limit. This limit is obsolete and should be replaced by site_id limit

If the second, optional, parameter SQL-Request is specified for Limit command, then this SQL-query is executed for limit construction. This SQL-query should return all possible pairs of limit value and url.rec_id. E.g.:

Limit prm:strcrc32 "SELECT label, rec_id FROM labels" pgsql://u:p@localhost/sitedb/
where prm - is the name of limit and the name of CGI-parameter is used for this limit; strcrc32 - is the type of limit, particularly for this limit is a string. Instead of strcrc32 it's possible to use any of the following limit types:

Table 5-2. SQL-based cache mode limit types

hex8strHex or hexavigesimal (base-26) string similar to those used in categories. The nested limit will be created.
strcrc32A string, the hash32 value is calculated on, used as key for this limit.
intAn integer (4-byte wide).
hourAn integer (4-byte wide) number of seconds since epoch. The value in index is in hour precision.
minuteAn integer (4-byte wide) number of seconds since epoch. The value in index is in minute precision.

With third, optional, parameter DBAddr for Limit command it's possible to specify a connection to an alternate SQL-database where to get data for this limit.

It's possible to omit optional parameters SQL-Request and DBAddr of Limit command in search template search.htm or in searchd.conf file (when searchd is used), since they are used only for limit construction.

Limit prm:strcrc32