DataparkSearch Engine 4.54: Reference manual
Prev	Chapter 3. Indexing	Next

3.9. External parsers

DataparkSearch indexer can use external parsers to index various file types (MIME types).

Parser is an executable program which converts one of the mime types to text/plain or text/html. For example, if you have postscript files, you can use ps2ascii parser (filter), which reads postscript file from stdin and produces ascii to stdout.

3.9.1. Supported parser types

Indexer supports four types of parsers that can:

read data from stdin and send result to stdout
read data from file and send result to stdout
read data from file and send result to file
read data from stdin and send result to file

3.9.2. Setting up parsers

Configure mime types
Configure your web server to send appropriate "Content-Type" header. For apache, have a look at mime.types file, most mime types are already defined there.
If you want to index local files or via ftp use "AddType" command in indexer.conf to associate file name extensions with their mime types. For example:
```
AddType text/html *.html
```
Add parsers
Add lines with parsers definitions. Lines have the following format with three arguments:
```
Mime <from_mime> <to_mime> [<command line>]
```
For example, the following line defines parser for man pages:
```
# Use deroff for parsing man pages ( *.man )
Mime  application/x-troff-man   text/plain   deroff
```
This parser will take data from stdin and output result to stdout.
Many parsers can not operate on stdin and require a file to read from. In this case indexer creates a temporary file in /tmp and will remove it when parser exits. Use $1 macro in parser command line to substitute file name. For example, Mime command for "catdoc" MS Word to ASCII converters may look like this:
```
Mime application/msword text/plain "/usr/bin/catdoc -a $1"
```
If your parser writes result into output file, use $2 macro. indexer will replace $2 by temporary file name, start parser, read result from this temporary file then remove it. For example:
```
Mime application/msword text/plain "/usr/bin/catdoc -a $1 >$2"
```
The parser above will read data from first temporary file and write result to second one. Both temporary files will be removed when parser exists. Note that result of usage of this parser will be absolutely the same with the previous one, but they use different execution mode: file->stdout and file->file correspondingly.
If the <command line> parameter is omitted this means both MIME type are synonyms. E.g. some sites can supply incorrect type for MP3 files as application/mp3. You can alter it into correct one audio/mpeg and therefore process them:
```
Mime application/mp3 audio/mpeg
```

3.9.3. Avoid indexer hang on parser execution

To avoid a indexer hang on parser execution, you may specify the amount of time in seconds for parser execution in your indexer.conf by ParserTimeOut command. For example:

ParserTimeOut 600

Default value is 300 seconds, i.e. 5 minutes.

3.9.4. Pipes in parser's command line

You can use pipes in parser's command line. For example, these lines will be useful to index gzipped man pages from local disk:

AddType  application/x-gzipped-man  *.1.gz *.2.gz *.3.gz *.4.gz
Mime     application/x-gzipped-man  text/plain  "zcat | deroff"

3.9.5. Charsets and parsers

Some parsers can produce output in other charset than given in LocalCharset command. Specify charset to make indexer convert parser's output to proper one. For example, if your catdoc is configured to produce output in windows-1251 charset but LocalCharset is koi8-r, use this command for parsing MS Word documents:

Mime  application/msword  "text/plain; charset=windows-1251" "catdoc -a $1"

3.9.6. DPS_URL environment variable

When executing a parser indexer creates DPS_URL environment variable with an URL being processed as a value. You can use this variable in parser scripts.

3.9.7. Some third-party parsers

RPM parser by Mario Lang <lang@zid.tu-graz.ac.at>

/usr/local/bin/rpminfo:

#!/bin/bash
/usr/bin/rpm -q --queryformat="<html><head><title>RPM: %{NAME} %{VERSION}-%{RELEASE}
(%{GROUP})</title><meta name=\"description\" content=\"%{SUMMARY}\"></head><body>
%{DESCRIPTION}\n</body></html>" -p $1

indexer.conf:

Mime application/x-rpm text/html "/usr/local/bin/rpminfo $1"

It renders to such nice RPM information:

3. RPM: mysql 3.20.32a-3 (Applications/Databases) [4]
       Mysql is a SQL (Structured Query Language) database server.
       Mysql was written by Michael (monty) Widenius. See the CREDITS
       file in the distribution for more credits for mysql and related
       things....
       (application/x-rpm) 2088855 bytes

catdoc MS Word to text converter
Home page, also listed on Freshmeat.
indexer.conf:
```
Mime application/msword         text/plain      "catdoc $1"
```
xls2csv MS Excel to text converter
It is supplied with catdoc.
indexer.conf:
```
Mime application/vnd.ms-excel   text/plain      "xls2csv $1"
```
pdftotext Adobe PDF converter
Supplied with xpdf project.
Homepage, also listed on Freshmeat.
indexer.conf:
```
Mime application/pdf            text/plain      "pdftotext $1 -"
```

unrtf RTF to html converter

Homepage

indexer.conf:

Mime text/rtf*        text/html  "/usr/local/dpsearch/sbin/unrtf --html $1"
Mime application/rtf  text/html  "/usr/local/dpsearch/sbin/unrtf --html $1"

xlhtml XLS to html converter

Homepage

indexer.conf:

Mime	application/vnd.ms-excel  text/html  "/usr/local/dpsearch/sbin/xlhtml $1"

ppthtml PowerPoint (PPT) to html converter. Part of xlhtml 0.5.

Homepage

indexer.conf:

Mime	application/vnd.ms-powerpoint  text/html  "/usr/local/dpsearch/sbin/ppthtml $1"

Using vwHtml (DOC to html).

/usr/local/dpsearch/sbin/0vwHtml.pl:

#!/usr/bin/perl -w

$p = $ARGV[1];
$f = $ARGV[1];

$p =~ s/(.*)\/([^\/]*)/$1\//;
$f =~ s/(.*)\/([^\/]*)/$2/;

system("/usr/local/bin/wvHtml --targetdir=$p $ARGV[0] $f");

indexer.conf:

Mime  application/msword       text/html  "/usr/local/dpsearch/sbin/0wvHtml.pl $1 $2"
Mime  application/vnd.ms-word  text/html  "/usr/local/dpsearch/sbin/0wvHtml.pl $1 $2"

swf2html from Flash Search Engine SDK

indexer.conf:

Mime  application/x-shockwave-flash  text/html  "/usr/local/dpsearch/sbin/swf2html $1"

djvutxt from djvuLibre

indexer.conf:

Mime  image/djvu  text/plain  "/usr/local/bin/djvutxt $1 $2"
Mime  image/x.djvu  text/plain  "/usr/local/bin/djvutxt $1 $2"
Mime  image/x-djvu  text/plain  "/usr/local/bin/djvutxt $1 $2"
Mime  image/vnd.djvu  text/plain  "/usr/local/bin/djvutxt $1 $2"

3.9.8. libextractor library

DataparkSearch can be build with libextractor library. Using this library, DataparkSearch can index keywords from files of the following formats: PDF, PS, OLE2 (DOC, XLS, PPT), OpenOffice (sxw), StarOffice (sdw), DVI, MAN, FLAC, MP3 (ID3v1 and ID3v2), NSF(E) (NES music), SID (C64 music), OGG, WAV, EXIV2, JPEG, GIF, PNG, TIFF, DEB, RPM, TAR(.GZ), ZIP, ELF, S3M (Scream Tracker 3), XM (eXtended Module), IT (Impulse Tracker), FLV, REAL, RIFF (AVI), MPEG, QT and ASF.

To build DataparkSearch with libextractor library, install the library, and then configure and compile DataparkSearch.

Bellow the relationship between keyword types of libextractor version prior to 0.6 and DataparkSearch's section names is given:

Table 3-1. Relationship between libextractor's keyword types and DataparkSearch section names

Keyword Type	Section name
EXTRACTOR_FILENAME	Filename
EXTRACTOR_MIMETYPE	Mimetype
EXTRACTOR_TITLE	Title
EXTRACTOR_AUTHOR	Author
EXTRACTOR_ARTIST	Artist
EXTRACTOR_DESCRIPTION	Description
EXTRACTOR_COMMENT	Comment
EXTRACTOR_DATE	Date
EXTRACTOR_PUBLISHER	Publisher
EXTRACTOR_LANGUAGE	Content-Language
EXTRACTOR_ALBUM	Album
EXTRACTOR_GENRE	Genre
EXTRACTOR_LOCATION	Location
EXTRACTOR_VERSIONNUMBER	VersionNumber
EXTRACTOR_ORGANIZATION	Organization
EXTRACTOR_COPYRIGHT	Copyright
EXTRACTOR_SUBJECT	Subject
EXTRACTOR_KEYWORDS	Meta.Keywords
EXTRACTOR_CONTRIBUTOR	Contributor
EXTRACTOR_RESOURCE_TYPE	Resource-Type
EXTRACTOR_FORMAT	Format
EXTRACTOR_RESOURCE_IDENTIFIER	Resource-Idendifier
EXTRACTOR_SOURCE	Source
EXTRACTOR_RELATION	Relation
EXTRACTOR_COVERAGE	Coverage
EXTRACTOR_SOFTWARE	Software
EXTRACTOR_DISCLAIMER	Disclaimer
EXTRACTOR_WARNING	Warning
EXTRACTOR_TRANSLATED	Translated
EXTRACTOR_CREATION_DATE	Creation-Date
EXTRACTOR_MODIFICATION_DATE	Modification-Date
EXTRACTOR_CREATOR	Creator
EXTRACTOR_PRODUCER	Producer
EXTRACTOR_PAGE_COUNT	Page-Count
EXTRACTOR_PAGE_ORIENTATION	Page-Orientation
EXTRACTOR_PAPER_SIZE	Paper-Size
EXTRACTOR_USED_FONTS	Used-Fonts
EXTRACTOR_PAGE_ORDER	Page-Order
EXTRACTOR_CREATED_FOR	Created-For
EXTRACTOR_MAGNIFICATION	Magnification
EXTRACTOR_RELEASE	Release
EXTRACTOR_GROUP	Group
EXTRACTOR_SIZE	Size
EXTRACTOR_SUMMARY	Summary
EXTRACTOR_PACKAGER	Packager
EXTRACTOR_VENDOR	Vendor
EXTRACTOR_LICENSE	License
EXTRACTOR_DISTRIBUTION	Distribution
EXTRACTOR_BUILDHOST	BuildHost
EXTRACTOR_OS	OS
EXTRACTOR_DEPENDENCY	Dependency
EXTRACTOR_HASH_MD4	Hash-MD4
EXTRACTOR_HASH_MD5	Hash-MD5
EXTRACTOR_HASH_SHA0	Hash-SHA0
EXTRACTOR_HASH_SHA1	Hash-SHA1
EXTRACTOR_HASH_RMD160	Hash-RMD160
EXTRACTOR_RESOLUTION	Resolution
EXTRACTOR_CATEGORY	Ext.Category
EXTRACTOR_BOOKTITLE	BookTitle
EXTRACTOR_PRIORITY	Priority
EXTRACTOR_CONFLICTS	Conflicts
EXTRACTOR_REPLACES	Replaces
EXTRACTOR_PROVIDES	Provides
EXTRACTOR_CONDUCTOR	Conductor
EXTRACTOR_INTERPRET	Interpret
EXTRACTOR_OWNER	Owner
EXTRACTOR_LYRICS	Lyrics
EXTRACTOR_MEDIA_TYPE	Media-Type
EXTRACTOR_CONTACT	Contact
EXTRACTOR_THUMBNAIL_DATA	Thumbnail-Data
EXTRACTOR_PUBLICATION_DATE	Publication-Date
EXTRACTOR_CAMERA_MAKE	Camera-Make
EXTRACTOR_CAMERA_MODEL	Camera-Model
EXTRACTOR_EXPOSURE	Exposure
EXTRACTOR_APERTURE	Aperture
EXTRACTOR_EXPOSURE_BIAS	Exposure-Bias
EXTRACTOR_FLASH	Flash
EXTRACTOR_FLASH_BIAS	Flash-Bias
EXTRACTOR_FOCAL_LENGTH	Focal-Length
EXTRACTOR_FOCAL_LENGTH_35MM	Focal-Length-35MM
EXTRACTOR_ISO_SPEED	ISO-Speed
EXTRACTOR_EXPOSURE_MODE	Exposure-Mode
EXTRACTOR_METERING_MODE	Metering-Mode
EXTRACTOR_MACRO_MODE	Macro-Mode
EXTRACTOR_IMAGE_QUALITY	Image-Quality
EXTRACTOR_WHITE_BALANCE	White-Balance
EXTRACTOR_ORIENTATION	Orientation
EXTRACTOR_TEMPLATE	Template
EXTRACTOR_SPLIT	Split
EXTRACTOR_PRODUCTVERSION	ProductVersion
EXTRACTOR_LAST_SAVED_BY	Last-Saved-By
EXTRACTOR_LAST_PRINTED	Last-Printed
EXTRACTOR_WORD_COUNT	Word-Count
EXTRACTOR_CHARACTER_COUNT	Character-Count
EXTRACTOR_TOTAL_EDITING_TIME	Total-Editing-Time
EXTRACTOR_THUMBNAILS	Thumbnails
EXTRACTOR_SECURITY	Security
EXTRACTOR_CREATED_BY_SOFTWARE	Created-By-Software
EXTRACTOR_MODIFIED_BY_SOFTWARE	Modified-By-Software
EXTRACTOR_REVISION_HISTORY	Revision-History
EXTRACTOR_LOWERCASE	Lowercase
EXTRACTOR_COMPANY	Company
EXTRACTOR_GENERATOR	Generator
EXTRACTOR_CHARACTER_SET	Meta-Charset
EXTRACTOR_LINE_COUNT	Line-Count
EXTRACTOR_PARAGRAPH_COUNT	Paragraph-Count
EXTRACTOR_EDITING_CYCLES	Editing-Cycles
EXTRACTOR_SCALE	Scale
EXTRACTOR_MANAGER	Manager
EXTRACTOR_MOVIE_DIRECTOR	Movie-Director
EXTRACTOR_DURATION	Duration
EXTRACTOR_INFORMATION	Information
EXTRACTOR_FULL_NAME	Full-Name
EXTRACTOR_CHAPTER	Chapter
EXTRACTOR_YEAR	Year
EXTRACTOR_LINK	Link
EXTRACTOR_MUSIC_CD_IDENTIFIER	Music-CD-Identifier
EXTRACTOR_PLAY_COUNTER	Play-Counter
EXTRACTOR_POPULARITY_METER	Popularity-Meter
EXTRACTOR_CONTENT_TYPE	Ext.Content-Type
EXTRACTOR_ENCODED_BY	Encoded-By
EXTRACTOR_TIME	Time
EXTRACTOR_MUSICIAN_CREDITS_LIST	Musician-Credits-List
EXTRACTOR_MOOD	Mood
EXTRACTOR_FORMAT_VERSION	Format-Version
EXTRACTOR_TELEVISION_SYSTEM	Television-System
EXTRACTOR_SONG_COUNT	Song-Count
EXTRACTOR_STARTING_SONG	Strting-Song
EXTRACTOR_HARDWARE_DEPENDENCY	Hardware-Dependency
EXTRACTOR_RIPPER	Ripper
EXTRACTOR_FILE_SIZE	File-Size
EXTRACTOR_TRACK_NUMBER	Track-Number
EXTRACTOR_ISRC	ISRC
EXTRACTOR_DISC_NUMBER	Disc-Number

If a section name from the list above doesn't specified in sections.conf, the value of corresponding keyword is written as body section. Keywords of unknown type are written as body section as well.

For libextractor 0.6.x, the values returned by EXTRACTOR_metatype_to_string function are used as section names.

Prev	Home	Next
Servers Table	Up	Other commands are used in `indexer.conf`