3.9. External parsers

DataparkSearch indexer can use external parsers to index various file types (MIME types).

Parser is an executable program which converts one of the mime types to text/plain or text/html. For example, if you have postscript files, you can use ps2ascii parser (filter), which reads postscript file from stdin and produces ascii to stdout.

3.9.1. Supported parser types

Indexer supports four types of parsers that can:

3.9.2. Setting up parsers

  1. Configure mime types

    Configure your web server to send appropriate "Content-Type" header. For apache, have a look at mime.types file, most mime types are already defined there.

    If you want to index local files or via ftp use "AddType" command in indexer.conf to associate file name extensions with their mime types. For example:

    AddType text/html *.html

  2. Add parsers

    Add lines with parsers definitions. Lines have the following format with three arguments:

    Mime <from_mime> <to_mime> [<command line>]

    For example, the following line defines parser for man pages:

    # Use deroff for parsing man pages ( *.man )
    Mime  application/x-troff-man   text/plain   deroff

    This parser will take data from stdin and output result to stdout.

    Many parsers can not operate on stdin and require a file to read from. In this case indexer creates a temporary file in /tmp and will remove it when parser exits. Use $1 macro in parser command line to substitute file name. For example, Mime command for "catdoc" MS Word to ASCII converters may look like this:

    Mime application/msword text/plain "/usr/bin/catdoc -a $1"

    If your parser writes result into output file, use $2 macro. indexer will replace $2 by temporary file name, start parser, read result from this temporary file then remove it. For example:

    Mime application/msword text/plain "/usr/bin/catdoc -a $1 >$2"

    The parser above will read data from first temporary file and write result to second one. Both temporary files will be removed when parser exists. Note that result of usage of this parser will be absolutely the same with the previous one, but they use different execution mode: file->stdout and file->file correspondingly.

    If the <command line> parameter is omitted this means both MIME type are synonyms. E.g. some sites can supply incorrect type for MP3 files as application/mp3. You can alter it into correct one audio/mpeg and therefore process them:

    Mime application/mp3 audio/mpeg

3.9.3. Avoid indexer hang on parser execution

To avoid a indexer hang on parser execution, you may specify the amount of time in seconds for parser execution in your indexer.conf by ParserTimeOut command. For example:

ParserTimeOut 600

Default value is 300 seconds, i.e. 5 minutes.

3.9.4. Pipes in parser's command line

You can use pipes in parser's command line. For example, these lines will be useful to index gzipped man pages from local disk:

AddType  application/x-gzipped-man  *.1.gz *.2.gz *.3.gz *.4.gz
Mime     application/x-gzipped-man  text/plain  "zcat | deroff"

3.9.5. Charsets and parsers

Some parsers can produce output in other charset than given in LocalCharset command. Specify charset to make indexer convert parser's output to proper one. For example, if your catdoc is configured to produce output in windows-1251 charset but LocalCharset is koi8-r, use this command for parsing MS Word documents:

Mime  application/msword  "text/plain; charset=windows-1251" "catdoc -a $1"

3.9.6. DPS_URL environment variable

When executing a parser indexer creates DPS_URL environment variable with an URL being processed as a value. You can use this variable in parser scripts.

3.9.7. Some third-party parsers

3.9.8. libextractor library

DataparkSearch can be build with libextractor library. Using this library, DataparkSearch can index keywords from files of the following formats: PDF, PS, OLE2 (DOC, XLS, PPT), OpenOffice (sxw), StarOffice (sdw), DVI, MAN, FLAC, MP3 (ID3v1 and ID3v2), NSF(E) (NES music), SID (C64 music), OGG, WAV, EXIV2, JPEG, GIF, PNG, TIFF, DEB, RPM, TAR(.GZ), ZIP, ELF, S3M (Scream Tracker 3), XM (eXtended Module), IT (Impulse Tracker), FLV, REAL, RIFF (AVI), MPEG, QT and ASF.

To build DataparkSearch with libextractor library, install the library, and then configure and compile DataparkSearch.

Bellow the relationship between keyword types of libextractor version prior to 0.6 and DataparkSearch's section names is given:

Table 3-1. Relationship between libextractor's keyword types and DataparkSearch section names

Keyword TypeSection name
EXTRACTOR_FILENAME Filename
EXTRACTOR_MIMETYPE Mimetype
EXTRACTOR_TITLE Title
EXTRACTOR_AUTHOR Author
EXTRACTOR_ARTIST Artist
EXTRACTOR_DESCRIPTION Description
EXTRACTOR_COMMENT Comment
EXTRACTOR_DATE Date
EXTRACTOR_PUBLISHER Publisher
EXTRACTOR_LANGUAGE Content-Language
EXTRACTOR_ALBUM Album
EXTRACTOR_GENRE Genre
EXTRACTOR_LOCATION Location
EXTRACTOR_VERSIONNUMBER VersionNumber
EXTRACTOR_ORGANIZATION Organization
EXTRACTOR_COPYRIGHT Copyright
EXTRACTOR_SUBJECT Subject
EXTRACTOR_KEYWORDS Meta.Keywords
EXTRACTOR_CONTRIBUTOR Contributor
EXTRACTOR_RESOURCE_TYPE Resource-Type
EXTRACTOR_FORMAT Format
EXTRACTOR_RESOURCE_IDENTIFIER Resource-Idendifier
EXTRACTOR_SOURCE Source
EXTRACTOR_RELATION Relation
EXTRACTOR_COVERAGE Coverage
EXTRACTOR_SOFTWARE Software
EXTRACTOR_DISCLAIMER Disclaimer
EXTRACTOR_WARNING Warning
EXTRACTOR_TRANSLATED Translated
EXTRACTOR_CREATION_DATE Creation-Date
EXTRACTOR_MODIFICATION_DATE Modification-Date
EXTRACTOR_CREATOR Creator
EXTRACTOR_PRODUCER Producer
EXTRACTOR_PAGE_COUNT Page-Count
EXTRACTOR_PAGE_ORIENTATION Page-Orientation
EXTRACTOR_PAPER_SIZE Paper-Size
EXTRACTOR_USED_FONTS Used-Fonts
EXTRACTOR_PAGE_ORDER Page-Order
EXTRACTOR_CREATED_FOR Created-For
EXTRACTOR_MAGNIFICATION Magnification
EXTRACTOR_RELEASE Release
EXTRACTOR_GROUP Group
EXTRACTOR_SIZE Size
EXTRACTOR_SUMMARY Summary
EXTRACTOR_PACKAGER Packager
EXTRACTOR_VENDOR Vendor
EXTRACTOR_LICENSE License
EXTRACTOR_DISTRIBUTION Distribution
EXTRACTOR_BUILDHOST BuildHost
EXTRACTOR_OS OS
EXTRACTOR_DEPENDENCY Dependency
EXTRACTOR_HASH_MD4 Hash-MD4
EXTRACTOR_HASH_MD5 Hash-MD5
EXTRACTOR_HASH_SHA0 Hash-SHA0
EXTRACTOR_HASH_SHA1 Hash-SHA1
EXTRACTOR_HASH_RMD160 Hash-RMD160
EXTRACTOR_RESOLUTION Resolution
EXTRACTOR_CATEGORY Ext.Category
EXTRACTOR_BOOKTITLE BookTitle
EXTRACTOR_PRIORITY Priority
EXTRACTOR_CONFLICTS Conflicts
EXTRACTOR_REPLACES Replaces
EXTRACTOR_PROVIDES Provides
EXTRACTOR_CONDUCTOR Conductor
EXTRACTOR_INTERPRET Interpret
EXTRACTOR_OWNER Owner
EXTRACTOR_LYRICS Lyrics
EXTRACTOR_MEDIA_TYPE Media-Type
EXTRACTOR_CONTACT Contact
EXTRACTOR_THUMBNAIL_DATA Thumbnail-Data
EXTRACTOR_PUBLICATION_DATE Publication-Date
EXTRACTOR_CAMERA_MAKE Camera-Make
EXTRACTOR_CAMERA_MODEL Camera-Model
EXTRACTOR_EXPOSURE Exposure
EXTRACTOR_APERTURE Aperture
EXTRACTOR_EXPOSURE_BIAS Exposure-Bias
EXTRACTOR_FLASH Flash
EXTRACTOR_FLASH_BIAS Flash-Bias
EXTRACTOR_FOCAL_LENGTH Focal-Length
EXTRACTOR_FOCAL_LENGTH_35MM Focal-Length-35MM
EXTRACTOR_ISO_SPEED ISO-Speed
EXTRACTOR_EXPOSURE_MODE Exposure-Mode
EXTRACTOR_METERING_MODE Metering-Mode
EXTRACTOR_MACRO_MODE Macro-Mode
EXTRACTOR_IMAGE_QUALITY Image-Quality
EXTRACTOR_WHITE_BALANCE White-Balance
EXTRACTOR_ORIENTATION Orientation
EXTRACTOR_TEMPLATE Template
EXTRACTOR_SPLIT Split
EXTRACTOR_PRODUCTVERSION ProductVersion
EXTRACTOR_LAST_SAVED_BY Last-Saved-By
EXTRACTOR_LAST_PRINTED Last-Printed
EXTRACTOR_WORD_COUNT Word-Count
EXTRACTOR_CHARACTER_COUNT Character-Count
EXTRACTOR_TOTAL_EDITING_TIME Total-Editing-Time
EXTRACTOR_THUMBNAILS Thumbnails
EXTRACTOR_SECURITY Security
EXTRACTOR_CREATED_BY_SOFTWARE Created-By-Software
EXTRACTOR_MODIFIED_BY_SOFTWARE Modified-By-Software
EXTRACTOR_REVISION_HISTORY Revision-History
EXTRACTOR_LOWERCASE Lowercase
EXTRACTOR_COMPANY Company
EXTRACTOR_GENERATOR Generator
EXTRACTOR_CHARACTER_SET Meta-Charset
EXTRACTOR_LINE_COUNT Line-Count
EXTRACTOR_PARAGRAPH_COUNT Paragraph-Count
EXTRACTOR_EDITING_CYCLES Editing-Cycles
EXTRACTOR_SCALE Scale
EXTRACTOR_MANAGER Manager
EXTRACTOR_MOVIE_DIRECTOR Movie-Director
EXTRACTOR_DURATION Duration
EXTRACTOR_INFORMATION Information
EXTRACTOR_FULL_NAME Full-Name
EXTRACTOR_CHAPTER Chapter
EXTRACTOR_YEAR Year
EXTRACTOR_LINK Link
EXTRACTOR_MUSIC_CD_IDENTIFIER Music-CD-Identifier
EXTRACTOR_PLAY_COUNTER Play-Counter
EXTRACTOR_POPULARITY_METER Popularity-Meter
EXTRACTOR_CONTENT_TYPE Ext.Content-Type
EXTRACTOR_ENCODED_BY Encoded-By
EXTRACTOR_TIME Time
EXTRACTOR_MUSICIAN_CREDITS_LIST Musician-Credits-List
EXTRACTOR_MOOD Mood
EXTRACTOR_FORMAT_VERSION Format-Version
EXTRACTOR_TELEVISION_SYSTEM Television-System
EXTRACTOR_SONG_COUNT Song-Count
EXTRACTOR_STARTING_SONG Strting-Song
EXTRACTOR_HARDWARE_DEPENDENCY Hardware-Dependency
EXTRACTOR_RIPPER Ripper
EXTRACTOR_FILE_SIZE File-Size
EXTRACTOR_TRACK_NUMBER Track-Number
EXTRACTOR_ISRC ISRC
EXTRACTOR_DISC_NUMBER Disc-Number

If a section name from the list above doesn't specified in sections.conf, the value of corresponding keyword is written as body section. Keywords of unknown type are written as body section as well.

For libextractor 0.6.x, the values returned by EXTRACTOR_metatype_to_string function are used as section names.