Filtering documents with SWISH::Filter -------------------------------------- Swish-e knows only how to parse HTML, XML, and text files. Other file types may be indexed with the help of filters. SWISH::Filter is a Perl module designed to make converting documents from one type of content to another type of content easy. It's uses a plug-in type of system where new filters can be added with little effort. SWISH::Filter (and associated plug-in filter modules) do not normally do the actual filtering. This system provides only an interface to the programs that do the filtering. For example, the Swish-e distribution includes a filter plug-in called SWISH::Filters::Pdf2HTML. For this filter to work you must install the xpdf package that includes the pdftotext and pdfinfo programs. SWISH::Filters::Pdf2HTML only provides a unified interface to this programs. The included program F will use SWISH::Filter by default. This means that installing the programs that do the filter is all that is needed to start filtering documents. For example, installing the xpdf package will enable indexing of PDF file when spidering. The filter modules are in the $libexecdir/perl directory. Running swish-e -h will list the setting for $libexecdir, but is typically /usr/local/lib/swish-e if swish-e was built from source, or /usr/lib/swish-e if installed as a package. On Window $libexecdir will be set at installation time. Note that $libexecdir/perl is not normally part of Perl's @INC array. So to read documenation on a specific filter you will need to either specify the full path to the filter or set PERL5LIB. For example: export PERL5LIB=/usr/local/lib/swish-e/perl perldoc SWISH::Filter Documentation for SWISH::Filter can also be found in the html directory and at http://swish-e.org. Swish-e has another filter system. The FileFilter directive that can be used to filter documents through an external program while indexing. That system requires a separate filter setup for each type of document. See the SWISH-CONFIG page for information on that type of filtering. Testing SWISH::Filter --------------------- The program swish-filter-test in installed by default (in the same location as the swish-e binary). This program can be used to test SWISH::Filter. For example, run the command: $ swish-filter-test foo.pdf foo.txt Document foo.pdf was filtered. Document: foo.pdf Content-Type: text/html (initial was application/pdf) Parser type: HTML* Document foo.txt was not filtered. Document: foo.txt Content-Type: text/plain (initial was text/plain) Parser type: TXT* Run the command $ swish-filter-test -man for documentation. Current filters distributed with Swish-e: ----------------------------------------- All of these filters require installation of helper programs and/or Perl modules. See the individual module's documentation for dependencies. SWISH::Filters::Doc2txt - converts MS Word documents to text SWISH::Filters::Pdf2HTML - converts PDF files to HTML with info tags as metanames SWISH::Filters::ID3toHTML - extracts out ID3 (v1 and v2) tags from MP3 files SWISH::Filters::XLtoHTML - converts MS Excel to HTML Filters that depend on Perl modules that are not installed will not load. Setting the environment variable FILTER_DEBUG may report helpful errors when using filters. See perldoc SWISH::Filter for instructions on creating filters.