=pod =head1 NAME libswish3 - Swish3 C library =head1 SYNOPSIS <> =head1 DESCRIPTION B is the core C library of B. B uses the GNOME L library to parse words and metadata from XML, HTML and plain text files. B supports full UTF-8 encoding. B is a parsing tool for use with information retrieval (IR) libraries. Dynamic language bindings are available in the source distribution in the C directory. =head1 APIs The following APIs are defined: =head2 Parsing API B provides three basic input functions: =over =item swish_parse_file() =item swish_parse_fh() =item swish_parse_buffer() =back Each of these functions takes a C struct pointer and optional I. In addition: =over =item The swish_parse_file() function takes a file path, which must be a valid file. Directories and links are not supported. The assumption is that you will use your calling code to recurse through directories and handle links. =item swish_parse_buffer() takes a string representing the document headers and the full text of the document. =item swish_parse_fh() takes a filehandle pointer, which if set to NULL, defaults to stdin. =back See the L section for more information on using swish_parse_fh() and swish_parse_buffer(). See the L Function> section for more information on how to deal with the data extracted by each of the swish_parse_* functions. =head2 Headers API The Headers API supports and extends the Swish-e B<-S prog> feature, which allows you to feed the indexer with output from another Iram. The API has been extended from Swish-e's to allow for MIME types and more congruence with the HTTP 1.1 specification. See SWISH-RUN documentation in the Swish-e distribution for the Swish-e version 2 headers API. This is the B implementation. See B for a simple Perl-based way of generating the proper headers. =over =item Content-Location B Path-Name The name of the document. May be any string: an ID of a record in a database, a URL or a simple file name. The string is stored in the swish_DocInfo B struct member, which is often used as the primary identifier of a document in an index. This header is required. =item Content-Length The length in bytes of the document, starting after the blank line separating the headers from the document itself. The value must be exactly the length of the document, including any extra line feeds or carriage returns at the end of the document. Example: Content-Location: foo.html Content-Length: 9 The doc.\n ^^^^^^^^ ^ 12345678 9 The value is stored in the swish_DocInfo B struct member. This header is required. =item Last-Modified B Last-Mtime The last modification time of the document. The value must be an integer: the seconds since the Epoch on your system. If not present, will default to the current time. The value is stored in the swish_DocInfo B struct member. This header is not required. =item Parser-Type B Document-Type Explicitly name the parser used for the document, rather than defaulting to the MIME type mapping based on B and/or B. The three parser types are: =over =item XML =item HTML =item TXT =back The Swish-e values B, B, B, B, B, B are also supported for compatibility, but they map to the three B types. The value is stored in the swish_DocInfo B struct member. If not present, the document parser will be automatically chosen based on the following logic: =over =item If a B is given, the parser mapped to that MIME type will be used. You may override the default mappings in your configuration. See B. =item If no B is given, a MIME type will be guessed at based on the file extension of the document's B, and the parser mapped to that MIME type will be used. =item Finally, if a MIME type is not identified, the parser defined in B in B will be used. =back See also B and B. This header is not required. =item Content-Type The MIME type of the document. The B MIME type list is based on the Apache 2.0 version. See L for the official registry. If not defined with B, the MIME type will be guessed based on the file extension in the B header. If the B string does not contain a file extension (as might be the case with non-URL value), or the file extension has no MIME mapping, then the MIME type will default to B as defined in B. You may override the default extension-to-MIME mappings in your configuration. See B. The value is stored in the swish_DocInfo B struct member. See also B and B. This header is not required. =item Update-Mode B This header exists only for backwards compatibility with Swish-e's incremental index feature. B =item Encoding Alias: Charset Declare the document encoding. See the L section. =back =head2 Structures API Writing an effective I function requires an understanding of some of the key B data structures. For more details on any of these structures, see the SYNOPSIS. =head3 swish_3 The main data structure. A swish_3 object has a swish_Config, swish_Analyzer and swish_Parser object and delegates to eash as appropriate. This is typically the only object you need to create and use. =head3 swish_Config A configuration object. This object is required for initializing both a C object and a C object. =head3 swish_Parser A parser object. Required for executing any of the three C functions. =head3 swish_ParserData A parser data object. This object is passed around internally by the libxml2 SAX2 handlers, and is eventually the object passed to the I function pointer. See L Function>. =head3 swish_TokenList A list of words or tokens. The object is typically accessed via a swish_TokenIterator, like this: // example of swish_TokenIterator swish_Token *t; while ((t = swish_next_token(token_iterator)) != NULL) { SWISH_DEBUG_MSG("\n\ t->ref_cnt = %d\n\ t->pos = %d\n\ t->context = %s\n\ t->meta = %d [%s]\n\ t->len = %d\n\ t->value = %s\n\ ", t->ref_cnt, t->pos, t->context, t->meta->id, t->meta->name, t->len, t->value); } See the swish_debug_token_list() function from which the code above is taken. =head3 swish_Token An object representing one word or token. The word's position relative to other words, length, tag context and MetaName are all available in the object. =head3 swish_DocInfo An object describing metadata about the document itself: URI, MIME type, size, etc. =head3 swish_Analyzer The Analyzer object controls how the character content of a document is parsed: whether or not a WordList is created with a tokenizer, if the words (tokens) are lowercased or stemmed, etc. =head2 The I Function The I function pointer is the final link in the parsing chain. The function pointer is set in the swish_Parser object constructor, and is called by each of the swish_parse_* functions after the entire document has been parsed and (optionally) tokenized. The I receives one argument: a swish_ParserData object containing all the metadata and words in the document. If all you wanted to do was print out a report about each document as it was parsed, your I function might be as simple as: void my_handler( swish_ParserData * parse_data ) { swish_docinfo_debug( parse_data->docinfo ); swish_token_list_debug( parse_data->token_iterator ); swish_nb_debug( parse_data->properties, "Property" ); swish_nb_debug( parse_data->metanames, "MetaName" ); } IMPORTANT: After the I function is called, all the structures referenced by the swish_ParserData object are automatically freed, so if you intend to keep any of the data for storing in an index, you will need to strdup() words, properties, docinfo, etc. as part of your indexing code. See the example C file for how to create and pass in a I function pointer to the swish_3_init() constructor. =head2 Configuration API Configuration is different with B than with Swish-e. The biggest change is that B configuration files are written in XML. This is done for several reasons: =over =item 1 Since B already has a powerful XML parser built-in, it's much easier to parse a configuration file written in XML than to port the Swish-e config parser to B. =item 2 B stores index header information in a XML format nearly identical to the configuration file format. So the parser needs to understand only one XML schema. =item 3 You can store UTF-8 text in your configuration file and it will be parsed correctly. =item 4 The configuration directive list is extensible. Simple key/value configuration directives can be added without any modification to the B config parser. They are simply stored in the C struct hash for your own use and amusement. =back B The configuration directive names documented in the L section below are reserved for use by B. Some of them have special handling considerations (like MetaNames and PropertyNames). So the important idea to grasp with the extensible configuration feature is "simple key/value pairs." This section describes how to build a B configuration file. =head2 Configuration Example Here's an example B configuration file: yes <other>color size weight</other> </MetaNames> <PropertyNames> <foo type="text" ignore_case="1" /> <bar type="int" /> <lastmod type="date" /> <bing ignore_case="0" /> <description verbatim="1" max="10000" alias="body" sort_length="20" /> <notsorted sort="0" /> </PropertyNames> <Tokenize>1</Tokenize> </swish> And here's that same example, dissected: <swish> The top level tag. <FollowSymLinks>yes</FollowSymLinks> Equivalent to the Swish-e style: FollowSymLinks yes which simply informs whatever aggregator you are using that when confronted with a symlink on the filesystem, it should be followed. C<FollowSymLinks> is an example of a simple key/value pair (see the B<CAUTION> above). =head3 MetaNames Here's the first big difference from Swish-e. MetaNames, MetaNameAlias, and MetaNamesRank have been combined into a single XML tag with appropriate attributes. <foo bias="10" /> is the same thing as (in Swish-e style): MetaNames foo MetaNamesRank 10 foo while: <swishtitle bias="50" /> <title alias_for="swishtitle" /> is equivalent to: MetaNames swishtitle MetaNameAlias swishtitle title MetaNamesRank 50 swishtitle You can see that the XML style allows for a terser, more compact expression. You can still assign multiple aliases to a single MetaName: <other>color size weight</other> is equivalent to: MetaNames other MetaNameAlias other color size weight In addition, there are some special features intended for use with HTML documents. <links html="1" alias_for="href" /> # same as HTMLLinksMetaName <images html="1" alias_for="src" /> # same as ImageLinksMetaName <alttext html="1" alias_for="alt" /> # same as IndexAltTagMetaName =head3 PropertyNames PropertyNames, PropertyNamesCompareCase, PropertyNamesIgnoreCase, PropertyNamesNoStripChars, PropertyNamesNumeric, PropertyNamesDate, PropertyNameAlias, PropertyNamesMaxLength, PropertyNamesSortKeyLength, StoreDescription and PreSortedIndex have all been combined into a single XML directive. Here's the example from above with equivalent Swish-e directives annotated: <foo ignore_case="1" /> # PropertyNamesIgnoreCase foo <bar type="int" /> # PropertyNamesNumeric bar <lastmod type="date" /> # PropertyNamesDate lastmod <bing comparecase="1" /> # PropertyNamesCompareCase bing <description verbatim="1" max="10000" alias="body" sort_length="20" /> # PropertyNamesNoStripChars description # PropertyNamesMaxLength 10000 description # PropertyNameAlias description body # PropertyNamesSortKeyLength 20 description <notsorted sort="0" /> # PreSortedIndex foo bar lastmod bind description Again, the XML format greatly simplifies the syntax. You can assign attributes as you need, though be aware that some attributes are inherently mismatched and might generate an error or unexpected behaviour: <foo ignore_case="1" type="int" /> # wrong <foo ignore_case="0" type="date" /> # wrong <foo verbatim="1" type="int" /> # wrong <foo sort="0" max="20" /> # wrong =head2 Directives The following configuration directives are currently reserved by libswish3. These are top-level tags within the C<<swish>> root tag. If you use other top-level tags than these, they will be treated as key/value pairs and added to the C<misc> hash in the C<swish_Config> struct. =over =item swish_version Contains the value of the SWISH_VERSION constant. =item swish_lib_version Contains the value of the SWISH_LIB_VERSION constant. =item MetaNames Contains MetaName definitions. =item PropertyNames Contains PropertyName definitions. =item Parsers Contains a mapping of Parser name to MIME types. Example: <Parsers> <XML>application/xml</XML> <HTML>text/html</HTML> <TXT>text/plain</TXT> <XML>text/foo</XML> <HTML>default</HTML> <HTML>foo/bar</HTML> </Parsers> =item MIME Contains a mapping of file extensions to MIME types. Example: <MIME> <au>foo/bar</au> </MIME> =item Index Contains attributes of the inverted index. Example (with defaults): <Index> <Format>Native</Format> <Name>index.swish</Name> <Locale>en_US.UTF-8</Locale> </Index> =item TagAlias Contains mapping of alias names to tag names. <TagAlias> <swishdescription>body</swishdescription> <swishtitle>title</swishtitle> <swishtitle>foo</swishtitle> <swishtitle>bar</swishtitle> </TagAlias> =item Tokenize Toggle the tokenizer on or off. Default is C<yes> (on). <Tokenize>yes</Tokenize> =item CascadeMetaContext Toggle the cascading effect of MetaName context. The default is C<no> (off). <CascadeMetaContext>no</CascadeMetaContext> The effect is to consider a Token assigned to every MetaName in its DOM stack. An example: <doc> <metaone> foo <metatwo>bar</metatwo> </metaone> </doc> If CascadeMetaContext is true (on), then the token C<bar> will be tracked as both MetaName C<metaone> and C<metatwo>. If CascadeMetaContext is false (off), then the token C<bar> will be only C<metatwo>. =back =head1 ENCODINGS B<libswish3> relies on libxml2 to convert all incoming documents to UTF-8 encoding. If your documents are not UTF-8 encoded, you should either (a) convert them to UTF-8 using a tool like iconv or (b) explicitly tell B<libswish3> the encoding. You can tell B<libswish3> the encoding in the following ways, in order of precedence: =over =item Let libxml2 guess the encoding, based on the first 4 bytes of the document. This is not always reliable and is not recommended. =item When using the Headers API, specify the Encoding via a header. =item Put the encoding in the document. For XML this is via the <?xml encoding=""> attribute. For HTML this is via a <meta> tag with content indicating the charset. For TXT this is done via environment variables (see below). =item Use environment variables. B<libswish3> will detect your environment's locale using the standard Unix setlocale() library function. The environment variables checked are, in order, C<LC_ALL>, C<LC_CTYPE> and C<LANG>. Finally, you may set the encoding explicity with the C<SWISH_ENCODING> environment variable. =back The libxml2 parsers default to UTF-8 for XML per the XML standard, and to ISO-8859-1 (Latin1) for HTML. =head1 EXAMPLES See the C<swish_lint.c> file included in the B<libswish3> distribution. =head1 FAQ =head2 What is IR? Information Retrieval. =head2 How is libswish3 related to Swish-e? libswish3 is the core parsing library for Swish-e version 3 (Swish3). =head2 Is libswish3 a search engine? No. libswish3 is a document parser. It might work well in or with any number of search engines, but it is not in itself any kind of search tool. =head2 So what does libswish3 DO exactly? libswish3 reads text, HTML and XML files and extracts just the words and document properties from each document. It then hands off the token list and properties to a I<handler> function. Finally, it frees all the memory associated with the token list and properties. The I<handler> function can do whatever you wish, though typically a I<handler> would iterate over the words in the token list and add each one to an index using an IR library API. =head1 BACKGROUND libswish3 is part of the Swish-e project. It was born out of the need for UTF-8 and incremental indexing support and a desire to experiment with alternate indexing libraries like Lucene, KinoSearch, Xapian and Hyperestraier. libswish3 was developed with the idea that many quality IR libraries already exist, but few if any provide an easy and fast way of preparing documents for indexing. The following assumptions informed the development of libswish3. =head2 The IR Toolchain A decent IR toolchain requires 5 parts: =over 4 =item aggregator Collects documents from a filesystem, database, website or other sources. =item filter Normalizes documents to a standard format (plain text or a delimited/markup like YAML, HTML or XML) for indexing. =item parser Breaks a document into a list of words, including their context and position. =item indexer Writes the list of words in a storage system for quick, efficient retrieval. =item searcher Parses queries and fetches data from the indexer's storage system. =back Of course, the division between these parts is not always clean or apparent. Parsing search queries, for example, will necessarily involve elements of the parser and searcher components, while the indexer and searcher are of necessity intrinsically bound. But any complete IR system will have these five parts in some combination. =head2 Swish-e aggregators and filters are already good The existing Swish-e document aggregators (B<DirTree.pl> and B<spider.pl>) and filtering system (B<SWISH::Filter>) are good. They are all written in Perl and are easily modified, and they have ample configuration options and documentation. =head2 Why reinvent the wheel? Several good IR libraries exist that provide an indexer and searcher. These libraries do UTF-8, incremental indexing, and have search syntax on par with (or better than) Swish-e 2.x. Examples include Xapian, KinoSearch and Lucene. While they might be a little slower than Swish-e (at least in terms of indexing speed) they make up that for with: =over =item well-documented APIs =item bindings in a variety of programming languages =item active development communities =item the flexibility that comes with being a library instead of a fixed program =back =head2 The missing link The piece that Swish-e provides that other IR libraries lack is a fast, stable, integrated document parser. Xapian has Omega, but it does not parse XML, nor does it recognize ad hoc word context (metanames). However, the Swish-e 2.x parser does not work independently of the Swish-e indexer and searcher, nor does it support UTF-8. One piece is missing: a parser that works with the Swish-e aggregator/filter system, supports UTF-8, and offers flexible options for connecting with other IR libraries. Ergo, libswish3: a document parser compatible with the existing Swish-e -S prog API and capable of generating UTF-8 token lists for indexing with a variety of IR libraries. =head2 Where does libswish3 fit? libswish3 is the core C library in Swish3. However, libswish3 may be used without the rest of the Swish3. The assumption is that libswish3 could fit into an IR toolchain like this: aggregator -> filter -> libswish3 -> some IR library You could then use the native search API of the IR library. For example, you might use the Swish-e B<spider.pl> script to spider a website, filtering documents with B<SWISH::Filter> and then handing the output to a B<libswish3>-based program that will parse the documents into words and store the data in a Xapian or KinoSearch index (or both!). That model is, in fact, what Swish3 does. Or you might use the B<SWISH::Prog> Perl module (from the CPAN) to build your own aggregator/filter system, then hand the output to libswish3. =head1 AUTHOR Peter Karman (peter@peknet.com). =head1 CREDITS B<libswish3> is inspired by code from Swish-e (http://www.swish-e.org), Libxml2 (http://www.xmlsoft.org), Apache (http://www.apache.org), Rahul Dhesi (http://www.tug.org/tex-archive/tools/zoo/), Angel Ortega (http://www.triptico.com/software/unicode.html), James Henstridge (http://www.jamesh.id.au/articles/libxml-sax/libxml-sax.html), YoLinux (http://www.yolinux.com/TUTORIALS/GnomeLibXml2.html) and no doubt many unnamed others. All mistakes, errors and poor programming choices are, however, those of the author. =head1 LICENSE B<libswish3> is licensed under the GPL. libswish3 is free software; you can redistribute it and/or modify it under the terms of the GNU Library General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. libswish3 is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Library General Public License for more details. You should have received a copy of the GNU Library General Public License along with libswish3; see the file COPYING. If not, write to the Free Software Foundation, Inc. 59 Temple Place - Suite 330 Boston, MA 02111-1307, USA =head1 SEE ALSO The project homepage: http://dev.swish-e.org/wiki/swish3 swish_lint(1), swish_isw(1), swish_words(1) =cut