PostgreSQL Fulltext Search
From phpBB Development Wiki
phpBB 3.1 "Ascraeus" introduces PostgreSQL Fulltext search backend for boards running on pgsql (postgres) databases. Full text indexing allows documents to be preprocessed and an index saved for later rapid searching. Preprocessing includes:
- Parsing documents into tokens. It is useful to identify various classes of tokens, e.g. numbers, words, complex words, email addresses, so that they can be processed differently. In principle token classes depend on the specific application, but for most purposes it is adequate to use a predefined set of classes. PostgreSQL uses a parser to perform this step. PostgreSQL fulltext search docs
- Converting tokens into lexemes. A lexeme is a string, just like a token, but it has been normalized so that different forms of the same word are made alike. For example, normalization almost always includes folding upper-case letters to lower-case, and often involves removal of suffixes (such as s or es in English). This allows searches to find variant forms of the same word, without tediously entering all the possible variants. Also, this step typically eliminates stop words, which are words that are so common that they are useless for searching.
- Storing preprocessed documents optimized for searching. For example, each document can be represented as a sorted array of normalized lexemes. Along with the lexemes it is often desirable to store positional information to use for proximity ranking, so that a document that contains a more "dense" region of query words is assigned a higher rank than one with scattered query words.
Postgres fulltext search requires a phpBB 3.1 board running on postgres version >= 8.3
Postgres fulltext search can be selected via Search Settings in ACP. ACP configuration settings allow to modify maximum/minimum word length and text search configuration profile.
Minimum word length
Search ignores keywords with length shorter than the minimum word length while searching. Default Value: 4
Maximum word length
Search ignores keywords with length longer than the maximum word length while searching. Default Value:254
Text search configuration profile
Enables the user to select the configuration to be used for search. PostgreSQL's text search facility provides four types of configuration-related database objects:
- Text search parsers break documents into tokens and classify each token (for example, as words or numbers)
- Text search dictionaries convert tokens to normalized form and reject stop words.
- Text search templates provide the functions underlying dictionaries. (A dictionary simply specifies a template and a set of parameters for the template)
- Text search configurations select a parser and a set of dictionaries to use to normalize the tokens produced by the parser.
The default configuration profile is simple which is the appropriate configuration set during postgres installation. Postgres also has pre-defined configurations for many languages which can be selected as per the language of the board.
The board administrator can create/delete indexes for postgres full text search by choosing search index option in the maintenance tab. Index stats for postgres indexes show just the total no of posts which were indexed. Indexes need to be rebuilt everytime the search backend configuration settings are changed in the ACP.
The search keywords can contain + in front of a word which must be found and - in front of a word which must not be found. A list of words can be entered, separated by | into brackets if only one of the words must be found. Use * as a wildcard for partial matches.
Search by exact phrase is currently not supported by postgres full text search. In case a user tries to use " for phrase search, a warning message is displayed that the search backend does not support phrase search along with the general results.