phpBB

Development Wiki

Difference between revisions of "Sphinx Fulltext Search"

From phpBB Development Wiki

(Minimum Requirements)
 
(11 intermediate revisions by 3 users not shown)
Line 2: Line 2:
  
 
==Minimum Requirements==
 
==Minimum Requirements==
Sphinx Search server 2.0.1+ and phpBB 3.1 board running on either MySQL or PostgreSQL Databases.
+
*Sphinx Search server 2.0.1+ and phpBB 3.1 board running on either MySQL or PostgreSQL Databases.
 +
*Sphinx Search server 3.x has introduced major changes and does not work.
 +
*Sphinx Search server 2.2.11 is confirmed to be working on phpBB 3.3.0 running MariaDB 10.3.
  
 
==Installation Instructions==
 
==Installation Instructions==
Line 39: Line 41:
 
Crontab file on most Unix Systems can be edited by
 
Crontab file on most Unix Systems can be edited by
 
<pre>crontab -e</pre>
 
<pre>crontab -e</pre>
 +
Add this line to update the delta index every five minutes
 +
<pre>*/5 * * * * indexer --rotate --config {CONFIG_PATH}/sphinx.conf index_phpbb_{SPHINX_ID}_delta >> {DATA_PATH}/log/indexer.log 2>&1 &</pre>
 
Add this line to set up cron job for full index once every night
 
Add this line to set up cron job for full index once every night
<pre> 0 3 * * * indexer --rotate --config {CONFIG_PATH}/sphinx.conf index_phpbb_{SPHINX_ID}_main >> {DATA_PATH}/log/indexer.log 2>&1 &</pre>
+
<pre>0 3 * * * indexer --rotate --config {CONFIG_PATH}/sphinx.conf index_phpbb_{SPHINX_ID}_main >> {DATA_PATH}/log/indexer.log 2>&1 &</pre>
  
 
===Start Searchd===
 
===Start Searchd===
Line 66: Line 70:
 
*'''max_children''' - Maximum amount of children to fork (concurrent searches to run in parallel), default 30
 
*'''max_children''' - Maximum amount of children to fork (concurrent searches to run in parallel), default 30
 
*'''max_matches''' - the number of search hits to display per result page, default 20000
 
*'''max_matches''' - the number of search hits to display per result page, default 20000
 +
 +
===Wildcard searching===
 +
 +
By default, wildcard searching is DISABLED and use of * operator will not work. To enable wildcard searching, consider configuring the following parameters:
 +
 +
*'''ignore_chars''' - characters (in Unicode format) ignored and truncated in search index. default none. ignore_chars = U+00AD, U+002D will truncate hyphenated words into single word eg "re-establish" will be indexed as "reestablish". Ignored characters cannot be listed in charset_table.
 +
*'''min_prefix_len''' - minimum prefix length to index. Value greater than 0 will enable partial word match using wordstart* wildcard, default 0 (wildcards disabled). Suggested value 3 (tes* will find test, tested, testing etc)
 +
*'''min_infix_len''' - minimum infix length to index. Value greater than 0 will enable partial word match using 'start*', '*end', and '*middle*' wildcards, default 0 (wildcards disabled). Suggested value 3 (*est* will find test, tested, testing, estimated, shortest etc).
 +
 +
 +
NOTE: only use one of either min_prefix_len or min_infix_len, not both. The unused parameter should be set as 0. Enabling wildcard indexing will increase search index size.
  
 
===Stopwords===
 
===Stopwords===
 
Sphinx config file provides an option for specifying a file containing search stop words. Stop words are those common words like 'a' and 'the' that appear commonly in text and should really be ignored from searching. A somewhat complete list of English stop words can be found [# here]. These words can be copied into a text file and added to sphinx.conf under index_phpbb section as
 
Sphinx config file provides an option for specifying a file containing search stop words. Stop words are those common words like 'a' and 'the' that appear commonly in text and should really be ignored from searching. A somewhat complete list of English stop words can be found [# here]. These words can be copied into a text file and added to sphinx.conf under index_phpbb section as
 
<pre>stopwords = path/to/stopwords.txt</pre>
 
<pre>stopwords = path/to/stopwords.txt</pre>

Latest revision as of 19:29, 2 June 2020

Sphinx fulltext search provides a new feature to use Sphinx Open Source Search Server for phpBB 3.1 search. Using Sphinx will improve the performance of searching as well as indexing particularly in boards with large databases. Sphinx server being both flexible and fast, provides a better alternative as a search backend.

Minimum Requirements

  • Sphinx Search server 2.0.1+ and phpBB 3.1 board running on either MySQL or PostgreSQL Databases.
  • Sphinx Search server 3.x has introduced major changes and does not work.
  • Sphinx Search server 2.2.11 is confirmed to be working on phpBB 3.3.0 running MariaDB 10.3.

Installation Instructions

Sphinx Installation

Follow the Instructions to install sphinx. Only the actual installation is required, no need to follow "Sphinx Quick Usage Tour" for phpBB search.

Sphinx Configuration

Sphinx configuration file data can either be generated through ACP and then copy pasted into the sphinx.conf or phpBB/docs/sphinx.sample.conf can be manually edited and used. Following folders/files need to be created and defined in the sphinx.conf:

  • Config directory which will have sphinx.conf and stopwords.txt (If defined).
  • Data directory which will have binary and index files.
  • Log directory as a sub directory of Data directory which will save all logs related to sphinx search server.

Creating Required Directories

  • Data Directory
mkdir -p {DATA_PATH}
  • Log Directory
mkdir -p {DATA_PATH}/log

Indexing

Board administrator needs to select Sphinx Fulltext Search as the search backend and Create Search Index through the ACP UI. This will create a SPHINX_TABLE in the database. Then the sphinx indexer should be manually run from the shell.

  • Index Main
indexer --config {CONFIG_PATH}/sphinx.conf index_phpbb_{SPHINX_ID}_main >> {DATA_PATH}/log/indexer.log 2>&1 &
  • Index Delta
indexer --config {CONFIG_PATH}/sphinx.conf index_phpbb_{SPHINX_ID}_delta >> {DATA_PATH}/log/indexer.log 2>&1 &
  • Re-Index
indexer --rotate --config {CONFIG_PATH}/sphinx.conf index_phpbb_{SPHINX_ID}_delta >> {DATA_PATH}/log/indexer.log 2>&1 &

Test Sphinx

Test whether sphinx is working. The following command will return the search result.

search --config {CONFIG_PATH}/sphinx.conf search string

Incremental Updates

Crontab file on most Unix Systems can be edited by

crontab -e

Add this line to update the delta index every five minutes

*/5 * * * * indexer --rotate --config {CONFIG_PATH}/sphinx.conf index_phpbb_{SPHINX_ID}_delta >> {DATA_PATH}/log/indexer.log 2>&1 &

Add this line to set up cron job for full index once every night

0 3 * * * indexer --rotate --config {CONFIG_PATH}/sphinx.conf index_phpbb_{SPHINX_ID}_main >> {DATA_PATH}/log/indexer.log 2>&1 &

Start Searchd

Start sphinx daemon.

searchd --config {CONFIG_PATH}/sphinx.conf >> {DATA_PATH}/log/searchd-startup.log 2>&1 &

Troubleshooting

Log files present in the {DATA_PATH}/log/ directory can be checked for errors. See Sphinx Documentation for details.

Manual Configuration

Sample Sphinx config file for phpBB sphinx search backend is available [# here]. It has many options which include database details as well as the directory details for sphinx data and config folders.

Database Details

Database details on which sphinx daemon and the board are running.

  • type - database type , default mysql.
  • sql_host - hostname, default localhost
  • sql_user
  • sql_pass
  • sql_port - database port, default 3306 for mysql
  • db_name

Searchd Details

  • listen - IP address : Sphinx Daemon port, default 127.0.0.1 : 3312
  • read_timeout - Network client request read timeout in seconds, default 5
  • max_children - Maximum amount of children to fork (concurrent searches to run in parallel), default 30
  • max_matches - the number of search hits to display per result page, default 20000

Wildcard searching

By default, wildcard searching is DISABLED and use of * operator will not work. To enable wildcard searching, consider configuring the following parameters:

  • ignore_chars - characters (in Unicode format) ignored and truncated in search index. default none. ignore_chars = U+00AD, U+002D will truncate hyphenated words into single word eg "re-establish" will be indexed as "reestablish". Ignored characters cannot be listed in charset_table.
  • min_prefix_len - minimum prefix length to index. Value greater than 0 will enable partial word match using wordstart* wildcard, default 0 (wildcards disabled). Suggested value 3 (tes* will find test, tested, testing etc)
  • min_infix_len - minimum infix length to index. Value greater than 0 will enable partial word match using 'start*', '*end', and '*middle*' wildcards, default 0 (wildcards disabled). Suggested value 3 (*est* will find test, tested, testing, estimated, shortest etc).


NOTE: only use one of either min_prefix_len or min_infix_len, not both. The unused parameter should be set as 0. Enabling wildcard indexing will increase search index size.

Stopwords

Sphinx config file provides an option for specifying a file containing search stop words. Stop words are those common words like 'a' and 'the' that appear commonly in text and should really be ignored from searching. A somewhat complete list of English stop words can be found [# here]. These words can be copied into a text file and added to sphinx.conf under index_phpbb section as

stopwords = path/to/stopwords.txt