LATEST VERSION: 3.2.0 - RELEASE NOTES
Pivotal® Greenplum® Text v2.0.0

Working With Indexes

Indexing is key to preparing documents for text analysis and to achieving the best possible query performance. How you set up and configure indexes can affect the success of your project.

Creating GPText Indexes and Indexing Data

The general steps for creating and using a GPText index are:

  1. Add the GPText schema to a database
  2. Create an empty Solr index
  3. Map Greenplum Database data types to Solr data types
  4. Populate the index
  5. Commit the index

After you complete these steps, you can create and execute a search query or implement machine learning algorithms.

The examples in this section use a table called articles in a database called wikipedia that was created with a default public schema.

The articles table has five columns: id, date_time, title, content, and references. The type of the id column must be bigint or int8.

The content column is the default search column—the column that will be searched if we do not name a different one in the search query.

See the GPText Function Reference for details about GPText functions.

Add the GPText schema to a database

Before you can use GPText with a Greenplum Database, the gptext schema must be installed. Use the gptext-installsql utility to perform this one-time task:

$ gptext-installsql wikipedia

To open an interactive shell for executing queries on the wikipedia database, execute:

$ psql wikipedia

Create an empty Solr index

To create a new index, use the function gptext.create_index(schema_name, table_name, id_col_name, default_search_col_name, if_check_id_uniqueness).

There can be only one index per database table.

Example:

=# SELECT * FROM gptext.create_index('public', 'articles', 'id', 'content', true);
INFO:  Created index wikipedia.public.articles
 create_index
--------------
 t
(1 row)

This creates an index named wikipedia.public.articles.

To verify the index is created, enter the following query:

=# SELECT * FROM gptext.index_status('wikipedia.public.articles');
 content_id |        index_name         | shard_name | shard_state | replica_name | replica_state |                   core                    |        node_name
    |            base_url             | is_leader
------------+---------------------------+------------+-------------+--------------+---------------+-------------------------------------------+----------------------
----+---------------------------------+-----------
          0 | wikipedia.public.articles | shard0     | active      | core_node2   | active        | wikipedia.public.articles_shard0_replica2 | 192.0.2.217:18984_s
olr | http://192.0.2.217:18984/solr | f
          0 | wikipedia.public.articles | shard0     | active      | core_node4   | active        | wikipedia.public.articles_shard0_replica3 | 192.0.2.217:18983_s
olr | http://192.0.2.217:18983/solr | f
          0 | wikipedia.public.articles | shard0     | active      | core_node6   | active        | wikipedia.public.articles_shard0_replica1 | 192.0.2.217:18983_s
olr | http://192.0.2.217:18983/solr | t
          1 | wikipedia.public.articles | shard1     | active      | core_node1   | active        | wikipedia.public.articles_shard1_replica1 | 192.0.2.217:18984_s
olr | http://192.0.2.217:18984/solr | f
          1 | wikipedia.public.articles | shard1     | active      | core_node3   | active        | wikipedia.public.articles_shard1_replica3 | 192.0.2.217:18984_s
olr | http://192.0.2.217:18984/solr | f
          1 | wikipedia.public.articles | shard1     | active      | core_node5   | active        | wikipedia.public.articles_shard1_replica2 | 192.0.2.217:18983_s
olr | http://192.0.2.217:18983/solr | t
(6 rows)

This example executed on a Greenplum Database cluster with two primary segments. Two shards were created, one for each segment, and each shard has three replicas. The replicas are named core_node1 through core_node6.

You can also run the gptext-state command line utility at the command line to verify the index was created.

Map Greenplum Database data types to Solr data types

If a Greenplum Database data type is an array it is mapped to a multi-value type in Solr. For example, INT[ ] maps to a multi-value int field.

GPDB Type Solr Type
bool boolean
bytea binary
char string
name string
int8 long
int4 int
int2 int
int int
text text
point point
float4 float
float8 double
money string
bpchar string
varchar text
interval string
date tdate
time string
timestamp tdate
timetz string
timestamptz tdate
bit string
varbit string
numeric double

If a Greenplum Database data type is not listed, it will be text type in Solr.

Populate the index

To populate the index, use the table function gptext.index(), which has the following syntax:

SELECT * FROM gptext.index(TABLE(SELECT * FROM table_name), index_name);

For example:

=# SELECT * FROM gptext.index(TABLE(SELECT * FROM articles), 'wikipedia.public.articles');
 dbid | num_docs
------+----------
    3 |        3
    2 |        3
(2 rows)

The results of this command show that three documents from each segment were added to the index.

Considerations and best practices when indexing documents

The arguments to the gptext.index() function must be expressions.
TABLE(SELECT * FROM articles) creates a “table-valued expression” from the articles table, using the table function TABLE.

Choose the data to index or update by changing the inner select list in the query to select the columns and rows you want to index. When adding new documents to an existing index, for example, specify a WHERE clause in the gptext.index() call to choose only the new documents to index.

Be careful about distribution policies.
The first parameter of gptext.index() is TABLE(SELECT * FROM articles). The query in this parameter should have the same distribution policy as the table you are indexing. However, there are two cases where the query will not have the same distribution policy:

  1. Your query is a join of two tables
  2. You are indexing an intermediate table (staging table) that is distributed differently than the final table.

When the distribution policies differ, you must specify “SCATTER BY” for the query:

TABLE(SELECT * FROM articles SCATTER BY distrib_id)

where distrib_id is the distribution id used when you created your primary/final table.

Consider how GPText search results will be used.
You could, for example, index just the ID column and text columns that are targets of search queries, and then use the ID column to join with other tables in Greenplum Database. An alternative for some applications is to allow GPText to index and store all of the columns, even those that will not be searched, so that results from search queries are complete and do not have to be joined in Greenplum Database.

Commit the index

After you create and populate an index, you must commit the index using gptext.commit_index(index_name).

For example:

=# SELECT * FROM gptext.commit_index('wikipedia.public.articles');
 commit_index
--------------
 t
(1 row)

Note: The index picks up any new data added since your last index commit when you call this function.

Working With indexes

This topic describes how you can work with indexes. In particular, it covers the following tasks:

Configuring an index

You can modify your indexing behavior globally by using the gptext-config utility to edit a set of index configuration files. The files you can edit with gptext-config are:

  • solrconfig.xml – Contains most of the parameters for configuring Solr itself (see http://wiki.apache.org/solr/SolrConfigXml).
  • managed-schema – Defines the analysis chains that Solr uses for various different types of search fields (see Text Analysis Chains).
  • stopwords.txt – Lists words you want to eliminate from the final index.
  • protwords.txt – Lists protected words that you do not want to be modified by the analysis chain. For example, iPhone.
  • synonyms.txt – Lists words that you want replaced by synonyms in the analysis chain.
  • elevate.xml – Moves specific words to the top of your final index.
  • emoticons.txt – Defines emoticons for the text_sm social media analysis chain. (see The emoticons.txt file).

You can also use gptext-config to move files.

Optimizing an index

The function gptext.optimize_index(index_name, max_segments) merges all segments into a small number of segments (max_segments) for increased efficiency.

Example:

=# SELECT * FROM gptext.optimize_index('wikipedia.public.articles', 10);
 optimize_index
----------------
 t
(1 row)

Deleting from an index

You can delete from an index using a query with the function gptext.delete(index_name, query). This will delete all documents that match the search query. To delete all documents, use the query '*.*'.

After a successful deletion, you must issue a gptext.commit_index(index_name).

Example that deletes all documents containing "sports" in the default search field:

=# SELECT * FROM gptext.delete('wikipedia.public.articles', 'sports');
 delete
--------
 t
(1 row)
SELECT * FROM gptext.commit_index('wikipedia.public.articles');

Example that deletes all documents from the index:

SELECT * FROM gptext.delete('wikipedia.public.articles', '*:*');
SELECT * FROM gptext.commit_index('wikipedia.public.articles');

Dropping an index

You can completely remove an index with the gptext.drop_index(index_name) function.

Example:

SELECT * FROM gptext.drop_index('wikipedia.public.articles');

Adding a field to an index

You can add a field to an existing index using the gptext.add_field() function. For example, you can add a field to the index after a column is added to the underlying database table or you can add a field to index a column that was not specified when the index was created.

Example:

CREATE TABLE articles(id bigint, date timestamp with time zone, 
    title text, content text, refs text);
SELECT * FROM gptext.create_index('public', 'articles', 'id', 'content', true);
ALTER TABLE articles ADD summary text;
SELECT * FROM gptext.add_field('wikipedia.public.articles', 'summary', false, false);
SELECT * FROM gptext.reload_index('wikipedia.public.articles');

Dropping a field from an index

You can drop a field from an existing index with the gptext.drop_field() function. After you have dropped fields, call gptext.reload_index() to reload the index.

Example:

SELECT * FROM gptext.drop_field('wikipedia.public.articles', 'summary');
SELECT * FROM gptext.reload_index('wikipedia.public.articles');

Listing all indexes

You can list all indexes in the GPText cluster using the gptext-state command-line utility. For example:

$ gptext-state
20160601:15:32:13:011986 gptext-state:gpdb-sandbox:gpadmin-[INFO]:-Check GPText cluster status...
20160601:15:32:13:011986 gptext-state:gpdb-sandbox:gpadmin-[INFO]:-Current GPText Version: 2.0.0
20160601:15:32:15:011986 gptext-state:gpdb-sandbox:gpadmin-[INFO]:-All nodes are up and running.
20160601:15:32:15:011986 gptext-state:gpdb-sandbox:gpadmin-[INFO]:------------------------------------------------
20160601:15:32:15:011986 gptext-state:gpdb-sandbox:gpadmin-[INFO]:-Index state.
20160601:15:32:15:011986 gptext-state:gpdb-sandbox:gpadmin-[INFO]:------------------------------------------------
20160601:15:32:15:011986 gptext-state:gpdb-sandbox:gpadmin-[INFO]:-   database    index_name                  state
20160601:15:32:15:011986 gptext-state:gpdb-sandbox:gpadmin-[INFO]:-   wikipedia   wikipedia.public.articles   Green

About GPText Field Types and Analyzer Chains

When you create a GPText index, Solr creates a set of configuration files for the index. The managed-schema file contains definitions for fields, field types, and analyzers. This section is an overview of the contents of the managed-schema file and the relationships between the XML elements in the file. For more detailed documentation, refer to the comments in the file or the Apache SolrCloud documentation.

You can edit configuration files using the gptext-config utility. For example, the following command loads the managed-schema file into an editor:

$ gptext-config -i <index-name> -f managed-schema

GPText adds field elements to the managed-schema file for columns included when the index was created with the gptext.create_index() function. For example, this is the definition for a text field named description:

<field name="description" stored="false" type="text_intl" indexed="true"/>
  • The name attribute is the name of the database column. If the column name is not a valid Solr field name, it is altered to conform.
  • The stored attribute determines if the content of the field will be stored in the index. If the field is stored in the index, GPText search results can return the content of the field. If the attribute is not stored, retrieving the field content requires a SQL join.
  • The type attribute maps the Greenplum Database type to a Solr type, defined in the same file with a <fieldType> element.
  • The indexed attribute determines whether the field content will be indexed.

The <field> element can have additional attributes used with some types. See the comment after the <fields> element for a complete list of attributes.

Field Types

The type attribute of the <field> element determines how Solr stores the field in indexes. The class attribute maps the field type to a Solr Java class that recognizes and processes the data type. Solr includes many base field types. See Map Greenplum Database data types to Solr data types for a mapping of Solr types to Greenplum Database types. If you have a custom type, you can add new field types by implementing Solr Java type interfaces, or you can specify an existing base type and customize it with an analyzer chain, as described in the next section.

Analyzer Chains

Text analysis chains determine how Solr indexes text fields in a document. An analyzer chain is a set of Java classes that tokenize and filter the content before it is added to the index. Different analysis chains can be defined for indexing and querying operations.

Field analysis begins with a tokenizer that divides the contents of a field into tokens. In Latin-based text documents, the tokens are words (also called terms). In Chinese, Japanese, and Korean (CJK) documents, the tokens are characters.

The tokenizer can be followed by one or more filters executed in succession. Filters restrict the query results, for example, by removing unnecessary terms (“a”, “an”, “the”), converting term formats, or by performing other actions to ensure that only important, relevant terms appear in the result set. Each filter operates on the output of the tokenizer or filter that precedes it. Solr includes many tokenizers and filters that allow analyzer chains to process different character sets, languages, and transformations. See Analyzers, Tokenizers and Filters - The full list for a comprehensive list.

Field types are assigned analyzers in an index’s managed-schema file. The following example shows the Solr text field type specification:

<fieldType name="text" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <!-- in this example, we will only use synonyms at query time
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
    -->
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
</fieldType>

An analyzer has only one tokenizer, solr.WhitespaceTokenizerFactory in this example. The tokenizer can be followed by one or more filters executed in succession.

Filters restrict the query results. Each filter operates on the output of the tokenizer or filter that precedes it. For example, the solr.StopFilterFactory filter removes unnecessary terms (“a”, “an”, “the”) from the stream of tokens. The words to filter out of the stream are listed in the stopwords.txt configuration file. You can edit the stopwords.txt file with the gptext-config utility to change the list of words excluded from the index.

There are separate analyzer types for index and query operations. The query analyzer chain in this example includes a solr.SynonymFilterFactory that looks up each token in a file synonyms.txt and, if found, returns the synonym in place of the token.

The analysis chain can include a “stemmer”, solr.PorterStemFilterFactory in this example. The stemmer employs an algorithm to change words to their “stems”. For example, “confidential”, “confidentiality”, and “confidentis” are all stemmed to “confidenti”. Using a stemmer can dramatically reduce the size of the index, but users executing searches should be aware that some search expressions will not work as expected because of stemming. For example, searching with a wildcard such as "confidential*" will return no matches because the words were stemmed to “confidenti” during indexing. Without a wildcard, the word in the search expression is also stemmed and therefore the search succeeds.

GPText Text Analysis Chains

In addition to the text analysis chains Solr provides, GPText provides the following text analysis chains:

text_intl, the International Text Analyzer

text_intl is the default GPText analyzer. It is a multiple language text analyzer for textfields. It handles Latin-based words and Chinese, Japanese, and Korean (CJK) characters.

text_intl processes documents as follows.

  1. Separates CJK characters from other language text.
  2. Identifies currency tokens or symbols that were ignored in the first pass.
  3. For any CJK characters, generates a bigram for the CJK character and, for Korean characters only, preserves the original word.

Note that CJK and non-CJK text are treated as separate tokens. Preserving the original Korean word increases the number of tokens in a document.

Following is the definition from the Solr managed-schema template.

<fieldType autoGeneratePhraseQueries="true" class="solr.TextField"
            name="text_intl" positionIncrementGap="100">

  <analyzer type="index">
    <tokenizer class="com.emc.solr.analysis.worldlexer.WorldLexerTokenizerFactory"/>
    <filter class="solr.CJKWidthFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="com.emc.solr.analysis.worldlexer.WorldLexerBigramFilterFactory" han="true"
            hiragana="true" katakana="true" hangul="true" />
    <filter class="solr.StopFilterFactory" enablePositionIncrements="true"
            ignoreCase="true" words="stopwords.txt"/>
    <filter class="solr.KeywordMarkerFilterFactory"
            protected="protwords.txt"/>
    <filter class="solr.PorterStemFilterFactory"/> </analyzer>
  <analyzer type="query">
    <tokenizer class="com.emc.solr.analysis.worldlexer.WorldLexerTokenizerFactory"/>
    <filter class="solr.CJKWidthFilterFactory"/>
    <filter class="com.emc.solr.analysis.worldlexer.WorldLexerBigramFilterFactory" han="true"
            hiragana="true" katakana="true" hangul="true" />
    <filter class="solr.StopFilterFactory" enablePositionIncrements="true" ignoreCase="true"
            words="stopwords.txt"/>
    <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
</fieldType>

Following are the analysis steps for text_intl.

  1. The analyzer chain for indexing begins with a tokenizer called WorldLexerTokenizerFactory. This tokenizer handles most modern languages. It separates CJK characters from other language text and identifies any currency tokens or symbols.
  2. The solr.CJKWidthFilterFactory filter normalizes the CJK characters based on character width.
  3. The solr.LowerCaseFilterFactory filter changes all letters to lower case.
  4. The WorldLexerBigramFilterFactory filter generates a bigram for any CJK characters, leaves any non-CJK characters intact, and preserves original Korean-language words. Set the han, hiragana, katakana, and hangul attributes to "true" to generate bigrams for all supported CJK languages.
  5. The solr.StopFilterFactory removes common words, such as “a”, “an”, and “the”, which are listed in the stopwords.txt configuration file (see To configure an index). If there are no words in the stopwords.txt file, no words are removed.
  6. The solr.KeywordMarkerFilterFactory marks the English words to protect from stemming, using the words listed in the protwords.txt configuration file (see To configure an index). If protwords.txt does not contain a list of words, all words in the document are stemmed.
  7. The final filter is the stemmer, in this case solr.PorterStemFilterFactory, a fast stemmer for the English language.

Note: The text_intl analyzer chain for querying is the same as the text analyzer chain for indexing.

An analysis chain, text, is included in GPText’s Solr managed-schema and is based on Solr’s default analysis chain. Because its tokenizer splits on white space, text cannot process CJK languages: white space is meaningless for CJK languages. Best practice is to use the text_intl analyzer.

For information about using an analyzer chain other than the default, see Using the text_sm Social Media Analyzer.

GPText Language Processing

The root-level tokenizer, WorldLexerTokenizerFactory, tokenizes international languages, including CJK languages. WorldLexerTokenizerFactory tokenizes languages based on their Unicode points and, for Latin-based languages, white space.

Note: Unicode is the encoding for all text in the Greenplum Database.

The following are sample input to, and output from, GPText. Each line in the output corresponds to a term.

English and CJK input:

  • ₩10 대부분 english자선 단체는.

English and CJK output:

  • ₩10
  • 대부분
  • 대부
  • 부분
  • english
  • 자선
  • 단체는
  • 단체
  • 체는

Bulgarian input:

  • Cъстав на nарламента: вж. nротоколи

Bulgarian output:

  • cъстав
  • на
  • nарламента
  • вж
  • протоколиа

Danish input:

  • Genoptagelse af sessionen

Danish output:

  • genoptagelse
  • af
  • sessionen

text_intl Filters

The text_intl analyzer uses the following filters:

  • The CJKWidthFilterFactory normalizes width differences in CJK characters. This filter normalizes all character widths to fullwidth.
  • The WorldLexerBigramFilterFactory filter forms bigrams (pairs) of CJK terms that are generated from WorldLexerTokenizerFactory. This filter does not modify non-CJK text.

    WorldLexerBigramFilterFactory accepts attributes that guide the creation of bigrams for CJK scripts. For example, if the input contains HANGUL script but the hangul attribute is set to false, this filter will not create bigrams for that script. To ensure that WorldLexerBigramFilterFactorycreates bigrams as required, set the CJK attributes han, hiragana, katakana, and hangul to true.

text_sm, the Social Media Text Analyzer

The GPText text_smtext analyzer analyzes text from sources such as social media feeds. text_sm consists of a tokenizer and two filters. To configure the text_sm text analyzer, use the gptext-config utility to edit the managed-schema file. See To use the text_sm Social Media Analyzer for details.

text_sm normalizes emoticons: it replaces emoticons with text using the emoticons.txt configuration file. For example, it replaces a happy face emoticon, :-), with the text “happy”.

The following is the definition from the Solr managed-schema template.

<fieldType autoGeneratePhraseQueries="true"
           class="solr.TextField" name="text_sm"
           positionIncrementGap="100" termVectors="true"
           termPositions="true" termOffsets="true">
  <analyzer type="index">
  <tokenizer class =
          "com.emc.solr.analysis.text_sm.twitter.TwitterTokenizerFactory"
          delimiter="\t"
          emoticons="emoticons.txt"/>
<!-- Case insensitive stop word removal.
   Add enablePositionIncrements=true in both the index and query
 analyzers to leave a 'gap' for more accurate phrase queries. -->
    <filter class="solr.StopFilterFactory"
         enablePositionIncrements="true" ignoreCase="true"
         words="stopwords.txt"/>
   <filter class="solr.LowerCaseFilterFactory"/>
   <filter class="solr.KeywordMarkerFilterFactory"
         protected="protwords.txt"/>
   <filter class =
        "com.emc.solr.analysis.text_sm.twitter.EmoticonsClassifierFilterFactory"
        delimiter="\t" emoticons="emoticons.txt"/>
   <filter class =
        "com.emc.solr.analysis.text_sm.twitter.TwitterStemFilterFactory"/>
  <analyzer type="query">
  <tokenizer class =
          "com.emc.solr.analysis.text_sm.twitter.TwitterTokenizerFactory"
          delimiter="\t"
          emoticons="emoticons.txt"
          />
   <filter class="solr.StopFilterFactory"
         enablePositionIncrements="true" ignoreCase="true"
         words="stopwords.txt"/>
   <filter class="solr.LowerCaseFilterFactory"/>
   <filter class="solr.KeywordMarkerFilterFactory"
         protected="protwords.txt"/>
   <filter class =
        "com.emc.solr.analysis.text_sm.twitter.EmoticonsClassifierFilterFactory"
        delimiter="\t"
        emoticons="emoticons.txt"/>
   <filter class =
        "com.emc.solr.analysis.text_sm.twitter.TwitterStemFilterFactory"/>
 </analyzer>
</fieldType>

The TwitterTokenizer

The Twitter tokenizer extends the English language tokenizer, solr.WhitespaceTokenizerFactory, to recognize the following elements as terms.

  • Emoticons
  • Hyperlinks
  • Hashtag keywords (for example, #keyword)
  • User references (for example, @username)
  • Numbers
  • Floating point numbers
  • Numbers including commas (for example 10,000)
  • time expressions (for example, 9:30)

The text_sm filters

com.emc.solr.analysis.socialmedia.twitter.EmoticonsClassifierFilterFactory classifies emoticons as happy , sad , or wink . It is based on the emoticons.txt file (one of the files you can edit with gptext-config ), and is intended for future use, such as in sentiment analysis.

The TwitterStemFilterFactory

com.emc.solr.analysis.socialmedia.twitter.TwitterStemFilterFactory extends the solr.PorterStemFilterFactory class to bypass stemming of the social media patterns recognized by the twitter.TwitterTokenizerFactory.

The emoticons.txt file

This file contains lists of emoticons for “happy,” “sad,” and “wink.” They are separated by a tab by default. You can change the separation to any character or string by changing the value of delimiterin the social media analysis chain. The following is a sample line from the text_sm analyzer chain:

<filter class =
        "com.emc.solr.analysis.text_sm.twitter.EmoticonsClassifierFilterFactory"
        delimiter="\t" emoticons="emoticons.txt"/>

Using the text_sm Social Media Analyzer

The Solr managed-schema file specifies the analyzer to use to index a field. The default analyzer is text_intl. To specify the text_sm social media analyzer, you use the gptext-config utility to modify the Solr managed-schema for your index.

The steps are:

  1. Create an index using gptext.create_index().

    The index contains a line similar to the following near the end of the file:

    <field name="text_search_col" indexed="true" stored="false"
    type="text_intl"/>
    

    The type field specifies the analyzer to use. text_intl is the default.

  2. Use gptext-config to edit the managed-schema file:

    gptext-config -f managed-schema -i <index_name>
    
  3. Modify the line as follows:

    <field name="text_search_col" indexed="true" stored="false" type="text_sm"/>
    

gptext-config fetches the managed-schema file from the configuration files directory for your index and opens it in the vi editor. After you edit the file, save it, and quit vi, gptext-config returns the file to its original configuration files directory.

Storing Additional Fields in an Index

Solr can store additional fields in the index when it indexes a document and then return the fields in search results. You can avoid an expensive join in Greenplum Database if you store all of the required columns as fields in the Solr index.

GPText packs the additional fields in a buffer added to the search results. Individual fields can be retrieved from this buffer using the gptext.gptext_retrieve_field(), gptext.gptext_retrieve_field_int(), and gptext.gptext_)retrieve_field_float() functions.

Follow these steps to configure the index to store the the additional fields.

  1. Use gptext-config to edit the managed-schema to determine which columns you would like stored. gptext-config -i <index_name> -f managed-schema
  2. Change the <stored=BOOL> value of the additional fields you want to store from FALSE to TRUE.

  3. Reindex and commit the index.

See Retrieving Additional Stored Fields for information about adding additional stored fields to the results of a GPText query.

Using Multiple Analyzer Chains

If you want to index a field using two different analyzer chains simultaneously, you can do this:

Create a new empty index. Then use the gptext-config utility to add a new field to the index that is a copy of the field you are interested in, but with a different name and analyzer chain.

Let us assume that your index, as initially created, includes a field to index named mytext. Also assume that this field will be indexed using the default international analyzer (text_intl).

You want to add a new field to the index’s managed-schema that is a copy of mytext and that will be indexed with a different analyzer (say the text_sm analyzer). To do so, follow these steps:

  1. Create an empty index with gptext.create_index().
  2. Open the index’s managed-schema file for editing with gptext-config.
  3. Add a <field> in the managed-schema for a new field that will use a different analyzer chain. For example:

    <field indexed="true" name="mytext2" stored="false" type="text_sm"/>

    By defining the type of this new field to be text_sm, it will be indexed using the social media analyzer rather than the default text_intl.

  4. Add a <copyField> in managed-schema to copy the original field to the new field. For example:

    <copyField dest="mytext2" source="mytext"/>

  5. Index and commit as you normally would.

The database column mytext is now in the index twice with two different analyzer chains. One column is mytext, which uses the default international analyzer chain, and the other is the newly created mytext2, which uses the social media analyzer chain.

Using Different Analyzer Chains for Individual Fields

You can use different analyzers for individual fields by editing the managed-schema configuration file. For example, if one field contains English text and another contains Chinese language text, you can specify different analyzers for the two fields.

Example

You have a table named email_tbl with the following definition:

create table email_tbl (
   id bigint,
   english_content text,
   chinese_content text,
   timestamp date,
   username text,
   age int,
   ... ) # additional columns that are not indexed
  • You want to index the six columns shown—id, english_content, chinese_content, timestamp, username, and age.
  • For the column english_content, you want to use the English language analyzer called “text_en” for the text segmentation.
  • For the column chinese_content, you want to use the international language analyzer named “text_intl”.

Here are steps to implement this example:

  1. Create the GPText index for the table.

    SELECT * FROM gptext.create_index('public', 'email_tbl', 'id', 'english_content');
    
  2. Modify the analyzer for each column in managed-schema.

    $ gptext-config -i db.public.email_tbl -f managed-schema
    
  3. Find the element for the english_content field.

    <field name="english_content" type="*" indexed="true" stored="true" />
    

    Change the type attribute to text_en.

    <field name="english_content" type="text_en" indexed="true" stored="true" />
    
  4. Find the element for the chinese_content field.

    <field name="chinese_content" type="*" indexed="true" stored="true" />
    

    Change the type attribute to text_intl.

    <field name="chinese_content" type="text_intl" indexed="true" stored="true" />
    
  5. Index the table.

    SELECT * FROM gptext.index(TABLE(SELECT id, english_content, chinese_content, timestamp, username, age FROM email_tbl),
    'db.public.email_tbl');
    
  6. Commit the index.

    SELECT * FROM gptext.commit_index('db.public.email_tbl');
    

The field types text_en and text_intl are defined in <fieldType> entries in the managed-schema file and then referenced in the type attribute of the <field> element.

You can define a custom field type by adding a <fieldType> entry with custom analyzers and then setting the field’s type attribute to the name of the custom field type. For example, the following “text_customize” field type is a copy of the “text_en” field type entry with the synonym filter commented out in the index analyzer. This custom field type will apply the synonym filter to queries, but not to the index.

<fieldType name="text_customize" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
    <!-- in this example, we will only use synonyms at query time
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
    -->
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

A field type can also be customized by adding analyzers as child elements of the <field> element:

<field name="english_content" type="text" indexed="true" stored="false">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
    <!-- in this example, we will only use synonyms at query time
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
    -->
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</field>