LATEST VERSION: 3.1.0 - RELEASE NOTES
Pivotal® Greenplum® Text v3.1.0

Customizing GPText Indexes

GPText saves configuration files for an index in the ZooKeeper /gptext/configs/<index_name> znode, for example /gptext/configs/demo.twitter.message. The configuration files are copied from the $GPTXTHOME/share/gp_index_template/conf directory and modified with information passed in the gptext.create_index() function arguments and the Greenplum Database table definition.

After an index has been created, you can modify the index’s configuration files using the gptext-config command-line utility. You can also edit the template files in the $GPTXTHOME/share/gp_index_template/conf directory so that any new index you create has your customizations.

If you choose to customize the template files in the $GPTXTHOME/share/gp_index_template/conf directory, you should first back up the files so that you can restore the default versions if necessary.

Editing GPText Index Configuration Files

You can edit the index configuration files saved in ZooKeeper using the gptext-config command-line utility with the edit option. You provide the name of the index and the name of the configuration file you want to modify. To edit the managed-schema file for the demo.twitter.message index, for example:

$ gptext-config edit -i demo.twitter.message -f managed-schema

The utility loads the file into an editor, vi by default. You can specify a different editor with the -e option. This command uses the nano editor to edit the stopwords.txt file.

$ gptext-config edit -i demo.twitter.message -f stopwords.txt -e nano

When editing XML files such as managed-schema, be sure that you save a valid XML document. Invalid XML syntax will cause Solr errors and prevent access to your index.

You can use the gptext-config upload command to upload a local configuration file to ZooKeeper. This example uploads a local configuration file named protwords.custom to ZooKeeper, overwriting the existing protwords.txt file.

$  gptext-config upload -i demo.twitter.message -l protwords.custom -f protwords.txt
20171011:11:24:59:030178 gptext-config:gpdb:gpadmin-[INFO]:-Execute GPText config.
20171011:11:25:00:030178 gptext-config:gpdb:gpadmin-[INFO]:-Check zookeeper cluster state ...
20171011:11:25:00:030178 gptext-config:gpdb:gpadmin-[INFO]:-Upload file protwords.custom to zookeeper...
20171011:11:25:01:030178 gptext-config:gpdb:gpadmin-[INFO]:-Reloading configuration...
20171011:11:25:02:030178 gptext-config:gpdb:gpadmin-[INFO]:-Modifications to protwords.txt require that all data be reindexed.
20171011:11:25:02:030178 gptext-config:gpdb:gpadmin-[INFO]:-Done.

Use the gptext-config append command to append a local text file to an existing configuration file. For example, you could create an additional list of stop words in a local file stopwords.add and append them to the stopwords.txt file.

$ gptext-config append -i demo.twitter.message -l stopwords.add -f stopwords.txt
20171010:09:52:59:019764 gptext-config:gpdb:gpadmin-[INFO]:-Execute GPText config.
20171010:09:53:00:019764 gptext-config:gpdb:gpadmin-[INFO]:-Check zookeeper cluster state ...
20171010:09:53:00:019764 gptext-config:gpdb:gpadmin-[INFO]:-Creating temporary copy of stopwords.txt...
20171010:09:53:01:019764 gptext-config:gpdb:gpadmin-[INFO]:-Appending contents of stopwords.add to stopwords.txt
20171010:09:53:01:019764 gptext-config:gpdb:gpadmin-[INFO]:-Backing up stopwords.txt for index demo.twitter.message...
20171010:09:53:03:019764 gptext-config:gpdb:gpadmin-[INFO]:-Reloading configuration...
20171010:09:53:22:019764 gptext-config:gpdb:gpadmin-[INFO]:-Modifications to stopwords.txt require that all data be reindexed.
20171010:09:53:22:019764 gptext-config:gpdb:gpadmin-[INFO]:-Done.

See the gptext-config command reference for gptext-config command-line options and for descriptions of the files you can edit with gptext-config.

The managed-schema File

The main configuration file for an index is the managed-schema file. The managed-schema file is an XML file containing definitions for the fields, field types, and analyzer chains that define the contents and behavior of a GPText index.

  • A field (<field> XML element) maps a Greenplum Database table column to a field in the GPText index.
  • A field type (<fieldType> XML element) assigns Solr Java classes and analyzer chains that handle a data type to a field.
  • An analyzer chain (<analyzer> XML element) is a container element that specifies the Java classes that tokenize and filter the content of a field that is to be indexed. An <analyzer> element is a child of a <fieldType> element.

In addition to the managed-schema file, the Solr configuration files for an index include text files that contain lists of words to treat specially when indexing data, localization files, character set collation maps used for sorting, and a Solr server configuration file.

The following sections provide an overview of the contents of the managed-schema file and the relationships between the XML elements that define fields, field types, and analyzers. By editing the managed-schema file, you can specify at the field level how Solr indexes and stores Greenplum Database data.

For detailed documentation of the contents of the managed-schema file, refer to the comments in the file or to the Apache SolrCloud documentation.

Field Elements

GPText adds field elements to the managed-schema file for columns included when the index was created with the gptext.create_index() function. This example is the definition for a text field named description:

<field name="description" stored="false" type="text_intl" indexed="true"/>
  • The name attribute is the name of the database column. If the column name is not a valid Solr field name, it is altered to conform.
  • The stored attribute determines if the content of the field will be stored in the index. If the field is stored in the index, GPText search results can return the content of the field. If the attribute is not stored, retrieving the field content requires a SQL join.
  • The type attribute maps the Greenplum Database type to a Solr type, defined in the same file with a <fieldType> element.
  • The indexed attribute determines whether the field content will be indexed.

The <field> element can have additional attributes used with some types. See the comment after the <fields> element for a complete list of attributes.

Field Types

The type attribute of the <field> element is mapped to the name attribute of a <fieldType> element in the managed-schema file. The <fieldType> element determines how Solr parses and stores a field in the index.

The class attribute maps the field type to a Solr Java class that recognizes and processes the data type. Solr includes many base field types. See GPText and Solr Data Type Mappings for a mapping of Solr types to Greenplum Database types.

You can map a field to a different type by changing the field’s type attribute. For example, to use the GPText social media text analyzer chain, you can change the type of a text field from text_intl to text_sm. Both of text_intl and text_sm use the Solr.TextField class, but specify different filters in their analyzer chains.

The GPText gptext.list_field_types() function is a convenience function that lets you see the text field types defined in the managed-schema file for an index without having to edit the file. All of the types listed have the class Solr.TextField.

SELECT * FROM gptext.list_field_types('demo.wikipedia.articles');
     list_field_types
---------------------------
 ancestor_path
 delimited_payloads_float
 delimited_payloads_int
 delimited_payloads_string
 descendent_path
 lowercase
 phonetic_en
 text
 text_ar
 text_bg
 text_ca
 text_cjk
 text_cz
 text_da
 text_de
 text_el
 text_en
 text_en_splitting
 text_en_splitting_tight
 text_es
 text_eu
 text_fa
 text_fi
 text_fr
 text_ga
 text_general
 text_general_rev
 text_gl
 text_hi
 text_hu
 text_hy
 text_icu
 text_id
 text_intl
 text_intl_prev
 text_it
 text_ja
 text_lv
 text_nl
 text_no
 text_pt
 text_ro
 text_ru
 text_sm
 text_sv
 text_th
 text_tr
 text_ws
 text_zhsmart
(49 rows)

To add a custom type, you can add a new field type by implementing Solr Java type interfaces, or you can specify an existing base type and customize it with an analyzer chain, as described in the next section.

Analyzer Chains

An analyzer examines the contents of field or search query phrase and returns a stream of tokens used to index the field or search the index. The <analyzer> element is a child of a <fieldType> element that specifies how text will be tokenized and processed before it is indexed or applied to a search. An <analyzer> can be of type index or query.

Different indexing and query hains can be defined for indexing and querying operations by adding a type attribute to the <analyzer> element. If no type attribute appears the chain is applied to both field text that is to be indexed and query text that searches the index.

Field analysis begins with a <tokenizer> that divides the contents of a field into tokens. In Latin-based text documents, the tokens are words or terms. In Chinese, Japanese, and Korean (CJK) documents, the tokens are characters.

The tokenizer can be followed by one or more <filter> elements which are applied in succession. Filters restrict the query results, for example, by removing unnecessary terms (“a”, “an”, “the”), converting term formats, or by performing other actions to ensure that only important, relevant terms appear in the result set. Each filter operates on the output of the tokenizer or filter that precedes it. Solr includes many tokenizers and filters that allow analyzer chains to process different character sets, languages, and transformations. See Analyzers, Tokenizers and Filters for more information.

Field types are assigned analyzers in an index’s managed-schema file. The following example shows the Solr text field type specification:

<fieldType name="text" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <!-- in this example, we will only use synonyms at query time
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
    -->
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
</fieldType>

An analyzer has only one tokenizer, solr.WhitespaceTokenizerFactory in this example. The tokenizer can be followed by one or more filters executed in succession.

Filters restrict the query results. Each filter operates on the output of the tokenizer or filter that precedes it. For example, the solr.StopFilterFactory filter removes unnecessary terms (“a”, “an”, “the”) from the stream of tokens. The words to filter out of the stream are listed in the stopwords.txt configuration file. You can edit the stopwords.txt file with the gptext-config utility to change the list of words excluded from the index.

There are separate analyzer types for index and query operations. The query analyzer chain in this example includes a solr.SynonymFilterFactory that looks up each token in a file synonyms.txt and, if found, returns the synonym in place of the token.

The analyzer chain can include a “stemmer”, solr.PorterStemFilterFactory in this example. The stemmer employs an algorithm to change words to their “stems”. For example, “confidential”, “confidentiality”, and “confidentis” are all stemmed to “confidenti”. Using a stemmer can dramatically reduce the size of the index, but users executing searches should be aware that some search expressions will not work as expected because of stemming. For example, searching with a wildcard such as "confidential*" will return no matches because the words were stemmed to “confidenti” during indexing. Without a wildcard, the word in the search expression is also stemmed and therefore the search succeeds.

The gptext.get_field_type() convenience function retrieves the field type definition for a field type from the managed-schema file for an index, as a JSON string. This example shows the field type definition for the Solr text field type.

#= SELECT * FROM gptext.get_field_type('demo.wikipedia.articles', 'text');
                    field_type                    
-------------------------------------------------
 {                                               
  "name": "text",                                
  "class": "solr.TextField",                     
  "indexAnalyzer": {                             
   "tokenizer": {                                
    "class": "solr.WhitespaceTokenizerFactory"   
   },                                            
   "filters": [                                  
    {                                            
     "class": "solr.StopFilterFactory",          
     "attributes": [                             
      {                                          
       "name": "words",                          
       "value": "stopwords.txt"                  
      },                                         
      {                                          
       "name": "ignoreCase",                     
       "value": "true"                           
      }                                          
     ]                                           
    },                                           
    {                                            
     "class": "solr.WordDelimiterFilterFactory", 
     "attributes": [                             
      {                                          
       "name": "catenateNumbers",                
       "value": "1"                              
      },                                         
      {                                          
       "name": "generateNumberParts",            
       "value": "1"                              
      },                                         
      {                                          
       "name": "splitOnCaseChange",              
       "value": "1"                              
      },                                         
      {                                          
       "name": "generateWordParts",              
       "value": "1"                              
      },                                         
      {                                          
       "name": "catenateAll",                    
       "value": "0"                              
      },                                         
      {                                          
       "name": "catenateWords",                  
       "value": "1"                              
      }                                          
     ]                                           
    },                                           
    {                                            
     "class": "solr.LowerCaseFilterFactory"      
    },                                           
    {                                            
     "class": "solr.KeywordMarkerFilterFactory", 
     "attributes": [                             
      {                                          
       "name": "protected",                      
       "value": "protwords.txt"                  
      }                                          
     ]                                           
    },                                           
    {                                            
     "class": "solr.PorterStemFilterFactory"     
    }                                            
   ]                                             
  },                                             
  "queryAnalyzer": {                             
   "tokenizer": {                                
    "class": "solr.WhitespaceTokenizerFactory"   
   },                                            
   "filters": [                                  
    {                                            
     "class": "solr.SynonymFilterFactory",       
     "attributes": [                             
      {                                          
       "name": "expand",                         
       "value": "true"                           
      },                                         
      {                                          
       "name": "ignoreCase",                     
       "value": "true"                           
      },                                         
      {                                          
       "name": "synonyms",                       
       "value": "synonyms.txt"                   
      }                                          
     ]                                           
    },                                           
    {                                            
     "class": "solr.StopFilterFactory",          
     "attributes": [                             
      {                                          
       "name": "words",                          
       "value": "stopwords.txt"                  
      },                                         
      {                                          
       "name": "ignoreCase",                     
       "value": "true"                           
      }                                          
     ]                                           
    },                                           
    {                                            
     "class": "solr.WordDelimiterFilterFactory", 
     "attributes": [                             
      {                                          
       "name": "catenateNumbers",                
       "value": "0"                              
      },                                         
      {                                          
       "name": "generateNumberParts",            
       "value": "1"                              
      },                                         
      {                                          
       "name": "splitOnCaseChange",              
       "value": "1"                              
      },                                         
      {                                          
       "name": "generateWordParts",              
       "value": "1"                              
      },                                         
      {                                          
       "name": "catenateAll",                    
       "value": "0"                              
      },                                         
      {                                          
       "name": "catenateWords",                  
       "value": "0"                              
      }                                          
     ]                                           
    },                                           
    {                                            
     "class": "solr.LowerCaseFilterFactory"      
    },                                           
    {                                            
     "class": "solr.KeywordMarkerFilterFactory", 
     "attributes": [                             
      {                                          
       "name": "protected",                      
       "value": "protwords.txt"                  
      }                                          
     ]                                           
    },                                           
    {                                            
     "class": "solr.PorterStemFilterFactory"     
    }                                            
   ]                                             
  },                                             
  "attributes": [                                
   {                                             
    "name": "autoGeneratePhraseQueries",         
    "value": "true"                              
   },                                            
   {                                             
    "name": "positionIncrementGap",              
    "value": "100"                               
   }                                             
  ]                                              
 }                                               

(1 row)

The gptext.analyzer() function lets you test an analyzer chain for a field without altering the index. It shows the output of the tokenizer and each filter in the chain. You supply the text to analyze and specify whether to test the index or the query analyzer chain. It is useful for testing tokenizers and filters and for troubleshooting search queries that do not return the expected results.

=# SELECT * FROM gptext.analyzer('demo.wikipedia.articles', 'index',
    'If You Optimize Everything, You will Always be Unhappy.');
         class          |                                            tokens
------------------------+-----------------------------------------------------------------------------------------------
 WhitespaceTokenizer    | {{"If"},{"You"},{"Optimize"},{"Everything,"},{"You"},{"will"},{"Always"},{"be"},{"Unhappy."}}
 StopFilter             | {{},{"You"},{"Optimize"},{"Everything,"},{"You"},{},{"Always"},{},{"Unhappy."}}
 WordDelimiterFilter    | {{},{"You"},{"Optimize"},{"Everything"},{"You"},{},{"Always"},{},{"Unhappy"}}
 LowerCaseFilter        | {{},{"you"},{"optimize"},{"everything"},{"you"},{},{"always"},{},{"unhappy"}}
 SetKeywordMarkerFilter | {{},{"you"},{"optimize"},{"everything"},{"you"},{},{"always"},{},{"unhappy"}}
 PorterStemFilter       | {{},{"you"},{"optim"},{"everyth"},{"you"},{},{"alwai"},{},{"unhappi"}}
(6 rows)

GPText Text Analyzer Chains

In addition to the text analyzer chains Solr provides, GPText provides the following text analyzer chains:

text_intl, the International Text Analyzer

text_intl is the default GPText analyzer. It is a multiple language text analyzer for text fields. It handles Latin-based words and Chinese, Japanese, and Korean (CJK) characters.

text_intl processes documents as follows.

  1. Separates CJK characters from other language text.
  2. Identifies currency tokens or symbols that were ignored in the first pass.
  3. For any CJK characters, generates a bigram for the CJK character and, for Korean characters only, preserves the original word.

Note that CJK and non-CJK text are treated as separate tokens. Preserving the original Korean word increases the number of tokens in a document.

Following is the definition from the Solr managed-schema template.

<fieldType autoGeneratePhraseQueries="true" class="solr.TextField"
            name="text_intl" positionIncrementGap="100">

  <analyzer type="index">
    <tokenizer class="com.emc.solr.analysis.worldlexer.WorldLexerTokenizerFactory"/>
    <filter class="solr.CJKWidthFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="com.emc.solr.analysis.worldlexer.WorldLexerBigramFilterFactory" han="true"
            hiragana="true" katakana="true" hangul="true" />
    <filter class="solr.StopFilterFactory" enablePositionIncrements="true"
            ignoreCase="true" words="stopwords.txt"/>
    <filter class="solr.KeywordMarkerFilterFactory"
            protected="protwords.txt"/>
    <filter class="solr.PorterStemFilterFactory"/> </analyzer>
  <analyzer type="query">
    <tokenizer class="com.emc.solr.analysis.worldlexer.WorldLexerTokenizerFactory"/>
    <filter class="solr.CJKWidthFilterFactory"/>
    <filter class="com.emc.solr.analysis.worldlexer.WorldLexerBigramFilterFactory" han="true"
            hiragana="true" katakana="true" hangul="true" />
    <filter class="solr.StopFilterFactory" enablePositionIncrements="true" ignoreCase="true"
            words="stopwords.txt"/>
    <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
</fieldType>

Following are the analysis steps for text_intl.

  1. The analyzer chain for indexing begins with a tokenizer called WorldLexerTokenizerFactory. This tokenizer handles most modern languages. It separates CJK characters from other language text and identifies any currency tokens or symbols.
  2. The solr.CJKWidthFilterFactory filter normalizes the CJK characters based on character width.
  3. The solr.LowerCaseFilterFactory filter changes all letters to lower case.
  4. The WorldLexerBigramFilterFactory filter generates a bigram for any CJK characters, leaves any non-CJK characters intact, and preserves original Korean-language words. Set the han, hiragana, katakana, and hangul attributes to "true" to generate bigrams for all supported CJK languages.
  5. The solr.StopFilterFactory removes common words, such as “a”, “an”, and “the”, which are listed in the stopwords.txt configuration file (see To configure an index). If there are no words in the stopwords.txt file, no words are removed.
  6. The solr.KeywordMarkerFilterFactory marks the English words to protect from stemming, using the words listed in the protwords.txt configuration file (see To configure an index). If protwords.txt does not contain a list of words, all words in the document are stemmed.
  7. The final filter is the stemmer, in this case solr.PorterStemFilterFactory, a fast stemmer for the English language.

Note: The text_intl analyzer chain for querying is the same as the text analyzer chain for indexing.

An analyzer chain, text, is included in GPText’s Solr managed-schema and is based on Solr’s default analyzer chain. Because its tokenizer splits on white space, text cannot process CJK languages: white space is meaningless for CJK languages. Best practice is to use the text_intl analyzer.

For information about using an analyzer chain other than the default, see Using the text_sm Social Media Analyzer.

GPText Language Processing

The root-level tokenizer, WorldLexerTokenizerFactory, tokenizes international languages, including CJK languages. WorldLexerTokenizerFactory tokenizes languages based on their Unicode points and, for Latin-based languages, white space.

Note: Unicode is the encoding for all text in the Greenplum Database.

The following are sample input to, and output from, GPText. Each line in the output corresponds to a term.

English and CJK input:

  • ₩10 대부분 english자선 단체는.

English and CJK output:

  • ₩10
  • 대부분
  • 대부
  • 부분
  • english
  • 자선
  • 단체는
  • 단체
  • 체는

Bulgarian input:

  • Cъстав на nарламента: вж. nротоколи

Bulgarian output:

  • cъстав
  • на
  • nарламента
  • вж
  • протоколиа

Danish input:

  • Genoptagelse af sessionen

Danish output:

  • genoptagelse
  • af
  • sessionen

text_intl Filters

The text_intl analyzer uses the following filters:

  • The CJKWidthFilterFactory normalizes width differences in CJK characters. This filter normalizes all character widths to fullwidth.
  • The WorldLexerBigramFilterFactory filter forms bigrams (pairs) of CJK terms that are generated from WorldLexerTokenizerFactory. This filter does not modify non-CJK text.

    WorldLexerBigramFilterFactory accepts attributes that guide the creation of bigrams for CJK scripts. For example, if the input contains HANGUL script but the hangul attribute is set to false, this filter will not create bigrams for that script. To ensure that WorldLexerBigramFilterFactorycreates bigrams as required, set the CJK attributes han, hiragana, katakana, and hangul to true.

text_sm, the Social Media Text Analyzer

The GPText text_smtext analyzer analyzes text from sources such as social media feeds. text_sm consists of a tokenizer and two filters. To configure the text_sm text analyzer, use the gptext-config utility to edit the managed-schema file. See To use the text_sm Social Media Analyzer for details.

text_sm normalizes emoticons: it replaces emoticons with text using the emoticons.txt configuration file. For example, it replaces a happy face emoticon, :-), with the text “happy”.

The following is the definition from the Solr managed-schema template.

<fieldType autoGeneratePhraseQueries="true"
           class="solr.TextField" name="text_sm"
           positionIncrementGap="100" termVectors="true"
           termPositions="true" termOffsets="true">
  <analyzer type="index">
  <tokenizer class =
          "com.emc.solr.analysis.text_sm.twitter.TwitterTokenizerFactory"
          delimiter="\t"
          emoticons="emoticons.txt"/>
<!-- Case insensitive stop word removal.
   Add enablePositionIncrements=true in both the index and query
 analyzers to leave a 'gap' for more accurate phrase queries. -->
    <filter class="solr.StopFilterFactory"
         enablePositionIncrements="true" ignoreCase="true"
         words="stopwords.txt"/>
   <filter class="solr.LowerCaseFilterFactory"/>
   <filter class="solr.KeywordMarkerFilterFactory"
         protected="protwords.txt"/>
   <filter class =
        "com.emc.solr.analysis.text_sm.twitter.EmoticonsClassifierFilterFactory"
        delimiter="\t" emoticons="emoticons.txt"/>
   <filter class =
        "com.emc.solr.analysis.text_sm.twitter.TwitterStemFilterFactory"/>
  <analyzer type="query">
  <tokenizer class =
          "com.emc.solr.analysis.text_sm.twitter.TwitterTokenizerFactory"
          delimiter="\t"
          emoticons="emoticons.txt"
          />
   <filter class="solr.StopFilterFactory"
         enablePositionIncrements="true" ignoreCase="true"
         words="stopwords.txt"/>
   <filter class="solr.LowerCaseFilterFactory"/>
   <filter class="solr.KeywordMarkerFilterFactory"
         protected="protwords.txt"/>
   <filter class =
        "com.emc.solr.analysis.text_sm.twitter.EmoticonsClassifierFilterFactory"
        delimiter="\t"
        emoticons="emoticons.txt"/>
   <filter class =
        "com.emc.solr.analysis.text_sm.twitter.TwitterStemFilterFactory"/>
 </analyzer>
</fieldType>

The TwitterTokenizer

The Twitter tokenizer extends the English language tokenizer, solr.WhitespaceTokenizerFactory, to recognize the following elements as terms.

  • Emoticons
  • Hyperlinks
  • Hashtag keywords (for example, #keyword)
  • User references (for example, @username)
  • Numbers
  • Floating point numbers
  • Numbers including commas (for example 10,000)
  • time expressions (for example, 9:30)

The text_sm filters

com.emc.solr.analysis.socialmedia.twitter.EmoticonsClassifierFilterFactory classifies emoticons as happy, sad, or wink. It is based on the emoticons.txt file (one of the files you can edit with gptext-config, and is intended for future use, such as in sentiment analysis.

The TwitterStemFilterFactory

com.emc.solr.analysis.socialmedia.twitter.TwitterStemFilterFactory extends the solr.PorterStemFilterFactory class to bypass stemming of the social media patterns recognized by the twitter.TwitterTokenizerFactory.

The emoticons.txt file

This file contains lists of emoticons for “happy,” “sad,” and “wink.” They are separated by a tab by default. You can change the separation to any character or string by changing the value of delimiterin the social media analyzer chain. The following is a sample line from the text_sm analyzer chain:

<filter class =
        "com.emc.solr.analysis.text_sm.twitter.EmoticonsClassifierFilterFactory"
        delimiter="\t" emoticons="emoticons.txt"/>

Using the text_sm Social Media Analyzer

The Solr managed-schema file created for an index specifies an analyzer to use to index each field. The default analyzer for text fields is text_intl. To specify the text_sm social media analyzer, you use the gptext-config utility to modify the Solr managed-schema for your index.

The steps are:

  1. Create an index using gptext.create_index().

  2. Use the gptext-config utility to edit the managed-schema file created for the index:

    gptext-config edit -f managed-schema -i <index_name>
    

    The managed-schema file contains a <field> element for each text field. For example:

    <field name="message_text" stored="false" type="text_intl" indexed="true"/>
    

    The type attribute specifies the analyzer to use. text_intl is the default analyzer.

  3. Modify the <field> element for each text field you want to use the GPText social media analyzer and change the type attribute as follows:

    <field name="text_search_col" indexed="true" stored="false" type="text_sm"/>
    
  4. Save the managed-schema file.

Using Multiple Analyzer Chains

If you want to index a field using two different analyzer chains simultaneously, you can do this:

Create a new empty index. Then use the gptext-config utility to add a new field to the index that is a copy of the field you are interested in, but with a different name and analyzer chain.

Let us assume that your index, as initially created, includes a field to index named mytext. Also assume that this field will be indexed using the default international analyzer (text_intl).

You want to add a new field to the index’s managed-schema that is a copy of mytext and that will be indexed with a different analyzer (say the text_sm analyzer). To do so, follow these steps:

  1. Create an empty index with gptext.create_index().
  2. Open the index’s managed-schema file for editing with gptext-config.
  3. Add a <field> in the managed-schema for a new field that will use a different analyzer chain. For example:

    <field indexed="true" name="mytext2" stored="false" type="text_sm"/>

    By defining the type of this new field to be text_sm, it will be indexed using the social media analyzer rather than the default text_intl.

  4. Add a <copyField> in managed-schema to copy the original field to the new field. For example:

    <copyField dest="mytext2" source="mytext"/>

  5. Index and commit as you normally would.

The database column mytext is now in the index twice with two different analyzer chains. One column is mytext, which uses the default international analyzer chain, and the other is the newly created mytext2, which uses the social media analyzer chain.

Using Different Analyzer Chains for Individual Fields

You can use different analyzers for individual fields by editing the managed-schema configuration file. For example, if one field contains English text and another contains Chinese language text, you can specify different analyzers for the two fields.

Example

You have a table named email_tbl with the following definition:

create table email_tbl (
   id bigint,
   english_content text,
   chinese_content text,
   timestamp date,
   username text,
   age int,
   ... ) # additional columns that are not indexed
  • You want to index the six columns shown—id, english_content, chinese_content, timestamp, username, and age.
  • For the column english_content, you want to use the English language analyzer called “text_en” for the text segmentation.
  • For the column chinese_content, you want to use the international language analyzer named “text_intl”.

Here are steps to implement this example:

  1. Create the GPText index for the table.

    SELECT * FROM gptext.create_index('public', 'email_tbl', 'id', 'english_content');
    
  2. Modify the analyzer for each column in managed-schema.

    $ gptext-config edit -i db.public.email_tbl -f managed-schema
    
  3. Find the element for the english_content field.

    <field name="english_content" type="*" indexed="true" stored="true" />
    

    Change the type attribute to text_en.

    <field name="english_content" type="text_en" indexed="true" stored="true" />
    
  4. Find the element for the chinese_content field.

    <field name="chinese_content" type="*" indexed="true" stored="true" />
    

    Change the type attribute to text_intl.

    <field name="chinese_content" type="text_intl" indexed="true" stored="true" />
    
  5. Index the table.

    SELECT * FROM gptext.index(TABLE(SELECT id, english_content, chinese_content, timestamp, username, age FROM email_tbl),
    'db.public.email_tbl');
    
  6. Commit the index.

    SELECT * FROM gptext.commit_index('db.public.email_tbl');
    

The field types text_en and text_intl are defined in <fieldType> entries in the managed-schema file and then referenced in the type attribute of the <field> element.

You can define a custom field type by adding a <fieldType> entry with custom analyzers and then setting the field’s type attribute to the name of the custom field type. For example, the following “text_customize” field type is a copy of the “text_en” field type entry with the synonym filter commented out in the index analyzer. This custom field type will apply the synonym filter to queries, but not to the index.

<fieldType name="text_customize" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
    <!-- in this example, we will only use synonyms at query time
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
    -->
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

A field type can also be customized by adding analyzers as child elements of the <field> element:

<field name="english_content" type="text" indexed="true" stored="false">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
    <!-- in this example, we will only use synonyms at query time
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
    -->
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</field>