LATEST VERSION: 3.1.0 - RELEASE NOTES
Pivotal® Greenplum® Text v2.3.1

Working With GPText Indexes

Indexing prepares documents for text analysis and fast query processing. This topic shows you how to create GPText indexes and add documents from Greenplum Database tables to them, and how to maintain and customize indexes for your own applications.

For help indexing and searching documents stored outside of Greenplum Database see Working With GPText External Indexes.

Setting Up the Sample Database

The examples in this documentation work with a demo database containing three database tables, called wikipedia.articles, twitter.message, and store.products. If you want to run the examples yourself, follow the instructions in this section to set up the demo database.

  1. Log in to the Greenplum Database master as the gpadmin user and create the demo database.

    $ createdb demo
    
  2. Open an interactive shell for executing queries in the demo database.

    $ psql demo
    
  3. Create the articles table in the wikipedia schema with the following statements.

    CREATE SCHEMA wikipedia;
    CREATE TABLE wikipedia.articles (
        id int8 primary key,
        date_time timestamptz,
        title text,
        content text,
        refs text
    ) DISTRIBUTED BY (id);
    
  4. Create the message table in the twitter schema with the following statements.

    CREATE SCHEMA twitter;
    CREATE TABLE twitter.message (
        id bigint,
        message_id bigint,
        spam boolean,
        created_at timestamp without time zone,
        source text,
        retweeted boolean,
        favorited boolean,
        truncated boolean,
        in_reply_to_screen_name text,
        in_reply_to_user_id bigint,
        author_id bigint,
        author_name text,
        author_screen_name text,
        author_lang text,
        author_url text,
        author_description text,
        author_listed_count integer,
        author_statuses_count integer,
        author_followers_count integer,
        author_friends_count integer,
        author_created_at timestamp without time zone,
        author_location text,
        author_verified boolean,
        message_url text,
        message_text text ) 
    DISTRIBUTED BY (id)
    PARTITION BY RANGE (created_at)
    ( START (DATE '2011-08-01') INCLUSIVE
      END (DATE '2011-12-01') EXCLUSIVE
      EVERY (INTERVAL '1 month'));
    CREATE INDEX id_idx ON twitter.message USING btree (id);
    
  5. CREATE the store.products table with these statements.

    CREATE SCHEMA store;
    CREATE TABLE store.products (
        id bigint, 
        title text, 
        category varchar(32), 
        brand varchar(32), 
        price float )
    DISTRIBUTED BY (id);
    
  6. Download test data for the three tables here. Right-click the link, save the file, and then copy it to the gpadmin user’s home directory.

  7. Extract the data files with this tar command.

    $ tar xvfz gptext-demo-data.tgz
    
  8. Load the wikipedia data into the wikipedia.articles table using the psql \COPY metacommand.

    \COPY wikipedia.articles FROM '/home/gpadmin/demo/articles.csv' HEADER CSV;
    

    The articles table now contains text from 23 Wikipedia articles.

  9. Load the twitter data into the twitter.message table using the following psql \COPY metacommand.

    \COPY twitter.message FROM '/home/gpadmin/demo/twitter.csv' CSV;
    

    The message table now contains 1730 tweets from August to October, 2011.

  10. Load the products table into the store.products table with the following psql \COPY metacommand.

    \COPY store.products FROM '/home/gpadmin/demo/products.csv' HEADER CSV;
    

    The products table now contains 50 rows. This table is used to demonstrate faceted search queries. See Creating Faceted Search Queries.

Setting up the GPText Command-line Environment

To work with GPText indexes, you must first set up your environment and add the GPText schema to the database containing the documents (Greenplum Database data) you want to index.

To set the environment, log in as the gpadmin user and source the Greenplum Database and GPText environment scripts. The Greenplum Database environment must be set before you source the GPText environment script. For example, if both Greenplum Database and GPText are installed in the /usr/local/ directory, enter these commands:

$ source /usr/local/greenplum-db-<version>/greenplum_path.sh
$ source /usr/local/greenplum-text-<version>/greenplum-text_path.sh

With the environment now set, you can access the GPText command-line utilities.

Adding the GPText Schema to a Database

Use the gptext-installsql utility to add the GPText schema to databases containing data you want to index with GPText. You perform this task one time for each database. In this example, the gptext schema is installed into the demo database.

$ gptext-installsql demo

The gptext schema provides user-defined types, tables, views, and functions for GPText. This schema is reserved for GPText. If you create any new objects in the gptext schema, they will be lost when you reinstall the schema or upgrade GPText.

Creating GPText Indexes and Indexing Data

The general steps for creating a GPText index and indexing documents are:

  1. Create an empty Solr index
  2. Customize the index (optional)
  3. Populate the index
  4. Commit the index

After you complete these steps, you can create and execute a search query or implement machine learning algorithms. Searching GPText indexes is described in the Querying GPText Indexes topic.

The following steps are completed by executing SQL commands and GPText functions in the database. Refer to the GPText Function Reference for details about the GPText functions described in the following examples.

Create an empty GPText index

A GPText index is an Apache Solr collection containing documents added from a Greenplum Database table. There can be one GPText index per Greenplum Database table. Each row in the database table is a document that can be added to the GPText index.

If the database table is partitioned, there is one GPText index for all partitions. You must specify the root table name when creating the index and adding documents to it. GPText provides search semantics that enable searching partitions efficiently.

A GPText external index is a Solr index for documents that are located outside of Greenplum Database. GPText provides user-defined functions to create external indexes and insert documents into them. See Working with GPText External Indexes.

The gptext.create_index() function creates a new GPText index. This function has two signatures:

gptext.create_index(<schema_name>, <table_name>, <id_col_name>, 
    <def_search_col_name> [, <if_check_id_uniqueness>])

or

gptext.create_index(<schema_name>, <table_name>, <p_columns>, <p_types>,
    <id_col_name>, <def_search_col_name> [, <if_check_id_uniqueness>])

The <schema_name> and <table_name> arguments specify the database table that contains the source documents.

The <id_col_name> argument is the name of the table column that contains a unique identifier for each row. The <id_col_name> column can be of type int4, int8, varchar, text, or uuid.

The <def_search_col_name> argument is the name of the table column that contains the content you want to search by default. For example, if you want to index and search just the <content> column, you can use the first signature and specify the content column name in the <def_search_col_name> argument.

The final, optional argument, <if_check_id_uniqueness>, is a Boolean argument. When true, the default, attempting to add a document with an id that already exists in the index generates an error. If you set the argument to false, you can add documents with the same id, but when you search the index all documents with the same ID are returned.

The following command creates an index for the twitter.message table, with the id column as the unique ID field and the message_text column for the default search column:

=# SELECT * FROM gptext.create_index('twitter', 'message', 'id', 'message_text');

To verify that the demo.twitter.message index was created, call gptext.index_status():

=# SELECT * FROM gptext.index_status('demo.twitter.message');
 content_id |      index_name      | shard_name | shard_state | replica_name | replica_state |
core                 |     node_name     |         base_url         | is_leader | partitioned | external_index
------------+----------------------+------------+-------------+--------------+---------------+-----------------
---------------------+-------------------+--------------------------+-----------+-------------+----------------
          0 | demo.twitter.message | shard0     | active      | core_node2   | active        | demo.twitter.mes
sage_shard0_replica1 | gpdb51:18984_solr | http://gpdb51:18984/solr | f         | t           | f
          0 | demo.twitter.message | shard0     | active      | core_node3   | active        | demo.twitter.mes
sage_shard0_replica2 | gpdb51:18983_solr | http://gpdb51:18983/solr | t         | t           | f
          1 | demo.twitter.message | shard1     | active      | core_node1   | active        | demo.twitter.mes
sage_shard1_replica1 | gpdb51:18984_solr | http://gpdb51:18984/solr | t         | t           | f
          1 | demo.twitter.message | shard1     | active      | core_node4   | active        | demo.twitter.mes
sage_shard1_replica2 | gpdb51:18983_solr | http://gpdb51:18983/solr | f         | t           | f
(4 rows)

This example executed on a Greenplum Database cluster with two primary segments. Two shards were created, one for each segment, and each shard has two replicas. The replicas are named core_node1 through core_node4.

You can also run the gptext-state -D command-line utility to verify the index was created. See the gptext-state reference for details.

The GPText index for the demo.twitter.message table is configured, by default, to index all columns in the twitter.message database table. You can write search queries that contain criteria using any column in the table.

If you want to index and search a subset of the table columns, you can use the second gptext.create_index() signature, specifying the columns to index in the <p_columns> argument and the data types of those columns in the <p_types> argument. The <p_columns> and <p_types> arguments are text arrays. The id column name and default search column name must be included in the arrays.

Use the second gptext.create_index() signature to create an index for the wikipedia.articles table. This index will allow you to search on the title, content, and refs columns. Note that the id column and default search column are still specified in separate arguments following the <p_columns> and <p_types> arrays.

=# SELECT * FROM gptext.create_index('wikipedia', 'articles', '{id, title, content, refs}',
     '{long, text_intl, text_intl, text_intl}', 'id', 'content', true);
INFO:  Created index demo.wikipedia.articles
 create_index
--------------
 t
(1 row)

Because the date_time column was omitted from the <p_columns> and <p_types> arrays, it will not be possible to search the wikipedia.articles index on date with the GPText search functions.

Customize the index (optional)

Creating a GPText index generates a set of configuration files for the index. Before you add documents to the index, you can customize the configuration files to change the way data is indexed and stored. You can customize an index later, after you have added documents to it, but you must then reindex the data to take advantage of your customizations.

One common customization is to remap data types for some database columns. In the managed-schema configuration file for an index, GPText maps the data types for each field from the Greenplum Database type to an equivalent Solr data type. GPText applies default mappings (see GPText and Solr Data Type Mappings), but your index may be more effective if you use a different mapping for some fields.

The demo.twitter.message table, for example, has a message_text text column that contains tweets. By default, GPText maps text columns to the Solr text_intl (international text) type. The GPText text_sm (social media text) type is a better mapping for a text column that contains social media idioms such as emoticons.

Follow these steps to remap the message_text field to the gtext_sm type.

  1. Use the gptext-config utility to edit the managed-schema file for the demo.twitter.message index.

    $ gptext-config edit -i demo.twitter.message -f managed-schema
    

    The managed-schema file loads in a text editor (normally vi).

  2. Find the <field> element for the message_text field.

    <field name="message_text" stored="false" type="text_intl" indexed="true"/>
    
  3. Change the type attribute from text_intl to text_sm.

    <field name="message_text" stored="false" type="text_sm" indexed="true"/>
    
  4. Save the file and exit the editor.

There are many other ways to customize a GPText index. For example, you can omit fields from the index by changing the indexed attribute of the <field> element to false, store the contents of the field in the index by changing the stored attribute to true, or use gptext-config to edit the stopwords.txt file to specify additional words to ignore when indexing.

See Customizing GPText Indexes to learn how data type mapping determines how Solr analyzes and indexes field contents and for more ways to customize GPText indexes.

Populate the index

To populate the index, use the table function gptext.index(), which has the following syntax:

SELECT * FROM gptext.index(TABLE(SELECT * FROM <table_name>), <index_name>);

To index all rows in the twitter.message table, execute this command:

=# SELECT * FROM gptext.index(TABLE(SELECT * FROM twitter.message), 'demo.twitter.message');
 dbid | num_docs
------+----------
    2 |      892
    3 |      838
(2 rows)

This command indexes the rows in the wikipedia.articles table.

=# SELECT * FROM gptext.index(TABLE(SELECT * FROM wikipedia.articles), 'demo.wikipedia.articles');
dbid | num_docs 
------+----------
   3 |       11
   2 |       12
(2 rows)

The results of this command show that 23 documents from two segments were added to the index.

The first argument of the gptext.index() function is a “table-valued expression.” TABLE(SELECT * FROM wikipedia.articles) creates a table-valued expression from the articles table, using the table function TABLE.

You can choose the data to index or update by changing the inner select list in the query to select the rows you want to index. When adding new documents to an existing index, for example, specify a WHERE clause in the gptext.index() call to choose only the new rows to index.

The inner SELECT statement could also be a query on a different table with the same structure, or a result set constructed with an arbitrarily complex join, provided the columns specified in the gptext.create_index() function are present in the results. If you index data from a source other than the table used to create the index, be sure the distribution key for the result set matches the distribution key of the base table. The Greenplum Database SELECT statement has a SCATTER BY clause that you can use to specify the distribution key for the results from a query. See Specifying a distribution key with SCATTER BY for more about the distribution policy and GPText indexes.

Commit the index

After you create and populate an index, you commit the index using gptext.commit_index(<index_name>).

This example commits the documents added to the indexes in the previous example.

=# SELECT * FROM gptext.commit_index('demo.twitter.message');
 commit_index
--------------
 t
(1 row)

=# SELECT * FROM gptext.commit_index('demo.wikipedia.articles');
 commit_index
--------------
 t
(1 row)

The gptext.commit_index() function commits any new data added to or deleted from the index since the last commit.

Managing GPText Indexes

GPText provides command-line utilities and functions you can use to perform these GPText management tasks:

Configuring an index

You can modify your indexing behavior globally by using the gptext-config utility to edit a set of index configuration files. The files you can edit with gptext-config are:

  • solrconfig.xml – Contains most of the parameters for configuring Solr itself (see http://wiki.apache.org/solr/SolrConfigXml).
  • managed-schema – Defines the analyzer chains that Solr uses for various different types of search fields (see Text Analyzer Chains).
  • stopwords.txt – Lists words you want to eliminate from the final index.
  • protwords.txt – Lists protected words that you do not want to be modified by the analyzer chain. For example, iPhone.
  • synonyms.txt – Lists words that you want replaced by synonyms in the analyzer chain.
  • elevate.xml – Moves specific words to the top of your final index.
  • emoticons.txt – Defines emoticons for the text_sm social media analyzer chain. (see The emoticons.txt file).

You can also use gptext-config to move files.

Optimizing an index

The function gptext.optimize_index(<index_name>, <max_segments>) merges all segments into a small number of segments (<max_segments>) for increased efficiency.

Example:

=# SELECT * FROM gptext.optimize_index('demo.wikipedia.articles', 10);
 optimize_index
----------------
 t
(1 row)

Specifying a distribution policy with SCATTER BY

The first parameter of gptext.index() is a table-valued expression, such as TABLE(SELECT * FROM wikipedia.articles). The query in this parameter must have the same distribution policy as the table you are indexing so that documents added to the index are associated with the correct Greenplum Database segments. Some queries, however, have no distribution policy or they have a different distribution policy. This could happen if the query is a join of two or more tables or a query on an intermediate (staging) table that is distributed differently than the base table for the index.

To specify a distribution policy for a query result set, the Greenplum Database SELECT statement has a “SCATTER BY” clause.

TABLE(SELECT * FROM wikipedia.articles SCATTER BY <distrib_id>)

where distrib_id is the same distribution key used to distribute the base table for the index.

Deleting from an index

You can delete from an index using a query with the function gptext.delete(<index_name>, <query>). This deletes from the index all documents that match the search query. To delete all documents, use the query '*'.

After a successful deletion, execute gptext.commit_index(<index_name>) to commit the change.

This example deletes all documents containing "toxin" in the default search field.

=# SELECT * FROM gptext.delete('demo.wikipedia.articles', 'toxin');
 delete
--------
 t
(1 row)

SELECT * FROM gptext.commit_index('demo.wikipedia.articles');

Example that deletes all documents from the index:

SELECT * FROM gptext.delete('demo.wikipedia.articles', '*:*');

Be sure to commit changes to the index after deleting documents.

SELECT * FROM gptext.commit_index('demo.wikipedia.articles');

Dropping an index

You can completely remove an index with the gptext.drop_index(<index_name>) function.

Example:

SELECT * FROM gptext.drop_index('demo.wikipedia.articles');

Adding a field to an index

You can add a field to an existing index using the gptext.add_field() function. For example, you can add a field to the index after a column is added to the underlying database table or you can add a field to index a column that was not specified when the index was created.

GPText maps the Greenplum Database field type to an equivalent Solr data type automatically. See GPText and Solr Data Type Mappings for a table of data type mappings.

CREATE TABLE myarticles (
    id int8 primary key,
    date_time timestamptz,
    title text,
    content text,
    refs text
) DISTRIBUTED BY (id);

SELECT * FROM gptext.create_index('wikipedia', 'myarticles', 'id', 'content', true);
   ... populate the index ...
SELECT * FROM gptext.commit_index('demo.wikipedia.myarticles');

ALTER TABLE myarticles ADD notes text;
SELECT * FROM gptext.add_field('demo.wikipedia.myarticles', 'notes', false, false);
SELECT * FROM gptext.reload_index('demo.wikipedia.myarticles');

Adding a field to a GPText index requires the base table to be available. If you drop the table after creating the index, you cannot add fields to the index.

Dropping a field from an index

You can drop a field from an existing index with the gptext.drop_field() function. After you have dropped fields, call gptext.reload_index() to reload the index.

Example:

SELECT * FROM gptext.drop_field('demo.wikipedia.myarticles', 'notes');
SELECT * FROM gptext.reload_index('demo.wikipedia.myarticles');

Listing all indexes

You can list all indexes in the GPText cluster using the gptext-state command-line utility. For example:

$ gptext-state -D
20170822:10:11:23:029752 gptext-state:gpsne:gpadmin-[INFO]:-Execute GPText state ...
20170822:10:11:23:029752 gptext-state:gpsne:gpadmin-[INFO]:-Check zookeeper cluster state ...
20170822:10:11:23:029752 gptext-state:gpsne:gpadmin-[INFO]:-Check GPText cluster status...
20170822:10:11:23:029752 gptext-state:gpsne:gpadmin-[INFO]:-Current GPText Version: 2.1.2
20170822:10:11:24:029752 gptext-state:gpsne:gpadmin-[INFO]:-All nodes are up and running.
20170822:10:11:24:029752 gptext-state:gpsne:gpadmin-[INFO]:------------------------------------------------
20170822:10:11:24:029752 gptext-state:gpsne:gpadmin-[INFO]:-Index state details.
20170822:10:11:24:029752 gptext-state:gpsne:gpadmin-[INFO]:------------------------------------------------
20170822:10:11:24:029752 gptext-state:gpsne:gpadmin-[INFO]:-   database    index name                 state
20170822:10:11:24:029752 gptext-state:gpsne:gpadmin-[INFO]:-   wikipedia   demo.wikipedia.articles  Green
20170822:10:11:28:029752 gptext-state:gpsne:gpadmin-[INFO]:-Done.

Storing Field Content in an Index

Solr can store the contents of columns in the index so that results of a search on the index can include the column contents. This makes it unnecessary to join the search query results with the original table. You can even store the contents of database columns that are not indexed and return that content with search results. GPText returns the additional field content in a buffer added to the search results. Individual fields can be retrieved from this buffer using the gptext.gptext_retrieve_field(), gptext.gptext_retrieve_field_int(), and gptext.gptext_retrieve_field_float() functions.

One design pattern is to store content for all of a table’s columns in the GPText index so the database table can then be truncated or dropped. Additional documents can be added to the GPText index later by inserting them into the truncated table, or into a temporary table with the same structure, and then adding them to the index with the gptext.index() function.

To enable storing content in a GPText index, you must edit the managed-schema file for the index. The <field> element for each field has a stored attribute, which defaults to false, except for the unique id field.

Follow these steps to configure the demo.wikipedia.articles index to store content for the title, content, and refs columns.

  1. Log into the master as gpadmin and use gptext-config to edit the managed-schema file.

    $ gptext-config edit -i demo.wikipedia.articles -f managed-schema
    
  2. Find the <field> elements for the columns you want to store in the index. Note that <field> elements with names beginning with an underscore are internal fields and should not be modified. The “title”, “content”, and “refs” fields in this example are indexed, but not stored.

    <field name="__temp_field" type="intl_text" indexed="true" stored="false" multiValued="true"/>
    <field name="_version_" type="long" indexed="true" stored="true"/>
    <field name="id" stored="true" type="long" indexed="true"/>
    <field name="__pk" stored="true" indexed="true" type="long"/>
    <field name="title" stored="false" type="text" indexed="true"/>
    <field name="content" stored="false" type="text" indexed="true"/>
    <field name="refs" stored="false" type="text" indexed="true"/>
    
  3. For each field you want to store in the index, change the stored attribute from "false" to "true".

    <field name="title" stored="true" type="text" indexed="true"/>
    <field name="content" stored="true" type="text" indexed="true"/>
    <field name="refs" stored="true" type="text" indexed="true"/>
    
  4. Save the file and, if any documents were already added to the index, reindex the table.

    See Retrieving Stored Field Content for information about retrieving the stored content with GPText query results.

For more about the contents of the managed-schema file and additional ways to customize GPText indexes see Customizing GPText Indexes.

Creating a GPText index for a Greenplum Database partitioned table

Creating a GPText index for a partitioned Greenplum Database table using gptext.create_index() is the same as creating an index for a non-partitioned table. You must supply the name of the root partition, however; if you attempt to create a GPText index for a child partition, the gptext.create_index() function issues an error message.

GPText recognizes a partitioned table and adds a __partition field to the index. Then when you add documents to the index, GPText saves the child partition table name in the __partition field. You can use the __partition field to create GPText queries that search and filter by partition.

Unlike Greenplum Database, which manages child partitions as separate database tables, GPText does not create a separate Solr collection for each database partition because the larger number of Solr cores could adversely affect the capacity and performance of the Solr cluster.

The demo.twitter.message table created in the Setting Up the Sample Database section is a partitioned table. See Searching Partitioned Tables for examples of searching partitions.

Adding and dropping partitions from GPText indexes

You can add new partitions to, and drop partitions from, Greenplum Database partitioned tables. If you have created a GPText index on a partitioned table, when you add or drop partitions in the base database table, you must perform a parallel GPText index operation.

When a new partition is added, the partition can be indexed once the data is in place. You can select rows directly from the newly added child partition table to index the data. First, use the gptext.partition() status function to find the names of child partition tables.

=# SELECT * FROM gptext.partition_status('demo.twitter.message');
           partition_name           |    inherits_name     | level |
                                     cons

------------------------------------+----------------------+-------+-------------------------------
---------------------------------------------------------------------------------------------------
----------
 demo.twitter.message_1_prt_1       | demo.twitter.message |     1 | ((created_at >= '2011-08-01 00
:00:00'::timestamp without time zone) AND (created_at < '2011-09-01 00:00:00'::timestamp without ti
me zone))
 demo.twitter.message_1_prt_2       | demo.twitter.message |     1 | ((created_at >= '2011-09-01 00
:00:00'::timestamp without time zone) AND (created_at < '2011-10-01 00:00:00'::timestamp without ti
me zone))
 demo.twitter.message_1_prt_3       | demo.twitter.message |     1 | ((created_at >= '2011-10-01 00
:00:00'::timestamp without time zone) AND (created_at < '2011-11-01 00:00:00'::timestamp without ti
me zone))
 demo.twitter.message_1_prt_4       | demo.twitter.message |     1 | ((created_at >= '2011-11-01 00
:00:00'::timestamp without time zone) AND (created_at < '2011-12-01 00:00:00'::timestamp without ti
me zone))
 demo.twitter.message_1_prt_dec2011 | demo.twitter.message |     1 | ((created_at >= '2011-12-01 00
:00:00'::timestamp without time zone) AND (created_at < '2112-01-01 00:00:00'::timestamp without ti
me zone))
(5 rows)

In the example above, a new partition with the name twitter.message_1_prt_dec2011 was added to the demo.twitter.message table. The following statements add the data from the new partition to the GPText index and commit the changes.

=# SELECT * FROM gptext.index(TABLE(SELECT * FROM twitter.message_1_prt_dec2011), 'demo.twitter.message');
 dbid | num_docs
------+----------
    3 |      109
    2 |      128
(2 rows)

=# SELECT * FROM gptext.commit_index('demo.twitter.message');
 commit_index
--------------
 t
(1 row)

The name of the new child partition file (excluding the database and schema names) is saved in the __partition field in the index.

When a partition is deleted from a partitioned table, the data from the partition can be deleted from the GPText index by specifying the partition name in the <search> argument of the gptext.delete() function. Be sure to commit the index after deleting the partition.

=# SELECT * FROM gptext.delete('demo.twitter.message', '__partition:message_1_prt_dec2011');
delete
--------
t
(1 row)

=# SELECT * FROM gptext.commit_index('demo.twitter.message');
 commit_index
--------------
 t
(1 row)