LATEST VERSION: 3.1.0 - RELEASE NOTES
Pivotal® Greenplum® Text v2.2.1

GPText Function Reference

The following functions are available in Pivotal GPText.

Indexing

gptext.create_index() – creates an empty index.

gptext.create_index_external() - creates an index for external documents.

gptext.index() – populates an index.

gptext.index_external() - adds documents to an external index.

gptext.commit_index() – finalizes index operations.

gptext.enable_terms() – enables term vectors and positions to allow extracting terms and their positions from text fields.

gptext.recreate_error_table() – recreates the error table that records errors that occur while adding documents to an external index.

Modifying or Deleting an Index

gptext.add_field() – adds a field to an index.

gptext.delete() – deletes documents matching a search query.

gptext.drop_field() – deletes a field from an index.

gptext.drop_index()– deletes an index.

gptext.search() – searches an index.

gptext.search_count() – returns number of documents that match search.

gptext.search_external() - searches a GPText external index.

gptext.gptext_retrieve_field – extracts a single field from the rf search result column as text.

gptext.gptext_retrieve_field_int – extracts a single field from the rf search result column and converts to an integer.

gptext.gptext_retrieve_field_float – extracts a single field from the rf search result column and converts to a float.

gptext.highlight() – returns search result with search term highlighted.

gptext.highlight_external() – applys highlighing to search results from external indexes.

gptext.faceted_field_search() – search, faceted by fields.

gptext.faceted_query_search() – search, faceted by queries.

gptext.faceted_range_search() – search, faceted by defined ranges.

Working With Terms

gptext.enable_terms() – enables term vectors and positions.

gptext.terms() – gets the term vectors for the indexed documents in a Solr index for the specified field.

Configuration and Monitoring

gptext.cluster_status() - shows status of indexes managed by the GPText cluster.

gptext.index_size() - shows the number of documents indexed and total disk space used for GPText indexes.

gptext.index_status() – shows status of replicas for an index or for all indexes.

gptext.live_nodes() – lists active Solr nodes.

gptext.partition_status() - lists partitioned indexes and child partitions.

gptext.reload_index() – reloads Solr configuration files.

gptext.version() – returns version of GPText installation.

gptext.zookeeper_hosts() – returns a list of the ZooKeeper host names and ports.

High Availability

gptext.add_replica() – Adds a replica of an index shard.

gptext.delete_replica() – Deletes a replica of an index shard.

General Purpose Functions

gptext.count_t() – counts number of rows in a table.

Privileges

Your privileges to execute the GPText functions depend on your Greenplum Database privileges for the table from which the index is generated. For example, if you have SELECT privileges for a table in the Greenplum database, you have SELECT privileges for an index generated from that table.

Executing index functions requires one of OWNER, SELECT, INSERT, UPDATE, or DELETE privileges, depending on the function. The OWNER is the person who created the table and has all privileges. See the Security section of the GPText User’s Guide for information about setting privileges.

The Privileges required section for each of the GPText functions specifies the privileges required to execute that function.

Usage

The gptext functions in this section must be executed as SQL queries in the form:

SELECT * FROM gptext.function();

The examples in this document use a Greenplum database named demo set up as follows:

  • A table named articles in the wikipedia schema.
  • A table named message in the twitter schema.

See Setting Up the Sample Database for details about these tables.

Indexing

Indexing functions create, set up, populate, and finalize (commit) Solr indexes.

gptext.create_index()

Creates an empty Solr index.

Syntax

gptext.create_index(<schema_name>, <table_name>, <id_col_name>, 
  <def_search_col_name> [, <if_check_id_uniqueness>])

or

gptext.create_index(<schema_name>, <table_name>, <p_columns>, <p_types>, 
  <id_col_name>, <def_search_col_name> [, <if_check_id_uniqueness>])

Parameters

<schema_name>
The name of the schema in the Greenplum database.
<table_name>
The name of the table in the Greenplum database. If the table is partitioned this must be the name of the root table.
<p_columns>
A text array containing the names of the table columns to index. If <p_columns> and <p_types> are omitted, all table columns are indexed.

The columns must be valid columns in the table. The columns identified by the <id_col_name> and <def_search_col_name> must be included in the array.

If the <p_columns> parameter is supplied, the <p_types> parameter must also be supplied. The sizes of the <p_columns> and <p_types> arrays must be the same.

<p_types>
A text array containing the Solr data types of the columns in the <p_columns> array.

Text types can be mapped to the name of an analyzer chain, for example <text_intl>, <text_sm>, or any type defined in the <managed_schema>. See Map Greenplum Database Data Types to Solr Data Types for equivalent Solr data types for other Greenplum types.

The <p_types> parameter must be supplied if the <p_columns> parameter is supplied.

<id_col_name>
The name of a column in <table_name> that is unique for each row. The column must be of type int4, int8, varchar, text, or uuid.
<def_search_col_name>
The name of the default column to search in <table_name>, if no other column is named in a query.
<if_check_id_uniqueness>
Optional. A Boolean value. The default is true. Set to false to index a table with a non-unique ID field.

Return type

boolean

Privileges required

Only the OWNER can execute this function.

Remarks

A GPText index is a Solr collection.

The contents of the <id_col_name> column should, in most cases, be a unique ID for each row. It must be of type int4, int8, varchar, or text.

If the <if_check_id_uniqueness> argument is true, the default, a document with an ID matching an existing ID cannot be added to the index.

If the <if_check_id_uniquess> argument is false, documents with duplicate IDs are allowed to be added to the index. The content of other fields may or may not be the same as existing documents with the same ID. When a query returns multiple documents with the same ID, it is the user’s responsibility to anticipate and handle the multiple documents. For example, a table could have a revision column that is incremented when a new version of a document is added to the index, allowing queries that omit all but the most recent version from search results.

The name of the index created has the format:

<database_name>.<schema_name>.<table_name>

When the table is partitioned, the GPText index created for the table will contain records for all partitions. If you specify the name of a subpartition table in this function an error is returned. The index records for documents added to the index have a __partition field containing the name of the child partition table. See Searching Partitioned Tables for syntax to search by partitions.

Populate the new index with gptext.index().

The number of replicas for each shard is determined when the index is created. It is the value of the gptext.replication_factor server configuration parameter, 2 by default.

If the gptext.failover_factor server configuration parameter is set, gptext.create_index() fails if the ratio of the number of GPText nodes that are up to the total number of GPText nodes is less than the gptext.failover_factor value (from 0.0 to 1.0). Index shards can only be created on active GPText nodes, so the gptext.failover_factor parameter prevents overloading the active GPText nodes when too many nodes are down.

To index a partitioned table, specify the name of the root table. The gptext.index() function returns an error if you specify the name of a child partition table.

Examples

  • Create an index, demo.wikipedia.articles, with content as the default search field.

    =# SELECT * FROM gptext.create_index('wikipedia', 'articles', 'id', 'content'); 
    
  • Create an index, demo.wikipedia.articles, with content as the default search field. Index the id, content, and title fields.

    =# SELECT * FROM gptext.create_index('wikipedia', 'articles', 
          '{"id", "content", "title"}', '{"long", "text", "text"}', 
          'id', 'content');
    

gptext.create_index_external()

Creates an empty Solr index for external documents.

Syntax

gptext.create_index_external(<index_name>)

Parameters

<index_name>
The name of the index to create. The name cannot contain periods (.).

Notes

A GPText external index is a Solr index for documents external to Greenplum Database, for example, PDF, Microsoft Word, XML, and HTML files. Unlike regular GPText indexes, external indexes are not associated with a Greenplum Database table, but they can be searched with GPText search functions.

Example

=# SELECT * FROM gptext.create_index_external('gptext-docs');

gptext.enable_terms()

Enables term vectors and positions to allow extracting terms and their positions from fields of data type text.

Syntax

gptext.enable_terms(<index_name>, <field_name>) 

Parameters

<index_name>
The name of the index for which you want to enable terms.
<field_name>
The name of the field for which you want to enable terms.

Return type

boolean

Privileges required

Only the OWNER can execute this function.

Remarks

Solr can mark terms and their positions in documents when indexing. This capability is disabled by default. Use gptext.enable_terms() to enable the capability.

Call gptext.enable_terms() for each field where you want to enable terms.

After calling this function, you must index or re-index with gptext.index().

Examples

=# SELECT * FROM gptext.enable_terms('demo.twitter.message', 'message_text');
WARNING:  Enable terms for field: message_text of index: demo.twitter.message successfully. Reindex data needed.
 enable_terms 
--------------
 t
(1 row)

=# SELECT * FROM gptext.index(TABLE(SELECT * FROM twitter.message), 'demo.twitter.message');
 dbid | num_docs 
------+----------
    3 |      947
    2 |     1020
(2 rows)

=# SELECT * FROM gptext.commit_index('demo.twitter.message');
 commit_index 
--------------
 t
(1 row)

gptext.index()

Populates an index by indexing data in a table.

Syntax

gptext.index(TABLE(SELECT * FROM <table_name>), <index_name>) 

Parameters

TABLE(SELECT * FROM <table_name>)
The table to be indexed, with data type anytable.
<index_name>
Name of the index that was created with gptext.create_index() and is to be populated.

Return type

SETOF dbid INT, num_docs BIGINT

where dbid is the dbid of the segment that the documents were sent to, and num_docs is the number of documents that were indexed.

Privileges required

You must have the INSERT or UPDATE privilege to execute this function.

Remarks

<index_name> must have been created with gptext.create_index().

The arguments to the gptext.index() function must be expressions. For example, TABLE(SELECT * FROM articles) creates a “table-valued expression” from the articles table, using the table function TABLE.

You can selectively index/update by changing the inner select list in the query.

After successfully indexing, you must commit the index with gptext.commit_index(<index_name>).

The output includes a two-column table with dbid (the Greenplum segment ID) and num_docs (the number of documents added to the index for that segment) as the columns.

Note:
Be careful about distribution policies:

The first parameter of gptext.index() is TABLE(SELECT * FROM <table_name>). The query in this parameter should have the same distribution policy as the table you are indexing. There are two cases where the query will not have the same distribution policy:

  1. Your query is a join of two tables

  2. You are indexing an intermediate table (staging table) that is distributed differently than the final table.

When the distribution policies differ, you must specify "SCATTER BY" for the query like this:

TABLE(SELECT * FROM messages SCATTER BY <distrib_id>),

where <distrib_id> is the distribution id used when you created your primary/final table.

Example

=# SELECT * FROM gptext.index(TABLE(SELECT * FROM wikipedia.articles), 'demo.wikipedia.articles');
 dbid | num_docs 
------+----------
    3 |        6
    2 |        5
(2 rows)

gptext.index_external()

Adds documents stored outside of Greenplum Database to a GPText external index.

Syntax

gptext.index_external(<url-list>, <index-name>)

Parameters

<url-list>
A list of URLs for documents to add to the GPText external index. The URLs may be expressed as an array or as a table-valued expression.
<index-name>
The name of the index to which the documents are to be added.

Remarks

If the document cannot be retrieved at the given URL, or an error occurs while indexing the document, GPText inserts a row in the gptext.error_table table. You can use gptext.recreate_error_table() to create an empty error table before you call gptext.index().

The value of the GPText custom server parameter gptext.idx_segment_error_limit (default 10) is the number of errors that can occur on any one segment before the indexing operation is canceled.

When adding a document to an external index, GPText calculates a 256-bit hash on the contents of the document. The hash is stored as a 64-byte hexadecimal value in the sha256 field. If you later add a document with a URL matching an existing document in the index, the new document is only added to the index if the newly calculated hash differs from the current value in the sha256 field.

Examples

This example adds a single PDF document to the index gptext-docs.

=# SELECT * FROM gptext.index_external(
      '{http://gptext.docs.pivotal.io/archives/GPText-docs-213.pdf}',
      'gptext-docs');
 dbid | num_docs
------+----------
    3 |        0
    2 |        1
(2 rows)

This example adds multiple HTML documents to the gptext-docs external index by selecting URLs from a database table. Errors will be logged in the gptext.gptext_errrors table.

=# DROP TABLE IF EXISTS gptext_html_docs;
=# CREATE TABLE gptext_html_docs (
      id bigint,
      url text)
DISTRIBUTED BY (id);
CREATE TABLE
=# INSERT INTO gptext_html_docs VALUES 
    (1, 'http://gptext.docs.pivotal.io/latest/topics/administering.html'), 
    (2, 'http://gptext.docs.pivotal.io/latest/topics/ext-indexes.html'), 
    (3, 'http://gptext.docs.pivotal.io/latest/topics/function_ref.html'),
    (4, 'http://gptext.docs.pivotal.io/latest/topics/guc_ref.html'),
    (5, 'http://gptext.docs.pivotal.io/latest/topics/ha.html'),
    (6, 'http://gptext.docs.pivotal.io/latest/topics/index.html'),
    (7, 'http://gptext.docs.pivotal.io/latest/topics/indexes.html'),
    (8, 'http://gptext.docs.pivotal.io/latest/topics/intro.html'), 
    (9, 'http://gptext.docs.pivotal.io/latest/topics/managed-schema.html'),
    (10, 'http://gptext.docs.pivotal.io/latest/topics/performance.html'),
    (11, 'http://gptext.docs.pivotal.io/latest/topics/queries.html'),
    (12, 'http://gptext.docs.pivotal.io/latest/topics/type_ref.html'),
    (13, 'http://gptext.docs.pivotal.io/latest/topics/upgrading.html'),
    (14, 'http://gptext.docs.pivotal.io/latest/topics/utility_ref.html'),
    (15, 'http://gptext.docs.pivotal.io/latest/topics/installing.html');
INSERT 0 15
=# SELECT * FROM gptext.index_external(
      TABLE(SELECT url FROM gptext_html_docs), 
      'gptext-docs', 'gptext_docs_errors');
dbid | num_docs
------+----------
   3 |        6
   2 |        8
(2 rows)

=# SELECT * FROM gptext.gptext_docs_errors;
-[ RECORD 1 ]-------------------------------------------------------------------------------------------
error_time | 2000-01-01 00:25:11.282769
index_name | gptext-docs
sqlcmd     |
errmsg     | Code: RUNTIME_ERROR, Message: 'http://gptext.docs.pivotal.io/210/topics/ext-indexes.html. '
rawdata    | http://gptext.docs.pivotal.io/latest/topics/ext-indexes.html
rawbytes   |

=# SELECT * FROM gptext.commit_index('gptext-docs');
commit_index
--------------
t
(1 row)

gptext.commit_index()

Finishes an index operation. The results of an indexing operation are not available until this function is called for the index.

Syntax

gptext.commit_index(<index_name>)

Parameters

<index_name>
The name of the index to commit. If the table is partitioned this must be the name of the root table.

Return type

boolean

Privileges required

You must have the INSERT, UPDATE, or DELETE privilege to execute this function.

Remarks

Must be called after gptext.index() and gptext.delete().

Example

=# SELECT * FROM gptext.commit_index('demo.wikipedia.articles');
 commit_index 
--------------
 t
(1 row)

gptext.recreate_error_table()

Drops and recreates the gptext.error_table database table.

Syntax

gptext.recreate_error_table()

Return type

Boolean

Remarks

If an error occurs while accessing a document to add to a GPText external index, GPText adds a record to the gptext.error_table table. See gptext.error_table for a description of this table. Users should not drop or modify the table.

Rows added to the gptext.error_table table remain until you use the gptext.recreate_error_table() function to create a new empty table.

If you attempt to execute gptext.recreate_error_table() while it is in use for an indexing operation, a warning is raised and the function returns false without recreating the table.

Examples

=# SELECT gptext.recreate_error_table();
recreate_error_table
----------------------
t
(1 row)

Modifying or Deleting an Index

You can change an index by adding or dropping fields, reverting an index to its previous state, or deleting the index.

gptext.add_field()

Adds a field to your schema if the field was added to the database after the index was created.

Syntax

gptext.add_field (<index_name>, <field_name>[, <is_default_search_col> [, <if_enable_terms>]])

Parameters

<index_name>
The name of the index to which you want to add the field. If the table is partitioned this must be the name of the root table.
<field_name>
The name of the field to be indexed.
<is_default_search_col>
Optional. Boolean value. Is this to become the default search column (field)?
<if_enable_terms>
Optional. Boolean value. Enable terms support on this field when added to the GPText index.

Return type

SETOF boolean

Privileges required

Only the OWNER can execute this function.

Remarks

Call this function for each field you add.

Before and after you add one or more fields, reload the Solr configuration files with gptext.reload_index(). The initial reload_index() call is required because of Solr 4.0 behavior and may not be required in subsequent versions.

After you add one or more fields, you should reindex the table and commit the index with gptext.commit_index().

Example

Adds the field external_links to the index, then recreates, repopulates, and commits the index.

=# ALTER TABLE wikipedia.articles ADD external_links text;
 ALTER TABLE
=# SELECT * FROM gptext.reload_index('demo.wikipedia.articles');
 reload_index
--------------
 t
(1 row)
=# SELECT * FROM gptext.add_field('demo.wikipedia.articles', 'external_links', false, false);
INFO:  Add field: external_links for index: demo.wikipedia.articles
 add_field 
-----------
 t
(1 row)

=# SELECT * FROM gptext.reload_index('demo.wikipedia.articles');
 reload_index 
--------------
 t
(1 row)

=# SELECT * FROM gptext.commit_index('demo.wikipedia.articles');
 commit_index 
--------------
 t
(1 row)

gptext.delete()

Deletes all documents that match the search query.

Syntax

gptext.delete(<index_name>, <query>)

Parameters

<index_name>
The name of the index.
<query>
Documents matching this query will be deleted. To delete all documents use the query'*'or'*:*'.

Return type

boolean

Privileges required

You must have the DELETE privilege to execute this function.

Remarks

After a successful delete, commit the index using gptext.commit_index(<index_name>).

Examples

  • Delete all documents containing the word “unverified” in the default search field:

    =# SELECT * FROM gptext.delete('demo.wikipedia.articles', 'unverified');
     delete 
    --------
     t
    
    (1 row)
    =# SELECT * FROM gptext.commit_index('demo.wikipedia.articles');
     commit_index 
    --------------
     t
    (1 row)
    
  • Delete all documents from the index:

    =# SELECT * FROM gptext.delete('demo.wikipedia.articles', '*:*');
     delete 
    --------
     t
    (1 row)
    
    =# SELECT * FROM gptext.commit_index('demo.wikipedia.articles');
     commit_index 
    --------------
     t
    (1 row)
    

gptext.drop_field()

Removes a field from your schema.

Syntax

gptext.drop_field(<index_name>, <field_name>)

Parameters

<index_name>
The name of the index from which to drop the field. If the table is partitioned this must be the name of the root table.
<field_name>
The name of the field to drop.

Return type

boolean

Privileges required

Only the OWNER can execute this function.

Remarks

Call this function for each field you drop.

Before and after dropping one or more fields, you must reload the Solr configuration files with gptext.reload_index(), then commit the index with gptext.commit_index().

The column __partition in indexes for partitioned database tables cannot be dropped.

The initial reload_index() is required by Solr 4.0 behavior and may not be necessary in subsequent versions.

Example

Drops the field external_links from the index.

=# SELECT * FROM gptext.reload_index('demo.wikipedia.articles');
 reload_index 
--------------
 t
(1 row)

=# SELECT * FROM gptext.drop_field('demo.wikipedia.articles', 'external_links');
INFO:  Drop field: external_links for index: demo.wikipedia.articles
 drop_field 
------------
 t
(1 row)

=# SELECT * FROM gptext.reload_index('demo.wikipedia.articles');
 reload_index 
--------------
 t
(1 row)

=# SELECT * FROM gptext.commit_index('demo.wikipedia.articles');
 commit_index 
--------------
 t
(1 row)

gptext.drop_index()

Removes an index.

Syntax

gptext.drop_index(<index_name>)

Parameters

<index_name>
The name of the index to drop. If the database table is partitioned, this must be the name of the root table.

Return type

boolean

Privileges required

Only the OWNER can execute this function.

Remarks

A dropped index cannot be recovered.

Example

=# SELECT * FROM gptext.drop_index('demo.wikipedia.articles');
 drop_index 
------------
 t
(1 row)

Search functions enable querying an index.

Changing the query parser at query time

When using the search functions, you can change the query parser used by Solr at query time. A different query parser may be required, depending on the nature of the query. See Using Advanced Querying Options for a list of the query parsers GPText supports.

To change the query parser at query time, use the defType Solr option with the gptext.search() function.

To change the query parser for any search function at query time, use the Solr localParams syntax, replacing the <query> term with '{!type=edismax}<query>'.

With the GPText Universal Query Parser, you can use features from any of the other supported query parsers in a single query. To use the Universal Query Parser, replace the <query> term with '{!gptextqp}<query>'. See Using the Universal Query Parser for information and examples.

gptext.search()

Searches an index.

Syntax

gptext.search(<src_tbl>, <index_name>, <search_query>, <filter_queries>[, <options>])

Parameters

<src_table>
Specifies a SELECT statement on an existing, indexed table on which to perform the search. The <src_table> parameter is an anytable data type, specified in this format:

TABLE(SELECT * FROM <src_table>)

<index_name>
The name of the index to search. If the database table is partitioned you can specify the name of a sub-partition table to search.
<search_query>
Text value containing a Solr text search query.
<filter_queries>
A text array of filter queries, if any. If none, set this parameter to null.
<options>
An optional ampersand-delimited list of Solr query parameters. See Solr options.

Return type

SETOF gptext.search_scored_result

This is a composite type with the following columns:

Column Type
id text
score double precision
hs (conditional) gptext hstore
rf (conditional) text

The id column is returned as text, even if the <id_col> specified in the gptext.create_index() function is an integer type. If you order results by id or join search results with the original table on id, you must cast the returned id column to the correct integer type in your query. For example, the following search query casts the id returned by the search query to an INT8 type to join with the numeric id column in the wikipedia.articles table and to sort the results numerically. However, the id column in the results is a text value and is therefore displayed left-justified.

SELECT s.id, s.score, a.title 
FROM wikipedia.articles a,
     gptext.search(TABLE(SELECT 1 SCATTER BY 1), 
    'demo.wikipedia.articles', '*:*', null) s
WHERE s.id::INT8 = a.id
ORDER BY s.id::INT8;
    id    | score |           title
----------+-------+---------------------------
 25784    |     1 | Renewable energy
 27743    |     1 | Solar energy
 54838    |     1 | Biogas
 55017    |     1 | Fusion power
 65499    |     1 | Soil salinity
 113728   |     1 | Geothermal energy
 213555   |     1 | Solar updraft tower
 533423   |     1 | Solar water heating
 608623   |     1 | Ethanol fuel
 855056   |     1 | Forward osmosis
 2008322  |     1 | Vehicle-to-grid
 2120798  |     1 | Lithium economy
 2988035  |     1 | Vortex engine
 4711003  |     1 | Osmotic power
 7906908  |     1 | Biomass
 13021878 |     1 | Geothermal power
 13690575 |     1 | Solar power
 14090587 |     1 | Low-carbon power
 14205946 |     1 | Algae fuel
 18965585 |     1 | Pressure-retarded osmosis
 22391303 |     1 | Liquid nitrogen engine
 26787212 |     1 | Reverse electrodialysis
 53716476 |     1 | Seaweed fuel
(23 rows)

If the <options> parameter is included in gptext.search(), the result includes the offsets column. This column contains key-value pairs, where the key is the column name and the value is a comma-separated list of offsets to locations where the search term occurs. This data is used by the gptext.highlight() function to add highlighting tags to the column data. If highlighting is not enabled with the 'hl=true' option, the offsets column is NULL.

If the fl option is included in the <options> parameter to specify additional fields to add to the result, the rf column contains the additional fields in a formatted text value. The gptext.gptext_retrieve_field() function can be used to extract a single field value from the rf column. There are variants of the gptext.gptext_retrieve_field() function to retrieve integer and float values from the rf column value.

Privileges required

You must have the SELECT privilege to execute this function.

Solr options

Solr queries allow the following optional refinements, specified as an ampersand-delimited list in the options parameter.

defType
The name of the query parser to use for this query.

Example: defType=edismax

rows
The maximum number of rows to return per segment. If omitted, all rows are returned.

Example: rows=100 returns 100 rows per segment or all rows if there are fewer than 100.

sort
Sorts on a field or score in ascending or descending order.

Examples:

  • sort=score desc (default if no sort defined)
  • sort=date_time asc
  • sort=date_time asc score desc sorts on date_time ascending, then on score descending
start
The number of the first record to return.

Examples:

  • start=0 default: returned records start with record 0
  • start=25 returned records start with record 25
hl
Enable highlighting.

Example: hl=true

hl.fl
Comma-separated list of field names to consider when highlighting.

Examples:

  • hl.fl=message_text
  • hl.fl=title,content
fl
Comma-separated list of fields to include in search results. The fields must have been set to stored=true in the managed-schema for the index.

Example: fl=title,refs

Remarks

  • The output includes a table with columns id (the ID named in gptext.create_index()) and score (the tf-idf score). A column named offsets is included if highlighting is specified in the options parameter. A column named rf is included when a list of additional fields to include is specified in the <options> parameter.
  • To change the query parser at query time, specify the defType option in the options parameter list. For example, setting the options parameter to 'rows=100&defType=edismax' limits the output to 100 rows per segment and will change the query parser to edismax.
  • The TABLE query is planned and affects the estimate for gptext.search(), but does not execute. For example, if your query includes

    gptext.search(TABLE(SELECT * FROM t), ...)
    

    the query planner estimates the number of results as the number of rows in t. This can cause the query planner to ignore the use of an index scan. Use a query like TABLE(SELECT 1 SCATTER BY 1) to avoid this issue.

  • If you do not specify options, gptext.search() returns all rows.

  • The options separator has changed from comma to ampersand (&) in order to support highlighting. If you do not use highlighting, you can revert to using the comma separator by setting the gptext.search_param_separator to ','.

  • The Solr option rows specifies the maximum number of rows to return per segment. For example, if you have four segments,'rows=100' returns at most a total of 400 rows. To limit the number of rows returned for an entire query, set a LIMIT in the SQL query. For example, the following query returns at most 20 rows for the query:

    =# SELECT t.id, q.score, t.message_text FROM twitter.message t, 
        gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.twitter.message', 
            '*:*', null, null) q 
    WHERE t.id=q.id::int8 LIMIT '20';
    
  • The gptext hstore type is a limited form of the Postgres hstore type. It only has the hstore input and output functions implemented, as gptext_hstore_in and gptext_hstore_out.

Examples

  • Runs a GPText query that looks for Wikipedia articles that contain the term “optimization”, and joins the results to the original Greenplum Database articles table:

    =# SELECT a.id, a.title, q.score 
    FROM wikipedia.articles a, 
         gptext.search(TABLE(SELECT * FROM wikipedia.articles), 
            'demo.wikipedia.articles', 'optimization', null) q 
    WHERE a.id = q.id::int8 ORDER BY score DESC;
        id    |           title           |   score
    ----------+---------------------------+------------
       213555 | Solar updraft tower       |  1.5528862
     18965585 | Pressure-retarded osmosis | 0.89540845
      7906908 | Biomass                   |  0.8692636
        25784 | Renewable energy          |  0.7473389
       533423 | Solar water heating       |  0.7186527
       608623 | Ethanol fuel              |  0.6943706
        27743 | Solar energy              |  0.6943706
      2008322 | Vehicle-to-grid           |  0.6352812
        55017 | Fusion power              |  0.6347449
     14205946 | Algae fuel                |  0.6347449
     13690575 | Solar power               | 0.58286035
    (11 rows)
    
  • Returns 5 rows from each segment with the text “iphone” highlighted in the message_text column. This example requires that you have enabled terms on the message_text field in the demo.twitter.message table. See the example in the gptext.enable_terms() reference.

    =# SELECT t.id,
    gptext.highlight(t.message_text, 'message_text', s.hs) message
    FROM twitter.message t,
         gptext.search(TABLE(SELECT 1 SCATTER BY 1),
            'demo.twitter.message','{!gptextqp}iphone', null,
             'rows=5&hl=true&hl.fl=message_text' ) s
    WHERE t.id = s.id::int8;
    id    |                                                       message
    ----------+---------------------------------------------------------------------------
    ------------------------------------------
    19714120 | @ayee_Eddy2011 I love pancakes too!  #<em>iPhone</em> #app
    19284329 | #nowplaying on my <em>iPhone</em>: Daft Punk - "Digital Love"
    19416451 | I'm in love with my new <em>iPhone</em>(:
    20257190 | Love my #<em>iphone</em> - only problem now?  I want an #Ipad!
    20759274 | Dropped frutopia on                                                       
    
          : My phone... #ciaowaterdamage I hate <em>iPhones</em>.
    20473459 | i love my iphone4 but I'm excited to see what the iphone5 has to offer #ga
    dgets #<em>iphone</em> #apple #technology
    19424811 | I hate                                                                    
    
          : <em>iPhones</em>:
    20663075 | RT @indigoFKNvanity: I hate the auto correct on <em>iPhones</em> !!!!!!!!!
    20350436 | I absolutely love how fast this phone works. Love the <em>iPhone</em>.
    20042822 | @KDMC23 ohhhh!!! I hate <em>Iphone</em> Talk!
    (10 rows)
    

gptext.search_count()

Returns the number of documents that match the search query.

Syntax

gptext.search_count(<index_name>, <search_query>, <filter_queries>, <options>)

Parameters

<index_name>
The name of the index.
<search_query>
The search query.
<filter_queries>
A comma-delimited array of filter queries, if any. If none, set this parameter to null.
<options>
An optional ampersand-delimited list of Solr query parameters. See Solr options.

Return type

bigint

Privileges required

You must have the SELECT privilege to execute this function.

Example

=# SELECT * FROM gptext.search_count('demo.wikipedia.articles', 'bubble', null);
 count 
-------
     3
(1 row)

gptext.search_external()

Searches a GPText external index.

Syntax

gptext.search_external(<table-exp>, <index_name>, <search_query>, 
      <filter_queries>[, <options>])

Parameters

<table>
A table-valued expression. Because external indexes are not associated with a database table, this parameter is ignored. An expression like the following is sufficient:

TABLE(SELECT 1 SCATTER BY 1)

<index_name>
The name of the GPText external index to search.
<search_query>
Text value containing a Solr text search query.
<filter_queries>
A text array of filter queries, if any. If none, set this parameter to null.
<options>
An optional ampersand-delimited list of Solr query parameters. See Solr options.

Return type

SETOF gptext.search_external_result

This type has the following columns:

Column Type
id text
title text
subject text
description text
comments text
author text
keywords text
category text
resourcename text
url text
content_type text
last_modified text
links text
sha256 text
content text
score double precision
meta text

The last column, meta is present only if the optional <options> argument is included in the search.

Remarks

When you add an external document to the index, Apache Tika extracts a core set of metadata from the document, the columns listed in the Return type section. If any of these core metadata values are not present or do not exist in the document type, the value of the column in the result row is null.

If the <options> argument is supplied, the results contain an additional text column named meta. The meta column contains additional document-type-specific metadata. You can use the gptext.gptext_retrieve_field() function and its variants to extract individual metadata values by name from the meta column.

If the <options> argument contains the fl=<list> Solr option, Solr returns values only for the columns included in <list> and the id, score, and meta columns. Other columns in the result set will have null values. It is more efficient to filter out columns in Solr than to retrieve all columns from Solr and then choose a subset of columns in the SQL SELECT statement.

Examples

  • Finds HTML documents containing the term “facet”.

    =# \x on
    =# SELECT id, title 
    FROM gptext.search_external(TABLE(SELECT 1 SCATTER BY 1), 'gptext-docs', 
    'facet', '{content_type:*html*}');
    -[ RECORD 1 ]--------------------------------------------------------
    id    | http://gptext.docs.pivotal.io/latest/topics/function_ref.html
    title | GPText Function Reference |
          |     Pivotal GPText Docs
    -[ RECORD 2 ]--------------------------------------------------------
    id    | http://gptext.docs.pivotal.io/latest/topics/queries.html
    title | Querying GPText Indexes |
          |     Pivotal GPText Docs
    -[ RECORD 3 ]--------------------------------------------------------
    id    | http://gptext.docs.pivotal.io/latest/topics/guc_ref.html
    title | GPText Configuration Parameters |
          |     Pivotal GPText Docs
    

    Although just two columns are in this result set, the data Greenplum Database receives from Solr includes all core metadata fields, including the content field, which contains the full text of the document. The next example shows how to limit the data transferred from Solr to Greenplum Database with the Solr options argument.

  • This example lists the id, title, sha256, and score columns from the core metadata and extracts meta_creation_date from the additional metadata supplied for PDF documents in the gptext-docs external index. The fl=title,sha256 Solr option prevents Solr from transferring unneeded fields from the index to Greenplum Database. (The id and score columns are always transferred.)

    =# \x on
    =# SELECT id, title, sha256, score,
          gptext.gptext_retrieve_field(meta, 'meta_creation_date') created
    FROM gptext.search_external(TABLE(SELECT 1 SCATTER by 1), 'gptext-docs', '*:*',
          '{content_type:*pdf*}', 'fl=title,sha256');
    -[ RECORD 1 ]-------------------------------------------------------------
    id      | http://gptext.docs.pivotal.io/archives/GPText-docs-213.pdf
    title   | Pivotal GPText 2.1.3 Documentation | Pivotal GPText Docs
    sha256  | 2E063DF5037B9ACC6E180681AE6838077BC5F7A362B4A1E67D9D8FF3E4DD7F3D
    score   | 1
    created | 2017-09-22T17:22:54Z,2017-09-22T17:22:54Z
    -[ RECORD 2 ]-------------------------------------------------------------
    id      | http://gpdb.docs.pivotal.io/latest/pdf/GPDB510Docs.pdf
    title   | Version 5.1.0
    sha256  | AF0B71D032C99A6BE817E1FA2FB774EB7B4D47D75A755ABF54F4F60FEBB92FF7
    score   | 1
    created | 2017-10-31T19:12:41Z,2017-10-31T19:12:41Z
    

gptext.gptext_retrieve_field()

Retrieves a single field from the rf or meta search result column as a text value.

Syntax

gptext.gptext_retrieve_field(rf | meta, <field_name>)

Parameters

rf | meta
The name of the column in which GPText returns fields. This is rf for search results from regular GPText indexes and meta for search results from GPText external indexes.
<field_name>
The name of the field to retrieve.

Remarks

The fl=<field_list> Solr search option is added to the <options> parameter of the gptext.search() function to request additional stored fields. The additional fields are returned in the results in a column named rf (meta for external indexes). This column value has a format like the following:

column_value { name: "_version" value: "1544714234398507008" } column_value { name: "revision" value: "9.70" } column_value { name: "author" value: "jdough" }

The gptext.gptext_retrieve_field() function extracts the value for a single specified field and returns it as a text value. If there is no field with the specified name in the rf column, it returns NULL.

Storing additional fields in an index requires editing managed-schema to specify the fields that should be stored. See Storing Additional Fields in an Index for instructions.

gptext.gptext_retrieve_field_int()

Retrieves a single field from the rf or meta search result column as an integer value.

Syntax

gptext.gptext_retrieve_field_int(rf | meta, <field_name>)

Parameters

rf | meta
The name of the column containing fields to be retrieved. For regular GPText indexes, it is rf. For GPText external indexes it is meta.
<field_name>
The name of the integer field to retrieve.

Remarks

The gptext.gptext_retrieve_field_int() function is the same as the gptext.gptext_retrieve_field() function, except that the extracted field value is converted to an integer value.

gptext.gptext_retrieve_field_float()

Retrieves a single field from the search result column as a float value.

Syntax

gptext.gptext_retrieve_field_float(rf | meta, <field_name>)

Parameters

rf | meta
The name of the column containing fields to be retrieved. For regular GPText indexes, it is rf. For GPText external indexes it is meta.
<field_name>
The name of the float field to retrieve.

Remarks

The gptext.gptext_retrieve_field_float() function is the same as the gptext.gptext_retrieve_field() function, except that the extracted field value is converted to a float value.

gptext.highlight()

Highlights terms by inserting markup tags into data.

Syntax

gptext.highlight(<column_data>, <column_name>, <offsets>)

Parameters

<column_data>
The text data from the table which is to be tagged with highlighting tags.
<column_name>
The name of the corresponding column from the table.
<offsets>
A gptext hstore value that contains key-value pairs that indicate the locations of the text to highlight within the column data. See Remarks for information about the gptext hstore data type.

Prequisite

  • To use highlighting, term vectors must be enabled before creating the index. To enable term vectors, call gptext.enable_terms() for each field where you want to enable terms, then index or re-index with gptext.index().

Remarks

  • The offsets parameter is a gptext hstore, where each key is a column name and the value is a comma-separated list of offsets into the column data. This hstore is constructed by gptext.search() with highlighting enabled in the offsets parameter.

    Following is an example of the offsets hstore content:

    "field1"=>"0:5,9:14","field2"=>"13:20"
    

    gptext.highlight() will insert two sets of tags into the field1 data and one set into the field2 data at the indicated offsets.

  • The gptext hstore type is a limited form of the Postgres hstore type. It has only the hstore input and output functions implemented, as gptext_hstore_in and gptext_hstore_out.

  • The highlight tags are defined by the gptext.hl_pre_tag and gptext.hl_post_tag server configuration parameters. Their default values are <em> and </em>, respectively.

Example

gptext.highlight_external()

Highlights terms in search results from external indexes by inserting markup tags.

Syntax

gptext.highlight_external(<table_exp>, <index>, <search_query>, <filter_queries>[, <options>])

Parameters

<table_exp>
A table expression, ignored for external indexes. An expression such as TABLE(SELECT 1 SCATTER BY 1) is sufficient.
<index>
Name of the index containing data to highlight.
<search_query>
Text value containing a Solr text search query.
<filter_queries>
A text array of filter queries, if any. If none, set this parameter to null.
<options
An optional ampersand-delimited list of Solr query parameters. See Solr options.

Remarks

The gptext.highlight_external() function searches a GPText external index and encloses the search terms in markup tags in the returned results.

Example

Search for and highlight the terms “zookeeper” and “solr” in HTML documents.

=# SELECT id, content FROM gptext.highlight_external(TABLE(SELECT 1 SCATTER BY 1),
      'gptext-docs','{!gptextqp}zookeeper AND solr', '{content_type:*html*}',
      'rows=2');
-[ RECORD 1 ]-------------------------------------------------------------------------------------------------------------------------
id      | http://gptext.docs.pivotal.io/latest/topics/administering.html
content |  includes security considerations, monitoring <em>Solr</em> index statistics, managing and monitoring <em>ZooKeeper</em>
-[ RECORD 2 ]-------------------------------------------------------------------------------------------------------------------------
id      | http://gptext.docs.pivotal.io/latest/topics/performance.html
content |  problems can result from resource contention between the Greenplum Database, <em>Solr</em>, and <em>ZooKeeper</em> clusters
-[ RECORD 3 ]-------------------------------------------------------------------------------------------------------------------------
id      | http://gptext.docs.pivotal.io/latest/topics/ha.html
content | /topics/utility_ref.html GPText Management Utilities
        |           rect /210/topics/type_ref.html GPText and <em>Solr</em>
-[ RECORD 4 ]-------------------------------------------------------------------------------------------------------------------------
id      | http://gptext.docs.pivotal.io/latest/topics/indexes.html
content | /topics/utility_ref.html GPText Management Utilities
        |           rect /210/topics/type_ref.html GPText and <em>Solr</em>

Faceting breaks up a search result into multiple categories, showing counts for each.

gptext.faceted_field_search()

The faceted_field_search() function breaks search results into field name categories.

Syntax

gptext.faceted_field_search(<index_name>, <query>, <filter_queries>, <facet_fields>, 
    <facet_limit>, <minimum>, <options>)

Parameters

<index_name>
The name of the index.
<query>
Query statement. Use *:* to query for all results.
<filter_queries>
A text array of filter queries, if any. If none, set this parameter to null.
<facet_fields>
An array of field names to facet. Use Greenplum Database array notation.
<facet_limit>
Maximum number of results to be returned for each aggregation (facet).
<minimum>
Minimum number of results required before an aggregation (facet) will be returned. Enter 0 to return all facets.
<options>
An optional ampersand-delimited list of Solr query parameters. See Solr options.

Return type

SETOF gptext.facet_field_result

This is a composite type with the following columns:

Column Type
field_name text
field_value text
value_count bigint

Privileges required

You must have the SELECT privilege to execute this function.

Remarks

None.

Examples

  • Facet all tweets on spam and truncated fields.

    =# SELECT * 
       FROM gptext.faceted_field_search('demo.twitter.message', '*:*', null, 
          '{spam, truncated}', 2, 0);
     field_name | field_value | value_count 
    ------------+-------------+-------------
     spam       | true        |        1730
     truncated  | false       |        1705
     truncated  | true        |          25
    (3 rows)
    
  • Facet on author_id, no limit, with a minimum of five tweets, all tweets. Selects five authors with at least two tweets.

    =# SELECT * 
       FROM gptext.faceted_field_search('demo.twitter.message', '*:*', null, 
          '{author_id}', 5, 2);
     field_name | field_value | value_count 
    ------------+-------------+-------------
     author_id  | 102185050   |           9
     author_id  | 202305785   |           2
     author_id  | 64111799    |           2
     author_id  | 45326213    |           2
     author_id  | 195035308   |           2
    (5 rows)
    

gptext.faceted_query_search()

The faceted_query_search() function breaks search results into categories defined by queries that you provide.

Syntax

gptext.faceted_query_search(<index_name>, <query>, <filter_queries>, 
      <facet_queries>, <options>)

Parameters

<index_name>
The name of the index.
<query>
Query statement. Use *:* to query for all results.
<filter_queries>
A text array of filter queries, if any. If none, set this parameter to null.
<facet_queries>
Type: text[]. Required. An array of facet queries.
<options>
An optional ampersand-delimited list of Solr query parameters. See Solr options.

Return type

SETOF gptext.facet_query_result

This is a composite type with the following columns:

Column Type
query_name text
value_count bigint

Privileges required

You must have the SELECT privilege to execute this function.

Remarks

None.

Example

This example uses Solr queries to divide twitter authors into three classes based on number of followers.

=# SELECT * 
FROM gptext.faceted_query_search('demo.twitter.message', '*:*', null,
    '{author_followers_count:[0 TO  5], author_followers_count:[6 TO 10], 
          author_followers_count:[11 TO *]}');
            query_name            | value_count 
----------------------------------+-------------
 author_followers_count:[0 TO  5] |          39
 author_followers_count:[11 TO *] |        1632
 author_followers_count:[6 TO 10] |          36
(3 rows)

gptext.faceted_range_search()

The faceted_range_search() function breaks search results into range categories over a numeric or date field, with ranges defined by the <range_start>, <range_end>, and <range_gap> arguments.

Syntax

gptext.faceted_range_search(<index_name>, <query>, <filter_queries>, 
  <field_name>, <range_start>, <range_end>, <range_gap>, <options>)

Parameters

<index_name>
The name of the index.
<query>
Query statement. Use *:* to query for all results.
<filter_queries>
A text array of filter queries, if any. If none, set this parameter to null.
<field_name>
The name of the field on which to facet.
<range_start>
Beginning of the range.
<range_end>
End of the range.
<range_gap>
Size of range increment, a text value.
<options>
An optional ampersand-delimited list of Solr query parameters. See Solr options.

Return type

SETOF gptext.facet_range_result

This is a composite type with the following columns:

Column Type
field_name text
range_value text
value_count bigint

Privileges required

You must have the SELECT privilege to execute this function.

Example

Facet on date range from midnight August 1, 2011 to midnight November 1, 2011, with a 7-day gap.

=# SELECT * 
FROM gptext.faceted_range_search('demo.twitter.message', '*:*', null, 
       'created_at', '2011-08-01T00:00:00Z', 
       '2011-11-01T00:00:00Z', '+7DAY');
 field_name |     range_value      | value_count 
------------+----------------------+-------------
 created_at | 2011-08-01T00:00:00Z |           0
 created_at | 2011-08-08T00:00:00Z |           0
 created_at | 2011-08-15T00:00:00Z |           0
 created_at | 2011-08-22T00:00:00Z |          52
 created_at | 2011-08-29T00:00:00Z |         189
 created_at | 2011-09-05T00:00:00Z |         545
 created_at | 2011-09-12T00:00:00Z |           0
 created_at | 2011-09-19T00:00:00Z |         109
 created_at | 2011-09-26T00:00:00Z |          69
 created_at | 2011-10-03T00:00:00Z |          59
 created_at | 2011-10-10T00:00:00Z |         206
 created_at | 2011-10-17T00:00:00Z |         147
 created_at | 2011-10-24T00:00:00Z |         112
 created_at | 2011-10-31T00:00:00Z |          94
(14 rows)

Working with Terms

gptext.terms()

Gets the term vectors for the indexed documents in a Solr index for the specified field. You can use gptext.terms() to create tables.

Syntax

gptext.terms(<src_table>, <index_name>, <field_name>, 
            <search_query>, <filter_queries>[, <options>])

Parameters

<src_table>
An anytable value that specifies a SELECT statement on an existing, indexed table on which to perform the search. Specify in the format:

TABLE(SELECT * FROM <src_table>;)

<index_name>
The name of the index to query for terms.
<field_name>
The name of the field to query for terms.
<search_query>
A query that the field must match.
<filter_queries>
A comma-delimited array of filter queries, if any. If none, set this parameter to null.
<options>
An optional, comma-delimited list of Solr query parameters.

Return type

SETOF gptext.term_info

This is a composite type with the following columns:

Column Type
id text
term text
positions integer[]

Privileges required

You must have the SELECT privilege to execute this function.

Remarks

  • To enable using gptext.terms(), execute the GPText function gptext.enable_terms(), then reindex with gptext.index().
  • The TABLE query is planned and affects the estimate of of gptext.terms(), but does not execute. For example, if your query includes:

    gptext.terms(TABLE(SELECT * FROM t), ...)
    

    The query planner estimates the number of results as the number of rows in t. This can cause the query planner to ignore the use of an index scan. Use a query like TABLE(SELECT 1 SCATTER BY 1) to avoid this issue.

Examples

This example creates a terms table.

=# CREATE TABLE twitter.terms AS 
    SELECT * FROM gptext.terms(TABLE(SELECT * FROM twitter.message SCATTER BY 1),
     'demo.twitter.message', 'message_text', 'iphone', null)
   DISTRIBUTED BY (id);
SELECT 5385

Configuration and Monitoring

Index configuration and monitoring functions enable managing indexes, tracking index statistics, checking status of index segments, and ensuring that index contents are current.

gptext.cluster_status()

Shows Solr cluster status.

Syntax

gptext.cluster_status()

Return Type

SETOF gptext.cluster_status_result

This is a composite type with the following columns:

Column Type
index_name text
max_shards_per_node integer
router text
replication_factor integer
auto_add_replicas boolean
znode_version integer
config_name text
partitioned boolean

Example

=# SELECT * FROM gptext.cluster_status();
index_name        | max_shards_per_node |  router  | replication_factor | auto_add_
replicas | znode_version |       config_name       | partitioned
-------------------------+---------------------+----------+--------------------+----------
---------+---------------+-------------------------+-------------
demo.twitter.message    |                   4 | implicit |                  2 | f
  |             8 | demo.twitter.message    | t
demo.wikipedia.articles |                   4 | implicit |                  2 | f
  |             8 | demo.wikipedia.articles | f
(2 rows)

gptext.index_size()

Shows the number of documents indexed and total disk space used for GPText indexes.

Syntax

gptext.index_size([<index_name>])

Parameters

<index_name>
The name of the index. Optional. Returns sizes for all indexes if no index is specified.

Return Types

SETOF gptext.index_size_result

This is a composite type with the following columns:

Column Type
index_name text
num_docs integer
size_in_bytes bigint

Examples

=# SELECT * FROM gptext.index_size();
       index_name        | num_docs | size_in_bytes
-------------------------+----------+---------------
 demo.wikipedia.articles |       23 |        500515
 demo.twitter.message    |     1730 |        767118
 gptext-docs             |       16 |        618231
(3 rows)
=# SELECT * FROM gptext.index_size('demo.wikipedia.articles');
       index_name        | num_docs | size_in_bytes
-------------------------+----------+---------------
 demo.wikipedia.articles |       23 |        500515
(1 row)

gptext.index_status()

Shows status of replicas for a specified index or for all indexes.

Syntax

gptext.index_status([<index_name>])

Parameters

<index_name>
The name of the index. Optional. Returns status for all indexes if no index is specified.

Return Type

SETOF gptext.index_status_result

This is a composite type with the following columns:

Column Type
content_id smallint
index_name text
shard_name text
shard_state text
replica_name text
replica_state text
core text
node_name text
base_url text
is_leader boolean
partitioned boolean
external_index boolean

Examples

  1. Show status for a single index.

    =# SELECT * FROM gptext.index_status('demo.wikipedia.articles');
        content_id |       index_name        | shard_name | shard_state | replica_name | replica_state |
       core                   |     node_name     |         base_url         | is_leader | partitioned | external_
    index
    ------------+-------------------------+------------+-------------+--------------+---------------+--------------
    ---------------------------+-------------------+--------------------------+-----------+-------------+----------
    ------
             0 | demo.wikipedia.articles | shard0     | active      | core_node1   | active        | demo.wikipedi
    a.articles_shard0_replica2 | gpdb51:18983_solr | http://gpdb51:18983/solr | t         | f           | f
             0 | demo.wikipedia.articles | shard0     | active      | core_node4   | active        | demo.wikipedi
    a.articles_shard0_replica1 | gpdb51:18984_solr | http://gpdb51:18984/solr | f         | f           | f
             1 | demo.wikipedia.articles | shard1     | active      | core_node2   | active        | demo.wikipedi
    a.articles_shard1_replica1 | gpdb51:18984_solr | http://gpdb51:18984/solr | f         | f           | f
             1 | demo.wikipedia.articles | shard1     | active      | core_node3   | active        | demo.wikipedi
    a.articles_shard1_replica2 | gpdb51:18983_solr | http://gpdb51:18983/solr | t         | f           | f
    (4 rows)
    
  2. Show status for all GPText indexes.

    =# SELECT * FROM gptext.index_status();
     content_id |       index_name        | shard_name | shard_state | replica_name | replica_state |
        core                   |     node_name     |         base_url         | is_leader | partitioned | external_
    index
    ------------+-------------------------+------------+-------------+--------------+---------------+--------------
    ---------------------------+-------------------+--------------------------+-----------+-------------+----------
    ------
              0 | demo.store.products     | shard0     | active      | core_node1   | active        | demo.store.pr
    oducts_shard0_replica1     | gpdb51:18984_solr | http://gpdb51:18984/solr | t         | f           | f
              0 | demo.store.products     | shard0     | active      | core_node2   | active        | demo.store.pr
    oducts_shard0_replica2     | gpdb51:18983_solr | http://gpdb51:18983/solr | f         | f           | f
              1 | demo.store.products     | shard1     | active      | core_node3   | active        | demo.store.pr
    oducts_shard1_replica2     | gpdb51:18983_solr | http://gpdb51:18983/solr | f         | f           | f
              1 | demo.store.products     | shard1     | active      | core_node4   | active        | demo.store.pr
    oducts_shard1_replica1     | gpdb51:18984_solr | http://gpdb51:18984/solr | t         | f           | f
              0 | demo.twitter.message    | shard0     | active      | core_node2   | active        | demo.twitter.
    message_shard0_replica1    | gpdb51:18984_solr | http://gpdb51:18984/solr | f         | t           | f
              0 | demo.twitter.message    | shard0     | active      | core_node3   | active        | demo.twitter.
    message_shard0_replica2    | gpdb51:18983_solr | http://gpdb51:18983/solr | t         | t           | f
              1 | demo.twitter.message    | shard1     | active      | core_node1   | active        | demo.twitter.
    message_shard1_replica1    | gpdb51:18984_solr | http://gpdb51:18984/solr | t         | t           | f
              1 | demo.twitter.message    | shard1     | active      | core_node4   | active        | demo.twitter.
    message_shard1_replica2    | gpdb51:18983_solr | http://gpdb51:18983/solr | f         | t           | f
              0 | demo.wikipedia.articles | shard0     | active      | core_node1   | active        | demo.wikipedi
    a.articles_shard0_replica2 | gpdb51:18983_solr | http://gpdb51:18983/solr | t         | f           | f
              0 | demo.wikipedia.articles | shard0     | active      | core_node4   | active        | demo.wikipedi
    a.articles_shard0_replica1 | gpdb51:18984_solr | http://gpdb51:18984/solr | f         | f           | f
              1 | demo.wikipedia.articles | shard1     | active      | core_node2   | active        | demo.wikipedi
    a.articles_shard1_replica1 | gpdb51:18984_solr | http://gpdb51:18984/solr | f         | f           | f
              1 | demo.wikipedia.articles | shard1     | active      | core_node3   | active        | demo.wikipedi
    a.articles_shard1_replica2 | gpdb51:18983_solr | http://gpdb51:18983/solr | t         | f           | f
              1 | gptext-docs             | shard1     | active      | core_node2   | active        | gptext-docs_s
    hard1_replica1             | gpdb51:18983_solr | http://gpdb51:18983/solr | t         | f           | t
              1 | gptext-docs             | shard1     | active      | core_node3   | active        | gptext-docs_s
    hard1_replica2             | gpdb51:18984_solr | http://gpdb51:18984/solr | f         | f           | t
              2 | gptext-docs             | shard2     | active      | core_node1   | active        | gptext-docs_s
    hard2_replica1             | gpdb51:18983_solr | http://gpdb51:18983/solr | t         | f           | t
              2 | gptext-docs             | shard2     | active      | core_node4   | active        | gptext-docs_s
    hard2_replica2             | gpdb51:18984_solr | http://gpdb51:18984/solr | f         | f           | t
    (16 rows)
    

gptext.live_nodes()

Lists active Solr nodes and their up or down state.

Syntax

gptext.live_nodes()

Return Type

SETOF gptext.live_nodes_result

This is a composite type with the following columns:

Column Type
host text
port bigint
data_dir text
status text

Example

=# SELECT * FROM gptext.live_nodes();
host  | port  |      data_dir       | status
--------+-------+---------------------+--------
gpdb51 | 18983 | /data/gpdata1/solr0 | u
gpdb51 | 18984 | /data/gpdata2/solr0 | u
(2 rows)

Remarks

The status column can be u (up) or d (down).

gptext.partition_status()

Lists indexes on partitioned tables or child partition names in the current Greenplum database.

Syntax

gptext.partition_status([<index_name>])

Parameters

<index_name>
Optional. Returns partition status for all indexes if no index is specified.

Return Type

SETOF gptext.partition_status_result

This is a composite type with the following columns:

Column Type
partition_name text
inherits_name text
level integer
cons text

Example

List partition status for an index.

=# SELECT partition_name, inherits_name, level 
   FROM gptext.partition_status('demo.twitter.message');
        partition_name        |    inherits_name     | level
------------------------------+----------------------+-------
 demo.twitter.message_1_prt_1 | demo.twitter.message |     1
 demo.twitter.message_1_prt_2 | demo.twitter.message |     1
 demo.twitter.message_1_prt_3 | demo.twitter.message |     1
 demo.twitter.message_1_prt_4 | demo.twitter.message |     1
(4 rows)

Remarks

The gptext.partition_status() function can only list the index partitions for tables in the current Greenplum database.

gptext.reload_index()

Reloads Solr configuration files if they have been modified.

Syntax

gptext.reload_index(<index_name>)

Parameters

<index_name>
Optional. The name of the index for which to reload the configuration files.

Return type

boolean

Privileges required

Only the OWNER can execute this function.

Remarks

None.

Example

=# SELECT * FROM gptext.reload_index('demo.wikipedia.articles');
 reload_index 
--------------
 t
(1 row)

gptext.version()

Returns the version of your GPText installation.

Syntax

SELECT * FROM gptext.version()

Parameters

None.

Return type

text

Privileges required

You do not need any privileges to execute this function.

Example

=# SELECT * FROM gptext.version();
version
--------------------------------
Greenplum Text Analytics 2.1.3
(1 row)

gptext.zookeeper_hosts()

Returns a list of ZooKeeper hosts and ports.

Syntax

gptext.zookeeper_hosts()

Return type

text

Remarks

This function returns a comma-separated list of ZooKeeper nodes in the the format <host-name>:<port>.

Example

=# SELECT * FROM gptext.zookeeper_hosts()
host  | port
--------+------
gpdb51 | 2188
gpdb51 | 2189
gpdb51 | 2190
(3 rows)

High Availability

gptext.add_replica()

Adds a replica of an index shard.

Syntax

gptext.add_replica(<index_name>, <shard_name>[, <node_name>])

Parameters

<index_name>
Name of the index. If the index is for a partitioned database table, this must be the name of the root table.
<shard_name>
Name of the shard to replicate.
<node_name>
Name of the node where the replica is to be added. Optional. If omitted, SolrCloud chooses the node.

Return type

boolean

Remarks

This function is used by the GPText management utility gptext-replica add.

The value of the gptext.replication_factor configuration parameter when an index is created determines how many replicas are created for each shard. In a Greenplum system, there are the same number of shards as there are Greenplum segments. The number of replicas created for a new index is the number of segments times the value of the gptext.replication_factor configuration parameter, 2 by default. The replicas are distributed evenly among the live GPText nodes.

Replicas consume space on the host where they are created, so they are usually only created to replace a replica that has failed or become unavailable or to relocate a replica to another GPText instance. When adding replicas, you should maintain equal distribution of replicas among the GPText nodes and avoid placing multiple replicas for the same shard on the same host.

The total number of replicas for an index that can be placed on each GPText node is set when the index is created. (In Solr, this is the MaxShardsPerNode parameter.) GPText sets this limit by calculating the number of replicas to create per node and adding an additional factor, specified in the gptext.extension_factor server configuration parameter. This parameter can be set between 0 and 10; the default value is 2. Since the limit is set when the index is created, it is recommended to set the gptext.extension_factor parameter to a higher number to allow new replicas to be created when necessary.

Example

=# SELECT * FROM gptext.add_replica('demo.wikipedia.articles', 'shard1');
 success |                core_name
---------+-----------------------------------------
 t       | demo.wikipedia.articles_shard1_replica3
(1 row)

gptext.delete_replica()

Deletes a named replica from the specified index and shard.

Syntax

gptext.delete_replica(<index_name>, <shard_name>, <replica_name>[, <only_if_down>])

Parameters

<index_name>
Name of the index.
<shard_name>
Name of the shard that contains the replica to delete.
<replica_name>
Name of the replica to remove.
<only_if_down>
Optional. When true, no action is taken if the replica is active. Default is false.

Return type

boolean

Remarks

Use the gptext.index_status() function to find the name of the replica to drop. Names are in the format core_nodeX, where X is a number.

This function is called from the gptext-replica drop management utility.

Examples

  1. Delete the core_node5 replica if it is down.

    =# SELECT * FROM gptext.delete_replica('demo.wikipedia.articles', 'shard1', 'core_node5', true);
    ERROR:  Delete replica failed: Attempted to remove replica : demo.wikipedia.articles/shard1/core_node5 with onlyIfDown='true', but state is 'active'.
    
  2. Delete the core_node5 replica even if it is active.

    =# SELECT * FROM gptext.delete_replica('demo.wikipedia.articles', 'shard1', 'core_node5');
     success 
    ---------
     t
    (1 row)
    

General Purpose Functions

gptext.count_t()

Counts the number of records in a table.

Syntax

gptext.count_t(<table_name>)

Parameters

<table_name>
Name of the table for which to count records.

Return type

integer

Privileges required

You need SELECT privileges on <table_name> to execute this function.

Example

=# SELECT * FROM gptext.count_t('demo.wikipedia.articles');
count_t
---------
     23
(1 row)