GPText Function Reference
The following functions are available in Pivotal GPText.
Indexing
gptext.create_index() – creates an empty index.
gptext.create_index_external() - creates an index for external documents.
gptext.index() – populates an index.
gptext.index_external() - adds documents to an external index.
gptext.index_external_dir - adds all documents in a directory in an external document source to a GPText external index.
gptext.commit_index() – finalizes index operations.
gptext.enable_terms() – enables term vectors and positions to allow extracting terms and their positions from text
fields.
gptext.recreate_error_table() – recreates the error table that records errors that occur while adding documents to an external index.
Authenticating with External Document Sources
gptext.external_login() – log in to an external document store that requires authentication.
gptext.external_logout() – log out of an external document store.
Modifying or Deleting an Index
gptext.add_field() – adds a field to an index.
gptext.delete() – deletes documents matching a search query.
gptext.drop_field() – deletes a field from an index.
gptext.drop_index() - deletes an index.
Search
gptext.search() – searches an index.
gptext.search_count() – returns number of documents that match search.
gptext.search_external() - searches a GPText external index.
gptext.gptext_retrieve_field – extracts a single field from the rf
search result column as text.
gptext.gptext_retrieve_field_int – extracts a single field from the rf
search result column and converts to an integer.
gptext.gptext_retrieve_field_float – extracts a single field from the rf
search result column and converts to a float.
gptext.highlight() – returns search result with search term highlighted.
gptext.highlight_external() – applys highlighing to search results from external indexes.
Faceted Search
gptext.faceted_field_search() – search, faceted by fields.
gptext.faceted_query_search() – search, faceted by queries.
gptext.faceted_range_search() – search, faceted by defined ranges.
Working With Terms
gptext.enable_terms() – enables term vectors and positions.
gptext.terms() – gets the term vectors for the indexed documents in a Solr index for the specified field.
Configuration and Monitoring
gptext.cluster_status() - shows status of indexes managed by the GPText cluster.
gptext.config_append() - appends the contents of a local file to a ZooKeeper configuration file for an index.
gptext.config_delete() - deletes an index configuration file from ZooKeeper.
gptext.config_get() - displays the contents of a ZooKeeper index configuration file.
gptext.config_list() - lists ZooKeeper configuration files and directories for an index.
gptext.config_upload() - uploads an index configuration file to ZooKeeper.
gptext.index_size() - shows the number of documents indexed and total disk space used for GPText indexes.
gptext.index_status() – shows status of replicas for an index or for all indexes.
gptext.live_nodes() – lists active Solr nodes.
gptext.partition_status() - lists partitioned indexes and child partitions.
gptext.reload_index() – reloads Solr configuration files.
gptext.version() – returns version of GPText installation.
gptext.zookeeper_hosts() – returns a list of the ZooKeeper host names and ports.
High Availability
gptext.add_replica() – Adds a replica of an index shard.
gptext.delete_replica() – Deletes a replica of an index shard.
General Purpose Functions
gptext.count_t() – counts number of rows in a table.
Privileges
Your privileges to execute the GPText functions depend on your Greenplum Database privileges for the table from which the index is generated. For example, if you have SELECT privileges for a table in the Greenplum database, you have SELECT privileges for an index generated from that table.
Executing index functions requires one of OWNER, SELECT, INSERT, UPDATE, or DELETE privileges, depending on the function. The OWNER is the person who created the table and has all privileges. See the Security section of the GPText User’s Guide for information about setting privileges.
The Privileges required section for each of the GPText functions specifies the privileges required to execute that function.
Usage
The gptext
functions in this section must be executed as SQL queries in the form:
SELECT * FROM gptext.function();
The examples in this document use a Greenplum database named demo
set up as follows:
- A table named
articles
in thewikipedia
schema. - A table named
message
in thetwitter
schema.
See Setting Up the Sample Database for details about these tables.
Indexing
Indexing functions create, set up, populate, and finalize (commit) Solr indexes.
gptext.create_index()
Creates an empty Solr index.
Syntax
gptext.create_index(<schema_name>, <table_name>, <id_col_name>,
<def_search_col_name> [, <if_check_id_uniqueness>])
or
gptext.create_index(<schema_name>, <table_name>, <p_columns>, <p_types>,
<id_col_name>, <def_search_col_name> [, <if_check_id_uniqueness>])
Parameters
<schema_name>
<table_name>
<p_columns>
<p_columns>
and <p_types>
are omitted, all table columns are indexed.
The columns must be valid columns in the table. The columns identified by the <id_col_name>
and <def_search_col_name>
must be included in the array.
If the <p_columns>
parameter is supplied, the <p_types>
parameter must also be supplied. The sizes of the <p_columns>
and <p_types>
arrays must be the same.
<p_types>
<p_columns>
array.
Text types can be mapped to the name of an analyzer chain, for example <text_intl>
, <text_sm>
, or any type defined in the <managed_schema>
. See Map Greenplum Database Data Types to Solr Data Types for equivalent Solr data types for other Greenplum types.
The <p_types>
parameter must be supplied if the <p_columns>
parameter is supplied.
<id_col_name>
<table_name>
that is unique for each row. The column must be of type int4
, int8
, varchar
, text
, or uuid
.
<def_search_col_name>
<table_name>
, if no other column is named in a query.
<if_check_id_uniqueness>
Return type
boolean
Privileges required
Only the OWNER can execute this function.
Remarks
A GPText index is a Solr collection.
The contents of the <id_col_name>
column should, in most cases, be a unique ID for each row. It must be of type int4
, int8
, varchar
, or text
.
If the <if_check_id_uniqueness>
argument is true, the default, a document with an ID matching an existing ID cannot be added to the index.
If the <if_check_id_uniquess>
argument is false, documents with duplicate IDs are allowed to be added to the index. The content of other fields may or may not be the same as existing documents with the same ID. When a query returns multiple documents with the same ID, it is the user’s responsibility to anticipate and handle the multiple documents. For example, a table could have a revision
column that is incremented when a new version of a document is added to the index, allowing queries that omit all but the most recent version from search results.
The name of the index created has the format:
<database_name>.<schema_name>.<table_name>
When the table is partitioned, the GPText index created for the table will contain records for all partitions. If you specify the name of a subpartition table in this function an error is returned. The index records for documents added to the index have a __partition
field containing the name of the child partition table. See Searching Partitioned Tables for syntax to search by partitions.
Populate the new index with gptext.index()
.
The number of replicas for each shard is determined when the index is created. It is the value of the gptext.replication_factor
server configuration parameter, 2 by default.
If the gptext.failover_factor
server configuration parameter is set, gptext.create_index()
fails if the ratio of the number of GPText nodes that are up to the total number of GPText nodes is less than the gptext.failover_factor
value (from 0.0 to 1.0). Index shards can only be created on active GPText nodes, so the gptext.failover_factor
parameter prevents overloading the active GPText nodes when too many nodes are down.
To index a partitioned table, specify the name of the root table. The gptext.index()
function returns an error if you specify the name of a child partition table.
Examples
Create an index,
demo.wikipedia.articles
, withcontent
as the default search field.=# SELECT * FROM gptext.create_index('wikipedia', 'articles', 'id', 'content');
Create an index,
demo.wikipedia.articles
, withcontent
as the default search field. Index theid
,content
, andtitle
fields.=# SELECT * FROM gptext.create_index('wikipedia', 'articles', '{"id", "content", "title"}', '{"long", "text", "text"}', 'id', 'content');
gptext.create_index_external()
Creates an empty Solr index for external documents.
Syntax
gptext.create_index_external(<index_name>)
Parameters
<index_name>
.
).
Notes
A GPText external index is a Solr index for documents external to Greenplum Database, for example, PDF, Microsoft Word, XML, and HTML files. Unlike regular GPText indexes, external indexes are not associated with a Greenplum Database table, but they can be searched with GPText search functions.
Example
=# SELECT * FROM gptext.create_index_external('gptext-docs');
gptext.enable_terms()
Enables term vectors and positions to allow extracting terms and their positions from fields of data type text
.
Syntax
gptext.enable_terms(<index_name>, <field_name>)
Parameters
<index_name>
<field_name>
Return type
boolean
Privileges required
Only the OWNER can execute this function.
Remarks
Solr can mark terms and their positions in documents when indexing. This capability is disabled by default. Use gptext.enable_terms()
to enable the capability.
Call gptext.enable_terms()
for each field where you want to enable terms.
After calling this function, you must index or re-index with gptext.index().
Examples
=# SELECT * FROM gptext.enable_terms('demo.twitter.message', 'message_text');
WARNING: Enable terms for field: message_text of index: demo.twitter.message successfully. Reindex data needed.
enable_terms
--------------
t
(1 row)
=# SELECT * FROM gptext.index(TABLE(SELECT * FROM twitter.message), 'demo.twitter.message');
dbid | num_docs
------+----------
3 | 947
2 | 1020
(2 rows)
=# SELECT * FROM gptext.commit_index('demo.twitter.message');
commit_index
--------------
t
(1 row)
gptext.index()
Populates an index by indexing data in a table.
Syntax
gptext.index(TABLE(SELECT * FROM <table_name>), <index_name>)
Parameters
TABLE(SELECT * FROM <table_name>)
anytable
. <index_name>
gptext.create_index()
and is to be populated.
Return type
SETOF dbid INT, num_docs BIGINT
where dbid
is the dbid
of the segment that the documents were sent to, and num_docs
is the number of documents that were indexed.
Privileges required
You must have the INSERT or UPDATE privilege to execute this function.
Remarks
<index_name>
must have been created with gptext.create_index()
.
The arguments to the gptext.index()
function must be expressions. For example, TABLE(SELECT * FROM articles)
creates a “table-valued expression” from the articles table, using the table function TABLE
.
You can selectively index/update by changing the inner select list in the query.
After successfully indexing, you must commit the index with gptext.commit_index(<index_name>)
.
The output includes a two-column table with dbid
(the Greenplum segment ID) and num_docs
(the number of documents added to the index for that segment) as the columns.
The first parameter of gptext.index()
is TABLE(SELECT * FROM <table_name>)
. The query in this parameter should have the same distribution policy as the table you are indexing. There are two cases where the query will not have the same distribution policy:
Your query is a join of two tables
You are indexing an intermediate table (staging table) that is distributed differently than the final table.
When the distribution policies differ, you must specify "SCATTER BY"
for the query like this:
TABLE(SELECT * FROM messages SCATTER BY <distrib_id>)
,
where <distrib_id>
is the distribution id used when you created your primary/final table.
Example
=# SELECT * FROM gptext.index(TABLE(SELECT * FROM wikipedia.articles), 'demo.wikipedia.articles');
dbid | num_docs
------+----------
3 | 6
2 | 5
(2 rows)
gptext.index_external()
Adds documents stored outside of Greenplum Database to a GPText external index.
Syntax
gptext.index_external(<url-list>, <index-name>)
Parameters
<url-list>
<index-name>
Remarks
If the document cannot be retrieved at the given URL, or an error occurs while indexing the document, GPText inserts a row in the gptext.error_table
table. You can use gptext.recreate_error_table()
to create an empty error table before you call gptext.index()
.
The value of the GPText custom server parameter gptext.idx_segment_error_limit
(default 10) is the number of errors that can occur on any one segment before the indexing operation is canceled.
When adding a document to an external index, GPText calculates a 256-bit hash on the contents of the document. The hash is stored as a 64-byte hexadecimal value in the sha256
field. If you later add a document with a URL matching an existing document in the index, the new document is only added to the index if the newly calculated hash differs from the current value in the sha256
field.
Examples
This example adds a single PDF document to the index gptext-docs
.
=# SELECT * FROM gptext.index_external(
'{http://gptext.docs.pivotal.io/archives/GPText-docs-213.pdf}',
'gptext-docs');
dbid | num_docs
------+----------
3 | 0
2 | 1
(2 rows)
This example adds multiple HTML documents to the gptext-docs
external index by selecting URLs from a database table. Errors will be logged in the gptext.gptext_errrors
table.
=# DROP TABLE IF EXISTS gptext_html_docs;
=# CREATE TABLE gptext_html_docs (
id bigint,
url text)
DISTRIBUTED BY (id);
CREATE TABLE
=# INSERT INTO gptext_html_docs VALUES
(1, 'http://gptext.docs.pivotal.io/latest/topics/administering.html'),
(2, 'http://gptext.docs.pivotal.io/latest/topics/ext-indexes.html'),
(3, 'http://gptext.docs.pivotal.io/latest/topics/function_ref.html'),
(4, 'http://gptext.docs.pivotal.io/latest/topics/guc_ref.html'),
(5, 'http://gptext.docs.pivotal.io/latest/topics/ha.html'),
(6, 'http://gptext.docs.pivotal.io/latest/topics/index.html'),
(7, 'http://gptext.docs.pivotal.io/latest/topics/indexes.html'),
(8, 'http://gptext.docs.pivotal.io/latest/topics/intro.html'),
(9, 'http://gptext.docs.pivotal.io/latest/topics/managed-schema.html'),
(10, 'http://gptext.docs.pivotal.io/latest/topics/performance.html'),
(11, 'http://gptext.docs.pivotal.io/latest/topics/queries.html'),
(12, 'http://gptext.docs.pivotal.io/latest/topics/type_ref.html'),
(13, 'http://gptext.docs.pivotal.io/latest/topics/upgrading.html'),
(14, 'http://gptext.docs.pivotal.io/latest/topics/utility_ref.html'),
(15, 'http://gptext.docs.pivotal.io/latest/topics/installing.html');
INSERT 0 15
=# SELECT * FROM gptext.index_external(
TABLE(SELECT url FROM gptext_html_docs),
'gptext-docs', 'gptext_docs_errors');
dbid | num_docs
------+----------
3 | 6
2 | 8
(2 rows)
=# SELECT * FROM gptext.gptext_docs_errors;
-[ RECORD 1 ]-------------------------------------------------------------------------------------------
error_time | 2000-01-01 00:25:11.282769
index_name | gptext-docs
sqlcmd |
errmsg | Code: RUNTIME_ERROR, Message: 'http://gptext.docs.pivotal.io/210/topics/ext-indexes.html. '
rawdata | http://gptext.docs.pivotal.io/latest/topics/ext-indexes.html
rawbytes |
=# SELECT * FROM gptext.commit_index('gptext-docs');
commit_index
--------------
t
(1 row)
gptext.index_external_dir()
Adds all documents in a directory in an external document source to a GPText external index.
Syntax
gptext.index_external_dir(<directory_url>, <index_name>)
Parameters
<directory_url>
<index_name>
Remarks
The gptext.index_external_dir()
function adds the documents in a directory and its subdirectories to a GPText external index.
Log in to the document source using the gptext.external_login()
function before you call the gptext.index_external_dir()
function.
If you specify a file instead of a directory, an error is added to the gptext.error_table
table.
The ID for each file added to the index is the URL for the file in the external document source.
The Apache Tika library discovers the content_type
for each file.
The user who logs in to the external document source must have read permissions on the directory. The gptext.index_external_dir()
function adds to the index only those documents and subdirectories that the user has permission to read.
Example
This example adds documents from an hdfs store to the GPText webdocs
external index.
#= SELECT * FROM gptext.external_login('hdfs', 'hdfs://myhadoop:9000', 'myhadoop');
external_login
----------------
t
(1 row)
=# SELECT * FROM gptext.index_external_dir('hdfs://myhadoop:9000/gptext_web/public/230/', 'webdocs');
num_docs
----------
37
(1 row)
=# SELECT * FROM gptext.commit_index('webdocs');
commit_index
--------------
t
(1 row)
=# SELECT * FROM gptext.external_logout('hdfs');
external_logout
-----------------
t
(1 row)
gptext.commit_index()
Finishes an index operation. The results of an indexing operation are not available until this function is called for the index.
Syntax
gptext.commit_index(<index_name>)
Parameters
<index_name>
Return type
boolean
Privileges required
You must have the INSERT, UPDATE, or DELETE privilege to execute this function.
Remarks
Must be called after gptext.index() and gptext.delete().
Example
=# SELECT * FROM gptext.commit_index('demo.wikipedia.articles');
commit_index
--------------
t
(1 row)
gptext.recreate_error_table()
Drops and recreates the gptext.error_table
database table.
Syntax
gptext.recreate_error_table()
Return type
Boolean
Remarks
If an error occurs while accessing a document to add to a GPText external index, GPText adds a record to the gptext.error_table
table. See gptext.error_table
for a description of this table. Users should not drop or modify the table.
Rows added to the gptext.error_table
table remain until you use the gptext.recreate_error_table()
function to create a new empty table.
If you attempt to execute gptext.recreate_error_table()
while it is in use for an indexing operation, a warning is raised and the function returns false without recreating the table.
Examples
=# SELECT gptext.recreate_error_table();
recreate_error_table
----------------------
t
(1 row)
Authenticating with External Document Sources
To add documents from a document source that requires authentication, such as Hadoop hdfs, log in before adding documents to a GPText index and log out when done.
gptext.external_login()
Logs in to an external document source before adding documents from the source to a GPText external index.
Syntax
gptext.external_login(<type>, <url>, <config-name>)
Parameters
<type>
'hdfs'
. The type is not case-sensitive.
<url>
<config-name>
gptext-external upload
utility command.Remarks
You can log in to only one external documentation source at a time.
Use the gptext-external list
command to list the configurations that have been uploaded.
Example
Log in to an hdfs file system using the myhdfs
configuration.
=# SELECT * FROM gptext.external_login('HDFS', 'hdfs://198.51.100.23:19000', 'myhdfs');
gptext.external_logout()
Logs out of an external document source after adding documents from the source to a GPText external index.
Syntax
gptext.external_logout(<type>)
Parameters
<type>
'hdfs'
. The type is not case-sensitive.
Remarks
You can log in to only one external document source at a time. To index documents from another source, you must first call gptext.external_logout()
and then log in to the new source with gptext.external_login()
.
Example
Log out of an hdfs file system.
=# SELECT * FROM gptext.external_logout('HDFS');
Modifying or Deleting an Index
You can change an index by adding or dropping fields, reverting an index to its previous state, or deleting the index.
gptext.add_field()
Adds a field to your schema if the field was added to the database after the index was created.
Syntax
gptext.add_field(<index_name>, <field_name>[, <is_default_search_col> [, <if_enable_terms>]])
Parameters
<index_name>
<field_name>
<is_default_search_col>
<if_enable_terms>
Return type
SETOF boolean
Privileges required
Only the OWNER can execute this function.
Remarks
Call this function for each field you add.
Before and after you add one or more fields, reload the Solr configuration files with gptext.reload_index()
. The initial reload_index()
call is required because of Solr 4.0 behavior and may not be required in subsequent versions.
After you add one or more fields, you should reindex the table and commit the index with gptext.commit_index()
.
Example
Adds the field external_links
to the index, then recreates, repopulates, and commits the index.
=# ALTER TABLE wikipedia.articles ADD external_links text;
ALTER TABLE
=# SELECT * FROM gptext.reload_index('demo.wikipedia.articles');
reload_index
--------------
t
(1 row)
=# SELECT * FROM gptext.add_field('demo.wikipedia.articles', 'external_links', false, false);
INFO: Add field: external_links for index: demo.wikipedia.articles
add_field
-----------
t
(1 row)
=# SELECT * FROM gptext.reload_index('demo.wikipedia.articles');
reload_index
--------------
t
(1 row)
=# SELECT * FROM gptext.commit_index('demo.wikipedia.articles');
commit_index
--------------
t
(1 row)
gptext.delete()
Deletes all documents that match the search query.
Syntax
gptext.delete(<index_name>, <query>)
Parameters
<index_name>
<query>
'*'
or'*:*'
.
Return type
boolean
Privileges required
You must have the DELETE privilege to execute this function.
Remarks
After a successful delete, commit the index using gptext.commit_index(<index_name>)
.
Examples
Delete all documents containing the word “unverified” in the default search field:
=# SELECT * FROM gptext.delete('demo.wikipedia.articles', 'unverified'); delete -------- t (1 row) =# SELECT * FROM gptext.commit_index('demo.wikipedia.articles'); commit_index -------------- t (1 row)
Delete all documents from the index:
=# SELECT * FROM gptext.delete('demo.wikipedia.articles', '*:*'); delete -------- t (1 row) =# SELECT * FROM gptext.commit_index('demo.wikipedia.articles'); commit_index -------------- t (1 row)
gptext.drop_field()
Removes a field from your schema.
Syntax
gptext.drop_field(<index_name>, <field_name>)
Parameters
<index_name>
<field_name>
Return type
boolean
Privileges required
Only the OWNER can execute this function.
Remarks
Call this function for each field you drop.
Before and after dropping one or more fields, you must reload the Solr configuration files with gptext.reload_index()
, then commit the index with gptext.commit_index()
.
The column __partition
in indexes for partitioned database tables cannot be dropped.
The initial reload_index()
is required by Solr 4.0 behavior and may not be necessary in subsequent versions.
Example
Drops the field external_links
from the index.
=# SELECT * FROM gptext.reload_index('demo.wikipedia.articles');
reload_index
--------------
t
(1 row)
=# SELECT * FROM gptext.drop_field('demo.wikipedia.articles', 'external_links');
INFO: Drop field: external_links for index: demo.wikipedia.articles
drop_field
------------
t
(1 row)
=# SELECT * FROM gptext.reload_index('demo.wikipedia.articles');
reload_index
--------------
t
(1 row)
=# SELECT * FROM gptext.commit_index('demo.wikipedia.articles');
commit_index
--------------
t
(1 row)
gptext.drop_index()
Removes an index.
Syntax
gptext.drop_index(<index_name>)
Parameters
<index_name>
Return type
boolean
Privileges required
Only the OWNER can execute this function.
Remarks
A dropped index cannot be recovered.
Example
=# SELECT * FROM gptext.drop_index('demo.wikipedia.articles');
drop_index
------------
t
(1 row)
Search
Search functions enable querying an index.
Changing the query parser at query time
When using the search functions, you can change the query parser used by Solr at query time. A different query parser may be required, depending on the nature of the query. See Using Advanced Querying Options for a list of the query parsers GPText supports.
To change the query parser at query time, use the defType
Solr option with the gptext.search()
function.
To change the query parser for any search function at query time, use the Solr localParams syntax, replacing the <query>
term with '{!type=edismax}<query>'
.
With the GPText Universal Query Parser, you can use features from any of the other supported query parsers in a single query. To use the Universal Query Parser, replace the <query>
term with '{!gptextqp}<query>'
. See Using the Universal Query Parser for information and examples.
gptext.search()
Searches an index.
Syntax
gptext.search(<src_tbl>, <index_name>, <search_query>, <filter_queries>[, <options>])
Parameters
<src_table>
SELECT
statement on an existing, indexed table on which to perform the search. The <src_table>
parameter is an anytable
data type, specified in this format:
TABLE(SELECT * FROM <src_table>)
<index_name>
<search_query>
<filter_queries>
null
.<options>
Return type
SETOF gptext.search_scored_result
This is a composite type with the following columns:
Column | Type |
---|---|
id | text |
score | double precision |
hs (conditional) | gptext hstore |
rf (conditional) | text |
The id
column is returned as text, even if the <id_col>
specified in the gptext.create_index()
function is an integer type. If you order results by id
or join search results with the original table on id
, you must cast the returned id
column to the correct integer type in your query. For example, the following search query casts the id
returned by the search query to an INT8 type to join with the numeric id
column in the wikipedia.articles
table and to sort the results numerically. However, the id
column in the results is a text value and is therefore displayed left-justified.
SELECT s.id, s.score, a.title
FROM wikipedia.articles a,
gptext.search(TABLE(SELECT 1 SCATTER BY 1),
'demo.wikipedia.articles', '*:*', null) s
WHERE s.id::INT8 = a.id
ORDER BY s.id::INT8;
id | score | title
----------+-------+---------------------------
25784 | 1 | Renewable energy
27743 | 1 | Solar energy
54838 | 1 | Biogas
55017 | 1 | Fusion power
65499 | 1 | Soil salinity
113728 | 1 | Geothermal energy
213555 | 1 | Solar updraft tower
533423 | 1 | Solar water heating
608623 | 1 | Ethanol fuel
855056 | 1 | Forward osmosis
2008322 | 1 | Vehicle-to-grid
2120798 | 1 | Lithium economy
2988035 | 1 | Vortex engine
4711003 | 1 | Osmotic power
7906908 | 1 | Biomass
13021878 | 1 | Geothermal power
13690575 | 1 | Solar power
14090587 | 1 | Low-carbon power
14205946 | 1 | Algae fuel
18965585 | 1 | Pressure-retarded osmosis
22391303 | 1 | Liquid nitrogen engine
26787212 | 1 | Reverse electrodialysis
53716476 | 1 | Seaweed fuel
(23 rows)
If the <options>
parameter is included in gptext.search()
, the result includes the offsets
column. This column contains key-value pairs, where the key is the column name and the value is a comma-separated list of offsets to locations where the search term occurs. This data is used by the gptext.highlight()
function to add highlighting tags to the column data. If highlighting is not enabled with the 'hl=true'
option, the offsets
column is NULL
.
If the fl
option is included in the <options>
parameter to specify additional fields to add to the result, the rf
column contains the additional fields in a formatted text value. The gptext.gptext_retrieve_field()
function can be used to extract a single field value from the rf
column. There are variants of the gptext.gptext_retrieve_field()
function to retrieve integer and float values from the rf
column value.
Privileges required
You must have the SELECT privilege to execute this function.
Solr options
Solr queries allow the following optional refinements, specified as an ampersand-delimited list in the options
parameter.
defType
Example: defType=edismax
rows
Example: rows=100
returns 100 rows per segment or all rows if there are fewer than 100.
sort
Examples:
sort=score desc
(default if no sort defined)sort=date_time asc
sort=date_time asc score desc
sorts ondate_time
ascending, then onscore
descending
start
Examples:
start=0
default: returned records start with record 0start=25
returned records start with record 25
hl
Example: hl=true
hl.fl
Examples:
hl.fl=message_text
hl.fl=title,content
fl
stored=true
in the managed-schema for the index.
Example: fl=title,refs
Remarks
- The output includes a table with columns
id
(the ID named in gptext.create_index()) andscore
(thetf-idf
score). A column namedoffsets
is included if highlighting is specified in theoptions
parameter. A column namedrf
is included when a list of additional fields to include is specified in the<options>
parameter. - To change the query parser at query time, specify the
defType
option in the options parameter list. For example, setting the options parameter to'rows=100&defType=edismax'
limits the output to 100 rows per segment and will change the query parser toedismax
. The
TABLE
query is planned and affects the estimate forgptext.search()
, but does not execute. For example, if your query includesgptext.search(TABLE(SELECT * FROM t), ...)
the query planner estimates the number of results as the number of rows in
t
. This can cause the query planner to ignore the use of an index scan. Use a query likeTABLE(SELECT 1 SCATTER BY 1)
to avoid this issue.If you do not specify options,
gptext.search()
returns all rows.The options separator has changed from comma to ampersand (&) in order to support highlighting. If you do not use highlighting, you can revert to using the comma separator by setting the
gptext.search_param_separator
to','
.The Solr option
rows
specifies the maximum number of rows to return per segment. For example, if you have four segments,'rows=100'
returns at most a total of 400 rows. To limit the number of rows returned for an entire query, set aLIMIT
in the SQL query. For example, the following query returns at most 20 rows for the query:=# SELECT t.id, q.score, t.message_text FROM twitter.message t, gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.twitter.message', '*:*', null, null) q WHERE t.id=q.id::int8 LIMIT '20';
The
gptext hstore
type is a limited form of the Postgreshstore
type. It only has thehstore
input and output functions implemented, asgptext_hstore_in
andgptext_hstore_out
.
Examples
Runs a GPText query that looks for Wikipedia articles that contain the term “optimization”, and joins the results to the original Greenplum Database
articles
table:=# SELECT a.id, a.title, q.score FROM wikipedia.articles a, gptext.search(TABLE(SELECT * FROM wikipedia.articles), 'demo.wikipedia.articles', 'optimization', null) q WHERE a.id = q.id::int8 ORDER BY score DESC; id | title | score ----------+---------------------------+------------ 213555 | Solar updraft tower | 1.5528862 18965585 | Pressure-retarded osmosis | 0.89540845 7906908 | Biomass | 0.8692636 25784 | Renewable energy | 0.7473389 533423 | Solar water heating | 0.7186527 608623 | Ethanol fuel | 0.6943706 27743 | Solar energy | 0.6943706 2008322 | Vehicle-to-grid | 0.6352812 55017 | Fusion power | 0.6347449 14205946 | Algae fuel | 0.6347449 13690575 | Solar power | 0.58286035 (11 rows)
Returns 5 rows from each segment with the text “iphone” highlighted in the
message_text
column. This example requires that you have enabled terms on themessage_text
field in thedemo.twitter.message
table. See the example in thegptext.enable_terms()
reference.=# SELECT t.id, gptext.highlight(t.message_text, 'message_text', s.hs) message FROM twitter.message t, gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.twitter.message','{!gptextqp}iphone', null, 'rows=5&hl=true&hl.fl=message_text' ) s WHERE t.id = s.id::int8; id | message ----------+--------------------------------------------------------------------------- ------------------------------------------ 19714120 | @ayee_Eddy2011 I love pancakes too! #<em>iPhone</em> #app 19284329 | #nowplaying on my <em>iPhone</em>: Daft Punk - "Digital Love" 19416451 | I'm in love with my new <em>iPhone</em>(: 20257190 | Love my #<em>iphone</em> - only problem now? I want an #Ipad! 20759274 | Dropped frutopia on : My phone... #ciaowaterdamage I hate <em>iPhones</em>. 20473459 | i love my iphone4 but I'm excited to see what the iphone5 has to offer #ga dgets #<em>iphone</em> #apple #technology 19424811 | I hate : <em>iPhones</em>: 20663075 | RT @indigoFKNvanity: I hate the auto correct on <em>iPhones</em> !!!!!!!!! 20350436 | I absolutely love how fast this phone works. Love the <em>iPhone</em>. 20042822 | @KDMC23 ohhhh!!! I hate <em>Iphone</em> Talk! (10 rows)
gptext.search_count()
Returns the number of documents that match the search query.
Syntax
gptext.search_count(<index_name>, <search_query>, <filter_queries>, <options>)
Parameters
<index_name>
<search_query>
<filter_queries>
null
.
<options>
Return type
bigint
Privileges required
You must have the SELECT privilege to execute this function.
Example
=# SELECT * FROM gptext.search_count('demo.wikipedia.articles', 'bubble', null);
count
-------
3
(1 row)
gptext.search_external()
Searches a GPText external index.
Syntax
gptext.search_external(<table-exp>, <index_name>, <search_query>,
<filter_queries>[, <options>])
Parameters
<table>
TABLE(SELECT 1 SCATTER BY 1)
<index_name>
<search_query>
<filter_queries>
null
.
<options>
Return type
SETOF gptext.search_external_result
This type has the following columns:
Column | Type |
---|---|
id | text |
title | text |
subject | text |
description | text |
comments | text |
author | text |
keywords | text |
category | text |
resourcename | text |
url | text |
content_type | text |
last_modified | text |
links | text |
sha256 | text |
content | text |
score | double precision |
meta | text |
The last column, meta
is present only if the optional <options>
argument is included in the search.
Remarks
When you add an external document to the index, Apache Tika extracts a core set of metadata from the document, the columns listed in the Return type section. If any of these core metadata values are not present or do not exist in the document type, the value of the column in the result row is null.
If the <options>
argument is supplied, the results contain an additional text column named meta
. The meta
column contains additional document-type-specific metadata. You can use the gptext.gptext_retrieve_field()
function and its variants to extract individual metadata values by name from the meta
column.
If the <options>
argument contains the fl=<list>
Solr option, Solr returns values only for the columns included in <list>
and the id
, score
, and meta
columns. Other columns in the result set will have null values. It is more efficient to filter out columns in Solr than to retrieve all columns from Solr and then choose a subset of columns in the SQL SELECT
statement.
Examples
Finds HTML documents containing the term “facet”.
=# \x on =# SELECT id, title FROM gptext.search_external(TABLE(SELECT 1 SCATTER BY 1), 'gptext-docs', 'facet', '{content_type:*html*}'); -[ RECORD 1 ]-------------------------------------------------------- id | http://gptext.docs.pivotal.io/latest/topics/function_ref.html title | GPText Function Reference | | Pivotal GPText Docs -[ RECORD 2 ]-------------------------------------------------------- id | http://gptext.docs.pivotal.io/latest/topics/queries.html title | Querying GPText Indexes | | Pivotal GPText Docs -[ RECORD 3 ]-------------------------------------------------------- id | http://gptext.docs.pivotal.io/latest/topics/guc_ref.html title | GPText Configuration Parameters | | Pivotal GPText Docs
Although just two columns are in this result set, the data Greenplum Database receives from Solr includes all core metadata fields, including the
content
field, which contains the full text of the document. The next example shows how to limit the data transferred from Solr to Greenplum Database with the Solr options argument.This example lists the
id
,title
,sha256
, andscore
columns from the core metadata and extractsmeta_creation_date
from the additional metadata supplied for PDF documents in thegptext-docs
external index. Thefl=title,sha256
Solr option prevents Solr from transferring unneeded fields from the index to Greenplum Database. (Theid
andscore
columns are always transferred.)=# \x on =# SELECT id, title, sha256, score, gptext.gptext_retrieve_field(meta, 'meta_creation_date') created FROM gptext.search_external(TABLE(SELECT 1 SCATTER by 1), 'gptext-docs', '*:*', '{content_type:*pdf*}', 'fl=title,sha256'); -[ RECORD 1 ]------------------------------------------------------------- id | http://gptext.docs.pivotal.io/archives/GPText-docs-213.pdf title | Pivotal GPText 2.1.3 Documentation | Pivotal GPText Docs sha256 | 2E063DF5037B9ACC6E180681AE6838077BC5F7A362B4A1E67D9D8FF3E4DD7F3D score | 1 created | 2017-09-22T17:22:54Z,2017-09-22T17:22:54Z -[ RECORD 2 ]------------------------------------------------------------- id | http://gpdb.docs.pivotal.io/latest/pdf/GPDB510Docs.pdf title | Version 5.1.0 sha256 | AF0B71D032C99A6BE817E1FA2FB774EB7B4D47D75A755ABF54F4F60FEBB92FF7 score | 1 created | 2017-10-31T19:12:41Z,2017-10-31T19:12:41Z
gptext.gptext_retrieve_field()
Retrieves a single field from the rf
or meta
search result column as a text value.
Syntax
gptext.gptext_retrieve_field(rf | meta, <field_name>)
Parameters
rf | meta
rf
for search results from regular GPText indexes and meta
for search results from GPText external indexes.
<field_name>
Remarks
The fl=<field_list>
Solr search option is added to the <options>
parameter of the gptext.search()
function to request additional stored fields. The additional fields are returned in the results in a column named rf
(meta
for external indexes). This column value has a format like the following:
column_value { name: "_version" value: "1544714234398507008" } column_value { name: "revision" value: "9.70" } column_value { name: "author" value: "jdough" }
The gptext.gptext_retrieve_field()
function extracts the value for a single specified field and returns it as a text value. If there is no field with the specified name in the rf
column, it returns NULL.
Storing additional fields in an index requires editing managed-schema
to specify the fields that should be stored. See Storing Additional Fields in an Index for instructions.
gptext.gptext_retrieve_field_int()
Retrieves a single field from the rf
or meta
search result column as an integer value.
Syntax
gptext.gptext_retrieve_field_int(rf | meta, <field_name>)
Parameters
rf | meta
rf
. For GPText external indexes it is meta
.
<field_name>
Remarks
The gptext.gptext_retrieve_field_int()
function is the same as the gptext.gptext_retrieve_field()
function, except that the extracted field value is converted to an integer value.
gptext.gptext_retrieve_field_float()
Retrieves a single field from the search result column as a float value.
Syntax
gptext.gptext_retrieve_field_float(rf | meta, <field_name>)
Parameters
rf | meta
rf
. For GPText external indexes it is meta
.
<field_name>
Remarks
The gptext.gptext_retrieve_field_float()
function is the same as the gptext.gptext_retrieve_field()
function, except that the extracted field value is converted to a float value.
gptext.highlight()
Highlights terms by inserting markup tags into data.
Syntax
gptext.highlight(<column_data>, <column_name>, <offsets>)
Parameters
<column_data>
<column_name>
<offsets>
gptext hstore
value that contains key-value pairs that indicate the locations of the text to highlight within the column data. See Remarks for information about the gptext hstore
data type.
Prequisite
- To use highlighting, term vectors must be enabled before creating the index. To enable term vectors, call
gptext.enable_terms()
for each field where you want to enable terms, then index or re-index withgptext.index()
.
Remarks
The
offsets
parameter is agptext hstore
, where each key is a column name and the value is a comma-separated list of offsets into the column data. Thishstore
is constructed by gptext.search() with highlighting enabled in theoffsets
parameter.Following is an example of the
offsets
hstore
content:"field1"=>"0:5,9:14","field2"=>"13:20"
gptext.highlight()
will insert two sets of tags into thefield1
data and one set into thefield2
data at the indicated offsets.The
gptext hstore
type is a limited form of the Postgreshstore
type. It has only thehstore
input and output functions implemented, asgptext_hstore_in
andgptext_hstore_out
.The highlight tags are defined by the
gptext.hl_pre_tag
andgptext.hl_post_tag
server configuration parameters. Their default values are<em>
and</em>
, respectively.
Example
gptext.highlight_external()
Highlights terms in search results from external indexes by inserting markup tags.
Syntax
gptext.highlight_external(<table_exp>, <index>, <search_query>, <filter_queries>[, <options>])
Parameters
<table_exp>
TABLE(SELECT 1 SCATTER BY 1)
is sufficient.
<index>
<search_query>
<filter_queries>
null
.
<options
Solr options
.
Remarks
The gptext.highlight_external()
function searches a GPText external index and encloses the search terms in markup tags in the returned results.
Example
Search for and highlight the terms “zookeeper” and “solr” in HTML documents.
=# SELECT id, content FROM gptext.highlight_external(TABLE(SELECT 1 SCATTER BY 1),
'gptext-docs','{!gptextqp}zookeeper AND solr', '{content_type:*html*}',
'rows=2');
-[ RECORD 1 ]-------------------------------------------------------------------------------------------------------------------------
id | http://gptext.docs.pivotal.io/latest/topics/administering.html
content | includes security considerations, monitoring <em>Solr</em> index statistics, managing and monitoring <em>ZooKeeper</em>
-[ RECORD 2 ]-------------------------------------------------------------------------------------------------------------------------
id | http://gptext.docs.pivotal.io/latest/topics/performance.html
content | problems can result from resource contention between the Greenplum Database, <em>Solr</em>, and <em>ZooKeeper</em> clusters
-[ RECORD 3 ]-------------------------------------------------------------------------------------------------------------------------
id | http://gptext.docs.pivotal.io/latest/topics/ha.html
content | /topics/utility_ref.html GPText Management Utilities
| rect /210/topics/type_ref.html GPText and <em>Solr</em>
-[ RECORD 4 ]-------------------------------------------------------------------------------------------------------------------------
id | http://gptext.docs.pivotal.io/latest/topics/indexes.html
content | /topics/utility_ref.html GPText Management Utilities
| rect /210/topics/type_ref.html GPText and <em>Solr</em>
Faceted Search
Faceting breaks up a search result into multiple categories, showing counts for each.
gptext.faceted_field_search()
The faceted_field_search()
function breaks search results into field name categories.
Syntax
gptext.faceted_field_search(<index_name>, <query>, <filter_queries>, <facet_fields>,
<facet_limit>, <minimum>, <options>)
Parameters
<index_name>
<query>
*:*
to query for all results. <filter_queries>
null
. <facet_fields>
<facet_limit>
<minimum>
<options>
Return type
SETOF gptext.facet_field_result
This is a composite type with the following columns:
Column | Type |
---|---|
field_name | text |
field_value | text |
value_count | bigint |
Privileges required
You must have the SELECT privilege to execute this function.
Remarks
None.
Examples
Facet all tweets on
spam
andtruncated
fields.=# SELECT * FROM gptext.faceted_field_search('demo.twitter.message', '*:*', null, '{spam, truncated}', 2, 0); field_name | field_value | value_count ------------+-------------+------------- spam | true | 1730 truncated | false | 1705 truncated | true | 25 (3 rows)
Facet on
author_id
, no limit, with a minimum of five tweets, all tweets. Selects five authors with at least two tweets.=# SELECT * FROM gptext.faceted_field_search('demo.twitter.message', '*:*', null, '{author_id}', 5, 2); field_name | field_value | value_count ------------+-------------+------------- author_id | 102185050 | 9 author_id | 202305785 | 2 author_id | 64111799 | 2 author_id | 45326213 | 2 author_id | 195035308 | 2 (5 rows)
gptext.faceted_query_search()
The faceted_query_search()
function breaks search results into categories defined by queries that you provide.
Syntax
gptext.faceted_query_search(<index_name>, <query>, <filter_queries>,
<facet_queries>, <options>)
Parameters
<index_name>
<query>
*:*
to query for all results.
<filter_queries>
null
.
<facet_queries>
<options>
Return type
SETOF gptext.facet_query_result
This is a composite type with the following columns:
Column | Type |
---|---|
query_name | text |
value_count | bigint |
Privileges required
You must have the SELECT privilege to execute this function.
Remarks
None.
Example
This example uses Solr queries to divide twitter authors into three classes based on number of followers.
=# SELECT *
FROM gptext.faceted_query_search('demo.twitter.message', '*:*', null,
'{author_followers_count:[0 TO 5], author_followers_count:[6 TO 10],
author_followers_count:[11 TO *]}');
query_name | value_count
----------------------------------+-------------
author_followers_count:[0 TO 5] | 39
author_followers_count:[11 TO *] | 1632
author_followers_count:[6 TO 10] | 36
(3 rows)
gptext.faceted_range_search()
The faceted_range_search()
function breaks search results into range categories over a numeric or date field, with ranges defined by the <range_start>
, <range_end>
, and <range_gap>
arguments.
Syntax
gptext.faceted_range_search(<index_name>, <query>, <filter_queries>,
<field_name>, <range_start>, <range_end>, <range_gap>, <options>)
Parameters
<index_name>
<query>
*:*
to query for all results.
<filter_queries>
null
.
<field_name>
<range_start>
<range_end>
<range_gap>
<options>
Return type
SETOF gptext.facet_range_result
This is a composite type with the following columns:
Column | Type |
---|---|
field_name | text |
range_value | text |
value_count | bigint |
Privileges required
You must have the SELECT privilege to execute this function.
Example
Facet on date range from midnight August 1, 2011 to midnight November 1, 2011, with a 7-day gap.
=# SELECT *
FROM gptext.faceted_range_search('demo.twitter.message', '*:*', null,
'created_at', '2011-08-01T00:00:00Z',
'2011-11-01T00:00:00Z', '+7DAY');
field_name | range_value | value_count
------------+----------------------+-------------
created_at | 2011-08-01T00:00:00Z | 0
created_at | 2011-08-08T00:00:00Z | 0
created_at | 2011-08-15T00:00:00Z | 0
created_at | 2011-08-22T00:00:00Z | 52
created_at | 2011-08-29T00:00:00Z | 189
created_at | 2011-09-05T00:00:00Z | 545
created_at | 2011-09-12T00:00:00Z | 0
created_at | 2011-09-19T00:00:00Z | 109
created_at | 2011-09-26T00:00:00Z | 69
created_at | 2011-10-03T00:00:00Z | 59
created_at | 2011-10-10T00:00:00Z | 206
created_at | 2011-10-17T00:00:00Z | 147
created_at | 2011-10-24T00:00:00Z | 112
created_at | 2011-10-31T00:00:00Z | 94
(14 rows)
Working with Terms
gptext.terms()
Gets the term vectors for the indexed documents in a Solr index for the specified field. You can use gptext.terms()
to create tables.
Syntax
gptext.terms(<src_table>, <index_name>, <field_name>,
<search_query>, <filter_queries>[, <options>])
Parameters
<src_table>
anytable
value that specifies a SELECT
statement on an existing, indexed table on which to perform the search. Specify in the format:
TABLE(SELECT * FROM <src_table>;)
<index_name>
<field_name>
<search_query>
<filter_queries>
null
.<options>
Return type
SETOF gptext.term_info
This is a composite type with the following columns:
Column | Type |
---|---|
id | text |
term | text |
positions | integer[] |
Privileges required
You must have the SELECT privilege to execute this function.
Remarks
- To enable using
gptext.terms()
, execute the GPText functiongptext.enable_terms()
, then reindex withgptext.index()
. The
TABLE
query is planned and affects the estimate of ofgptext.terms()
, but does not execute. For example, if your query includes:gptext.terms(TABLE(SELECT * FROM t), ...)
The query planner estimates the number of results as the number of rows in
t
. This can cause the query planner to ignore the use of an index scan. Use a query likeTABLE(SELECT 1 SCATTER BY 1)
to avoid this issue.
Examples
This example creates a terms table.
=# CREATE TABLE twitter.terms AS
SELECT * FROM gptext.terms(TABLE(SELECT * FROM twitter.message SCATTER BY 1),
'demo.twitter.message', 'message_text', 'iphone', null)
DISTRIBUTED BY (id);
SELECT 5385
Configuration and Monitoring
Index configuration and monitoring functions enable managing indexes, tracking index statistics, checking status of index segments, and ensuring that index contents are current.
gptext.cluster_status()
Shows Solr cluster status.
Syntax
gptext.cluster_status()
Return Type
SETOF gptext.cluster_status_result
This is a composite type with the following columns:
Column | Type |
---|---|
index_name | text |
max_shards_per_node | integer |
router | text |
replication_factor | integer |
auto_add_replicas | boolean |
znode_version | integer |
config_name | text |
partitioned | boolean |
Example
=# SELECT * FROM gptext.cluster_status();
index_name | max_shards_per_node | router | replication_factor | auto_add_
replicas | znode_version | config_name | partitioned
-------------------------+---------------------+----------+--------------------+----------
---------+---------------+-------------------------+-------------
demo.twitter.message | 4 | implicit | 2 | f
| 8 | demo.twitter.message | t
demo.wikipedia.articles | 4 | implicit | 2 | f
| 8 | demo.wikipedia.articles | f
(2 rows)
gptext.config_append()
Appends the contents of a local file to a ZooKeeper index configuration file.
Syntax
gptext.config_append(<index_name>, <local_config_file> [, <index_config_file>])
Parameters
<index_name>
<local_config_file>
<index_config_file>
Return Type
boolean
Example
Append the local file /home/gpadmin/stopwords.add
to the top-level ZooKeeper file stopwords.txt
for index demo.wikipedia.articles
:
=# SELECT * FROM gptext.config_append('demo.wikipedia.articles', '/home/gpadmin/stopwords.add', 'stopwords.txt');
config_append
---------------
t
(1 row)
gptext.config_delete()
Deletes an index configuration file from ZooKeeper.
Syntax
gptext.config_delete(<index_name>, <index_config_file>)
Parameters
<index_name>
<index_config_file>
Return Type
boolean
Example
Delete the file named stopwords.add
from the top-level configuration directory for the index demo.wikipedia.articles
:
=# select * from gptext.config_delete('demo.wikipedia.articles', 'stopwords.add');
config_delete
---------------
t
(1 row)
gptext.config_get()
Displays the contents of a ZooKeeper index configuration file.
Syntax
gptext.config_get(<index_name>, <index_config_file>)
Parameters
<index_name>
<index_config_file>
Return Type
text
Example
Display the contents of synonyms.txt
for the index demo.wikipedia.article
:
=# select * from gptext.config_get('demo.wikipedia.articles', 'synonyms.txt');
config_get
----------------------------------------------------------------------------
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#-----------------------------------------------------------------------
#some test synonym mappings unlikely to appear in real input text
aaa => aaaa
bbb => bbbb1 bbbb2
ccc => cccc1,cccc2
a\=>a => b\=>b
a\,a => b\,b
fooaaa,baraaa,bazaaa
# Some synonym groups specific to this example
GB,gib,gigabyte,gigabytes
MB,mib,megabyte,megabytes
Television, Televisions, TV, TVs
#notice we use "gib" instead of "GiB" so any WordDelimiterFilter coming
#after us won't split it into two words.
# Synonym mappings can be used for spelling correction too
pixima => pixma
(1 row)
gptext.config_list()
Lists the ZooKeeper configuration files and directories for an index.
Syntax
gptext.config_list(<index_name>, [<index_config_path>,] <is_recursive>)
Parameters
<index_name>
<index_config_path>
<is_recursive>
Return Type
SETOF text
Examples
List ZooKeeper configuration files and directories only in the top-level directory for the index demo.wikipedia.articles
:
=# select * from gptext.config_list('demo.wikipedia.articles', false);
config_list
-----------------------------
currency.xml
mapping-FoldToASCII.txt
managed-schema
protwords.txt
scripts.conf
synonyms.txt
managed_schema
stopwords.txt
velocity
admin-extra.html
aggconfig.xml
emoticons.txt
solrconfig.xml
elevate.xml
xslt
mapping-ISOLatin1Accent.txt
spellings.txt
lang
(18 rows)
List ZooKeeper configuration files in the ZooKeeper lang
subdirectory for demo.wikipedia.articles
:
=# select * from gptext.config_list('demo.wikipedia.articles', 'lang', false);
config_list
--------------------------
lang/contractions_it.txt
lang/contractions_ca.txt
lang/stemdict_nl.txt
lang/stopwords_hy.txt
lang/stopwords_no.txt
lang/stopwords_id.txt
[...]
(39 rows)
List all configuration files and directories for demo.wikipedia.articles
:
=# select * from gptext.config_list('demo.wikipedia.articles', true);
config_list
----------------------------------
currency.xml
mapping-FoldToASCII.txt
managed-schema
protwords.txt
scripts.conf
synonyms.txt
managed_schema
stopwords.txt
velocity
velocity/doc.vm
velocity/suggest.vm
velocity/hit.vm
[...]
(86 rows)
gptext.config_upload()
Uploads an index configuration file to ZooKeeper, replacing any existing file of the same name.
Syntax
gptext.config_upload(<index_name>, <local_config_file> [, <index_config_file>])
Parameters
<index_name>
<local_config_file>
<index_config_file>
option to change the path or filename.
<index_config_file>
Return Types
boolean
Examples
Upload the local file /home/gpadmin/stopwords.txt
to ZooKeeper, overwriting the existing stopwords.txt
file for the index demo.wikipedia.articles
:
=# select * from gptext.config_upload('demo.wikipedia.articles', '/home/gpadmin/stopwords.txt');
config_upload
---------------
t
(1 row)
Upload the local file /home/gpadmin/stopwords_japanese.txt
to ZooKeeper, overwriting the file lang/stopwords_ja.txt
for the index demo.wikipedia.articles
:
# select * from gptext.config_upload('demo.wikipedia.articles', '/home/gpadmin/stopwords_japanese.txt', 'lang/stopwords_ja.txt');
config_upload
---------------
t
(1 row)
gptext.index_size()
Shows the number of documents indexed and total disk space used for GPText indexes.
Syntax
gptext.index_size([<index_name>])
Parameters
<index_name>
Return Types
SETOF gptext.index_size_result
This is a composite type with the following columns:
Column | Type |
---|---|
index_name | text |
num_docs | integer |
size_in_bytes | bigint |
Examples
=# SELECT * FROM gptext.index_size();
index_name | num_docs | size_in_bytes
-------------------------+----------+---------------
demo.wikipedia.articles | 23 | 500515
demo.twitter.message | 1730 | 767118
gptext-docs | 16 | 618231
(3 rows)
=# SELECT * FROM gptext.index_size('demo.wikipedia.articles');
index_name | num_docs | size_in_bytes
-------------------------+----------+---------------
demo.wikipedia.articles | 23 | 500515
(1 row)
gptext.index_status()
Shows status of replicas for a specified index or for all indexes.
Syntax
gptext.index_status([<index_name>])
Parameters
<index_name>
Return Type
SETOF gptext.index_status_result
This is a composite type with the following columns:
Column | Type |
---|---|
content_id | smallint |
index_name | text |
shard_name | text |
shard_state | text |
replica_name | text |
replica_state | text |
core | text |
node_name | text |
base_url | text |
is_leader | boolean |
partitioned | boolean |
external_index | boolean |
Examples
Show status for a single index.
=# SELECT * FROM gptext.index_status('demo.wikipedia.articles'); content_id | index_name | shard_name | shard_state | replica_name | replica_state | core | node_name | base_url | is_leader | partitioned | external_ index ------------+-------------------------+------------+-------------+--------------+---------------+-------------- ---------------------------+-------------------+--------------------------+-----------+-------------+---------- ------ 0 | demo.wikipedia.articles | shard0 | active | core_node1 | active | demo.wikipedi a.articles_shard0_replica2 | gpdb51:18983_solr | http://gpdb51:18983/solr | t | f | f 0 | demo.wikipedia.articles | shard0 | active | core_node4 | active | demo.wikipedi a.articles_shard0_replica1 | gpdb51:18984_solr | http://gpdb51:18984/solr | f | f | f 1 | demo.wikipedia.articles | shard1 | active | core_node2 | active | demo.wikipedi a.articles_shard1_replica1 | gpdb51:18984_solr | http://gpdb51:18984/solr | f | f | f 1 | demo.wikipedia.articles | shard1 | active | core_node3 | active | demo.wikipedi a.articles_shard1_replica2 | gpdb51:18983_solr | http://gpdb51:18983/solr | t | f | f (4 rows)
Show status for all GPText indexes.
=# SELECT * FROM gptext.index_status(); content_id | index_name | shard_name | shard_state | replica_name | replica_state | core | node_name | base_url | is_leader | partitioned | external_ index ------------+-------------------------+------------+-------------+--------------+---------------+-------------- ---------------------------+-------------------+--------------------------+-----------+-------------+---------- ------ 0 | demo.store.products | shard0 | active | core_node1 | active | demo.store.pr oducts_shard0_replica1 | gpdb51:18984_solr | http://gpdb51:18984/solr | t | f | f 0 | demo.store.products | shard0 | active | core_node2 | active | demo.store.pr oducts_shard0_replica2 | gpdb51:18983_solr | http://gpdb51:18983/solr | f | f | f 1 | demo.store.products | shard1 | active | core_node3 | active | demo.store.pr oducts_shard1_replica2 | gpdb51:18983_solr | http://gpdb51:18983/solr | f | f | f 1 | demo.store.products | shard1 | active | core_node4 | active | demo.store.pr oducts_shard1_replica1 | gpdb51:18984_solr | http://gpdb51:18984/solr | t | f | f 0 | demo.twitter.message | shard0 | active | core_node2 | active | demo.twitter. message_shard0_replica1 | gpdb51:18984_solr | http://gpdb51:18984/solr | f | t | f 0 | demo.twitter.message | shard0 | active | core_node3 | active | demo.twitter. message_shard0_replica2 | gpdb51:18983_solr | http://gpdb51:18983/solr | t | t | f 1 | demo.twitter.message | shard1 | active | core_node1 | active | demo.twitter. message_shard1_replica1 | gpdb51:18984_solr | http://gpdb51:18984/solr | t | t | f 1 | demo.twitter.message | shard1 | active | core_node4 | active | demo.twitter. message_shard1_replica2 | gpdb51:18983_solr | http://gpdb51:18983/solr | f | t | f 0 | demo.wikipedia.articles | shard0 | active | core_node1 | active | demo.wikipedi a.articles_shard0_replica2 | gpdb51:18983_solr | http://gpdb51:18983/solr | t | f | f 0 | demo.wikipedia.articles | shard0 | active | core_node4 | active | demo.wikipedi a.articles_shard0_replica1 | gpdb51:18984_solr | http://gpdb51:18984/solr | f | f | f 1 | demo.wikipedia.articles | shard1 | active | core_node2 | active | demo.wikipedi a.articles_shard1_replica1 | gpdb51:18984_solr | http://gpdb51:18984/solr | f | f | f 1 | demo.wikipedia.articles | shard1 | active | core_node3 | active | demo.wikipedi a.articles_shard1_replica2 | gpdb51:18983_solr | http://gpdb51:18983/solr | t | f | f 1 | gptext-docs | shard1 | active | core_node2 | active | gptext-docs_s hard1_replica1 | gpdb51:18983_solr | http://gpdb51:18983/solr | t | f | t 1 | gptext-docs | shard1 | active | core_node3 | active | gptext-docs_s hard1_replica2 | gpdb51:18984_solr | http://gpdb51:18984/solr | f | f | t 2 | gptext-docs | shard2 | active | core_node1 | active | gptext-docs_s hard2_replica1 | gpdb51:18983_solr | http://gpdb51:18983/solr | t | f | t 2 | gptext-docs | shard2 | active | core_node4 | active | gptext-docs_s hard2_replica2 | gpdb51:18984_solr | http://gpdb51:18984/solr | f | f | t (16 rows)
gptext.live_nodes()
Lists active Solr nodes and their up or down state.
Syntax
gptext.live_nodes()
Return Type
SETOF gptext.live_nodes_result
This is a composite type with the following columns:
Column | Type |
---|---|
host | text |
port | bigint |
data_dir | text |
status | text |
Example
=# SELECT * FROM gptext.live_nodes();
host | port | data_dir | status
--------+-------+---------------------+--------
gpdb51 | 18983 | /data/gpdata1/solr0 | u
gpdb51 | 18984 | /data/gpdata2/solr0 | u
(2 rows)
Remarks
The status column can be u
(up) or d
(down).
gptext.partition_status()
Lists indexes on partitioned tables or child partition names in the current Greenplum database.
Syntax
gptext.partition_status([<index_name>])
Parameters
<index_name>
Return Type
SETOF gptext.partition_status_result
This is a composite type with the following columns:
Column | Type |
---|---|
partition_name | text |
inherits_name | text |
level | integer |
cons | text |
Example
List partition status for an index.
=# SELECT partition_name, inherits_name, level
FROM gptext.partition_status('demo.twitter.message');
partition_name | inherits_name | level
------------------------------+----------------------+-------
demo.twitter.message_1_prt_1 | demo.twitter.message | 1
demo.twitter.message_1_prt_2 | demo.twitter.message | 1
demo.twitter.message_1_prt_3 | demo.twitter.message | 1
demo.twitter.message_1_prt_4 | demo.twitter.message | 1
(4 rows)
Remarks
The gptext.partition_status()
function can only list the index partitions for tables in the current Greenplum database.
gptext.reload_index()
Reloads Solr configuration files if they have been modified.
Syntax
gptext.reload_index(<index_name>)
Parameters
<index_name>
Return type
boolean
Privileges required
Only the OWNER can execute this function.
Remarks
None.
Example
=# SELECT * FROM gptext.reload_index('demo.wikipedia.articles');
reload_index
--------------
t
(1 row)
gptext.version()
Returns the version of your GPText installation.
Syntax
SELECT * FROM gptext.version()
Parameters
None.
Return type
text
Privileges required
You do not need any privileges to execute this function.
Example
=# SELECT * FROM gptext.version();
version
--------------------------------
Greenplum Text Analytics 2.1.3
(1 row)
gptext.zookeeper_hosts()
Returns a list of ZooKeeper hosts and ports.
Syntax
gptext.zookeeper_hosts()
Return type
text
Remarks
This function returns a comma-separated list of ZooKeeper nodes in the the format <host-name>:<port>
.
Example
=# SELECT * FROM gptext.zookeeper_hosts()
host | port
--------+------
gpdb51 | 2188
gpdb51 | 2189
gpdb51 | 2190
(3 rows)
High Availability
gptext.add_replica()
Adds a replica of an index shard.
Syntax
gptext.add_replica(<index_name>, <shard_name>[, <node_name>])
Parameters
<index_name>
<shard_name>
<node_name>
Return type
boolean
Remarks
This function is used by the GPText management utility gptext-replica add
.
The value of the gptext.replication_factor
configuration parameter when an index is created determines how many replicas are created for each shard. In a Greenplum system, there are the same number of shards as there are Greenplum segments. The number of replicas created for a new index is the number of segments times the value of the gptext.replication_factor
configuration parameter, 2 by default. The replicas are distributed evenly among the live GPText nodes.
Replicas consume space on the host where they are created, so they are usually only created to replace a replica that has failed or become unavailable or to relocate a replica to another GPText instance. When adding replicas, you should maintain equal distribution of replicas among the GPText nodes and avoid placing multiple replicas for the same shard on the same host.
The total number of replicas for an index that can be placed on each GPText node is set when the index is created. (In Solr, this is the MaxShardsPerNode
parameter.) GPText sets this limit by calculating the number of replicas to create per node and adding an additional factor, specified in the gptext.extension_factor
server configuration parameter. This parameter can be set between 0 and 10; the default value is 2. Since the limit is set when the index is created, it is recommended to set the gptext.extension_factor
parameter to a higher number to allow new replicas to be created when necessary.
Example
=# SELECT * FROM gptext.add_replica('demo.wikipedia.articles', 'shard1');
success | core_name
---------+-----------------------------------------
t | demo.wikipedia.articles_shard1_replica3
(1 row)
gptext.delete_replica()
Deletes a named replica from the specified index and shard.
Syntax
gptext.delete_replica(<index_name>, <shard_name>, <replica_name>[, <only_if_down>])
Parameters
<index_name>
<shard_name>
<replica_name>
<only_if_down>
Return type
boolean
Remarks
Use the gptext.index_status()
function to find the name of the replica to drop. Names are in the format core_nodeX
, where X
is a number.
This function is called from the gptext-replica drop
management utility.
Examples
Delete the
core_node5
replica if it is down.=# SELECT * FROM gptext.delete_replica('demo.wikipedia.articles', 'shard1', 'core_node5', true); ERROR: Delete replica failed: Attempted to remove replica : demo.wikipedia.articles/shard1/core_node5 with onlyIfDown='true', but state is 'active'.
Delete the
core_node5
replica even if it is active.=# SELECT * FROM gptext.delete_replica('demo.wikipedia.articles', 'shard1', 'core_node5'); success --------- t (1 row)
General Purpose Functions
gptext.count_t()
Counts the number of records in a table.
Syntax
gptext.count_t(<table_name>)
Parameters
<table_name>
Return type
integer
Privileges required
You need SELECT privileges on <table_name>
to execute this function.
Example
=# SELECT * FROM gptext.count_t('demo.wikipedia.articles');
count_t
---------
23
(1 row)