Administering GPText

GPText administration includes security considerations, monitoring Solr index statistics, managing and monitoring ZooKeeper, and troubleshooting.

Viewing the Cluster Configuration

GPText deploys Apache ZooKeeper and Apache Solr nodes on hosts in your Greenplum Database network. Each node is a JVM server process listening for requests from other nodes. Use the gptext-state config command to list the host and port for each ZooKeeper and Solr node and the memory configuration for Solr nodes.

$ gptext-state configs
20181112:12:38:26:018080 gptext-state:mdw:gpadmin-[INFO]:-Execute GPText state ...
20181112:12:38:27:018080 gptext-state:mdw:gpadmin-[INFO]:-Check zookeeper cluster state ...
20181112:12:38:27:018080 gptext-state:mdw:gpadmin-[INFO]:-Cluster Configurations.
20181112:12:38:27:018080 gptext-state:mdw:gpadmin-[INFO]:----------------------------------------------------------
20181112:12:38:27:018080 gptext-state:mdw:gpadmin-[INFO]:-JVM Min  |  Max            Xms1024M  |  Xmx2048M
20181112:12:38:27:018080 gptext-state:mdw:gpadmin-[INFO]:-Node information
20181112:12:38:27:018080 gptext-state:mdw:gpadmin-[INFO]:----------------------------------
20181112:12:38:27:018080 gptext-state:mdw:gpadmin-[INFO]:-   Host   Node Name         Port    Solr Dir
20181112:12:38:27:018080 gptext-state:mdw:gpadmin-[INFO]:-   sdw1   sdw1_solr:18983   18983   /data/gptext/solr0
20181112:12:38:27:018080 gptext-state:mdw:gpadmin-[INFO]:-   sdw1   sdw1_solr:18984   18984   /data/gptext/solr1
20181112:12:38:27:018080 gptext-state:mdw:gpadmin-[INFO]:-   sdw2   sdw2_solr:18983   18983   /data/gptext/solr0
20181112:12:38:27:018080 gptext-state:mdw:gpadmin-[INFO]:-   sdw2   sdw2_solr:18984   18984   /data/gptext/solr1
20181112:12:38:27:018080 gptext-state:mdw:gpadmin-[INFO]:-Zookeeper information
20181112:12:38:27:018080 gptext-state:mdw:gpadmin-[INFO]:----------------------------------
20181112:12:38:27:018080 gptext-state:mdw:gpadmin-[INFO]:-   Host   Port   Zookeeper Dir
20181112:12:38:27:018080 gptext-state:mdw:gpadmin-[INFO]:-   mdw    2189   /data/zoo/zoo0
20181112:12:38:27:018080 gptext-state:mdw:gpadmin-[INFO]:-   sdw2   2189   /data/zoo/zoo0
20181112:12:38:27:018080 gptext-state:mdw:gpadmin-[INFO]:-   sdw1   2189   /data/zoo/zoo0
20181112:12:38:27:018080 gptext-state:mdw:gpadmin-[INFO]:-Done.

You don’t need these details to use the GPText functions and utilities, but the information can be useful for monitoring and troubleshooting the cluster. For example, you can access the Solr Admin UI by browsing to the URL http://<hostname>:<port> on any Solr node. See Using the Solr Administration Interface for information about the Solr Admin UI.

Changing GPText Server Configuration Parameters

Configuration parameters used with GPText are built-in to GPText with default values. You set new values for the parameters in a Greenplum Database session using the SET command, the same way you set Greenplum Database session parameters. When you enter the SET command GPText updates the value in ZooKeeper so that the change persists between database sessions.

The custom_variable_classes configuration parameter is removed in Greenplum Database 6. You can set custom variables in a database session without error, so this step is not needed for Greenplum Database 6.

With Greenplum Database 4.x and 5.x, a one-time Greenplum Database configuration change is needed so that Greenplum Database allows you to set and display GPText configuration parameters. Until you have performed this step, any attempt to set a GPText parameter results in an “Unrecognized configuration parameter” error. You must declare a custom variable class for GPText.

As the gpadmin user, enter the following commands in a shell:

$ gpconfig -c custom_variable_classes -v 'gptext'
$ gpstop -u

Once this step is completed, you can view and set GPText configuration parameters in psql.

To view GPText configuration parameters, you first need to fetch them from ZooKeeper into your Greenplum Database session by executing the gptext.version() UDF.

=# SELECT gptext.version();
 Greenplum Text Analytics 3.2.0
(1 row)

Then you can use the SHOW command to display values of the parameters, for example:

=# SHOW gptext.idx_num_shards;
(1 row)

See GPText Configuration Parameters for a complete list of configuration parameters.

GPText uses the current values of the configuration parameters when you create a new index, so changing a configuration parameter affects new indexes, but does not affect existing indexes.

Change the values of GPText configuration variables using the SET command in a session with a database that contains the GPText schema. The following example sets values for three configuration parameters in a psql session:

=# set gptext.idx_buffer_size=10485760;
=# set gptext.idx_delim='|';
=# set gptext.extension_factor=5;

You can view the new value of a configuration parameter that you have set using the SHOW command:

=# show gptext.idx_delim;
(1 row)

Security and GPText Indexes

GPText security is based on Greenplum Database security. Your privileges to execute GPText functions depend on your privileges for the database table that is the source for the index. For example, if you have SELECT privileges for a table in the Greenplum Database database, then you have SELECT privileges for an index generated from that table.

Executing GPText functions requires one of OWNER, SELECT, INSERT, UPDATE, or DELETE privileges, depending on the function. The OWNER is the person who created the table and has all privileges. See the Greenplum Database Administrator Guide for information about setting privileges.

ZooKeeper Administration

Apache ZooKeeper enables coordination between the Apache Solr and Pivotal GPText distributed processes through a shared namespace that resembles a file system. In ZooKeeper, a node (called a znode) can contain data, like a file, and can have child znodes, like a directory. ZooKeeper replicates data between multiple instances deployed as a cluster to provide a highly available, fault-tolerant service. Both Solr and GPText store configuration files and share status by writing data to ZooKeeper znodes. GPText stores information in the /gptext znode. The configuration files for a GPText index are in the /gptext/configs/<index-name> znode.

The number of ZooKeeper instances in the cluster determines how many ZooKeeper node failures the cluster can tolerate and still remain active. The service remains available as long as a clear majority of the non-failed nodes are able to communicate with each other. To tolerate a failure of n nodes the cluster must have 2n+1 nodes. A cluster of five nodes, for example, can tolerate two failed nodes.

ZooKeeper is very fast for read requests because it stores data in memory. If ZooKeeper begins to swap memory to disk, Solr and GPText performance will suffer and could experience failures, so it is critical to allocate sufficient memory to the ZooKeeper Java processes. To avoid ZooKeeper instances competing with Greenplum Database segments for memory, you should deploy the ZooKeeper instances and Greenplum Database segments on different hosts. The ZooKeeper and Greenplum Database hosts must be on the same network and accessible with passwordless SSH by the gpadmin user. You can use the Greenplum Database gpssh-exkeys utility to share SSH keys between ZooKeeper and Greenplum Database hosts.

You must start the ZooKeeper cluster before you start GPText. When you start GPText, the Solr nodes each load the replicas for indexes they manage. With large numbers of indexes, shards, and replicas, starting up the cluster can generate a very high, atypical load on ZooKeeper. It can take a long time to get all indexes loaded and some ZooKeeper requests may time out waiting for responses. Using the gptext-start --slow_start option starts Solr nodes one at a time, providing a more ordered start-up and limiting the number of concurrent ZooKeeper requests.

The GPText command-line utility zkManager can be used to monitor the ZooKeeper cluster. If the ZooKeeper cluster is bound to GPText, you can also start and stop the cluster using zkManager.

Checking ZooKeeper Status

Use the zkManager utility from the command line to check the ZooKeeper cluster status. The utility lists the hosts, ports, latency, and follower/leader mode for each ZooKeeper instance. If a node is down, its mode is listed as Down.

To check the ZooKeeper cluster status, run the zkManager state command.

$ zkManager state
20171016:12:59:47:026338 zkManager:gpdb:gpadmin-[INFO]:-Execute zookeeper state process.
20171016:12:59:47:026338 zkManager:gpdb:gpadmin-[INFO]:-Check zookeeper cluster state ...
20171016:12:59:47:026338 zkManager:gpdb:gpadmin-[INFO]:-   Host   port   Latency min/avg/max   Mode
20171016:12:59:47:026338 zkManager:gpdb:gpadmin-[INFO]:-   gpdb   2189   0/0/22                follower
20171016:12:59:47:026338 zkManager:gpdb:gpadmin-[INFO]:-   gpdb   2190   0/0/29                leader
20171016:12:59:47:026338 zkManager:gpdb:gpadmin-[INFO]:-   gpdb   2188   0/0/27                follower
20171016:12:59:47:026338 zkManager:gpdb:gpadmin-[INFO]:-Done.

In a database session, you can use the gptext.zookeeper_hosts() function to list the ZooKeeper hosts.

=# SELECT * FROM gptext.zookeeper_hosts();
  host  | port
 gpdb51 | 2188
 gpdb51 | 2189
 gpdb51 | 2190
(3 rows)

Starting and Stopping the ZooKeeper Cluster

If the ZooKeeper cluster was installed by the GPText installer, the zkManager utility can start or stop the ZooKeeper cluster. To start the cluster, run the zkManager start command.

$ zkManager start
20171016:16:14:46:017845 zkManager:gpdb:gpadmin-[INFO]:-Execute zookeeper start process
20171016:16:14:46:017845 zkManager:gpdb:gpadmin-[INFO]:------------------------------------------------
20171016:16:14:46:017845 zkManager:gpdb:gpadmin-[INFO]:-Starting Zookeeper:
20171016:16:14:46:017845 zkManager:gpdb:gpadmin-[INFO]:------------------------------------------------
20171016:16:14:46:017845 zkManager:gpdb:gpadmin-[INFO]:-   Host   Zookeeper Dir
20171016:16:14:46:017845 zkManager:gpdb:gpadmin-[INFO]:-   gpdb   /data/master/zoo0
20171016:16:14:46:017845 zkManager:gpdb:gpadmin-[INFO]:-   gpdb   /data/master/zoo1
20171016:16:14:46:017845 zkManager:gpdb:gpadmin-[INFO]:-   gpdb   /data/master/zoo2
20171016:16:14:48:017845 zkManager:gpdb:gpadmin-[INFO]:-Check zookeeper cluster state ...
20171016:16:14:53:017845 zkManager:gpdb:gpadmin-[INFO]:-Done.

To stop ZooKeeper, run the zkManager stop command.

$ zkManager stop
20171016:16:14:08:016499 zkManager:gpdb:gpadmin-[INFO]:-Execute zookeeper stop process.
20171016:16:14:08:016499 zkManager:gpdb:gpadmin-[INFO]:------------------------------------------------
20171016:16:14:08:016499 zkManager:gpdb:gpadmin-[INFO]:-Stop Zookeeper:
20171016:16:14:08:016499 zkManager:gpdb:gpadmin-[INFO]:------------------------------------------------
20171016:16:14:08:016499 zkManager:gpdb:gpadmin-[INFO]:-   Host   Zookeeper Dir
20171016:16:14:08:016499 zkManager:gpdb:gpadmin-[INFO]:-   gpdb   /data/master/zoo0
20171016:16:14:08:016499 zkManager:gpdb:gpadmin-[INFO]:-   gpdb   /data/master/zoo1
20171016:16:14:08:016499 zkManager:gpdb:gpadmin-[INFO]:-   gpdb   /data/master/zoo2
20171016:16:14:09:016499 zkManager:gpdb:gpadmin-[INFO]:-Done.

See the zkManager reference for more information.

Checking SolrCloud Status

You can check the status of the SolrCloud cluster and indexes by running the gptext-state utility from the command line.

To check the state of the GPText nodes and each index, run the gptext-state utility with the -D (--details) option. Example:

$ gptext-state -D
20180615:16:09:24:031986 gptext-state:mdw:gpadmin-[INFO]:-Execute GPText state ...
20180615:16:09:25:031986 gptext-state:mdw:gpadmin-[INFO]:-Check zookeeper cluster state ...
20180615:16:09:25:031986 gptext-state:mdw:gpadmin-[INFO]:-Check GPText cluster status...
20180615:16:09:25:031986 gptext-state:mdw:gpadmin-[INFO]:-Current GPText Version: 3.0.0
20180615:16:09:25:031986 gptext-state:mdw:gpadmin-[INFO]:-All nodes are up and running.
20180615:16:09:26:031986 gptext-state:mdw:gpadmin-[INFO]:------------------------------------------------
20180615:16:09:26:031986 gptext-state:mdw:gpadmin-[INFO]:-Index state details.
20180615:16:09:26:031986 gptext-state:mdw:gpadmin-[INFO]:------------------------------------------------
20180615:16:09:26:031986 gptext-state:mdw:gpadmin-[INFO]:-   database   index name                state
20180615:16:09:26:031986 gptext-state:mdw:gpadmin-[INFO]:-   demo       demo.twitter.message      Green
20180615:16:09:26:031986 gptext-state:mdw:gpadmin-[INFO]:-   demo       demo.wikipedia.articles   Green
20180615:16:09:26:031986 gptext-state:mdw:gpadmin-[INFO]:-Done.

This command reports the status of the GPText nodes and status of each GPText index.

Run gptext-state list to view just the indexes.

The gptext-state healthcheck command checks the GPText configuration files, the index status, required disk space, user privileges, and index and database consistency. By default, the required disk space check passes if there is at least 20% disk free. You can set a different disk free threshold using the --disk_free option. For example:

[gpadmin@gpdb-sandbox ~]$ gptext-state healthcheck --disk_free=25
20160629:15:45:24:669652 gptext-state:gpdb-sandbox:gpadmin-[INFO]:-Execute healthcheck on GPText cluster!
20160629:15:45:24:669652 gptext-state:gpdb-sandbox:gpadmin-[INFO]:-Check GPText config files ...
20160629:15:45:24:669652 gptext-state:gpdb-sandbox:gpadmin-[INFO]:-GOOD
20160629:15:45:24:669652 gptext-state:gpdb-sandbox:gpadmin-[INFO]:-Check GPText index status ...
20160629:15:45:25:669652 gptext-state:gpdb-sandbox:gpadmin-[INFO]:-GOOD
20160629:15:45:25:669652 gptext-state:gpdb-sandbox:gpadmin-[INFO]:-Checking for required disk space...
20160629:15:45:25:669652 gptext-state:gpdb-sandbox:gpadmin-[INFO]:-GOOD
20160629:15:45:25:669652 gptext-state:gpdb-sandbox:gpadmin-[INFO]:-Checking for required user privileges...
20160629:15:45:25:669652 gptext-state:gpdb-sandbox:gpadmin-[INFO]:-GOOD
20160629:15:45:25:669652 gptext-state:gpdb-sandbox:gpadmin-[INFO]:-Checking for indexes and database consistency...
20160629:15:45:27:669652 gptext-state:gpdb-sandbox:gpadmin-[INFO]:-GOOD
20160629:15:45:27:669652 gptext-state:gpdb-sandbox:gpadmin-[INFO]:-Done.

See the gptext-state utility reference for additional options.

Recovering GPText Nodes

Use the gptext-recover utility to recover down GPText nodes, for example after a failed Greenplum Database segment host is recovered.

With no arguments, the gptext-recover utility discovers down GPText nodes and restarts them.

With the -f (or --force) option, if a GPText node cannot be restarted and no shards are down, the node is deleted and created again on the same host. Missing replicas are added and the failed node and failed replicas are removed. If the index is in a red state gptext-recover -f will print a message and exit.

The -H (--new_hosts) option allows recreating down GPText nodes on new hosts that replace failed hosts. The down GPText nodes are deleted and recreated on the new hosts. The argument to the -H option is a comma-separated list of the new hosts that are to replace the failed hosts. The number of new hosts must match the number of failed hosts. If shards are down, it advises reindexing. If only some replicas are down, it recreates the replicas on the new hosts and updates gptext.conf.

The -r option recovers replicas, but does not attempt to recover any down nodes.

Note: Before recovering GPText nodes on newly added hosts, ensure that the following GPText prerequisites have been installed on the host:

  • Java 1.8
  • Python 2.6
  • The Linux lsof utility

Viewing Solr Index Statistics

You can view Solr index statistics by running the gptext-state utility from the command line.

To list all GPText indexes, enter the following command at the command line:

gptext-state list

A command line that retrieves all statistics for an index:

gptext-state --index demo.wikipedia.articles

A command line that retrieves the number of documents in an index:

gptext-state --index demo.wikipedia.articles --stats_columns=num_docs

A command line that retrieves num_docs, index size, and the date and time last_modified:

gptext-state --index demo.wikipedia.articles --stats_columns num_docs,size,last_modified

Backing Up and Restoring GPText Indexes

With the gptext-backup management utility, you can back up a GPText index so that, if needed, you can quickly recover from a failure. The backup can be restored to the same GPText system or to another system with the same number of Greenplum Database segments.

The gptext-backup management utility backs up an index and its configuration files to either a shared file system, which must be mounted on and writable by each host in the Greenplum Database cluster, or to local storage on the Greenplum Database master and segment hosts.

Backing Up to a Shared File System

To back up on a shared file system, use the -p (--path) command-line option to specify the location of a directory on the mounted file system and the -n (--name) option to provide a name for the backup. Specify the index to backup with the -i (--index) option.

$ gptext-backup -i <index-name> -p <path> --n <backup-name>

The gptext-backup utility then checks that:

  • the GPText cluster is up
  • the shared file system is valid
  • the backup name specified with the -n option does not already exist in the directory specified with the -p option

The utility creates the new directory and then saves one copy of each index shard to that directory, along with the index’s configuration files from ZooKeeper.

To save the configuration files only, with no data, add the -c (--backup_conf) command-line option.

To restore an index from a shared file system, use the gptext-restore management utility. The GPText system you restore to must be on a Greenplum Database cluster with the same number of segments. The database and schema for the index must be present.

The -i (--index) option specifies the name of the GPText index that will be restored. If the index exists, you must first drop it with the gptext.drop_index() user-defined function.

The -p (--path) option specifies the location of the directory containing the backup files—the directory that gptext-backup created on the shared file system.

$ gptext-restore -i <index-name> -p <path>

You can add the -c option to restore only the configuration files to ZooKeeper and create an empty GPText index, without restoring any saved index data.

Backing Up to Local Storage

To back up to local storage on the Greenplum Database cluster, add the local keyword to the gptext-backup command-line.

A local GPText backup has a unique name constructed by appending a timestamp to the index name. You do not use the -n option with local backups.

$ gptext-backup local -i <index-name>

On the master host, in the master data directory by default, the backup utility saves a JSON file with backup metadata and a directory containing the index’s configuration files from ZooKeeper.

The utility backs up each index shard on the Greenplum Database segment host with the GPText node that manages the shard’s lead replica. By default, the shard backup files are saved in a segment data directory.

The gptext-backup command output reports the locations of all backup files.

You can add the -p (--path) option to the gptext-backup command to specify a local directory where the backup will be saved. The directory must be present on every Greenplum Database host and must be writeable by the gpadmin user.

$ gptext-backup local -i <index-name> -p <path>

The backup files will be saved in the specified directory on each host instead of in the Greenplum Database master and segment data directories.

To restore a backup saved to local storage, add the local keyword to the gptext-restore command-line and specify the path to the backup directory on the master host.

$ gptext-restore local -p <path>

The <path> is the full path to the directory the gptext-backup command created on the master host, including the timestamp, for example $MASTER_DATA_DIRECTORY/demo.twitter.message_2018-05-08T15:32:21.397779.

See gptext-backup for syntax and examples for running gptext-backup. See gptext-restore for syntax and examples for running gptext-restore.

Expanding the GPText Cluster

The gptext-expand management utility adds GPText nodes to the cluster. There are two ways to add nodes:

  • Add GPText nodes to existing hosts in the cluster. This option increases the number of GPText nodes on each host.
  • Add GPText nodes to new hosts added by using the Greenplum Database gpexpand management utility to expand the Greenplum Database system.

Adding GPText Nodes to Existing Segment Hosts

To add nodes to existing segment hosts, run the gptext-expand utility with a command like the following:

gptext-expand -e -p /data1/nodes,/data2/nodes

This example adds two GPText nodes to each host.

The -e (--existing) option specifies that nodes are to be added to existing hosts.

The -p (--expand_paths) option provides a list of directories where the new nodes’ data directories are to be created. These should be the same directories that contain the Greenplum Database segment data directories and existing GPText data directories. The number of directories in the list is the number of new nodes that are added.

A directory can be repeated in the directory list multiple times to increase the number of new GPText nodes to create. For example, if there is currently one GPText node per host in the /data1/nodes directory, you could add three nodes with a command like the following:

gptext-expand -e -p /data1/nodes,/data2/nodes,/data2/nodes

This adds one node to the /data1/nodes directory and two nodes to the /data2/nodes directory so there are two GPText nodes in each directory.

Adding GPText nodes affects new indexes, but not existing indexes. Replicas for new indexes will be distributed across all of the nodes, including both old nodes and the newly created nodes. Replicas for indexes that existed before running gptext-expand are not automatically moved. You can use the gptext-rebalance command to relocate replicas to new nodes.

Adding GPText Nodes to New Hosts

Check that the following GPText prerequisites are installed on each new host added to the Greenplum Database cluster:

  • Java 1.8
  • Python 2.6 or greater
  • Linux lsof utility

New hosts must be reachable by all hosts in the GPText cluster, including existing hosts and the new hosts you are adding.

After expanding the Greenplum Database cluster with the gpexpand management utility, call gptext-expand with the -H (--new_hosts) option and a list of the new hosts on which to install GPText:

gptext-expand -H newhost1,newhost2

The gptext-expand utility installs GPText binaries on the new hosts and then creates new GPText nodes on the new hosts.

Newly created indexes will automatically be distributed among the new nodes. You can use the gptext-rebalance command to relocate replicas to new nodes.


GPText errors are of the following types:

  • Solr errors
  • gptext errors

Most of the Solr errors are self-explanatory.

gptext errors are caused by misuse of a function or utility. They provide a message that tells you when you have used an incorrect function or argument.

Monitoring Logs

You can examine the Greenplum Database and Solr logs for more information if errors occur. Greenplum Database logs reside in:


Solr logs reside in:

<GPDB path>/solr/logs

Determining Segment Status with gptext-state

Use the gptext-state utility to determine if any primary or mirror segments are down. See gptext-state in the GPText Management Utilities Reference.