Pivotal® GPText 3.4.5 Release Notes
This document contains release information for Pivotal GPText 3.4.5.
Published: September 18, 2020
Pivotal GPText joins the Greenplum Database massively parallel-processing database server with Apache SolrCloud enterprise search and the Apache MADlib Analytics Library to provide large-scale analytics processing and business decision support. GPText includes free text search as well as support for text analysis.
GPText includes the following features:
- The GPText database schema provides in-database access to Apache Solr indexing and searching
- Build indexes with database data or external documents and search with the GPText API
- Custom tokenizers for international text and social media text
- A Universal Query Processor that accepts queries with mixed syntax from supported Solr query processors
- Faceted search results
- Term highlighting in results
- Natural language processing, including part-of-speech tagging and named entity extraction
- Greater emphasis on high availability
The GPText management utility suite includes command-line utilities to perform the following tasks:
- Start, stop, and monitor ZooKeeper and GPText nodes
- Configure GPText nodes and indexes
- Add and delete replicas for index shards
- Back up and restore GPText indexes
- Recover a GPText node
- Expand the GPText cluster by adding GPText nodes
Installing GPText also installs Apache Solr Cloud and, optionally, Apache ZooKeeper.
Following are GPText installation prerequisites.
- GPText 3.4.5 runs on Red Hat Enterprise Linux 6 and 7.
- GPText 3.4.5 runs with Greenplum Database version 4, 5, and any versions after 6.5.0. GPText is not compatible with Greenplum Database releases earlier than 6.5.0 due to an ABI incompatibility.
- GPText requires Java 8 or OpenJDK 8 to be installed on each host in the Greenplum Database cluster. Add the JRE
bindirectory to the
PATHon all hosts in the cluster.
- Install and configure your Greenplum Database system before you install GPText. See the Pivotal Greenplum Database Installation Guide at https://gpdb.docs.pivotal.io.
- Ensure that
nc(netcat) is installed on all Greenplum cluster hosts (
sudo yum install nc).
lsofon all cluster hosts is recommended (
sudo yum install lsof).
- GPText cannot be installed onto a shared NFS mount.
- GPText nodes can be installed on the Greenplum Database cluster hosts alongside the Greenplum segments or on additional, non-database hosts accessible on the Greenplum cluster network. All hosts participating in the GPText system must have the same operating system and configuration and have passwordless-ssh access for the gpadmin user. See the Pivotal Greenplum Database Installation Guide for instructions to configure hosts.
- If you plan to place GPText nodes on the Greenplum Database segment hosts, ensure that you reserve memory for GPText use when you configure Greenplum Database. To determine the memory to set aside for GPText, multiply the number of GPText nodes to create on each Greenplum segment host by the JVM maximum size. Subtract this memory from the physical RAM when calculating the value for the Greenplum Database
gp_vmem_protect_limitserver configuration parameter. See the Greenplum Database server configuration parameter
gp_vmem_protect_limitin the Greenplum Database Reference Guide for recommended memory calculation formulas or visit the GPDB Virtual Memory Calculator web site.
- Apache Solr requires a ZooKeeper cluster with at minimum three nodes (five nodes recommended). You can install a “binding” ZooKeeper cluster with GPText on the Greenplum cluster hosts, or you can use an existing ZooKeeper cluster. When deployed alongside Greenplum Database segments, ZooKeeper performance can be affected under heavy database load. For best performance, install a ZooKeeper cluster on separate hosts with network connectivity to the Greenplum network.
GPText 3.4.5 is a maintenance release. It contains the following fixes and changes:
- Fixed an issue where the GPText installer trimmed the domain name part of a hostname; for example
mdw.prod.vmware.comwould be trimmed to
mdw. This caused the GPText installation to fail in environments where the DNS server expected hostnames with the Fully Qualified Domain Name (FQDN).
GPText 3.4.4 is a maintenance release. It contains the following fixes and changes:
- Performance problems were observed after the Solr default router changed to
compositeIdin GPText 3.1.0. To address this problem, GPText 3.4.4 restores the Solr default router to
implicit. To use the
compositeIdrouter, you must specify a non-zero value for
-  Fixed a type mapping issue that caused GPText to use the same convert function for the Greenplum column types
timestamp without timezoneand
timestamp with timezone.
GPText 3.4.3 is a maintenance release. It contains the following fixes:
gptext-rebalance nodecommand, a Beta feature, is now deprecated. Use the
gptext-rebalance indexcommand for similar results.
GPText now by default keeps up to 100 Solr logs on each Solr node.
Fixed an issue with upgrading GPText.
GPText 3.4.2 is a maintenance release. It contains the following fixes.
Support for Greenplum Database 6.5.0 and later. GPText 3.4.2 is not compatible with Greenplum Database versions earlier than 6.5.0 due to an ABI compatibility issue.
Fixes that allow GPText to support symbolic links.
GPText 3.4.1 is a maintenance release. It contains fixes for these issues.
gptext-migratorfailed to install the GPText shared library after upgrading Greenplum Database.
gptext-stateutility could take a long time to return results if a replica was in recovery mode. This issue is resolved.
Solr now uses log4j 2.11. The log4j configuration file name is now
The GPText installer will display an error message and exit if the operating system version and Greenplum Database version are not compatible with the installer version.
Setting the value of the
GPTXTHOMEenvironment variable to a symlink to the GPText installation directory, caused GPText to fail in some cases. Upgrading GPText failed because the
GPTXTHOMEenvironment variable was inconsistent with the value in the
gptxtenvs.conffile. This issue has been fixed. GPText now supports creating a symbolic link to the installation directory.
Upgraded Apache Solr to version 7.4.0.
(Beta) The new
gptext-rebalance nodecommand rebalances the GPText cluster by relocating replicas to new nodes. Use this command after you add hosts or nodes to the cluster with
gptext-rebalancein the Utility Reference for help using this utility. Note:
gptext-rebalance nodemay fail because of a Solr bug. See Known Issues. You can use the
gptext-rebalance indexcommand to work around this issue.
gptext-expandcommand has a new
--binary-only) option that is used to copy only the GPText installation directories to the new hosts without starting Solr nodes on the new hosts. Using this option allows you to verify GPText operation with the expanded cluster. The
-boption can only be used with the
-Hoption and the new hosts must be Greenplum Database hosts.
gptext-stoputilities now check for zombie Solr processes and report them if found. The
gptext-stoputility verifies that all Solr processes are closed.
Improved ZooKeeper stability.
- ZooKeeper and Solr configurations have been modified to increase timeouts to better tolerate small fluctuations in ZooKeeper response.
- ZooKeeper JVM memory is configured to a small heap size to reduce the frequency of long GC pauses.
- Added a ZooKeeper GC log to track ZooKeeper garbage collections.
- When a ZooKeeper timeout occurs, GPText retries the query ten times in the following five mintues before the query fails.
- Added ZooKeeper Best Practices with steps users can take to optimize ZooKeeper performance.
When indexing external documents in a directory using
gptext.index_external_dir(), if one document failed to be added to the index, other documents in the same directory could fail. This is fixed. GPText now sends a request to Solr to get the files to be indexed and then indexes them individually with the
A GPText query could time out when ZooKeeper was under heavy load, leaving the ZooKeeper connection handle in an invalid state in the Greenplum Database session. The query would fail with an error message
invalid zhandle state, and it was necessary to start a new session to continue using GPText. Now, after a ZooKeeper timeout, GPText retries the query ten times in the following five minutes before the query fails. The retry attempts are not visible to users, but they are logged. See also “Improved ZooKeeper stability” in the New Features and Enhancements section.
With a very large index, the number of documents could exceed the maximum value of the integer data type, causing the
gptext.index_size() function to return an “integer out of range” error. This has been fixed. The function now returns a bigint type.
Using GPText 3.3 with Greenplum Database 6.0
GPText 3.3.1 can be installed on a Greenplum Database 6 system with Java 8.
A GPText binary distribution has been added to Pivotal Network for Red Hat 7/CentOS 7 with Greenplum Database 6.
Note: The “Greenplum Text 3.3.1 for RHEL 7” distribution is for Greenplum Database 6.x only. Download the RHEL 6 distribution if you are installing GPText into a Greenplum Database 5.x system.
Following are differences using GPText with Greenplum Database 6 than with earlier Greenplum Database releases:
custom_variable_classesserver configuration parameter has been removed in Greenplum Database 6. With earlier Greenplum Database versions, it was necessary to add
'gptext'to this parameter in order to set GPText configuration parameters. Greenplum Database 6 allows you to set configuration parameters in a database session without declaring a variable class.
In Greenplum Database 4 and 5, the default output format for the binary data type
byteais the PostgreSQL escape format, a sequence of ASCII characters with escape sequences where bytes cannot be represented with ASCII. In Greenplum Database 6, the default output format is the hex format, which represents each byte with hexadecimal digits. In Greenplum Database 5, the hex output format can be specified by setting the
bytea_outputconfiguration parameter to
hex. To produce the same output in Greenplum Database 4, 5, and 6, you can set the
bytea_outputconfiguration parameter to
Custom Configuration Directory
A new optional installation parameter,
GPTEXT_CUSTOM_CONFIG_DIR, can be set in the
gptext_install_config file to specify a directory to store custom configuration files.
By default, GPText saves custom configuration files under the
$GPTEXTHOME/share/ directory on each Solr host, for example
To specify a different directory to store external configuration files, before you run the GPText installer, uncomment the
GPTEXT_CUSTOM_CONFIG_DIR parameter in the
gptext_install_config file and specify the full path to the directory. For example:
The gpadmin user must have the OS permissions required to create the directory.
If the parameter is set, the GPText installer will create the custom configuration directory on every Solr host. Configuration files you upload using the
gptext-external upload command will be stored under this directory on every Solr host to allow Solr to access the external document source from every host. For example if the
GPTEXT_CUSTOM_CONFIG_DIR parameter is set to
/home/gpadmin/config_dir when you install GPText, an s3 configuration with the name
s3_conf will be saved in the directory
/home/gpadmin/config_dir/external_source/s3/s3_conf on each host.
New Features and Enhancements in GPText 3.2.0
The GPText 3.2.0 release provides the following features and enhancements.
GPText 3.2.0 enables lemmatizing terms in GPText indexes. You can define Solr analysis chains that include the Apache OpenNLP parts-of-speech filter and the new GPText WordNetLemmatizer filter, which replaces terms with the root form of the term. The WordNetLemmatizer filter uses a lexical database from the Princeton University WordNet® project to determine the root form.
GPText Configuration Files Location
GPText now saves configuration files
zookeeper.conf only in the Greenplum Database master and standby master directories. The
gptext.conf file is no longer saved in each segment data directory.
By default, GPText creates one Solr index shard for each Greenplum Database primary segment. You can now specify a smaller number of shards by setting the
gptext.idx_num_shards parameter to the number of shards you want before you create the index. This works for both regular GPText indexes and external indexes.
In GPtext 3.2.0, when
gptext.idx_num_shards is set to the default (0), GPText configures the index to use the Solr
implicit router, with one shard per Greenplum Database segment. When the
gptext.idx_num_shards parameter is changed to the number of shards desired, GPText creates the index using the Solr
compositeId router to route documents to shards. The
compositeId router does not support duplicate IDs, so if you set the
if_check_id_uniqueness argument to false when you call the
gptext.create_index() function the
implicit router is used, and the index will have one shard per Greenplum Database segment. Note: The Solr default router is restored to
implicit in GPText versions 3.4.4 and higher to address performance issues.
content_id column is removed from the output of the
gptext.index_summary() functions, since Greenplum Database segments are not always associated with a single index shard.
See Specifying the Number of Shards for more information about this feature.
When using the
--force) option, the
gptext-recover utility now verifies that there are no indexes in a red state before proceeding. If any index is down, the utility exits.
Apache ZooKeeper included with GPText 3.2.0 has been upgraded to version 3.4.11. This ZooKeeper release includes bug fixes that resolve an inconsistent cluster issue with GPText(MPP-29742).
New Features and Enhancements in GPText 3.1.0
The GPText 3.1.0 release provides the following features and enhancements.
Improvements to aid in developing and testing analyzer chains
gptext.list_field_types()function lists the field types defined in the
managed-schemaconfiguration file for an index.
gptext.get_field_type()function displays the index and query analyzer chains for a field type in JSON format.
gptext.analyzer()function shows the index or query analyzer chain output for a given field type and input text. This function is useful for testing and debugging analyzer chains interactively without modifying the index.
Part-of-speech tagging and named entity recognition
GPText includes OpenNLP libraries and analyzer classes to classify indexed terms’ parts-of-speech (POS), and to recognize named entities, such as the names of persons, locations, and organizations (NER). GPText saves NER terms in the field’s terms vector, prepended with a code to identify the type of entity recognized. This allows searching documents by entity type.
gptext.ner_terms()function lists NER-tagged terms for documents that match a query.
GPText includes the OpenNLP models for the English language. You can download models for other languages from the OpenNLP web site and use them with GPText.
Other enhancements and fixes
The first argument of the
gptext.terms()function, an anytable data type, has been made optional.
Fixed an error where the
gptext.partition_status()function displayed partition information for an index after it was dropped.
Apache Solr updated to Solr version 7.3
GPText 3.1.0 includes Apache Solr 7.3. See the following release documents for information about the Solr 7.3 release.
Following are GPText changes and Solr usage notes related to the Solr 7.3 upgrade.
GPText server-side components are rebuilt and tested with the new Solr JAR files.
solrconfig.xmland other collection configuration files are updated.
solrconfig.xmlis now officially deprecated in favor of the equivalent
<searchComponent>syntax. This element has been out of use in default Solr installations for several releases already.
legacyCloudparameter now defaults to false. If an entry for a replica does not exist in
state.json, that replica will not be registered. This may affect users who bring up replicas and they are automatically registered as a part of a shard. It is possible to revert to the old behavior by setting the property
legacyCloud=truein the cluster properties by running the following command in the GPText installation directory:
$ ./server/scripts/cloud-scripts/zkcli.sh -zkhost 127.0.0.1:2181 -cmd clusterprop -name legacyCloud -val true
With earlier Solr releases, if you drop an index while a Solr node with a replica of the index is down, when the down node comes back on-line, the index comes back and cannot be deleted. Solr 7 fixes this bug. The GPText workaround for this bug is removed.
PointFields are default numeric types. Solr has implemented *PointField types across the board, to replace Trie* based numeric fields. All Trie* fields are now considered deprecated, and will be removed in Solr 8. If you are using Trie* fields in your schema, you should consider moving to PointFields as soon as feasible. Changing to the new PointField types will require you to re-index your data.
The following spatial-related fields have been deprecated: LatLonType GeoHashField FieldType SpatialTermQueryPrefixTreeFieldType Use one of these field types instead: LatLonPointSpatialField SpatialRecursivePrefixTreeField RptWithGeometrySpatialField
To improve parameter consistency in the Collections API, the parameter names
fromNodefor the MOVEREPLICA command and source, and
targetfor the REPLACENODE command have been deprecated and replaced with
targetNodeinstead. The old names will continue to work for backwards compatibility, but they will be removed in Solr 8.
The replica core name has changed from
<collection_name>_shard#_replica_<node_type>#. For example,
New Features and Enhancements in GPText 3.0.0
GPText 3.0.0 allows adding documents stored in Amazon Web Services S3 buckets to a GPText external index. This enhancement includes changes to enable uploading AWS credentials to ZooKeeper and support for the
s3document source type for the
gptext-stateutility with the
-i) option now includes the date and time the GPText index was last modified.
See the Apache Jira for known issues in Apache Solr.
Following are known issues in GPText. Workarounds are provided when available.
Solr Log4j Log is Missing in GPText 3.4.0
 Solr 7.4 uses Log4j version 2.11. In Log4j 2.11 the configuration file name changed from
log4j.xml, but the file name was not changed in GPText 3.4.0. Due to this issue, no new lines are added to
This issue is fixed in GPText 3.4.1.
Upgrading to GPText 3.4.1 fixes the issue. If you are unable to upgrade to 3.4.1, you can follow these steps to manually fix the issue in your GPText 3.4.0 system:
Download the new
log4j2.xmlconfiguration file from https://raw.githubusercontent.com/apache/lucene-solr/master/solr/server/resources/log4j2.xml.
log4j2.xmlfile to every solr data directory, for example
Update the startup parameters file
solr.in.shin every solr data directory.
Change this line:
$ gptext-start -r
You will find the following files in the Solr log directory, for example
- solr-18983-console.log - solr_gc.log.0.current - solr.log - solr_slow_requests.log
- Verify that messages are now written to the
solr.logfile by executing a GPText operation such as
gptext-migrator Fails to Install the GPText Shared Libary After Greenplum Database Upgrade
(Fixed in GPText 3.4.1) When you upgrade Greenplum Database and then migrate your existing GPText installation to the new Greenplum Database installation, the
gptext-migrator utility in some cases fails to install the GPText UDF library to the new Greenplum Database
gptext-migrator outputs the message
[INFO]:-UDF libraries are installed in $GPTXTHOME/lib, don't need to migrate.
The message is correct only if you installed GPText binaries to a shared drive following the Optional Two-Part GPText Installation installation method.
Create a host file containing a list of all Greenplum Database hosts.
Make sure the gpadmin user has write permission in the
$GPHOME/lib/postgresqldirectory of the new Greenplum Database installation directory on every Greenplum Database host.
gpscputility to copy the GPText UDF library from the old Greenplum Database installation to the new Greenplum Database installation.
$ gpscp -f hostfile /usr/local/greenplum-db-<old-version>/lib/postgresql/gptext*.so \ =:/usr/local/greenplum-db-<new-version>/lib/postgresql/
See Upgrading GPText for more inforaation.
The gptext-rebalance node Command May be Unable to Execute
gptext-rebalance node command (beta) may fail with a message
ERROR: Utilize Node failed due to a Solr bug. (See SOLR 13240.) You can use the
gptext-rebalance index command to work around the issue. NOTE: The
gptext-rebalance node command is deprecated.
Wildcards in GPText Search Options
Solr does not return all fields when the
fl Solr search option contains a wildcard that matches field names.
For example, given a table with columns
fl=contenta,contentb,(sum,1,1) correctly returns three fields. Specifying
fl=cont*,sum(1,1) correctly returns
contentb, but omits the pseudo-field
Specifying a wildcard to match all fields (
fl=*,sum(1,1)) also omits the pseudo-field.
Index Load Failure After Configuration File Error
If Solr fails to load an index because of a configuration file error, and then the index is dropped without first correcting the configuration file error, the index cannot be recreated until GPText is restarted. This can happen if you edit
solrconfig.xml and introduce an XML syntax error or a typo in configuration values.
- When an index fails to load, check the Solr log to find the cause.
- If the cause is a configuration file error, such as invalid XML, use the
gptext-configutility to edit the file and fix the error. Dropping the index without first correcting the error is not recommended.
- If you have dropped an index that failed to load without first correcting the cause of the failure, you must restart GPText before you can recreate the index. Run
gptext-start -rto restart GPText.
Startup Failure with Large Numbers of Indexes
When there is a large number of Solr cores, Solr Cloud can fail to restart successfully, with error messages indicating failure to elect leaders for shards. This is a known Solr issue; see https://issues.apache.org/jira/browse/SOLR-5990 in the Apache Solr Jira for an example. Because of this issue, it is recommended to avoid designing GPText applications that create large numbers of indexes, shards, and replicas. The number of cores you can create before you observe this behavior is hardware dependent, so you should test to determine your system’s limits. You can create and successfully operate a larger numbers of indexes than can be restarted successfully later, so be sure to test restarting GPText to determine a practical limit.
Setting GPText Configuration Parameters Without First Setting custom_variable_classes
In Greenplum Database versions before Greenplum Database 6, if the
custom_variable_classes Greenplum Database server configuration parameter does not include the value “gptext”, attempting to set a GPText configuration parameter returns an error message, for example:
mydb-# set gptext.replication_factor = 4; WARNING: Please logon again to make GUC setting take effect. (GucValue.h:301) WARNING: Please logon again to make GUC setting take effect. (GucValue.h:301) ERROR: unrecognized configuration parameter "gptext.replication_factor"
In GPText 2.0, in addition to the error message, the value of the configuration parameter persisted in ZooKeeper is zero, replacing the previous value of the parameter.
mydb-# show gptext.replication_factor; gptext.replication_factor ---------------------------- 0
Beginning with GPText 2.1, the error message is still generated, however the value saved in ZooKeeper is the value specified in the
set command, 4 in the preceding example.
To prevent the error message, before setting any GPText configuration parameters, use the
gpconfig command-line utility to set the
custom_variable_classes configuration parameter:
$ gpconfig -c custom_variable_classes -v 'gptext'
In Greenplum Database 6.0, the
custom_variable_classes configuration parameter is removed and custom parameters can be set without errors.