Pivotal® GPText 2.2.1 Release Notes
This document contains release information for Pivotal GPText 2.2.1.
Released: February 2018
Pivotal GPText joins the Greenplum Database massively parallel-processing database server with Apache SolrCloud enterprise search and the Apache MADlib Analytics Library to provide large-scale analytics processing and business decision support. GPText includes free text search as well as support for text analysis.
GPText includes the following features:
- The GPText database schema provides in-database access to Apache Solr indexing and searching
- Build indexes with database data or external documents and search with the GPText API
- Custom tokenizers for international text and social media text
- A Universal Query Processor that accepts queries with mixed syntax from supported Solr query processors
- Faceted search results
- Term highlighting in results
- Greater emphasis on high availability
The GPText management utility suite includes command-line utilities to perform the following tasks:
- Start, stop, and monitor ZooKeeper and GPText nodes
- Configure GPText nodes and indexes
- Add and delete replicas for index shards
- Back up and restore GPText indexes
- Recover a GPText node
- Expand the GPText cluster by adding GPText nodes
Installing GPText also installs Apache Solr Cloud and, optionally, Apache ZooKeeper.
Following are GPText installation prerequisites.
- GPText runs on Red Hat Enterprise Linux 5.x, 6.x, and 7.x.
- Install and configure your Greenplum Database system, version 4.3.6 or higher. See the Pivotal Greenplum Database Installation Guide at https://gpdb.docs.pivotal.io.
- Install Java JRE 1.8.x and add the
bindirectory to the
PATHon all hosts in the cluster. GPText is tested with Oracle Java 1.8 and OpenJDK 1.8.
- Ensure that
nc(netcat) is installed on all Greenplum cluster hosts (
sudo yum install nc).
lsofon all cluster hosts is recommended (
sudo yum install lsof).
- GPText cannot be installed onto a shared NFS mount.
- GPText nodes can be installed on the Greenplum Database cluster hosts alongside the Greenplum segments or on additional, non-database hosts accessible on the Greenplum cluster network. All hosts participating in the GPText system must have the same operating system and configuration and have passwordless-ssh access for the gpadmin user. See the Pivotal Greenplum Database Installation Guide for instructions to configure hosts.
- If you plan to place GPText nodes on the Greenplum Database segment hosts, ensure that you reserve memory for GPText use when you configure Greenplum Database. To determine the memory to set aside for GPText, multiply the number of GPText nodes to create on each Greenplum segment host by the JVM maximum size. Subtract this memory from the physical RAM when calculating the value for the Greenplum Database
gp_vmem_protect_limitserver configuration parameter. See the Greenplum Database server configuration parameter
gp_vmem_protect_limitin the Greenplum Database Reference Guide for recommended memory calculation formulas or visit the GPDB Virtual Memory Calculator web site.
- Apache Solr requires a ZooKeeper cluster with at minimum three nodes (five nodes recommended). You can install a “binding” ZooKeeper cluster with GPText on the Greenplum cluster hosts, or you can use an existing ZooKeeper cluster. When deployed alongside Greenplum Database segments, ZooKeeper performance can be affected under heavy database load. For best performance, install a ZooKeeper cluster on separate hosts with network connectivity to the Greenplum network.
GPText release 2.2.1 includes the following new features and improvements.
Certified GPText 2.2.1 with Red Hat Enterprise Linux 7.x and CentOS 7.x.
Certified GPText 2.2.1 with OpenJDK 1.8.
gptext-uninstallmanagement utility now requires confirmation before uninstalling GPText.
A new GPText function,
gptext.index_size(), shows the number of documents indexed and total disk space used by GPText indexes.
statscommand to the
gptext-statecommand-line utility to show the number of documents indexed and total disk space used by GPText indexes.
Added a boolean
external_indexcolumn to the
gptext.index_status()UDF to indicate if an index is a GPText external index.
Fixed an error that occurred with the
gptext-state healthcheckcommand when external indexes were present.
Fixed a bug that sometimes produced the message, WARNING: relcache reference leak: relation “error_table” not closed, when indexing external documents.
When specifying an error table name with the
gptext.index_external()UDF, an error that causes the transaction to roll back would remove the error table. For this reason, the optional
<error_table>argument is no longer supported. Errors are logged to the default table,
gptext.error_table. You can call a new UDF,
gptext.recreate_error_table(), to ensure that the
gptext.error_tabletable is empty before you call
Fixed an error that occurred when running
gptext-upgradebecause there was no retry when checking the collection status during a rolling start.
For regular GPText indexes, the
enableRemoteStreamingattribute of the
<requestParsers>element is now set to false in the
solrconfig.xmlconfiguration file. Remote streaming is only used to index external documents.
GPText release 2.2.0 includes the following new features and improvements.
GPText External Indexes
You can use GPText to index and search documents that reside outside of Greenplum Database. External documents can be of any type supported by the Apache Tika library, which is included with Apache Solr. Apache Tika supports many document types, including PDF, Microsoft Word, XML, HTML, email, and other rich document types.
New functions are added to create, search, and manage GPText external functions:
gptext.create_index_external()creates an external index.
gptext.index_external()adds external documents to an index.
gptext.search_external()searches external indexes and extracts metadata fields to columns. You can also search with the standard
gptext.highlight_external()searches an external index and highlights search terms in the results with markup tags.
A new table,
gptext.error_table, has been added. A row is added to this table when Solr fails to add an external document to an index, for example if the document cannot be retrieved from the supplied URL.
Existing GPText functions and management utilities are updated with support for external indexes.
See Working With GPText External Indexes to learn more about the differences between standard GPText indexes and GPText external indexes.
By default, GPText 2.2 uses the Solr Unified Highlighter.
If a field is stored in the GPText index, highlighted text can be obtained from Solr, without having to use the
gptext.highlight() function. This is useful if you store fields in GPText and then drop the original database table. Before you can use this feature, you must update the
managed-schema configuration file for the index. See Highlighting Terms in Stored Fields for steps to enable and use this feature.
Apache Solr Version Upgrade
Apache Solr version 6.6.2 is included with GPText version 2.2.0. Solr 6.6.2 includes an important security fix and other fixes. See a list of changes in the Apache Solr Release Notes.
New Monitoring Functions
gptext.live_nodes() function lists the host, port, data directory, and up or down status of each GPText/Solr node.
gptext.zookeeper_hosts() function lists the host names and ports of the ZooKeeper instances.
Optional Two-part Installation Procedure
A new option is provided to install GPText in two parts. The first part prepares the GPText installation directories and installs the binary distribution package. The second part deploys the GPText cluster using your customized installation configuration file.
See Optional Two-part GPText Installation for instructions.
Upgraded Apached Solr to release 6.6.2 due to security vulnerabilities identified in Apache Solr CVE-2017-12629.
After dropping and recreating an indexed partitioned table, the
gptext.partition_status_recursive()functions output a message, “Failed to look up a relation(
: ) in the system catalog. Function: GetOid …”. This issue has been fixed.
Editing a configuration file using the
gptext-configutility failed unless the
-e <editor>option was added to the command. This is fixed.
Following are known issues in GPText. Workarounds are provided when available.
Wildcards in GPText Search Options
Solr does not return all fields when the
fl Solr search option contains a wildcard that matches field names.
For example, given a table with columns
fl=contenta,contentb,(sum,1,1) correctly returns three fields. Specifying
fl=cont*,sum(1,1) correctly returns
contentb, but omits the pseudo-field
Specifying a wildcard to match all fields (
fl=*,sum(1,1)) also omits the pseudo-field.
Index Load Failure After Configuration File Error
If Solr fails to load an index because of a configuration file error, and then the index is dropped without first correcting the configuration file error, the index cannot be recreated until GPText is restarted. This can happen if you edit
solrconfig.xml and introduce an XML syntax error or a typo in configuration values.
- When an index fails to load, check the Solr log to find the cause.
- If the cause is a configuration file error, such as invalid XML, use the
gptext-configutility to edit the file and fix the error. Dropping the index without first correcting the error is not recommended.
- If you have dropped an index that failed to load without first correcting the cause of the failure, you must restart GPText before you can recreate the index. Run
gptext-start -rto restart GPText.
Startup Failure with Large Numbers of Indexes
When there is a large number of Solr cores, Solr Cloud can fail to restart successfully, with error messages indicating failure to elect leaders for shards. This is a known Solr issue; see https://issues.apache.org/jira/browse/SOLR-5990 in the Apache Solr Jira for an example. Because of this issue, it is recommended to avoid designing GPText applications that create large numbers of indexes, shards, and replicas. The number of cores you can create before you observe this behavior is hardware dependent, so you should test to determine your system’s limits. You can create and successfully operate a larger numbers of indexes than can be restarted successfully later, so be sure to test restarting GPText to determine a practical limit.
Setting GPText Configuration Parameters Without First Setting custom_variable_classes
custom_variable_classes Greenplum Database server configuration parameter does not include the value “gptext”, attempting to set a GPText configuration parameter returns an error message, for example:
mydb-# set gptext.replication_factor = 4; WARNING: Please logon again to make GUC setting take effect. (GucValue.h:301) WARNING: Please logon again to make GUC setting take effect. (GucValue.h:301) ERROR: unrecognized configuration parameter "gptext.replication_factor"
In GPText 2.0, in addition to the error message, the value of the configuration parameter persisted in ZooKeeper is zero, replacing the previous value of the parameter.
mydb-# show gptext.replication_factor; gptext.replication_factor ---------------------------- 0
Beginning with GPText 2.1, the error message is still generated, however the value saved in ZooKeeper is the value specified in the
set command, 4 in the preceding example.
To prevent the error message, before setting any GPText configuration parameters, use the
gpconfig command-line utility to set the
custom_variable_classes configuration parameter:
$ gpconfig -c custom_variable_classes -v 'gptext'