Pivotal® GPText 3.1.0 Release Notes
This document contains release information for Pivotal GPText 3.1.0
Released: September 2018
Pivotal GPText joins the Greenplum Database massively parallel-processing database server with Apache SolrCloud enterprise search and the Apache MADlib Analytics Library to provide large-scale analytics processing and business decision support. GPText includes free text search as well as support for text analysis.
GPText includes the following features:
- The GPText database schema provides in-database access to Apache Solr indexing and searching
- Build indexes with database data or external documents and search with the GPText API
- Custom tokenizers for international text and social media text
- A Universal Query Processor that accepts queries with mixed syntax from supported Solr query processors
- Faceted search results
- Term highlighting in results
- Natural language processing, including part-of-speech tagging and named entity extraction
- Greater emphasis on high availability
The GPText management utility suite includes command-line utilities to perform the following tasks:
- Start, stop, and monitor ZooKeeper and GPText nodes
- Configure GPText nodes and indexes
- Add and delete replicas for index shards
- Back up and restore GPText indexes
- Recover a GPText node
- Expand the GPText cluster by adding GPText nodes
Installing GPText also installs Apache Solr Cloud and, optionally, Apache ZooKeeper.
Following are GPText installation prerequisites.
- GPText runs on Red Hat Enterprise Linux 5.x, 6.x, and 7.x.
- Install and configure your Greenplum Database system, version 4.3.6 or higher. See the Pivotal Greenplum Database Installation Guide at https://gpdb.docs.pivotal.io.
- Install Java JRE 1.8.x and add the
bindirectory to the
PATHon all hosts in the cluster. GPText is tested with Oracle Java 1.8 and OpenJDK 1.8.
- Ensure that
nc(netcat) is installed on all Greenplum cluster hosts (
sudo yum install nc).
lsofon all cluster hosts is recommended (
sudo yum install lsof).
- GPText cannot be installed onto a shared NFS mount.
- GPText nodes can be installed on the Greenplum Database cluster hosts alongside the Greenplum segments or on additional, non-database hosts accessible on the Greenplum cluster network. All hosts participating in the GPText system must have the same operating system and configuration and have passwordless-ssh access for the gpadmin user. See the Pivotal Greenplum Database Installation Guide for instructions to configure hosts.
- If you plan to place GPText nodes on the Greenplum Database segment hosts, ensure that you reserve memory for GPText use when you configure Greenplum Database. To determine the memory to set aside for GPText, multiply the number of GPText nodes to create on each Greenplum segment host by the JVM maximum size. Subtract this memory from the physical RAM when calculating the value for the Greenplum Database
gp_vmem_protect_limitserver configuration parameter. See the Greenplum Database server configuration parameter
gp_vmem_protect_limitin the Greenplum Database Reference Guide for recommended memory calculation formulas or visit the GPDB Virtual Memory Calculator web site.
- Apache Solr requires a ZooKeeper cluster with at minimum three nodes (five nodes recommended). You can install a “binding” ZooKeeper cluster with GPText on the Greenplum cluster hosts, or you can use an existing ZooKeeper cluster. When deployed alongside Greenplum Database segments, ZooKeeper performance can be affected under heavy database load. For best performance, install a ZooKeeper cluster on separate hosts with network connectivity to the Greenplum network.
The GPText 3.1.0 release provides the following features and enhancements.
Improvements to aid in developing and testing analyzer chains
gptext.list_field_types()function lists the field types defined in the
managed-schemaconfiguration file for an index.
gptext.get_field_type()function displays the index and query analyzer chains for a field type in JSON format.
gptext.analyzer()function shows the index or query analyzer chain output for a given field type and input text. This function is useful for testing and debugging analyzer chains interactively without modifying the index.
Part-of-speech tagging and named entity recognition
GPText includes OpenNLP libraries and analyzer classes to classify indexed terms’ parts-of-speech (POS), and to recognize named entities, such as the names of persons, locations, and organizations (NER). GPText saves NER terms in the field’s terms vector, prepended with a code to identify the type of entity recognized. This allows searching documents by entity type.
gptext.ner_terms()function lists NER-tagged terms for documents that match a query.
GPText includes the OpenNLP models for the English language. You can download models for other languages from the OpenNLP web site and use them with GPText.
Other enhancements and fixes
The first argument of the
gptext.terms()function, an anytable data type, has been made optional.
Fixed an error where the
gptext.partition_status()function displayed partition information for an index after it was dropped.
Apache Solr updated to Solr version 7.3
GPText 3.1.0 includes Apache Solr 7.3. See the following release documents for information about the Solr 7.3 release.
Following are GPText changes and Solr usage notes related to the Solr 7.3 upgrade.
GPText server-side components are rebuilt and tested with the new Solr JAR files.
solrconfig.xmland other collection configuration files are updated.
solrconfig.xmlis now officially deprecated in favor of the equivalent
<searchComponent>syntax. This element has been out of use in default Solr installations for several releases already.
legacyCloudparameter now defaults to false. If an entry for a replica does not exist in
state.json, that replica will not be registered. This may affect users who bring up replicas and they are automatically registered as a part of a shard. It is possible to revert to the old behavior by setting the property
legacyCloud=truein the cluster properties by running the following command in the GPText installation directory:
$ ./server/scripts/cloud-scripts/zkcli.sh -zkhost 127.0.0.1:2181 -cmd clusterprop -name legacyCloud -val true
With earlier Solr releases, if you drop an index while a Solr node with a replica of the index is down, when the down node comes back on-line, the index comes back and cannot be deleted. Solr 7 fixes this bug. The GPText workaround for this bug is removed.
PointFields are default numeric types. Solr has implemented *PointField types across the board, to replace Trie* based numeric fields. All Trie* fields are now considered deprecated, and will be removed in Solr 8. If you are using Trie* fields in your schema, you should consider moving to PointFields as soon as feasible. Changing to the new PointField types will require you to re-index your data.
The following spatial-related fields have been deprecated:
Use one of these field types instead:
To improve parameter consistency in the Collections API, the parameter names
fromNodefor the MOVEREPLICA command and source, and
targetfor the REPLACENODE command have been deprecated and replaced with
targetNodeinstead. The old names will continue to work for backwards compatibility, but they will be removed in Solr 8.
The replica core name has changed from
<collection_name>_shard#_replica_<node_type>#. For example,
GPText 3.0.0 allows adding documents stored in Amazon Web Services S3 buckets to a GPText external index. This enhancement includes changes to enable uploading AWS credentials to ZooKeeper and support for the
s3document source type for the
gptext-stateutility with the
-i) option now includes the date and time the GPText index was last modified.
- GPText 2.4.0 allows adding documents stored in an authenticated FTP server to a GPText external index. This enhancement includes changes to add support for the
ftptype to the
gptext.external uploadcommand-line utility and the
gptext-backupcommand-line utility can now back up GPText indexes to local GPText cluster storage as well as a directory on a shared drive. For local backups, backup metadata and the index configuration files are backed up to the Greenplum Database master data directory and index shards are backed up in the segment data directories on each host.
gptext-backuputility has a new option to back up just the index configuration files from ZooKeeper, with no index data.
gptext-restoreuility is updated to restore backups created on local cluster storage.
gptext-restoreutility has a new option to restore only the configuration files from a backup. This option loads the configuration files into ZooKeeper and creates an empty GPText index.
Revised gptext-config Utility Syntax
gptext-config command-line utility was revised to have a more user-friendly syntax.
list subcommand was added to
gptext-config you can use to list all of the configuration files for a specified GPText index.
$ gptext-config list -i <index-name>
Index Documents in a Hadoop File System (hdfs) Document Source
GPText 2.3.0 enables you to add documents stored in a hdfs system to a GPText external index.
- The new
gptext-externalcommand-line utility uploads Hadoop configuration and authentication files to a named configuration in ZooKeeper. The utility has subcommands
deleteto manage the configurations you have uploaded.
- The new
gptext.external_login()function logs in to the hdfs system using the named configuration you have uploaded. You can log in to only one external document source at a time.
- Use URLs of the form
gptext.index_external()functions to add documents to a GPText external index.
- Use the new
gptext.index_external_dir()function to add all documents in an hdfs directory to a GPText external index.
- Log out of the hdfs external document source with the new
See Authenticating with an External Document Source for steps to enable access to an hdfs document source.
See the Apache Jira for known issues in Apache Solr.
Following are known issues in GPText. Workarounds are provided when available.
Wildcards in GPText Search Options
Solr does not return all fields when the
fl Solr search option contains a wildcard that matches field names.
For example, given a table with columns
fl=contenta,contentb,(sum,1,1) correctly returns three fields. Specifying
fl=cont*,sum(1,1) correctly returns
contentb, but omits the pseudo-field
Specifying a wildcard to match all fields (
fl=*,sum(1,1)) also omits the pseudo-field.
Index Load Failure After Configuration File Error
If Solr fails to load an index because of a configuration file error, and then the index is dropped without first correcting the configuration file error, the index cannot be recreated until GPText is restarted. This can happen if you edit
solrconfig.xml and introduce an XML syntax error or a typo in configuration values.
- When an index fails to load, check the Solr log to find the cause.
- If the cause is a configuration file error, such as invalid XML, use the
gptext-configutility to edit the file and fix the error. Dropping the index without first correcting the error is not recommended.
- If you have dropped an index that failed to load without first correcting the cause of the failure, you must restart GPText before you can recreate the index. Run
gptext-start -rto restart GPText.
Startup Failure with Large Numbers of Indexes
When there is a large number of Solr cores, Solr Cloud can fail to restart successfully, with error messages indicating failure to elect leaders for shards. This is a known Solr issue; see https://issues.apache.org/jira/browse/SOLR-5990 in the Apache Solr Jira for an example. Because of this issue, it is recommended to avoid designing GPText applications that create large numbers of indexes, shards, and replicas. The number of cores you can create before you observe this behavior is hardware dependent, so you should test to determine your system’s limits. You can create and successfully operate a larger numbers of indexes than can be restarted successfully later, so be sure to test restarting GPText to determine a practical limit.
Setting GPText Configuration Parameters Without First Setting custom_variable_classes
custom_variable_classes Greenplum Database server configuration parameter does not include the value “gptext”, attempting to set a GPText configuration parameter returns an error message, for example:
mydb-# set gptext.replication_factor = 4; WARNING: Please logon again to make GUC setting take effect. (GucValue.h:301) WARNING: Please logon again to make GUC setting take effect. (GucValue.h:301) ERROR: unrecognized configuration parameter "gptext.replication_factor"
In GPText 2.0, in addition to the error message, the value of the configuration parameter persisted in ZooKeeper is zero, replacing the previous value of the parameter.
mydb-# show gptext.replication_factor; gptext.replication_factor ---------------------------- 0
Beginning with GPText 2.1, the error message is still generated, however the value saved in ZooKeeper is the value specified in the
set command, 4 in the preceding example.
To prevent the error message, before setting any GPText configuration parameters, use the
gpconfig command-line utility to set the
custom_variable_classes configuration parameter:
$ gpconfig -c custom_variable_classes -v 'gptext'