LATEST VERSION: 2.2.1 - CHANGELOG
Pivotal Greenplum GPText v2.2.1

PivotalĀ® GPText 2.1.2 Release Notes

This document contains release information for Pivotal GPText 2.1.2.

Published: August, 2017

About Pivotal GPText

Pivotal GPText joins the Greenplum Database massively parallel-processing database server with Apache SolrCloud enterprise search and the Apache MADlib Analytics Library to provide large-scale analytics processing and business decision support. GPText includes free text search as well as support for text analysis.

GPText includes the following features:

  • The GPText database schema provides in-database access to Apache Solr indexing and searching
  • Custom tokenizers for international text and social media text
  • A Universal Query Processor that accepts queries with mixed syntax from supported Solr query processors
  • Faceted search results
  • Term highlighting in results
  • Greater emphasis on high availability

The GPText management utility suite includes command-line utilities to perform the following tasks:

  • Start, stop, and monitor ZooKeeper and GPText nodes
  • Configure GPText nodes and indexes
  • Add and delete replicas for index shards
  • Back up and restore GPText indexes
  • Recover a GPText node
  • Expand the GPText cluster by adding GPText nodes

Prerequisites

Installing GPText also installs Apache Solr Cloud 6.1 and, optionally, Apache ZooKeeper.

Following are GPText installation prerequisites.

  • Install and configure your Greenplum Database system, version 4.3.6 or higher. See the Pivotal Greenplum Database Installation Guide at https://gpdb.docs.pivotal.io.
  • GPText runs on Red Hat Enterprise Linux 5.x or 6.x.
  • GPText cannot be installed onto a shared NFS mount.
  • Install Oracle JRE 1.8.x and add its bin directory to the PATH on all hosts in the cluster.

    GPText requires Oracle JDK 1.8.x. You cannot use an OpenJDK JRE with GPText.

  • Ensure that nc (netcat) is installed on all Greenplum cluster hosts (sudo yum install nc).
  • Installing lsof on all cluster hosts is recommended (sudo yum install lsof).
  • GPText nodes can be installed on the Greenplum Database cluster hosts alongside the Greenplum segments or on additional, non-database hosts accessible on the Greenplum cluster network. All hosts participating in the GPText system must have the same operating system and configuration and have passwordless-ssh access for the gpadmin user. See the Pivotal Greenplum Database Installation Guide for instructions to configure hosts.
  • If you plan to place GPText nodes on the Greenplum Database segment hosts, ensure that you reserve memory for GPText use when you configure Greenplum Database. To determine the memory to set aside for GPText, multiply the number of GPText nodes to create on each Greenplum segment host by the JVM maximum size. Subtract this memory from the physical RAM when calculating the value for the Greenplum Database gp_vmem_protect_limit server configuration parameter. See the Greenplum Database server configuration parameter gp_vmem_protect_limit in the Greenplum Database Reference Guide for recommended memory calculation formulas or visit the GPDB Virtual Memory Calculator web site.
  • Apache Solr requires a ZooKeeper cluster with at minimum three nodes (five nodes recommended). You can install a “binding” ZooKeeper cluster with GPText on the Greenplum cluster hosts, or you can use an existing ZooKeeper cluster. When deployed alongside Greenplum Database segments, ZooKeeper performance can be affected under heavy database load. For best performance, install a ZooKeeper cluster on separate hosts with network connectivity to the Greenplum network.

New Features, Enhancements, and Fixes in GPText 2.1.2

Features and Enhancements

  • An optional Solr options parameter is added to the following GPText functions:

    • gptext.search_count()
    • gptext.faceted_range_search()
    • gptext.faceted_query_search()
    • gptext.faceted_field_search()

    The options parameter is an ampersand-delimited list of Solr options to include in the search. See Solr options for more about using Solr options.

  • Added support for the Greenplum Database 5.0 uuid data type. Greenplum Database uuid columns are mapped to the solr.UUIDField type in GPText indexes. Columns of uuid type may also be specified for the unique id column (id_col) in the gptext.create_index() function.

  • Changed the default value of the term_batch_size GPText configuration parameter from 50000 to 1000. This reduces the possibility of an out of memory (OOM) error when executing the gptext.terms() function. If you experience OOM errors with this new default value, you may need to further reduce the value of term_batch_size. For more information see Terms Queries and Out of Memory Errors.

  • In previous releases, a user was permitted access to a GPText index if their role had permission to access the Greenplum Database table from which the index was created. Permissions are no longer checked on the base table. This makes it possible to drop the base table and continue to search the GPText index. GPText functions that depend upon the existence of the base table, such as gptext.add_field(), are not allowed after the database table has been dropped. If the table was partitioned, GPText queries must specify the root table name.

Fixes

  • When the defType=dismax Solr option was added to the Solr options parameter of the gptext.search() function, GPText lowercased the option name to deftype before submitting the query to Solr, causing the query to return unexpected results. This has been fixed.

Known Issues

Following are known issues in GPText. Workarounds are provided when available.

Wildcards in GPText Search Options

Solr does not return all fields when the fl Solr search option contains a wildcard that matches field names. For example, given a table with columns contenta and contentb, specifying fl=contenta,contentb,(sum,1,1) correctly returns three fields. Specifying fl=cont*,sum(1,1) correctly returns contenta and contentb, but omits the pseudo-field sum(1,1).

Specifying a wildcard to match all fields (fl=*,sum(1,1)) also omits the pseudo-field.

Index Load Failure After Configuration File Error

If Solr fails to load an index because of a configuration file error, and then the index is dropped without first correcting the configuration file error, the index cannot be recreated until GPText is restarted. This can happen if you edit managed-schema or solrconfig.xml and introduce an XML syntax error or a typo in configuration values.

Workaround:

  1. When an index fails to load, check the Solr log to find the cause.
  2. If the cause is a configuration file error, such as invalid XML, use the gptext-config utility to edit the file and fix the error. Dropping the index without first correcting the error is not recommended.
  3. If you have dropped an index that failed to load without first correcting the cause of the failure, you must restart GPText before you can recreate the index. Run gptext-start -r to restart GPText.

Startup Failure with Large Numbers of Indexes

When there is a large number of Solr cores, Solr Cloud can fail to restart successfully, with error messages indicating failure to elect leaders for shards. This is a known Solr issue; see https://issues.apache.org/jira/browse/SOLR-5990 in the Apache Solr Jira for an example. Because of this issue, it is recommended to avoid designing GPText applications that create large numbers of indexes, shards, and replicas. The number of cores you can create before you observe this behavior is hardware dependent, so you should test to determine your system’s limits. You can create and successfully operate a larger numbers of indexes than can be restarted successfully later, so be sure to test restarting GPText to determine a practical limit.

Setting GPText Configuration Parameters Without First Setting custom_variable_classes

If the custom_variable_classes Greenplum Database server configuration parameter does not include the value “gptext”, attempting to set a GPText configuration parameter returns an error message, for example:

mydb-# set gptext.replication_factor = 4;
WARNING:  Please logon again to make GUC setting take effect. (GucValue.h:301)
WARNING:  Please logon again to make GUC setting take effect. (GucValue.h:301)
ERROR: unrecognized configuration parameter "gptext.replication_factor"

In GPText 2.0, in addition to the error message, the value of the configuration parameter persisted in ZooKeeper is zero, replacing the previous value of the parameter.

 mydb-# show gptext.replication_factor;
  gptext.replication_factor
 ----------------------------
  0

Beginning with GPText 2.1, the error message is still generated, however the value saved in ZooKeeper is the value specified in the set command, 4 in the preceding example.

To prevent the error message, before setting any GPText configuration parameters, use the gpconfig command-line utility to set the custom_variable_classes configuration parameter:

$ gpconfig -c custom_variable_classes -v 'gptext'

Cannot Source greenplum_path.sh after greenplum-text_path.sh

The GPText greenplum-text_path.sh script requires that the Greenplum Database greenplum_path.sh script be run first. For example:

$ source /usr/local/greenplum-db/greenplum_path.sh
$ source /usr/local/greenplum-text/greenplum-text_path.sh

The GPText script modifies the PATH and PYTHONPATH environment variables set previously by the Greenplum Database script.

If you source greenplum_path.sh again after you have run greenplum-text_path.sh, GPText’s PYTHONPATH is overwritten and GPText fails.

A workaround is to source the greenplum-text_path.sh script from the greenplum_path.sh script. Edit the file $GPHOME/greenplum_path.sh and add the following line to the end of the file:

source /usr/local/greenplum-text/greenplum-text_path.sh
``