LATEST VERSION: 2.2.1 - CHANGELOG
Pivotal Greenplum GPText v2.2.1

GPText Best Practices

Each GPText/Apache Solr node is a Java Virtual Machine (JVM) process and is allocated memory at startup. The maximum amount of memory the JVM will use is set with the -Xmx parameter on the Java command line. Performance problems and out of memory failures can occur when the nodes have insufficient memory.

Other performance problems can result from resource contention between the Greenplum Database, Solr, and ZooKeeper clusters.

This topic discusses GPText use cases that stress Solr JVM memory in different ways and the best practices for preventing or alleviating performance problems from insufficient JVM memory and other causes.

Indexing Large Numbers of Documents

Indexing documents consumes data in Solr JVM memory. When the index is committed, parts of the memory are released, but some data remains in memory to support fast search. By default, Solr performs an automatic soft commit when 1,000,000 documents are indexed or 20 minutes (1,200,000 milliseconds) have passed. A soft commit pushes documents from memory to the index, freeing JVM memory. A soft commit also makes the documents visible in searches. A soft commit does not, however, make the index updates durable; it is still necessary to commit the index with the gptext.commit() user-defined function.

You can configure an index to perform a more frequent automatic soft commit by editing the solrconfig.xml file for the index:

$ gptext-config -f solrconfig.xml -i <db>.<schema>.<index-name> 

The <autoSoftCommit> element is a child of the <updateHandler> element. Edit the <maxDocs> and <maxTime> values to reduce the time between automatic commits. For example, the following settings perform an autocommit every 100,000 documents or 10 minutes.

<autoSoftCommit>
  <maxDocs>100000</maxDocs>
  <maxTime>600000</maxTime>
</autoSoftCommit>

Indexing Very Large Documents

Indexing very large documents can use a large amount of JVM memory. To manage this, you can set the gptext.idx_buffer_size configuration parameter to reduce the size of the indexing buffer.

See Changing GPText Server Configuration Parameters for instructions to change configuration parameter values.

Determining the Number of GPText Nodes to Deploy

A GPText node is a Solr instance managed by GPText. The nodes can be deployed on the Greenplum Database cluster hosts or on separate hosts accessible to the Greenplum Database cluster. The number of nodes is configured during GPText installation.

The maximum number of GPText nodes you can deploy is the number of Greenplum Database primary segments. However, the best practice recommendation is to deploy fewer GPText nodes with more memory rather than to divide the memory available to GPText among the maximum number of GPText nodes allowed. For example, if there are eight primary segments per host in the Greenplum Database cluster, the maximum number of GPText nodes per host is eight, but you should test with two or four GPText nodes per host, adjusting the JAVA_OPTS installation parameter to divide the memory reserved for GPText among them.

Configure Maximum JVM Heap Size

Each Solr core file consumes JVM heap memory. Adding more indexes increases JVM swapping and garbage collection frequency so that it takes longer to create indexes and to load the core files when GPText is started. If you continue to create indexes without increasing the JVM heap, an out of memory error will eventually occur.

Monitor performance at startup and during index creation and increase the JVM size when you begin to see degraded performance. You can also use tools such as jconsole, included with the Java Developer Kit, to monitor Java heap usage. If garbage collections are occurring too frequently and freeing too little memory, JVM heap should be increased.

The JVM size is initially configured during GPText installation by setting the JAVA_OPTIONS parameter in the installation configuration file. After installation, use the gptext-config command -o option to increase the JVM heap size. For example, this gptext-config command sets the JVM maximum heap to 4GB:

$ gptext-config -o "-Xmx=4096M"

Manage Indexing and Search Loads

With high indexing or search load, JVM garbage collection pauses can cause the Solr overseer queue to back up. For a heavily loaded GPText system, you can prevent some performance problems by scheduling document indexing for times when search activity is low.

Terms Queries and Out of Memory Errors

The gptext.terms() function retrieves terms vectors from documents that match a query. An out of memory error may occur if the documents are large, or if the query matches a large number of documents on each node. Other factors can contribute to out of memory errors when running a gptext.terms() query, including the maximum memory available to the Solr nodes (-Xmx value in JAVA_OPTS) and concurrent queries.

If you experience out of memory errors with gptext.terms() you can set a lower value for the term_batch_size GPText configuration variable. The default value is 1000. For example, you could try running the failing query with term_batch_size set to 500. Lowering the value may prevent out of memory errors, but performance of terms queries can be affected.

See GPText Configuration Parameters for help setting GPText configuration parameters.

Configure File System Caching for ZooKeeper

Good Solr performance is dependent on fast response for ZooKeeper requests. ZooKeeper performs best when its database is cached so it does not have to go to disk for lookups. If you find that ZooKeeper JVMs have frequent disk accesses, look for ways to improve file caching or move ZooKeeper disks to faster storage.

The ZooKeeper zkClientTimeout parameter is the time a client is allowed to not talk to ZooKeeper before having its session expired.