LATEST VERSION: 3.0.0 - RELEASE NOTES
Pivotal® Greenplum® Text v2.2.1

GPText Best Practices

Each GPText/Apache Solr node is a Java Virtual Machine (JVM) process and is allocated memory at startup. The maximum amount of memory the JVM will use is set with the -Xmx parameter on the Java command line. Performance problems and out of memory failures can occur when the nodes have insufficient memory.

Other performance problems can result from resource contention between the Greenplum Database, Solr, and ZooKeeper clusters.

This topic discusses GPText use cases that stress Solr JVM memory in different ways and the best practices for preventing or alleviating performance problems from insufficient JVM memory and other causes.

Indexing Large Numbers of Documents

Indexing documents consumes data in Solr JVM memory. When the index is committed, parts of the memory are released, but some data remains in memory to support fast search. By default, Solr performs an automatic soft commit when 1,000,000 documents are indexed or 20 minutes (1,200,000 milliseconds) have passed. A soft commit pushes documents from memory to the index, freeing JVM memory. A soft commit also makes the documents visible in searches. A soft commit does not, however, make the index updates durable; it is still necessary to commit the index with the gptext.commit() user-defined function.

You can configure an index to perform a more frequent automatic soft commit by editing the solrconfig.xml file for the index:

$ gptext-config -f solrconfig.xml -i <db>.<schema>.<index-name> 

The <autoSoftCommit> element is a child of the <updateHandler> element. Edit the <maxDocs> and <maxTime> values to reduce the time between automatic commits. For example, the following settings perform an autocommit every 100,000 documents or 10 minutes.

<autoSoftCommit>
  <maxDocs>100000</maxDocs>
  <maxTime>600000</maxTime>
</autoSoftCommit>

Indexing Very Large Documents

Indexing very large documents can use a large amount of JVM memory. To manage this, you can set the gptext.idx_buffer_size configuration parameter to reduce the size of the indexing buffer.

See Changing GPText Server Configuration Parameters for instructions to change configuration parameter values.

Determining the Number of GPText Nodes to Deploy

A GPText node is a Solr instance managed by GPText. The nodes can be deployed on the Greenplum Database cluster hosts or on separate hosts accessible to the Greenplum Database cluster. The number of nodes is configured during GPText installation.

The maximum recommended number of GPText nodes you can deploy is the number of Greenplum Database primary segments. However, the best practice recommendation is to deploy fewer GPText nodes with more memory rather than to divide the memory available to GPText among the maximum number of GPText nodes. Use the JAVA_OPTS installation parameter to set memory size for GPText nodes.

A single GPText node per host can easily handle several indexes. Each additional node consumes additional CPU and memory resources, so it is desirable to limit the number of nodes per host. For most GPText installations, a single GPText node per host is sufficient.

If the JVM has a very large amount of memory, however, garbage collection can cause long pauses while the JVM reorganizes memory. Also, the JVM employs a memory address optimization that cannot be used when JVM memory exceeds 32GB, so at more than 32GB, a GPText node loses capacity and performance. Therefore, no GPText node should have more than 32GB of memory.

For example, if you have 48GB memory available for GPText per host, you should deploy two GPText nodes with 24GB memory. If you have 128GB available, you should deploy at least four JVMs, and more if garbage collection becomes a problem.

Configure Maximum JVM Heap Size

Each Solr core file consumes JVM heap memory. Adding more indexes increases JVM swapping and garbage collection frequency so that it takes longer to create indexes and to load the core files when GPText is started. If you continue to create indexes without increasing the JVM heap, an out of memory error will eventually occur.

Monitor performance at startup and during index creation and increase the JVM size when you begin to see degraded performance. You can also use tools such as jconsole, included with the Java Developer Kit, to monitor Java heap usage. If garbage collections are occurring too frequently and freeing too little memory, JVM heap should be increased.

The JVM size is initially configured during GPText installation by setting the JAVA_OPTIONS parameter in the installation configuration file. After installation, use the gptext-config command -o option to increase the JVM heap size. For example, this gptext-config command sets the JVM maximum heap to 4GB:

$ gptext-config -o "-Xmx=4096M"

Manage Indexing and Search Loads

With high indexing or search load, JVM garbage collection pauses can cause the Solr overseer queue to back up. For a heavily loaded GPText system, you can prevent some performance problems by scheduling document indexing for times when search activity is low.

Terms Queries and Out of Memory Errors

The gptext.terms() function retrieves terms vectors from documents that match a query. An out of memory error may occur if the documents are large, or if the query matches a large number of documents on each node. Other factors can contribute to out of memory errors when running a gptext.terms() query, including the maximum memory available to the Solr nodes (-Xmx value in JAVA_OPTS) and concurrent queries.

If you experience out of memory errors with gptext.terms() you can set a lower value for the term_batch_size GPText configuration variable. The default value is 1000. For example, you could try running the failing query with term_batch_size set to 500. Lowering the value may prevent out of memory errors, but performance of terms queries can be affected.

See GPText Configuration Parameters for help setting GPText configuration parameters.

Configure File System Caching for ZooKeeper

Good Solr performance is dependent on fast response for ZooKeeper requests. ZooKeeper performs best when its database is cached so it does not have to go to disk for lookups. If you find that ZooKeeper JVMs have frequent disk accesses, look for ways to improve file caching or move ZooKeeper disks to faster storage.

The ZooKeeper zkClientTimeout parameter is the time a client is allowed to not talk to ZooKeeper before having its session expired.