LATEST VERSION: 3.2.0 - RELEASE NOTES
Pivotal® Greenplum® Text v2.0.0

GPText Best Practices

Each GPText/Apache Solr node is a Java Virtual Machine (JVM) process and is allocated memory at startup. The maximum amount of memory the JVM will use is set with the -Xmx parameter on the Java command line. Performance problems and out of memory failures can occur when the nodes have insufficient memory.

This topic discusses three GPText use cases that stress Solr JVM memory in different ways and the best practices for preventing or alleviating performance problems.

  • Indexing large numbers of documents in an index
  • Indexing large documents
  • Creating a large number of GPText indexes

Indexing Large Numbers of Documents

Indexing documents consumes data in Solr JVM memory. When the index is committed, parts of the memory are released, but some data remains in memory to support fast search. By default, Solr performs an automatic soft commit when 1,000,000 documents are indexed or 20 minutes (1,200,000 milliseconds) have passed. A soft commit pushes documents from memory to the index, freeing JVM memory. A soft commit also makes the documents visible in searches. A soft commit does not, however, make the index updates durable; it is still necessary to commit the index with the gptext.commit() user-defined function.

You can configure an index to perform a more frequent automatic soft commit by editing the solrconfig.xml file for the index:

$ gptext-config -f solrconfig.xml -i <db>.<schema>.<index-name> 

The <autoSoftCommit> element is a child of the <updateHandler> element. Edit the <maxDocs> and <maxTime> values to reduce the time between automatic commits. For example, the following settings perform an autocommit every 100,000 documents or 10 minutes.

<autoSoftCommit>
  <maxDocs>100000</maxDocs>
  <maxTime>600000</maxTime>
</autoSoftCommit>

Indexing Very Large Documents

Indexing very large documents can use a large amount of JVM memory. To manage this, you can set the gptext.idx_buffer_size configuration parameter to reduce the size of the indexing buffer.

See Changing GPText Server Configuration Parameters for instructions to change configuration parameter values.

Configure Maximum JVM Heap Size

Each Solr core file consumes JVM heap memory. Adding more indexes increases JVM swapping and garbage collection frequency so that it takes longer to create indexes and to load the core files when GPText is started. If you continue to create indexes without increasing the JVM heap, an out of memory error will eventually occur.

Monitor performance at startup and during index creation and increase the JVM size when you begin to see degraded performance. You can also use tools such as jconsole, included with the Java Developer Kit, to monitor Java heap usage. If garbage collections are occurring too frequently and freeing too little memory, JVM heap should be increased.

Use the -Xmx JVM command line option to increase the JVM heap size. For example, this gptext-config command sets the JVM maximum heap to 4GB:

$ gptext-config -o "-Xmx=4096M"

Manage Indexing and Search Loads

With high indexing or search load, JVM garbage collection pauses can cause the Solr overseer queue to back up. For a heavily loaded GPText system, you can prevent some performance problems by scheduling document indexing for times when search activity is low.

Configure File System Caching for ZooKeeper

Good Solr performance is dependent on fast response for ZooKeeper requests. ZooKeeper performs best when its database is cached so it does not have to go to disk for lookups. If you find that ZooKeeper JVMs have frequent disk accesses, look for ways to improve file caching or move ZooKeeper disks to faster storage.

The ZooKeeper zkClientTimeout is the time a client is allowed to not talk to ZooKeeper before having it’s session expired.