Introduction to Pivotal GPText
Pivotal GPText enables processing mass quantities of raw text data (such as social media feeds or e-mail databases) into mission-critical information that guides business and project decisions. GPText joins the Greenplum Database massively parallel-processing database server with Apache SolrCloud enterprise search and the MADlib Analytics Library to provide large-scale analytics processing and business decision support. GPText includes free text search as well as support for text analysis. GPText supports business decision making by offering:
- Multiple kinds of data: GPText supports both semi-structured and unstructured data searches, which exponentially increases the kinds of information you can find.
- Less schema dependence: GPText does not require static schemas to successfully locate information; schemas can change or be quite simple and still return targeted results.
- Text analytics: GPText supports analysis of text data with machine learning algorithms. The MADlib analytics library is integrated with Greenplum Database and is available for use with GPText.
This chapter contains the following topics:
GPText combines a Greenplum Database cluster with an Apache SolrCloud cluster. Greenplum Database segments and GPText nodes can be deployed on the same hosts or on different hosts with network connectivity.
The following figure shows the process architecture of the combined Greenplum Database and ApacheSolr clusters. The figure shows four cluster nodes with four Greenplum segments and four Solr instances deployed on each. An Apache ZooKeeper service manages the SolrCloud cluster. Because ZooKeeper is most efficient with an odd number of servers, ZooKeeper nodes are deployed on three of the four hosts. Greenplum Database users access SolrCloud services via GPText user-defined functions installed in Greenplum databases and command-line utilities.
The figure omits the Greenplum master host, secondary master, and mirror segments for the Greenplum primary segments.
The Greenplum segments, Solr instances, and ZooKeeper nodes may all be deployed on separate hosts on the same network, depending on application and performance requirements.
The following sections describe how GPText integrates SolrCloud with Greenplum Database and how the two clusters work together to provide parallel text search capabilities in Greenplum Database and maintain high availability.
Greenplum Database Cluster
A Greenplum Database cluster is comprised of the following components:
- A master database instance, executing on a dedicated host, conventionally named
mdw. (Not illustrated)
- A secondary master instance, on a host conventionally named
smdw, acting as a warm standby for the master instance. (Not illustrated)
- An array of database primary segment instances and mirrors deployed on segment hosts, by convention
sdwn. A segment instance is an independent Postgres database process managing a portion of the distributed data. Each segment has a mirror (not illustrated) on another host in the cluster to provide uninterrupted service in case of a segment or segment host failure. The number of primary segments per host is determined by the hardware configuration—the number and type of processor cores, the amount of physical RAM, local storage capacity, and network capacity—as well as availability and performance requirements.
The Greenplum master instance coordinates the work of the segment instances. Optimal performance of a Greenplum Database cluster requires that all segment hosts be configured identically with the same number of primary and mirror segments on each, and with the database data distributed evenly among the segment instances. The full capacity of the database cluster is utilized when every segment host performs an equal amount of work.
Apache Solr is a server providing access to Apache Lucene full-text indexes. Apache SolrCloud is a highly available, fault tolerant cluster of Apache Solr servers. The term GPText cluster is another way to refer to a SolrCloud cluster deployed by GPText for use with a Greenplum Database system.
A SolrCloud cluster is comprised of the following components:
- An Apache ZooKeeper cluster to manage the SolrCloud cluster. SolrCloud uses ZooKeeper to manage server configuration and to coordinate the cluster’s activities. GPText can install ZooKeeper cluster that is bound to the GPText cluster, or it can share an existing ZooKeeper cluster. If GPText installs the ZooKeeper cluster, it can be managed using GPText functions and utilities. The ZooKeeper cluster can be deployed on Greenplum Database cluster hosts or, for best performance, on seperate hosts accessible to the Greenplum Database cluster.
- Multiple SolrCloud server instances deployed on the Greenplum segment hosts or on other hosts on the same network. Each instance is a JVM process running Solr server. SolrCloud instances use local storage, which may be the same local storage volumes that store Greenplum Database data. The number of SolrCloud instances per host can be the same as the number of Greenplum primary segments per host, but this is not a requirement. The number of instances to execute per host is specified during GPText installation.
GPText provides document indexing and search capabilities for Greenplum Database by adding user-defined functions (UDFs) that access Solr APIs from within database queries.
GPText UDFs perform the following tasks:
- create and manage GPText indexes
- insert documents into indexes from database tables or, for GPText external indexes, from documents stored outside of Greenplum Database
- search indexes
There are also GPText UDFs and command-line utilities to configure, monitor, and manage the SolrCloud cluster and to manage replicas, SolrCloud’s high-availability mechanism. (More on replicas in the next section.)
Parallelism in GPText Indexing and Searching
SolrCloud distributes document indexes in slices called shards. With GPText, the number of shards for an index is the same as the number of Greenplum segments, so each Greenplum segment operates on an equal portion of the index. Each shard is managed by a SolrCloud instance and the shards are distributed evenly among the SolrCloud instances. The SolrCloud instance and Greenplum segment are not required to be on the same host.
High Availability for GPText Indexes
SolrCloud provides high availability by maintaining replicas of shards and providing automatic failover if a shard fails or becomes unavailable. One replica of each shard is the lead replica and any changes to it are applied to the other replicas. The replication factor, which determines the number of replicas to maintain for each shard, is set when the index is created. Replicas may also be added or dropped later using GPText UDFs or command-line utilities.
ZooKeeper determines the locations of shard replicas among the Solr nodes and hosts. When adding a replica using a GPText UDF or command-line utility, a new shard can be explicitly placed on a SolrCloud instance.
Forensic financial analysts need to locate communications among corporate executives that point to financial malfeasance in their firm. The analysts use the following workflow:
- Load the email records into a Greenplum database.
- Create a Solr index of the email records.
- Run queries that look for text strings and their authors.
- Refine the queries until they pair a dummy company name with top three or four executives corresponding about suspect offshore financial transactions. With this data, the analysts can focus the investigation on specific individuals rather than the thousands of authors in the initial data sample.
GPText works with Greenplum Database and Apache SolrCloud to store and index big data for information retrieval (query) purposes. High-level workflows include data loading and indexing, and data querying.
This topic describes the following information:
The following diagram shows the GPText workflow for loading and indexing data.
All client interaction with the system is through the Greenplum master instance.
Load data into your Greenplum Database system.
Create a database table to hold data and then add the data to the table. Greenplum provides parallel data loading utilities and protocols that help to transform and load external data in various formats and from various sources. For details, see the Greenplum Database Administrator Guide, at http://gpdb.docs.pivotal.io.
Create an empty GPText index.
gptext.create_index()user-defined function (UDF) to create an empty GPText index for the table. Each Greenplum segment will manage a slice of the index, called a shard. SolrCloud creates multiple replicas for each shard, distributed among the Solr instances, and chooses a lead replica for the Greenplum segment to operate upon. Solr manages replication between the replicas.
Populate the index with data from the database table.
gptext.index()UDF to add data to the index. This UDF works by dispatching a SQL query to execute on each Greenplum segment. The segments execute the query and add the results to their shards using Solr APIs.
Commit changes to the index.
Commit changes to the GPText index by calling the
gptext.commit_index()UDF. Until the changes are committed, queries executed on the index cannot access any data added to the index with
gptext.index(). If needed, uncommitted changes can be rolled back. SolrCloud replicates changes committed to the lead replica to the shards’ non-lead replicas.
The following diagram shows the high-level GPText query process workflow:
A user submits a SQL query designed to search the indexed data.
A GPText search query is a SQL
SELECTstatement on a GPText search UDF that contains full-text search expressions.
The Greenplum master dispatches the query to the Greenplum segments.
Each segment executes the query, using the Solr API to search its index shard.
SolrCloud executes the search query on the lead replica for the shard.
The Greenplum segments return the results of the search query to the Greenplum master.
The Greenplum master aggregates the results from all segments and returns them to the client.
GPText enables analysis of Solr indexes with Apache MADlib, an open source library for scalable in-database analytics. MADlib provides data-parallel implementations of mathematical, statistical, and machine learning methods for structured and unstructured data. You can use GPText to perform a variety of MADlib analyses.