Installing Tanzu Greenplum Text
Prerequisites
The GPText installation includes the installation of Apache Solr Cloud and, optionally, Apache ZooKeeper.
If you are installing a new GPText release into an existing GPText system, follow the instructions in Upgrading GPText instead.
Following are GPText installation prerequisites.
- Install and configure your Greenplum Database system, version 4.3.6 or higher. See Installing and Upgrading Greenplum.
- GPText runs on Red Hat Enterprise Linux or CentOS 5.x, 6.x, or 7.x.
- GPText cannot be installed onto a shared NFS mount.
- Install a JRE 1.8 on all hosts in the cluster.
- Ensure that
nc
(netcat) is installed on all Greenplum cluster hosts (yum install nc
). - Installing
lsof
on all cluster hosts is recommended (sudo yum install lsof
). - GPText nodes can be installed on the Greenplum Database cluster hosts alongside the Greenplum segments or on additional, non-database hosts accessible on the Greenplum cluster network. All hosts participating in the GPText system must have the same operating system and configuration and have passwordless-ssh access for the gpadmin user. See the Tanzu Greenplum Installation Guide for instructions to configure hosts.
- If you plan to place GPText nodes on the Greenplum Database segment hosts, ensure that you reserve memory for GPText use when you configure Greenplum Database. To determine the memory to set aside for GPText, multiply the number of GPText nodes to create on each Greenplum segment host by the JVM maximum size. Subtract this memory from the physical RAM when calculating the value for the Greenplum Database
gp_vmem_protect_limit
server configuration parameter. See the Greenplum Database server configuration parametergp_vmem_protect_limit
in the Greenplum Database Reference Guide for recommended memory calculation formulas or visit the GPDB Virtual Memory Calculator web site. - Apache Solr requires a ZooKeeper cluster with at minimum three nodes. You can install a “binding” ZooKeeper cluster with GPText on the Greenplum cluster hosts, or you can use an existing ZooKeeper cluster. When deployed alongside Greenplum Database segments, ZooKeeper performance can be affected under heavy database load. For best performance, install a ZooKeeper cluster with at least three nodes (five nodes recommended) on separate hosts with network connectivity to the Greenplum network. See ZooKeeper Best Practices for more information about optimizing ZooKeeper performance.
Install the GPText Binary Distribution
On the Greenplum master host, extract the GPText distribution file. For example:
$ cd /home/gpadmin $ tar xvfz greenplum-text-<version>-<platform>.tar.gz
This creates the directory
greenplum-text-<version>-<platform>
containing the files:gptext_install_config
and the GPText installation binary, which has a name in the formatgreenplum-text-<version>-<platform>.bin
.If necessary, grant execute permission to the GPText binary. For example:
$ chmod +x /home/gpadmin/greenplum-text-<version>-<platform>.bin
If you are installing GPText in a parent directory that is not writable by the gpadmin user, you must create the installation directories on each GPText host machine and set ownership and permissions to allow the gpadmin user write access to the directories.
For example, if you are installing GPText in the default directory,
/usr/local/greenplum-text-<version>
, execute these commands on each host as root (or as gpadmin usingsudo
):mkdir /usr/local/greenplum-text-<version> mkdir /usr/local/greenplum-solr chown gpadmin:gpadmin /usr/local/greenplum-text-<version> chmod 775 /usr/local/greenplum-text-<version> chown gpadmin:gpadmin /usr/local/greenplum-solr chmod 775 /usr/local/greenplum-solr
Note: You can use the Greenplum Database
gpssh
command-line utility to execute these commands in parallel on all hosts if the gpadmin user hassudo
privilege or if the root user has passwordless SSH access to all hosts. See thegpssh
command reference in the Greenplum Database Utility Guide for details.Complete the remaining steps as the gpadmin user.
Edit the
gptext_install_config
file to set parameters for the installation. See Set Installation Parameters for details. Review the user authentication setup for the SolrCloud web user interface, usingGPTEXT_ENABLE_USER_AUTH
. Enabling user authentication after GPText installation, and when the cluster is running, is a disruptive process.Run the GPText installation binary as
gpadmin
on the master server:$ ./greenplum-text-<version>-<platform>.bin -c <gptext_install_config>
Accept the license agreement and respond to the installer’s prompts.
Optional Two-Part GPText Installation
The GPText two-part installation installs and deploys the GPText software in separate steps. This gives you the option to install the software files to a read-only, shared directory mounted on all GPText hosts in the cluster, rather than installing the software on every GPText host.
If you install the GPText software onto a shared drive, you must set the GPTEXT_CUSTOM_CONFIG_DIR
parameter in the installation configuration file. This parameter specifies a writable directory that exists on every GPText host where GPText can store configuration files for external data sources. See GPText installation parameters for more information about this parameter.
Run the GPText installation in two parts by following the steps in this section.
Prepare GPText installation directories as described in steps 1 through 3 in Install the GPText Binaries.
Run the GPText installation binary as
gpadmin
on the master server:$ ./greenplum-text-<version>.bin -b
Note that the
-c <gptext_install_config>
option is omitted.Source the GPText environment script in the GPText installation directory:
$ source <gptext-install-dir>/greenplum-text_path.sh
Edit the
gptext_install_config
file to set parameters for the GPText deployment. See Set Installation Parameters for details. Be sure to uncomment and set theGPTEXT_CUSTOM_CONFIG_DIR
parameter if you installed the software on a read-only drive. Also review the user authentication setup for the SolrCloud web user interface, usingGPTEXT_ENABLE_USER_AUTH
. Enabling user authentication after GPText installation, and when the cluster is running, is a disruptive process.Deploy the GPText cluster with the
gptext-deploy
command. The command requires the-c
option to specify the installation configuration file. Also include the-m
option because you installed the GPText software to a shared drive mounted on all GPText hosts. If you do not include-m
,gptext-deploy
copies the GPText software to all GPText hosts.$ gptext-deploy -m -c <gptext_install_config>
Set Installation Parameters
A GPText configuration file named gptext_install_config
contains parameters to configure the GPText installation. Edit the file and set the parameters as described in the following section.
GPTEXT_HOSTS
and DATA_DIRECTORY
installation parameters determine the number of GPText nodes that are deployed.
The maximum number of GPText nodes supported is 960. The best practice recommendation is to deploy fewer GPText nodes with more memory rather than to divide the memory available to GPText among a larger number of GPText nodes. For example, if there are eight primary segments per host in the Greenplum Database cluster, you should test with two or four GPText nodes per host, adjusting the JAVA_OPTS
installation parameter to divide the memory reserved for GPText among them.
GPText installation parameters
GPTEXT_HOSTS
"ALLSEGHOSTS"
to install GPText on all Greenplum Database segment hosts. GPText hosts must be passwordless ssh-accessible by the gpadmin user from all other hosts in the Greenplum Cluster.
declare -a GPTEXT_HOSTS=(gptext_h1 gptext_h2 gptext_h3)
GPTEXT_HOSTS="ALLSEGHOSTS"If you use the constant
"ALLSEGHOSTS"
, the number of GPText node hosts is the same as the number of Greenplum segment hosts. If GPTEXT_HOSTS
is set to an array of host names, the length of the array is the number of GPText node hosts.
DATA_DIRECTORY
GPTEXT_HOSTS
lists multiple interfaces per host, the GPText nodes are spread evenly across the interface addresses.
declare -a DATA_DIRECTORY=(/data/primary /data/primary)
GPTEXT_CUSTOM_CONFIG_DIR
share
subdirectory of the GPText installation directory. If you do specify a directory with this parameter, the directory is created on every Solr host in the cluster, and external configuration files and custom libraries will be stored there, leaving the GPText installation directory free from application data. JAVA_OPTS
JAVA_OPTS="-Xms1024M -Xmx2048M"
GPTEXT_ENABLE_USER_AUTH
true
to enable user authentication for the SolrCloud web user interface. The default user account is solr
.
GPTEXT_ENABLE_USER_AUTH=True
GPTEXT_ADMIN_PWD
solr
when GPTEXT_ENABLE_USER_AUTH=True
.
GPTEXT_ADMIN_PWD=mypassword
GPTEXT_ADMIN_USER
solr
. Change this value to a user account of your preference. You may only specify a single SolrCloud web user account.
GPTEXT_ADMIN_USER=solr
GPTEXT_PORT_BASE
GP_MAX_PORT_LIMIT
GPTEXT_PORT_BASE=18983 GP_MAX_PORT_LIMIT=28983
SOLR_TIMEZONE
1. GMT+offset, like
SOLR_TIMEZONE="GMT+8"
2. GMT+/-long offset, like
SOLR_TIMEZONE="GMT+0800"
.3. TZ name, like
SOLR_TIMEZONE="Asia/Shanghai"
. See List of TZ database time zones for a full list of the possible TZ name values.
SOLR_TIMEZONE="Asia/Tokyo"If the timezone is not set, GPText defaults to the timezone of the master host.
ZOO_CLUSTER
"BINDING"
the installation deploys a ZooKeeper cluster. To use an existing ZooKeeper cluster, set this parameter to a list of ZooKeeper nodes in the format "host1:port,host2:port,host3:port
“.
ZOO_CLUSTER="BINDING"
ZOO_HOSTS
ZOO_CLUSTER
is set to "BINDING"
, this parameter is an array of the hosts where the ZooKeeper nodes are to be installed. The array must contain 3, 5, or 7 host names, for example ZOO_HOSTS=(sdw1 sdw2 swd3 sdw4 sdw5)
. If you are using a single host for ZooKeeper, specify it multiple times, for example, ZOO_HOSTS=(sdw1 sdw1 sdw1)
.
declare -a ZOO_HOSTS=(sdw1 sdw2 sdw3 sdw4 sdw5)
ZOO_DATA_DIR
ZOO_CLUSTER
is set to "BINDING"
.
ZOO_DATA_DIR="/data/master/"
ZOO_GPTXTNODE
ZOO_CLUSTER
is set to "BINDING"
or a list of hosts.
ZOO_GPTXTNODE="gptext"
ZOO_PORT_BASE
ZOO_MAX_PORT_LIMIT
ZOO_PORT_BASE=2188 ZOO_MAX_PORT_LIMIT=12188
GPTEXT_JAVA_HOME
PATH
and JAVA_HOME
environment variables will be used.
GPTEXT_JAVA_HOME=/usr/java/jdk1.8.0_131
Starting GPText
First, make sure the GPText command-line utilities are in your path by sourcing the Greenplum Database and GPText environment scripts. It is important to source the GPText environment script each time you source the Greenplum Database script. For example:
$ source /usr/local/greenplum-db-<version>/greenplum_path.sh
$ source /usr/local/greenplum-text-<version>/greenplum-text_path.sh
To use GPText in a database, you must first use the gptext-installsql
management utility to install the GPText user-defined functions and other objects in the database:
$ gptext-installsql database [database2 ... ]
The GPText objects are created in the gptext
schema.
The ZooKeeper cluster must be running before you start GPText. If you installed a bound ZooKeeper cluster, start it with the zkManager
command-line utility.
$ zkManager start
Start GPText with the gptext-start
utility.
$ gptext-start
Configure Greenplum Database
GPText configuration parameters are saved in ZooKeeper. You can, however, view and set GPText configuration parameters in a Greenplum Database session using the SHOW
and SET
commands.
If you are using Greenplum Database 4.3.x or 5.x, you must first declare the GPText custom variable class by adding it to the Greenplum Database custom_variable_classes
configuration parameter. The custom_variable_classes
parameter is removed in Greenplum Database 6, so this step is unnecessary if you have Greenplum Database 6.
The custom_variable_classes
configuration parameter is a comma-separated list of class names. It is unset by default. To see if any custom variable classes have already been configured, run this gpconfig
command at the command line.
$ gpconfig -s custom_variable_classes
If no custom variable classes have been set, set the parameter with the following command.
$ gpconfig -c custom_variable_classes -v 'gptext'
[gpadmin@gpsne ~]$ gpconfig -c custom_variable_classes -v 'gptext'
20171029:12:29:11:028199 gpconfig:gpsne:gpadmin-[INFO]:-completed successfully
If other classes have been configured, add gptext
to the existing list, separated by a comma.
Run gpstop -u
to have Greenplum Database reload the configuration file.
View or set GPText Configuration Parameters
When you want to view or set GPText configuration parameters in a psql
session, first execute the gptext.version()
function to load the GPText configuration parameters into the session.
=# SELECT gptext.version();
version
--------------------------------
Greenplum Text Analytics 3.2.0
(1 row)
=# SHOW gptext.idx_delim;
gptext.idx_delim
------------------
,
(1 row)
See Setting GPText Configuration Parameters for more about GPText configuration parameters.
Uninstalling GPText
To uninstall GPText, run the gptext-uninstall
utility. You must have superuser permissions on all databases with GPText schemas to run gptext-uninstall
.
gptext-uninstall
runs only if there is at least one database with a GPText schema.
Execute:
$ gptext-uninstall