LATEST VERSION: 2.3.1 - CHANGELOG
Pivotal® Greenplum® Text v2.3.1

Working With GPText External Indexes

A GPText external index is an Apache Solr index you create in Greenplum Database to index and search documents that reside outside of Greenplum Database. External documents can be of many types, for example, PDF, Microsoft Word, XML, or HTML. Solr recognizes document types automatically, using code included from the Apache Tika project.

Read about how Solr and Tika (the “Solr Cell” framework) extract and index document text and metadata at Uploading Data with Solr Cell using Apache Tika. See a list of supported document types at Supported Document Formats.

GPText supports indexing external documents that are accessible by URL with an HTTP GET request. GPText also supports indexing external documents stored in Hadoop, provides functions and utilities to specify required Hadoop configuration and authentication information.

To add external documents to an index, you supply GPText with a list of URLs in an array or as a SQL SELECT statement. The URL becomes the unique id field in the Solr index.

How GPText External Indexes Differ From Regular GPText Indexes

External indexes exist entirely in Solr—there is no associated database table in Greenplum Database. Because of this, the index name does not follow the database.schema.table pattern required for regular GPText indexes. You can choose any name for an external index, but it must not contain periods. You can access a GPText external index from any database in the Greenplum Database system that has the GPText schema installed.

GPText provides the following alternate functions for working with external indexes:

The distribution policy for a regular GPText index is the same as the underlying Greenplum Database table, so that segments manage the same GPText table data as the Solr index shard. A GPText external index also has one shard per segment, but the documents are distributed among the segments using Solr compositeId routing, which allows Solr to choose the shard for a document. See Shards and Indexing Data in SolrCloud .

A regular GPText index only indexes and stores the database table columns you specify. A GPText external index stores and indexes the textual content of the file, as well as metadata fields that are members of the document type.

When an external document is added to the index, the content of the document is saved in the content field. The content field is stored in the index but it is not indexed.

GPText copies the following fields to the text field, the default search field, which is indexed but not stored.

  • title
  • author
  • description
  • keywords
  • content
  • content_type
  • resourcename
  • url

To search the document content, therefore, search the text field, but to retrieve or highlight document contents, use the content field.

The following common metadata fields are indexed and stored:

  • title
  • subject
  • description
  • comments
  • author
  • keywords
  • category
  • resourcename
  • url
  • content_type
  • last_modified
  • links

A dynamic field named meta_* is also indexed and stored. This is a multi-valued field where Solr stores document-type-specific metadata. In search results, this field is returned as a JSON-formatted columnValue string. You can extract individual metadata by name using the gptext.gptext_retrieve_field() function.

Search results for external indexes include all fields saved with the documents, including all metadata fields. You can use the Solr field list option (fl=<field-list>) to limit the fields returned. You can also use SELECT <field-list> FROM gptext.search_external() to limit the fields returned, but it is more efficient to filter out the fields in Solr with the fl option than in the database session.

Authenticating with an External Document Source

The information in this section is applicable only to external document sources that require authentication.

If the external document source that you want to index requires authentication, you must provide the authentication configuration to GPText. You must also use GPText functions to explicitly log in to the external document source before indexing, and log out of the source after indexing completes.

Note: Authenticating is not required for searching an external document.

GPText currently supports authenticating only to native and Kerberized Hadoop clusters.

Uploading a Configuration to ZooKeeper

Before you use GPText to index an external document source that requires authentication, you must upload configuration information to ZooKeeper. Use the gptext-external upload command to upload this information:

gptext-external upload -t <type> -p <config_dir> -c <config_name>

This table describes the options to the gptext-external upload command:

Option Description
<type> The type of the external document source. The only supported <type> is hdfs.
<config_dir> The path to a directory that contains the configuration files. The configuration information that you provide in this directory will depend on the external document source <type>.
<config_name> The name that you assign to the configuration information. You will provide this name when you log in to the external document source.

Note: Retain a local copy of <config_dir>. Should you need to update the configuration, you must edit a local copy of the file(s) and re-upload.

Configuring and Uploading Hadoop Authentication

When you access a Hadoop external document source, <config_dir> must include the following configuration files for <type> hdfs:

  • The core-site.xml and hdfs-site.xml configuration files from the Hadoop server.
  • A file named user.txt. This file contains a single line identifying the Hadoop username to use for authentication. If Kerberos is enabled in the Hadoop cluster, the username in user.txt must identify the Kerberos principal for the user.
  • If the Hadoop cluster is secured with Kerberos, also include the user’s keytab file and the krb5.conf file for the Kerberos realm.

For example, to upload configuration information for a Hadoop external document store:

  1. Create a directory for the authentication configuration files. For example:

    $ mkdir hdfs_conf
    
  2. Copy the core-site.xml and hdfs-site.xml configuration files from the Hadoop server to the configuration directory. The location of these files will differ for different Hadoop distributions. For example:

    $ scp hdfsuser@hdfsnamenode:/etc/hadoop/conf/core-site.xml hdfs_conf/
    $ scp hdfsuser@hdfsnamenode:/etc/hadoop/conf/hdfs-site.xml hdfs_conf/
    
  3. Construct the user.txt file. For example, if the Hadoop username is bill:

    $ touch hdfs_conf/user.txt
    $ echo "bill" > hdfs_conf/user.txt
    
  4. Upload the Hadoop authentication configuration files for user bill to ZooKeeper. For example:

    $ gptext-external upload -t hdfs -p ./hdfs_conf -c hdfs_bill_auth
    

    This command maps the configuration information you provided in the hdfs_conf/ directory to the name hdfs_bill_auth.

Logging In/Out of the External Document Source

Prior to indexing, you must explictly log in to an external document source that requires authentication. Use the gptext.external_login() function for this purpose:

gptext.external_login('<type>', '<type>://<url>', '<config_name>')

The table below describes the arguments to the gptext.external_login() function:

Option Description
<type> The type of the external document source. The only supported <type> is hdfs.
<type>://<url> The URL of the external document source <type>.
<config_name> The <config_name> you provided when you uploaded the authentication configuration with gptext-external upload.

When you invoke the gptext.external_login() function, GPText logs you in to the external document source as the user identified in the user.txt file you provided in the <config_name> configuration.

For example, to log in to a Hadoop document source using the authentication configuration you uploaded in the prior section:

SELECT * FROM gptext.external_login('hdfs', 'hdfs://<namenode_host_or_ip>:<hdfs_port>', 'hdfs_bill_auth');

Note: You can log in to only one GPText external document source at a time. You must explicitly log out before you can log in to another external document source.

To log out of an external document source, use the gptext.external_logout('<type>') function. For example, to log out of the Hadoop cluster that you are currently logged in to:

SELECT * FROM gptext.external_logout('hdfs');

Troubleshooting Authenticated Document Stores

If you run into problems logging in to or accessing documents in an authenticated Hadoop external document store, refer to Troubleshooting Hadoop Connection Problems.

Creating External Indexes

Use the gptext.create_index_external() function to create an external index.

This example creates an external index named gptext-docs.

=# SELECT * FROM gptext.create_index_external('gptext-docs');

An external index does not have a corresponding Greenplum Database table, so the index name does not follow the database.schema.table pattern required for regular GPText indexes. The only restriction is that the name for an external table must not contain periods.

Adding Documents to an External Index

To add external documents to an external index, supply a list of HTTP or HTTPS URLs where Solr can retrieve the document to the gptext.index_external() function. URLs may be specified either in an array or as a SQL result set.

A hash of the URL is the document’s ID in the index. If a URL has already been added to the index, the file is not reindexed. If you add two identical files retrieved from different URLs, both files are added to the index.

This example adds a single PDF document, specified in an array, to the gptext-docs index.

=# SELECT * FROM gptext.index_external(
      '{http://gptext.docs.pivotal.io/archives/GPText-docs-213.pdf}', 
      'gptext-docs');
 dbid | num_docs
------+----------
    3 |        0
    2 |        1
(2 rows)

=# SELECT * FROM gptext.commit_index('gptext-docs');
 commit_index
--------------
 t
(1 row)

This example adds several HTML documents by selecting URLs from a database table.

=# DROP TABLE IF EXISTS gptext_html_docs;
=# CREATE TABLE gptext_html_docs (
      id bigint,
      url text)
DISTRIBUTED BY (id);
CREATE TABLE
=# INSERT INTO gptext_html_docs VALUES 
    (1, 'http://gptext.docs.pivotal.io/latest/topics/administering.html'), 
    (2, 'http://gptext.docs.pivotal.io/latest/topics/ext-indexes.html'), 
    (3, 'http://gptext.docs.pivotal.io/latest/topics/function_ref.html'),
    (4, 'http://gptext.docs.pivotal.io/latest/topics/guc_ref.html'),
    (5, 'http://gptext.docs.pivotal.io/latest/topics/ha.html'),
    (6, 'http://gptext.docs.pivotal.io/latest/topics/index.html'),
    (7, 'http://gptext.docs.pivotal.io/latest/topics/indexes.html'),
    (8, 'http://gptext.docs.pivotal.io/latest/topics/intro.html'), 
    (9, 'http://gptext.docs.pivotal.io/latest/topics/managed-schema.html'),
    (10, 'http://gptext.docs.pivotal.io/latest/topics/performance.html'),
    (11, 'http://gptext.docs.pivotal.io/latest/topics/queries.html'),
    (12, 'http://gptext.docs.pivotal.io/latest/topics/type_ref.html'),
    (13, 'http://gptext.docs.pivotal.io/latest/topics/upgrading.html'),
    (14, 'http://gptext.docs.pivotal.io/latest/topics/utility_ref.html'),
    (15, 'http://gptext.docs.pivotal.io/latest/topics/installing.html');
INSERT 0 15
=# SELECT * FROM gptext.index_external(
      TABLE(SELECT url FROM gptext_html_docs), 
      'gptext-docs');
dbid | num_docs
------+----------
   3 |        6
   2 |        8
(2 rows)
=# SELECT * FROM gptext.commit_index('gptext-docs');
commit_index
--------------
t
(1 row)

To add documents from an external document source that requires authentication, such as hdfs, log in to the external system with the gptext.external_login() function before you add the documents. With an authenticated document source, you can add all documents in a directory, using the gptext.external_index_dir() function. See the gptext.external_index_dir() function reference for an example.

Searching GPText External Indexes

You can search GPText external indexes with the standard gptext.search() function or with the gptext.search_external() function. The difference is that the gptext.search() function returns just the id, score, hs, and rf columns and the gptext.search_external() function by default also includes all of the content and metadata stored in the external index. You can use the Solr fl (field list) option with either function to set the actual fields that are included in the results.

Searching with gptext.search()

This simple gptext.search() example searches for “Solr” in the title field of the gptext-docs external index.

=# SELECT * FROM gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'gptext-docs', 'title:Solr', null, null);
                            id                             |   score   | hs | rf
-----------------------------------------------------------+-----------+----+----
 http://gptext.docs.pivotal.io/latest/topics/type_ref.html | 0.9745732 |    |
(1 row)

To see the title of the document that matched the search, you must request the field with a fl option.

=# SELECT * FROM gptext.search (TABLE(SELECT 1 SCATTER BY 1), 'gptext-docs', 
'title:Solr', null, 'fl=title');
id                             |   score   | hs |
         rf
-----------------------------------------------------------+-----------+----+----------------
--------------------------------------------------------------------------------------------
http://gptext.docs.pivotal.io/latest/topics/type_ref.html | 0.9745732 |    | {"columnValue":
[{"name":"title","value":"GPText and Solr Data Type Mappings |\n    Pivotal GPText Docs"}]}
(1 row)

The title field specified in the field list of the Solr options argument is returned in the rf column in a JSON document. If you want to return the title in its own result column, you can use the gptext.gptext_retrieve_field() function to extract the text from the JSON document. The expanded display (\x on) psql option in the following examples makes the results easier to read.

=# \x on
Expanded display is on.
demo=# SELECT id, score, gptext.gptext_retrieve_field(rf, 'title') title
FROM gptext.search (TABLE(SELECT 1 SCATTER BY 1), 'gptext-docs','title:Solr', 
     null, 'fl=title');
-[ RECORD 1 ]----------------------------------------------------
id    | http://gptext.docs.pivotal.io/latest/topics/type_ref.html
score | 0.9745732
title | GPText and Solr Data Type Mappings |
      |     Pivotal GPText Docs

Searching with gptext.search_external()

The gptext.search_external() function, by default, returns a standard set of metadata fields and the content of the document. Depending on the content type of the document, gptext.search_external() returns additional metadata as a JSON document in the meta column.

The following example search returns all fields stored in the gptext-docs index for the document with the word “Installing” in the title field. The content and meta column values in the example results are truncated.

=# SELECT * FROM gptext.search_external(TABLE(SELECT 1 SCATTER BY 1), 'gptext-docs', 
        'title:Installing', null, null);
-[ RECORD 1 ]------------------------------------------------------------------------------------
id            | http://gptext.docs.pivotal.io/latest/topics/installing.html
title         | Installing GPText |
              |     Pivotal GPText Docs
subject       | 
description   | 
comments      | 
author        | 
keywords      | 
category      | 
resourcename  | 
url           | 
content_type  | text/html; charset=UTF-8
last_modified | 
links         | 
sha256        | F1182EE7D993CB494CAB8480DA47EA2F82DE8F7DCCC4E76745B6FA5FD7E73FC8
content       | ... 
score         | 1.4449482G
meta          | {"columnValue":[{"name":"meta_a","value":"..."},{"name":"meta_content_encoding","value":"UTF-8"},
{"name":"meta_dc_title","value":"Installing GPText |\n    Pivotal GPText Docs"},
{"name":"meta_div","value":"..."},{"name":"meta_form","value":"application/x-www-form-urlencoded,get,/search"},
... 

You usually only want a subset of the fields in the index. You can specify the fields you want in the SELECT clause or by adding the fl Solr option in the options argument of the gptext.search_external() function. Even if you list the desired fields in the SELECT clause, specifying a field list in the options argument is more efficient because it reduces the amount of data Solr transfers to Greenplum Database.

This example searches for HTML documents that have the word “Indexes” in the title field. A filter query chooses documents with “html” in the content_type field. The field list in the options argument contains just the title field. The id, score, and meta fields are always included in search results.

=# SELECT id, title, score FROM gptext.search_external(
    TABLE(SELECT 1 SCATTER BY 1), 'gptext-docs', 'title:indexes',
    '{content_type:*html*}', 'fl=title');
                               id                                |                 title                  |   score
-----------------------------------------------------------------+----------------------------------------+-----------
 http://gptext.docs.pivotal.io/latest/topics/ext-indexes.html    | Working With GPText External Indexes | | 1.1593812
                                                                 :     Pivotal GPText Docs
 http://gptext.docs.pivotal.io/latest/topics/managed-schema.html | Customizing GPText Indexes |           | 1.1191859
                                                                 :     Pivotal GPText Docs
 http://gptext.docs.pivotal.io/latest/topics/indexes.html        | Working With GPText Indexes |          | 0.8013617
                                                                 :     Pivotal GPText Docs
 http://gptext.docs.pivotal.io/latest/topics/queries.html        | Querying GPText Indexes |              | 0.8013617
                                                                 :     Pivotal GPText Docs
(4 rows)

=# SELECT id, title, score FROM gptext.search_external( TABLE(SELECT 1 SCATTER BY 1), ‘gptext-docs’, 'title:indexes’, ’{content_type:html}’, 'fl=title’); id | title | score —————————————————————–+—————————————-+———– http://gptext.docs.pivotal.io/latest/topics/ext-indexes.html | Working With GPText External Indexes | | 1.1593812 : Pivotal GPText Docs http://gptext.docs.pivotal.io/latest/topics/managed-schema.html | Customizing GPText Indexes | | 1.1191859 : Pivotal GPText Docs http://gptext.docs.pivotal.io/latest/topics/indexes.html | Working With GPText Indexes | | 0.8013617 : Pivotal GPText Docs http://gptext.docs.pivotal.io/latest/topics/queries.html | Querying GPText Indexes | | 0.8013617 : Pivotal GPText Docs (4 rows) “`

Highlighting External Index Search Results

Solr highlighting includes fragments of documents that match a search query in the search results, with the query terms highlighted with markup tags. Fragments are also called snippets or passages.

Highlighting with GPText external indexes is a different process than highlighting with regular GPText indexes. Because the text and all metadata of external documents are stored in an external index, the markup tags can be applied in Solr before returning search results to Greenplum Database. With regular indexes, highlighting can be performed only for fields with terms enabled, and then search results must be joined with the database table so that the gptext.highlight() function can insert the markup tags into the text. You can, however, configure a regular GPText index so that you store the fields in the index and perform highlighting in Solr. This requires editing the index’s solrconfig.xml and managed-schema configuration files. See Highlighting Terms in Stored Fields for steps to enable this configuration.

Solr highlighting is performed by a search handler called a HighlightComponent, configured in the managed-schema configuration file. Solr provides highlighters that work somewhat differently and have different configurable options. GPText uses the Unified Highlighter by default. See Highlighting at the Apache Solr website to learn more about Solr highlighting and the Unified Highlighter.

You can enable highlighting for GPText external indexes in the Solr options argument of a gptext.search() query. Using this method, the highlighted text is returned in a result column named hs, which contains a JSON-formatted array of highlighted fragments. You can access the fragments using the gptext.gptext_retrieve_field() function.

In addition, GPText provides the gptext.highlight_external() function, which unpacks highlighted fragments in the search results into separate columns in the Greenplum Database search result set.

First, let’s look at the results of a search query with highlighting enabled using the Solr options argument in the gptext.search() function. This statement searches the gptext-docs external index for documents containing the term "apache”. The Solr options are:

  • hl=true – enables highlighting.
  • hl.fl=content title – the content field will be highlighted.
  • rows=1 – return just one document per segment.
=# SELECT * FROM gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'gptext-docs',
     'apache', '{content_type:*html*}', 'hl=true&hl.fl=content&rows=1')
-[ RECORD 1 ]--------------------------------------------------------------------------------
id    | http://gptext.docs.pivotal.io/latest/topics/ha.html
score | 0.4548784
hs    | {"columnValue":[{"name":"content","value":"Refer to the \u003cem\u003eApache\u003c/em\u003e 
SolrCloud documentation for help using the SolrCloud Dashboard. \n\n\n   "}]}
rf    | 
-[ RECORD 2 ]--------------------------------------------------------------------------------
id    | http://gptext.docs.pivotal.io/latest/topics/function_ref.html
score | 0.05978464
hs    | {"columnValue":[{"name":"content","value":"Remarks \n\n\n When you add an external document 
to the index, \u003cem\u003eApache\u003c/em\u003e Tika extracts a core set of metadata from the 
document, the columns listed in the   Return type   section. "}]}
rf    | 
-[ RECORD 3 ]--------------------------------------------------------------------------------
id    | http://gptext.docs.pivotal.io/latest/topics/ext-indexes.html
score | 1.2426406
hs    | {"columnValue":[{"name":"content","value":"Solr recognizes document types automatically, 
using code included from the  \u003cem\u003eApache\u003c/em\u003e Tika  project. \n\n\n "}]}
rf    | 
-[ RECORD 4 ]--------------------------------------------------------------------------------
id    | http://gptext.docs.pivotal.io/latest/topics/administering.html
score | 0.8155949
hs    | {"columnValue":[{"name":"content","value":"ZooKeeper Administration \n\n\n 
\u003cem\u003eApache\u003c/em\u003e ZooKeeper enables coordination between the 
\u003cem\u003eApache\u003c/em\u003e Solr and Pivotal GPText distributed processes through a shared 
namespace that resembles a file system. "}]}
rf    | 

In this example the hs column has a single fragment from each of the returned documents. You can use the hl.snippets and hl.fragsize Solr options to set, respectively, the maximum number of fragments to return and the approximate number of characters in each fragment. Other options you can use to control how the Unified Highlighter chooses fragments are hl.bs.type and hl.maxAnalyzedChars. The hl.bs.type option specifies how the highlighter breaks the text into fragments. The default is SENTENCE. Other valid choices are SEPARATOR, SENTENCE, WORD, CHARACTER, LINE, or WHOLE. The hl.maxAnalyzedChars option, default 51200, is the maximum number of characters to analyze for highlighting.

See Highlighting in the Solr documentation for tables of options you can set and their default values.

For an external index, Solr returns the highlighted fragments in a columnValue array in the hs result column. You can use the gptext.gptext_retrieve_field() function in the SELECT list to extract the fragments from the array.

=# SELECT id, score, gptext.gptext_retrieve_field(hs, 'content') AS content 
FROM gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'gptext-docs',
      'apache', '{content_type:*html*}', 
      'hl=true&hl.fl=content&hl.snippets=3&hl.fragsize=75&rows=1');
-[ RECORD 1 ]--------------------------------------------------------------------------------
id      | http://gptext.docs.pivotal.io/latest/topics/function_ref.html
score   | 0.05978464
content | Remarks 
        | 
        | 
        |  When you add an external document to the index, <em>Apache</em> Tika extracts a core 
set of metadata from the document, the columns listed in the   Return type   section. 
-[ RECORD 2 ]--------------------------------------------------------------------------------
id      | http://gptext.docs.pivotal.io/latest/topics/ext-indexes.html
score   | 1.2426406
content | Solr recognizes document types automatically, using code included from the  <em>Apache</em> 
Tika  project. 
        | 
        | 
        |  ,See  Highlighting  at the <em>Apache</em> Solr website to learn more about Solr highlighting 
and the Unified Highlighter. 
        | 
        | 
        |  ,This statement searches the  gptext-docs  external index for documents containing the term 
<em>apache</em>. 
-[ RECORD 3 ]--------------------------------------------------------------------------------
id      | http://gptext.docs.pivotal.io/latest/topics/administering.html
score   | 0.8155949
content | ZooKeeper Administration 
        | 
        | 
        |  <em>Apache</em> ZooKeeper enables coordination between the <em>Apache</em> Solr and Pivotal 
GPText distributed processes through a shared namespace that resembles a file system. 
-[ RECORD 4 ]--------------------------------------------------------------------------------
id      | http://gptext.docs.pivotal.io/latest/topics/ha.html
score   | 0.4548784
content | Refer to the <em>Apache</em> SolrCloud documentation for help using the SolrCloud Dashboard. 
        | 

GPText found four documents (1 for each segment) containing the string “apache” and extracted from each document a fragment of the content field containing the term.