LATEST VERSION: 3.0.0 - RELEASE NOTES
Pivotal® Greenplum® Text v2.2.1

Working With GPText External Indexes

A GPText external index is an Apache Solr index you create in Greenplum Database to index and search documents that reside outside of Greenplum Database. External documents can be of many types, for example, PDF, Microsoft Word, XML, or HTML. Solr recognizes document types automatically, using code included from the Apache Tika project.

Read about how Solr and Tika (the “Solr Cell” framework) extract and index document text and metadata at Uploading Data with Solr Cell using Apache Tika. See a list of supported document types at Supported Document Formats.

External documents must be accessible by URL with an HTTP GET request. To add external documents to an index, you supply GPText with a list of URLs in an array or as a SQL SELECT statement. The URL will be the unique id field in the Solr index.

How GPText External Indexes Differ From Regular GPText Indexes

External indexes exist entirely in Solr—there is no associated database table in Greenplum Database. Because of this, the index name does not follow the database.schema.table pattern required for regular GPText indexes. You can choose any name for an external index, but it must not contain periods. You can access a GPText external index from any database in the Greenplum Database system that has the GPText schema installed.

GPText provides the following alternate functions for working with external indexes:

The distribution policy for a regular GPText index is the same as the underlying Greenplum Database table, so that segments manage the same GPText table data as the Solr index shard. A GPText external index also has one shard per segment, but the documents are distributed among the segments using Solr compositeId routing, which allows Solr to choose the shard for a document. See Shards and Indexing Data in SolrCloud .

A regular GPText index only indexes and stores the database table columns you specify. A GPText external index stores and indexes the textual content of the file, as well as metadata fields that are members of the document type.

When an external document is added to the index, the content of the document is saved in the content field. The content field is stored in the index but it is not indexed.

GPText copies the following fields to the text field, the default search field, which is indexed but not stored.

  • title
  • author
  • description
  • keywords
  • content
  • content_type
  • resourcename
  • url

To search the document content, therefore, search the text field, but to retrieve or highlight document contents, use the content field.

The following common metadata fields are indexed and stored:

  • title
  • subject
  • description
  • comments
  • author
  • keywords
  • category
  • resourcename
  • url
  • content_type
  • last_modified
  • links

A dynamic field named meta_* is also indexed and stored. This is a multi-valued field where Solr stores document-type-specific metadata. In search results, this field is returned as a JSON-formatted columnValue string. You can extract individual metadata by name using the gptext.gptext_retrieve_field() function.

Search results for external indexes include all fields saved with the documents, including all metadata fields. You can use the Solr field list option (fl=<field-list>) to limit the fields returned. You can also use SELECT <field-list> FROM gptext.search_external() to limit the fields returned, but it is more efficient to filter out the fields in Solr with the fl option than in the database session.

Creating External Indexes

Use the gptext.create_index_external() function to create an external index.

This example creates an external index named gptext-docs.

=# SELECT * FROM gptext.create_index_external('gptext-docs');

An external index does not have a corresponding Greenplum Database table, so the index name does not follow the database.schema.table pattern required for regular GPText indexes. The only restriction is that the name for an external table must not contain periods.

Adding Documents to an External Index

To add external documents to an external index, supply a list of HTTP or HTTPS URLs where Solr can retrieve the document to the gptext.index_external() function. URLs may be specified either in an array or as a SQL result set.

A hash of the URL is the document’s ID in the index. If a URL has already been added to the index, the file is not reindexed. If you add two identical files retrieved from different URLs, both files are added to the index.

This example adds a single PDF document, specified in an array, to the gptext-docs index.

=# SELECT * FROM gptext.index_external(
      '{http://gptext.docs.pivotal.io/archives/GPText-docs-213.pdf}', 
      'gptext-docs');
 dbid | num_docs
------+----------
    3 |        0
    2 |        1
(2 rows)

=# SELECT * FROM gptext.commit_index('gptext-docs');
 commit_index
--------------
 t
(1 row)

This example adds several HTML documents by selecting URLs from a database table.

=# DROP TABLE IF EXISTS gptext_html_docs;
=# CREATE TABLE gptext_html_docs (
      id bigint,
      url text)
DISTRIBUTED BY (id);
CREATE TABLE
=# INSERT INTO gptext_html_docs VALUES 
    (1, 'http://gptext.docs.pivotal.io/latest/topics/administering.html'), 
    (2, 'http://gptext.docs.pivotal.io/latest/topics/ext-indexes.html'), 
    (3, 'http://gptext.docs.pivotal.io/latest/topics/function_ref.html'),
    (4, 'http://gptext.docs.pivotal.io/latest/topics/guc_ref.html'),
    (5, 'http://gptext.docs.pivotal.io/latest/topics/ha.html'),
    (6, 'http://gptext.docs.pivotal.io/latest/topics/index.html'),
    (7, 'http://gptext.docs.pivotal.io/latest/topics/indexes.html'),
    (8, 'http://gptext.docs.pivotal.io/latest/topics/intro.html'), 
    (9, 'http://gptext.docs.pivotal.io/latest/topics/managed-schema.html'),
    (10, 'http://gptext.docs.pivotal.io/latest/topics/performance.html'),
    (11, 'http://gptext.docs.pivotal.io/latest/topics/queries.html'),
    (12, 'http://gptext.docs.pivotal.io/latest/topics/type_ref.html'),
    (13, 'http://gptext.docs.pivotal.io/latest/topics/upgrading.html'),
    (14, 'http://gptext.docs.pivotal.io/latest/topics/utility_ref.html'),
    (15, 'http://gptext.docs.pivotal.io/latest/topics/installing.html');
INSERT 0 15
=# SELECT * FROM gptext.index_external(
      TABLE(SELECT url FROM gptext_html_docs), 
      'gptext-docs');
dbid | num_docs
------+----------
   3 |        6
   2 |        8
(2 rows)
=# SELECT * FROM gptext.commit_index('gptext-docs');
commit_index
--------------
t
(1 row)

Searching External Indexes

The gptext.search_external() function searches an external index.

The following example search (results omitted) returns all fields stored in the gptext-docs index for all documents.

=# SELECT * FROM gptext.search_external(TABLE(SELECT 1 SCATTER by 1), 
      'gptext-docs', '*:*', null);

Searching an external index, by default, returns all of the stored fields. Usually, however, you only want a subset of the fields. You can specify the fields you want in the SELECT clause or by adding the fl Solr option in the options argument of the gptext.search_external.

Even if you list the desired fields in the SELECT clause, specifying a field list in the options argument is more efficient because it restricts the amount of data Solr transfers to Greenplum Database.

This example searches for HTML documents that have the word “Indexes” in the title field. A filter query chooses documents with “html” in the content_type field. The field list in the options argument contains just the title field. The id, score, and meta fields are always included in search results.

=# SELECT id, title, score FROM gptext.search_external(
      TABLE(SELECT 1 SCATTER BY 1), 'gptext-docs', 'title:indexes', 
      '{content_type:*html*}', 'fl=title');
      id                                |             title             |   score
-----------------------------------------------------------------+-------------------------------+------------
http://gptext.docs.pivotal.io/latest/topics/indexes.html        | Working With GPText Indexes | | 0.96766204
                                        :     Pivotal GPText Docs
http://gptext.docs.pivotal.io/latest/topics/managed-schema.html | Customizing GPText Indexes |  | 0.96766204
                                        :     Pivotal GPText Docs
http://gptext.docs.pivotal.io/latest/topics/queries.html        | Querying GPText Indexes |     | 0.96766204
                                        :     Pivotal GPText Docs
(3 rows)

Highlighting External Index Search Results

Solr highlighting includes fragments of documents that match a search query in the search results, with the query terms highlighted with markup tags. Fragments are also called snippets or passages.

Highlighting with GPText external indexes is a different process than highlighting with regular GPText indexes. Because the text and all metadata of external documents are stored in an external index, the markup tags can be applied in Solr before returning search results to Greenplum Database. With regular indexes, highlighting can be performed only for fields with terms enabled, and then search results must be joined with the database table so that the gptext.highlight() function can insert the markup tags into the text. You can, however, configure a regular GPText index so that you store the fields in the index and perform highlighting in Solr. This requires editing the index’s solrconfig.xml and managed-schema configuration files. See Highlighting Terms in Stored Fields for steps to enable this configuration.

Solr highlighting is performed by a search handler called a HighlightComponent, configured in the managed-schema configuration file. Solr provides highlighters that work somewhat differently and have different configurable options. GPText uses the Unified Highlighter by default. See Highlighting at the Apache Solr website to learn more about Solr highlighting and the Unified Highlighter.

You can enable highlighting for GPText external indexes in the Solr options argument of a gptext.search() query. Using this method, the highlighted text is returned in a result column named hs, which contains a JSON-formatted array of highlighted fragments. You can access the fragments using the gptext.gptext_retrieve_field() function.

In addition, GPText provides the gptext.highlight_external() function, which unpacks highlighted fragments in the search results into separate columns in the Greenplum Database search result set.

First, let’s look at the results of a search query with highlighting enabled using the Solr options argument in the gptext.search() function. This statement searches the gptext-docs external index for documents containing the term “apache”. The Solr options are:

  • hl=true – enables highlighting.
  • hl.fl=content title – the content field will be highlighted.
  • rows=1 – return just one document per segment.
=# \x on
Expanded display is on.
=# \pset border off
Border style is 0.
=# SELECT * FROM gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'gptext-docs',
     'apache', '{content_type:*html*}', 'hl=true&hl.fl=content&rows=1');
* Record 1
id    http://gptext.docs.pivotal.io/latest/topics/ha.html
score 0.55252916
hs    {"columnValue":[{"name":"content","value":"Refer to the \u003cem\u003eApache\u003c/em\u003e SolrCloud documentation for help using the SolrCloud Dashboard. \n\n\n   "}]}
rf
* Record 2
id    http://gptext.docs.pivotal.io/latest/topics/administering.html
score 0.9950347
hs    {"columnValue":[{"name":"content","value":"ZooKeeper Administration \n\n\n \u003cem\u003eApache\u003c/em\u003e ZooKeeper enables coordination between the \u003cem\u003eApache\u003c/em\u003e Solr and Pivotal GPText distributed processes through a shared namespace that resembles a file system. "}]}
rf

Notice that in this example the hs column has a single fragment from each of the returned documents. You can use the hl.snippets and hl.fragsize options to set, respectively, the maximum number of fragments to return and the approximate number of characters in each fragment. Other options you can use to control how the Unified Highlighter chooses fragments are hl.bs.type and hl.maxAnalyzedChars. The hl.bs.type option specifies how the highlighter breaks the text into fragments. The default is SENTENCE. Other valid choices are SEPARATOR, SENTENCE, WORD, CHARACTER, LINE, or WHOLE. The hl.maxAnalyzedChars option, default 51200, is the maximum number of characters to analyze for highlighting.

See Highlighting in the Solr documentation for tables of options you can set and their default values.

For an external index, Solr returns the highlighted fragments in a columnValue array in the hs result column. You can use the gptext.gptext_retrieve_field() function in the SELECT list to extract the fragments from the array.

=# SELECT id, score, gptext.gptext_retrieve_field(hs, 'content') content 
FROM gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'gptext-docs',
      'apache', '{content_type:*html*}', 
      'hl=true&hl.fl=content&hl.snippets=3&hl.fragsize=75&rows=1');
* Record 1
id      http://gptext.docs.pivotal.io/latest/topics/ha.html
score   0.55252916
content Refer to the <em>Apache</em> SolrCloud documentation for help using the SolrCloud Dashboard.


* Record 2
id      http://gptext.docs.pivotal.io/archives/GPText-docs-213.pdf
score   1.2220514
content <em>Apache</em> Solr    requires        a       ZooKeeper       cluster with    at      minimum three   nodes.

The gptext.highlight_external() function performs the search, but extracts the highlighted fragments to columns. The hl and hl.fl Solr options must not be specified in this function. This example, with an AND query, demonstrates that the terms on both sides of the AND are highlighted.

=# SELECT * FROM gptext.highlight_external(TABLE(SELECT 1 SCATTER BY 1), 
      'gptext-docs', 'apache AND GPText', '{content_type:*text/html*}', 'rows=1');
* Record 1
id      http://gptext.docs.pivotal.io/latest/topics/administering.html
title   Administering <em>GPText</em> |
            Pivotal <em>GPText</em> Docs
content ZooKeeper Administration


         <em>Apache</em> ZooKeeper enables coordination between the <em>Apache</em> Solr and Pivotal <em>GPText</em> distributed processes through a shared namespace that resembles a file system.
score   1.107425
* Record 2
id      http://gptext.docs.pivotal.io/latest/topics/ha.html
title   <em>GPText</em> High Availability |
            Pivotal <em>GPText</em> Docs
content Refer to the <em>Apache</em> SolrCloud documentation for help using the SolrCloud Dashboard.



score   0.63803643

Here is an example that searches for and highlights the term “python” in the content field of the gptext-docs index.

=# SELECT * FROM gptext.highlight_external(TABLE(SELECT 1 SCATTER BY 1),
      'gptext-docs', 'python', '{content_type:*html*}');
* Record 1 
id      http://gptext.docs.pivotal.io/latest/topics/utility_ref.html
title   GPText Management Utilities |
       |     Pivotal GPText Docs
content When new hosts are added to the Greenplum Database cluster, ensure that the following GPText prerequisites are installed before running  gpexpand :


                 Java 1.8
                 <em>Python</em> 2.6 or greater
                 Linux  lsof  utility
                 All hosts in the cluster must be able to reach the new and existing hosts.



score  1.0837649

* Record 2 
id      http://gptext.docs.pivotal.io/latest/topics/administering.html
title   Administering GPText |
       |     Pivotal GPText Docs
content Note:  Before recovering GPText nodes on newly added hosts, ensure that the following GPText prerequisites have been installed on the host:


               Java 1.8
               Python 2.6
               The Linux  lsof  utility



        Viewing Solr Index Statistics


        You can view Solr index statistics by running the  gptext-state  utility from the command line.



score   0.71699136

* Record 3 
id      http://gptext.docs.pivotal.io/latest/topics/upgrading.html
title   Upgrading GPText |
       |     Pivotal GPText Docs
content The installer verifies the environment to ensure that prerequisites are present, such as <em>Python</em> and Java.
score   0.6838306

GPText found three documents containing the string ‘Python’ and extracted from each a fragment of the content field containing the term.