Querying GPText Indexes
- Creating a Simple Search Query
- Creating Faceted Search Queries
- Highlighting Search Terms in Query Results
- Searching Partitioned Tables
- Retrieving Stored Field Content
- Selecting a Query Parser
- Proximity Search Queries
- Using the Universal Query Parser
- Using the DisMax and Extended DisMax Query Parsers
To retrieve data, you submit a query that performs a search based on criteria that you specify. Simple queries return straight-forward results. You can use the default query parser, or specify a different query parser at query time.
Creating a Simple Search Query
After a Solr index is committed, you can run queries with the gptext.search()
function, which has this syntax:
gptext.search(<src_table>, <index_name>, <search_query>, <filter_queries>[, <options>])
The <search_query>
argument is a text value that contains a Solr query. The <filter_queries>
argument is an array of queries that restrict the set of documents to search.
The default Solr Standard Query Parser has a rich query syntax that includes wildcard characters, Boolean operators, proximity and range searches, and fuzzy searches. See The Standard Query Parser at the Solr website for examples.
Solr has additional query processors that you can specify in the <search_query>
argument to access additional features. The GPText Universal Query Parser, gptextqp
, allows queries that mix features from all of the supported query parsers.
See Selecting a Query Parser for a list of the supported query parsers and how to request them in your queries. See Using the Universal Query Parser for examples using the GPText Universal Query Parser.
The following sections show how to use the gptext.search()
function, including example queries that demonstrate Solr search features.
An AND search example with top 5 results
This search finds documents in the wikipedia.articles
index that contain both search terms “solar” and “battery”. The 'rows=5'
argument is a Solr option that specifies the top 5 results are to be returned from each segment. In a Greenplum Database cluster with two segments, this query returns up to 10 rows.
=# SELECT a.id, a.date_time, a.title, q.score
FROM wikipedia.articles a,
gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.wikipedia.articles',
'solar AND battery', null, 'rows=5') q
WHERE q.id::int8 = a.id
ORDER BY score DESC;
id | date_time | title | score
----------+------------------------+---------------------+-----------
13690575 | 2017-08-24 02:34:00-05 | Solar power | 2.7128658
2008322 | 2017-08-05 02:09:00-05 | Vehicle-to-grid | 2.5810153
4711003 | 2017-08-10 18:56:00-05 | Osmotic power | 2.2073007
25784 | 2017-08-26 07:10:00-05 | Renewable energy | 2.1295567
213555 | 2017-08-27 12:48:00-05 | Solar updraft tower | 2.0210648
27743 | 2017-08-20 15:56:00-05 | Solar energy | 1.6916461
608623 | 2017-08-27 03:56:00-05 | Ethanol fuel | 1.4619896
(7 rows)
See Solr options for more about Solr options.
An OR search example with top 5 results
By using the OR keyword, this search matches more documents than the AND example. The total number of rows returned is limited by the rows=5
Solr option.
=# SELECT a.id, a.date_time, a.title, q.score
FROM wikipedia.articles a,
gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.wikipedia.articles',
'solar OR battery', null, 'rows=5') q
WHERE q.id::int8 = a.id
ORDER BY score DESC;
id | date_time | title | score
---------+------------------------+---------------------+-----------
2008322 | 2017-08-05 02:09:00-05 | Vehicle-to-grid | 2.5810153
25784 | 2017-08-26 07:10:00-05 | Renewable energy | 2.1295567
2120798 | 2017-01-28 00:59:00-06 | Lithium economy | 2.0416002
213555 | 2017-08-27 12:48:00-05 | Solar updraft tower | 2.0210648
27743 | 2017-08-20 15:56:00-05 | Solar energy | 1.6916461
608623 | 2017-08-27 03:56:00-05 | Ethanol fuel | 1.4619896
533423 | 2017-08-28 00:52:00-05 | Solar water heating | 1.0239072
2988035 | 2017-03-12 06:39:00-05 | Vortex engine | 0.9519546
113728 | 2017-08-15 09:59:00-05 | Geothermal energy | 0.6801035
55017 | 2017-08-28 19:24:00-05 | Fusion power | 0.6432224
(10 rows)
Search non-default fields
A GPText index has a default search column, specified when the index is created with the gptext.create_index()
function. If you have included additional columns to index, you can reference them in your queries. This query searches for documents with the word “solar” in the title
column.
=# SELECT a.id, a.date_time, a.title, q.score
FROM wikipedia.articles a,
gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.wikipedia.articles',
'title:solar', null, null) q
WHERE q.id::int8 = a.id
ORDER BY score DESC;
id | date_time | title | score
----------+------------------------+---------------------+-----------
13690575 | 2017-08-24 02:34:00-05 | Solar power | 1.6547729
27743 | 2017-08-20 15:56:00-05 | Solar energy | 1.6547729
533423 | 2017-08-28 00:52:00-05 | Solar water heating | 1.1132113
213555 | 2017-08-27 12:48:00-05 | Solar updraft tower | 1.1132113
(4 rows)
This example finds documents where the title
column matches “Solar power” or “Solar energy”.
=# SELECT a.id, a.date_time, a.title, q.score
FROM wikipedia.articles a,
gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.wikipedia.articles',
'title:(solar AND (power OR energy))', null, null) q
WHERE q.id::int8 = a.id;
id | date_time | title | score
----------+------------------------+--------------+-----------
27743 | 2017-08-20 15:56:00-05 | Solar energy | 3.3095458
13690575 | 2017-08-24 02:34:00-05 | Solar power | 2.9718256
(2 rows)
This example searches for articles that have “photosynthesis” in the content
column but that do not have “solar” in the title
column.
=# SELECT a.id, a.date_time, a.title, q.score
FROM wikipedia.articles a,
gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.wikipedia.articles',
'photosynthesis and -title:solar', null, null) q
WHERE q.id::int8 = a.id
ORDER BY score DESC;
id | date_time | title | score
----------+------------------------+------------------+-----------
25784 | 2017-08-26 07:10:00-05 | Renewable energy | 2.9720955
53716476 | 2017-08-28 20:40:00-05 | Seaweed fuel | 1.4240221
14205946 | 2017-08-28 08:46:00-05 | Algae fuel | 1.3022419
608623 | 2017-08-27 03:56:00-05 | Ethanol fuel | 0.7614042
(4 rows)
Filtering search results
A filter query applies filters to the results returned by the query. The <filter_queries>
argument of the gptext.search()
function is an array, so you can apply multiple filters to the search results.
The following example finds articles that have the word “nuclear” in the content
column and then applies two filter queries to remove articles that have “solar” in the title
column and articles that do not have “power” in the title
column.
=# SELECT a.id, a.date_time, a.title, q.score
FROM wikipedia.articles a,
gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.wikipedia.articles',
'nuclear', '{-title:solar,title:power}', null) q
WHERE q.id::int8 = a.id
ORDER BY score DESC;
id | date_time | title | score
----------+------------------------+------------------+------------
14090587 | 2017-08-14 14:00:00-05 | Low-carbon power | 1.1897897
55017 | 2017-08-28 19:24:00-05 | Fusion power | 1.1753609
13021878 | 2017-08-09 05:03:00-05 | Geothermal power | 0.99499804
(3 rows)
The following example searches the demo.twitter.message
table for messages that contain the text “iphone” and either “hate” or “love” and filters for authors who specified English language in their twitter profile.
=# SELECT t.id, q.score, t.author_screen_name, t.message_text
FROM twitter.message t,
gptext.search(TABLE(SELECT * FROM twitter.message), 'demo.twitter.message',
'(iphone AND (hate OR love))', '{author_lang:en}', 'rows=5') q
WHERE t.id = q.id::int4
ORDER BY score DESC;
id | score | author_screen_name |
message_text
----------+-----------+--------------------+-----------------------------------------------
-------------------------------------------------------------
19424811 | 3.446217 | kennediiscool | I hate
: iPhones:
20663075 | 2.9209785 | Hi_imMac | RT @indigoFKNvanity: I hate the auto correct o
n iPhones !!!!!!!!!
20042822 | 2.9209785 | renadrian | @KDMC23 ohhhh!!! I hate Iphone Talk!
20759274 | 2.5128412 | SteLala | Dropped frutopia on
: My phone... #ciaowaterdamage I hate iPhones.
19416451 | 2.1448703 | ShayFknShay | I'm in love with my new iPhone(:
20350436 | 2.102924 | mahhnamestj | I absolutely love how fast this phone works. L
ove the iPhone.
19284329 | 1.9478481 | popolvuhplaya | #nowplaying on my iPhone: Daft Punk - "Digital
Love"
19714120 | 1.9478481 | BipolarBearApp | @ayee_Eddy2011 I love pancakes too! #iPhone #
app
20257190 | 1.6903389 | alasco | Love my #iphone - only problem now? I want an
#Ipad!
20473459 | 1.379696 | ArniBella | i love my iphone4 but I'm excited to see what
the iphone5 has to offer #gadgets #iphone #apple #technology
(10 rows)
Creating Faceted Search Queries
Faceting breaks query results into multiple categories with a count of the number of documents in the index for each category. There are three GPText faceted search functions:
gptext.faceted_field_search()
– the categories are the values of one or more fields in GPText index.gptext.faceted_query_search()
– the categories are a list of search queries.gptext.faceted_range_search()
– the categories are a list of ranges calculated from minimum value, maximum value, and the size of the range (gap).
The examples in this section use the store.products
table. See Setting Up the Demo Database for commands to create and load data into this table.
After the table is created and the data loaded, create the GPText index, index the data, and then commit the index as shown in this example.
=# SELECT * FROM gptext.create_index('store', 'products', '{id, title, category, brand, price}', '{int, text_intl, string, string, float}', 'id', 'title'); =# SELECT * FROM gptext.index(TABLE(SELECT * FROM store.products), 'demo.store.products'); dbid | num_docs ------+---------- 2 | 25 3 | 25 (2 rows) =# SELECT * FROM gptext.commit_index('demo.store.products'); commit_index -------------- t (1 row)
Faceting on Fields
With the gptext.faceted_field_search()
function, the categories are values of one or more fields in the index. Here is the syntax for the gptext.faceted_field_search()
function:
gptext.faceted_field_search(<index_name>, <query>, <filter_queries>,
<facet_fields>, <facet_limit>, <minimum>[, <options>])
<index_name>
is the name of the GPText index with fields to facet.
<query>
is a search query that selects the set of documents to be faceted. To facet all documents in the index specify '*:*'
.
<filter_queries>
is an array of queries that filter documents from the set returned by the <query>
, or null
if none. Only documents that match all queries in the list are included in the counts.
<facet_fields>
is an array of index fields to facet.
<facet_limit>
is the maximum number of results to report for any one category. Use -1
to report all results.
<minimum>
is the minimum number of results a category must have in order to be included in the results.
This example facets all documents in the demo.store.products
index on the category field.
=# SELECT * FROM gptext.faceted_field_search(
'demo.store.products', '*:*', null, '{category}', -1, 1);
field_name | field_value | value_count
------------+--------------+-------------
category | Pot | 11
category | Desktops | 10
category | Tablets | 8
category | Monitors | 7
category | Tent | 6
category | Luggage | 5
category | Sleeping Bag | 3
(7 rows)
This example facets all documents on two fields, category
and brand
. Only facets with a count of 2 or more are included in the results.
=# SELECT * FROM gptext.faceted_field_search(
'demo.store.products', '*:*', null, '{category, brand}', -1, 2);
field_name | field_value | value_count
------------+----------------+-------------
brand | ASUS | 7
brand | Dell | 5
brand | HP | 4
brand | Samsung | 4
brand | Apple | 2
brand | Utopia Kitchen | 2
brand | Big Agnes | 2
brand | Yaheetech | 2
brand | Kelty | 2
brand | Huawei | 2
category | Pot | 11
category | Desktops | 10
category | Tablets | 8
category | Monitors | 7
category | Tent | 6
category | Luggage | 5
category | Sleeping Bag | 3
(17 rows)
The next example uses a filter query to facet the brand
field for just the 10 documents with category “Desktops”.
=# SELECT * FROM gptext.faceted_field_search(
'demo.store.products', '*:*', '{category:Desktops}', '{brand}', -1, 1);
field_name | field_value | value_count
------------+-------------+-------------
brand | Dell | 5
brand | ASUS | 3
brand | HP | 2
(3 rows)
Faceting on search queries
With the faceted_query_search()
function, the categories are GPText search queries. The counts are a report of the numbers of documents that match each search query. Here is the syntax for the faceted_field_search()
function:
gptext.faceted_query_search(<index_name>, <query>, <filter_queries>, <facet_queries>);
<index_name>
is the name of the GPText index with fields to facet.
<query>
is a search query that selects the set of documents to be faceted. To facet all documents in the index specify '*:*'
.
<filter_queries>
is an array of queries that filter documents from the set returned by the <query>
, or null
if none. Only documents that match all queries in the list are included in the counts.
<facet_queries>
is an array of search queries. Each query in the array is a category in the results.
This example reports the number of documents that contain “windows”, “intel”, and both “windows” and “intel” in the default search column (title
).
=# SELECT * FROM gptext.faceted_query_search(
'demo.store.products', '*:*', null,
'{windows, intel, windows AND intel}');
query_name | value_count
-------------------+-------------
intel | 7
windows | 4
windows AND intel | 2
(3 rows)
The facet queries in this example are Solr range queries that define four custom ranges over the price
field.
=# SELECT * FROM gptext.faceted_query_search(
'demo.store.products', '*:*', null,
'{price:[* TO 200],price:[201 TO 250],price:[251 TO 300],price:[301 TO *]}');
query_name | value_count
--------------------+-------------
price:[201 TO 250] | 2
price:[251 TO 300] | 2
price:[301 TO *] | 11
price:[* TO 200] | 35
(4 rows)
Faceting on Ranges
The gptext.faceted_range_search()
function facets a single field in the GPText index into ranges specified with start, end, and gap values. The faceted field must be a numeric type.
gptext.faceted_range_search(<index_name>, <query>, <filter_queries>,
<field_name>, <range_start>, <range_end>, <range_gap>, <options>)
<index_name>
is the name of the GPText index with fields to facet.
<query>
is a search query that selects the set of documents to be faceted. To facet all documents in the index specify '*:*'
.
<filter_queries>
is an array of queries that filter documents from the set returned by the <query>
, or null
if none. Only documents that match all queries in the list are included in the results.
<field_name>
is the name of the field to facet. The field must have numeric content. The calculated ranges will have the same data type as the field.
<range_start>
is the smallest value of the first range category.
<range_limit>
is the highest value of the top range.
<range_gap>
is the size of each range category.
<options>
is an optional string containing Solr query options.
This range search example facets the price field into ranges between 0 and 1200 with a gap of 100. The range_value
column in the results is a text value, so the ORDER BY
clause casts the value to a float type.
=# SELECT * from gptext.faceted_range_search(
'demo.store.products', '*:*', null, 'price', '0', '1200', '100')
ORDER BY range_value::float;
field_name | range_value | value_count
------------+-------------+-------------
price | 0.0 | 23
price | 100.0 | 12
price | 200.0 | 4
price | 300.0 | 6
price | 400.0 | 0
price | 500.0 | 1
price | 600.0 | 1
price | 700.0 | 1
price | 800.0 | 0
price | 900.0 | 1
price | 1000.0 | 0
price | 1100.0 | 1
(12 rows)
Highlighting Search Terms in Query Results
Highlighting inserts markup tags before and after each occurrence of the search terms in a query. For example, if the search term is “iphone”, each occurrence of “iphone” in the field is marked up:
<em>iphone</em>
You can change the default markup strings from <em>
and </em>
by setting the gptext.hl_pre_tag
and gptext.hl_post_tag
server configuration options.
There are two ways to highlight search terms, depending on whether the field to be marked up is stored in the GPText index.
If the field is indexed, but not stored, you must join the search results with the database table and use the
gptext.highlight()
function to apply markup tags to the column data.If the field is indexed and stored, Solr can apply the markup tags and return the marked-up field in the results of the search query. This is the same way highlighting works for GPText external indexes. (See Highlighting External Index Search Results.) Using this method with regular GPText indexes requires modifying the
solrconfig.xml
configuration file for the index.
Highlighting Terms with gptext.highlight()
To use gptext.highlight()
the index must have been created with terms enabled for the columns that are to be highlighted. Use gptext.enable_terms()
to enable term vectors and then reindex the data if it was already indexed. See gptext.enable_terms()
in the GPText Function Reference.
This example enables terms for the message_text
field in the demo.twitter.message
index, reindexes the data, and commits the changes to the index:
=# SELECT * FROM gptext.enable_terms('demo.twitter.message', 'message_text');
=# SELECT * FROM gptext.index(TABLE (SELECT * FROM twitter.message), 'demo.twitter.message');
=# SELECT * FROM gptext.commit_index('demo.twitter.message');
Warning: Highlighting adds overhead to the query, including index space, indexing time, and search time.
The gptext.highlight()
syntax is:
gptext.highlight(<column_data>, <column_name>, <offsets>)
The <column_data>
argument contains the text data that will be marked up with highlighting tags.
The <column_name>
argument is the name of the corresponding table column.
The <offsets>
argument is a GPText hstore
type that contains key-value pairs that specify the locations of the search term in the text data. This value is constructed by the gptext.search()
function when highlighting is enabled. The key contains the column name and the value is a comma-separated list of offsets where the data appears.
To enable highlighting in a gptext.search()
query, add the hl
and hl.fl
options in the <options>
argument:
hl=true&hl.fl=<field1>,<field2>
Setting the hl=true
option enables highlighting for the search. The hl.fl
option specifies a list of the field names to highlight.
This example returns up to five rows from each segment with the text “iphone” highlighted in the message_text
field.
=# SELECT t.id, gptext.highlight(t.message_text, 'message_text', s.hs)
FROM twitter.message t,
gptext.search(TABLE(SELECT 1 SCATTER BY 1),
'demo.twitter.message', '{!gptextqp}iphone', null,
'rows=5&hl=true&hl.fl=message_text' ) s
WHERE t.id = s.id::int8;
id | highlight
----------+----------------------------------------------------------------------------------
-----------------------------------
20473459 | i love my iphone4 but I'm excited to see what the iphone5 has to offer #gadgets #
<em>iphone</em> #apple #technology
19424811 | I hate
: <em>iPhones</em>:
20663075 | RT @indigoFKNvanity: I hate the auto correct on <em>iPhones</em> !!!!!!!!!
20350436 | I absolutely love how fast this phone works. Love the <em>iPhone</em>.
20042822 | @KDMC23 ohhhh!!! I hate <em>Iphone</em> Talk!
19714120 | @ayee_Eddy2011 I love pancakes too! #<em>iPhone</em> #app
19284329 | #nowplaying on my <em>iPhone</em>: Daft Punk - "Digital Love"
19416451 | I'm in love with my new <em>iPhone</em>(:
20257190 | Love my #<em>iphone</em> - only problem now? I want an #Ipad!
20759274 | Dropped frutopia on
: My phone... #ciaowaterdamage I hate <em>iPhones</em>.
(10 rows)
Highlighting Terms in Stored Fields
If the field to be highlighted is stored in the index, Solr can return the field in the search results with markup tags applied. The gptext.highlight()
function is not used and it is not necessary to enable terms for the field. This is the default behavior for GPText external indexes, but for regular GPText indexes you must enable it by editing the solrconfig.xml
configuration file for the index.
Use the
gptext-config
utility to open thesolrconfig.xml
configuration file for the index in the editor. .$ gptext-config edit -i demo.twitter.message -f solrconfig.xml
Search for
<!-- Search Components -->
and add the following element.<searchComponent class="solr.HighlightComponent" name="highlight" />
Search for
<requestHandler name="/select" class="solr.SearchHandler">
. In the<arr name="components">
child element, change<str>termoffsets</str>
to<str>highlight</str>
. The complete<requestHandler>
entry should be:<requestHandler name="/select" class="solr.SearchHandler"> <!-- default values for query parameters can be specified, these will be overridden by parameters in the request --> <lst name="defaults"> <str name="echoParams">explicit</str> <int name="rows">10</int> <str name="df">message_text</str> </lst> <arr name="components"> <str>query</str> <str>facet</str> <str>mlt</str> <str>highlight</str> <str>stats</str> <str>debug</str> </arr> </requestHandler>
Save your changes.
Update the field definitions in the
managed-schema
configuration file to store the fields that will be highlighted. See Storing Additional Fields in an Index for instructions. Be sure to reindex the data after changing storage options.
The following query searches the message_text
field for messages containing the text “iphone” and highlights “iphone” in the text returned in the hs
column.
=# SELECT * FROM gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.twitter.message',
'{!gptextqp}iphone', null, 'rows=5&hl=true&hl.fl=message_text');
id | score | hs
| rf
----------+-----------+---------------------------------------------------------------------------------------
--------------------------------------------------+----
19284329 | 0.8176138 | {"columnValue":[{"name":"message_text","value":"#nowplaying on my \u003cem\u003eiPhone
\u003c/em\u003e: Daft Punk - \"Digital Love\""}]} |
19416451 | 0.9003142 | {"columnValue":[{"name":"message_text","value":"I'm in love with my new \u003cem\u003e
iPhone\u003c/em\u003e(:"}]} |
19424811 | 1.0051261 | {"columnValue":[{"name":"message_text","value":"I hate\n\u003cem\u003eiPhones\u003c/em
\u003e:"}]} |
20042822 | 0.8519347 | {"columnValue":[{"name":"message_text","value":"I hate \u003cem\u003eIphone\u003c/em\u
003e Talk!"}]} |
(4 rows)
You can use the gptext.gptext_retrieve_field()
function to extract the highlighted text from the columnValue
array in the hs
column. Compare the previous results to the results from this query.
=# SELECT id, score, gptext.gptext_retrieve_field(hs, 'message_text') message_text
FROM gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.twitter.message',
'{!gptextqp}iphone', null, 'rows=5&hl=true&hl.fl=message_text');
id | score | message_text
----------+------------+--------------------------------------------------------------------------------------
-------------------------------
19424811 | 1.0051261 | I hate
: <em>iPhones</em>:
20042822 | 0.8519347 | I hate <em>Iphone</em> Talk!
20350436 | 0.7387052 | Love the <em>iPhone</em>.
20473459 | 0.59349346 | i love my iphone4 but I'm excited to see what the iphone5 has to offer #gadgets #<em>
iphone</em> #apple #technology
20663075 | 0.8519347 | RT @indigoFKNvanity: I hate the auto correct on <em>iPhones</em> !!!!!!!!!
19284329 | 0.8176138 | #nowplaying on my <em>iPhone</em>: Daft Punk - "Digital Love"
19416451 | 0.9003142 | I'm in love with my new <em>iPhone</em>(:
19714120 | 0.8176138 | #<em>iPhone</em> #app
20257190 | 0.7095236 | Love my #<em>iphone</em> - only problem now? I want an #Ipad!
20759274 | 0.7095236 | #ciaowaterdamage I hate <em>iPhones</em>.
(10 rows)
Searching Partitioned Tables
A GPText index for a partitioned Greenplum Database table has a __partition
field that contains the name of the child partition. When you query the index, you can use the __partition
field to restrict the partitions to search.
Search all partitions in an index by calling gptext.search()
with the root partition name:
=# SELECT *
FROM gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.twitter.message',
'{!gptextqp} blackberry', null, null);
id | score | hs | rf
-----------+-----------+----+----
71559892 | 5.670539 | |
127444971 | 5.1496587 | |
127024083 | 5.1496587 | |
65596365 | 4.4688635 | |
79177658 | 4.4688635 | |
78934938 | 4.4688635 | |
111566417 | 4.4688635 | |
65058966 | 3.5941496 | |
92240815 | 5.212467 | |
38424415 | 4.730712 | |
96811329 | 4.730712 | |
146782767 | 4.730712 | |
41409575 | 4.1019597 | |
104198393 | 4.1019597 | |
86943734 | 3.2956126 | |
89120464 | 3.2956126 | |
153181836 | 3.2956126 | |
139227011 | 3.2956126 | |
20664699 | 2.8236253 | |
(19 rows)
You can search a single partition by calling gptext.search()
with the child partition name. Use the gptext.partition_status(<index_name>)
function to see the partition names. For example:
=# SELECT partition_name, level
FROM gptext.partition_status('demo.twitter.message');
partition_name | level
------------------------------+-------
demo.twitter.message_1_prt_1 | 1
demo.twitter.message_1_prt_2 | 1
demo.twitter.message_1_prt_3 | 1
demo.twitter.message_1_prt_4 | 1
(4 rows)
This example searches only the demo.twitter.message_1_prt_3
partition:
=# SELECT *
FROM gptext.search(TABLE(SELECT 1 SCATTER BY 1),
'demo.twitter.message_1_prt_3', '{!gptextqp} blackberry', null);
id | score
-----------+-----------
71559892 | 5.670539
79177658 | 4.4688635
78934938 | 4.4688635
111566417 | 4.4688635
92240815 | 5.212467
96811329 | 4.730712
104198393 | 4.1019597
86943734 | 3.2956126
89120464 | 3.2956126
(9 rows)
You can also specify a partition name or a range of partitions in the query filter argument of the gptext.search()
function. This example searches the partitions between message_1_prt_2
and message_1_prt_4
.
=# SELECT *
FROM gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.twitter.message', 'android',
'{__partition:[message_1_prt_2 TO message_1_prt_4]}');
id | score
-----------+-----------
42474603 | 5.770868
95666225 | 5.670539
68701747 | 4.4688635
56900818 | 4.4688635
111566417 | 4.4688635
120764432 | 4.4688635
115326522 | 4.4688635
67269000 | 3.5941496
99959486 | 6.413594
104293903 | 3.1360807
(10 rows)
Retrieving Stored Field Content
A GPText index does not, by default, store the contents of database columns in the index, with the exception of the unique id column. When you search the index, you must join the search results with the original database table on the id column in order to access other table columns.
You can configure a GPText index to store content of fields when documents are indexed. The additional stored fields can be returned with the search results so that it is unnecessary to join with the database original table. For some applications, you can even delete data from the database table or drop the table after the data has been added to the index.
Retrieve the additional field values in a GPText search by specifying a list of fields in the gptext.search()
options argument. In this example, the demo.wikipedia.articles
index has been configured to store the content
, title
, and refs
fields, in addition to the id
field. See Storing Field Content in an Index for instructions to edit the managed-schema
file to store these additional fields. In the option
argument, the Solr fl
parameter requests that contents of the id
and title
fields be included in the results.
=# SELECT *
FROM gptext.search (TABLE(SELECT 1 SCATTER BY 1), 'demo.wikipedia.articles',
'+grid +economy', null, 'fl=id,title&rows=2');
id | score | hs | rf
---------+-----------+----+---------------------------------------------------------------
-------------------------------------------
533423 | 2.4593863 | | column_value { name: "id" value: "533423" } column_value { nam
e: "title" value: "Solar water heating" }
7906908 | 2.0646634 | | column_value { name: "id" value: "7906908" } column_value { na
me: "title" value: "Biomass" }
27743 | 1.823319 | | column_value { name: "id" value: "27743" } column_value { name
: "title" value: "Solar energy" }
113728 | 1.2235354 | | column_value { name: "id" value: "113728" } column_value { nam
e: "title" value: "Geothermal energy" }
(4 rows)
To retrieve all fields stored in the index, use the *
wildcard for the field list: 'fl=*'
.
In the results, the requested fields are packed into an field named rf
added to the results. The rf
field is a text value containing a structure with the following format:
column_value { name: "<field1_name>" value: "<field1_value>"}
[column_value { name: "<field2_name>" value: "<field2_value>"}] ...
The GPText function gptext.gptext_retrieve_field(rf,<column_name>)
retrieves a single field value by name from this structure as a text value. GPText provides variations to retrieve the field values as int
or float
values. If the specified field name does not exist in the rf
structure, the function returns NULL
.
This example shows how you can use the gptext.gptext_retrieve*()
functions to unpack search results into separate result columns.
=# SELECT score,
gptext.gptext_retrieve_field_int(rf, 'id') id,
gptext.gptext_retrieve_field(rf, 'title') title,
substring(gptext.gptext_retrieve_field(rf, 'content'),1,15) content
FROM gptext.search (TABLE(SELECT 1 SCATTER BY 1), 'demo.wikipedia.articles',
'+grid +economy', null, 'fl=*');
score | id | title | content
-----------+----------+---------------------+-----------------
2.4593863 | 533423 | Solar water heating | '''Solar water
2.0646634 | 7906908 | Biomass | '''Biomass''' i
2.0444229 | 13690575 | Solar power | '''Solar power'
1.823319 | 27743 | Solar energy | '''Solar energy
1.2235354 | 113728 | Geothermal energy | '''Geothermal e
1.0890164 | 14205946 | Algae fuel | '''Algae fuel''
(6 rows)
Selecting a Query Parser
When you submit a query, Solr processes the query using a query parser. There are several Solr query parsers with different capabilities. For example, the ComplexPhraseQueryParser
can parse wildcards, and the SurroundQueryParser
supports span queries—finding words in the vicinity of a search term in a document.
GPText supports these query parsers:
QParserPlugin
, the default GPText query parser.QParserPlugin
is a superset of theLuceneQParserPlugin
, Solr’s native Lucene query parser.QParserPlugin
is a general purpose query parser with broad capabilities.QParserPlugin
does not support span queries and handles operator precedence in an unintuitive manner. The support for field selection is also rather weak. See http://wiki.apache.org/solr/SolrQuerySyntax.ComplexPhraseQueryParser
supports wildcards, ORs, ranges, and fuzzies inside phrase queries. See https://issues.apache.org/jira/browse/SOLR-1604.-
DisMax
(oreDisMax
) handles operator precedence in an intuitive manner and is well-suited for user queries since it is similar to popular search engines on the web. See Using the DisMax and Extended DisMax Query Processors. SurroundQueryParser
, supports the family of span queries. See Proximity Search Queries and Surround Query Parser in the Apache Solr Reference Guide.gptextqp
, the GPText Unified Query Parser, can use all of the above query parsers in combination. See Using the Universal Query Parser for more information.
The default query parser is specified in the requestHandler
definitions in solrconfig.xml
. You can edit solrconfig.xml
with the management utility command gptext-config edit
.
You can specify the query parser to use at query time by setting the Solr defType
option in the options
argument of the search function or by setting the type
as a Solr LocalParam embedded in the query.
This query specifies the dismax
query parser in the options
argument of the gptext.search()
function:
=# SELECT a.title, q.score
FROM wikipedia.articles a,
gptext.search(TABLE(SELECT 1 SCATTER BY 1),
'demo.wikipedia.articles', '+hydroelectric -solar', null,
'defType=dismax') q
WHERE a.id = q.id::int8;
title | score
------------------------+-----------
Forward osmosis | 0.9552469
Liquid nitrogen engine | 1.0126935
(2 rows)
The following query uses the ComplexPhraseQueryParser, setting the type
parameter in a Solr LocalParam.
=# SELECT a.title, q.score
FROM wikipedia.articles a,
gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.wikipedia.articles',
'{!type=complexphrase} sequester AND carbon', null, null) q
WHERE a.id = q.id::int8;
title | score
---------+---------
Biomass | 3.83572
(1 row)
In the LocalParam, the type=
specifier can be omitted because type
is the default parameter:
'{!complexphrase} sequester AND carbon'
Proximity Search Queries
Proximity search queries find documents that have search terms within a specified distance. The distance is measured as the number of term moves that would be needed to make the terms adjacent.
With the standard query parser, the terms to match are placed in quotes and the distance between them is specified by adding a tilde ~
and an integer after the closing quote. The following search query finds documents with the terms “solar” and “fossil” within five terms of each other.
=# SELECT t.id, s.score, t.title
FROM wikipedia.articles t,
gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.wikipedia.articles',
'"solar fossil"~5', null, null) s
WHERE s.id::int8 = t.id;
id | score | title
----------+------------+------------------
25784 | 0.4855828 | Renewable energy
14090587 | 0.30585092 | Low-carbon power
13690575 | 0.62667537 | Solar power
(3 rows)
The search terms inside the quotes can appear in either order. However, if the terms occur in the opposite order in the document, the distance between them is one greater than if the terms occur in the specified order.
The Surround query parser allows ordered and unordered proximity searches. The W
operator specifies an ordered search and the N
operator specifies an unordered search. The maximum distance between the terms is specified by prefixing the W
or N
operator with an integer, for example 3W
.
The proximity query can be written with prefix or infix notation.
Prefix notation: '{!surround} 3W(solar, fossil)'
Infix notation: '{!surround} solar 3W fossil'
Here are some proximity query examples using the Surround query parser.
'{!surround} title:2w(solar, heat)'
title
field for the terms “solar” and “heat” within two terms, and in the specified order. This query uses prefix notation. The N
and W
operators are not case-sensitive. '{!surround} title:heat 2N solar'
title
field for the terms “heat” and “solar” within two terms, in any order. This query uses infix notation.'{!surround} title: W(solar, heat)'
title
field for adjacent terms “solar” and “heat”. The default distance is 1, so 1W
can be abbreviated to W
.The Surround query parser does not analyze query text like the other query parsers. GPText indexes are by default built with lowercase and stemming filters, for example, so surround queries containing capital letters or unstemmed terms will return no results.
The wikipedia.articles
index contains a document with the title “Solar water heating”. The following example search, however, cannot find it.
=# SELECT t.id, s.score, t.title
FROM wikipedia.articles t,
gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.wikipedia.articles',
'{!surround} title: 2w(Solar, heating)', null, null) s
WHERE s.id::int8 = t.id;
id | score | title
----+-------+-------
(0 rows)
When you rewrite the query to use only lowercase characters and remove the suffix from “heating”, the document is found.
=# SELECT t.id, s.score, t.title
FROM wikipedia.articles t,
gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.wikipedia.articles',
'{!surround} title: 2w(solar, heat)', null, null)
WHERE s.id::int8 = t.id;
id | score | title
--------+-----------+---------------------
533423 | 1.5434089 | Solar water heating
(1 row)
An easy way to avoid this limitation is to use the GPText Universal Query Parser, which does analyze the query text and also supports the Surround query parser’s proximity syntax.
Using the Universal Query Parser
With the GPText Universal Query Parser, you can perform searches using features from any of the other supported query parsers, combined into one search string. Invoke the Universal Query Parser by setting the Solr type
parameter in a Solr LocalParam with this format:
'{!gptextqp} <search_query>'
The search query in the following example includes syntax from three query parsers:
-
sea*
– Complex query with wildcard -
2W
– Proximity query requesting a maximum of two words distance between the terms “sea*” and “oil” or “fuel” -
oil OR fuel
– Solr Standard Query Processor
=# SELECT a.title, q.score
FROM wikipedia.articles a,
gptext.search(TABLE(SELECT 1 SCATTER BY 1),'demo.wikipedia.articles',
'{!gptextqp} sea* 2W (oil OR fuel)', null, null) q
WHERE a.id = q.id::int8;
title | score
--------------+-----------
Seaweed fuel | 55.250305
(1 row)
In the following example, title:n(power, geothermal)
specifies that the terms “power” and “geothermal” in the title
field must be adjacent, but they can occur in either order.
=# SELECT a.title, q.score
FROM wikipedia.articles a,
gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.wikipedia.articles',
'{!gptextqp} title:n(power, geothermal)', null, null) q
WHERE a.id = q.id::int8;
title | score
------------------+---------
Geothermal power | 2.05577
(1 row)
This query uses the fuzzy search operator ~
to find articles with titles containing a term similar to “lethiam” and a complex query that finds articles with “ocean” and “wind” in the content.
=# SELECT t.id, score, title
FROM wikipedia.articles t,
gptext.search(TABLE(SELECT 1 SCATTER by 1), 'demo.wikipedia.articles',
'{!gptextqp} title:lethiam~ OR content:(ocean AND wind)', null, null) s
WHERE t.id=s.id::int8;
id | score | title
----------+------------+-------------------
2120798 | 1.3326647 | Lithium economy
4711003 | 2.6328268 | Osmotic power
25784 | 3.3899183 | Renewable energy
55017 | 0.95579207 | Fusion power
113728 | 1.3909805 | Geothermal energy
27743 | 2.114852 | Solar energy
13690575 | 1.4488393 | Solar power
(7 rows)
Using the DisMax and Extended DisMax Query Parsers
The DisMax query parser supports a subset of the Solr Standard Query Parser syntax. It is useful for queries from end users who are familiar with common search systems, such as Google search. It supports quoted phrases, AND and OR operators, and + and - operators. The Extended DisMax query parser improves upon the DisMax query parser, supporting the full Standard query parser syntax.
The DisMax and Extended DisMax query parser behaviors can be customized at query time by setting parameters in the Solr options argument of the gptext.search()
function or as local parameters in the query text. See DisMax Parameters and Extended DisMax Parameters for details. One useful parameter is the qf
(query fields) parameter, which specifies a list of fields to search. Using this parameter avoids having to write a query that searches each field individually. For example, instead of writing this query:
'content:nuclear OR title:nuclear OR links:nuclear'
you can write:
{!edismax qf="content title links"} nuclear
The following example queries illustrate features of the DisMax and Extended DisMax query parsers.
'{!dismax} +nuclear reactor'
'{!dismax} +"nuclear reactor"'
'{!dismax} +solar -reactor'
'{!edismax qf="title refs"} solar'
title
or refs
fields. '{!edismax qf="title"} (solar or renewable) and energy'