LATEST VERSION: 3.1.0 - RELEASE NOTES
Pivotal® Greenplum® Text v3.1.0

Using Named Entity Recognition with GPText

Pivotal GPText includes Apache OpenNLP components to allow you to use named entity recognition (NER). Named entities include the names of people, organizations, and locations. OpenNLP also recognizes parts of speech (POS). The OpenNLP libraries and models required for English language recognition are included with GPText. For non-English language documents, you can upload to ZooKeeper any of the other models available from the OpenNLP project.

A GPText index that includes NER and POS tagging must have terms enabled, using the gptext.enable_terms() function. You add a text field definition to the index’s configuration, adding POS and NER filters to the analysis chain after the tokenizer. The filters use the OpenNLP models you specify to recognize entities in documents and classify parts of speech. Tokens recognized are tagged and saved as terms in the field’s term vector.

NER-tagged terms have the format _ner_<entity-type>_<token>, where <entity-type> is the type of entity, for example person or location, and <token> is the text of the token, produced by the tokenizer. Terms are not case-sensitive. Examples of NER-tagged terms are _ner_person, _ner_person_Alan, and _ner_location_boston. A term like _ner_person matches any person, including a more specific term like _ner_person_alan.

The POS English language model uses part-of-speech tags from the University of Penn Treebank project. POS-tagged terms have the format _pos_<tag>, where <tag> is a Penn Treebank part-of-speech tag. Examples of POS-tagged terms are _pos_nn, _pos_vb, and _pos_rb, for nouns, verbs, and adverbs, respectively.

Enabling NER for GPText Indexes

The example in this section shows how to add NER support to a GPText index. The example works with a table named new_demo in the Greenplum Database demo database.

  1. Download the CSV data file for the table to the gpadmin home directory from this link: news_demo.csv.tgz. Extract the news_demo.csv file from the downloaded file with the following command.

    $ tar xvf news_demo.csv.tgz
    
  2. Log in to the demo database with psql and create and load the news_demo table.

    =# CREATE TABLE news_demo(
        id bigint, 
        articleid varchar(50),
        news_date date, 
        headline text, 
        content text) 
    DISTRIBUTED BY (id);
    

    Load data into the table from the news_demo.csv data file.

    =# COPY news_demo from '/home/gpadmin/news_demo.csv' with csv header;
    
  3. Create the GPText index and enable terms for the content field.

    =# SELECT * FROM gptext.create_index('public', 'news_demo', 'id', 'content');
    =# SELECT * FROM gptext.enable_terms('demo.public.news_demo', 'content');
    
  4. Edit the managed-schema for the demo.public.news_demo index using the gptext-config utility.

    $ gptext-config edit -i demo.public.news_demo -f managed-schema
    

    Add the following text_opennlp field type definition to the list of <fieldType> elements.

    <fieldType name="text_opennlp" class="solr.TextField">
      <analyzer type="index">
        <tokenizer class="solr.OpenNLPTokenizerFactory"
           sentenceModel="en-sent.bin"
           tokenizerModel="en-token.bin"/>
        <filter class="solr.OpenNLPPOSFilterFactory" posTaggerModel="en-pos-maxent.bin"/>
        <filter class="com.emc.solr.analysis.opennlp.OpenNLPNERFilterFactory"
          nerTaggerModels="en-ner-person.bin,en-ner-organization.bin,en-ner-time.bin"/>
        <filter class="solr.StopFilterFactory" words="stopwords-ner.txt" ignoreCase="true"/>
        <filter class="com.emc.solr.analysis.opennlp.NERAndTypeAttributeAsSynonymFilterFactory" extractType="true" typePrefix="_pos_"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" pattern="^(_ner_|_pos_).+$" />
        <filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
    </fieldType>
    

    Find the content field and change the type attribute to "text_opennlp".

    <field name="content" type="text_opennlp" indexed="true" termOffsets="true" 
           stored="false" termPositions="true" termPayloads="true" termVectors="true"/>
    
  5. Index the documents and commit the index.

    =# SELECT * FROM gptext.index(TABLE(SELECT * FROM news_demo), 'demo.public.news_demo');
    =# SELECT * FROM gptext.commit_index('demo.public.news_demo');
    

Example Search Queries for NER-Enabled Indexes

Following are example queries that search for NER-tagged terms in the demo.public.news_demo index.

Retrieve NER person offsets

This query retrieves an array of locations for NER person terms in documents that contain NER persons.

=# SELECT * FROM gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.public.news_demo', 
      '_ner_person', NULL, 'hl=true&hl.fl=content&rows=10&sort=score desc');

Following are results from this search (with some rows omitted for space).

    id     |   score    |                                                                                                   

    hs                                                                                                                      
                                                                                                              | rf 
-----------+------------+---------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------+----
 842613544 |  0.7074248 | {"fieldOffsets":[{"field":"content","offsets":[{"start":14,"end":28},{"start":40,"end":44},{"start
":251,"end":261},{"start":726,"end":736},{"start":896,"end":909},{"start":1093,"end":1103},{"start":1118,"end":1133},{"start
":1184,"end":1187},{"start":1188,"end":1194},{"start":1253,"end":1258}]}]}                                                  
                                                                                                              | 
 842613572 |  0.7059102 | {"fieldOffsets":[{"field":"content","offsets":[{"start":61,"end":65},{"start":500,"end":512},{"sta
rt":547,"end":563},{"start":711,"end":715},{"start":883,"end":896},{"start":965,"end":969},{"start":1065,"end":1078}]}]}    

                                                                                                              | 

(ROWS OMITTED)

 842613594 |  0.5854553 | {"fieldOffsets":[{"field":"content","offsets":[{"start":520,"end":533},{"start":559,"end":564},{"s
tart":968,"end":982},{"start":987,"end":1000},{"start":1509,"end":1512}]}]}                                                 

                                                                                                              | 
 842614457 |  0.5810676 | {"fieldOffsets":[{"field":"content","offsets":[{"start":400,"end":423},{"start":723,"end":733},{"s
tart":812,"end":827},{"start":963,"end":970},{"start":1181,"end":1188}]}]}                                                  

                                                                                                              | 
(40 rows)

Retrieve documents containing an NER person term

This query retrieves the content of documents in the news_demo table with terms tagged _ner_person highlighted.

=# SELECT 
     news_demo.id, gptext.highlight(news_demo.content, 'content', hs) AS content,
     s.score 
FROM news_demo, 
  gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.public.news_demo', 
      '{!gptextqp} _ner_person', Null, 
      'hl=true&hl.fl=content&rows=10&sort=score desc') s 
WHERE news_demo.id = s.id::bigint 
ORDER BY s.score desc;

Following are three rows of the output from this command with the headings and empty lines omitted.

842613544 | WASHINGTON -- <em>William Taylor</em>, President <em>Bush</em>'s nominee to run the nation's deposit insurance 
system, said bank regulators could have done a better job policing the Bank of Credit & Commerce International. "I think we 
have learned a series of things," <em>Mr. Taylor</em>, currently the head of bank supervision at the Federal Reserve, told t
he Senate Banking Committee at his confirmation hearing. "You shouldn't allow someone in the country that doesn't have super
vision from a strong home-country supervisor," he said. BCCI, with operations in the Middle East, Africa, Europe and the U.S
., fraudulently hid huge losses for months from regulators around the world. No U.S. depositors lost money. Questions about 
BCCI's actions during <em>Mr. Taylor</em>'s tenure at the Federal Reserve were expected to be the only serious hurdle to his
confirmation. But committee members seemed satisfied with his remarks. Sen. <em>Donald Riegle</em> (D., Mich.), chairman of
the committee, said that he expects the committee will recommend the confirmation and that the Senate will vote within a fe
w weeks. If confirmed as expected, <em>Mr. Taylor</em> would succeed <em>William Seidman</em>, whose term expires next month
. In his testimony, <em>Mr.</em> <em>Taylor</em> said he remains troubled by lingering questions involving <em>BCCI.</em> In
the U.S., BCCI was able to evade government prohibitions from purchasing stock in First American Bankshares Inc. by using f
rontmen. "I really have difficulty in knowing how we're going to uncover (such) arrangements anywhere," he said. "I really t
hink it's difficult to determine when two people conspire to change the control of an organization."                        
|  0.7074248
842613572 | WASHINGTON -- The House, in a stunning victory for President <em>Bush</em>, agreed to cut the tax on capital ga
ins, soundly rejecting an alternative proposed by Democratic leaders. After weeks of intense lobbying by both sides, the lea
dership's plan was defeated by a larger-than-expected 239-190 vote. The convincing margin increases the likelihood that a ca
pital gains cut of some sort could become law this year. The vote was a blow to the House's newly elected Democratic leaders
hip, particularly Speaker <em>Thomas Foley</em> of Washington and Majority Leader <em>Richard Gephardt</em> of Missouri. Bot
h had put their personal prestige on the line to defeat the tax-cut measure, which represented their first major showdown wi
th the <em>Bush</em> administration. Still, fully one-quarter of their membership -- 64 Democrats -- deserted them and sided
with a near-solid phalanx of Republicans. Only one Republican, <em>Doug Bereuter</em> of Nebraska, broke ranks and voted, a
gainst the wishes of President <em>Bush</em>, for the Democratic alternative. "This was a watershed for us," glowed House Re
publican Leader <em>Robert Michel</em> of Illinois.                                                                         
|  0.7059102
842613885 | <em>John Kerry</em>, seizing the chance to define his candidacy before a national television audience with his 
presidential nomination acceptance speech, took the fight straight to the two areas where President <em>Bush</em> has enjoye
d his greatest political strengths: national security and social values. Rather than shying away from ground that has someti
mes been shaky for Democrats, Mr. <em>Kerry</em> planted his own flag in a forceful and at times combative speech. "Let ther
e be no mistake: I will never hesitate to use force when it is required," the Massachusetts senator told 4,000 cheering dele
gates on the final night of the Democratic convention in Boston. "Any attack will be met with a swift and certain response,"
he continued, attempting to meet widespread and persistent voter questions about whether a Democrat, even a war veteran, is
tough enough to lead the country in fighting terrorism. At one point, Mr. <em>Kerry</em> appeared to belittle Mr. <em>Bush<
/em>'s record as commander in chief, especially his justification for the war in Iraq. "Now I know there are those who criti
cize me for seeing complexities -- and I do -- because some issues just aren't all that simple," he said. "Saying there are 
weapons of mass destruction in Iraq doesn't make it so." It was one of several oblique shots Mr. <em>Kerry</em> took at the 
president and his advisers, even as he also called directly on President <em>Bush</em> to run a positive campaign. Confronti
ng another of his party's vulnerabilities -- a perception that Democrats are out of the cultural mainstream -- Mr. <em>Kerry
</em>'s 45- minute speech tackled President <em>Bush</em> on social issues. "It's time for those who talk about family value
s to start valuing families," he said.                                                                                      
|  0.7014734

Retrieve documents containing specified NER person terms

This search returns the content of documents that contain the persons “Alan” and “Bush”, with the names highlighted.

=# SELECT 
    news_demo.id, gptext.highlight(news_demo.content, 'content', hs) AS content, 
    s.score 
FROM news_demo, 
  gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.public.news_demo', 
      '_ner_person_Alan AND _ner_person_Bush', Null, 
      'hl=true&hl.fl=content&rows=10&sort=score desc') s 
WHERE news_demo.id = s.id::bigint 
ORDER BY s.score desc;

Retrieve documents containing both NER organization and time terms

This search finds documents that contain both NER organization and time terms, with the terms highlighted.

=# SELECT news_demo.id, gptext.highlight(news_demo.content, 'content', hs) AS content, 
    s.score 
FROM news_demo, 
  gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.public.news_demo', 
      '_ner_organization AND _ner_time', Null, 
      'hl=true&hl.fl=content&rows=10&sort=score desc') s 
WHERE news_demo.id = s.id::bigint 
ORDER BY s.score desc;

Following is an example row returned by this query.

842613848 | NEW YORK--U.S. oil futures declined Tuesday as traders were reluctant to place big bets while <em>Federal Reser
ve</em> officials debated the future of the central bank's key economic stimulus program. Light, sweet crude for January del
ivery settled 26 cents, or 0.3%, lower at $97.22 a barrel on the <em>New York Mercantile Exchange</em>. Nymex prices traded 
in a narrow range for most of the session as market participants chose to wait until Wednesday <em>afternoon</em> for potent
ial clarity on the <em>Fed</em>'s easy-money policies. "It's a directionless trade," said John Kilduff, founding partner of 
<em>Again Capital LLC</em>, a New York hedge fund that focuses on energy, referring to the lack of significant price movemen
t. He added, "You can make a strong argument on both sides, and there's a lot of room for the <em>Fed</em> to surprise us ei
ther way." Many traders expect the <em>Fed</em> to begin scaling back its so-called quantitative-easing program, in which it
buys $85 billion each month in mortgage-backed securities and longer-term <em>Treasury</em> bonds, in the near future. The 
program has boosted oil prices by weakening the dollar, making crude cheaper to buy with other currencies.                  

Retrieve documents containing NER person or organization terms

This search returns documents containing an NER person term or an NER organization term, or both, with the terms highlighted.

=# SELECT news_demo.id, gptext.highlight(news_demo.content, 'content', hs) AS content, 
    s.score
FROM news_demo, 
  gptext.search(TABLE(SELECT 1 SCATTER by 1), 'demo.public.news_demo', 
      '_ner_person _ner_organization', Null, 
      'hl=true&hl.fl=content&rows=10&sort=score desc') s 
WHERE news_demo.id = s.id::bigint
ORDER BY s.score desc;

Retrieve documents containing NER person and time terms (forward proximity search)

This query performs a proximity search to find documents with a person term followed by a time term within the next seven terms.

=# SELECT news_demo.id, gptext.highlight(news_demo.content, 'content', hs) AS content, 
    s.score 
FROM news_demo, 
  gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.public.news_demo', 
      '{!gptextqp} (_ner_person 7W _ner_time)', Null, 
      'hl=true&hl.fl=content&rows=10&sort=score desc') s 
WHERE news_demo.id = s.id::bigint 
ORDER BY s.score desc;

Retrieve documents with a specified NER person and any NER person (unordered proximity search)

Like the previous example, this query performs a proximity search, but the terms can appear in the document in either order and must be within ten terms of each other.

=# SELECT news_demo.id, gptext.highlight(news_demo.content, 'content', hs) AS content, 
    s.score 
FROM news_demo, 
  gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.public.news_demo', 
      '{!gptextqp} (_ner_person_Taylor 10N _ner_person)', Null, 
      'hl=true&hl.fl=content&rows=10&sort=score desc') s 
WHERE news_demo.id = s.id::bigint 
ORDER BY s.score desc;

Customizing NER Field Types

GPText includes the following English language models.

  • en-ner-date.bin
  • en-ner-location.bin
  • en-ner-money.bin
  • en-ner-organization.bin
  • en-ner-percentage.bin
  • en-ner-person.bin
  • en-ner-time.bin

To specify the models you want to use for an index, edit the managed-schema file for the index and set the nerTaggerModels attribute of the OpenNLPNERFilterFactory filter element in the field type definition.

<filter class="com.emc.solr.analysis.opennlp.OpenNLPNERFilterFactory"
  nerTaggerModels="en-ner-person.bin,en-ner-organization.bin,en-ner-time.bin"/>

You can download models for other languages at Models for 1.5 Series. Upload the model to ZooKeeper using the gptext-config upload command and then update the nerTaggerModels attribute as shown. For example, to add the Spanish person model:

  1. Download the es-ner-person.bin file from Models for 1.5 Series.

  2. Upload the es-ner-person.bin file to ZooKeeper.

    $ gptext-config upload -i demo.public.news_demo -l es-ner-person.bin -f es-ner-person.bin
    
  3. Edit the managed-schema file for the index.

    $ gptext-config edit -i demo.public.news_demo -f managed-schema
    
  4. Add the es_ner_person model to the OpenNLPNERFilterFactory filter for the field. Spanish names will be recognized first, and then English names.

    <filter class="com.emc.solr.analysis.opennlp.OpenNLPNERFilterFactory"
      nerTaggerModels="es-ner-person.bin,en-ner-person.bin,en-ner-organization.bin,en-ner-time.bin"/>
    
  5. Save the managed-schema file changes and reindex the documents.

Adding OpenNLP Libraries to Existing GPText Indexes

To use NER with a GPText index created with a version of GPText earlier than GPText 3.1, you must add the OpenNLP libraries to the index’s solrconfig.xml configuration file. These libraries are already present in the solrconfig.xml file for indexes created with GPText 3.1 or later. The installed GPText version must be release 3.1 or later.

Use the gptext-config utility to edit the solrconfig.xml file.

$ gptext-config edit -i <index-name> -f solrconfig.xml

Find the existing <lib> elements and add these elements.

  <!-- Add below existing lib settings -->
  <lib dir="${solr.install.dir:../../../..}/contrib/analysis-extras/lib" 
       regex="opennlp.*"/>
  <lib dir="${solr.install.dir:../../../..}/contrib/analysis-extras/lucene-libs" 
       regex="lucene-analyzers-opennlp-.*\.jar"/>
  <lib dir="${solr.install.dir:../../../..}/contrib/analysis-extras/lucene-libs" 
       regex="gptext-analysis-extras-.*\.jar"/>

Save the solrconfig.xml file with these changes.