Defines the set of terms for a Solr index field. An analyzer consists of a tokenizer and a set of optional filters to be applied to the input text. For example, an analyzer can consist of a WhiteSpaceTokenizerFactory followed by a LowerCaseFilterFactory as a filter. See also: taxonomy.
A sequence of two adjacent elements in a token string. A sequence of three consecutive tokens is a trigram and a sequence of n consecutive tokens is an n-gram.
The process of sorting data into one of two categories, for example, classifying a given text according to whether the text is associated with a positive or negative sentiment.
Classification problems with more than two classes into which the data is to be categorized are called multiclass classification problems.
In clustering problems, a centroid represents the approximate center of a cluster.
A centroid does not have to map directly to a data point in the cluster. For example, in k-means clustering the coordinates of a centroid are the mean of the coordinates of data points (documents) pertaining to that cluster and are constantly updated as new data point assignments are made.
A set of identical data points. For example, if some documents are to be grouped into three clusters, the result of a machine learning algorithm is three clusters: all the documents within a particular cluster are similar to each other, but different from the documents in other clusters. The results and the quality of the clusters depend on various factors, including the algorithm used, the parameters that were configured, and set of features used.
A complete, logical index in a GPText or SolrCloud system. A logical index is composed of multiple shards. A GPText collection has one shard per Greenplum segment.
The Solr configuration files that define the structure and configuration of a GPText index, such as
solrconfig.xml. These files are stored in ZooKeeper.
A replica of a shard, implemented in GPText and SolrCloud as a Lucene index. Also called Solr Core.
A collection of documents. Plural: corpora.
A list of unique words or terms from the documents that comprise the vocabulary of the document collection.
A generalized term for a feature of data, such as word counts in a document. Dimensions are typically large in number.
An n-dimensional vector is a vector according to which a document can be expressed in an n-dimensional feature space. For example, if your dictionary contains n unique terms, a document could be expressed in an n–dimensional vector where each position contains the count with which a particular term from the dictionary appears in that document. A feature space could be the entire dictionary or could be another dictionary (or set of features) extracted by using feature selection.
The process of reducing the dimensions (or features) according to which the data (or documents) is expressed in a feature space. For example, selecting terms (or features) from the dictionary that appear in more than k documents in the entire corpus gives one set of reduced dimensions.
A replica that has been elected leader for a shard. When documents are indexed, SolrCloud sends them to the lead replica for the shard and the lead distributes them to all of the shard’s remaining replicas. Also called leader.
A branch of artificial intelligence that focuses on the construction and study of systems that can learn from data.
natural language processing
A field of study that combines computer science, artificial intelligence, and liguistics to study interactions between computers and human languages.
The formal representation of knowledge as a set of concepts (ideas, entities, events) and their properties and relations according to a system of categories within a domain. Ontologies provide the structural frameworks for organizing information for fields such as artificial intelligence. Ontology is not a synonym for taxonomy.
proximity, term proximity
A search that looks for documents in which two or more separately matching term occurences are within a specified distance (a number of intermediate words or characters).
A component that parses the input queries provided for search.
A single copy of a shard, managed as a Solr core. One replica for each shard is elected as the lead replica, or leader. All updates and searches are directed to the lead replica; changes are replicated from the lead to the remaining replicas.
Classifies opinions expressed in text documents into categories such as “positive” and “negative”.
A logical piece or slice of a collection. In GPText, there is one shard for each Greenplum Database segment. A shard is made up of one or more replicas. One replica is elected the lead shard, or leader, and updates to the leader are replicated to the other replicas.
silhouette coefficient (SC)
A quantitive measure of data clustering performance. SC measures how tightly grouped all the data in the cluster are. Its values range between –1 and 1. Values near 1 indicate that clustering was good, and values near -1 indicate that clustering was not good and the data point must have been assigned to another cluster. Values near 0 indicate that the cluster assignment was ambiguous, and the data point is somewhere on the boundary of the cluster.
A vector whose elements are mostly zeros or are unpopulated. See the Getting Started with GPText Guide for more information.
The part of a word that is common to all its inflected variants (how you modify a word to express its different grammatical categories, for example, by conjugating a verb). For example, receives, receiving, and received all derive from the stem “receiv”.
The process for reducing an inflected or derived word to its stem, base, or root form. The stem is not necessarily the same as the root form. For example, receives, receiving, and received all derive from the stem “receiv”; the root form is receive.
support vector machine (SVM)
A supervised learning model that classifies data by analyzing the data, recognizing patterns in the data, and placing the data in specific classes. Applications include sentiment analysis, separating spam email from legitimate email, and, if the Sorting Hat were an SVM, determining the House to which new Hogwarts students are assigned.
A hierarchical system of classification; a method for dividing terms, concepts, or other entities into ordered groups or categories. Taxonomies differ from ontologies in that they are generally focused, simple tree relationships, and ontologies have wider, broader scopes.
A distinct word or value within a field.
Term frequency-inverse document frequency. A numeric statistic that reflects how important a word is to a set of documents. The tf-idf score increases proportionate to the number of times a word appears in a document.
A vector containing tf-idf scores.
Units into which an input string is broken. For example, a token can be individual terms in a bigram (multiple terms) or trigram that appear in the input text.
Breaks a stream of text into tokens based on delimiters, the separators that specify the characters to consider as the token boundaries, or some regular expressions. For example, a delimiter could be a white space.
Tokenizers are not aware of fields in a document.
Takes a stream of tokens produced by a tokenizer, examines each token, and either passes the token along or discards it. For example, a token filter may remove white space, unnecessary words such as “a”, “an”, or “the”, or remove dots from acronyms. Token filters produce another stream of tokens that can be input to other token filters.
Universal Query Parser
The GPText Universal Query Parser parses queries containing expressions supported by any supported query parser.
Apache ZooKeeper is a server product that provides cluster management services such as centralized configuration management, load balancing, and failover. It is a SolrCloud requirement. ZooKeeper handles leader elections for SolrCloud replicas. For high availability, ZooKeeper is deployed as a cluster of 3, 5, or 7 nodes. The ZooKeeper cluster can be deployed with GPText on the same cluster hosts or an existing ZooKeeper cluster can be used.