Class SimplePatternSplitTokenizerFactory

java.lang.Object
org.apache.lucene.analysis.AbstractAnalysisFactory
org.apache.lucene.analysis.TokenizerFactory
org.apache.lucene.analysis.pattern.SimplePatternSplitTokenizerFactory

public class SimplePatternSplitTokenizerFactory extends TokenizerFactory
Factory for SimplePatternSplitTokenizer, for producing tokens by splitting according to the provided regexp.

This tokenizer uses Lucene RegExp pattern matching to construct distinct tokens for the input stream. The syntax is more limited than PatternTokenizer, but the tokenization is quite a bit faster. It takes two arguments:

  • "pattern" (required) is the regular expression, according to the syntax described at RegExp
  • "determinizeWorkLimit" (optional, default Operations.DEFAULT_DETERMINIZE_WORK_LIMIT) the limit on total effort to determinize the automaton computed from the regexp

The pattern matches the characters that should split tokens, like String.split, and the matching is greedy such that the longest token separator matching at a given point is matched. Empty tokens are never created.

For example, to match tokens delimited by simple whitespace characters:

 <fieldType name="text_ptn" class="solr.TextField" positionIncrementGap="100">
   <analyzer>
     <tokenizer class="solr.SimplePatternSplitTokenizerFactory" pattern="[ \t\r\n]+"/>
   </analyzer>
 </fieldType>
Since:
6.5.0
See Also:
  • Field Details

  • Constructor Details

    • SimplePatternSplitTokenizerFactory

      public SimplePatternSplitTokenizerFactory(Map<String,String> args)
      Creates a new SimpleSplitPatternTokenizerFactory
    • SimplePatternSplitTokenizerFactory

      public SimplePatternSplitTokenizerFactory()
      Default ctor for compatibility with SPI
  • Method Details