Class RegexTokenizer
java.lang.Object
org.apache.commons.text.similarity.RegexTokenizer
- All Implemented Interfaces:
Tokenizer<CharSequence>
A simple word tokenizer that utilizes regex to find words. It applies a regex
(\w)+
over the input text to extract words from a given character
sequence.- Since:
- 1.0
-
Field Summary
Fields -
Constructor Summary
Constructors -
Method Summary
-
Field Details
-
PATTERN
The whitespace pattern.
-
-
Constructor Details
-
RegexTokenizer
RegexTokenizer()
-
-
Method Details
-
tokenize
Returns an array of tokens.- Specified by:
tokenize
in interfaceTokenizer<CharSequence>
- Parameters:
text
- input text- Returns:
- array of tokens
- Throws:
IllegalArgumentException
- if the input text is blank
-