Class JapaneseIterationMarkCharFilter
- All Implemented Interfaces:
Closeable
,AutoCloseable
,Readable
Sequences of iteration marks are supported. In case an illegal sequence of iteration marks is encountered, the implementation emits the illegal source character as-is without considering its script. For example, with input "?ゝ", we get "??" even though the question mark isn't hiragana.
Note that a full stop punctuation character "。" (U+3002) can not be iterated (see below). Iteration marks themselves can be emitted in case they are illegal, i.e. if they go back past the beginning of the character stream.
The implementation buffers input until a full stop punctuation character (U+3002) or EOF is reached in order to not keep a copy of the character stream in memory. Vertical iteration marks, which are even rarer than horizontal iteration marks in contemporary Japanese, are unsupported.
-
Field Summary
FieldsModifier and TypeFieldDescriptionprivate final RollingCharBuffer
private int
private static final char
private static char[]
private static final char
private static final char
private int
private int
private static char[]
private static final char
private static final char
private static final char
static final boolean
Normalize kana iteration marks by defaultstatic final boolean
Normalize kanji iteration marks by defaultprivate boolean
private boolean
Fields inherited from class org.apache.lucene.analysis.CharFilter
input
-
Constructor Summary
ConstructorsConstructorDescriptionConstructor.JapaneseIterationMarkCharFilter
(Reader input, boolean normalizeKanji, boolean normalizeKana) Constructor -
Method Summary
Modifier and TypeMethodDescriptionprotected int
correct
(int currentOff) Subclasses override to correct the current offset.private boolean
inside
(char c, char[] map, char offset) Predicate indicating if the lookup character is within dakuten map rangeprivate boolean
isHiraganaDakuten
(char c) Hiragana dakuten predicateprivate boolean
isHiraganaIterationMark
(char c) Hiragana iteration mark character predicateprivate boolean
isIterationMark
(char c) Iteration mark character predicateprivate boolean
isKanjiIterationMark
(char c) Kanji iteration mark character predicateprivate boolean
isKatakanaDakuten
(char c) Katakana dakuten predicateprivate boolean
isKatakanaIterationMark
(char c) Katakana iteration mark character predicateprivate char
lookup
(char c, char[] map, char offset) Looks up a character in dakuten map and returns the dakuten variant if it exists.private char
lookupHiraganaDakuten
(char c) Look up hiragana dakutenprivate char
lookupKatakanaDakuten
(char c) Look up katakana dakuten.private int
Finds the number of subsequent next iteration marksprivate char
normalize
(char c, char m) Normalize a characterprivate char
normalizedHiragana
(char c, char m) Normalize hiragana characterprivate char
normalizedKatakana
(char c, char m) Normalize katakana characterprivate char
normalizeIterationMark
(char c) Normalizes the iteration mark character cint
read()
int
read
(char[] buffer, int offset, int length) private char
sourceCharacter
(int position, int spanSize) Returns the source character for a given position and iteration mark span sizeMethods inherited from class org.apache.lucene.analysis.CharFilter
close, correctOffset
Methods inherited from class java.io.Reader
mark, markSupported, nullReader, read, read, ready, reset, skip, transferTo
-
Field Details
-
NORMALIZE_KANJI_DEFAULT
public static final boolean NORMALIZE_KANJI_DEFAULTNormalize kanji iteration marks by default- See Also:
-
NORMALIZE_KANA_DEFAULT
public static final boolean NORMALIZE_KANA_DEFAULTNormalize kana iteration marks by default- See Also:
-
KANJI_ITERATION_MARK
private static final char KANJI_ITERATION_MARK- See Also:
-
HIRAGANA_ITERATION_MARK
private static final char HIRAGANA_ITERATION_MARK- See Also:
-
HIRAGANA_VOICED_ITERATION_MARK
private static final char HIRAGANA_VOICED_ITERATION_MARK- See Also:
-
KATAKANA_ITERATION_MARK
private static final char KATAKANA_ITERATION_MARK- See Also:
-
KATAKANA_VOICED_ITERATION_MARK
private static final char KATAKANA_VOICED_ITERATION_MARK- See Also:
-
FULL_STOP_PUNCTUATION
private static final char FULL_STOP_PUNCTUATION- See Also:
-
h2d
private static char[] h2d -
k2d
private static char[] k2d -
buffer
-
bufferPosition
private int bufferPosition -
iterationMarksSpanSize
private int iterationMarksSpanSize -
iterationMarkSpanEndPosition
private int iterationMarkSpanEndPosition -
normalizeKanji
private boolean normalizeKanji -
normalizeKana
private boolean normalizeKana
-
-
Constructor Details
-
JapaneseIterationMarkCharFilter
Constructor. Normalizes both kanji and kana iteration marks by default.- Parameters:
input
- char stream
-
JapaneseIterationMarkCharFilter
Constructor- Parameters:
input
- char streamnormalizeKanji
- indicates whether kanji iteration marks should be normalizednormalizeKana
- indicates whether kana iteration marks should be normalized
-
-
Method Details
-
read
- Specified by:
read
in classReader
- Throws:
IOException
-
read
- Overrides:
read
in classReader
- Throws:
IOException
-
normalizeIterationMark
Normalizes the iteration mark character c- Parameters:
c
- iteration mark character to normalize- Returns:
- normalized iteration mark
- Throws:
IOException
- If there is a low-level I/O error.
-
nextIterationMarkSpanSize
Finds the number of subsequent next iteration marks- Returns:
- number of iteration marks starting at the current buffer position
- Throws:
IOException
- If there is a low-level I/O error.
-
sourceCharacter
Returns the source character for a given position and iteration mark span size- Parameters:
position
- buffer position (should not exceed bufferPosition)spanSize
- iteration mark span size- Returns:
- source character
- Throws:
IOException
- If there is a low-level I/O error.
-
normalize
private char normalize(char c, char m) Normalize a character- Parameters:
c
- character to normalizem
- repetition mark referring to c- Returns:
- normalized character - return c on illegal iteration marks
-
normalizedHiragana
private char normalizedHiragana(char c, char m) Normalize hiragana character- Parameters:
c
- hiragana characterm
- repetition mark referring to c- Returns:
- normalized character - return c on illegal iteration marks
-
normalizedKatakana
private char normalizedKatakana(char c, char m) Normalize katakana character- Parameters:
c
- katakana characterm
- repetition mark referring to c- Returns:
- normalized character - return c on illegal iteration marks
-
isIterationMark
private boolean isIterationMark(char c) Iteration mark character predicate- Parameters:
c
- character to test- Returns:
- true if c is an iteration mark character. Otherwise false.
-
isHiraganaIterationMark
private boolean isHiraganaIterationMark(char c) Hiragana iteration mark character predicate- Parameters:
c
- character to test- Returns:
- true if c is a hiragana iteration mark character. Otherwise false.
-
isKatakanaIterationMark
private boolean isKatakanaIterationMark(char c) Katakana iteration mark character predicate- Parameters:
c
- character to test- Returns:
- true if c is a katakana iteration mark character. Otherwise false.
-
isKanjiIterationMark
private boolean isKanjiIterationMark(char c) Kanji iteration mark character predicate- Parameters:
c
- character to test- Returns:
- true if c is a kanji iteration mark character. Otherwise false.
-
lookupHiraganaDakuten
private char lookupHiraganaDakuten(char c) Look up hiragana dakuten- Parameters:
c
- character to look up- Returns:
- hiragana dakuten variant of c or c itself if no dakuten variant exists
-
lookupKatakanaDakuten
private char lookupKatakanaDakuten(char c) Look up katakana dakuten. Only full-width katakana are supported.- Parameters:
c
- character to look up- Returns:
- katakana dakuten variant of c or c itself if no dakuten variant exists
-
isHiraganaDakuten
private boolean isHiraganaDakuten(char c) Hiragana dakuten predicate- Parameters:
c
- character to check- Returns:
- true if c is a hiragana dakuten and otherwise false
-
isKatakanaDakuten
private boolean isKatakanaDakuten(char c) Katakana dakuten predicate- Parameters:
c
- character to check- Returns:
- true if c is a hiragana dakuten and otherwise false
-
lookup
private char lookup(char c, char[] map, char offset) Looks up a character in dakuten map and returns the dakuten variant if it exists. Otherwise return the character being looked up itself- Parameters:
c
- character to look upmap
- dakuten mapoffset
- code point offset from c- Returns:
- mapped character or c if no mapping exists
-
inside
private boolean inside(char c, char[] map, char offset) Predicate indicating if the lookup character is within dakuten map range- Parameters:
c
- character to look upmap
- dakuten mapoffset
- code point offset from c- Returns:
- true if c is mapped by map and otherwise false
-
correct
protected int correct(int currentOff) Description copied from class:CharFilter
Subclasses override to correct the current offset.- Specified by:
correct
in classCharFilter
- Parameters:
currentOff
- current offset- Returns:
- corrected offset
-