Class WordDictionary
java.lang.Object
org.apache.lucene.analysis.cn.smart.hhmm.AbstractDictionary
org.apache.lucene.analysis.cn.smart.hhmm.WordDictionary
SmartChineseAnalyzer Word Dictionary
-
Field Summary
FieldsModifier and TypeFieldDescriptionprivate char[]
static final int
Large prime number for hash functionprivate static WordDictionary
private short[]
wordIndexTable guarantees to hash all Chinese characters in Unicode into PRIME_INDEX_LENGTH array.private char[][][]
To avoid taking too much space, the data structure needed to store the lexicon requires two multidimensional arrays to store word and frequency.private int[][]
Fields inherited from class org.apache.lucene.analysis.cn.smart.hhmm.AbstractDictionary
CHAR_NUM_IN_FILE, GB2312_CHAR_NUM, GB2312_FIRST_CHAR
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionprivate void
The original lexicon puts all information with punctuation into a chart (from 1 to 3755).private int
findInTable
(short knownHashIndex, char[] charArray) Look up the text string corresponding with the word char array, and return the position of the word list.private short
getAvaliableTableIndex
(char c) int
getFrequency
(char[] charArray) Get the frequency of a word from the dictionarystatic WordDictionary
Get the singleton dictionary instance.int
getPrefixMatch
(char[] charArray) Find the first word in the dictionary that starts with the supplied prefixint
getPrefixMatch
(char[] charArray, int knownStart) Find the nth word in the dictionary that starts with the supplied prefixprivate short
getWordItemTableIndex
(char c) boolean
isEqual
(char[] charArray, int itemIndex) Return true if the dictionary entry at itemIndex for table charArray[0] is charArrayvoid
load()
Load coredict.mem internally from the jar file.void
Attempt to load dictionary from provided directory, first trying coredict.mem, failing back on coredict.dctprivate boolean
loadFromObj
(Path serialObj) private void
loadFromObjectInputStream
(InputStream serialObjectInputStream) private int
loadMainDataFromFile
(String dctFilePath) Load the datafile into this WordDictionaryprivate void
private void
private boolean
setTableIndex
(char c, int j) private void
Methods inherited from class org.apache.lucene.analysis.cn.smart.hhmm.AbstractDictionary
getCCByGB2312Id, getGB2312Id, hash1, hash1, hash2, hash2
-
Field Details
-
singleInstance
-
PRIME_INDEX_LENGTH
public static final int PRIME_INDEX_LENGTHLarge prime number for hash function- See Also:
-
wordIndexTable
private short[] wordIndexTablewordIndexTable guarantees to hash all Chinese characters in Unicode into PRIME_INDEX_LENGTH array. There will be conflict, but in reality this program only handles the 6768 characters found in GB2312 plus some ASCII characters. Therefore in order to guarantee better precision, it is necessary to retain the original symbol in the charIndexTable. -
charIndexTable
private char[] charIndexTable -
wordItem_charArrayTable
private char[][][] wordItem_charArrayTableTo avoid taking too much space, the data structure needed to store the lexicon requires two multidimensional arrays to store word and frequency. Each word is placed in a char[]. Each char represents a Chinese char or other symbol. Each frequency is put into an int. These two arrays correspond to each other one-to-one. Therefore, one can use wordItem_charArrayTable[i][j] to look up word from lexicon, and wordItem_frequencyTable[i][j] to look up the corresponding frequency. -
wordItem_frequencyTable
private int[][] wordItem_frequencyTable
-
-
Constructor Details
-
WordDictionary
private WordDictionary()
-
-
Method Details
-
getInstance
Get the singleton dictionary instance.- Returns:
- singleton
-
load
Attempt to load dictionary from provided directory, first trying coredict.mem, failing back on coredict.dct- Parameters:
dctFileRoot
- path to dictionary directory
-
load
Load coredict.mem internally from the jar file.- Throws:
IOException
- If there is a low-level I/O error.ClassNotFoundException
-
loadFromObj
-
loadFromObjectInputStream
private void loadFromObjectInputStream(InputStream serialObjectInputStream) throws IOException, ClassNotFoundException - Throws:
IOException
ClassNotFoundException
-
saveToObj
-
loadMainDataFromFile
Load the datafile into this WordDictionary- Parameters:
dctFilePath
- path to word dictionary (coredict.dct)- Returns:
- number of words read
- Throws:
IOException
- If there is a low-level I/O error.
-
expandDelimiterData
private void expandDelimiterData()The original lexicon puts all information with punctuation into a chart (from 1 to 3755). Here it then gets expanded, separately being placed into the chart that has the corresponding symbol. -
mergeSameWords
private void mergeSameWords() -
sortEachItems
private void sortEachItems() -
setTableIndex
private boolean setTableIndex(char c, int j) -
getAvaliableTableIndex
private short getAvaliableTableIndex(char c) -
getWordItemTableIndex
private short getWordItemTableIndex(char c) -
findInTable
private int findInTable(short knownHashIndex, char[] charArray) Look up the text string corresponding with the word char array, and return the position of the word list.- Parameters:
knownHashIndex
- already figure out position of the first word symbol charArray[0] in hash table. If not calculated yet, can be replaced with function int findInTable(char[] charArray).charArray
- look up the char array corresponding with the word.- Returns:
- word location in word array. If not found, then return -1.
-
getPrefixMatch
public int getPrefixMatch(char[] charArray) Find the first word in the dictionary that starts with the supplied prefix- Parameters:
charArray
- input prefix- Returns:
- index of word, or -1 if not found
- See Also:
-
getPrefixMatch
public int getPrefixMatch(char[] charArray, int knownStart) Find the nth word in the dictionary that starts with the supplied prefix- Parameters:
charArray
- input prefixknownStart
- relative position in the dictionary to start- Returns:
- index of word, or -1 if not found
- See Also:
-
getFrequency
public int getFrequency(char[] charArray) Get the frequency of a word from the dictionary- Parameters:
charArray
- input word- Returns:
- word frequency, or zero if the word is not found
-
isEqual
public boolean isEqual(char[] charArray, int itemIndex) Return true if the dictionary entry at itemIndex for table charArray[0] is charArray- Parameters:
charArray
- input worditemIndex
- item index for table charArray[0]- Returns:
- true if the entry exists
-