Class WikipediaTokenizerImpl

java.lang.Object
org.apache.lucene.analysis.wikipedia.WikipediaTokenizerImpl

class WikipediaTokenizerImpl extends Object
JFlex-generated tokenizer that is aware of Wikipedia syntax.
  • Field Details

    • YYEOF

      public static final int YYEOF
      This character denotes the end of file.
      See Also:
    • ZZ_BUFFERSIZE

      private static final int ZZ_BUFFERSIZE
      Initial size of the lookahead buffer.
      See Also:
    • YYINITIAL

      public static final int YYINITIAL
      Lexical states.
      See Also:
    • CATEGORY_STATE

      public static final int CATEGORY_STATE
      See Also:
    • TWO_SINGLE_QUOTES_STATE

      public static final int TWO_SINGLE_QUOTES_STATE
      See Also:
    • THREE_SINGLE_QUOTES_STATE

      public static final int THREE_SINGLE_QUOTES_STATE
      See Also:
    • FIVE_SINGLE_QUOTES_STATE

      public static final int FIVE_SINGLE_QUOTES_STATE
      See Also:
    • DOUBLE_EQUALS_STATE

      public static final int DOUBLE_EQUALS_STATE
      See Also:
    • DOUBLE_BRACE_STATE

      public static final int DOUBLE_BRACE_STATE
      See Also:
    • STRING

      public static final int STRING
      See Also:
    • ZZ_LEXSTATE

      private static final int[] ZZ_LEXSTATE
      ZZ_LEXSTATE[l] is the state in the DFA for the lexical state l ZZ_LEXSTATE[l+1] is the state in the DFA for the lexical state l at the beginning of a line l is of the form l = 2*k, k a non negative integer
    • ZZ_CMAP_TOP

      private static final int[] ZZ_CMAP_TOP
      Top-level table for translating characters to character classes
    • ZZ_CMAP_TOP_PACKED_0

      private static final String ZZ_CMAP_TOP_PACKED_0
      See Also:
    • ZZ_CMAP_BLOCKS

      private static final int[] ZZ_CMAP_BLOCKS
      Second-level tables for translating characters to character classes
    • ZZ_CMAP_BLOCKS_PACKED_0

      private static final String ZZ_CMAP_BLOCKS_PACKED_0
      See Also:
    • ZZ_ACTION

      private static final int[] ZZ_ACTION
      Translates DFA states to action switch labels.
    • ZZ_ACTION_PACKED_0

      private static final String ZZ_ACTION_PACKED_0
      See Also:
    • ZZ_ROWMAP

      private static final int[] ZZ_ROWMAP
      Translates a state to a row index in the transition table
    • ZZ_ROWMAP_PACKED_0

      private static final String ZZ_ROWMAP_PACKED_0
      See Also:
    • ZZ_TRANS

      private static final int[] ZZ_TRANS
      The transition table of the DFA
    • ZZ_TRANS_PACKED_0

      private static final String ZZ_TRANS_PACKED_0
      See Also:
    • ZZ_UNKNOWN_ERROR

      private static final int ZZ_UNKNOWN_ERROR
      Error code for "Unknown internal scanner error".
      See Also:
    • ZZ_NO_MATCH

      private static final int ZZ_NO_MATCH
      Error code for "could not match input".
      See Also:
    • ZZ_PUSHBACK_2BIG

      private static final int ZZ_PUSHBACK_2BIG
      Error code for "pushback value was too large".
      See Also:
    • ZZ_ERROR_MSG

      private static final String[] ZZ_ERROR_MSG
      Error messages for ZZ_UNKNOWN_ERROR, ZZ_NO_MATCH, and ZZ_PUSHBACK_2BIG respectively.
    • ZZ_ATTRIBUTE

      private static final int[] ZZ_ATTRIBUTE
      ZZ_ATTRIBUTE[aState] contains the attributes of state aState
    • ZZ_ATTRIBUTE_PACKED_0

      private static final String ZZ_ATTRIBUTE_PACKED_0
      See Also:
    • zzReader

      private Reader zzReader
      Input device.
    • zzState

      private int zzState
      Current state of the DFA.
    • zzLexicalState

      private int zzLexicalState
      Current lexical state.
    • zzBuffer

      private char[] zzBuffer
      This buffer contains the current text to be matched and is the source of the yytext() string.
    • zzMarkedPos

      private int zzMarkedPos
      Text position at the last accepting state.
    • zzCurrentPos

      private int zzCurrentPos
      Current text position in the buffer.
    • zzStartRead

      private int zzStartRead
      Marks the beginning of the yytext() string in the buffer.
    • zzEndRead

      private int zzEndRead
      Marks the last character in the buffer, that has been read from input.
    • zzAtEOF

      private boolean zzAtEOF
      Whether the scanner is at the end of file.
      See Also:
    • zzFinalHighSurrogate

      private int zzFinalHighSurrogate
      The number of occupied positions in zzBuffer beyond zzEndRead.

      When a lead/high surrogate has been read from the input stream into the final zzBuffer position, this will have a value of 1; otherwise, it will have a value of 0.

    • yyline

      private int yyline
      Number of newlines encountered up to the start of the matched text.
    • yycolumn

      private int yycolumn
      Number of characters from the last newline up to the start of the matched text.
    • yychar

      private long yychar
      Number of characters up to the start of the matched text.
    • zzAtBOL

      private boolean zzAtBOL
      Whether the scanner is currently at the beginning of a line.
    • zzEOFDone

      private boolean zzEOFDone
      Whether the user-EOF-code has already been executed.
    • ALPHANUM

      public static final int ALPHANUM
      See Also:
    • APOSTROPHE

      public static final int APOSTROPHE
      See Also:
    • ACRONYM

      public static final int ACRONYM
      See Also:
    • COMPANY

      public static final int COMPANY
      See Also:
    • EMAIL

      public static final int EMAIL
      See Also:
    • HOST

      public static final int HOST
      See Also:
    • NUM

      public static final int NUM
      See Also:
    • CJ

      public static final int CJ
      See Also:
    • CITATION

      public static final int CITATION
      See Also:
    • CATEGORY

      public static final int CATEGORY
      See Also:
    • BOLD

      public static final int BOLD
      See Also:
    • ITALICS

      public static final int ITALICS
      See Also:
    • BOLD_ITALICS

      public static final int BOLD_ITALICS
      See Also:
    • HEADING

      public static final int HEADING
      See Also:
    • SUB_HEADING

      public static final int SUB_HEADING
      See Also:
    • currentTokType

      private int currentTokType
    • numBalanced

      private int numBalanced
    • positionInc

      private int positionInc
    • numLinkToks

      private int numLinkToks
    • numWikiTokensSeen

      private int numWikiTokensSeen
    • TOKEN_TYPES

      public static final String[] TOKEN_TYPES
  • Constructor Details

    • WikipediaTokenizerImpl

      WikipediaTokenizerImpl(Reader in)
      Creates a new scanner
      Parameters:
      in - the java.io.Reader to read input from.
  • Method Details

    • zzUnpackcmap_top

      private static int[] zzUnpackcmap_top()
    • zzUnpackcmap_top

      private static int zzUnpackcmap_top(String packed, int offset, int[] result)
    • zzUnpackcmap_blocks

      private static int[] zzUnpackcmap_blocks()
    • zzUnpackcmap_blocks

      private static int zzUnpackcmap_blocks(String packed, int offset, int[] result)
    • zzUnpackAction

      private static int[] zzUnpackAction()
    • zzUnpackAction

      private static int zzUnpackAction(String packed, int offset, int[] result)
    • zzUnpackRowMap

      private static int[] zzUnpackRowMap()
    • zzUnpackRowMap

      private static int zzUnpackRowMap(String packed, int offset, int[] result)
    • zzUnpackTrans

      private static int[] zzUnpackTrans()
    • zzUnpackTrans

      private static int zzUnpackTrans(String packed, int offset, int[] result)
    • zzUnpackAttribute

      private static int[] zzUnpackAttribute()
    • zzUnpackAttribute

      private static int zzUnpackAttribute(String packed, int offset, int[] result)
    • getNumWikiTokensSeen

      public final int getNumWikiTokensSeen()
      Returns the number of tokens seen inside a category or link, etc.
      Returns:
      the number of tokens seen inside the context of wiki syntax.
    • yychar

      public final int yychar()
    • getPositionIncrement

      public final int getPositionIncrement()
    • getText

      final void getText(CharTermAttribute t)
      Fills Lucene token with the current token text.
    • setText

      final int setText(StringBuilder buffer)
    • reset

      final void reset()
    • zzCMap

      private static int zzCMap(int input)
      Translates raw input code points to DFA table row
    • zzRefill

      private boolean zzRefill() throws IOException
      Refills the input buffer.
      Returns:
      false iff there was new input.
      Throws:
      IOException - if any I/O-Error occurs
    • yyclose

      public final void yyclose() throws IOException
      Closes the input reader.
      Throws:
      IOException - if the reader could not be closed.
    • yyreset

      public final void yyreset(Reader reader)
      Resets the scanner to read from a new input stream.

      Does not close the old reader.

      All internal variables are reset, the old input stream cannot be reused (internal buffer is discarded and lost). Lexical state is set to ZZ_INITIAL.

      Internal scan buffer is resized down to its initial length, if it has grown.

      Parameters:
      reader - The new input stream.
    • yyResetPosition

      private final void yyResetPosition()
      Resets the input position.
    • yyatEOF

      public final boolean yyatEOF()
      Returns whether the scanner has reached the end of the reader it reads from.
      Returns:
      whether the scanner has reached EOF.
    • yystate

      public final int yystate()
      Returns the current lexical state.
      Returns:
      the current lexical state.
    • yybegin

      public final void yybegin(int newState)
      Enters a new lexical state.
      Parameters:
      newState - the new lexical state
    • yytext

      public final String yytext()
      Returns the text matched by the current regular expression.
      Returns:
      the matched text.
    • yycharat

      public final char yycharat(int position)
      Returns the character at the given position from the matched text.

      It is equivalent to yytext().charAt(pos), but faster.

      Parameters:
      position - the position of the character to fetch. A value from 0 to yylength()-1.
      Returns:
      the character at position.
    • yylength

      public final int yylength()
      How many characters were matched.
      Returns:
      the length of the matched text region.
    • zzScanError

      private static void zzScanError(int errorCode)
      Reports an error that occurred while scanning.

      In a well-formed scanner (no or only correct usage of yypushback(int) and a match-all fallback rule) this method will only be called with things that "Can't Possibly Happen".

      If this method is called, something is seriously wrong (e.g. a JFlex bug producing a faulty scanner etc.).

      Usual syntax/scanner level error handling should be done in error fallback rules.

      Parameters:
      errorCode - the code of the error message to display.
    • yypushback

      public void yypushback(int number)
      Pushes the specified amount of characters back into the input stream.

      They will be read again by then next call of the scanning method.

      Parameters:
      number - the number of characters to be read again. This number must not be greater than yylength().
    • getNextToken

      public int getNextToken() throws IOException
      Resumes scanning until the next regular expression is matched, the end of input is encountered or an I/O-Error occurs.
      Returns:
      the next token.
      Throws:
      IOException - if any I/O-Error occurs.