Kouhei Sutou
null+****@clear*****
Mon Mar 16 16:09:18 JST 2015
Kouhei Sutou 2015-03-16 16:09:18 +0900 (Mon, 16 Mar 2015) New Revision: c39fc0daa761b8085529c2381ac279e4f1823eff https://github.com/groonga/groonga/commit/c39fc0daa761b8085529c2381ac279e4f1823eff Message: doc: document more tokenizers Added files: doc/source/example/reference/tokenizers/token-bigram-ignore-blank-split-symbol-with-white-spaces-and-symbol-and-alphabet-digit.log doc/source/example/reference/tokenizers/token-bigram-ignore-blank-split-symbol-with-white-spaces-and-symbol-and-alphabet.log doc/source/example/reference/tokenizers/token-bigram-ignore-blank-split-symbol-with-white-spaces-and-symbol.log doc/source/example/reference/tokenizers/token-bigram-ignore-blank-with-white-spaces.log doc/source/example/reference/tokenizers/token-bigram-split-symbol-alpha-digit-with-normalizer.log doc/source/example/reference/tokenizers/token-bigram-split-symbol-alpha-with-normalizer.log doc/source/example/reference/tokenizers/token-bigram-split-symbol-with-normalizer.log doc/source/example/reference/tokenizers/token-bigram-with-white-spaces-and-symbol-and-alphabet-and-digit.log doc/source/example/reference/tokenizers/token-bigram-with-white-spaces-and-symbol-and-alphabet.log doc/source/example/reference/tokenizers/token-bigram-with-white-spaces-and-symbol.log doc/source/example/reference/tokenizers/token-bigram-with-white-spaces.log doc/source/example/reference/tokenizers/token-trigram.log doc/source/example/reference/tokenizers/token-unigram.log Modified files: doc/source/reference/tokenizers.rst Added: doc/source/example/reference/tokenizers/token-bigram-ignore-blank-split-symbol-with-white-spaces-and-symbol-and-alphabet-digit.log (+68 -0) 100644 =================================================================== --- /dev/null +++ doc/source/example/reference/tokenizers/token-bigram-ignore-blank-split-symbol-with-white-spaces-and-symbol-and-alphabet-digit.log 2015-03-16 16:09:18 +0900 (5e127a2) @@ -0,0 +1,68 @@ +Execution example:: + + tokenize TokenBigramIgnoreBlankSplitSymbolAlphaDigit "Hello 日 本 語 ! ! ! 777" NormalizerAuto + # [ + # [ + # 0, + # 1337566253.89858, + # 0.000355720520019531 + # ], + # [ + # { + # "position": 0, + # "value": "he" + # }, + # { + # "position": 1, + # "value": "el" + # }, + # { + # "position": 2, + # "value": "ll" + # }, + # { + # "position": 3, + # "value": "lo" + # }, + # { + # "position": 4, + # "value": "o日" + # }, + # { + # "position": 5, + # "value": "日本" + # }, + # { + # "position": 6, + # "value": "本語" + # }, + # { + # "position": 7, + # "value": "語!" + # }, + # { + # "position": 8, + # "value": "!!" + # }, + # { + # "position": 9, + # "value": "!!" + # }, + # { + # "position": 10, + # "value": "!7" + # }, + # { + # "position": 11, + # "value": "77" + # }, + # { + # "position": 12, + # "value": "77" + # }, + # { + # "position": 13, + # "value": "7" + # } + # ] + # ] Added: doc/source/example/reference/tokenizers/token-bigram-ignore-blank-split-symbol-with-white-spaces-and-symbol-and-alphabet.log (+56 -0) 100644 =================================================================== --- /dev/null +++ doc/source/example/reference/tokenizers/token-bigram-ignore-blank-split-symbol-with-white-spaces-and-symbol-and-alphabet.log 2015-03-16 16:09:18 +0900 (13a38ac) @@ -0,0 +1,56 @@ +Execution example:: + + tokenize TokenBigramIgnoreBlankSplitSymbolAlpha "Hello 日 本 語 ! ! !" NormalizerAuto + # [ + # [ + # 0, + # 1337566253.89858, + # 0.000355720520019531 + # ], + # [ + # { + # "position": 0, + # "value": "he" + # }, + # { + # "position": 1, + # "value": "el" + # }, + # { + # "position": 2, + # "value": "ll" + # }, + # { + # "position": 3, + # "value": "lo" + # }, + # { + # "position": 4, + # "value": "o日" + # }, + # { + # "position": 5, + # "value": "日本" + # }, + # { + # "position": 6, + # "value": "本語" + # }, + # { + # "position": 7, + # "value": "語!" + # }, + # { + # "position": 8, + # "value": "!!" + # }, + # { + # "position": 9, + # "value": "!!" + # }, + # { + # "position": 10, + # "value": "!" + # } + # ] + # ] Added: doc/source/example/reference/tokenizers/token-bigram-ignore-blank-split-symbol-with-white-spaces-and-symbol.log (+36 -0) 100644 =================================================================== --- /dev/null +++ doc/source/example/reference/tokenizers/token-bigram-ignore-blank-split-symbol-with-white-spaces-and-symbol.log 2015-03-16 16:09:18 +0900 (16e0168) @@ -0,0 +1,36 @@ +Execution example:: + + tokenize TokenBigramIgnoreBlankSplitSymbol "日 本 語 ! ! !" NormalizerAuto + # [ + # [ + # 0, + # 1337566253.89858, + # 0.000355720520019531 + # ], + # [ + # { + # "position": 0, + # "value": "日本" + # }, + # { + # "position": 1, + # "value": "本語" + # }, + # { + # "position": 2, + # "value": "語!" + # }, + # { + # "position": 3, + # "value": "!!" + # }, + # { + # "position": 4, + # "value": "!!" + # }, + # { + # "position": 5, + # "value": "!" + # } + # ] + # ] Added: doc/source/example/reference/tokenizers/token-bigram-ignore-blank-with-white-spaces.log (+28 -0) 100644 =================================================================== --- /dev/null +++ doc/source/example/reference/tokenizers/token-bigram-ignore-blank-with-white-spaces.log 2015-03-16 16:09:18 +0900 (9f36b74) @@ -0,0 +1,28 @@ +Execution example:: + + tokenize TokenBigramIgnoreBlank "日 本 語 ! ! !" NormalizerAuto + # [ + # [ + # 0, + # 1337566253.89858, + # 0.000355720520019531 + # ], + # [ + # { + # "position": 0, + # "value": "日本" + # }, + # { + # "position": 1, + # "value": "本語" + # }, + # { + # "position": 2, + # "value": "語" + # }, + # { + # "position": 3, + # "value": "!!!" + # } + # ] + # ] Added: doc/source/example/reference/tokenizers/token-bigram-split-symbol-alpha-digit-with-normalizer.log (+56 -0) 100644 =================================================================== --- /dev/null +++ doc/source/example/reference/tokenizers/token-bigram-split-symbol-alpha-digit-with-normalizer.log 2015-03-16 16:09:18 +0900 (7a38aa7) @@ -0,0 +1,56 @@ +Execution example:: + + tokenize TokenBigramSplitSymbolAlphaDigit "100cents!!!" NormalizerAuto + # [ + # [ + # 0, + # 1337566253.89858, + # 0.000355720520019531 + # ], + # [ + # { + # "position": 0, + # "value": "10" + # }, + # { + # "position": 1, + # "value": "00" + # }, + # { + # "position": 2, + # "value": "0c" + # }, + # { + # "position": 3, + # "value": "ce" + # }, + # { + # "position": 4, + # "value": "en" + # }, + # { + # "position": 5, + # "value": "nt" + # }, + # { + # "position": 6, + # "value": "ts" + # }, + # { + # "position": 7, + # "value": "s!" + # }, + # { + # "position": 8, + # "value": "!!" + # }, + # { + # "position": 9, + # "value": "!!" + # }, + # { + # "position": 10, + # "value": "!" + # } + # ] + # ] Added: doc/source/example/reference/tokenizers/token-bigram-split-symbol-alpha-with-normalizer.log (+48 -0) 100644 =================================================================== --- /dev/null +++ doc/source/example/reference/tokenizers/token-bigram-split-symbol-alpha-with-normalizer.log 2015-03-16 16:09:18 +0900 (f166e85) @@ -0,0 +1,48 @@ +Execution example:: + + tokenize TokenBigramSplitSymbolAlpha "100cents!!!" NormalizerAuto + # [ + # [ + # 0, + # 1337566253.89858, + # 0.000355720520019531 + # ], + # [ + # { + # "position": 0, + # "value": "100" + # }, + # { + # "position": 1, + # "value": "ce" + # }, + # { + # "position": 2, + # "value": "en" + # }, + # { + # "position": 3, + # "value": "nt" + # }, + # { + # "position": 4, + # "value": "ts" + # }, + # { + # "position": 5, + # "value": "s!" + # }, + # { + # "position": 6, + # "value": "!!" + # }, + # { + # "position": 7, + # "value": "!!" + # }, + # { + # "position": 8, + # "value": "!" + # } + # ] + # ] Added: doc/source/example/reference/tokenizers/token-bigram-split-symbol-with-normalizer.log (+32 -0) 100644 =================================================================== --- /dev/null +++ doc/source/example/reference/tokenizers/token-bigram-split-symbol-with-normalizer.log 2015-03-16 16:09:18 +0900 (0a669f8) @@ -0,0 +1,32 @@ +Execution example:: + + tokenize TokenBigramSplitSymbol "100cents!!!" NormalizerAuto + # [ + # [ + # 0, + # 1337566253.89858, + # 0.000355720520019531 + # ], + # [ + # { + # "position": 0, + # "value": "100" + # }, + # { + # "position": 1, + # "value": "cents" + # }, + # { + # "position": 2, + # "value": "!!" + # }, + # { + # "position": 3, + # "value": "!!" + # }, + # { + # "position": 4, + # "value": "!" + # } + # ] + # ] Added: doc/source/example/reference/tokenizers/token-bigram-with-white-spaces-and-symbol-and-alphabet-and-digit.log (+44 -0) 100644 =================================================================== --- /dev/null +++ doc/source/example/reference/tokenizers/token-bigram-with-white-spaces-and-symbol-and-alphabet-and-digit.log 2015-03-16 16:09:18 +0900 (004a3ff) @@ -0,0 +1,44 @@ +Execution example:: + + tokenize TokenBigram "Hello 日 本 語 ! ! ! 777" NormalizerAuto + # [ + # [ + # 0, + # 1337566253.89858, + # 0.000355720520019531 + # ], + # [ + # { + # "position": 0, + # "value": "hello" + # }, + # { + # "position": 1, + # "value": "日" + # }, + # { + # "position": 2, + # "value": "本" + # }, + # { + # "position": 3, + # "value": "語" + # }, + # { + # "position": 4, + # "value": "!" + # }, + # { + # "position": 5, + # "value": "!" + # }, + # { + # "position": 6, + # "value": "!" + # }, + # { + # "position": 7, + # "value": "777" + # } + # ] + # ] Added: doc/source/example/reference/tokenizers/token-bigram-with-white-spaces-and-symbol-and-alphabet.log (+40 -0) 100644 =================================================================== --- /dev/null +++ doc/source/example/reference/tokenizers/token-bigram-with-white-spaces-and-symbol-and-alphabet.log 2015-03-16 16:09:18 +0900 (e1eb629) @@ -0,0 +1,40 @@ +Execution example:: + + tokenize TokenBigram "Hello 日 本 語 ! ! !" NormalizerAuto + # [ + # [ + # 0, + # 1337566253.89858, + # 0.000355720520019531 + # ], + # [ + # { + # "position": 0, + # "value": "hello" + # }, + # { + # "position": 1, + # "value": "日" + # }, + # { + # "position": 2, + # "value": "本" + # }, + # { + # "position": 3, + # "value": "語" + # }, + # { + # "position": 4, + # "value": "!" + # }, + # { + # "position": 5, + # "value": "!" + # }, + # { + # "position": 6, + # "value": "!" + # } + # ] + # ] Added: doc/source/example/reference/tokenizers/token-bigram-with-white-spaces-and-symbol.log (+36 -0) 100644 =================================================================== --- /dev/null +++ doc/source/example/reference/tokenizers/token-bigram-with-white-spaces-and-symbol.log 2015-03-16 16:09:18 +0900 (1ad4507) @@ -0,0 +1,36 @@ +Execution example:: + + tokenize TokenBigram "日 本 語 ! ! !" NormalizerAuto + # [ + # [ + # 0, + # 1337566253.89858, + # 0.000355720520019531 + # ], + # [ + # { + # "position": 0, + # "value": "日" + # }, + # { + # "position": 1, + # "value": "本" + # }, + # { + # "position": 2, + # "value": "語" + # }, + # { + # "position": 3, + # "value": "!" + # }, + # { + # "position": 4, + # "value": "!" + # }, + # { + # "position": 5, + # "value": "!" + # } + # ] + # ] Added: doc/source/example/reference/tokenizers/token-bigram-with-white-spaces.log (+36 -0) 100644 =================================================================== --- /dev/null +++ doc/source/example/reference/tokenizers/token-bigram-with-white-spaces.log 2015-03-16 16:09:18 +0900 (1ad4507) @@ -0,0 +1,36 @@ +Execution example:: + + tokenize TokenBigram "日 本 語 ! ! !" NormalizerAuto + # [ + # [ + # 0, + # 1337566253.89858, + # 0.000355720520019531 + # ], + # [ + # { + # "position": 0, + # "value": "日" + # }, + # { + # "position": 1, + # "value": "本" + # }, + # { + # "position": 2, + # "value": "語" + # }, + # { + # "position": 3, + # "value": "!" + # }, + # { + # "position": 4, + # "value": "!" + # }, + # { + # "position": 5, + # "value": "!" + # } + # ] + # ] Added: doc/source/example/reference/tokenizers/token-trigram.log (+24 -0) 100644 =================================================================== --- /dev/null +++ doc/source/example/reference/tokenizers/token-trigram.log 2015-03-16 16:09:18 +0900 (f03e493) @@ -0,0 +1,24 @@ +Execution example:: + + tokenize TokenTrigram "10000cents!!!!!" NormalizerAuto + # [ + # [ + # 0, + # 1337566253.89858, + # 0.000355720520019531 + # ], + # [ + # { + # "position": 0, + # "value": "10000" + # }, + # { + # "position": 1, + # "value": "cents" + # }, + # { + # "position": 2, + # "value": "!!!!!" + # } + # ] + # ] Added: doc/source/example/reference/tokenizers/token-unigram.log (+24 -0) 100644 =================================================================== --- /dev/null +++ doc/source/example/reference/tokenizers/token-unigram.log 2015-03-16 16:09:18 +0900 (501cd2d) @@ -0,0 +1,24 @@ +Execution example:: + + tokenize TokenUnigram "100cents!!!" NormalizerAuto + # [ + # [ + # 0, + # 1337566253.89858, + # 0.000355720520019531 + # ], + # [ + # { + # "position": 0, + # "value": "100" + # }, + # { + # "position": 1, + # "value": "cents" + # }, + # { + # "position": 2, + # "value": "!!!" + # } + # ] + # ] Modified: doc/source/reference/tokenizers.rst (+181 -15) =================================================================== --- doc/source/reference/tokenizers.rst 2015-03-16 15:10:28 +0900 (7e960f4) +++ doc/source/reference/tokenizers.rst 2015-03-16 16:09:18 +0900 (6b7a470) @@ -115,10 +115,10 @@ Here is a list of built-in tokenizers: * ``TokenBigramIgnoreBlankSplitSymbol`` * ``TokenBigramIgnoreBlankSplitAlpha`` * ``TokenBigramIgnoreBlankSplitAlphaDigit`` + * ``TokenUnigram`` + * ``TokenTrigram`` * ``TokenDelimit`` * ``TokenDelimitNull`` - * ``TokenTrigram`` - * ``TokenUnigram`` * ``TokenMecab`` * ``TokenRegexp`` @@ -130,6 +130,29 @@ Here is a list of built-in tokenizers: ``TokenBigram`` is a bigram based tokenizer. It's recommended to use this tokenizer for most cases. +Bigram tokenize method tokenizes a text to two adjacent characters +tokens. For example, ``Hello`` is tokenized to the following tokens: + + * ``He`` + * ``el`` + * ``ll`` + * ``lo`` + +Bigram tokenize method is good for recall because you can find all +texts by query consists of two or more characters. + +In general, you can't find all texts by query consists of one +character because one character token doesn't exist. But you can find +all texts by query consists of one character in Groonga. Because +Groonga find tokens that start with query by predictive search. For +example, Groonga can find ``ll`` and ``lo`` tokens by ``l`` query. + +Bigram tokenize method isn't good for precision because you can find +texts that includes query in word. For example, you can find ``world`` +by ``or``. This is more sensitive for ASCII only languages rather than +non-ASCII languages. ``TokenBigram`` has solution for this problem +described in the bellow. + ``TokenBigram`` behavior is different when it's worked with any :doc:/reference/normalizers . @@ -202,56 +225,199 @@ for non-ASCII characters. ``TokenBigramSplitSymbol`` ^^^^^^^^^^^^^^^^^^^^^^^^^^ +``TokenBigramSplitSymbol`` is similar to :ref:`token-bigram`. The +difference between them is symbol handling. ``TokenBigramSplitSymbol`` +tokenizes symbols by bigram tokenize method: + +.. groonga-command +.. include:: ../example/reference/tokenizers/token-bigram-split-symbol-with-normalizer.log +.. tokenize TokenBigramSplitSymbol "100cents!!!" NormalizerAuto + .. _token-bigram-split-symbol-alpha ``TokenBigramSplitSymbolAlpha`` ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +``TokenBigramSplitSymbolAlpha`` is similar to :ref:`token-bigram`. The +difference between them is symbol and alphabet +handling. ``TokenBigramSplitSymbolAlpha`` tokenizes symbols and +alphabets by bigram tokenize method: + +.. groonga-command +.. include:: ../example/reference/tokenizers/token-bigram-split-symbol-alpha-with-normalizer.log +.. tokenize TokenBigramSplitSymbolAlpha "100cents!!!" NormalizerAuto + .. _token-bigram-split-symbol-alpha-digit ``TokenBigramSplitSymbolAlphaDigit`` ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +``TokenBigramSplitSymbolAlphaDigit`` is similar to +:ref:`token-bigram`. The difference between them is symbol, alphabet +and digit handling. ``TokenBigramSplitSymbolAlphaDigit`` tokenizes +symbols, alphabets and digits by bigram tokenize method: + +.. groonga-command +.. include:: ../example/reference/tokenizers/token-bigram-split-symbol-alpha-digit-with-normalizer.log +.. tokenize TokenBigramSplitSymbolAlphaDigit "100cents!!!" NormalizerAuto + .. _token-bigramIgnoreBlank ``TokenBigramIgnoreBlank`` ^^^^^^^^^^^^^^^^^^^^^^^^^^ +``TokenBigramIgnoreBlank`` is similar to :ref:`token-bigram`. The +difference between them is blank handling. ``TokenBigramIgnoreBlank`` +ignores white-spaces in continuous symbols and non-ASCII characters. + +You can find difference of them by ``日 本 語 ! ! !`` text because it +has symbols and non-ASCII characters. + +Here is a result by :ref:`token-bigram` : + +.. groonga-command +.. include:: ../example/reference/tokenizers/token-bigram-with-white-spaces.log +.. tokenize TokenBigram "日 本 語 ! ! !" NormalizerAuto + +Here is a result by ``TokenBigramIgnoreBlank``: + +.. groonga-command +.. include:: ../example/reference/tokenizers/token-bigram-ignore-blank-with-white-spaces.log +.. tokenize TokenBigramIgnoreBlank "日 本 語 ! ! !" NormalizerAuto + .. _token-bigramIgnoreBlank-split-symbol ``TokenBigramIgnoreBlankSplitSymbol`` ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -.. _token-bigramIgnoreBlank-split-alpha +``TokenBigramIgnoreBlankSplitSymbol`` is similar to +:ref:`token-bigram`. The differences between them are the followings: -``TokenBigramIgnoreBlankSplitAlpha`` -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + * Blank handling + * Symbol handling -.. _token-bigramIgnoreBlank-split-alpha-digit +``TokenBigramIgnoreBlankSplitSymbol`` ignores white-spaces in +continuous symbols and non-ASCII characters. -``TokenBigramIgnoreBlankSplitAlphaDigit`` -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +``TokenBigramIgnoreBlankSplitSymbol`` tokenizes symbols by bigram +tokenize method. -.. _token-delimit +You can find difference of them by ``日 本 語 ! ! !`` text because it +has symbols and non-ASCII characters. -``TokenDelimit`` +Here is a result by :ref:`token-bigram` : + +.. groonga-command +.. include:: ../example/reference/tokenizers/token-bigram-with-white-spaces-and-symbol.log +.. tokenize TokenBigram "日 本 語 ! ! !" NormalizerAuto + +Here is a result by ``TokenBigramIgnoreBlankSplitSymbol``: + +.. groonga-command +.. include:: ../example/reference/tokenizers/token-bigram-ignore-blank-split-symbol-with-white-spaces-and-symbol.log +.. tokenize TokenBigramIgnoreBlankSplitSymbol "日 本 語 ! ! !" NormalizerAuto + +.. _token-bigramIgnoreBlank-split-symbol-alpha + +``TokenBigramIgnoreBlankSplitSymbolAlpha`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +``TokenBigramIgnoreBlankSplitSymbolAlpha`` is similar to +:ref:`token-bigram`. The differences between them are the followings: + + * Blank handling + * Symbol and alphabet handling + +``TokenBigramIgnoreBlankSplitSymbolAlpha`` ignores white-spaces in +continuous symbols and non-ASCII characters. + +``TokenBigramIgnoreBlankSplitSymbolAlpha`` tokenizes symbols and +alphabets by bigram tokenize method. + +You can find difference of them by ``Hello 日 本 語 ! ! !`` text because it +has symbols and non-ASCII characters with white spaces and alphabets. + +Here is a result by :ref:`token-bigram` : + +.. groonga-command +.. include:: ../example/reference/tokenizers/token-bigram-with-white-spaces-and-symbol-and-alphabet.log +.. tokenize TokenBigram "Hello 日 本 語 ! ! !" NormalizerAuto + +Here is a result by ``TokenBigramIgnoreBlankSplitSymbolAlpha``: + +.. groonga-command +.. include:: ../example/reference/tokenizers/token-bigram-ignore-blank-split-symbol-with-white-spaces-and-symbol-and-alphabet.log +.. tokenize TokenBigramIgnoreBlankSplitSymbolAlpha "Hello 日 本 語 ! ! !" NormalizerAuto + +.. _token-bigramIgnoreBlank-split-symbol-alpha-digit + +``TokenBigramIgnoreBlankSplitSymbolAlphaDigit`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +``TokenBigramIgnoreBlankSplitSymbolAlphaDigit`` is similar to +:ref:`token-bigram`. The differences between them are the followings: + + * Blank handling + * Symbol, alphabet and digit handling + +``TokenBigramIgnoreBlankSplitSymbolAlphaDigit`` ignores white-spaces +in continuous symbols and non-ASCII characters. + +``TokenBigramIgnoreBlankSplitSymbolAlphaDigit`` tokenizes symbols, +alphabets and digits by bigram tokenize method. + +You can find difference of them by ``Hello 日 本 語 ! ! ! 777`` text +because it has symbols and non-ASCII characters with white spaces, +alphabets and digits. + +Here is a result by :ref:`token-bigram` : + +.. groonga-command +.. include:: ../example/reference/tokenizers/token-bigram-with-white-spaces-and-symbol-and-alphabet-and-digit.log +.. tokenize TokenBigram "Hello 日 本 語 ! ! ! 777" NormalizerAuto + +Here is a result by ``TokenBigramIgnoreBlankSplitSymbolAlphaDigit``: + +.. groonga-command +.. include:: ../example/reference/tokenizers/token-bigram-ignore-blank-split-symbol-with-white-spaces-and-symbol-and-alphabet-digit.log +.. tokenize TokenBigramIgnoreBlankSplitSymbolAlphaDigit "Hello 日 本 語 ! ! ! 777" NormalizerAuto + +.. _token-unigram + +``TokenUnigram`` ^^^^^^^^^^^^^^^^ -.. _token-delimit-null +``TokenUnigram`` is similar to :ref:`token-bigram`. The differences +between them is token unit. :ref:`token-bigram` uses 2 characters per +token. ``TokenUnigram`` uses 1 character per token. -``TokenDelimitNull`` -^^^^^^^^^^^^^^^^^^^^ +.. groonga-command +.. include:: ../example/reference/tokenizers/token-unigram.log +.. tokenize TokenUnigram "100cents!!!" NormalizerAuto .. _token-trigram ``TokenTrigram`` ^^^^^^^^^^^^^^^^ -.. _token-unigram +``TokenTrigram`` is similar to :ref:`token-bigram`. The differences +between them is token unit. :ref:`token-bigram` uses 2 characters per +token. ``TokenTrigram`` uses 3 characters per token. -``TokenUnigram`` +.. groonga-command +.. include:: ../example/reference/tokenizers/token-trigram.log +.. tokenize TokenTrigram "10000cents!!!!!" NormalizerAuto + +.. _token-delimit + +``TokenDelimit`` ^^^^^^^^^^^^^^^^ +.. _token-delimit-null + +``TokenDelimitNull`` +^^^^^^^^^^^^^^^^^^^^ + .. _token-mecab ``TokenMecab`` -------------- next part -------------- HTML����������������������������... Télécharger