[Groonga-commit] groonga/groonga at c39fc0d [master] doc: document more tokenizers

Back to archive index

Kouhei Sutou null+****@clear*****
Mon Mar 16 16:09:18 JST 2015


Kouhei Sutou	2015-03-16 16:09:18 +0900 (Mon, 16 Mar 2015)

  New Revision: c39fc0daa761b8085529c2381ac279e4f1823eff
  https://github.com/groonga/groonga/commit/c39fc0daa761b8085529c2381ac279e4f1823eff

  Message:
    doc: document more tokenizers

  Added files:
    doc/source/example/reference/tokenizers/token-bigram-ignore-blank-split-symbol-with-white-spaces-and-symbol-and-alphabet-digit.log
    doc/source/example/reference/tokenizers/token-bigram-ignore-blank-split-symbol-with-white-spaces-and-symbol-and-alphabet.log
    doc/source/example/reference/tokenizers/token-bigram-ignore-blank-split-symbol-with-white-spaces-and-symbol.log
    doc/source/example/reference/tokenizers/token-bigram-ignore-blank-with-white-spaces.log
    doc/source/example/reference/tokenizers/token-bigram-split-symbol-alpha-digit-with-normalizer.log
    doc/source/example/reference/tokenizers/token-bigram-split-symbol-alpha-with-normalizer.log
    doc/source/example/reference/tokenizers/token-bigram-split-symbol-with-normalizer.log
    doc/source/example/reference/tokenizers/token-bigram-with-white-spaces-and-symbol-and-alphabet-and-digit.log
    doc/source/example/reference/tokenizers/token-bigram-with-white-spaces-and-symbol-and-alphabet.log
    doc/source/example/reference/tokenizers/token-bigram-with-white-spaces-and-symbol.log
    doc/source/example/reference/tokenizers/token-bigram-with-white-spaces.log
    doc/source/example/reference/tokenizers/token-trigram.log
    doc/source/example/reference/tokenizers/token-unigram.log
  Modified files:
    doc/source/reference/tokenizers.rst

  Added: doc/source/example/reference/tokenizers/token-bigram-ignore-blank-split-symbol-with-white-spaces-and-symbol-and-alphabet-digit.log (+68 -0) 100644
===================================================================
--- /dev/null
+++ doc/source/example/reference/tokenizers/token-bigram-ignore-blank-split-symbol-with-white-spaces-and-symbol-and-alphabet-digit.log    2015-03-16 16:09:18 +0900 (5e127a2)
@@ -0,0 +1,68 @@
+Execution example::
+
+  tokenize TokenBigramIgnoreBlankSplitSymbolAlphaDigit "Hello 日 本 語 ! ! ! 777" NormalizerAuto
+  # [
+  #   [
+  #     0, 
+  #     1337566253.89858, 
+  #     0.000355720520019531
+  #   ], 
+  #   [
+  #     {
+  #       "position": 0, 
+  #       "value": "he"
+  #     }, 
+  #     {
+  #       "position": 1, 
+  #       "value": "el"
+  #     }, 
+  #     {
+  #       "position": 2, 
+  #       "value": "ll"
+  #     }, 
+  #     {
+  #       "position": 3, 
+  #       "value": "lo"
+  #     }, 
+  #     {
+  #       "position": 4, 
+  #       "value": "o日"
+  #     }, 
+  #     {
+  #       "position": 5, 
+  #       "value": "日本"
+  #     }, 
+  #     {
+  #       "position": 6, 
+  #       "value": "本語"
+  #     }, 
+  #     {
+  #       "position": 7, 
+  #       "value": "語!"
+  #     }, 
+  #     {
+  #       "position": 8, 
+  #       "value": "!!"
+  #     }, 
+  #     {
+  #       "position": 9, 
+  #       "value": "!!"
+  #     }, 
+  #     {
+  #       "position": 10, 
+  #       "value": "!7"
+  #     }, 
+  #     {
+  #       "position": 11, 
+  #       "value": "77"
+  #     }, 
+  #     {
+  #       "position": 12, 
+  #       "value": "77"
+  #     }, 
+  #     {
+  #       "position": 13, 
+  #       "value": "7"
+  #     }
+  #   ]
+  # ]

  Added: doc/source/example/reference/tokenizers/token-bigram-ignore-blank-split-symbol-with-white-spaces-and-symbol-and-alphabet.log (+56 -0) 100644
===================================================================
--- /dev/null
+++ doc/source/example/reference/tokenizers/token-bigram-ignore-blank-split-symbol-with-white-spaces-and-symbol-and-alphabet.log    2015-03-16 16:09:18 +0900 (13a38ac)
@@ -0,0 +1,56 @@
+Execution example::
+
+  tokenize TokenBigramIgnoreBlankSplitSymbolAlpha "Hello 日 本 語 ! ! !" NormalizerAuto
+  # [
+  #   [
+  #     0, 
+  #     1337566253.89858, 
+  #     0.000355720520019531
+  #   ], 
+  #   [
+  #     {
+  #       "position": 0, 
+  #       "value": "he"
+  #     }, 
+  #     {
+  #       "position": 1, 
+  #       "value": "el"
+  #     }, 
+  #     {
+  #       "position": 2, 
+  #       "value": "ll"
+  #     }, 
+  #     {
+  #       "position": 3, 
+  #       "value": "lo"
+  #     }, 
+  #     {
+  #       "position": 4, 
+  #       "value": "o日"
+  #     }, 
+  #     {
+  #       "position": 5, 
+  #       "value": "日本"
+  #     }, 
+  #     {
+  #       "position": 6, 
+  #       "value": "本語"
+  #     }, 
+  #     {
+  #       "position": 7, 
+  #       "value": "語!"
+  #     }, 
+  #     {
+  #       "position": 8, 
+  #       "value": "!!"
+  #     }, 
+  #     {
+  #       "position": 9, 
+  #       "value": "!!"
+  #     }, 
+  #     {
+  #       "position": 10, 
+  #       "value": "!"
+  #     }
+  #   ]
+  # ]

  Added: doc/source/example/reference/tokenizers/token-bigram-ignore-blank-split-symbol-with-white-spaces-and-symbol.log (+36 -0) 100644
===================================================================
--- /dev/null
+++ doc/source/example/reference/tokenizers/token-bigram-ignore-blank-split-symbol-with-white-spaces-and-symbol.log    2015-03-16 16:09:18 +0900 (16e0168)
@@ -0,0 +1,36 @@
+Execution example::
+
+  tokenize TokenBigramIgnoreBlankSplitSymbol "日 本 語 ! ! !" NormalizerAuto
+  # [
+  #   [
+  #     0, 
+  #     1337566253.89858, 
+  #     0.000355720520019531
+  #   ], 
+  #   [
+  #     {
+  #       "position": 0, 
+  #       "value": "日本"
+  #     }, 
+  #     {
+  #       "position": 1, 
+  #       "value": "本語"
+  #     }, 
+  #     {
+  #       "position": 2, 
+  #       "value": "語!"
+  #     }, 
+  #     {
+  #       "position": 3, 
+  #       "value": "!!"
+  #     }, 
+  #     {
+  #       "position": 4, 
+  #       "value": "!!"
+  #     }, 
+  #     {
+  #       "position": 5, 
+  #       "value": "!"
+  #     }
+  #   ]
+  # ]

  Added: doc/source/example/reference/tokenizers/token-bigram-ignore-blank-with-white-spaces.log (+28 -0) 100644
===================================================================
--- /dev/null
+++ doc/source/example/reference/tokenizers/token-bigram-ignore-blank-with-white-spaces.log    2015-03-16 16:09:18 +0900 (9f36b74)
@@ -0,0 +1,28 @@
+Execution example::
+
+  tokenize TokenBigramIgnoreBlank "日 本 語 ! ! !" NormalizerAuto
+  # [
+  #   [
+  #     0, 
+  #     1337566253.89858, 
+  #     0.000355720520019531
+  #   ], 
+  #   [
+  #     {
+  #       "position": 0, 
+  #       "value": "日本"
+  #     }, 
+  #     {
+  #       "position": 1, 
+  #       "value": "本語"
+  #     }, 
+  #     {
+  #       "position": 2, 
+  #       "value": "語"
+  #     }, 
+  #     {
+  #       "position": 3, 
+  #       "value": "!!!"
+  #     }
+  #   ]
+  # ]

  Added: doc/source/example/reference/tokenizers/token-bigram-split-symbol-alpha-digit-with-normalizer.log (+56 -0) 100644
===================================================================
--- /dev/null
+++ doc/source/example/reference/tokenizers/token-bigram-split-symbol-alpha-digit-with-normalizer.log    2015-03-16 16:09:18 +0900 (7a38aa7)
@@ -0,0 +1,56 @@
+Execution example::
+
+  tokenize TokenBigramSplitSymbolAlphaDigit "100cents!!!" NormalizerAuto
+  # [
+  #   [
+  #     0, 
+  #     1337566253.89858, 
+  #     0.000355720520019531
+  #   ], 
+  #   [
+  #     {
+  #       "position": 0, 
+  #       "value": "10"
+  #     }, 
+  #     {
+  #       "position": 1, 
+  #       "value": "00"
+  #     }, 
+  #     {
+  #       "position": 2, 
+  #       "value": "0c"
+  #     }, 
+  #     {
+  #       "position": 3, 
+  #       "value": "ce"
+  #     }, 
+  #     {
+  #       "position": 4, 
+  #       "value": "en"
+  #     }, 
+  #     {
+  #       "position": 5, 
+  #       "value": "nt"
+  #     }, 
+  #     {
+  #       "position": 6, 
+  #       "value": "ts"
+  #     }, 
+  #     {
+  #       "position": 7, 
+  #       "value": "s!"
+  #     }, 
+  #     {
+  #       "position": 8, 
+  #       "value": "!!"
+  #     }, 
+  #     {
+  #       "position": 9, 
+  #       "value": "!!"
+  #     }, 
+  #     {
+  #       "position": 10, 
+  #       "value": "!"
+  #     }
+  #   ]
+  # ]

  Added: doc/source/example/reference/tokenizers/token-bigram-split-symbol-alpha-with-normalizer.log (+48 -0) 100644
===================================================================
--- /dev/null
+++ doc/source/example/reference/tokenizers/token-bigram-split-symbol-alpha-with-normalizer.log    2015-03-16 16:09:18 +0900 (f166e85)
@@ -0,0 +1,48 @@
+Execution example::
+
+  tokenize TokenBigramSplitSymbolAlpha "100cents!!!" NormalizerAuto
+  # [
+  #   [
+  #     0, 
+  #     1337566253.89858, 
+  #     0.000355720520019531
+  #   ], 
+  #   [
+  #     {
+  #       "position": 0, 
+  #       "value": "100"
+  #     }, 
+  #     {
+  #       "position": 1, 
+  #       "value": "ce"
+  #     }, 
+  #     {
+  #       "position": 2, 
+  #       "value": "en"
+  #     }, 
+  #     {
+  #       "position": 3, 
+  #       "value": "nt"
+  #     }, 
+  #     {
+  #       "position": 4, 
+  #       "value": "ts"
+  #     }, 
+  #     {
+  #       "position": 5, 
+  #       "value": "s!"
+  #     }, 
+  #     {
+  #       "position": 6, 
+  #       "value": "!!"
+  #     }, 
+  #     {
+  #       "position": 7, 
+  #       "value": "!!"
+  #     }, 
+  #     {
+  #       "position": 8, 
+  #       "value": "!"
+  #     }
+  #   ]
+  # ]

  Added: doc/source/example/reference/tokenizers/token-bigram-split-symbol-with-normalizer.log (+32 -0) 100644
===================================================================
--- /dev/null
+++ doc/source/example/reference/tokenizers/token-bigram-split-symbol-with-normalizer.log    2015-03-16 16:09:18 +0900 (0a669f8)
@@ -0,0 +1,32 @@
+Execution example::
+
+  tokenize TokenBigramSplitSymbol "100cents!!!" NormalizerAuto
+  # [
+  #   [
+  #     0, 
+  #     1337566253.89858, 
+  #     0.000355720520019531
+  #   ], 
+  #   [
+  #     {
+  #       "position": 0, 
+  #       "value": "100"
+  #     }, 
+  #     {
+  #       "position": 1, 
+  #       "value": "cents"
+  #     }, 
+  #     {
+  #       "position": 2, 
+  #       "value": "!!"
+  #     }, 
+  #     {
+  #       "position": 3, 
+  #       "value": "!!"
+  #     }, 
+  #     {
+  #       "position": 4, 
+  #       "value": "!"
+  #     }
+  #   ]
+  # ]

  Added: doc/source/example/reference/tokenizers/token-bigram-with-white-spaces-and-symbol-and-alphabet-and-digit.log (+44 -0) 100644
===================================================================
--- /dev/null
+++ doc/source/example/reference/tokenizers/token-bigram-with-white-spaces-and-symbol-and-alphabet-and-digit.log    2015-03-16 16:09:18 +0900 (004a3ff)
@@ -0,0 +1,44 @@
+Execution example::
+
+  tokenize TokenBigram "Hello 日 本 語 ! ! ! 777" NormalizerAuto
+  # [
+  #   [
+  #     0, 
+  #     1337566253.89858, 
+  #     0.000355720520019531
+  #   ], 
+  #   [
+  #     {
+  #       "position": 0, 
+  #       "value": "hello"
+  #     }, 
+  #     {
+  #       "position": 1, 
+  #       "value": "日"
+  #     }, 
+  #     {
+  #       "position": 2, 
+  #       "value": "本"
+  #     }, 
+  #     {
+  #       "position": 3, 
+  #       "value": "語"
+  #     }, 
+  #     {
+  #       "position": 4, 
+  #       "value": "!"
+  #     }, 
+  #     {
+  #       "position": 5, 
+  #       "value": "!"
+  #     }, 
+  #     {
+  #       "position": 6, 
+  #       "value": "!"
+  #     }, 
+  #     {
+  #       "position": 7, 
+  #       "value": "777"
+  #     }
+  #   ]
+  # ]

  Added: doc/source/example/reference/tokenizers/token-bigram-with-white-spaces-and-symbol-and-alphabet.log (+40 -0) 100644
===================================================================
--- /dev/null
+++ doc/source/example/reference/tokenizers/token-bigram-with-white-spaces-and-symbol-and-alphabet.log    2015-03-16 16:09:18 +0900 (e1eb629)
@@ -0,0 +1,40 @@
+Execution example::
+
+  tokenize TokenBigram "Hello 日 本 語 ! ! !" NormalizerAuto
+  # [
+  #   [
+  #     0, 
+  #     1337566253.89858, 
+  #     0.000355720520019531
+  #   ], 
+  #   [
+  #     {
+  #       "position": 0, 
+  #       "value": "hello"
+  #     }, 
+  #     {
+  #       "position": 1, 
+  #       "value": "日"
+  #     }, 
+  #     {
+  #       "position": 2, 
+  #       "value": "本"
+  #     }, 
+  #     {
+  #       "position": 3, 
+  #       "value": "語"
+  #     }, 
+  #     {
+  #       "position": 4, 
+  #       "value": "!"
+  #     }, 
+  #     {
+  #       "position": 5, 
+  #       "value": "!"
+  #     }, 
+  #     {
+  #       "position": 6, 
+  #       "value": "!"
+  #     }
+  #   ]
+  # ]

  Added: doc/source/example/reference/tokenizers/token-bigram-with-white-spaces-and-symbol.log (+36 -0) 100644
===================================================================
--- /dev/null
+++ doc/source/example/reference/tokenizers/token-bigram-with-white-spaces-and-symbol.log    2015-03-16 16:09:18 +0900 (1ad4507)
@@ -0,0 +1,36 @@
+Execution example::
+
+  tokenize TokenBigram "日 本 語 ! ! !" NormalizerAuto
+  # [
+  #   [
+  #     0, 
+  #     1337566253.89858, 
+  #     0.000355720520019531
+  #   ], 
+  #   [
+  #     {
+  #       "position": 0, 
+  #       "value": "日"
+  #     }, 
+  #     {
+  #       "position": 1, 
+  #       "value": "本"
+  #     }, 
+  #     {
+  #       "position": 2, 
+  #       "value": "語"
+  #     }, 
+  #     {
+  #       "position": 3, 
+  #       "value": "!"
+  #     }, 
+  #     {
+  #       "position": 4, 
+  #       "value": "!"
+  #     }, 
+  #     {
+  #       "position": 5, 
+  #       "value": "!"
+  #     }
+  #   ]
+  # ]

  Added: doc/source/example/reference/tokenizers/token-bigram-with-white-spaces.log (+36 -0) 100644
===================================================================
--- /dev/null
+++ doc/source/example/reference/tokenizers/token-bigram-with-white-spaces.log    2015-03-16 16:09:18 +0900 (1ad4507)
@@ -0,0 +1,36 @@
+Execution example::
+
+  tokenize TokenBigram "日 本 語 ! ! !" NormalizerAuto
+  # [
+  #   [
+  #     0, 
+  #     1337566253.89858, 
+  #     0.000355720520019531
+  #   ], 
+  #   [
+  #     {
+  #       "position": 0, 
+  #       "value": "日"
+  #     }, 
+  #     {
+  #       "position": 1, 
+  #       "value": "本"
+  #     }, 
+  #     {
+  #       "position": 2, 
+  #       "value": "語"
+  #     }, 
+  #     {
+  #       "position": 3, 
+  #       "value": "!"
+  #     }, 
+  #     {
+  #       "position": 4, 
+  #       "value": "!"
+  #     }, 
+  #     {
+  #       "position": 5, 
+  #       "value": "!"
+  #     }
+  #   ]
+  # ]

  Added: doc/source/example/reference/tokenizers/token-trigram.log (+24 -0) 100644
===================================================================
--- /dev/null
+++ doc/source/example/reference/tokenizers/token-trigram.log    2015-03-16 16:09:18 +0900 (f03e493)
@@ -0,0 +1,24 @@
+Execution example::
+
+  tokenize TokenTrigram "10000cents!!!!!" NormalizerAuto
+  # [
+  #   [
+  #     0, 
+  #     1337566253.89858, 
+  #     0.000355720520019531
+  #   ], 
+  #   [
+  #     {
+  #       "position": 0, 
+  #       "value": "10000"
+  #     }, 
+  #     {
+  #       "position": 1, 
+  #       "value": "cents"
+  #     }, 
+  #     {
+  #       "position": 2, 
+  #       "value": "!!!!!"
+  #     }
+  #   ]
+  # ]

  Added: doc/source/example/reference/tokenizers/token-unigram.log (+24 -0) 100644
===================================================================
--- /dev/null
+++ doc/source/example/reference/tokenizers/token-unigram.log    2015-03-16 16:09:18 +0900 (501cd2d)
@@ -0,0 +1,24 @@
+Execution example::
+
+  tokenize TokenUnigram "100cents!!!" NormalizerAuto
+  # [
+  #   [
+  #     0, 
+  #     1337566253.89858, 
+  #     0.000355720520019531
+  #   ], 
+  #   [
+  #     {
+  #       "position": 0, 
+  #       "value": "100"
+  #     }, 
+  #     {
+  #       "position": 1, 
+  #       "value": "cents"
+  #     }, 
+  #     {
+  #       "position": 2, 
+  #       "value": "!!!"
+  #     }
+  #   ]
+  # ]

  Modified: doc/source/reference/tokenizers.rst (+181 -15)
===================================================================
--- doc/source/reference/tokenizers.rst    2015-03-16 15:10:28 +0900 (7e960f4)
+++ doc/source/reference/tokenizers.rst    2015-03-16 16:09:18 +0900 (6b7a470)
@@ -115,10 +115,10 @@ Here is a list of built-in tokenizers:
   * ``TokenBigramIgnoreBlankSplitSymbol``
   * ``TokenBigramIgnoreBlankSplitAlpha``
   * ``TokenBigramIgnoreBlankSplitAlphaDigit``
+  * ``TokenUnigram``
+  * ``TokenTrigram``
   * ``TokenDelimit``
   * ``TokenDelimitNull``
-  * ``TokenTrigram``
-  * ``TokenUnigram``
   * ``TokenMecab``
   * ``TokenRegexp``
 
@@ -130,6 +130,29 @@ Here is a list of built-in tokenizers:
 ``TokenBigram`` is a bigram based tokenizer. It's recommended to use
 this tokenizer for most cases.
 
+Bigram tokenize method tokenizes a text to two adjacent characters
+tokens. For example, ``Hello`` is tokenized to the following tokens:
+
+  * ``He``
+  * ``el``
+  * ``ll``
+  * ``lo``
+
+Bigram tokenize method is good for recall because you can find all
+texts by query consists of two or more characters.
+
+In general, you can't find all texts by query consists of one
+character because one character token doesn't exist. But you can find
+all texts by query consists of one character in Groonga. Because
+Groonga find tokens that start with query by predictive search. For
+example, Groonga can find ``ll`` and ``lo`` tokens by ``l`` query.
+
+Bigram tokenize method isn't good for precision because you can find
+texts that includes query in word. For example, you can find ``world``
+by ``or``. This is more sensitive for ASCII only languages rather than
+non-ASCII languages. ``TokenBigram`` has solution for this problem
+described in the bellow.
+
 ``TokenBigram`` behavior is different when it's worked with any
 :doc:/reference/normalizers .
 
@@ -202,56 +225,199 @@ for non-ASCII characters.
 ``TokenBigramSplitSymbol``
 ^^^^^^^^^^^^^^^^^^^^^^^^^^
 
+``TokenBigramSplitSymbol`` is similar to :ref:`token-bigram`. The
+difference between them is symbol handling. ``TokenBigramSplitSymbol``
+tokenizes symbols by bigram tokenize method:
+
+.. groonga-command
+.. include:: ../example/reference/tokenizers/token-bigram-split-symbol-with-normalizer.log
+.. tokenize TokenBigramSplitSymbol "100cents!!!" NormalizerAuto
+
 .. _token-bigram-split-symbol-alpha
 
 ``TokenBigramSplitSymbolAlpha``
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
+``TokenBigramSplitSymbolAlpha`` is similar to :ref:`token-bigram`. The
+difference between them is symbol and alphabet
+handling. ``TokenBigramSplitSymbolAlpha`` tokenizes symbols and
+alphabets by bigram tokenize method:
+
+.. groonga-command
+.. include:: ../example/reference/tokenizers/token-bigram-split-symbol-alpha-with-normalizer.log
+.. tokenize TokenBigramSplitSymbolAlpha "100cents!!!" NormalizerAuto
+
 .. _token-bigram-split-symbol-alpha-digit
 
 ``TokenBigramSplitSymbolAlphaDigit``
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
+``TokenBigramSplitSymbolAlphaDigit`` is similar to
+:ref:`token-bigram`. The difference between them is symbol, alphabet
+and digit handling. ``TokenBigramSplitSymbolAlphaDigit`` tokenizes
+symbols, alphabets and digits by bigram tokenize method:
+
+.. groonga-command
+.. include:: ../example/reference/tokenizers/token-bigram-split-symbol-alpha-digit-with-normalizer.log
+.. tokenize TokenBigramSplitSymbolAlphaDigit "100cents!!!" NormalizerAuto
+
 .. _token-bigramIgnoreBlank
 
 ``TokenBigramIgnoreBlank``
 ^^^^^^^^^^^^^^^^^^^^^^^^^^
 
+``TokenBigramIgnoreBlank`` is similar to :ref:`token-bigram`. The
+difference between them is blank handling. ``TokenBigramIgnoreBlank``
+ignores white-spaces in continuous symbols and non-ASCII characters.
+
+You can find difference of them by ``日 本 語 ! ! !`` text because it
+has symbols and non-ASCII characters.
+
+Here is a result by :ref:`token-bigram` :
+
+.. groonga-command
+.. include:: ../example/reference/tokenizers/token-bigram-with-white-spaces.log
+.. tokenize TokenBigram "日 本 語 ! ! !" NormalizerAuto
+
+Here is a result by ``TokenBigramIgnoreBlank``:
+
+.. groonga-command
+.. include:: ../example/reference/tokenizers/token-bigram-ignore-blank-with-white-spaces.log
+.. tokenize TokenBigramIgnoreBlank "日 本 語 ! ! !" NormalizerAuto
+
 .. _token-bigramIgnoreBlank-split-symbol
 
 ``TokenBigramIgnoreBlankSplitSymbol``
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-.. _token-bigramIgnoreBlank-split-alpha
+``TokenBigramIgnoreBlankSplitSymbol`` is similar to
+:ref:`token-bigram`. The differences between them are the followings:
 
-``TokenBigramIgnoreBlankSplitAlpha``
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+  * Blank handling
+  * Symbol handling
 
-.. _token-bigramIgnoreBlank-split-alpha-digit
+``TokenBigramIgnoreBlankSplitSymbol`` ignores white-spaces in
+continuous symbols and non-ASCII characters.
 
-``TokenBigramIgnoreBlankSplitAlphaDigit``
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+``TokenBigramIgnoreBlankSplitSymbol`` tokenizes symbols by bigram
+tokenize method.
 
-.. _token-delimit
+You can find difference of them by ``日 本 語 ! ! !`` text because it
+has symbols and non-ASCII characters.
 
-``TokenDelimit``
+Here is a result by :ref:`token-bigram` :
+
+.. groonga-command
+.. include:: ../example/reference/tokenizers/token-bigram-with-white-spaces-and-symbol.log
+.. tokenize TokenBigram "日 本 語 ! ! !" NormalizerAuto
+
+Here is a result by ``TokenBigramIgnoreBlankSplitSymbol``:
+
+.. groonga-command
+.. include:: ../example/reference/tokenizers/token-bigram-ignore-blank-split-symbol-with-white-spaces-and-symbol.log
+.. tokenize TokenBigramIgnoreBlankSplitSymbol "日 本 語 ! ! !" NormalizerAuto
+
+.. _token-bigramIgnoreBlank-split-symbol-alpha
+
+``TokenBigramIgnoreBlankSplitSymbolAlpha``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+``TokenBigramIgnoreBlankSplitSymbolAlpha`` is similar to
+:ref:`token-bigram`. The differences between them are the followings:
+
+  * Blank handling
+  * Symbol and alphabet handling
+
+``TokenBigramIgnoreBlankSplitSymbolAlpha`` ignores white-spaces in
+continuous symbols and non-ASCII characters.
+
+``TokenBigramIgnoreBlankSplitSymbolAlpha`` tokenizes symbols and
+alphabets by bigram tokenize method.
+
+You can find difference of them by ``Hello 日 本 語 ! ! !`` text because it
+has symbols and non-ASCII characters with white spaces and alphabets.
+
+Here is a result by :ref:`token-bigram` :
+
+.. groonga-command
+.. include:: ../example/reference/tokenizers/token-bigram-with-white-spaces-and-symbol-and-alphabet.log
+.. tokenize TokenBigram "Hello 日 本 語 ! ! !" NormalizerAuto
+
+Here is a result by ``TokenBigramIgnoreBlankSplitSymbolAlpha``:
+
+.. groonga-command
+.. include:: ../example/reference/tokenizers/token-bigram-ignore-blank-split-symbol-with-white-spaces-and-symbol-and-alphabet.log
+.. tokenize TokenBigramIgnoreBlankSplitSymbolAlpha "Hello 日 本 語 ! ! !" NormalizerAuto
+
+.. _token-bigramIgnoreBlank-split-symbol-alpha-digit
+
+``TokenBigramIgnoreBlankSplitSymbolAlphaDigit``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+``TokenBigramIgnoreBlankSplitSymbolAlphaDigit`` is similar to
+:ref:`token-bigram`. The differences between them are the followings:
+
+  * Blank handling
+  * Symbol, alphabet and digit handling
+
+``TokenBigramIgnoreBlankSplitSymbolAlphaDigit`` ignores white-spaces
+in continuous symbols and non-ASCII characters.
+
+``TokenBigramIgnoreBlankSplitSymbolAlphaDigit`` tokenizes symbols,
+alphabets and digits by bigram tokenize method.
+
+You can find difference of them by ``Hello 日 本 語 ! ! ! 777`` text
+because it has symbols and non-ASCII characters with white spaces,
+alphabets and digits.
+
+Here is a result by :ref:`token-bigram` :
+
+.. groonga-command
+.. include:: ../example/reference/tokenizers/token-bigram-with-white-spaces-and-symbol-and-alphabet-and-digit.log
+.. tokenize TokenBigram "Hello 日 本 語 ! ! ! 777" NormalizerAuto
+
+Here is a result by ``TokenBigramIgnoreBlankSplitSymbolAlphaDigit``:
+
+.. groonga-command
+.. include:: ../example/reference/tokenizers/token-bigram-ignore-blank-split-symbol-with-white-spaces-and-symbol-and-alphabet-digit.log
+.. tokenize TokenBigramIgnoreBlankSplitSymbolAlphaDigit "Hello 日 本 語 ! ! ! 777" NormalizerAuto
+
+.. _token-unigram
+
+``TokenUnigram``
 ^^^^^^^^^^^^^^^^
 
-.. _token-delimit-null
+``TokenUnigram`` is similar to :ref:`token-bigram`. The differences
+between them is token unit. :ref:`token-bigram` uses 2 characters per
+token. ``TokenUnigram`` uses 1 character per token.
 
-``TokenDelimitNull``
-^^^^^^^^^^^^^^^^^^^^
+.. groonga-command
+.. include:: ../example/reference/tokenizers/token-unigram.log
+.. tokenize TokenUnigram "100cents!!!" NormalizerAuto
 
 .. _token-trigram
 
 ``TokenTrigram``
 ^^^^^^^^^^^^^^^^
 
-.. _token-unigram
+``TokenTrigram`` is similar to :ref:`token-bigram`. The differences
+between them is token unit. :ref:`token-bigram` uses 2 characters per
+token. ``TokenTrigram`` uses 3 characters per token.
 
-``TokenUnigram``
+.. groonga-command
+.. include:: ../example/reference/tokenizers/token-trigram.log
+.. tokenize TokenTrigram "10000cents!!!!!" NormalizerAuto
+
+.. _token-delimit
+
+``TokenDelimit``
 ^^^^^^^^^^^^^^^^
 
+.. _token-delimit-null
+
+``TokenDelimitNull``
+^^^^^^^^^^^^^^^^^^^^
+
 .. _token-mecab
 
 ``TokenMecab``
-------------- next part --------------
HTML����������������������������...
Télécharger 



More information about the Groonga-commit mailing list
Back to archive index