Using Hiragana, Katakana and Kanji for tokenizing Japanese

People working in the international web-development (and maybe also in other sectors) often suffer from the problem of tokenization. This usually happens, when you want to create a search engine. As already mentioned, I am currently working on a library to analyze texts with PHP. This library shall help all people having to implement a search field.

The complex method of tokenizing in Japanese

So let’s look at how we can tokenize Japanese. If you searched about this topic, you probably have found out that it’s not an easy topic, nothing you can write down yourself quickly. If you want to implement it really well, you need probability heuristics and use advanced topics like Hidden-Markov-Models or similar.

A simpler approach: Japanese knows several writing systems

So let’s think about an easier approach, somebody without a M.Sc. can implement or create. Somebody who cannot rely on C++ code (e.g. KyTea) or similar.

If you think about the structure of Japanese, you wil recognize that it has three writing systems, sometimes even four (which occurs on the web pretty often):

  • Hiragana
  • Katakana
  • Kanji
  • Romaji (latin characters)

We can use these different writing systems for tokenizing Japanese texts. If you have read Japanese text, you might know that the switch between Hiragana and Kanji fairly often happens at word borders. Of course verbs mostly consist of both kanji and hiragana, but as said, we want to keep it all very simple. We can still make it more difficult and accurate later. The point is: Even if we have a verb consisting of kanji and hiragana, the kanji will carry the meaning. On the other hand, nouns are very often delimited by hiragana because particles are written in hiragana.

Example

Let’s look at this easy sentence (I just remembered from some song or anime):

約束を忘れないでください。

There we have 約束 (promise, meeting) and then a hiragana, because this is being talked about. Then we have a form of 忘れる where the kanji carries the meaning of “to forget”. And then we have the whole hiragana rest carrying the meaning of “please not”.

Now, why should we not just tokenize the words when the writing system changes? That’s exactly what I thought and what I will also do in my current project. It would give us these tokens:

  • 約束
  • れないでください。

Of course, it is not perfect, but it is better than not splitting the words at all or splitting them into whole sentences (marked by 。 in Japanese). As a small improvement, we can still implement 。 as sort of its own writing system (like non-saved-writing-system, so that it will be considered a border too.

My current class

That’s the code I have read up to know. It still lacks recognition of Romaji (which I did not think of before, which can be seen often on the web) and I do not handle 。 、 or other special characters yet (but you do not do that in a pure whitespace tokenizer either).

Feel free to work with it. I will also improve it while working on my tokenizing-stemming-searching-PHP-library.

<?php
 
require_once 'Skoch/Tokenizer.php';
 
/**
 * The hiragana (katakana) tokenizer will tokenize words according to their
 * Japanese writings. It will split when the writing system changes (e.g. from
 * hiragana to katakana or from katakana to kanji).
 */
class Skoch_Tokenizer_Hiragana implements Skoch_Tokenizer {
	private $hiragana = array(
		'あ', 'い', 'う', 'え', 'お',
		'か', 'き', 'く', 'け', 'こ',
		'さ', 'し', 'す', 'せ', 'そ',
		'た', 'ち', 'つ', 'て', 'と',
		'な', 'に', 'ぬ', 'ね', 'の',
		'は', 'ひ', 'ふ', 'へ', 'ほ',
		'ま', 'み', 'む', 'め', 'も',
		'や',      'ゆ',      'よ',
		'ら', 'り', 'る', 'れ', 'ろ',
		'わ', 'ゐ',      'ゑ', 'を',
		                    'ん',
		'が', 'ぎ', 'ぐ', 'げ', 'ご',
		'ざ', 'じ', 'ず', 'ぜ', 'ぞ',
		'だ', 'ぢ', 'づ', 'で', 'ど',
		'ば', 'び', 'ぶ', 'べ', 'ぼ',
		'ぱ', 'ぴ', 'ぷ', 'ぺ', 'ぽ',
 
		'ぁ', 'ぃ', 'ぅ', 'ぇ', 'ぉ',
	);
	private $katakana = array(
		'ア', 'イ', 'ウ', 'エ', 'オ', 
		'カ', 'キ', 'ク', 'ケ', 'コ', 
		'サ', 'シ', 'ス', 'セ', 'ソ', 
		'タ', 'チ', 'ツ', 'テ', 'ト', 
		'ナ', 'ニ', 'ヌ', 'ネ', 'ノ', 
		'ハ', 'ヒ', 'フ', 'ヘ', 'ホ', 
		'マ', 'ミ', 'ム', 'メ', 'モ', 
		'ヤ',      'ユ',      'ヨ', 
		'ラ', 'リ', 'ル', 'レ', 'ロ', 
		'ワ', 'ヰ',      'ヱ', 'ヲ', 
		                    'ン',
		'ガ', 'ギ', 'グ', 'ゲ', 'ゴ', 
		'ザ', 'ジ', 'ズ', 'ゼ', 'ゾ', 
		'ダ', 'ヂ', 'ヅ', 'デ', 'ド', 
		'バ', 'ビ', 'ブ', 'ベ', 'ボ', 
		'パ', 'ピ', 'プ', 'ペ', 'ポ', 
 
		'ァ', 'ィ', 'ゥ', 'ェ', 'ォ', 
		'ー',
	);
 
	const HIRAGANA = 0x1;
	const KATAKANA = 0x2;
	const KANJI = 0x4;
 
	/**
	 * Initialize a new hiragana katakana tokenizer.
	 */
	public function __construct() {
 
	}
 
	/**
	 * Stem the input tokens given.
	 * 
	 * @param array $tokens The tokens.
	 * @return array The tokens after stemming.
	 */
	public function tokenize($string) {
		// ensure that we have utf-8 used
		$beforeEncoding = mb_internal_encoding();
		mb_internal_encoding("utf-8");
 
		$tokens = array();
 
		$currentSystem = null;
		$currentToken = '';
 
		for ($i = 0; $i <= mb_strlen($string); $i++) {
			$character = mb_substr($string, $i, 1);
 
			if (in_array($character, $this->hiragana)) {
				$system = self::HIRAGANA;
			} elseif (in_array($character, $this->katakana)) {
				$system = self::KATAKANA;
			} else {
				$system = self::KANJI;
			}
 
			// First string did not have a starting system
			if ($currentSystem == null) {
				$currentSystem = $system;
			}
 
			// if the system still is the same, no boundary has been reached
			if ($currentSystem == $system) {
				$currentToken .= $character;
			} else {
				// Write ended token to tokens and start a new one
				$tokens[] = $currentToken;
				$currentToken = $character;
				$currentSystem = $system;
			}
		}
 
 
		// reset encoding
		mb_internal_encoding($beforeEncoding);
 
		return $tokens;
	}
}

Hinterlasse eine Antwort

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind markiert *

Du kannst folgende HTML-Tags benutzen: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>