PHPで仮想マシンベースの正規表現エンジンを作ってみる第一回

こんにちは、久保田です。

皆さん正規表現は使っていますか? PHPに限らずどんな言語を使っていても、正規表現にお世話になっていないプログラマはいないと思います。しかし、その正規表現がどのように実装されているかについては知らない方が多いのではないのでしょうか。

この記事では、その正規表現エンジンの実装方法の一つである仮想マシンによる正規表現エンジンの実装方法を解説しつつ実際に簡単な正規表現エンジンを作っていきたいと思います。

正規表現エンジンの実装方法

正規表現エンジンの実装方法はいくつかあるのですが、それの一つに仮想マシンによって正規表現のマッチング処理を実行するやり方があります。PHPで利用している正規表現エンジンであるPCREはこの方式を採用しています。

仮想マシンによる実装方法は、正規表現というよりもプログラミング言語の実装方法の一つとして知られています。Rubyの最もメジャーな実装であるCRubyの1.9以降を例にして言えば、Rubyのコードは一旦パースされて、YARVと呼ばれる内部の仮想マシンが実行できる内部表現にコンパイルされたのち仮想マシンによって実行されます。

この実装方法は実は正規表現にも適用できます。今回のこの一連の記事ではこの仮想マシンによる正規表現エンジンの仕組みを解説しつつ、実際に簡単な正規表現エンジンを実装してみたいと思います。

通常正規表現エンジンはCやC++などで実装されますが、僕はC言語をまともに読み書きできないハイパーゆとりなのでここではみんな大好きPHPで実装してみたいと思います。

仮想マシンによる正規表現エンジンの実装

今回作成していく正規表現エンジンの実装方法ですが、基本的にRegular Expression Matching: the Virtual Machine Approachを参照していきます。この中では、仮想マシンによる正規表現エンジンの実装方法についてわかりやすく記述されています。英語ですが平易な語彙で記述されているので、適当に眺めているだけでもなんとなくわかった気になれます。

記事で紹介されている仮想マシンの概要を以下に引用します。

To start, we'll define a regular expression virtual machine (think Java VM). The VM executes one or more threads, each running a regular expression program, which is just a list of regular expression instructions. Each thread maintains two registers while it runs: a program counter (PC) and a string pointer (SP).
The regular expression instructions are:
char c If the character SP points at is not c, stop this thread: it failed.
Otherwise, advance SP to the next character and advance PC to the next instruction.
match Stop this thread: it found a match.
jmp x Jump to (set the PC to point at) the instruction at x.
split x, y Split execution: continue at both x and y. Create a new thread with SP
copied from the current thread. One thread continues with PC x.
The other continues with PC y. (Like a simultaneous jump to both locations.)

これを見ると、正規表現を実行する仮想マシンが驚くほど単純であることがわかります。この仮想マシンが必要とするレジスタはPCとSPの２つで、必要とする命令はmatchとcharとsplitとjmpのたったの4つだけです。

PHP内部の仮想マシンであるZendEngineの持っている命令数が150程度あるのに比べると、べらぼうに簡単であることがわかると思います。

実装の流れ

実装していく流れですが、以下の様な流れで実装していきます。

1. 正規表現パーサの構築
2. 仮想マシンの構築
3. コンパイラ構築

この記事では、まず正規表現のパーサを構築します。その後、正規表現のマッチング処理を行う仮想マシンを構築し、最後に正規表現を仮想マシンの命令に変換するコンパイラを構築します。

正規表現パーサの構築

まず正規表現エンジンを実装するにあたって、正規表現の文法のパーサを構築します。

実装する正規表現の文法の概要を簡単に書いておきます。解説用のものなので、簡易的な文法にとどめています。

* hoge|fuga "|"による選択を利用できます
* a(ho|ge)b 括弧によるグルーピングができます
* a+b*c? "+"や"*"や"?"などの繰り返し演算子が利用できます

PHPPEGを用いて正規表現の文法のパーサを構築します。PHPPEGはPEGに基づくパーサコンビネータです。これを用いると簡単にパーサを構築出来ます。

パーサの構築についてはそれほど本質的では無いのでここでは特に解説無しでいきます。PHPPEGの使い方はドキュメントを参照してください。

<?php
include_once __DIR__ . '/../vendor/autoload.php';
class RegexSyntaxParser implements PEG_IParser
{
protected $regexParser;
function __construct()
{
/*
* regex <- split*
* split <- operations ("|" operations)*
* operations <- operation*
* operation <- target operator
* target <- charClass / group / singleCharacter
* suffixOperator <- "*" / "+" / "?"
* group <- "(" split ")"
* charClass <- "[" (!"]" .)+ "]"
* singleCharacter <- ![+*?|[)] .
*/
$singleCharacter = self::objectize('singleCharacter',
PEG::second(PEG::not(PEG::choice('*', '+', '?', '|', '[', ')')), PEG::anything())
);
$charClass = self::objectize('charClass', PEG::second(
'[',
PEG::many1(PEG::second(PEG::not(']'), PEG::anything())),
']'
));
$group = self::objectize('group', PEG::memo(PEG::second(
'(', PEG::ref($split), ')'
)));
$suffixOperator = self::objectize('suffixOperator', PEG::choice('*', '+', '?'));
$target = PEG::choice(
$charClass, $group, $singleCharacter
);
$operation = self::objectize('operation',
PEG::seq($target, PEG::optional($suffixOperator))
);
$operations = self::objectize('operations', PEG::many($operation));
$split = self::objectize('split',
PEG::choice(PEG::listof($operations, '|'), '')
);
$this->regexParser = self::objectize('regex', PEG::many($split));
}
/**
* @return PEG::IParser
*/
function getParser()
{
return $this->regexParser;
}
/**
* @param String $str
*/
function parse(PEG_IContext $context)
{
return $this->regexParser->parse($context);
}
/**
* @param PEG_IParser
* @return PEG_IParser
*/
protected static function objectize($name, PEG_IParser $parser)
{
return PEG::hook(function($result) use($name) {
return new RegexSyntaxNode($name, $result);
}, $parser);
}
}
class RegexSyntaxNode
{
protected $name, $content;
function __construct($name, $content)
{
$this->name = $name;
$this->content = $content;
}
function __toString()
{
$result = '';
$result .= $this->name . " {\n";
$result .= $this->dump($this->content);
$result .= "}";
return $result;
}
protected function dump($content)
{
$result = '';
if (is_array($content)) {
foreach ($content as $i => $element) {
$result .= $this->dump($element);
}
} elseif ($content instanceof self) {
$result .= self::indent($content->__toString()) . "\n";
} else {
$result .= self::indent(var_export($content, true)) . "\n";
}
return $result;
}
static function indent($str) {
$lines = preg_split("/\r|\n|\r\n/", $str);
foreach ($lines as $i => $line) {
$lines[$i] = ' ' . $line;
}
return implode($lines, "\n");
}
}

このコードやプロジェクトは、githubに公開していますので実際に動かしてみたい方は参照してください。

このパーサに正規表現をかけてみます。

<?php
include_once __DIR__ . '/../src/PHPRegex.php';
$parser = new RegexSyntaxParser();
echo 'a => ' . $parser->parse(PEG::context('a')) . "\n\n";
echo 'a|b =>' . $parser->parse(PEG::context('a|b')) . "\n\n";
echo 'a(bc) => ' . $parser->parse(PEG::context('a(bc)')) . "\n\n";
echo 'a+b*c? => ' . $parser->parse(PEG::context('a+b*c?')) . "\n\n";

すると、以下のように出力されます。正規表現がきちんとパースされて構文木ができているのがわかると思います。

a => regex {
split {
operations {
operation {
singleCharacter {
'a'
}
false
}
}
}
}
a|b =>regex {
split {
operations {
operation {
singleCharacter {
'a'
}
false
}
}
operations {
operation {
singleCharacter {
'b'
}
false
}
}
}
}
a(bc) => regex {
split {
operations {
operation {
singleCharacter {
'a'
}
false
}
operation {
group {
split {
operations {
operation {
singleCharacter {
'b'
}
false
}
operation {
singleCharacter {
'c'
}
false
}
}
}
}
false
}
}
}
}
a+b*c? => regex {
split {
operations {
operation {
singleCharacter {
'a'
}
suffixOperator {
'+'
}
}
operation {
singleCharacter {
'b'
}
suffixOperator {
'*'
}
}
operation {
singleCharacter {
'c'
}
suffixOperator {
'?'
}
}
}
}
}

さて、パーサの方が出来たので次は正規表現用の仮想マシンを作って本格的に解説していきます。(第2回へ続く)

PHPで仮想マシンベースの正規表現エンジンを作ってみる第一回

正規表現エンジンの実装方法

仮想マシンによる正規表現エンジンの実装

実装の流れ

正規表現パーサの構築

Trending Articles

RealLifeCam (RLC) - Mini and Tim, Buki, Brianna - Terrace

伊東　瑛進

ゴールデン・スランバー　ザ・ビートルズ　歌詞　和訳

井上貴博アナウンサー彼女や結婚の噂は？実家や親が話題？人気は？

大阪・泉南イオンで飛び降り自殺とみられる転落事件が発生：ネットで拡散された理由とは

FlaR_ver1.06.zip (FlaR_ver1.06.zip)

2015年3月20日号　豊川信用金庫（3月1日付）

部落探訪(318)静岡県沼津市戸田沢海

三越伊勢丹

元AKB48・倉持明日香の胸の谷間がスゴすぎる！スタイル抜群の写真が話題に

自宅警備員2　-灰原家の血族-　攻略

☆西荻南で６棟燃える大火事、男性が死亡

モーツァルトディヴェルティメント変ホ長調 K.563 の名盤

池田連合会（木村會/神戸山口組）

[転載]宮崎県日向市暴力団で検索

【ディズニーランドパリ】日本にないオススメアトラクション13選【ウォルトディズニースタジオ】

【ビデオ】カワサキ、マン島で最高速度記録を更新した「Ninja H2R」の特集映像を公開！

荒川シルバー大学（令和4年度受講生募集）／荒川区

イベントID:0 のイベントログについて

[1080p]回復術士のやり直し 11 完全《回復》ver.