ElasticsearchでPDF, MSOfficeファイルの全文検索

つい最近までバイト先でElasticsearchを使っていたのだが、優先度的に当分触らなくなってしまいそうなので、自分のPCで構築しつつ復習してみます。

Elasticsearchに関する基本的な知識はちょっと古いけど下の書籍がおすすめです。

f:id:tom__bo:20150518225640j:plain

（ ↑公式のドキュメントと共にだいぶお世話になりました）

今回は実験としてPDF, PowerPointの文書を登録して日本語で検索してみます。

　1. mac環境でのElasticsearchのインストール

　2. プラグインのインストール

　3. 日本語対応

　4. pdf, ppt等の対応

1. mac環境でのElasticsearchのインストール

brew install elasticsearch

これだけ。

yumと違ってほぼ最新版が持ってこれるので楽ちん。

yumももうお亡くなりになったそうだけど、、、 (O_O)

2. プラグインのインストール

Elasticsearchを最低限使っていくために必要なプラグインはkuromojiくらいで、あとはファイル対応と面白そうなのでクローリング用のプラグインを入れました。

elasticsearch-analysis-kuromoji

　　Elasticsearchを日本語対応させるためのプラグイン。

　　https://github.com/elastic/elasticsearch-analysis-kuromoji

elasticsearch-head

　　Elasticsearchの設定情報やデータをブラウザの管理画面として提供してくれるプラグイン

　　http://mobz.github.io/elasticsearch-head/

mapper-attachment

　　.pdfや.pptxに対応するためのプラグイン

　　これらに対応させるApache Tikaを扱うためのattachmentタイプを使えるようにしてくれる

　　https://github.com/elastic/elasticsearch-mapper-attachments

elasticsearch-river-web

　　クローリングプラグイン（今後の実験用）

　　https://github.com/codelibs/elasticsearch-river-web

elasticsearch-quartz

　　river-webプラグインが依存しているプラグイン

　　https://github.com/codelibs/elasticsearch-quartz

公式の説明にあるようにpluginコマンドを叩けばインストールできます。

ただ、mac環境ではこのプラグインコマンドはデフォルトでは"/usr/local/bin/plugin"になるようで、探すのに苦労しました。whichコマンドで見つかってはいたけど、linuxでは"elasticsearch/bin/plugin"とかにあったと思うし、ちょっと戸惑うよね。

インストールしたコマンドは

curl -XGET 'http://localhost:9200/_nodes?plugin=true&pretty'

で確認することが出来ます。

↓ 結果

"plugins" : [ {

"name" : "analysis-kuromoji",

"version" : "2.5.0",

"description" : "Kuromoji analysis support",

"jvm" : true,

"site" : false

}, {

"name" : "mapper-attachments",

"version" : "2.5.0",

"description" : "Adds the attachment type allowing to parse difference attachment formats",

"jvm" : true,

"site" : false

}, {

"name" : "QuartzPlugin",

"version" : "NA",

"description" : "This is a elasticsearch-quartz plugin.",

"jvm" : true,

"site" : false

}, {

"name" : "WebPlugin",

"version" : "1.4.0",

"description" : "This is a elasticsearch-river-web plugin.",

"jvm" : true,

"site" : false

}, {

"name" : "head",

"version" : "NA",

"description" : "No description found.",

"url" : "/_plugin/head/",

"jvm" : false,

"site" : true

} ]

3. 日本語対応

日本語対応するためにkuromojiをデフォルトアナライザにしたインデックスを作成します。そのときに使うmapperをjsonで用意します。

[mapping.json]

{

"settings": {
"index": {
"number_of_shards": 1,

"number_of_replicas": 0,

  "analysis": {
"tokenizer": {
"kuromoji_search": {
"type": "kuromoji_tokenizer",
  "mode": "search"
}
},
  "analyzer": {
"default": {
"tokenizer": "kuromoji_search",
"filter": [
"kuromoji_baseform",
"kuromoji_part_of_speech",
"cjk_width",
"stop",
"kuromoji_stemmer",
"lowercase"
]
}
}
}
}
}
}

analyzerに登録しているフィルターは形態素解析をより正確にするためのものです。

日本語で形態素解析されていることを確認するにはこのmapperを指定しつつindexを作成します。そして、analyzeコマンドを使ってみると、ちゃんと形態素解析されていることがわかります。

インデックスの作成

curl -XPOST 'localhost:9200/test?pretty' -d @mapping.json

kuromojiが適用されているかの確認

curl -XPOST 'localhost:9200/test/_analyze?pretty' -d '関西国際空港'

（関西・国際・空港・関西国際空港の４種類に分解されていることが確認できるはず。）

ここまでで単純に日本語の文章を検索することができるようになっています。また、毎回クエリを生成しなくても、elasticsearch-headプラグインを利用すればGUIでもプラグインやindexの確認などが出来ます。

f:id:tom__bo:20150518234115p:plain

4. pdf, ppt等の対応

先ほどkuromojiを適用して作成したtestインデックスにattachmentというタイプを作成します。この時にattachmentタイプ（mapper-attachmentのタイプ）を指定してフィールドを設定しておきます。

[attachment.json]

{
"attachment" : {
"properties" : {
"file" : {
"type" : "attachment",
"fields" : {
"title" : { "store" : "yes" },
"file" : { "term_vector":"with_positions_offsets", "store":"yes" }
}
}
}
}
}

typeの作成

　curl -X PUT "localhost:9200/test/attachment/_mapping" -d @attachment.json

後はapache tikaに対応するためにbase64 エンコードをしてElasticsearchに登録します。

tutorial ( http://www.elasticsearch.cn/tutorials/2011/07/18/attachment-type-in-action.html ) のようにbash スクリプトを作って投げてみます。

#!/bin/sh

coded=`cat test.ppt | perl -MMIME::Base64 -ne 'print encode_base64($_)'`
json="{\"file\":\"${coded}\"}"
echo "$json" > json.file
curl -X POST "localhost:9200/test/attachment/test.ppt" -d @json.file

これで

curl "localhost:9200/_search?pretty=true" -d '{
"fields" : ["title"],
"query" : {
"query_string" : {
"query" : "エージェント"
}
},
"highlight" : {
"fields" : {
"file" : {}
}
}
}'