scRUBYt!をさくらのサーバに導入しようとしたら死ぬほど面倒くさい目に合ったでござるの巻

注意

長文ですよ！
必要な部分だけ読んで帰るといいと思います

これは何？

さくらインターネットのレンタルサーバに、scRUBYt!を導入して遊ぶにあたって苦戦したところのメモ書きです
同じ目で苦労している全ての人へ！

scRUBYt!って？

rubyのwebスクレイパーの中では比較的スタンダードなライブラリ
特徴としては、

XPath対応
formタグに値を入力して、表示された結果をスクレイピングしたり、webページの操作の自動化ぽい事ができる

詳しくはこの辺を参考に

Ruby Scraping - scRUBYt!

他には、scrAPIとかが使われてるっぽいけど、XPathに対応していないので、今回はパス

そっさくインストール

gemが利用できる環境ならば、以下のコマンド一発。超簡単

%gem install scrubyt

さくっとインストール失敗

chown/chgrp: Operation not permitted
だそうです

%gem install scrubyt
Building native extensions.  This could take a while...
ERROR:  Error installing scrubyt:
        ERROR: Failed to build gem native extension.

/usr/local/bin/ruby18 extconf.rb install scrubyt
checking for main() in -lc... yes
creating Makefile

make
cc -I. -I. -I/usr/local/lib/ruby/1.8/i386-freebsd6 -I.  -fPIC -O2 -fno-strict-aliasing -pipe   -fPIC -c hpricot_scan.c
cc -I. -I. -I/usr/local/lib/ruby/1.8/i386-freebsd6 -I.  -fPIC -O2 -fno-strict-aliasing -pipe   -fPIC -c hpricot_gram.c
cc -shared -o hpricot_scan.so hpricot_scan.o hpricot_gram.o -L. -L/usr/local/lib -Wl,-R/usr/local/lib -L.  -rdynamic  -Wl,-soname,hpricot_scan.so  -Wl,-R -Wl,/usr/local/lib -L/usr/local/lib -lruby18 -lc  -lcrypt -lm  -rpath=/usr/lib:/usr/local/lib -pthread  -lc

make install
mkdir -p /home/kasei-san/lib/ruby/gem/gems/hpricot-0.6.164/lib/universal-java1.6
/usr/bin/install -c -o root -g wheel -m 0755 hpricot_scan.so /home/kasei-san/lib/ruby/gem/gems/hpricot-0.6.164/lib/universal-java1.6
install: /home/kasei-san/lib/ruby/gem/gems/hpricot-0.6.164/lib/universal-java1.6/hpricot_scan.so: chown/chgrp: Operation not permitted
*** Error code 71

Stop in /home/kasei-san/lib/ruby/gem/gems/hpricot-0.6.164/ext/hpricot_scan.


Gem files will remain installed in /home/kasei-san/lib/ruby/gem/gems/hpricot-0.6.164 for inspection.
Results logged to /home/kasei-san/lib/ruby/gem/gems/hpricot-0.6.164/ext/hpricot_scan/gem_make.out

google先生に問い合わせたところ、以下のサイトを発見
JAM☆ぱん | さくらインターネットで、gem install すると「chown/chgrp: Operation not permitted 」と叱られる件の対応

詳細は、そちらを見ていただくとして、

setenv RB_USER_INSTALL true
とかコマンドラインで入力してから rubugemsをインストールすればいい

と、やったら上手くいきました！

%gem install scrubyt
Building native extensions.  This could take a while...
Successfully installed hpricot-0.6.164
Successfully installed scrubyt-0.4.06
2 gems installed
Installing ri documentation for hpricot-0.6.164...
Installing ri documentation for scrubyt-0.4.06...
Installing RDoc documentation for hpricot-0.6.164...
Installing RDoc documentation for scrubyt-0.4.06...

scrubytをrequireでこける

インストールは正常に終了したものの、scrubytをrequireしたところで、「Firewatir」が無いと怒られました

/home/kasei-san/lib/rubygems/custom_require.rb:27:in `gem_original_require': no such file to load -- firewatir (LoadError)
        from /home/kasei-san/lib/rubygems/custom_require.rb:27:in `require'
        from /home/kasei-san/lib/ruby/gem/gems/scrubyt-0.4.06/lib/scrubyt/core/navigation/agents/firewatir.rb:2
        from /home/kasei-san/lib/rubygems/custom_require.rb:27:in `gem_original_require'
        from /home/kasei-san/lib/rubygems/custom_require.rb:27:in `require'
        from /home/kasei-san/lib/ruby/gem/gems/scrubyt-0.4.06/lib/scrubyt.rb:29
        from /home/kasei-san/lib/rubygems/custom_require.rb:32:in `gem_original_require'
        from /home/kasei-san/lib/rubygems/custom_require.rb:32:in `require'
        from test.rb:9

FirewatirはIEをRubyを使って自動操作するソフトウェアみたいです
(ひょっとしたら、scrubytの所為じゃなく、mechanizeの方が原因か…？)
http://www.moongift.jp/2008/05/firewatir/
普通にインストールしたら、OKでした

%gem install firewatir

スクレイピングの際のコツ

実際に、scRUBYt!を使ってみて苦戦した事のメモ

XPathの指定は明示的に

スクレイピングの際は、第2引数に「:example_type => :xpath」を指定して、
明示的に指定してあげないと、うまくデータが取れないことがあったので注意

	# ランキング取得
	result = Scrubyt::Extractor.define do
		fetch 'http://www.nicovideo.jp/ranking/view/hourly/all'

		lanks '/html' do
			# 集計日時
			date '//div[@id="PAGEBODY"]/div/p[@class="TXT12"][2]/strong[2]', :example_type => :xpath
			# ランキング内容
			videos './' do
				video '//td[@class="data"]/div', :example_type => :xpath do
					lank		'//p[1]/strong[1]'
					link		'//h3/a' do
						url			'./@href'
						title		'./text()'
					end
				end
			end
		end
	end

to_hashが正しく動いてない

リファレンスによると、メソッド.to_hash でスクレイピングした結果をHashに変えてくれるとの事ですが、
例えば、to_xmlでこんな結果になるような値を.to_hashすると、うまいことHashを作ってくれませんでした

<link>
	<a>http://www.google.co.jp</a>
	<text>google</text>
</link>
<link>
	<a>http://www.yahoo.co.jp</a>
	<text>yahoo</text>
</link>

これだと、こんな感じに

{
	'a'		=>'http://www.google.co.jp,http://www.yahoo.co.jp'
	'text'	=>'google,yahoo'
}

さすがにこれだとちょっとキツいので、to_xmlでxmlに変えてから、
XmlSimpleで、xmlをHashにという無理やりな方法で、Hash化することに

hash = XmlSimple.xml_in(result.to_xml)

ただ、これはこれで、1つしかないノードでも必ず中身を配列にするので、ちょっと面倒でした
この辺はどこかで設定できるのかな…？

それからcron

cronにて、スクレイピングを毎時実行するようにしたところ、以下のようなエラーが

`require': no such file to load -- rubygems (LoadError)

rubygemsを上手く認識できておらず、環境変数絡みと判断。この辺を参考に、crontabの先頭に環境変数を追記
cronからrubyを動かす - くりまるwebつくる

それでも、同様のエラーが発生し、色々環境変数を試行錯誤するも上手くいかず
あきらめて、通常時に使う環境変数を全てcrontabの先頭に追記したところ、何とか動作
この辺は後ほどの課題です…

参考リンク

scRUBYt! - Scrape. Shape. Integrate. Profit.
- 本家

Reference - Scrubyt
- Scrubytのリファレンス。wikiな所為か多少間違えてたり、不確定な箇所があるものの(to_hashとか！)、まぁまぁ参考になります

http://github.com/scrubber/scrubyt_examples/tree/master
- 本家からリンクが張られてるサンプルコード

まとめ

とりあえず、これでやっと scRUBYt! で遊べるようになりました
web::scraperがあるから、perl使っていたので、これでrubyに完全移行できますよー

そんなかんじー