Hi,everyone
I have enjoyed Scrubyt for days and it worked greatly in most case.However,problems came out when scraped urls from Google and Yahoo at the same time.Here is my code:
require 'rubygems' require 'scrubyt' Scrubyt.logger = Scrubyt::Logger.new query = 'ruby' google_data = Scrubyt::Extractor.define do fetch 'http://www.google.com/ncr' fill_textfield 'q', query submit #retrieve by xpath title "/html/body/div/div/div/a" do url "href", :type => :attribute end end #end of extrator google_file = File.open("google.xml", "w") google_data.to_xml.write(google_file, 1) google_file.close yahoo_data = Scrubyt::Extractor.define do fetch 'http://search.yahoo.com' fill_textfield 'p', query submit #retrieve by xpath title "/html/body/div/div/div/div/div/div/div/ol/li/div/h3/a" do url "href", :type => :attribute end end #end of extrator yahoo_file = File.open("yahoo.xml", "w") yahoo_data.to_xml.write(yahoo_file, 1) yahoo_file.close
Running Environment: Ubuntu 7.04 + Netbeans 6.0 + Scrubyt
google.xml
<root> <title> <url>http://www.ruby-lang.org/</url> </title> <title> <url>http://www.ruby-lang.org/en/20020101.html</url> </title> ... <root>
yahoo.xml
<root> <title> <url>http://rds.yahoo.com/_ylt=A0oGklhqbodHe08AchtXNyoA;_ylu=X3oDMTE5MXY5dDllBHNlYwNzcgRwb3MDMQRjb2xvA3NrMQR2dGlkA1lTMTk4XzgyBGwDV1Mx/SIG=11ff2e34s/EXP=1200144362/**http%3a//www.ruby-lang.org/en</url> </title> <title> <url>http://rds.yahoo.com/_ylt=A0oGklhqbodHe08AdBtXNyoA;_ylu=X3oDMTE5cHJpN25qBHNlYwNzcgRwb3MDMgRjb2xvA3NrMQR2dGlkA1lTMTk4XzgyBGwDV1Mx/SIG=12aq03736/EXP=1200144362/**http%3a//en.wikipedia.org/wiki/Ruby_programming_language</url> </title> ... <root>
If switched the order of two extractors,that's define yahoo extractor fitstly,the result changed:
google.xml
<root/>
yahoo.xml
<root> <title> <url>http://www.ruby-lang.org/en</url> </title> <title> <url>http://en.wikipedia.org/wiki/Ruby_programming_language</url> </title> ..... <root>
It seems the latter extractor will be influenced by the former one. Since xpath I used for Yahoo is longer than Google, the result form Google is empty when defined Yahoo extractor firstly.
Why is that and how can I overcome this problem? Thanks in advance.