1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 |
ahei@ubuntu3:~/nutch-1.0/bin$ ./nutch crawl urls -dir crawl -depth 2 crawl started in: crawl rootUrlDir = urls threads = 10 depth = 2 Injector: starting Injector: crawlDb: crawl/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: done Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: segment: crawl/segments/20091126170222 Generator: filtering: true Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls by host, for politeness. Generator: done. Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. Fetcher: starting Fetcher: segment: crawl/segments/20091126170222 Fetcher: threads: 10 QueueFeeder finished: total 1 records. fetching http://lucene.apache.org/nutch/ -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -finishing thread FetcherThread, activeThreads=0 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=0 Fetcher: done CrawlDb update: starting CrawlDb update: db: crawl/crawldb CrawlDb update: segments: [crawl/segments/20091126170222] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: done Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: segment: crawl/segments/20091126170233 Generator: filtering: true Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls by host, for politeness. Generator: done. Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. Fetcher: starting Fetcher: segment: crawl/segments/20091126170233 Fetcher: threads: 10 QueueFeeder finished: total 38 records. fetching http://wiki.apache.org/nutch/ fetching http://issues.apache.org/jira/browse/Nutch fetching http://lucene.apache.org/nutch/tutorial.html -activeThreads=10, spinWaiting=7, fetchQueues.totalSize=35 -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=35 fetching http://lucene.apache.org/nutch/skin/breadcrumbs.js -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=34 Error parsing: http://lucene.apache.org/nutch/skin/breadcrumbs.js: org.apache.nutch.parse.ParseException: parser not found for contentType=application/javascript url=http://lucene.apache.org/nutch/skin/breadcrumbs.js at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74) at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:766) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:552)
-activeThreads=10, spinWaiting=10, fetchQueues.totalSize=34 fetching http://lucene.apache.org/nutch/version_control.html -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=33 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=33 fetching http://wiki.apache.org/nutch/FAQ fetching http://lucene.apache.org/nutch/apidocs-0.8.x/index.html -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=31 -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=31 fetching http://lucene.apache.org/hadoop/ -activeThreads=10, spinWaiting=8, fetchQueues.totalSize=30 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=30 fetching http://forrest.apache.org/ -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=29 -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=29 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=29 fetching http://lucene.apache.org/nutch/apidocs-0.9/index.html -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=28 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=28 fetching http://lucene.apache.org/nutch/credits.html -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=27 fetching http://www.apache.org/dist/lucene/nutch/CHANGES-0.9.txt -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=26 -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=26 -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=26 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=26 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=26 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=26 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=26 |