N.B. 本文最重要的部分在第三节对比部分!
http://sannahkvist.se/
种子url是一个瑞典的小清新摄影网站,选这个还有一个原因就是这个网站很简约简洁,没有大量垃圾外链。
marker _gnmrk_ : 1440144780-494854742
batchId: 1440144780-494854742
较上步消失了
较上步变化了
status: 2 (status_fetched) [0(null)]
fetchTime: 1440144966010 [1440144429269]
prevFetchTime: 1440144429269 [0]
protocolStatus: SUCCESS, args=[] [(null)]
较上步产生了
marker _ftcmrk_ : 1440144780-494854742
metadata _rs_ : ####
header: [xxx]
contentType: text/html
content:start:
[xxx]
content:end:
较上步消失了
metadata _csh_ : ?�##
较上步变化了
parseStatus: success/ok (1/0), args=[] [(null)]
title: Sannah Kvist [null]
较上步产生了
signature: 261c653e067e097acc6dd5dc68072e91
marker __prsmrk__ : 1440144780-494854742
metadata [xxx]
outlink: http://sannahkvist.se/
text:start:
[xxx]
text:end:
较上步消失了
较上步变化了
marker _updmrk_ : 1440144780-494854742
metadata _csh_ : ####
inlink: http://sannahkvist.se/
http://sannahkvist.se/commissioned/ key: se.sannahkvist:http/commissioned/
baseUrl: null
status: 1 (status_unfetched) [compare to inject: 0 (null)]
fetchTime: 1440146083913 [compareto inject: 1440144429269]
prevFetchTime: 0
fetchInterval: 2592000
retriesSinceFetch: 0
modifiedTime: 0
prevModifiedTime: 0
protocolStatus: (null)
parseStatus: (null)
title: null
score: 0.0 [compareto inject: 0]
[compare toinject: marker_injmrk_:y]
markerdist : 1 [compare to inject: 0]
reprUrl: null
metadata_csh_ : ####
inlink: http://sannahkvist.se/ commissioned[compare inject: new]
较上步消失了
marker __prsmrk__ : 1440144780-494854742
marker_gnmrk_ : 1440144780-494854742
marker_ftcmrk_ : 1440144780-494854742
2.6.1 solrindex命令
较上步变化了
较上步产生了
marker _idxmrk_ : 1440144780-494854742
较上步消失了
lhd@master:~/Nutch/apache-nutch-2.3/runtime/local/bin$ ./nutch fetch-all -crawlId photo
FetcherJob: starting at 2015-08-21 21:52:26
FetcherJob: fetching all
FetcherJob: threads: 10
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : -1
Using queue mode : byHost
Fetcher: threads: 10
QueueFeeder finished: total 3 records. Hit by time limit :0
fetchinghttp://sannahkvist.se/commissioned/i-rymden-finns-inga-kanslor/ (queue crawldelay=5000ms)
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
10/10 spinwaiting/active, 1 pages, 0 errors, 0.2 0 pages/s, 10 10kb/s, 2 URLs in 1 queues
* queue: http://sannahkvist.se
maxThreads = 1
inProgress = 0
crawlDelay = 5000
minCrawlDelay = 0
nextFetchTime = 1440165160919
now = 1440165157995
0.http://sannahkvist.se/commissioned/
1.http://sannahkvist.se/commissioned/flickan/
fetching http://sannahkvist.se/commissioned/ (queue crawldelay=5000ms)
10/10 spinwaiting/active, 2 pages, 0 errors, 0.2 0 pages/s, 9 7kb/s, 1 URLs in 1 queues
* queue: http://sannahkvist.se
maxThreads = 1
inProgress = 0
crawlDelay = 5000
minCrawlDelay = 0
nextFetchTime =1440165167259
now = 1440165162997
0. http://sannahkvist.se/commissioned/flickan/
fetching http://sannahkvist.se/commissioned/flickan/ (queue crawldelay=5000ms)
-finishing thread FetcherThread5, activeThreads=9
-finishing thread FetcherThread8, activeThreads=8
-finishing thread FetcherThread6, activeThreads=7
-finishing thread FetcherThread2, activeThreads=6
-finishing thread FetcherThread1, activeThreads=5
-finishing thread FetcherThread3, activeThreads=4
-finishing thread FetcherThread4, activeThreads=3
-finishing thread FetcherThread7, activeThreads=2
-finishing thread FetcherThread9, activeThreads=1
0/1 spinwaiting/active, 2 pages, 0 errors, 0.1 0 pages/s, 6 0 kb/s,0 URLs in 1 queues
-finishing thread FetcherThread0, activeThreads=0
0/0 spinwaiting/active, 3 pages, 0 errors, 0.2 0 pages/s, 8 13 kb/s,0 URLs in 0 queues
-activeThreads=0
FetcherJob: finished at 2015-08-21 21:52:53, time elapsed: 00:00:27
……
表1 |
Inject |
Generate |
Fetch |
Parse |
Updatedb |
Solrindex |
key: |
se.sannahkvist:http/ |
se.sannahkvist:http/ |
se.sannahkvist:http/ |
se.sannahkvist:http/ |
se.sannahkvist:http/ |
se.sannahkvist:http/ |
baseUrl: |
null |
null |
null |
null |
null |
null |
status: |
0 (null) |
0 (null) |
2 (status_fetched) |
2 (status_fetched) |
2 (status_fetched) |
2 (status_fetched) |
fetchTime: |
1440144429269 |
1440144429269 |
1440144966010 |
1440144966010 |
1440144966010 |
1440144966010 |
prevFetchTime: |
0 |
0 |
1440144429269 |
1440144429269 |
1440144429269 |
1440144429269 |
fetchInterval: |
2592000 |
2592000 |
2592000 |
2592000 |
2592000 |
2592000 |
retriesSinceFetch: |
0 |
0 |
0 |
0 |
0 |
0 |
modifiedTime: |
0 |
0 |
0 |
0 |
0 |
0 |
prevModifiedTime |
0 |
0 |
0 |
0 |
0 |
0 |
protocolStatus: |
(null) |
(null) |
SUCCESS, args=[] |
SUCCESS, args=[] |
SUCCESS, args=[] |
SUCCESS, args=[] |
signature: |
|
|
|
261c653e067e097acc6dd5dc68072e91 |
261c653e067e097acc6dd5dc68072e91 |
261c653e067e097acc6dd5dc68072e91 |
parseStatus: |
(null) |
(null) |
(null) |
success/ok (1/0), args=[] |
success/ok (1/0), args=[] |
success/ok (1/0), args=[] |
title: |
null |
null |
null |
Sannah Kvist |
Sannah Kvist |
Sannah Kvist |
score: |
1.0 |
1.0 |
1.0 |
1.0 |
1.0 |
1.0 |
marker _injmrk_ |
y |
y |
y |
y |
y |
y |
marker _updmrk_ |
|
|
|
|
1440144780-494854742 |
1440144780-494854742 |
marker__prsmrk__ |
|
|
|
1440144780-494854742 |
|
|
marker _gnmrk_ |
|
1440144780-494854742 |
1440144780-494854742 |
1440144780-494854742 |
|
|
marker _ftcmrk_ |
|
|
1440144780-494854742 |
1440144780-494854742 |
|
|
marker _idxmrk_ |
|
|
|
|
|
1440144780-494854742 |
marker dist : |
0 |
0 |
0 |
0 |
0 |
0 |
reprUrl: |
null |
null |
null |
null |
null |
null |
batchId: |
|
1440144780-494854742 |
1440144780-494854742 |
1440144780-494854742 |
1440144780-494854742 |
1440144780-494854742 |
metadata _csh_ : |
?�## |
?�## |
|
|
#### |
#### |
metadata _rs_ : |
|
|
#### |
#### |
#### |
#### |
metadata: |
|
|
|
xxx |
xxx |
xxx |
outlink: |
|
|
|
http://sannahkvist.se/ |
http://sannahkvist.se/ |
http://sannahkvist.se/ |
inlink: |
|
|
|
|
http://sannahkvist.se/ |
http://sannahkvist.se/ |
header: |
|
|
xxx |
xxx |
xxx |
xxx |
contentType: |
|
|
text/html |
text/html |
text/html |
text/html |
content:start: content:end: |
|
|
xxx |
xxx |
xxx |
xxx |
text:start: text:end: |
|
|
|
xxx |
xxx |
xxx |
有新的种子产生。
本表(表2)就是将第一轮的inject步骤过后种子url的状态与第一轮第二轮updatedb步骤之后新产生的url的状态做了个对比。
表2 |
Inject |
NEW ADDED URL AFTER updatedb round 1 |
NEW ADDED URL AFTER updatedb round 2 |
key: |
se.sannahkvist:http/ |
se.sannahkvist:http/commissioned/ |
com.imdb.www:http/title/tt1342378/ |
baseUrl: |
null |
null |
null |
status: |
0 (null) |
1 (status_unfetched) |
1 (status_unfetched) |
fetchTime: |
1440144429269 |
1440146083913 |
1440166605303 |
prevFetchTime: |
0 |
0 |
0 |
fetchInterval: |
2592000 |
2592000 |
2592000 |
retriesSinceFetch: |
0 |
0 |
0 |
modifiedTime: |
0 |
0 |
0 |
prevModifiedTime |
0 |
0 |
0 |
protocolStatus: |
(null) |
(null) |
(null) |
signature: |
|
|
|
parseStatus: |
(null) |
(null) |
(null) |
title: |
null |
null |
null |
score: |
1.0 |
0.0 |
0.0 |
marker _injmrk_ |
y |
|
|
marker _updmrk_ |
|
|
|
marker__prsmrk__ |
|
|
|
marker _gnmrk_ |
|
|
|
marker _ftcmrk_ |
|
|
|
marker _idxmrk_ |
|
|
|
marker dist : |
0 |
1 |
2 |
reprUrl: |
null |
null |
null |
batchId: |
|
|
|
metadata _csh_ : |
?�## |
#### |
#### |
metadata _rs_ : |
|
|
|
metadata: |
|
|
|
outlink: |
|
|
|
inlink: |
|
http://sannahkvist.se/ |
http://sannahkvist.se/commissioned/flickan/ |
header: |
|
|
|
contentType: |
|
|
|
content:start: content:end: |
|
|
|
text:start: text:end: |
|
|
|
表3 |
NEW ADDED URL AFTER updatedb 1 |
generate |
fetch |
parse |
updatedb |
solrindex |
key: |
se.sannahkvist:http/commissioned/ |
se.sannahkvist:http/commissioned/ |
se.sannahkvist:http/commissioned/ |
se.sannahkvist:http/commissioned/ |
se.sannahkvist:http/commissioned/ |
se.sannahkvist:http/commissioned/ |
baseUrl: |
null |
null |
null |
null |
null |
null |
status: |
1 (status_unfetched) |
1 (status_unfetched) |
2 (status_fetched) |
2 (status_fetched) |
2 (status_fetched) |
2 (status_fetched) |
fetchTime: |
1440146083913 |
1440146083913 |
1440165162261 |
1440165162261 |
1440165162261 |
1440165162261 |
prevFetchTime: |
0 |
0 |
1440146083913 |
1440146083913 |
1440146083913 |
1440146083913 |
fetchInterval: |
2592000 |
2592000 |
2592000 |
2592000 |
2592000 |
2592000 |
retriesSinceFetch: |
0 |
0 |
0 |
0 |
0 |
0 |
modifiedTime: |
0 |
0 |
0 |
0 |
0 |
0 |
prevModifiedTime |
0 |
0 |
0 |
0 |
0 |
0 |
protocolStatus: |
(null) |
(null) |
SUCCESS, args=[] |
SUCCESS, args=[] |
SUCCESS, args=[] |
SUCCESS, args=[] |
signature: |
|
|
|
c02daceb65a33aaba8fc075b4e1afe37 |
c02daceb65a33aaba8fc075b4e1afe37 |
c02daceb65a33aaba8fc075b4e1afe37 |
parseStatus: |
(null) |
(null) |
(null) |
success/ok (1/0), args=[] |
success/ok (1/0), args=[] |
success/ok (1/0), args=[] |
title: |
null |
null |
null |
commissioned × Sannah Kvist |
commissioned × Sannah Kvist |
commissioned × Sannah Kvist |
score: |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
marker _injmrk_ |
|
|
|
|
|
|
marker _updmrk_ |
|
|
|
|
1440164632-449570503 |
1440164632-449570503 |
marker__prsmrk__ |
|
|
|
1440164632-449570503 |
|
|
marker _gnmrk_ |
|
1440164632-449570503 |
1440164632-449570503 |
1440164632-449570503 |
|
|
marker _ftcmrk_ |
|
|
1440164632-449570503 |
1440164632-449570503 |
|
|
marker _idxmrk_ |
|
|
|
|
|
1440164632-449570503 |
marker dist : |
1 |
1 |
1 |
1 |
1 |
1 |
reprUrl: |
null |
null |
null |
null |
null |
null |
batchId: |
|
1440164632-449570503 |
1440164632-449570503 |
1440164632-449570503 |
1440164632-449570503 |
1440164632-449570503 |
metadata _csh_ : |
#### |
#### |
|
|
#### |
#### |
metadata _rs_ : |
|
|
#### |
#### |
#### |
#### |
metadata: |
|
|
|
xxx |
xxx |
xxx |
outlink: |
|
|
|
http://sannahkvist.se/ |
http://sannahkvist.se/ |
http://sannahkvist.se/ |
inlink: |
http://sannahkvist.se/ |
http://sannahkvist.se/ |
http://sannahkvist.se/ |
http://sannahkvist.se/ |
http://sannahkvist.se/commissioned/flickan/ http://sannahkvist.se/commissioned/i-rymden-finns-inga-kanslor/ http://sannahkvist.se/commissioned/ |
http://sannahkvist.se/commissioned/flickan/ http://sannahkvist.se/commissioned/i-rymden-finns-inga-kanslor/ http://sannahkvist.se/commissioned/ |
header: |
|
|
xxx |
xxx |
xxx |
xxx |
contentType: |
|
|
text/html |
text/html |
text/html |
text/html |
content:start: content:end: |
|
|
xxx |
xxx |
xxx |
xxx |
text:start: text:end: |
|
|
|
xxx |
xxx |
xxx |
在第二轮进行updatedb 操作之后,种子url也发生了变化,原因是种子url也出现在了这次爬取的urls中的outlinks。就跟产生了一个和种子url一样的新url。
本表(表4)是第二轮进行updatedb操作之后种子url产生的变化与第一轮solrindex步骤之后的种子url状态的一个对比。
表4 |
Solrindex 1 |
updatedb 2 |
key: |
se.sannahkvist:http/ |
se.sannahkvist:http/ |
baseUrl: |
null |
null |
status: |
2 (status_fetched) |
1 (status_unfetched) |
fetchTime: |
1440144966010 |
1440166605317 |
prevFetchTime: |
1440144429269 |
1440144429269 |
fetchInterval: |
2592000 |
2592000 |
retriesSinceFetch: |
0 |
0 |
modifiedTime: |
0 |
0 |
prevModifiedTime |
0 |
0 |
protocolStatus: |
SUCCESS, args=[] |
SUCCESS, args=[] |
signature: |
261c653e067e097acc6dd5dc68072e91 |
261c653e067e097acc6dd5dc68072e91 |
parseStatus: |
success/ok (1/0), args=[] |
success/ok (1/0), args=[] |
title: |
Sannah Kvist |
Sannah Kvist |
score: |
1.0 |
0.0 |
marker _injmrk_ |
y |
y |
marker _updmrk_ |
1440144780-494854742 |
|
marker__prsmrk__ |
|
|
marker _gnmrk_ |
|
|
marker _ftcmrk_ |
|
|
marker _idxmrk_ |
1440144780-494854742 |
|
marker dist : |
0 |
2 |
reprUrl: |
null |
null |
batchId: |
1440144780-494854742 |
1440144780-494854742 |
metadata _csh_ : |
#### |
#### |
metadata _rs_ : |
#### |
|
metadata: |
xxx |
xxx |
outlink: |
http://sannahkvist.se/ |
http://sannahkvist.se/ |
inlink: |
http://sannahkvist.se/ |
http://sannahkvist.se/commissioned/flickan/ http://sannahkvist.se/commissioned/i-rymden-finns-inga-kanslor/ http://sannahkvist.se/commissioned/ |
header: |
xxx |
xxx |
contentType: |
text/html |
text/html |
content:start: content:end: |
xxx |
xxx |
text:start: text:end: |
xxx |
xxx |