Heritrix 3.1.0 源码解析（二）

上文Heritrix 3.1.0 源码解析（一）实际上是讲述Heritrix3.1.0在eclipse中的环境搭建，还属于对Heritrix3.1.0 源码解析的热身阶段，本文接着分析Heritrix 3.1.0的任务配置，Heritrix3.1.0版本与原来的Heritrix1.14.4版本很大的不同是任务配置文件从order.xml文件转到了crawler-beans.cxml文件，而crawler-beans.cxml实际上是一个spring的容器配置文件，是用spring管理的，我们先来眼熟一下该文件的样子（这里面我配置了一个任务）：

<?xml version="1.0" encoding="UTF-8"?>

<!-- 

  HERITRIX 3 CRAWL JOB CONFIGURATION FILE

  

   This is a relatively minimal configuration suitable for many crawls.

   

   Commented-out beans and properties are provided as an example; values

   shown in comments reflect the actual defaults which are in effect

   if not otherwise specified specification. (To change from the default 

   behavior, uncomment AND alter the shown values.)   

 -->

<beans xmlns="http://www.springframework.org/schema/beans"

         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

         xmlns:context="http://www.springframework.org/schema/context"

         xmlns:aop="http://www.springframework.org/schema/aop"

         xmlns:tx="http://www.springframework.org/schema/tx"

         xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans-3.0.xsd

           http://www.springframework.org/schema/aop http://www.springframework.org/schema/aop/spring-aop-3.0.xsd

           http://www.springframework.org/schema/tx http://www.springframework.org/schema/tx/spring-tx-3.0.xsd

           http://www.springframework.org/schema/context http://www.springframework.org/schema/context/spring-context-3.0.xsd">

 

 <context:annotation-config/>



<!-- 

  OVERRIDES

   Values elsewhere in the configuration may be replaced ('overridden') 

   by a Properties map declared in a PropertiesOverrideConfigurer, 

   using a dotted-bean-path to address individual bean properties. 

   This allows us to collect a few of the most-often changed values

   in an easy-to-edit format here at the beginning of the model

   configuration.    

 -->

 <!-- overrides from a text property list -->

 <bean id="simpleOverrides" class="org.springframework.beans.factory.config.PropertyOverrideConfigurer">

  <property name="properties">

   <value>

# This Properties map is specified in the Java 'property list' text format

# http://java.sun.com/javase/6/docs/api/java/util/Properties.html#load%28java.io.Reader%29



metadata.operatorContactUrl=http://localhost

metadata.jobName=myjob1

metadata.description=myjob1 descrition



##..more?..##

   </value>

  </property>

 </bean>



 <!-- overrides from declared <prop> elements, more easily allowing

      multiline values or even declared beans -->

 <bean id="longerOverrides" class="org.springframework.beans.factory.config.PropertyOverrideConfigurer">

  <property name="properties">

   <props>

    <prop key="seeds.textSource.value">



# URLS HERE

http://www.watiao.net



    </prop>

   </props>

  </property>

 </bean>



 <!-- CRAWL METADATA: including identification of crawler/operator -->

 <bean id="metadata" class="org.archive.modules.CrawlMetadata" autowire="byName">

       <property name="operatorContactUrl" value="http://localhost"/>

       <property name="jobName" value="myjob1"/>

       <property name="description" value="myjob1 descrition"/>

  <!-- <property name="robotsPolicyName" value="obey"/> -->

  <!-- <property name="operator" value=""/> -->

  <!-- <property name="operatorFrom" value=""/> -->

  <!-- <property name="organization" value=""/> -->

  <!-- <property name="audience" value=""/> -->

  <!-- <property name="userAgentTemplate" 

         value="Mozilla/5.0 (compatible; heritrix/3.1.0 [email protected])"/> -->

       

 </bean>

 

 <!-- SEEDS: crawl starting points 

      ConfigString allows simple, inline specification of a moderate

      number of seeds; see below comment for example of using an

      arbitrarily-large external file. -->

 <bean id="seeds" class="org.archive.modules.seeds.TextSeedModule">

     <property name="textSource">

      <bean class="org.archive.spring.ConfigString">

       <property name="value">

        <value>

# [see override above]

        </value>

       </property>

      </bean>

     </property>

<!-- <property name='sourceTagSeeds' value='false'/> -->

<!-- <property name='blockAwaitingSeedLines' value='-1'/> -->

 </bean>

 

 <!-- SEEDS ALTERNATE APPROACH: specifying external seeds.txt file in

      the job directory, similar to the H1 approach. 

      Use either the above, or this, but not both. -->

 <!-- 

 <bean id="seeds" class="org.archive.modules.seeds.TextSeedModule">

  <property name="textSource">

   <bean class="org.archive.spring.ConfigFile">

    <property name="path" value="seeds.txt" />

   </bean>

  </property>

  <property name='sourceTagSeeds' value='false'/>

  <property name='blockAwaitingSeedLines' value='-1'/>

 </bean>

  -->

 

 <!-- SCOPE: rules for which discovered URIs to crawl; order is very 

      important because last decision returned other than 'NONE' wins. -->

 <bean id="scope" class="org.archive.modules.deciderules.DecideRuleSequence">

  <!-- <property name="logToFile" value="false" /> -->

  <property name="rules">

   <list>

    <!-- Begin by REJECTing all... -->

    <bean class="org.archive.modules.deciderules.RejectDecideRule">

    </bean>

    <!-- ...then ACCEPT those within configured/seed-implied SURT prefixes... -->

    <bean class="org.archive.modules.deciderules.surt.SurtPrefixedDecideRule">

     <!-- <property name="seedsAsSurtPrefixes" value="true" /> -->

     <!-- <property name="alsoCheckVia" value="false" /> -->

     <!-- <property name="surtsSourceFile" value="" /> -->

     <!-- <property name="surtsDumpFile" value="${launchId}/surts.dump" /> -->

     <!-- <property name="surtsSource">

           <bean class="org.archive.spring.ConfigString">

            <property name="value">

             <value>

              # example.com

              # http://www.example.edu/path1/

              # +http://(org,example,

             </value>

            </property> 

           </bean>

          </property> -->

    </bean>

    <!-- ...but REJECT those more than a configured link-hop-count from start... -->

    <bean class="org.archive.modules.deciderules.TooManyHopsDecideRule">

     <!-- <property name="maxHops" value="20" /> -->

    </bean>

    <!-- ...but ACCEPT those more than a configured link-hop-count from start... -->

    <bean class="org.archive.modules.deciderules.TransclusionDecideRule">

     <!-- <property name="maxTransHops" value="2" /> -->

     <!-- <property name="maxSpeculativeHops" value="1" /> -->

    </bean>

    <!-- ...but REJECT those from a configurable (initially empty) set of REJECT SURTs... -->

    <bean class="org.archive.modules.deciderules.surt.SurtPrefixedDecideRule">

          <property name="decision" value="REJECT"/>

          <property name="seedsAsSurtPrefixes" value="false"/>

          <property name="surtsDumpFile" value="${launchId}/negative-surts.dump" /> 

     <!-- <property name="surtsSource">

           <bean class="org.archive.spring.ConfigFile">

            <property name="path" value="negative-surts.txt" />

           </bean>

          </property> -->

    </bean>

    <!-- ...and REJECT those from a configurable (initially empty) set of URI regexes... -->

    <bean class="org.archive.modules.deciderules.MatchesListRegexDecideRule">

          <property name="decision" value="REJECT"/>

     <!-- <property name="listLogicalOr" value="true" /> -->

     <!-- <property name="regexList">

           <list>

           </list>

          </property> -->

    </bean>

    <!-- ...and REJECT those with suspicious repeating path-segments... -->

    <bean class="org.archive.modules.deciderules.PathologicalPathDecideRule">

     <!-- <property name="maxRepetitions" value="2" /> -->

    </bean>

    <!-- ...and REJECT those with more than threshold number of path-segments... -->

    <bean class="org.archive.modules.deciderules.TooManyPathSegmentsDecideRule">

     <!-- <property name="maxPathDepth" value="20" /> -->

    </bean>

    <!-- ...but always ACCEPT those marked as prerequisitee for another URI... -->

    <bean class="org.archive.modules.deciderules.PrerequisiteAcceptDecideRule">

    </bean>

    <!-- ...but always REJECT those with unsupported URI schemes -->

    <bean class="org.archive.modules.deciderules.SchemeNotInSetDecideRule">

    </bean>

   </list>

  </property>

 </bean>

 

 <!-- 

   PROCESSING CHAINS

    Much of the crawler's work is specified by the sequential 

    application of swappable Processor modules. These Processors

    are collected into three 'chains'. The CandidateChain is applied 

    to URIs being considered for inclusion, before a URI is enqueued

    for collection. The FetchChain is applied to URIs when their 

    turn for collection comes up. The DispositionChain is applied 

    after a URI is fetched and analyzed/link-extracted.

  -->

  

 <!-- CANDIDATE CHAIN --> 

 <!-- first, processors are declared as top-level named beans -->

 <bean id="candidateScoper" class="org.archive.crawler.prefetch.CandidateScoper">

 </bean>

 <bean id="preparer" class="org.archive.crawler.prefetch.FrontierPreparer">

  <!-- <property name="preferenceDepthHops" value="-1" /> -->

  <!-- <property name="preferenceEmbedHops" value="1" /> -->

  <!-- <property name="canonicalizationPolicy"> 

        <ref bean="canonicalizationPolicy" />

       </property> -->

  <!-- <property name="queueAssignmentPolicy"> 

        <ref bean="queueAssignmentPolicy" />

       </property> -->

  <!-- <property name="uriPrecedencePolicy"> 

        <ref bean="uriPrecedencePolicy" />

       </property> -->

  <!-- <property name="costAssignmentPolicy"> 

        <ref bean="costAssignmentPolicy" />

       </property> -->

 </bean>

 <!-- now, processors are assembled into ordered CandidateChain bean -->

 <bean id="candidateProcessors" class="org.archive.modules.CandidateChain">

  <property name="processors">

   <list>

    <!-- apply scoping rules to each individual candidate URI... -->

    <ref bean="candidateScoper"/>

    <!-- ...then prepare those ACCEPTed to be enqueued to frontier. -->

    <ref bean="preparer"/>

   </list>

  </property>

 </bean>

  

 <!-- FETCH CHAIN --> 

 <!-- first, processors are declared as top-level named beans -->

 <bean id="preselector" class="org.archive.crawler.prefetch.MyPreselector">

      <!-- <property name="recheckScope" value="false" />-->

     <!--  <property name="blockAll" value="false" />-->

     <!--  <property name="blockByRegex" value="" />-->

     <!--  <property name="allowByRegex" value="" />-->

 </bean>

 <bean id="preconditions" class="org.archive.crawler.prefetch.PreconditionEnforcer">

  <!-- <property name="ipValidityDurationSeconds" value="21600" /> -->

  <!-- <property name="robotsValidityDurationSeconds" value="86400" /> -->

  <!-- <property name="calculateRobotsOnly" value="false" /> -->

 </bean>

 <bean id="fetchDns" class="org.archive.modules.fetcher.FetchDNS">

  <!-- <property name="acceptNonDnsResolves" value="false" /> -->

  <!-- <property name="digestContent" value="true" /> -->

  <!-- <property name="digestAlgorithm" value="sha1" /> -->

 </bean>

 <!-- <bean id="fetchWhois" class="org.archive.modules.fetcher.FetchWhois">

       <property name="specialQueryTemplates">

        <map>

         <entry key="whois.verisign-grs.com" value="domain %s" />

         <entry key="whois.arin.net" value="z + %s" />

         <entry key="whois.denic.de" value="-T dn %s" />

        </map>

       </property> 

      </bean> -->

 <bean id="fetchHttp" class="org.archive.modules.fetcher.FetchHTTP">

  <!-- <property name="useHTTP11" value="false" /> -->

  <!-- <property name="maxLengthBytes" value="0" /> -->

  <!-- <property name="timeoutSeconds" value="1200" /> -->

  <!-- <property name="maxFetchKBSec" value="0" /> -->

  <!-- <property name="defaultEncoding" value="ISO-8859-1" /> -->

  <!-- <property name="shouldFetchBodyRule"> 

        <bean class="org.archive.modules.deciderules.AcceptDecideRule"/>

       </property> -->

  <!-- <property name="soTimeoutMs" value="20000" /> -->

  <!-- <property name="sendIfModifiedSince" value="true" /> -->

  <!-- <property name="sendIfNoneMatch" value="true" /> -->

  <!-- <property name="sendConnectionClose" value="true" /> -->

  <!-- <property name="sendReferer" value="true" /> -->

  <!-- <property name="sendRange" value="false" /> -->

  <!-- <property name="ignoreCookies" value="false" /> -->

  <!-- <property name="sslTrustLevel" value="OPEN" /> -->

  <!-- <property name="acceptHeaders"> 

        <list>

         <value>Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8</value>

        </list>

       </property>

  -->

  <!-- <property name="httpBindAddress" value="" /> -->

  <!-- <property name="httpProxyHost" value="" /> -->

  <!-- <property name="httpProxyPort" value="0" /> -->

  <!-- <property name="httpProxyUser" value="" /> -->

  <!-- <property name="httpProxyPassword" value="" /> -->

  <!-- <property name="digestContent" value="true" /> -->

  <!-- <property name="digestAlgorithm" value="sha1" /> -->

 </bean>

 <bean id="extractorHttp" class="org.archive.modules.extractor.ExtractorHTTP">

 </bean>

 <bean id="extractorHtml" class="org.archive.modules.extractor.ExtractorHTML">

  <!-- <property name="extractJavascript" value="true" /> -->

  <!-- <property name="extractValueAttributes" value="true" /> -->

  <!-- <property name="ignoreFormActionUrls" value="false" /> -->

  <!-- <property name="extractOnlyFormGets" value="true" /> -->

  <!-- <property name="treatFramesAsEmbedLinks" value="true" /> -->

  <!-- <property name="ignoreUnexpectedHtml" value="true" /> -->

  <!-- <property name="maxElementLength" value="1024" /> -->

  <!-- <property name="maxAttributeNameLength" value="1024" /> -->

  <!-- <property name="maxAttributeValueLength" value="16384" /> -->

 </bean>

 <bean id="extractorCss" class="org.archive.modules.extractor.ExtractorCSS">

 </bean> 

 <bean id="extractorJs" class="org.archive.modules.extractor.ExtractorJS">

 </bean>

 <bean id="extractorSwf" class="org.archive.modules.extractor.ExtractorSWF">

 </bean>    

 <!-- now, processors are assembled into ordered FetchChain bean -->

 <bean id="fetchProcessors" class="org.archive.modules.FetchChain">

  <property name="processors">

   <list>

    <!-- re-check scope, if so enabled... -->

    <ref bean="preselector"/>

    <!-- ...then verify or trigger prerequisite URIs fetched, allow crawling... -->

    <ref bean="preconditions"/>

    <!-- ...fetch if DNS URI... -->

    <ref bean="fetchDns"/>

    <!-- <ref bean="fetchWhois"/> -->

    <!-- ...fetch if HTTP URI... -->

    <ref bean="fetchHttp"/>

    <!-- ...extract outlinks from HTTP headers... -->

    <ref bean="extractorHttp"/>

    <!-- ...extract outlinks from HTML content... -->

    <ref bean="extractorHtml"/>

    <!-- ...extract outlinks from CSS content... -->

    <ref bean="extractorCss"/>

    <!-- ...extract outlinks from Javascript content... -->

    <ref bean="extractorJs"/>

    <!-- ...extract outlinks from Flash content... -->

    <ref bean="extractorSwf"/>

   </list>

  </property>

 </bean>

  

 <!-- DISPOSITION CHAIN -->

 <!-- first, processors are declared as top-level named beans  -->

 <bean id="warcWriter" class="org.archive.modules.writer.MirrorWriterProcessor">

  <!-- <property name="compress" value="true" /> -->

  <!-- <property name="prefix" value="IAH" /> -->

  <!-- <property name="suffix" value="${HOSTNAME}" /> -->

  <!-- <property name="maxFileSizeBytes" value="1000000000" /> -->

  <!-- <property name="poolMaxActive" value="1" /> -->

  <!-- <property name="MaxWaitForIdleMs" value="500" /> -->

  <!-- <property name="skipIdenticalDigests" value="false" /> -->

  <!-- <property name="maxTotalBytesToWrite" value="0" /> -->

  <!-- <property name="directory" value="${launchId}" /> -->

  <!-- <property name="storePaths">

        <list>

         <value>warcs</value>

        </list>

       </property> -->

  <!-- <property name="writeRequests" value="true" /> -->

  <!-- <property name="writeMetadata" value="true" /> -->

  <!-- <property name="writeRevisitForIdenticalDigests" value="true" /> -->

  <!-- <property name="writeRevisitForNotModified" value="true" /> -->

 </bean>

 <bean id="candidates" class="org.archive.crawler.postprocessor.CandidatesProcessor">

  <!-- <property name="seedsRedirectNewSeeds" value="true" /> -->

 </bean>

 <bean id="disposition" class="org.archive.crawler.postprocessor.DispositionProcessor">

  <!-- <property name="delayFactor" value="5.0" /> -->

  <!-- <property name="minDelayMs" value="3000" /> -->

  <!-- <property name="respectCrawlDelayUpToSeconds" value="300" /> -->

  <!-- <property name="maxDelayMs" value="30000" /> -->

  <!-- <property name="maxPerHostBandwidthUsageKbSec" value="0" /> -->

 </bean>

 <!-- <bean id="rescheduler" class="org.archive.crawler.postprocessor.ReschedulingProcessor">

       <property name="rescheduleDelaySeconds" value="-1" />

      </bean> -->

 <!-- now, processors are assembled into ordered DispositionChain bean -->

 <bean id="dispositionProcessors" class="org.archive.modules.DispositionChain">

  <property name="processors">

   <list>

    <!-- write to aggregate archival files... -->

    <ref bean="warcWriter"/>

    <!-- ...send each outlink candidate URI to CandidateChain, 

         and enqueue those ACCEPTed to the frontier... -->

    <ref bean="candidates"/>

    <!-- ...then update stats, shared-structures, frontier decisions -->

    <ref bean="disposition"/>

    <!-- <ref bean="rescheduler" /> -->

   </list>

  </property>

 </bean>

 

 <!-- CRAWLCONTROLLER: Control interface, unifying context -->

 <bean id="crawlController" 

   class="org.archive.crawler.framework.CrawlController">

  <!-- <property name="maxToeThreads" value="25" /> -->

  <!-- <property name="pauseAtStart" value="true" /> -->

  <!-- <property name="runWhileEmpty" value="false" /> -->

  <!-- <property name="recorderInBufferBytes" value="524288" /> -->

  <!-- <property name="recorderOutBufferBytes" value="16384" /> -->

  <!-- <property name="scratchDir" value="scratch" /> -->

 </bean>

 

 <!-- FRONTIER: Record of all URIs discovered and queued-for-collection -->

 <bean id="frontier" 

   class="org.archive.crawler.frontier.BdbFrontier">

  <!-- <property name="queueTotalBudget" value="-1" /> -->

  <!-- <property name="balanceReplenishAmount" value="3000" /> -->

  <!-- <property name="errorPenaltyAmount" value="100" /> -->

  <!-- <property name="precedenceFloor" value="255" /> -->

  <!-- <property name="queuePrecedencePolicy">

        <bean class="org.archive.crawler.frontier.precedence.BaseQueuePrecedencePolicy" />

       </property> -->

  <!-- <property name="snoozeLongMs" value="300000" /> -->

  <!-- <property name="retryDelaySeconds" value="900" /> -->

  <!-- <property name="maxRetries" value="30" /> -->

  <!-- <property name="recoveryLogEnabled" value="true" /> -->

  <!-- <property name="maxOutlinks" value="6000" /> -->

  <!-- <property name="extractIndependently" value="false" /> -->

  <!-- <property name="outbound">

        <bean class="java.util.concurrent.ArrayBlockingQueue">

         <constructor-arg value="200"/>

         <constructor-arg value="true"/>

        </bean>

       </property> -->

  <!-- <property name="inbound">

        <bean class="java.util.concurrent.ArrayBlockingQueue">

         <constructor-arg value="40000"/>

         <constructor-arg value="true"/>

        </bean>

       </property> -->

  <!-- <property name="dumpPendingAtClose" value="false" /> -->

 </bean>

 

 <!-- URI UNIQ FILTER: Used by frontier to remember already-included URIs --> 

 <bean id="uriUniqFilter" 

   class="org.archive.crawler.util.BdbUriUniqFilter">

 </bean>

 

 <!--

   EXAMPLE SETTINGS OVERLAY SHEETS

   Sheets allow some settings to vary by context - usually by URI context,

   so that different sites or sections of sites can be treated differently. 

   Here are some example Sheets for common purposes. The SheetOverlaysManager

   (below) automatically collects all Sheet instances declared among the 

   original beans, but others can be added during the crawl via the scripting 

   interface.

  -->



<!-- forceRetire: any URI to which this sheet's settings are applied 

     will force its containing queue to 'retired' status. -->

<bean id='forceRetire' class='org.archive.spring.Sheet'>

 <property name='map'>

  <map>

   <entry key='disposition.forceRetire' value='true'/>

  </map>

 </property>

</bean>



<!-- smallBudget: any URI to which this sheet's settings are applied 

     will give its containing queue small values for balanceReplenishAmount 

     (causing it to have shorter 'active' periods while other queues are 

     waiting) and queueTotalBudget (causing the queue to enter 'retired' 

     status once that expenditure is reached by URI attempts and errors) -->

<bean id='smallBudget' class='org.archive.spring.Sheet'>

 <property name='map'>

  <map>

   <entry key='frontier.balanceReplenishAmount' value='20'/>

   <entry key='frontier.queueTotalBudget' value='100'/>

  </map>

 </property>

</bean>



<!-- veryPolite: any URI to which this sheet's settings are applied 

     will cause its queue to take extra-long politeness snoozes -->

<bean id='veryPolite' class='org.archive.spring.Sheet'>

 <property name='map'>

  <map>

   <entry key='disposition.delayFactor' value='10'/>

   <entry key='disposition.minDelayMs' value='10000'/>

   <entry key='disposition.maxDelayMs' value='1000000'/>

   <entry key='disposition.respectCrawlDelayUpToSeconds' value='3600'/>

  </map>

 </property>

</bean>



<!-- highPrecedence: any URI to which this sheet's settings are applied 

     will give its containing queue a slightly-higher than default 

     queue precedence value. That queue will then be preferred over 

     other queues for active crawling, never waiting behind lower-

     precedence queues. -->

<bean id='highPrecedence' class='org.archive.spring.Sheet'>

 <property name='map'>

  <map>

   <entry key='frontier.balanceReplenishAmount' value='20'/>

   <entry key='frontier.queueTotalBudget' value='100'/>

  </map>

 </property>

</bean>



<!--

   EXAMPLE SETTINGS OVERLAY SHEET-ASSOCIATION

   A SheetAssociation says certain URIs should have certain overlay Sheets

   applied. This example applies two sheets to URIs matching two SURT-prefixes.

   New associations may also be added mid-crawl using the scripting facility.

  -->



<!--

<bean class='org.archive.crawler.spring.SurtPrefixesSheetAssociation'>

 <property name='surtPrefixes'>

  <list>

   <value>http://(org,example,</value>

   <value>http://(com,example,www,)/</value>

  </list>

 </property>

 <property name='targetSheetNames'>

  <list>

   <value>veryPolite</value>

   <value>smallBudget</value>

  </list>

 </property>

</bean>

-->



 <!-- 

   OPTIONAL BUT RECOMMENDED BEANS

  -->

  

 <!-- ACTIONDIRECTORY: disk directory for mid-crawl operations

      Running job will watch directory for new files with URIs, 

      scripts, and other data to be processed during a crawl. -->

 <bean id="actionDirectory" class="org.archive.crawler.framework.ActionDirectory">

  <!-- <property name="actionDir" value="action" /> -->

  <!-- <property name="doneDir" value="${launchId}/actions-done" /> -->

  <!-- <property name="initialDelaySeconds" value="10" /> -->

  <!-- <property name="delaySeconds" value="30" /> -->

 </bean> 

 

 <!--  CRAWLLIMITENFORCER: stops crawl when it reaches configured limits -->

 <bean id="crawlLimiter" class="org.archive.crawler.framework.CrawlLimitEnforcer">

  <!-- <property name="maxBytesDownload" value="0" /> -->

  <!-- <property name="maxDocumentsDownload" value="0" /> -->

  <!-- <property name="maxTimeSeconds" value="0" /> -->

 </bean>

 

 <!-- CHECKPOINTSERVICE: checkpointing assistance -->

 <bean id="checkpointService" 

   class="org.archive.crawler.framework.CheckpointService">

  <!-- <property name="checkpointIntervalMinutes" value="-1"/> -->

  <!-- <property name="checkpointsDir" value="checkpoints"/> -->

 </bean>

 

 <!-- 

   OPTIONAL BEANS

    Uncomment and expand as needed, or if non-default alternate 

    implementations are preferred.

  -->

  

 <!-- CANONICALIZATION POLICY -->

 <!--

 <bean id="canonicalizationPolicy" 

   class="org.archive.modules.canonicalize.RulesCanonicalizationPolicy">

   <property name="rules">

    <list>

     <bean class="org.archive.modules.canonicalize.LowercaseRule" />

     <bean class="org.archive.modules.canonicalize.StripUserinfoRule" />

     <bean class="org.archive.modules.canonicalize.StripWWWNRule" />

     <bean class="org.archive.modules.canonicalize.StripSessionIDs" />

     <bean class="org.archive.modules.canonicalize.StripSessionCFIDs" />

     <bean class="org.archive.modules.canonicalize.FixupQueryString" />

    </list>

  </property>

 </bean>

 -->

 



 <!-- QUEUE ASSIGNMENT POLICY -->

 <!--

 <bean id="queueAssignmentPolicy" 

   class="org.archive.crawler.frontier.SurtAuthorityQueueAssignmentPolicy">

  <property name="forceQueueAssignment" value="" />

  <property name="deferToPrevious" value="true" />

  <property name="parallelQueues" value="1" />

 </bean>

 -->

 

 <!-- URI PRECEDENCE POLICY -->

 <!--

 <bean id="uriPrecedencePolicy" 

   class="org.archive.crawler.frontier.precedence.CostUriPrecedencePolicy">

 </bean>

 -->

 

 <!-- COST ASSIGNMENT POLICY -->

 <!--

 <bean id="costAssignmentPolicy" 

   class="org.archive.crawler.frontier.UnitCostAssignmentPolicy">

 </bean>

 -->

 

 <!-- CREDENTIAL STORE: HTTP authentication or FORM POST credentials -->

 <!-- 

 <bean id="credentialStore" 

   class="org.archive.modules.credential.CredentialStore">

 </bean>

 -->

 

 <!-- DISK SPACE MONITOR: 

      Pauses the crawl if disk space at monitored paths falls below minimum threshold -->

 <!-- 

 <bean id="diskSpaceMonitor" class="org.archive.crawler.monitor.DiskSpaceMonitor">

   <property name="pauseThresholdMiB" value="500" />

   <property name="monitorConfigPaths" value="true" />

   <property name="monitorPaths">

     <list>

       <value>PATH</value>

     </list>

   </property>

 </bean>

 -->

 

 <!-- 

   REQUIRED STANDARD BEANS

    It will be very rare to replace or reconfigure the following beans.

  -->



 <!-- STATISTICSTRACKER: standard stats/reporting collector -->

 <bean id="statisticsTracker" 

   class="org.archive.crawler.reporting.StatisticsTracker" autowire="byName">

  <!-- <property name="reports">

        <list>

         <bean id="crawlSummaryReport" class="org.archive.crawler.reporting.CrawlSummaryReport" />

         <bean id="seedsReport" class="org.archive.crawler.reporting.SeedsReport" />

         <bean id="hostsReport" class="org.archive.crawler.reporting.HostsReport" />

         <bean id="sourceTagsReport" class="org.archive.crawler.reporting.SourceTagsReport" />

         <bean id="mimetypesReport" class="org.archive.crawler.reporting.MimetypesReport" />

         <bean id="responseCodeReport" class="org.archive.crawler.reporting.ResponseCodeReport" />

         <bean id="processorsReport" class="org.archive.crawler.reporting.ProcessorsReport" />

         <bean id="frontierSummaryReport" class="org.archive.crawler.reporting.FrontierSummaryReport" />

         <bean id="frontierNonemptyReport" class="org.archive.crawler.reporting.FrontierNonemptyReport" />

         <bean id="toeThreadsReport" class="org.archive.crawler.reporting.ToeThreadsReport" />

        </list>

       </property> -->

  <!-- <property name="reportsDir" value="${launchId}/reports" /> -->

  <!-- <property name="liveHostReportSize" value="20" /> -->

  <!-- <property name="intervalSeconds" value="20" /> -->

  <!-- <property name="keepSnapshotsCount" value="5" /> -->

  <!-- <property name="liveHostReportSize" value="20" /> -->

 </bean>

 

 <!-- CRAWLERLOGGERMODULE: shared logging facility -->

 <bean id="loggerModule" 

   class="org.archive.crawler.reporting.CrawlerLoggerModule">

  <!-- <property name="path" value="${launchId}/logs" /> -->

  <!-- <property name="crawlLogPath" value="crawl.log" /> -->

  <!-- <property name="alertsLogPath" value="alerts.log" /> -->

  <!-- <property name="progressLogPath" value="progress-statistics.log" /> -->

  <!-- <property name="uriErrorsLogPath" value="uri-errors.log" /> -->

  <!-- <property name="runtimeErrorsLogPath" value="runtime-errors.log" /> -->

  <!-- <property name="nonfatalErrorsLogPath" value="nonfatal-errors.log" /> -->

  <!-- <property name="logExtraInfo" value="false" /> -->

 </bean>

 

 <!-- SHEETOVERLAYMANAGER: manager of sheets of contextual overlays

      Autowired to include any SheetForSurtPrefix or 

      SheetForDecideRuled beans -->

 <bean id="sheetOverlaysManager" autowire="byType"

   class="org.archive.crawler.spring.SheetOverlaysManager">

 </bean>



 <!-- BDBMODULE: shared BDB-JE disk persistence manager -->

 <bean id="bdb" 

  class="org.archive.bdb.BdbModule">

  <!-- <property name="dir" value="state" /> -->

  <!-- <property name="cachePercent" value="60" /> -->

  <!-- <property name="useSharedCache" value="true" /> -->

  <!-- <property name="expectedConcurrency" value="25" /> -->

 </bean>

 

 <!-- BDBCOOKIESTORAGE: disk-based cookie storage for FetchHTTP -->

 <bean id="cookieStorage" 

   class="org.archive.modules.fetcher.BdbCookieStorage">

  <!-- <property name="cookiesLoadFile"><null/></property> -->

  <!-- <property name="cookiesSaveFile"><null/></property> -->

  <!-- <property name="bdb">

        <ref bean="bdb"/>

       </property> -->

 </bean>

 

 <!-- SERVERCACHE: shared cache of server/host info -->

 <bean id="serverCache" 

   class="org.archive.modules.net.BdbServerCache">

  <!-- <property name="bdb">

        <ref bean="bdb"/>

       </property> -->

 </bean>



 <!-- CONFIG PATH CONFIGURER: required helper making crawl paths relative

      to crawler-beans.cxml file, and tracking crawl files for web UI -->

 <bean id="configPathConfigurer" 

   class="org.archive.spring.ConfigPathConfigurer">

 </bean>

 

</beans>

粗看上去，我们大概能够明白新版本的采集任务配置在bean属性里面，bean之间还存在一些依赖关系

前面部分显然是加载任务种子

scope控制CrawlURI是ACCEPT NONE, 还是REJECT的决策，在org.archive.modules.Processor 的process(CrawlURI uri) 方法被调用

再下面部分是处理器链及相关处理器的依赖关系

我通过打印输出了Heritrix3.1.0的处理器链的结构，下面可以看到FetchChain、DispositionChain、CandidateChain三类处理器链各自所依赖的处理器（CandidateChain处理器链实际是DispositionChain处理器链中的CandidatesProcessor处理器所引用的）修正

FetchChain:org.archive.modules.FetchChain


org.archive.crawler.prefetch.Preselector

org.archive.crawler.prefetch.PreconditionEnforcer

org.archive.modules.fetcher.FetchDNS

org.archive.modules.fetcher.FetchHTTP

org.archive.modules.extractor.ExtractorHTTP

org.archive.modules.extractor.ExtractorHTML

org.archive.modules.extractor.ExtractorCSS

org.archive.modules.extractor.ExtractorJS

org.archive.modules.extractor.ExtractorSWF



Frontier:org.archive.crawler.frontier.BdbFrontier



DispositionChain:org.archive.modules.DispositionChain


org.archive.modules.writer.MirrorWriterProcessor

org.archive.crawler.postprocessor.CandidatesProcessor

org.archive.crawler.postprocessor.DispositionProcessor



CandidateChain:org.archive.modules.CandidateChain


org.archive.crawler.prefetch.CandidateScoper

org.archive.crawler.prefetch.FrontierPreparer

正好与上面的crawler-beans.cxml文件相互印证

crawler-beans.cxml文件后面部分的主要是为前面的处理器设置值（用于自定义设置相关采集地址与处理器对象的属性值的映射）

后面这个名称为sheetOverlaysManager的bean是管理上面的Sheet部分与处理器的属性值的加载的

后面的比较容易理解，其实crawler-beans.cxml文件里面的英文注释已经比较清楚了

---------------------------------------------------------------------------

本系列Heritrix 3.1.0 源码解析系本人原创

转载请注明出处博客园刺猬的温驯

本文链接 http://www.cnblogs.com/chenying99/archive/2013/04/10/3013284.html

DispositionChain

你可能感兴趣的:(Heritrix)

Heritrix网络爬虫与Tomcat服务器部署指南 Rubix-Kai
本文还有配套的精品资源，点击获取简介：Heritrix是一款功能强大的开源网络爬虫工具，由互联网档案馆开发，适用于大规模网页抓取。本文将指导读者如何下载、安装Heritrix，并在Tomcat服务器上进行部署和运行。内容包括Heritrix的基本概念、下载与安装步骤、集成到Eclipse的过程、配置Heritrix、构建与运行、部署到Tomcat以及如何访问Heritrix的Web界面。此外，还包
Python爬虫实战 weixin_34007879 爬虫 json java
引言网络爬虫是抓取互联网信息的利器，成熟的开源爬虫框架主要集中于两种语言Java和Python。主流的开源爬虫框架包括：1.分布式爬虫框架：Nutch2.Java单机爬虫框架：Crawler4j,WebMagic,WebCollector、Heritrix3.python单机爬虫框架：scrapy、pyspiderNutch是专为搜索引擎设计的的分布式开源框架，上手难度高，开发复杂，基本无法满足快
Berkeley DB JE版 jason成都数据库
一、BerkeleyDB的介绍（1）BerkeleyDB是一个嵌入式数据库，它适合于管理海量的、简单的数据。如Google使用其来保存账户信息，Heritrix用其来保存froniter.（2）key/value是BerkeleyDB用来管理数据的基础，每个key/value对代表一条记录。（3）BerkeleyDB在底层实现采用B树，可以看成能够存储大量数据的HashMap。（4）它是Oracl
Heritrix开源爬虫配置1.14.4和3.1 青峰祭坛 heritrix 爬虫开源 Heritrix
参考自：开源爬虫:Heritrix1.14.4安装/使用http://blog.sina.com.cn/s/blog_5f54f0be0101hcy8.html开源爬虫:Heritrix3.1Windows上安装/使用http://blog.sina.com.cn/s/blog_5f54f0be0101hcyt.htmlHeritrix是一个由java开发的、开源的网络爬虫，用户可以使用它来从网上
heritrix mysql_Heritrix使用小结有书 heritrix mysql
1.Heritrix简介Heritrix是一个专门为互联网上的网页进行存档而开发的网页检索器。它使用Java编写并且完全开源。它主要的用户界面可以通过一个web流量器来访问并通过它来控制检索器的行为，另外，它还有一个命令行工具来供用户选择调用。Heritrix是由互联网档案馆和北欧国家图书馆联合规范化编写于2003年初。第一次正式发布是在2004年1月，并不断的被互联网档案馆和其他感兴趣的第三方改
Java爬虫技术框架之Heritrix框架详解
Heritrix是一个由Java开发的开源Web爬虫系统，用来获取完整的、精确的站点内容的深度复制，具有强大的可扩展性，运行开发者任意选择或扩展各个组件，实现特定的抓取逻辑。一、Heritrix介绍Heritrix采用了模块化的设计，用户可以在运行时选择要用的模块。它由核心类（coreclasses）和插件模块（pluggablemodules）构成。核心类可以配置，但不能被覆盖，插件模块可以由第
Heritrix Crawler vs. Nutch Crawler Fenng 爬虫数据库
在邮件列表中看到有人问Heritrix爬虫与Nutch爬虫的不同。搜索了一下，该项目的领导者是GordonMohr，Heritrix主要用在http://www.archive.org。基本定义描述：HeritrixistheInternetArchive’sopen-source,extensible,web-scale,archival-qualitywebcrawlerproject.没想到
Nutch、heritrix、crawler4j优缺点 jiao732 Crawlers
Nutch:主页：https://nutch.apache.org/index.htmlApacheNutch是一个高度可扩展的和可伸缩的开源网页爬虫软件项目。源于ApacheLuceneTM,项目多样化，目前由两个代码库组成，即：1.Nutch1.x：一个非常成熟的爬虫产品。1.x版本支持细粒度的配置，依赖于一个很好的分布式处理的ApacheHadoop数据结构。2.Nutch2.x：一个新兴的
关于heritrix安装配置时出现"必须限制口令文件读取访问权限"的解决方法 jiangfullll
最近开始写一个RSS聚合程序，需要爬虫支持，于是就整来heritrix，没想到，这东西还挺拽，费了老衲好几个小时来安装配置这个heritrix。最后经过不懈努力，终于起来了，具体步骤如下：你如果在网上找相关配置，大多数都是讲先修改conf/properties文件的用户名和密码以及修改jmxremote.password.template，然后将其改名复制到heritrix根目录下，接着就让你无辜
Heritrix的Modules界面不能改变选择项的问题 weixin_30455067
具体的原因分析见“Heritrix的Modules界面不能改变选择项的问题”原因：找相关的Options文件是在Modules相对路径下的，而Modules目录是在conf目录下。Classpath没有找到需要的文件目录。解决方法：在Eclipse里面设置conf为Classpath(在Eclipse的RunDialog中，Classpath标签Table，选中UserEntries，然后右边会有
heritrix 3.2.0 -- 环境搭建大齐zy 爬虫
heritrix作为一个比较经典的开源爬虫，写这篇文章目的是因为，3.X之后的heritrix的介绍以及配置的文章比较少了。heritrix3.x以后使用maven2配置jar包引用，但是总是有好多包没法从maven库下载。所以，这里讲的环境搭建直接使用了编译好的工程来做，heritrix-3.2.0-dist.tar.gz以及源码压缩包heritrix-3.2.0-src.tar.gz具体方法如
【Heritrix基础教程之2】Heritrix基本内容介绍 weixin_30487701
1、版本说明（1）最新版本：3.3.0（2）最新release版本：3.2.0（3）重要历史版本：1.14.43.1.0及之前的版本：http://sourceforge.net/projects/archive-crawler/files/3.2.0及之后的版本：http://archive.org/由于国情需要，后者无法访问，因此本blog研究的是1.14.4版本。2、官方材料source：h
我的Heritrix学习之路（一） wan353694124 Heritrix
在Windows平台下，先把Heritrix启动起来详细步骤如下：1、老规矩，开源的东西，先下载，亲测地址：http://nchc.dl.sourceforge.net/project/archive-crawler/archive-crawler%20%28heritrix%201.x%29/1.14.4/heritrix-1.14.4.zip2、将下载的heritrix-1.14.4.zip解
Heritrix的使用入门 systemuser Hadoop
10.3扩展和定制Heritrix在前面两节中，向读者介绍了Heritrix的启动、创建任务、抓取网页、组件结构。但是，读者应该也可以明显的看出，如果不用Heritrix抓取和分析网页的行为进行一定的控制，它是无法达到要求的。对Heritrix的行为进行控制，是要建立在对其架构充分了解的基础之上的，因此，本节的内容完全是基于上一节中所讨论的基础。10.3.1向Heritrix中添加自己的Extra
heritrix学习总结蓝翔招生办网络爬虫
1下载和解压从[url]http://crawler.archive.org/[/url]下载解压到本地E:\heritrix-1.14.32配置环境变量HERITRIX_HOME=E:\heritrix-1.14.3path后追加;%HERITRIX_HOME%\bin3配置heritrix拷贝E:\heritrix-1.14.3\conf\jmxremote.password.template
Heritrix3.1.0的使用 jiang617325814 java开源包
1.在cmd下面进入Heritrix的bin目录下输入heritrix-aadmin:admin，弹出新窗口，新窗口中运行heritrix2.浏览中输入https://localhost:8443/得到界面如下第一个输入框中写入任意Job名称，如s第二个输入框如果不写则默认存储在bin目录下的jobs文件夹下3.点击create后：4.点击"s"任务：crawler-beans.cxml是配置本次
Heritrix3.0 的安装，使用 jazwoo 搜索引擎
1、下载heritrix3.0或heritrix3.1，解压。运行cmd，进入到bin目录下（如笔者的目录：cdD:\heritrix-3.1.0\bin）。运行命令：heritrix-aadmin:admin，这里冒号前面admin是用户名，后面是密码，这样将会在另一个新建的窗口中运行heritrix程序。在浏览器地址栏输入https://localhost:8443，注意这里是https，端口
Heritrix iteye_14258 网络爬虫
Heritrix项目介绍Heritrix工程始于2003年初，IA的目的是开发一个特殊的爬虫，对网上的资源进行归档，建立网络数字图书馆。在过去的6年里，IA已经建立了400TB的数据。IA期望他们的crawler包含以下几种：宽带爬虫：能够以更高的带宽去站点爬。主题爬虫：集中于被选择的问题。持续爬虫：不仅仅爬更当前的网页还负责爬日后更新的网页。实验爬虫：对爬虫技术进行实验，以决定该爬什么，以及对不
Heritrix3.0教程使用入门(三) 配置文件crawler-beans.cxml介绍 iteye_1364 Heritrix
本博客属原创文章,转载请注明出处:http://www.yun5u.com/articles/heritrix3-4.html可以说crawler-beans.cxml可以主导整个Heritrix的抓取.不同于Heritrix1.x版本的order.xml是,crawler-beans.cxml采用Spring来管理.里面的配置都是一个个bean.所以无论从配置上,耦合上,动态控制上,Heritr
Heritrix3.0教程使用入门(一) 下载安装与运行 iteye_1364 Heritrix
本博客属原创文章,转载请注明出处:http://www.yun5u.com/articles/heritrix3-1.htmlHeritrix3.0.0在2009年底发布,但资料甚少.我这里就先抛砖引用,以前也分析过Heritrix1.4.3,但只是源码,不系统.这里就系统的介绍Heritrix的使用,源码分析和借鉴.先介绍Heritrix的下载与使用吧.1.下载,下载地址:http://sour
Heritrix3.0教程使用教程(三) CrawlJob控制台界面(一) 大概介绍 iteye_1364 Heritrix
本博客属原创文章,转载请注明出处:http://www.yun5u.com/articles/heritrix3-5.html我觉得Heritrix很直观的一点就是有控制台,但以前我忽略了这个功能,直接代码启动Heritrix,然后放在Tomcat里.后期才慢慢发现一个UI界面的价值.可以很方便的获知抓取情况,甚至完全在千里之外控制它的抓取.其实慢慢的发现很多开源框架都会有一个UI界面.我觉得这也
【Heritrix基础教程之1】在Eclipse中配置Heritrix apple01010105
一、新建项目并将Heritrix源码导入１、下载heritrix-1.14.4-src.zip和heritrix-1.14.4.zip两个压缩包，并解压，以后分别简称SRC包和ZIP包；２、在Eclipse下新建Java项目，取名Heritrix.1.14.4；３、复制SRC包下面src/java文件夹下org和st两个文件夹到项目中的src包下；４、复制SRC包下src下conf文件夹到项目根目
【Heritrix基础教程之3】Heritrix的基本架构 apple01010105 运维 java 测试
Heritrix可分为四大模块：1、控制器CrawlController2、待处理的uri列表Frontier3、线程池ToeThread4、各个步骤的处理器（1）Pre-fetchprocessingchain：主要处理DNS-lookup,robots.txt,认证，抓取范围检查等。（2）FetchProcessingchain:抓取处理器。对于每个协议，均有一个类作支持，如FetchHTTP
Heritrix3.0教程使用入门(二) 开始抓取沐枫L Heritrix3
本博客属原创文章,转载请注明出处:http://www.yun5u.com/articles/heritrix3-2.html上一篇博客介绍了,Heritrix3.0的下载,安装以及启动,可以通过UI去配置,和控制抓取任务.这一篇博将讲述,如何在Heritrix上创建抓取任务(CrawlJob)并运行.首先创建抓取,本可以通过WEB界面来创建,但有时会出现一些莫名奇妙的问题,我这里通过手工创建的方
Heritrix3.0教程使用入门(一) 下载安装与运行沐枫L Heritrix3 jobs 任务浏览器 cmd ie web
本博客属原创文章,转载请注明出处:http://www.yun5u.com/articles/heritrix3-1.htmlHeritrix3.0.0在2009年底发布,但资料甚少.我这里就先抛砖引用,以前也分析过Heritrix1.4.3,但只是源码,不系统.这里就系统的介绍Heritrix的使用,源码分析和借鉴.先介绍Heritrix的下载与使用吧.1.下载,下载地址:http://sour
爬虫初探（一）crawler4j的robots weixin_34123613
2019独角兽企业重金招聘Python工程师标准>>>最近刚刚开始研究爬虫，身为小白的我不知道应该从何处下手，网上查了查，发现主要的开源java爬虫有nutchapache/nutch·GitHub，Heritrixinternetarchive/heritrix3·GitHub和Crawler4jyasserg/crawler4j·GitHub，还有WebCollectorCrawlScript
Lucene+Heritrix 开发搜索引擎 iteye_4245 搜索引擎 lucene 互联网
摘要:根据搜索引擎原理，Heritrix从互联网上抓取网页,Lucene建立索引数据库,在索引数据库中搜索排序.阅读全文jwebee2007-05-2420:09发表评论
Heritrix源码分析(二) 配置文件order.xml介绍 nizaina_0 Heritrix
本博客属原创文章,欢迎转载！转载请务必注明出处:http://guoyunsky.iteye.com/blog/613412本博客已迁移到本人独立博客:http://www.yun5u.com/order.xml是整个Heritrix的核心,里面的每个一个配置都关系到Heritrix的运行情况,没读源码之前我只能从有限的渠道去获知这些配置的运用.读完之后才知道Heritrix竟然有如此灵活的运用,
Web爬虫Heritrix的安装和配置 Rayping 爬虫爬虫人工智能
Web爬虫Heritrix的安装和配置2010-10-2720:00:01|分类：Web搜索|字号订阅1、将得到的heritrix-1.14.4.zip压缩包直接解压缩到某一目录，我选择的是F:\Heritrix。2、然后，将F:\Heritrix目录中的heritrix-1.14.4.jar文件解压缩，把profiles\default下的两个文件order.xml和seeds.txt复制到F:
开源爬虫: Heritrix 3.1 Windows 上安装/使用 xiaomin_____ java
目前Heritrix的最新版本是3.1.0（2011-10-21发布）http://blog.sina.com.cn/s/blog_5f54f0be0101hcy8.html讲了1.14.4版本的安装和使用http://blog.sina.com.cn/s/blog_5f54f0be0101hcyd.html讲了如何扩展1.14.4版本其中的模块本文讲如何安装和使用Heritrix最新的3.1.0
java线程Thread和Runnable区别和联系 zx_code java jvm thread 多线程 Runnable
我们都晓得java实现线程2种方式，一个是继承Thread，另一个是实现Runnable。模拟窗口买票，第一例子继承thread，代码如下 package thread; public class ThreadTest { public static void main(String[] args) { Thread1 t1 = new Thread1(
【转】JSON与XML的区别比较丁_新 json xml
1.定义介绍 (1).XML定义扩展标记语言 (Extensible Markup Language, XML) ，用于标记电子文件使其具有结构性的标记语言，可以用来标记数据、定义数据类型，是一种允许用户对自己的标记语言进行定义的源语言。 XML使用DTD(document type definition)文档类型定义来组织数据;格式统一，跨平台和语言，早已成为业界公认的标准。 XML是标
c++ 实现五种基础的排序算法 CrazyMizzz C++c 算法
#include<iostream> using namespace std; //辅助函数，交换两数之值 template<class T> void mySwap(T &x, T &y){ T temp = x; x = y; y = temp; } const int size = 10; //一、用直接插入排
我的软件麦田的设计者我的软件音乐类娱乐放松
这是我写的一款app软件，耗时三个月，是一个根据央视节目开门大吉改变的，提供音调，猜歌曲名。1、手机拥有者在android手机市场下载本APP，同意权限，安装到手机上。2、游客初次进入时会有引导页面提醒用户注册。（同时软件自动播放背景音乐）。3、用户登录到主页后，会有五个模块。a、点击不胫而走，用户得到开门大吉首页部分新闻，点击进入有新闻详情。b、
linux awk命令详解被触发 linux awk
awk是行处理器: 相比较屏幕处理的优点，在处理庞大文件时不会出现内存溢出或是处理缓慢的问题，通常用来格式化文本信息 awk处理过程: 依次对每一行进行处理，然后输出 awk命令形式: awk [-F|-f|-v] ‘BEGIN{} //{command1; command2} END{}’ file [-F|-f|-v]大参数，-F指定分隔符，-f调用脚本，-v定义变量 var=val
各种语言比较 _wy_ 编程语言
Java Ruby PHP 擅长领域
oracle 中数据类型为clob的编辑知了ing oracle clob
public void updateKpiStatus(String kpiStatus,String taskId){ Connection dbc=null; Statement stmt=null; PreparedStatement ps=null; try { dbc = new DBConn().getNewConnection(); //stmt = db
分布式服务框架 Zookeeper -- 管理分布式环境中的数据矮蛋蛋 zookeeper
原文地址： http://www.ibm.com/developerworks/cn/opensource/os-cn-zookeeper/ 安装和配置详解本文介绍的 Zookeeper 是以 3.2.2 这个稳定版本为基础，最新的版本可以通过官网 http://hadoop.apache.org/zookeeper/来获取，Zookeeper 的安装非常简单，下面将从单机模式和集群模式两
tomcat数据源 alafqq tomcat
数据库 JNDI(Java Naming and Directory Interface，Java命名和目录接口)是一组在Java应用中访问命名和目录服务的API。没有使用JNDI时我用要这样连接数据库： 03. Class.forName("com.mysql.jdbc.Driver"); 04. conn
遍历的方法百合不是茶遍历
遍历在java的泛
linux查看硬件信息的命令 bijian1013 linux
linux查看硬件信息的命令一.查看CPU： cat /proc/cpuinfo 二.查看内存： free 三.查看硬盘： df linux下查看硬件信息 1、lspci 列出所有PCI 设备； lspci - list all PCI devices:列出机器中的PCI设备（声卡、显卡、Modem、网卡、USB、主板集成设备也能
java常见的ClassNotFoundException bijian1013 java
1.java.lang.ClassNotFoundException: org.apache.commons.logging.LogFactory 添加包common-logging.jar2.java.lang.ClassNotFoundException: javax.transaction.Synchronization
【Gson五】日期对象的序列化和反序列化 bit1129 反序列化
对日期类型的数据进行序列化和反序列化时，需要考虑如下问题： 1. 序列化时，Date对象序列化的字符串日期格式如何 2. 反序列化时，把日期字符串序列化为Date对象，也需要考虑日期格式问题 3. Date A -> str -> Date B,A和B对象是否equals 默认序列化和反序列化 import com
【Spark八十六】Spark Streaming之DStream vs. InputDStream bit1129 Stream
1. DStream的类说明文档： /** * A Discretized Stream (DStream), the basic abstraction in Spark Streaming, is a continuous * sequence of RDDs (of the same type) representing a continuous st
通过nginx获取header信息 ronin47 nginx header
1. 提取整个的Cookies内容到一个变量，然后可以在需要时引用，比如记录到日志里面， if ( $http_cookie ~* "(.*)$") { set $all_cookie $1; } 变量$all_cookie就获得了cookie的值，可以用于运算了
java-65.输入数字n，按顺序输出从1最大的n位10进制数。比如输入3，则输出1、2、3一直到最大的3位数即999 bylijinnan java
参考了网上的http://blog.csdn.net/peasking_dd/article/details/6342984 写了个java版的： public class Print_1_To_NDigit { /** * Q65.输入数字n，按顺序输出从1最大的n位10进制数。比如输入3，则输出1、2、3一直到最大的3位数即999 * 1.使用字符串
Netty源码学习-ReplayingDecoder bylijinnan java netty
ReplayingDecoder是FrameDecoder的子类，不熟悉FrameDecoder的，可以先看看 http://bylijinnan.iteye.com/blog/1982618 API说，ReplayingDecoder简化了操作，比如： FrameDecoder在decode时，需要判断数据是否接收完全： public class IntegerH
js特殊字符过滤 cngolon js特殊字符 js特殊字符过滤
1.js中用正则表达式过滤特殊字符, 校验所有输入域是否含有特殊符号function stripscript(s) { var pattern = new RegExp("[`~!@#$^&*()=|{}':;',\\[\\].<>/?~！@#￥……&*（）——|{}【】‘；：”“'。，、？]"
hibernate使用sql查询 ctrain Hibernate
import java.util.Iterator; import java.util.List; import java.util.Map; import org.hibernate.Hibernate; import org.hibernate.SQLQuery; import org.hibernate.Session; import org.hibernate.Transa
linux shell脚本中切换用户执行命令方法 daizj linux shell 命令切换用户
经常在写shell脚本时，会碰到要以另外一个用户来执行相关命令，其方法简单记下： 1、执行单个命令：su - user -c "command" 如：下面命令是以test用户在/data目录下创建test123目录 [root@slave19 /data]# su - test -c "mkdir /data/test123"
好的代码里只要一个 return 语句 dcj3sjt126com return
别再这样写了：public boolean foo() { if (true) { return true; } else { return false;
Android动画效果学习 dcj3sjt126com android
1、透明动画效果方法一：代码实现 public View onCreateView(LayoutInflater inflater, ViewGroup container, Bundle savedInstanceState) { View rootView = inflater.inflate(R.layout.fragment_main, container, fals
linux复习笔记之bash shell (4)管道命令 eksliang linux管道命令汇总 linux管道命令 linux常用管道命令
转载请出自出处： http://eksliang.iteye.com/blog/2105461 bash命令执行的完毕以后，通常这个命令都会有返回结果，怎么对这个返回的结果做一些操作呢？那就得用管道命令‘|’。上面那段话，简单说了下管道命令的作用，那什么事管道命令呢？答：非常的经典的一句话，记住了，何为管
Android系统中自定义按键的短按、双击、长按事件 gqdy365 android
在项目中碰到这样的问题：由于系统中的按键在底层做了重新定义或者新增了按键，此时需要在APP层对按键事件（keyevent）做分解处理，模拟Android系统做法，把keyevent分解成： 1、单击事件：就是普通key的单击； 2、双击事件：500ms内同一按键单击两次； 3、长按事件：同一按键长按超过1000ms（系统中长按事件为500ms）； 4、组合按键：两个以上按键同时按住；
asp.net获取站点根目录下子目录的名称 hvt .net C#asp.net hovertree Web Forms
使用Visual Studio建立一个.aspx文件(Web Forms)，例如hovertree.aspx,在页面上加入一个ListBox代码如下： <asp:ListBox runat="server" ID="lbKeleyiFolder" /> 那么在页面上显示根目录子文件夹的代码如下： string[] m_sub
Eclipse程序员要掌握的常用快捷键 justjavac java eclipse 快捷键 ide
判断一个人的编程水平，就看他用键盘多，还是鼠标多。用键盘一是为了输入代码（当然了，也包括注释），再有就是熟练使用快捷键。曾有人在豆瓣评《卓有成效的程序员》：“人有多大懒，才有多大闲”。之前我整理了一个程序员图书列表，目的也就是通过读书，让程序员变懒。写道程序员作为特殊的群体，有的人可以这么懒，懒到事情都交给机器去做，而有的人又可
c++编程随记 lx.asymmetric C++笔记
为了字体更好看，改变了格式…… &&运算符： #include<iostream> using namespace std; int main(){ int a=-1,b=4,k; k=(++a<0)&&!(b--
linux标准IO缓冲机制研究音频数据 linux
一、什么是缓存I/O(Buffered I/O)缓存I/O又被称作标准I/O,大多数文件系统默认I/O操作都是缓存I/O。在Linux的缓存I/O机制中，操作系统会将I/O的数据缓存在文件系统的页缓存(page cache)中，也就是说，数据会先被拷贝到操作系统内核的缓冲区中，然后才会从操作系统内核的缓冲区拷贝到应用程序的地址空间。1.缓存I/O有以下优点:A.缓存I/O使用了操作系统内核缓冲区，
随想生活暗黑小菠萝生活
其实账户之前就申请了，但是决定要自己更新一些东西看也是最近。从毕业到现在已经一年了。没有进步是假的，但是有多大的进步可能只有我自己知道。毕业的时候班里12个女生，真正最后做到软件开发的只要两个包括我，PS：我不是说测试不好。当时因为考研完全放弃找工作，考研失败，我想这只是我的借口。那个时候才想到为什么大学的时候不能好好的学习技术，增强自己的实战能力，以至于后来找工作比较费劲。我
我认为POJO是一个错误的概念 windshome java POJO 编程 J2EE 设计
这篇内容其实没有经过太多的深思熟虑，只是个人一时的感觉。从个人风格上来讲，我倾向简单质朴的设计开发理念；从方法论上，我更加倾向自顶向下的设计；从做事情的目标上来看，我追求质量优先，更愿意使用较为保守和稳妥的理念和方法。 &