peigang

nutch-default.xml 配置范例

nutch的配置文件属性很多，需要根据实际需要详细配置。下面是经过验证的生产环境配置文件：

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
 Licensed to the Apache Software Foundation (ASF) under one or more
 contributor license agreements.  See the NOTICE file distributed with
 this work for additional information regarding copyright ownership.
 The ASF licenses this file to You under the Apache License, Version 2.0
 (the "License"); you may not use this file except in compliance with
 the License.  You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
-->
<!-- Do not modify this file directly.  Instead, copy entries that you -->
<!-- wish to modify from this file into nutch-site.xml and change them -->
<!-- there.  If nutch-site.xml does not already exist, create it.      -->

<configuration>

<!-- file properties -->

<property>
  <name>file.content.limit</name>
  <value>-1</value>
  <description>The length limit for downloaded content using the file://
  protocol, in bytes. If this value is nonnegative (>=0), content longer
  than it will be truncated; otherwise, no truncation at all. Do not
  confuse this setting with the http.content.limit setting.
  </description>
</property>
  
<property>
  <name>file.crawl.parent</name>
  <value>true</value>
  <description>The crawler is not restricted to the directories that you specified in the
    Urls file but it is jumping into the parent directories as well. For your own crawlings you can
    change this bahavior (set to false) the way that only directories beneath the directories that you specify get
    crawled.</description>
</property>

<property>
  <name>file.content.ignored</name>
  <value>true</value>
  <description>If true, no file content will be saved during fetch.
  And it is probably what we want to set most of time, since file:// URLs
  are meant to be local and we can always use them directly at parsing
  and indexing stages. Otherwise file contents will be saved.
  !! NO IMPLEMENTED YET !!
  </description>
</property>

<!-- HTTP properties -->

<property>
  <name>http.agent.name</name>
  <value>jdodrc</value>
  <description>HTTP 'User-Agent' request header. MUST NOT be empty - 
  please set this to a single word uniquely related to your organization.

  NOTE: You should also check other related properties:

	http.robots.agents
	http.agent.description
	http.agent.url
	http.agent.email
	http.agent.version

  and set their values appropriately.

  </description>
</property>

<property>
  <name>http.robots.agents</name>
  <value>*</value>
  <description>The agent strings we'll look for in robots.txt files,
  comma-separated, in decreasing order of precedence. You should
  put the value of http.agent.name as the first agent name, and keep the
  default * at the end of the list. E.g.: BlurflDev,Blurfl,*
  </description>
</property>

<property>
  <name>http.robots.403.allow</name>
  <value>true</value>
  <description>Some servers return HTTP status 403 (Forbidden) if
  /robots.txt doesn't exist. This should probably mean that we are
  allowed to crawl the site nonetheless. If this is set to false,
  then such sites will be treated as forbidden.</description>
</property>

<property>
  <name>http.agent.description</name>
  <value></value>
  <description>Further description of our bot- this text is used in
  the User-Agent header.  It appears in parenthesis after the agent name.
  </description>
</property>

<property>
  <name>http.agent.url</name>
  <value></value>
  <description>A URL to advertise in the User-Agent header.  This will 
   appear in parenthesis after the agent name. Custom dictates that this
   should be a URL of a page explaining the purpose and behavior of this
   crawler.
  </description>
</property>

<property>
  <name>http.agent.email</name>
  <value></value>
  <description>An email address to advertise in the HTTP 'From' request
   header and User-Agent header. A good practice is to mangle this
   address (e.g. 'info at example dot com') to avoid spamming.
  </description>
</property>

<property>
  <name>http.agent.version</name>
  <value>Nutch-1.4</value>
  <description>A version string to advertise in the User-Agent 
   header.</description>
</property>

<property>
  <name>http.agent.host</name>
  <value></value>
  <description>Name or IP address of the host on which the Nutch crawler
  would be running. Currently this is used by 'protocol-httpclient'
  plugin.
  </description>
</property>

<property>
  <name>http.timeout</name>
  <value>10000</value>
  <description>The default network timeout, in milliseconds.</description>
</property>

<property>
  <name>http.max.delays</name>
  <value>100</value>
  <description>The number of times a thread will delay when trying to
  fetch a page.  Each time it finds that a host is busy, it will wait
  fetcher.server.delay.  After http.max.delays attepts, it will give
  up on the page for now.</description>
</property>

<property>
  <name>http.content.limit</name>
  <value>-1</value>
  <description>The length limit for downloaded content using the http://
  protocol, in bytes. If this value is nonnegative (>=0), content longer
  than it will be truncated; otherwise, no truncation at all. Do not
  confuse this setting with the file.content.limit setting.
  </description>
</property>

<property>
  <name>http.proxy.host</name>
  <value></value>
  <description>The proxy hostname.  If empty, no proxy is used.</description>
</property>

<property>
  <name>http.proxy.port</name>
  <value></value>
  <description>The proxy port.</description>
</property>

<property>
  <name>http.proxy.username</name>
  <value></value>
  <description>Username for proxy. This will be used by
  'protocol-httpclient', if the proxy server requests basic, digest
  and/or NTLM authentication. To use this, 'protocol-httpclient' must
  be present in the value of 'plugin.includes' property.
  NOTE: For NTLM authentication, do not prefix the username with the
  domain, i.e. 'susam' is correct whereas 'DOMAIN\susam' is incorrect.
  </description>
</property>

<property>
  <name>http.proxy.password</name>
  <value></value>
  <description>Password for proxy. This will be used by
  'protocol-httpclient', if the proxy server requests basic, digest
  and/or NTLM authentication. To use this, 'protocol-httpclient' must
  be present in the value of 'plugin.includes' property.
  </description>
</property>

<property>
  <name>http.proxy.realm</name>
  <value></value>
  <description>Authentication realm for proxy. Do not define a value
  if realm is not required or authentication should take place for any
  realm. NTLM does not use the notion of realms. Specify the domain name
  of NTLM authentication as the value for this property. To use this,
  'protocol-httpclient' must be present in the value of
  'plugin.includes' property.
  </description>
</property>

<property>
  <name>http.auth.file</name>
  <value>httpclient-auth.xml</value>
  <description>Authentication configuration file for
  'protocol-httpclient' plugin.
  </description>
</property>

<property>
  <name>http.verbose</name>
  <value>false</value>
  <description>If true, HTTP will log more verbosely.</description>
</property>

<property>
  <name>http.redirect.max</name>
  <value>0</value>
  <description>The maximum number of redirects the fetcher will follow when
  trying to fetch a page. If set to negative or 0, fetcher won't immediately
  follow redirected URLs, instead it will record them for later fetching.
  </description>
</property>

<property>
  <name>http.useHttp11</name>
  <value>false</value>
  <description>NOTE: at the moment this works only for protocol-httpclient.
  If true, use HTTP 1.1, if false use HTTP 1.0 .
  </description>
</property>

<property>
  <name>http.accept.language</name>
  <value>en-us,en-gb,en;q=0.7,*;q=0.3</value>
  <description>Value of the "Accept-Language" request header field.
  This allows selecting non-English language as default one to retrieve.
  It is a useful setting for search engines build for certain national group.
  </description>
</property>

<!-- FTP properties -->

<property>
  <name>ftp.username</name>
  <value>anonymous</value>
  <description>ftp login username.</description>
</property>

<property>
  <name>ftp.password</name>
  <value>[email protected]</value>
  <description>ftp login password.</description>
</property>

<property>
  <name>ftp.content.limit</name>
  <value>-1</value> 
  <description>The length limit for downloaded content, in bytes.
  If this value is nonnegative (>=0), content longer than it will be truncated;
  otherwise, no truncation at all.
  Caution: classical ftp RFCs never defines partial transfer and, in fact,
  some ftp servers out there do not handle client side forced close-down very
  well. Our implementation tries its best to handle such situations smoothly.
  </description>
</property>

<property>
  <name>ftp.timeout</name>
  <value>60000</value>
  <description>Default timeout for ftp client socket, in millisec.
  Please also see ftp.keep.connection below.</description>
</property>

<property>
  <name>ftp.server.timeout</name>
  <value>100000</value>
  <description>An estimation of ftp server idle time, in millisec.
  Typically it is 120000 millisec for many ftp servers out there.
  Better be conservative here. Together with ftp.timeout, it is used to
  decide if we need to delete (annihilate) current ftp.client instance and
  force to start another ftp.client instance anew. This is necessary because
  a fetcher thread may not be able to obtain next request from queue in time
  (due to idleness) before our ftp client times out or remote server
  disconnects. Used only when ftp.keep.connection is true (please see below).
  </description>
</property>

<property>
  <name>ftp.keep.connection</name>
  <value>false</value>
  <description>Whether to keep ftp connection. Useful if crawling same host
  again and again. When set to true, it avoids connection, login and dir list
  parser setup for subsequent urls. If it is set to true, however, you must
  make sure (roughly):
  (1) ftp.timeout is less than ftp.server.timeout
  (2) ftp.timeout is larger than (fetcher.threads.fetch * fetcher.server.delay)
  Otherwise there will be too many "delete client because idled too long"
  messages in thread logs.</description>
</property>

<property>
  <name>ftp.follow.talk</name>
  <value>false</value>
  <description>Whether to log dialogue between our client and remote
  server. Useful for debugging.</description>
</property>

<!-- web db properties -->

<property>
  <name>db.default.fetch.interval</name>
  <value>30</value>
  <description>(DEPRECATED) The default number of days between re-fetches of a page.
  </description>
</property>

<property>
  <name>db.fetch.interval.default</name>
  <value>2592000</value>
  <description>The default number of seconds between re-fetches of a page (30 days).
  </description>
</property>

<property>
  <name>db.fetch.interval.max</name>
  <value>7776000</value>
  <description>The maximum number of seconds between re-fetches of a page
  (90 days). After this period every page in the db will be re-tried, no
  matter what is its status.
  </description>
</property>

<property>
  <name>db.fetch.schedule.class</name>
  <value>org.apache.nutch.crawl.DefaultFetchSchedule</value>
  <description>The implementation of fetch schedule. DefaultFetchSchedule simply
  adds the original fetchInterval to the last fetch time, regardless of
  page changes.</description>
</property>

<property>
  <name>db.fetch.schedule.adaptive.inc_rate</name>
  <value>0.4</value>
  <description>If a page is unmodified, its fetchInterval will be
  increased by this rate. This value should not
  exceed 0.5, otherwise the algorithm becomes unstable.</description>
</property>

<property>
  <name>db.fetch.schedule.adaptive.dec_rate</name>
  <value>0.2</value>
  <description>If a page is modified, its fetchInterval will be
  decreased by this rate. This value should not
  exceed 0.5, otherwise the algorithm becomes unstable.</description>
</property>

<property>
  <name>db.fetch.schedule.adaptive.min_interval</name>
  <value>60.0</value>
  <description>Minimum fetchInterval, in seconds.</description>
</property>

<property>
  <name>db.fetch.schedule.adaptive.max_interval</name>
  <value>31536000.0</value>
  <description>Maximum fetchInterval, in seconds (365 days).
  NOTE: this is limited by db.fetch.interval.max. Pages with
  fetchInterval larger than db.fetch.interval.max
  will be fetched anyway.</description>
</property>

<property>
  <name>db.fetch.schedule.adaptive.sync_delta</name>
  <value>true</value>
  <description>If true, try to synchronize with the time of page change.
  by shifting the next fetchTime by a fraction (sync_rate) of the difference
  between the last modification time, and the last fetch time.</description>
</property>

<property>
  <name>db.fetch.schedule.adaptive.sync_delta_rate</name>
  <value>0.3</value>
  <description>See sync_delta for description. This value should not
  exceed 0.5, otherwise the algorithm becomes unstable.</description>
</property>

<property>
  <name>db.update.additions.allowed</name>
  <value>true</value>
  <description>If true, updatedb will add newly discovered URLs, if false
  only already existing URLs in the CrawlDb will be updated and no new
  URLs will be added.
  </description>
</property>

<property>
  <name>db.update.purge.404</name>
  <value>false</value>
  <description>If true, updatedb will add purge records with status DB_GONE
  from the CrawlDB.
  </description>
</property>

<property>
  <name>db.update.max.inlinks</name>
  <value>10000</value>
  <description>Maximum number of inlinks to take into account when updating 
  a URL score in the crawlDB. Only the best scoring inlinks are kept. 
  </description>
</property>

<property>
  <name>db.ignore.internal.links</name>
  <value>true</value>
  <description>If true, when adding new links to a page, links from
  the same host are ignored.  This is an effective way to limit the
  size of the link database, keeping only the highest quality
  links.
  </description>
</property>

<property>
  <name>db.ignore.external.links</name>
  <value>false</value>
  <description>If true, outlinks leading from a page to external hosts
  will be ignored. This is an effective way to limit the crawl to include
  only initially injected hosts, without creating complex URLFilters.
  </description>
</property>

<property>
  <name>db.score.injected</name>
  <value>1.0</value>
  <description>The score of new pages added by the injector.
  </description>
</property>

<property>
  <name>db.score.link.external</name>
  <value>1.0</value>
  <description>The score factor for new pages added due to a link from
  another host relative to the referencing page's score. Scoring plugins
  may use this value to affect initial scores of external links.
  </description>
</property>

<property>
  <name>db.score.link.internal</name>
  <value>1.0</value>
  <description>The score factor for pages added due to a link from the
  same host, relative to the referencing page's score. Scoring plugins
  may use this value to affect initial scores of internal links.
  </description>
</property>

<property>
  <name>db.score.count.filtered</name>
  <value>false</value>
  <description>The score value passed to newly discovered pages is
  calculated as a fraction of the original page score divided by the
  number of outlinks. If this option is false, only the outlinks that passed
  URLFilters will count, if it's true then all outlinks will count.
  </description>
</property>

<property>
  <name>db.max.inlinks</name>
  <value>10000</value>
  <description>Maximum number of Inlinks per URL to be kept in LinkDb.
  If "invertlinks" finds more inlinks than this number, only the first
  N inlinks will be stored, and the rest will be discarded.
  </description>
</property>

<property>
  <name>db.max.outlinks.per.page</name>
  <value>100</value>
  <description>The maximum number of outlinks that we'll process for a page.
  If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks
  will be processed for a page; otherwise, all outlinks will be processed.
  </description>
</property>

<property>
  <name>db.max.anchor.length</name>
  <value>100</value>
  <description>The maximum number of characters permitted in an anchor.
  </description>
</property>

 <property>
  <name>db.parsemeta.to.crawldb</name>
  <value></value>
  <description>Comma-separated list of parse metadata keys to transfer to the crawldb (NUTCH-779).
   Assuming for instance that the languageidentifier plugin is enabled, setting the value to 'lang' 
   will copy both the key 'lang' and its value to the corresponding entry in the crawldb.
  </description>
</property>

<property>
  <name>db.fetch.retry.max</name>
  <value>3</value>
  <description>The maximum number of times a url that has encountered
  recoverable errors is generated for fetch.</description>
</property>

<property>
  <name>db.signature.class</name>
  <value>org.apache.nutch.crawl.MD5Signature</value>
  <description>The default implementation of a page signature. Signatures
  created with this implementation will be used for duplicate detection
  and removal.</description>
</property>

<property>
  <name>db.signature.text_profile.min_token_len</name>
  <value>2</value>
  <description>Minimum token length to be included in the signature.
  </description>
</property>

<property>
  <name>db.signature.text_profile.quant_rate</name>
  <value>0.01</value>
  <description>Profile frequencies will be rounded down to a multiple of
  QUANT = (int)(QUANT_RATE * maxFreq), where maxFreq is a maximum token
  frequency. If maxFreq > 1 then QUANT will be at least 2, which means that
  for longer texts tokens with frequency 1 will always be discarded.
  </description>
</property>

<!-- generate properties -->

<property>
  <name>generate.max.count</name>
  <value>-1</value>
  <description>The maximum number of urls in a single
  fetchlist.  -1 if unlimited. The urls are counted according
  to the value of the parameter generator.count.mode.
  </description>
</property>

<property>
  <name>generate.count.mode</name>
  <value>host</value>
  <description>Determines how the URLs are counted for generator.max.count.
  Default value is 'host' but can be 'domain'. Note that we do not count 
  per IP in the new version of the Generator.
  </description>
</property>

<property>
  <name>generate.update.crawldb</name>
  <value>false</value>
  <description>For highly-concurrent environments, where several
  generate/fetch/update cycles may overlap, setting this to true ensures
  that generate will create different fetchlists even without intervening
  updatedb-s, at the cost of running an additional job to update CrawlDB.
  If false, running generate twice without intervening
  updatedb will generate identical fetchlists.</description>
</property>

<property>
  <name>generate.max.per.host</name>
  <value>-1</value>
  <description>(Deprecated). Use generate.max.count and generate.count.mode instead.
  The maximum number of urls per host in a single
  fetchlist.  -1 if unlimited.</description>
</property>

<!-- urlpartitioner properties -->
<property>
  <name>partition.url.mode</name>
  <value>byHost</value>
  <description>Determines how to partition URLs. Default value is 'byHost', 
  also takes 'byDomain' or 'byIP'. 
  </description>
</property>

<property>
  <name>crawl.gen.delay</name>
  <value>604800000</value>
  <description>
   This value, expressed in days, defines how long we should keep the lock on records 
   in CrawlDb that were just selected for fetching. If these records are not updated 
   in the meantime, the lock is canceled, i.e. the become eligible for selecting. 
   Default value of this is 7 days.
  </description>
</property>

<!-- fetcher properties -->

<property>
  <name>fetcher.server.delay</name>
  <value>5.0</value>
  <description>The number of seconds the fetcher will delay between 
   successive requests to the same server.</description>
</property>

<property>
  <name>fetcher.server.min.delay</name>
  <value>0.0</value>
  <description>The minimum number of seconds the fetcher will delay between 
  successive requests to the same server. This value is applicable ONLY
  if fetcher.threads.per.host is greater than 1 (i.e. the host blocking
  is turned off).</description>
</property>

<property>
 <name>fetcher.max.crawl.delay</name>
 <value>10</value>
 <description>
 If the Crawl-Delay in robots.txt is set to greater than this value (in
 seconds) then the fetcher will skip this page, generating an error report.
 If set to -1 the fetcher will never skip such pages and will wait the
 amount of time retrieved from robots.txt Crawl-Delay, however long that
 might be.
 </description>
</property> 

<property>
  <name>fetcher.threads.fetch</name>
  <value>10</value>
  <description>The number of FetcherThreads the fetcher should use.
  This is also determines the maximum number of requests that are
  made at once (each FetcherThread handles one connection). The total
  number of threads running in distributed mode will be the number of
  fetcher threads * number of nodes as fetcher has one map task per node.
  </description>
</property>

<property>
  <name>fetcher.threads.per.queue</name>
  <value>1</value>
  <description>This number is the maximum number of threads that
    should be allowed to access a queue at one time. Replaces 
    deprecated parameter 'fetcher.threads.per.host'.
   </description>
</property>

<property>
  <name>fetcher.queue.mode</name>
  <value>byHost</value>
  <description>Determines how to put URLs into queues. Default value is 'byHost', 
  also takes 'byDomain' or 'byIP'. Replaces the deprecated parameter 
  'fetcher.threads.per.host.by.ip'.
  </description>
</property>

<property>
  <name>fetcher.verbose</name>
  <value>false</value>
  <description>If true, fetcher will log more verbosely.</description>
</property>

<property>
  <name>fetcher.parse</name>
  <value>false</value>
  <description>If true, fetcher will parse content. Default is false, which means
  that a separate parsing step is required after fetching is finished.</description>
</property>

<property>
  <name>fetcher.store.content</name>
  <value>true</value>
  <description>If true, fetcher will store content.</description>
</property>

<property>
  <name>fetcher.timelimit.mins</name>
  <value>-1</value>
  <description>This is the number of minutes allocated to the fetching.
  Once this value is reached, any remaining entry from the input URL list is skipped 
  and all active queues are emptied. The default value of -1 deactivates the time limit.
  </description>
</property>

<property>
  <name>fetcher.max.exceptions.per.queue</name>
  <value>-1</value>
  <description>The maximum number of protocol-level exceptions (e.g. timeouts) per
  host (or IP) queue. Once this value is reached, any remaining entries from this
  queue are purged, effectively stopping the fetching from this host/IP. The default
  value of -1 deactivates this limit.
  </description>
</property>

<property>
  <name>fetcher.threads.timeout.divisor</name>
  <value>2</value>
  <description>The thread time-out divisor to use. By default threads have a time-out
  value of mapred.task.timeout / 2. Increase this setting if the fetcher waits too
  long before killing hanged threads.
  </description>
</property>

<property>
  <name>fetcher.throughput.threshold.pages</name>
  <value>-1</value>
  <description>The threshold of minimum pages per second. If the fetcher downloads less
  pages per second than the configured threshold, the fetcher stops, preventing slow queue's
  from stalling the throughput. This threshold must be an integer. This can be useful when
  fetcher.timelimit.mins is hard to determine. The default value of -1 disables this check.
  </description>
</property>

<property>
  <name>fetcher.throughput.threshold.retries</name>
  <value>5</value>
  <description>The number of times the fetcher.throughput.threshold is allowed to be exceeded.
  This settings prevents accidental slow downs from immediately killing the fetcher thread.
  </description>
</property>

<property>
  <name>fetcher.queue.depth.multiplier</name>
  <value>100</value>
  <description>(EXPERT)The fetcher buffers the incoming URLs into queues based on the [host|domain|IP]
  (see param fetcher.queue.mode). The depth of the queue is the number of threads times the value of this parameter.
  A large value requires more memory but can improve the performance of the fetch when the order of the URLS in the fetch list
  is not optimal.
  </description>
</property>	

<!-- moreindexingfilter plugin properties -->

<property>
  <name>moreIndexingFilter.indexMimeTypeParts</name>
  <value>true</value>
  <description>Determines whether the index-more plugin will split the mime-type
  in sub parts, this requires the type field to be multi valued. Set to true for backward
  compatibility. False will not split the mime-type.
  </description>
</property>

<!-- AnchorIndexing filter plugin properties -->

<property>
  <name>anchorIndexingFilter.deduplicate</name>
  <value>false</value>
  <description>With this enabled the indexer will case-insensitive deduplicate anchors
  before indexing. This prevents possible hundreds or thousands of identical anchors for
  a given page to be indexed but will affect the search scoring (i.e. tf=1.0f).
  </description>
</property>

<!-- indexingfilter plugin properties -->

<property>
  <name>indexingfilter.order</name>
  <value></value>
  <description>The order by which index filters are applied.
  If empty, all available index filters (as dictated by properties
  plugin-includes and plugin-excludes above) are loaded and applied in system
  defined order. If not empty, only named filters are loaded and applied
  in given order. For example, if this property has value:
  org.apache.nutch.indexer.basic.BasicIndexingFilter org.apache.nutch.indexer.more.MoreIndexingFilter
  then BasicIndexingFilter is applied first, and MoreIndexingFilter second.
  
  Filter ordering might have impact on result if one filter depends on output of
  another filter.
  </description>
</property>

<property>
  <name>indexer.score.power</name>
  <value>0.5</value>
  <description>Determines the power of link analyis scores.  Each
  pages's boost is set to <i>score<sup>scorePower</sup></i> where
  <i>score</i> is its link analysis score and <i>scorePower</i> is the
  value of this parameter.  This is compiled into indexes, so, when
  this is changed, pages must be re-indexed for it to take
  effect.</description>
</property>

<property>
  <name>indexer.max.title.length</name>
  <value>200</value>
  <description>The maximum number of characters of a title that are indexed.
  </description>
</property>

<property>
  <name>indexer.max.content.length</name>
  <value>-1</value>
  <description>The maximum number of characters of a content that are indexed.
  Content beyond the limit is truncated. A value of -1 disables this check.
  </description>
</property>

<!-- URL normalizer properties -->

<property>
  <name>urlnormalizer.order</name>
  <value>org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer</value>
  <description>Order in which normalizers will run. If any of these isn't
  activated it will be silently skipped. If other normalizers not on the
  list are activated, they will run in random order after the ones
  specified here are run.
  </description>
</property>

<property>
  <name>urlnormalizer.regex.file</name>
  <value>regex-normalize.xml</value>
  <description>Name of the config file used by the RegexUrlNormalizer class.
  </description>
</property>

<property>
  <name>urlnormalizer.loop.count</name>
  <value>1</value>
  <description>Optionally loop through normalizers several times, to make
  sure that all transformations have been performed.
  </description>
</property>

<!-- mime properties -->

<!--
<property>
  <name>mime.types.file</name>
  <value>tika-mimetypes.xml</value>
  <description>Name of file in CLASSPATH containing filename extension and
  magic sequence to mime types mapping information. Overrides the default Tika config 
  if specified.
  </description>
</property>
-->

<property>
  <name>mime.type.magic</name>
  <value>true</value>
  <description>Defines if the mime content type detector uses magic resolution.
  </description>
</property>

<!-- plugin properties -->

<property>
  <name>plugin.folders</name>
  <value>./plugins</value>
  <description>Directories where nutch plugins are located.  Each
  element may be a relative or absolute path.  If absolute, it is used
  as is.  If relative, it is searched for on the classpath.</description>
</property>

<property>
  <name>plugin.auto-activation</name>
  <value>true</value>
  <description>Defines if some plugins that are not activated regarding
  the plugin.includes and plugin.excludes properties must be automaticaly
  activated if they are needed by some actived plugins.
  </description>
</property>

<property>
  <name>plugin.includes</name>
  <value>ys-parse-news|ys-parse-pdf|ys-index-filter|protocol-http|urlfilter-regex|parse-(html|tika)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins. In order to use HTTPS please enable 
  protocol-httpclient, but be aware of possible intermittent problems with the 
  underlying commons-httpclient library.
  </description>
</property>

<property>
  <name>plugin.excludes</name>
  <value></value>
  <description>Regular expression naming plugin directory names to exclude.  
  </description>
</property>

<property>
  <name>urlmeta.tags</name>
  <value></value>
  <description>
    To be used in conjunction with features introduced in NUTCH-655, which allows
    for custom metatags to be injected alongside your crawl URLs. Specifying those
    custom tags here will allow for their propagation into a pages outlinks, as
    well as allow for them to be included as part of an index.
    Values should be comma-delimited. ("tag1,tag2,tag3") Do not pad the tags with
    white-space at their boundaries, if you are using anything earlier than Hadoop-0.21. 
  </description>
</property>

<!-- parser properties -->

<property>
  <name>parse.plugin.file</name>
  <value>parse-plugins.xml</value>
  <description>The name of the file that defines the associations between
  content-types and parsers.</description>
</property>

<property>
  <name>parser.character.encoding.default</name>
  <value>utf-8</value>
  <description>The character encoding to fall back to when no other information
  is available</description>
</property>

<property>
  <name>encodingdetector.charset.min.confidence</name>
  <value>-1</value>
  <description>A integer between 0-100 indicating minimum confidence value
  for charset auto-detection. Any negative value disables auto-detection.
  </description>
</property>

<property>
  <name>parser.caching.forbidden.policy</name>
  <value>content</value>
  <description>If a site (or a page) requests through its robot metatags
  that it should not be shown as cached content, apply this policy. Currently
  three keywords are recognized: "none" ignores any "noarchive" directives.
  "content" doesn't show the content, but shows summaries (snippets).
  "all" doesn't show either content or summaries.</description>
</property>

<property>
  <name>parser.html.impl</name>
  <value>neko</value>
  <description>HTML Parser implementation. Currently the following keywords
  are recognized: "neko" uses NekoHTML, "tagsoup" uses TagSoup.
  </description>
</property>

<property>
  <name>parser.html.form.use_action</name>
  <value>false</value>
  <description>If true, HTML parser will collect URLs from form action
  attributes. This may lead to undesirable behavior (submitting empty
  forms during next fetch cycle). If false, form action attribute will
  be ignored.</description>
</property>

<property>
  <name>parser.html.outlinks.ignore_tags</name>
  <value></value>
  <description>Comma separated list of HTML tags, from which outlinks 
  shouldn't be extracted. Nutch takes links from: a, area, form, frame, 
  iframe, script, link, img. If you add any of those tags here, it
  won't be taken. Default is empty list. Probably reasonable value
  for most people would be "img,script,link".</description>
</property>

<property>
  <name>parser.fix.embeddedparams</name>
  <value>true</value>
  <description>Whether to fix URL embedded params using semi-colons.
  See NUTCH-436 and NUTCH-1115</description>
</property>

<property>
  <name>htmlparsefilter.order</name>
  <value></value>
  <description>The order by which HTMLParse filters are applied.
  If empty, all available HTMLParse filters (as dictated by properties
  plugin-includes and plugin-excludes above) are loaded and applied in system
  defined order. If not empty, only named filters are loaded and applied
  in given order.
  HTMLParse filter ordering MAY have an impact
  on end result, as some filters could rely on the metadata generated by a previous filter.
  </description>
</property>

<property>
  <name>parser.timeout</name>
  <value>30</value>
  <description>Timeout in seconds for the parsing of a document, otherwise treats it as an exception and 
  moves on the the following documents. This parameter is applied to any Parser implementation. 
  Set to -1 to deactivate, bearing in mind that this could cause
  the parsing to crash because of a very long or corrupted document.
  </description>
</property>

<!-- urlfilter plugin properties -->

<property>
  <name>urlfilter.domain.file</name>
  <value>domain-urlfilter.txt</value>
  <description>Name of file on CLASSPATH containing either top level domains or
  hostnames used by urlfilter-domain (DomainURLFilter) plugin.</description>
</property>

<property>
  <name>urlfilter.regex.file</name>
  <value>regex-urlfilter.txt</value>
  <description>Name of file on CLASSPATH containing regular expressions
  used by urlfilter-regex (RegexURLFilter) plugin.</description>
</property>

<property>
  <name>urlfilter.automaton.file</name>
  <value>automaton-urlfilter.txt</value>
  <description>Name of file on CLASSPATH containing regular expressions
  used by urlfilter-automaton (AutomatonURLFilter) plugin.</description>
</property>

<property>
  <name>urlfilter.prefix.file</name>
  <value>prefix-urlfilter.txt</value>
  <description>Name of file on CLASSPATH containing url prefixes
  used by urlfilter-prefix (PrefixURLFilter) plugin.</description>
</property>

<property>
  <name>urlfilter.suffix.file</name>
  <value>suffix-urlfilter.txt</value>
  <description>Name of file on CLASSPATH containing url suffixes
  used by urlfilter-suffix (SuffixURLFilter) plugin.</description>
</property>

<property>
  <name>urlfilter.order</name>
  <value></value>
  <description>The order by which url filters are applied.
  If empty, all available url filters (as dictated by properties
  plugin-includes and plugin-excludes above) are loaded and applied in system
  defined order. If not empty, only named filters are loaded and applied
  in given order. For example, if this property has value:
  org.apache.nutch.urlfilter.regex.RegexURLFilter org.apache.nutch.urlfilter.prefix.PrefixURLFilter
  then RegexURLFilter is applied first, and PrefixURLFilter second.
  Since all filters are AND'ed, filter ordering does not have impact
  on end result, but it may have performance implication, depending
  on relative expensiveness of filters.
  </description>
</property>

<!-- scoring filters properties -->

<property>
  <name>scoring.filter.order</name>
  <value></value>
  <description>The order in which scoring filters are applied.
  This may be left empty (in which case all available scoring
  filters will be applied in the order defined in plugin-includes
  and plugin-excludes), or a space separated list of implementation
  classes.
  </description>
</property>

<!-- language-identifier plugin properties -->

<property>
  <name>lang.analyze.max.length</name>
  <value>0</value>
  <description> The maximum bytes of data to uses to indentify
  the language (0 means full content analysis).
  The larger is this value, the better is the analysis, but the
  slowest it is.
  </description>
</property>

<property>
  <name>lang.extraction.policy</name>
  <value>detect,identify</value>
  <description>This determines when the plugin uses detection and
  statistical identification mechanisms. The order in which the
  detect and identify are written will determine the extraction
  policy. Default case (detect,identify)  means the plugin will
  first try to extract language info from page headers and metadata,
  if this is not successful it will try using tika language
  identification. Possible values are:
    detect
    identify
    detect,identify
    identify,detect
  </description>
</property>

<property>
  <name>lang.identification.only.certain</name>
  <value>false</value>
  <description>If set to true with lang.extraction.policy containing identify,
  the language code returned by Tika will be assigned to the document ONLY
  if it is deemed certain by Tika.
  </description>
</property>

<!-- index-static plugin properties -->

<property>
  <name>index-static</name>
  <value></value>
  <description>
  A simple plugin called at indexing that adds fields with static data. 
  You can specify a list of fieldname:fieldcontent per nutch job.
  It can be useful when collections can't be created by urlpatterns, 
  like in subcollection, but on a job-basis.
  </description>
</property>

<!-- Temporary Hadoop 0.17.x workaround. -->

<property>
  <name>hadoop.job.history.user.location</name>
  <value>${hadoop.log.dir}/history/user</value>
  <description>Hadoop 0.17.x comes with a default setting to create
     user logs inside the output path of the job. This breaks some
     Hadoop classes, which expect the output to contain only
     part-XXXXX files. This setting changes the output to a
     subdirectory of the regular log directory.
  </description>
</property>

<!-- solr index properties -->

<property>
  <name>solr.mapping.file</name>
  <value>solrindex-mapping.xml</value>
  <description>
  Defines the name of the file that will be used in the mapping of internal
  nutch field names to solr index fields as specified in the target Solr schema.
  </description>
</property>

<property> 
  <name>solr.commit.size</name>
  <value>1000</value>
  <description>
  Defines the number of documents to send to Solr in a single update batch.
  Decrease when handling very large documents to prevent Nutch from running
  out of memory.
  </description>  
</property> 

<property>
  <name>solr.auth</name>
  <value>false</value>
  <description>
  Whether to enable HTTP basic authentication for communicating with Solr.
  Use the solr.auth.username and solr.auth.password properties to configure
  your credentials.
  </description>
</property>
</configuration>

你可能感兴趣的:(Nutch)

Java：爬虫框架 dingcho Java java 爬虫
一、ApacheNutch2【参考地址】Nutch是一个开源Java实现的搜索引擎。它提供了我们运行自己的搜索引擎所需的全部工具。包括全文搜索和Web爬虫。Nutch致力于让每个人能很容易,同时花费很少就可以配置世界一流的Web搜索引擎.为了完成这一宏伟的目标,Nutch必须能够做到:每个月取几十亿网页为这些网页维护一个索引对索引文件进行每秒上千次的搜索提供高质量的搜索结果简单来说Nutch支持分
Python爬虫实战 weixin_34007879 爬虫 json java
引言网络爬虫是抓取互联网信息的利器，成熟的开源爬虫框架主要集中于两种语言Java和Python。主流的开源爬虫框架包括：1.分布式爬虫框架：Nutch2.Java单机爬虫框架：Crawler4j,WebMagic,WebCollector、Heritrix3.python单机爬虫框架：scrapy、pyspiderNutch是专为搜索引擎设计的的分布式开源框架，上手难度高，开发复杂，基本无法满足快
深入浅出hdfs-hadoop基本介绍大数据之家 hdfs hadoop 大数据
一、Hadoop基本介绍hadoop最开始是起源于ApacheNutch项目，这个是由DougCutting开发的开源网络搜索引擎，这个项目刚开始的目标是为了更好的做搜索引擎，后来Google发表了三篇未来持续影响大数据领域的三架马车论文：GoogleFileSystem、BigTable、Mapreduce开始掀起来了大数据的浪潮，paper原文可以参考我的这篇文章CSDN。这三篇论文介绍了如何
Hadoop简介：开启大数据处理之门乌龙饼干 hadoop 大数据分布式
随着信息技术的飞速发展，数据呈现爆炸式增长，传统的数据处理方式已无法满足日益增长的数据需求。在此背景下，Hadoop作为一种分布式系统基础架构，应运而生，为大数据处理打开了新的大门。一、Hadoop的起源与概念Hadoop最初由DougCutting创建，作为ApacheLucene的子项目Nutch的一部分。随着项目的不断发展，Hadoop逐渐独立出来，成为Apache软件基金会下的一个开源项目
专为初学者设计：Nutch库Java下载器入门指南亿牛云爬虫专家 java 代理IP 爬虫代理 java 开发语言 Nutch 下载器爬虫代理代理IP 多线程
概述:Nutch是一款开源的Java爬虫框架，用于抓取、解析、提取和存储网页数据。基于Hadoop的分布式系统，Nutch支持大规模网络爬取，并提供各种插件，包括链接分析、语言检测和内容过滤等功能。本文旨在介绍如何使用Nutch库编写简单的Java下载器，即能从指定URL下载网页内容的程序。目标是帮助初学者了解Nutch库的基本用法，并展示如何通过代理IP技术和多线程技术提升下载效率。假设读者已安
在CentOS7上安装Hadoop分布式系统栗子艾李子 hadoop linux hdfs 分布式
项目背景：Hadoop原来是ApacheLucene下的一个子项目，它最初是从Nutch项目中分离出来的专门负责分布式存储以及分布式运算的项目。简单地说来，Hadoop是一个可以更容易开发和运行处理大规模数据的软件平台。Hadoop由分布式存储HDFS和分布式计算MapReduce两部分组成。HDFS是一个master/slave的结构，就通常的部署来说，在master上只运行一个Namenode
大数据技术之Hadoop入门一在远方的你等我
1.从Hadoop框架讨论大数据生态名字起源该项目的创建者，DougCutting解释Hadoop的得名：“这个名字是我孩子给一个棕黄色的大象玩具命名的项目起源Hadoop由ApacheSoftwareFoundation公司于2005年秋天作为Lucene的子项目Nutch的一部分正式引入。它受到最先由GoogleLab开发的Map/Reduce和GoogleFileSystem(GFS)的启发
openpyxl3.0官方文档（14）—— 甜甜圈图 Sinchard
甜甜圈图表与饼图类似，只是它们使用了一个环而不是一个圆，还可以绘制出若干系列的数据作为中心环。fromopenpyxlimportWorkbookfromopenpyxl.chartimport(DoughnutChart,Reference,Series,)fromopenpyxl.chart.seriesimportDataPointdata=[['Pie',2014,2015],['Plai
kafka入门：简介、使用场景、设计原理、主要配置及集群搭建（转） weixin_34185320 运维操作系统系统架构
李克华云计算高级群:292870151195907286交流：Hadoop、NoSQL、分布式、lucene、solr、nutchkafka入门：简介、使用场景、设计原理、主要配置及集群搭建（转）问题导读：1.zookeeper在kafka的作用是什么？2.kafka中几乎不允许对消息进行“随机读写”的原因是什么？3.kafka集群consumer和producer状态信息是如何保存的？4.par
大数据之 Hadoop 小裕哥略帅大数据 hadoop java
hadoop主要解决：海量数据的存储和海量数据的分析计算hadoop发展历史Google是hadoop的思想之源（Google在大数据方面的三篇论文）2006年3月，Map-reduce和NutchDistributedFileSystem(NDFS)分别被纳入到Hadoop项目，Hadoop正式诞生。MapReduce对海量数据处理map函数进行数据的提取、排序，实现mapper，四个形参（输入
自己动手写搜索引擎系列【目录】 luyee2010 自己动手写搜索引擎自己动手写搜索引擎
第1章遍历搜索引擎技术11.130分钟实现的搜索引擎11.1.1准备工作环境（10分钟）11.1.2编写代码（15分钟）31.1.3发布运行（5分钟）51.2Google神话91.3体验搜索引擎91.4搜索语法101.5你也可以做搜索引擎131.6搜索引擎基本技术141.6.1网络蜘蛛141.6.2全文索引结构141.6.3Lucene全文检索引擎151.6.4Nutch网络搜索软件161.6.5
ElasticSearch（ES）——概述/API 平平无奇小码农笔记大数据 elasticsearch 数据库搜索引擎
文章目录一、ElasticSearch基础1.1简介1.2使用场景1.3ES与其他数据存储进行比较1.4ES的特点1.5Lucene、Nutch、ElasticSearch关系二、基本概念ES概念和MySQL关系对比三、安装ES3.1解压、改名3.2修改配置文件3.3教学环境启动优化分发3.4修改hadoop163、hadoop164的节点名3.5单台启动测试，解决问题四、安装kibana4.1解
asp html5 ajax,ASP.NET AJAX Chart (HTML5) - RadControls for Web Forms | Telerik UI for ASP.NET AJAX weixin_39942191 asp html5 ajax
AnyEssentialChartTypeQuicklyaddmeaningtodatawiththemostcommonlyusedASP.NETchartingtypes:PieorDonutcharttovisualizeeachpieceofdataaspartofawholeLineorAreatomonitortrendsBar,ColumnorRadartocomparesevera
安装关系型数据库MySQL和大数据处理框架Hadoop weixin_30621919 数据库嵌入式大数据
这个作业的要求来自于：https://edu.cnblogs.com/campus/gzcc/GZCC-16SE2/homework/3161。1.简述Hadoop平台的起源、发展历史与应用现状。列举发展过程中重要的事件、主要版本、主要厂商；国内外Hadoop应用的典型案例。（1）Hadoop的介绍：Hadoop最早起源于Nutch，Nutch的设计目标是构建一个大型的全网搜索引擎，包括网页抓取、
ChatGPT4 完成数据分析结构分析，动态饼图可视化阿里数据专家 ChatGPT实战案例 ChatGPT 数据分析信息可视化数据挖掘 excel 人工智能 AIGC chatgpt
对于数据分析中的结构占比分析，以下几种图表类型是比较常见和合适的：1.**饼图（PieChart）**：饼图是一种表现部分与整体关系的图表，各部分占整体的比例在图中以圆形的切片形式体现。它适用于表示不同类别之间的比较，以及每个类别占总数的百分比。2.**环图（DoughnutChart）**：环图是饼图的变种，有一个空心中心。它也是显示类别之间占比关系的一种有效的方式。3.**堆叠柱状图/堆叠条形
hadoop yuanjianqiang_0925 hadoop spark
hadoop主要解决：海量数据的存储和海量数据的分析计算hadoop发展历史Google是hadoop的思想之源（Google在大数据方面的三篇论文）2006年3月，Map-reduce和NutchDistributedFileSystem(NDFS)分别被纳入到Hadoop项目，Hadoop正式诞生。MapReduce对海量数据处理map函数进行数据的提取、排序，实现mapper，四个形参（
Ubuntu环境下Hadoop1.2.1, HBase0.94.25, nutch2.2.1各个配置文件一览 weixin_30491641 大数据 java runtime
/×××××××××××××××××××××××××××××××××××××××××/Author：xxx0624HomePage：http://www.cnblogs.com/xxx0624//×××××××××××××××××××××××××××××××××××××××××/Hadoop伪分布式配置过程：Hadoop：1.2.1Hbase：0.94.25nutch：2.2.1Java：1.8.
ElasticSearch详细教程-基础加实战工藤-新二实时数仓大数据实时项目 elasticsearch 实时大数据 spark
文章目录第1章ElasticSearch基础1.1简介1.2使用场景1.3ES与其他数据存储进行比较1.4ElasticSearch的特点1.4.1天然分片，天然集群1.4.2天然索引1.5Lucene、Nutch、ElasticSearch关系第2章ElasticSearch的安装2.1上传安装包2.2将ES解压到/opt/module目录下2.3在/opt/module目录下对ES重命名2.4
jvm命令和可视化工具调优 weixin_30834783 java 操作系统开发工具
李克华云计算高级群:292870151195907286交流：Hadoop、NoSQL、分布式、lucene、solr、nutch虚拟机：系统虚拟机程序虚拟机系统虚拟机有：VMWarevisureBox程序虚拟机：JVMJVM：1.类加载子系统（类加载器）2.方法区3.java堆4.直接内存5.java栈6.本地方法栈7.垃圾回收系统8.PC寄存器9.执行引擎堆：存储问题栈：程序运行方法去：辅助堆
linux服务器忘记ssh密码_【Linux】配置linux服务器之间ssh不用密码访问 weixin_40008033 linux服务器忘记ssh密码
如果想在A这太机器上可以不需要密码就ssh到B、C两台机器上，可以采用如下的方法：(1)在A机器上：ssh-keygen-trsaGeneratingpublic/privatersakeypair.Enterfileinwhichtosavethekey(/nutch/home/.ssh/id_rsa):不输入任何东西，直接回车Enterpassphrase(emptyfornopassphra
Hadoop分布式文件系统杀神lwz hadoop 大数据分布式
一、HadoopHadoop之父DougCuttingHadoop的发音[hædu:p]，Cutting儿子对玩具小象的昵称1、Hadoop发展简史2002年10月，DougCutting和MikeCafarella创建了开源网页爬虫项目Nutch。2003年10月，Google发表GoogleFileSystem论文。2004年7月，DougCutting和MikeCafarella在Nutch
java 爬虫框架nutch_网络爬虫（2）-- Java爬虫框架鲍鱼王 java 爬虫框架nutch
NutchNutch属于分布式爬虫，爬虫使用分布式，主要是解决两个问题：1)海量URL管理；2)网速。如果要做搜索引擎，Nutch1.x是一个非常好的选择。Nutch1.x和solr或者es配合，就可以构成一套非常强大的搜索引擎，否则尽量不要选择Nutch作为爬虫。用Nutch进行爬虫的二次开发，爬虫的编写和调试所需的时间，往往是单机爬虫所需的十倍时间不止。HeritrixHeritrix是个“A
nutch爬取网站数据详细步骤 Echoooo_o
环境：hadoop2.7.7+hbase0.98+nutch2.3+solr4.9大致步骤思想：hadoop提供底层数据存储hbase在其之上建立非关系型数据库nutch将爬的数据存到hbase上并建立索引到solr展示首先采用简单命令：#$1$2...$n表示命令后跟的第n个参数#存放待注入种子的路径SEEDDIR="$1"#存放爬取数据（URL状态信息、爬取数据、解析数据）文件夹的路径CRAW
nutch，hbase记录 feihuadao
hbase表操作优化http://blog.pureisle.net/archives/1930.htmlHow-to:UseHBaseBulkLoading,andWhyhttp://blog.cloudera.com/blog/2013/09/how-to-use-hbase-bulk-loading-and-why/nutch2.2分析http://blog.csdn.net/itufo/a
Hadoop 凤舞飘伶 Go hadoop
Hadoop是Google的集群系统的开源实现，Google集群系统:GFS(GoogleFileSystem)、MapReduce、BigTable。Hadoop主要由HDFS(HadoopDistributedFileSystemHadoop分布式文件系统)、MapReduce和HBase组成Hadoop的初衷是为解决Nutch的海量数据爬取和存储的需要。Hadoop于2005年秋天作为Luc
Hadoop之父：Doug Cutting Mr_Elliot
hadoop生活中，可能所有人都间接用过他的作品，他是Lucene、Nutch、Hadoop等项目的发起人。是他，把高深莫测的搜索技术形成产品，贡献给普罗大众；还是他，打造了目前在云计算和大数据领域里如日中天的Hadoop。他是某种意义上的盗火者，他就是DougCutting。DougCutting从实习生做起1985年，Cutting毕业于美国斯坦福大学。他并不是一开始就决心投身IT行业的，在大
Hadoop-2.6.5完整安装配置过程 syp_net 系统开发 hadoop mapreduce 搜索引擎
记录Hadoop-2.6.5完整安装配置过程一、Hadoop是什么？二、Hadoop-2.6.5安装配置1.修改主机名2.下载并解压JDK3.配置环境变量4.修改Hadoop中5个主要配置文件5.启动Hadoop6.HadoopWeb端口测试三、总结一、Hadoop是什么？Hadoop系统最初的源头来自于ApacheLucene项目下的搜索引擎子项目Nutch，该项目的负责人是DougCuttin
Hadoop之HDFS简介数新网络 hadoop 大数据 hdfs
前言Hadoop是由Apache基金会开发的分布式系统基础框架，主要解决海量数据存储和海量数据分析问题。Hadoop起源于ApacheNutch项目，起始于2002年，在2006年被正式命名为Hadoop。Hadoop有3大核心组件，分别是HDFS、MapReduce和YARN，本次我们重点介绍HDFS。一、HDFS简介HDFS全称HadoopDistributedFileSystem，是一个分布
hadoop原理和细节 truezqx
一、Hadoop概述Hadoop是Google的集群系统开源实现Google的集群系统：GFS、MapReduce、BigTableHadoop的集群系统：HDFS、MapReduce、HBaseHadoop设计的初衷是为了解决Nutch的海量数据存储和处理的需求，可以解决大数据场景下的数据存储和处理的问题。传统数据：GB、TB级别的数据、数据增长不快、主要为结构化的数据、统计和报表大数据：TB、
听阿里P7工程师只分七步讲解HDFS搭建 Python大数据工程师
前言HADOOP产生背景（1）HADOOP最早起源于Nutch。Nutch的设计目标是构建一个大型的全网搜索引擎，包括网页抓取、索引、查询等功能，但随着抓取网页数量的增加，遇到了严重的可扩展性问题——如何解决数十亿网页的存储和索引问题。（2）2003年、2004年谷歌发表的两篇论文为该问题提供了可行的解决方案。——分布式文件系统（GFS），可用于处理海量网页的存储——分布式计算框架MAPREDUC
关于旗正规则引擎下载页面需要弹窗保存到本地目录的问题何必如此 jsp 超链接文件下载窗口
生成下载页面是需要选择“录入提交页面”，生成之后默认的下载页面<a>标签超链接为：<a href="<%=root_stimage%>stimage/image.jsp?filename=<%=strfile234%>&attachname=<%=java.net.URLEncoder.encode(file234filesourc
【Spark九十八】Standalone Cluster Mode下的资源调度源代码分析 bit1129 cluster
在分析源代码之前，首先对Standalone Cluster Mode的资源调度有一个基本的认识：首先，运行一个Application需要Driver进程和一组Executor进程。在Standalone Cluster Mode下，Driver和Executor都是在Master的监护下给Worker发消息创建(Driver进程和Executor进程都需要分配内存和CPU，这就需要Maste
linux上独立安装部署spark daizj linux 安装 spark 1.4 部署
下面讲一下linux上安装spark，以 Standalone Mode 安装 1）首先安装JDK 下载JDK：jdk-7u79-linux-x64.tar.gz ，版本是1.7以上都行，解压 tar -zxvf jdk-7u79-linux-x64.tar.gz 然后配置 ~/.bashrc&nb
Java 字节码之解析一周凡杨 java 字节码 javap
一： Java 字节代码的组织形式类文件 { OxCAFEBABE ，小版本号，大版本号，常量池大小，常量池数组，访问控制标记，当前类信息，父类信息，实现的接口个数，实现的接口信息数组，域个数，域信息数组，方法个数，方法信息数组，属性个数，属性信息数组 } &nbs
java各种小工具代码 g21121 java
1.数组转换成List import java.util.Arrays; Arrays.asList(Object[] obj); 2.判断一个String型是否有值 import org.springframework.util.StringUtils; if (StringUtils.hasText(str)) 3.判断一个List是否有值 import org.spring
加快FineReport报表设计的几个心得体会老A不折腾 finereport
一、从远程服务器大批量取数进行表样设计时，最好按“列顺序”取一个“空的SQL语句”，这样可提高设计速度。否则每次设计时模板均要从远程读取数据，速度相当慢！！二、找一个富文本编辑软件（如NOTEPAD+）编辑SQL语句，这样会很好地检查语法。有时候带参数较多检查语法复杂时，结合FineReport中生成的日志，再找一个第三方数据库访问软件（如PL/SQL）进行数据检索，可以很快定位语法错误。
mysql linux启动与停止墙头上一根草
如何启动/停止/重启MySQL一、启动方式1、使用 service 启动：service mysqld start2、使用 mysqld 脚本启动：/etc/inint.d/mysqld start3、使用 safe_mysqld 启动：safe_mysqld&二、停止1、使用 service 启动：service mysqld stop2、使用 mysqld 脚本启动：/etc/inin
Spring中事务管理浅谈 aijuans spring 事务管理
Spring中事务管理浅谈 By Tony Jiang@2012-1-20 Spring中对事务的声明式管理拿一个XML举例 [html] view plain copy print ? <?xml version="1.0" encoding="UTF-8"?>&nb
php中隐形字符65279（utf-8的BOM头）问题 alxw4616
php中隐形字符65279（utf-8的BOM头）问题今天遇到一个问题. php输出JSON 前端在解析时发生问题:parsererror. 调试: 1.仔细对比字符串发现字符串拼写正确.怀疑是非打印字符的问题. 2.逐一将字符串还原为unicode编码. 发现在字符串头的位置出现了一个 65279的非打印字符.
调用对象是否需要传递对象(初学者一定要注意这个问题) 百合不是茶对象的传递与调用技巧
类和对象的简单的复习,在做项目的过程中有时候不知道怎样来调用类创建的对象,简单的几个类可以看清楚,一般在项目中创建十几个类往往就不知道怎么来看为了以后能够看清楚,现在来回顾一下类和对象的创建,对象的调用和传递(前面写过一篇) 类和对象的基础概念: JAVA中万事万物都是类类有字段(属性),方法,嵌套类和嵌套接
JDK1.5 AtomicLong实例 bijian1013 java thread java多线程 AtomicLong
JDK1.5 AtomicLong实例类 AtomicLong 可以用原子方式更新的 long 值。有关原子变量属性的描述，请参阅 java.util.concurrent.atomic 包规范。AtomicLong 可用在应用程序中（如以原子方式增加的序列号），并且不能用于替换 Long。但是，此类确实扩展了 Number，允许那些处理基于数字类的工具和实用工具进行统一访问。
自定义的RPC的Java实现 bijian1013 java rpc
网上看到纯java实现的RPC，很不错。 RPC的全名Remote Process Call，即远程过程调用。使用RPC，可以像使用本地的程序一样使用远程服务器上的程序。下面是一个简单的RPC 调用实例，从中可以看到RPC如何
【RPC框架Hessian一】Hessian RPC Hello World bit1129 Hello world
什么是Hessian The Hessian binary web service protocol makes web services usable without requiring a large framework, and without learning yet another alphabet soup of protocols. Because it is a binary p
【Spark九十五】Spark Shell操作Spark SQL bit1129 shell
在Spark Shell上，通过创建HiveContext可以直接进行Hive操作 1. 操作Hive中已存在的表 [hadoop@hadoop bin]$ ./spark-shell Spark assembly has been built with Hive, including Datanucleus jars on classpath Welcom
F5　往header加入客户端的ip ronin47
when HTTP_RESPONSE {if {[HTTP::is_redirect]}{ HTTP::header replace Location [string map {:port/ /} [HTTP::header value Location]]HTTP::header replace Lo
java-61-在数组中，数字减去它右边(注意是右边)的数字得到一个数对之差. 求所有数对之差的最大值。例如在数组{2, 4, 1, 16, 7, 5, bylijinnan java
思路来自： http://zhedahht.blog.163.com/blog/static/2541117420116135376632/ 写了个java版的 public class GreatestLeftRightDiff { /** * Q61.在数组中，数字减去它右边(注意是右边)的数字得到一个数对之差。 * 求所有数对之差的最大值。例如在数组
mongoDB 索引开窍的石头 mongoDB索引
在这一节中我们讲讲在mongo中如何创建索引得到当前查询的索引信息 db.user.find(_id:12).explain(); cursor: basicCoursor 指的是没有索引 &
[硬件和系统]迎峰度夏 comsci 系统
从这几天的气温来看，今年夏天的高温天气可能会维持在一个比较长的时间内所以，从现在开始准备渡过炎热的夏天。。。。每间房屋要有一个落地电风扇，一个空调(空调的功率和房间的面积有密切的关系) 坐的，躺的地方要有凉垫，床上要有凉席电脑的机箱
基于ThinkPHP开发的公司官网 cuiyadll 行业系统
后端基于ThinkPHP，前端基于jQuery和BootstrapCo.MZ 企业系统轻量级企业网站管理系统运行环境:PHP5.3+, MySQL5.0 系统预览系统下载：http://www.tecmz.com 预览地址：http://co.tecmz.com 各种设备自适应响应式的网站设计能够对用户产生友好度，并且对于
Transaction and redelivery in JMS (JMS的事务和失败消息重发机制) darrenzhu jms 事务承认 MQ acknowledge
JMS Message Delivery Reliability and Acknowledgement Patterns http://wso2.com/library/articles/2013/01/jms-message-delivery-reliability-acknowledgement-patterns/ Transaction and redelivery in
Centos添加硬盘完全教程 dcj3sjt126com linux centos hardware
Linux的硬盘识别: sda 表示第1块SCSI硬盘 hda 表示第1块IDE硬盘 scd0 表示第1个USB光驱一般使用“fdisk -l”命
yii2 restful web服务路由 dcj3sjt126com PHP yii2
路由随着资源和控制器类准备，您可以使用URL如 http://localhost/index.php?r=user/create访问资源，类似于你可以用正常的Web应用程序做法。在实践中，你通常要用美观的URL并采取有优势的HTTP动词。例如，请求POST /users意味着访问user/create动作。这可以很容易地通过配置urlManager应用程序组件来完成如下所示
MongoDB查询(4)——游标和分页[八] eksliang mongodb MongoDB游标 MongoDB深分页
转载请出自出处：http://eksliang.iteye.com/blog/2177567 一、游标数据库使用游标返回find的执行结果。客户端对游标的实现通常能够对最终结果进行有效控制，从shell中定义一个游标非常简单，就是将查询结果分配给一个变量（用var声明的变量就是局部变量），便创建了一个游标，如下所示： > var
Activity的四种启动模式和onNewIntent() gundumw100 android
Android中Activity启动模式详解　　在Android中每个界面都是一个Activity，切换界面操作其实是多个不同Activity之间的实例化操作。在Android中Activity的启动模式决定了Activity的启动运行方式。　　Android总Activity的启动模式分为四种： Activity启动模式设置： <acti
攻城狮送女友的CSS3生日蛋糕 ini html Web html5 css css3
在线预览：http://keleyi.com/keleyi/phtml/html5/29.htm 代码如下： <!DOCTYPE html> <html> <head> <meta charset="UTF-8"> <title>攻城狮送女友的CSS3生日蛋糕-柯乐义<
读源码学Servlet（1）GenericServlet 源码分析 jzinfo tomcat Web servlet 网络应用网络协议
Servlet API的核心就是javax.servlet.Servlet接口，所有的Servlet 类（抽象的或者自己写的）都必须实现这个接口。在Servlet接口中定义了5个方法，其中有3个方法是由Servlet 容器在Servlet的生命周期的不同阶段来调用的特定方法。先看javax.servlet.servlet接口源码： package
JAVA进阶：VO(DTO)与PO(DAO)之间的转换 snoopy7713 java VO Hibernate po
PO即 Persistence Object　　VO即 Value Object 　VO和PO的主要区别在于：　　VO是独立的Java Object。　　PO是由Hibernate纳入其实体容器（Entity Map）的对象，它代表了与数据库中某条记录对应的Hibernate实体，PO的变化在事务提交时将反应到实际数据库中。　实际上，这个VO被用作Data Transfer
mongodb group by date 聚合查询日期统计每天数据（信息量） qiaolevip 每天进步一点点学习永无止境 mongodb 纵观千象
/* 1 */ { "_id" : ObjectId("557ac1e2153c43c320393d9d"), "msgType" : "text", "sendTime" : ISODate("2015-06-12T11:26:26.000Z")
java之18天常用的类(一) Luob. Math Date System Runtime Rundom
System类 import java.util.Properties; /** * System: * out:标准输出,默认是控制台 * in:标准输入,默认是键盘 * * 描述系统的一些信息 * 获取系统的属性信息:Properties getProperties(); * * * */ public class Sy
maven wuai maven
1、安装maven：解压缩、添加M2_HOME、添加环境变量path 2、创建maven_home文件夹，创建项目mvn_ch01,在其下面建立src、pom.xml，在src下面简历main、test、main下面建立java文件夹 3、编写类，在java文件夹下面依照类的包逐层创建文件夹，将此类放入最后一级文件夹 4、进入mvn_ch01 4.1、mvn compile ,执行后会在