isiqi

多服务器的日志合并统计——Apache日志的cronolog轮循和webalizer合并统计

多服务器的日志合并统计
——Apache日志的cronolog轮循和webalizer合并统计

作者：车东 Email: chedongATbigfoot.com/chedongATchedong.com

写于：2002/07 最后更新： 11/29/2006 17:05:24
Feed Back >>(Read this before you ask question)

版权声明：可以任意转载，转载时请务必以超链接形式标明文章原始出处和作者信息及本声明
http://www.chedong.com/tech/rotate_merge_log.html

关键词：webalizer apache log analysis sort merge cronolog 日志分析

内容摘要：你完全不必耐心地看完下面的所有内容，因为结论无非以下2点：
1 用 cronolog 干净，安全地轮循apache“日”志
2 用 sort -m 合并排序多个日志

根据个人的使用经历：
1 先介绍apache日志的合并方法；
2 然后根据由此引出的问题说明日志轮循的必要性和解决方法，介绍如何通过cronolog对apache日志进行轮循；
中间有很多在设计日志合并过程中一些相关工具的使用技巧和一些尝试的失败经历……
我相信解决以上问题的路径不止这一条途径，以下方案肯定不是最简便或者说成本最低的，希望能和大家有更多的交流。

多服务器日志合并统计的必要性

越来越多大型的WEB服务使用DNS轮循来实现负载均衡：使用多个同样角色的服务器做前台的WEB服务，这大大方便了服务的分布规划和扩展性，但多个服务器的分布使得日志的分析统计也变得有些麻烦。如果使用webalizer等日志分析工具对每台机器分别做日志统计：
1 会对数据的汇总带来很多麻烦，比如：统计的总访问量需要将SERVER1 SERVER2...上指定月份的数字相加。
2 会大大影响统计结果中唯一访客数unique visits，唯一站点数unique sites的等指标的统计，因为这几个指标并非几台机器的代数相加。

统一日志统计所带来的好处是显而易见的，但如何把所有机器的统计合并到一个统计结果里呢？
首先也许会想：多个服务器能不能将日志记录到同一个远程文件里呢？我们不考虑使用远程文件系统记录日志的问题，因为带来的麻烦远比你获得的方便多的多……
因此，要统计的多个服务器的日志还是：分别记录=>并通过一定方式定期同步到后台=>合并=>后用日志分析工具来进行分析。

首先，要说明为什么要合并日志：因为webalizer没有将同一天的多个日志合并的功能
先后运行
webalizer log1
webalizer log2
webalizer log3
这样最后的结果是：只有log3的结果。

能不能将log1<<log2<<log3简单叠加呢？
因为一个日志的分析工具不是将日志一次全部读取后进行分析，而且流式的读取日志并按一定时间间隔，保存阶段性的统计结果。因此时间跨度过大（比如2条日志间隔超过5分钟），一些日志统计工具的算法就会将前面的结果“忘掉”。因此， log1<<log2<<log3直接文件连接的统计结果还是：只有log3的统计结果。

多台服务日志合并问题：把多个日志中的记录按时间排序后合并成一个文件

典型的多个日志文件的时间字段是这样的：
log1 log2 log3
00:15:00 00:14:00 00:11:00
00:16:00 00:15:00 00:12:00
00:17:00 00:18:00 00:13:00
00:18:00 00:19:00 00:14:00
14:18:00 11:19:00 10:14:00
15:18:00 17:19:00 11:14:00
23:18:00 23:19:00 23:14:00

日志合并必须是按时间将多个日志的交叉合并。合并后的日志应该是：
00:15:00 来自log1
00:15:00 来自log2
00:16:00 来自log1
00:17:00 来自log3
00:18:00 来自log2
00:19:00 来自log1
....

如何合并多个日志文件？
下面以标准的clf格式日志（apache）为例：
apche的日志格式是这样的：
%h %l %u %t \"%r\" %>s %b
具体的例子：
111.222.111.222 - - [03/Apr/2002:10:30:17 +0800] "GET /index.html HTTP/1.1" 200 419

最简单的想法是将日志一一读出来，然后按日志中的时间字段排序
cat log1 log2 log3 |sort -k 4 -t " "
注释：
-t " ": 日志字段分割符号是空格
-k 4: 按第4个字段排序，也就是：[03/Apr/2002:10:30:17 +0800] 这个字段
-o log_all: 输出到log_all这个文件中

但这样的效率比较低，要知道。如果一个服务已经需要使用负载均衡，其服务的单机日志条数往往都超过了千万级，大小在几百M，这样要同时对多个几百M的日志进行排序，机器的负载可想而之……
其实有一个优化的途径，要知道：即使单个日志本身已经是一个“已经按照时间排好序“的文件了，而sort对于这种文件的排序合并提供了一个优化合并算法：使用 -m merge合并选项，
因此：合并这样格式的3个日志文件log1 log2 log3并输出到log_all中比较好方法是：
sort -m -t " " -k 4 -o log_all log1 log2 log3
注释：
-m: 使用 merge优化算法

注意：合并后的日志输出最好压缩以后再发给webalizer处理
有的系统能处理2G的文件，有的不能。有的程序能处理大于2G的文件，有的不能。尽量避免大于2G的文件，除非确认所有参与处理的程序和操作系统都能处理这样的文件。所以输出后的文件如果大于2G，最好将日志gzip后再发给webalizer处理：大于2G的文件分析过程中文件系统出错的可能性比较大，并且gzip后也能大大降低分析期间的I/O操作。

日志的按时间排序合并就是这样实现的。

日志的轮循机制

让我们关心一下数据源问题：webalizer其实是一个按月统计的工具，支持增量统计：因此对于大型的服务，我可以按天将apache的日志合并后送给 webalizer统计。WEB日志是如何按天（比如每天子夜00:00:00）截断呢？
如果你每天使用crontab：每天0点准时将日志备份成access_log_yesterday
mv /path/to/apache/log/access_log /path/to/apache/log/access_log_yesterday
的话：你还需要：马上运行一下：apache restart 否则：apache会因为的日志文件句柄丢失不知道将日志记录到哪里去了。这样归档每天子夜重启apache服务会受到影响。
比较简便不影响服务的方法是：先复制，后清空
cp /path/to/apache/log/access_log /path/to/apache/log/access_log_yesterday
echo >/path/to/apache/log/access_log

严肃的分析员会这样做发现一个问题：
但cp不可能严格保证严格的0点截断。加入复制过程用了6秒，截断的access_log_yesterday日志中会出现复制过程到00:00:06期间的日志。对于单个日志统计这些每天多出来几百行日志是没有问题的。但对于多个日志在跨月的1天会有一个合并的排序问题：
[31/Mar/2002:59:59:59 +0800]
[31/Mar/2002:23:59:59 +0800]
[01/Apr/2002:00:00:00 +0800]
[01/Apr/2002:00:00:00 +0800]

要知道[01/Apr/2002:00:00:00 这个字段是不可以进行“跨天排序”的。因为日期中使用了dd/mm/yyyy，月份还是英文名，如果按照字母排序，很有可能是这样的结果：排序导致了日志的错误
[01/Apr/2002:00:00:00 +0800]
[01/Apr/2002:00:00:00 +0800]
[01/Apr/2002:00:00:00 +0800]
[01/Apr/2002:00:00:00 +0800]
[01/Apr/2002:00:00:00 +0800]
[01/Apr/2002:00:00:00 +0800]
[01/Apr/2002:00:00:00 +0800]
[31/Mar/2002:59:59:59 +0800]
[31/Mar/2002:59:59:59 +0800]
[31/Mar/2002:23:59:59 +0800]
[31/Mar/2002:59:59:59 +0800]
[31/Mar/2002:23:59:59 +0800]

这些跨天过程中的非正常数据对于webalizer等分析工具来说简直就好像是吃了一个臭虫一样，运行的结果是：它可能会把前一个月所有的数据都丢失！因此这样的数据会有很多风险出现在处理上月最后一天的数据的过程中。

问题的解决有几个思路：
1 事后处理：
。所以一个事后的处理的方法是：用grep命令在每月第1天将日志跨月的日志去掉，比如：
grep -v "01/Apr" access_log_04_01 > access_log_new

修改SORT后的日志:所有跨天的数据去掉。也许对日志的事后处理是一个途径，虽然sort命令中有对日期排序的特殊选项 -M（注意是：大写M），可以让指定字段按照英文月份排序而非字母顺序，但对于apache日志来说，用SORT命令切分出月份字段很麻烦。（我尝试过用 "/"做分割符，并且使用“月份” “年:时间”这两个字段排序）。虽然用一些PERL的脚本肯定可以实现，但最终我还是放弃了。这不符合系统管理员的设计原则：通用性。并且你需要一直问自己：有没有更简单的方法呢？
还有就是将日志格式改成用TIMESTAMP（象SQUID的日志就没有这个问题，它的日志本身就是使用TIMESTAMP做时间时间戳的），但我无法保证所有的日志工具都能识别你在日期这个字段使用了特别的格式。

2 优化数据源：
最好的办法还是优化数据源。将数据源保证按天轮循，同一天的日志中的数据都在同一天内。这样以后你无论使用什么工具（商业的，免费的）来分析日志，都不会因为日志复杂的预处理机制受到影响。

首先可能会想到的是控制截取日志的时间：比如严格从0点开始截取日志，但在子夜前1分钟还是后一分钟开始截取是没有区别的，你仍然无法控制一个日志中有跨 2天记录的问题，而且你也无法预测日志归档过程使用的时间。
因此必须要好好考虑一下使用日志轮循工具的问题，这些日志轮循工具要符合：
1 不中断WEB服务：不能停apache=>移动日志=>重启apache
2 保证同一天日志能够按天轮循：每天一个日志00:00:00-23:59:59
3 不受apache重启的影响：如果apache每次重启都会生成一个新的日志是不符合要求的
4 安装配置简单

首先考虑了apache/bin目录下自带的一个轮循工具：rotatelogs 这个工具基本是用来按时间或按大小控制日志的，无法控制何时截断和如何按天归档。
然后考虑logrotate后台服务：logrotate是一个专门对各种系统日志（syslogd，mail）进行轮循的后台服务，比如SYSTEM LOG，但其配置比较复杂，放弃，实际上它也是对相应服务进程发出一个-HUP重启命令来实现日志的截断归档的。

在apache的FAQ中，推荐了经过近2年发展已经比较成熟的一个工具cronolog：安装很简单：configure=>make=> make install

他的一个配置的例子会让你了解它有多么适合日志按天轮循：对httpd.conf做一个很小的修改就能实现：
TransferLog "|/usr/sbin/cronolog /web/logs/%Y/%m/%d/access.log"
ErrorLog "|/usr/sbin/cronolog /web/logs/%Y/%m/%d/errors.log"

然后：日志将写入
/web/logs/2002/12/31/access.log
/web/logs/2002/12/31/errors.log
午夜过后：日志将写入
/web/logs/2003/01/01/access.log
/web/logs/2003/01/01/errors.log
而2003 2003/01 和 2003/01/01 如果不存在的话，将自动创建

所以，只要你不在0点调整系统时间之类的话，日志应该是完全按天存放的（00:00:00-23:59:59），后面日志分析中： [31/Mar/2002:15:44:59这个字段就和日期无关了，只和时间有关。

测试：考虑到系统硬盘容量，决定按星期轮循日志
apache配置中加入：
#%w weekday
TransferLog "|/usr/sbin/cronolog /path/to/apache/logs/%w/access_log"

重启apache后，除了原来的CustomLog /path/to/apche/logs/access_log继续增长外，系统log目录下新建立了 3/目录（测试是在周3），过了一会儿，我忽然发现2个日志的增长速度居然不一样！
分别tail了2个日志才发现：
我设置CustomLog使用的是combined格式，就是包含（扩展信息的），而TransferLog使用的是缺省日志格式，看了apache的手册才知道，TransferLog是用配置文件中离它自己最近的一个格式作为日志格式的。我的httpd.conf里写的是：
LogFormat ..... combined
LogFormat ... common
...
CustomLog ... combined
TransferLog ...

所以TrasferLog日志用的是缺省格式，手册里说要让TRANSFER日志使用指定的格式需要：
LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\""
TransferLog "|/usr/local/sbin/cronolog /path/to/apache/logs/%w/access_log"

重启，OK，日志格式一样了。
这样的设置结果其实是同时在logs目录下分别记录2个日志access_log和%w/access_log，能不能只记录%w/下的日志那？
查apache手册，更简单的方法：直接让CustomLog输出到cronolog归档日志，并且还能指定格式。
CustomLog "|/usr/local/sbin/cronolog /path/to/apache/logs/%w/access_log" combined

最后是一个日志同步的问题。

任务：每天凌晨找到前1天的日志，另存一个文件准备发送到服务器上。
比如我要保留前1周的日志：每天复制前1天的日志到指定目录，等待日志服务器来抓取：
/bin/cp -f /path/to/apache/logs/`date -v-1d +%w`/access_log /path/for/backup/logs/access_log_yesterday

在FREEBSD上使用以下命令
date -v-1d +%w
注释：
-v-1d: 前1天，而在GNU/Linux上这个选项应该是date -d yesterday
+%w: weekday，由于使用的都是标准时间函数库，所有工具中的WEEKDAY定义都是一样的 0-6 => 周日－周六

注意：
写到CRONTAB里的时候"%"前面需要加一个"\"转义：每天0点5分进行一次日志归档，
另外一个问题就是在cront中需要用：rm -f {} ; 而不是rm -f {}\;
5 0 * * * /bin/cp /path/to/logs/`date -v-1d +\%w`/access_log /path/to/for_sync/logs/access_yesterday
37 10 * * * /usr/bin/find /home/apache/logs/ -name access_log -mtime +1 -exec /bin/rm -f {} ;

首次开始cronolog日志统计是周3，一周以后日志又将轮循回3/access_log
但这次日志是追加到3/access_log还是重新创建一个文件呢？>>access_log or >access_log？
我测试的结果是日志将被追加：
[01/Apr/2002:23:59:59 +0800]
[01/Apr/2002:23:59:59 +0800]
[08/Apr/2002:00:00:00 +0800]
[08/Apr/2002:00:00:00 +0800]

肯定是不希望每次日志还带着上周的数据的并重复统计一次的（虽然对结果没影响），而且这样%w/下的日志不是也越来越多了吗？
解决方法1 把每天的cp改成mv
解决方法2 每天复制完成后：删除6天以前的access_log日志
find /path/to/apache/logs -name access_log -mtime +6 -exec rm -f {}\;
多保留几天的日志还是有必要的：万一日志分析服务器坏了一天呢？

以下是把apache安装在/home/apache下每天统计的一个脚本文件：
#!/bin/sh

#backup old log
/bin/cp -f /home/apache/logs/`date -d yesterday +%w`/access_log /home/apache/logs/access_log_yesterday

#remove old log
/usr/bin/find /home/apache/logs -name access_log -mtime +6 -exec rm -f {}\;

#analysis with webalizer
/usr/local/sbin/webalizer

总结：
1 用 cronolog 干净，安全地轮循日志
2 用 sort -m 排序合并多个日志

参考资料：

日志分析统计工具：
http://directory.google.com/Top/Computers/Software/Internet/Site_Management/Log_Analysis/

Apche的日志设置：
http://httpd.apache.org/docs/mod/mod_log_config.html

Apache的日志轮循：
http://httpd.apache.org/docs/misc/FAQ.html#rotate

Cronolog
http://www.cronolog.org

Webalizer
http://www.mrunix.net/webalizer/
Webalzer的Windows版
http://www.medasys-lille.com/webalizer/

AWStats的使用简介
http://www.chedong.com/tech/awstats.html

附1：Webalizer配置文件说明：重要的地方做了翻译并附有一些重要的配置修改

#
# Webalizer 样例配置文件
# Copyright 1997-2000 by Bradford L. Barrett ([email protected])
# 翻译：车东 ([email protected])
#
# Distributed under the GNU General Public License. See the
# files "Copyright" and "COPYING" provided with the webalizer
# distribution for additional information.
#
# 这是一个Webalizer (版本 2.01)的配置文件样例
# 所有以'#'开始的行都是被程序忽略的注释，此外空白行也会被跳过，其他行都是具体的配置选项。
# 并按照"ConfigOption Value"的格式，ConfigOption是合法的配置选项关键词，而Value是相应选项对应的值
# 非法的键/值会被忽略并会有相应的警告提示。关键词和值之间至少需要一个空格或制表符tab分割
#
# 从0.98版本开始，Webalizer会找缺省在当前目录下找一个名为webalizer.conf缺省配置文件
# 如果没有找到，会使用/etc/webalizer.conf

# LogFile 定义了WEB服务的日志文件，如果这里没有定义，并且命令行参数也没有指定文件名，
# 则将STDIN（系统标准输入）作为输入数据源
# 如果日志文件扩展名为'.gz' (是一个gzip压缩文件),程序会一边读取一边进行解压缩。

LogFile /home/apache/log/access_log_yesterday

# LogType 定义了日志的类型，Webalizer一般用于CLF和Combined格式的WEB服务日志格式
指定这个选项，你可以处理FTP日志(比如wu-ftp生成的xferlog，和Squid自己的日志
值可以是：'clf', 'ftp' 或'squid', 缺省是'clf'
# JNH : 新的'iis'是为IIS设计的，IIS4缺省使用标准日志格式，IIS5缺省使用W3C格式
# webalizer会自动根据日志的文件名进行识别：标准格式的日志文件名以I开头，W3C的是E
# 你可以在一个目录下同时存放2种日志，webalizer会全部读取并生成一份报告

LogType iis

# OutputDir 报告的输出目录地址，必须是完整的全路径名，但相对路径也许也行，
# 如果没有指定，输出目录就是当前目录。

OutputDir /home/apache/htdocs/usage/

# HistoryName 允许你设置webalizer生成的历史数据文件名
# 历史数据文件保存了12个月内的数据，这些数据会用来生成首页的HTML页面index.html
# 缺省文件名是："webalizer.hist"，缺省存放在指定的输出目录中，也可以使用绝对路径指定到其他目录中。

#HistoryName webalizer.hist

# Incremental 增量处理允许你处理被分隔成多个小文件的大日志，对于大型站点的按周，按天的日志轮循会非常有用
# 为了继续上次的处理，Webalizer在退出前会保存当时处理的数据并在下次运行是恢复当时的状态
在这个模式下，Webalizer会扫描并忽略重复的记录，请看README文件，里面有更详细的解说
值可以是：'yes'或'no'缺省为'no'.
# 'webalizer.current'这个文件用来保存当前数据，位置在OutputDir设置的输出目录中
# 启用这个选项前，请至少阅读一下README文件中的增量处理一节

Incremental yes

# IncrementalName 允许你设置保存当前数据的文件名，和HistoryName选项一样，除非设置绝对路径，否则文件就在缺省输出目录中，
# 这个选项只有在启用了Incremental模式后才有意义

#IncrementalName webalizer.current

# ReportTitle是标题文字，除非这个字符串是空的，否则主机名会空一格后显示在后面，
# 缺省是英文："Usage Statistics for".

#ReportTitle Usage Statistics for

# HostName 定义了报告对应的主机名，用在报告的标题和URL统计里，这样
# 即使在一个虚拟主机的统计中，点击URL统计的链接也可以转向相应的正确地址。
# 或者生成报告的服务器是在另外一台机器，clicking on URL's in the report to go to the proper location in
# the event you are running the report on a 'virtual' web server,
# or for a server different than the one the report resides on.
# 如果这里没有指定webalizer会尝试调用uname命令获得系统的主机名，如果失败缺省为"localhost"

HostName www.chedong.com

# HTMLExtension 允许你设置生成报告的文件扩展名，一般缺省是"html",但你也可以根据站点改成你需要的名字
(像配置PHP一样 embeded pages)?

#HTMLExtension html

# PageType 你告诉Webalizer那种类型的URL是你定义的'页面访问'(Page View). 大部分人认为一个html或cgi请求文档是页面，
# 而嵌入在页面中的图片和声音不算，如果没有指定，如果是WEB日志统计，页面的扩展名就是'htm*'和'cgi'，
# 如果是ftp日志，扩展名就是'txt' 对于Servlet这样没有扩展名的请求Webalizer也是算页面的。

PageType htm*
PageType cgi
PageType asp
PageType p*
#PageType phtml
#PageType php3
#PageType pl

# UseHTTPS 如果分析的站点使用安全服务器，URL的链接将是以'https://'开头，而不是缺省的'http://'.
如果需要，把它设置成'yes'。缺省是'no'. 这个配置只影响'Top URL's'里的链接.

#UseHTTPS no

# DNSCache 指定了用于反相DNS解析的DNS缓存文件，如果你希望对所有日志中所有的IP地址进行反相域名解析
# addresses found in the log file. 如果没有指定绝对路径（文件名不是以'/'开头），这个文件缺省就在输出目录下
更多详细说明请参考DNS.README
# JNH : 如果你使用ListServer选项，你必须指定DnsCache的全路径

#DNSCache dns_cache.db

# DNSChildren 允许你设置用多少个"子"进程进行DNS解析和更新DNS缓存文件。
# 如果指定了数字，Webalizer会创建DNS缓存文件并且每次运行都会更新，DNS解析会在
日志分析之前根据指定的数值调起子进程进行。如果使用DNS解析，DNS缓存文件名也必须指定。
# DNS lookups. If used, the DNS cache filename MUST be specified as
# well. 缺省值是0，等于禁用DNS缓存文件，子进程的个数可以是用1 到100之间，如果更大会影响系统运行。
比较合理的值是5到20之间，更多详细信息请参考DNS.README

#DNSChildren 0

# HTMLPre 定义了输出页面中最开头的HTML代码，缺省是以下的DOCTYPE声明
# 每行最长是80个字符，如果需要更多代码可以使用多条配置。

#HTMLPre <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">

# HTMLHead 定义了插入到<HEAD></HEAD>中间，紧接在<TITLE>行后的HTML代码
# 每行最长是80个字符，如果需要更多代码可以使用多条配置。

#HTMLHead <META NAME="author" CONTENT="The Webalizer">

# HTMLBody 定义了第一行<BODY>标签的HTML代码，缺省如下：
# 每行最长是80个字符，如果需要更多代码可以使用多条配置。

#HTMLBody <BODY BGCOLOR="#E8E8E8" TEXT="#000000" LINK="#0000FF" VLINK="#FF0000">

# HTMLPost 定义了输出页面中紧跟在第个<HR>标签后面紧跟在标题
# 和"summary period"-"Generated on:"这几行后面的代码。
# As with HTMLHead, you can define as many of these as you want and
# they will be inserted in the output stream in order of apperance.
# 每行最长是80个字符，如果需要更多代码可以使用多条配置。

#HTMLPost <BR CLEAR="all">

# HTMLTail defines the HTML code to insert at the bottom of each
# HTML document, usually to include a link back to your home
# page or insert a small graphic. It is inserted as a table
# data element (ie: <TD> your code here </TD>) and is right
# alligned with the page. Max string size is 80 characters.

#HTMLTail <IMG SRC="msfree.png" ALT="100% Micro$oft free!">

# HTMLEnd defines the HTML code to add at the very end of the
# generated files. It defaults to what is shown below. If
# used, you MUST specify the </BODY> and </HTML> closing tags
# as the last lines. Max string length is 80 characters.

#HTMLEnd </BODY></HTML>

# The Quiet option suppresses output messages... Useful when run
# as a cron job to prevent bogus e-mails. Values can be either
# "yes" or "no". Default is "no". Note: this does not suppress
# warnings and errors (which are printed to stderr).

#Quiet no

# ReallyQuiet will supress all messages including errors and
# warnings. Values can be 'yes' or 'no' with 'no' being the
# default. If 'yes' is used here, it cannot be overriden from
# the command line, so use with caution. A value of 'no' has
# no effect.

#ReallyQuiet no

# TimeMe allows you to force the display of timing information
# at the end of processing. A value of 'yes' will force the
# timing information to be displayed. A value of 'no' has no
# effect.

#TimeMe no

# GMTTime allows reports to show GMT (UTC) time instead of local
# time. Default is to display the time the report was generated
# in the timezone of the local machine, such as EDT or PST. This
# keyword allows you to have times displayed in UTC instead. Use
# only if you really have a good reason, since it will probably
# screw up the reporting periods by however many hours your local
# time zone is off of GMT.

#GMTTime no

# Debug prints additional information for error messages. This
# will cause webalizer to dump bad records/fields instead of just
# telling you it found a bad one. As usual, the value can be
# either "yes" or "no". The default is "no". It shouldn't be
# needed unless you start getting a lot of Warning or Error
# messages and want to see why. (Note: warning and error messages
# are printed to stderr, not stdout like normal messages).

#Debug no

# FoldSeqErr forces the Webalizer to ignore sequence errors.
# This is useful for Netscape and other web servers that cache
# the writing of log records and do not guarentee that they
# will be in chronological order. The use of the FoldSeqErr
# option will cause out of sequence log records to be treated
# as if they had the same time stamp as the last valid record.
# Default is to ignore out of sequence log records.

#FoldSeqErr no

# VisitTimeout 用来定义一个访客回话的超时时间，缺省为30分钟。
# Visits是根据访客发出请求的时间和来自这个访客所在站点（IP）的最后访问时间决定的，
# 如果2者时间间隔超过VisitTimeout的值，这个请求就被认为是一个新的访客，访客数也被加1
# 值为超时的秒数(缺省为=1800秒=30分钟)

#VisitTimeout 1800

# IgnoreHist shouldn't be used in a config file, but it is here
# just because it might be usefull in certain situations. If the
# history file is ignored, the main "index.html" file will only
# report on the current log files contents. Usefull only when you
# want to reproduce the reports from scratch. USE WITH CAUTION!
# Valid values are "yes" or "no". Default is "no".

#IgnoreHist no

# Country Graph allows the usage by country graph to be disabled.
# Values can be 'yes' or 'no', default is 'yes'.

#CountryGraph yes

# DailyGraph and DailyStats allows the daily statistics graph
# and statistics table to be disabled (not displayed). Values
# may be "yes" or "no". Default is "yes".

#DailyGraph yes
#DailyStats yes

# HourlyGraph and HourlyStats allows the hourly statistics graph
# and statistics table to be disabled (not displayed). Values
# may be "yes" or "no". Default is "yes".

#HourlyGraph yes
#HourlyStats yes

# GraphLegend allows the color coded legends to be turned on or off
# in the graphs. The default is for them to be displayed. This only
# toggles the color coded legends, the other legends are not changed.
# If you think they are hideous and ugly, say 'no' here :)

#GraphLegend yes

# GraphLines allows you to have index lines drawn behind the graphs.
# I personally am not crazy about them, but a lot of people requested
# them and they weren't a big deal to add. The number represents the
# number of lines you want displayed. Default is 2, you can disable
# the lines by using a value of zero ('0'). [max is 20]
# Note, due to rounding errors, some values don't work quite right.
# The lower the better, with 1,2,3,4,6 and 10 producing nice results.

#GraphLines 2

# The "Top" options below define the number of entries for each table.
# Defaults are Sites=30, URL's=30, Referrers=30 and Agents=15, and
# Countries=30. TopKSites and TopKURLs (by KByte tables) both default
# to 10, as do the top entry/exit tables (TopEntry/TopExit). The top
# search strings and usernames default to 20. Tables may be disabled
# by using zero (0) for the value.

#TopSites 30
#TopKSites 10
#TopURLs 30
#TopKURLs 10
#TopReferrers 30
#TopAgents 15
#TopCountries 30
#TopEntry 10
#TopExit 10
#TopSearch 20
#TopUsers 20

# All* 关键词允许显示所有的URL，独立站点（IP），引用链接（Referrers）
# 用户浏览器, 搜索关键词和用户名，如果启用，会生成另外一个HTML页面并有链接
# 加在相应栏目的下面，注意以下2点，这些统计必然比TOP统计要大的多，第2，这些对外都是可见的
# 值可以是yes或no，缺省都是no，对于一个公开发布的站点，这些按月生成的统计
# 会非常大。会需要很多磁盘空间，如果访问很多也会带来很多流量。

#AllSites no
AllURLs yes
#AllReferrers no
#AllAgents no
AllSearchStr yes
#AllUsers no

# The Webalizer normally strips the string 'index.' off the end of
# URL's in order to consolidate URL totals. For example, the URL
# /somedir/index.html is turned into /somedir/ which is really the
# same URL. This option allows you to specify additional strings
# to treat in the same way. You don't need to specify 'index.' as
# it is always scanned for by The Webalizer, this option is just to
# specify _additional_ strings if needed. If you don't need any,
# don't specify any as each string will be scanned for in EVERY
# log record... A bunch f them will degrade performance. Also,
# the string is scanned for anywhere in the URL, so a string of
# 'home' would turn the URL /somedir/homepages/brad/home.html into
# just /somedir/ which is probably not what was intended.

#IndexAlias home.htm
#IndexAlias homepage.htm

# The Hide*, Group* and Ignore* and Include* keywords allow you to
# change the way Sites, URL's, Referrers, User Agents and Usernames
# are manipulated. The Ignore* keywords will cause The Webalizer to
# completely ignore records as if they didn't exist (and thus not
# counted in the main site totals). The Hide* keywords will prevent
# things from being displayed in the 'Top' tables, but will still be
# counted in the main totals. The Group* keywords allow grouping
# similar objects as if they were one. Grouped records are displayed
# in the 'Top' tables and can optionally be displayed in BOLD and/or
# shaded. Groups cannot be hidden, and are not counted in the main
# totals. The Group* options do not, by default, hide all the items
# that it matches. If you want to hide the records that match (so just
# the grouping record is displayed), follow with an identical Hide*
# keyword with the same value. (see example below) In addition,
# Group* keywords may have an optional label which will be displayed
# instead of the keywords value. The label should be seperated from
# the value by at least one 'white-space' character, such as a space
# or tab.
#
# The value can have either a leading or trailing '*' wildcard
# character. If no wildcard is found, a match can occur anywhere
# in the string. Given a string "www.yourmama.com", the values "your",
# "*mama.com" and "www.your*" will all match.
# Your own site should be hidden

#HideSite *mrunix.net
#HideSite localhost

# Your own site gives most referrals
#HideReferrer mrunix.net/

# This one hides non-referrers ("-" Direct requests)
#HideReferrer Direct Request

# Usually you want to hide these
HideURL *.gif
HideURL *.GIF
HideURL *.jpg
HideURL *.JPG
HideURL *.png
HideURL *.PNG
HideURL *.ra
HideURL *.css

# Hiding agents is kind of futile
#HideAgent RealPlayer

# You can also hide based on authenticated username
#HideUser root
#HideUser admin

# Grouping options
#GroupURL /cgi-bin/* CGI Scripts
#GroupURL /images/* Images
#GroupSite *.aol.com
#GroupSite *.compuserve.com
#GroupReferrer yahoo.com/ Yahoo!
#GroupReferrer excite.com/ Excite
#GroupReferrer infoseek.com/ InfoSeek
#GroupReferrer webcrawler.com/ WebCrawler

#GroupUser root Admin users
#GroupUser admin Admin users
#GroupUser wheel Admin users

# The following is a great way to get an overall total
# for browsers, and not display all the detail records.
# (You should use MangleAgent to refine further...)

#GroupAgent MSIE Micro$oft Internet Exploder
#HideAgent MSIE
#GroupAgent Mozilla Netscape
#HideAgent Mozilla
#GroupAgent Lynx* Lynx
#HideAgent Lynx*

# HideAllSites allows forcing individual sites to be hidden in the
# report. This is particularly useful when used in conjunction
# with the "GroupDomain" feature, but could be useful in other
# situations as well, such as when you only want to display grouped
# sites (with the GroupSite keywords...). The value for this
# keyword can be either 'yes' or 'no', with 'no' the default,
# allowing individual sites to be displayed.

#HideAllSites no

# The GroupDomains keyword allows you to group individual hostnames
# into their respective domains. The value specifies the level of
# grouping to perform, and can be thought of as 'the number of dots'
# that will be displayed. For example, if a visiting host is named
# cust1.tnt.mia.uu.net, a domain grouping of 1 will result in just
# "uu.net" being displayed, while a 2 will result in "mia.uu.net".
# The default value of zero disable this feature. Domains will only
# be grouped if they do not match any existing "GroupSite" records,
# which allows overriding this feature with your own if desired.

#GroupDomains 0

# The GroupShading allows grouped rows to be shaded in the report.
# Useful if you have lots of groups and individual records that
# intermingle in the report, and you want to diferentiate the group
# records a little more. Value can be 'yes' or 'no', with 'yes'
# being the default.

#GroupShading yes

# GroupHighlight allows the group record to be displayed in BOLD.
# Can be either 'yes' or 'no' with the default 'yes'.

#GroupHighlight yes

# The Ignore* keywords allow you to completely ignore log records based
# on hostname, URL, user agent, referrer or username. I hessitated in
# adding these, since the Webalizer was designed to generate _accurate_
# statistics about a web servers performance. By choosing to ignore
# records, the accuracy of reports become skewed, negating why I wrote
# this program in the first place. However, due to popular demand, here
# they are. Use the same as the Hide* keywords, where the value can have
# a leading or trailing wildcard '*'. Use at your own risk ;)

#IgnoreSite bad.site.net
#IgnoreURL /test*
#IgnoreReferrer file:/*
#IgnoreAgent RealPlayer
#IgnoreUser root

# The Include* keywords allow you to force the inclusion of log records
# based on hostname, URL, user agent, referrer or username. They take
# precidence over the Ignore* keywords. Note: Using Ignore/Include
# combinations to selectivly process parts of a web site is _extremely
# inefficent_!!! Avoid doing so if possible (ie: grep the records to a
# seperate file if you really want that kind of report).

# Example: Only show stats on Joe User's pages...
#IgnoreURL *
#IncludeURL ~joeuser*

# Or based on an authenticated username
#IgnoreUser *
#IncludeUser someuser

# The MangleAgents allows you to specify how much, if any, The Webalizer
# should mangle user agent names. This allows several levels of detail
# to be produced when reporting user agent statistics. There are six
# levels that can be specified, which define different levels of detail
# supression. Level 5 shows only the browser name (MSIE or Mozilla)
# and the major version number. Level 4 adds the minor version number
# (single decimal place). Level 3 displays the minor version to two
# decimal places. Level 2 will add any sub-level designation (such
# as Mozilla/3.01Gold or MSIE 3.0b). Level 1 will attempt to also add
# the system type if it is specified. The default Level 0 displays the
# full user agent field without modification and produces the greatest
# amount of detail. User agent names that can't be mangled will be
# left unmodified.

#MangleAgents 0

# 搜索引擎关键词允许你设置搜索引擎和URL中的查询格式，用于统计用户通过那些关键词
# 被用来找到你的站点。第1个关键词是从WEB日志中的referrer字段识别搜索引擎，第2个是
# URL中的关键词的参数名。

SearchEngine yahoo.com p=
SearchEngine altavista.com q=
SearchEngine google.com q=
SearchEngine eureka.com q=
SearchEngine lycos.com query=
SearchEngine hotbot.com MT=
SearchEngine msn.com MT=
SearchEngine infoseek.com qt=
SearchEngine webcrawler searchText=
SearchEngine excite search=
SearchEngine netscape.com search=
SearchEngine mamma.com query=
SearchEngine alltheweb.com query=
SearchEngine northernlight.com qr=
SearchEngine baidu.com word=
SearchEngine sina.com.cn word=
SearchEngine sohu.com word=
SearchEngine 163.com q=

# Dump* 用来将统计导出成用制表符（TAB）分割的文本文件，从而方便导入到其他应用中做统计。
# 比如数据库和统计软件

# DumpPath specifies the path to dump the files. If not specified,
# it will default to the current output directory. Do not use a
# trailing slash ('/').

#DumpPath /var/lib/httpd/logs

# The DumpHeader keyword specifies if a header record should be
# written to the file. A header record is the first record of the
# file, and contains the labels for each field written. Normally,
# files that are intended to be imported into a database system
# will not need a header record, while spreadsheets usually do.
# Value can be either 'yes' or 'no', with 'no' being the default.

#DumpHeader no

# DumpExtension allow you to specify the dump filename extension
# to use. The default is "tab", but some programs are pickey about
# the filenames they use, so you may change it here (for example,
# some people may prefer to use "csv").

#DumpExtension tab

# 控制各个大类统计的导出。
# 值可以是'yes'或 'no'缺省为'no'.

#DumpSites no
DumpURLs yes
DumpReferrers yes
#DumpAgents no
#DumpUsers no
DumpSearchStr yes

# End of configuration file... Have a nice day!

# begin of JNH mofications
# new entry for Win32 release

# NOUVELLE ENTREE pour les serveurs NT

# nom de la page par defaut sur le serveur
# replace file "Index" for unix systems by other name

# IndexPage default

# 所有的日志存放目录
# 文件个数限制为一个目录下250，如果需要处理更多你需要移动文件并再次运行。

# FolderLog C:\JnhDev\WebAlizer32\Exemple de Logs\IIS4.0\Log Standard\
FolderLog C:\WINNT\system32\LogFiles\W3SVC3\
ExtentionLog log

# when you use mix type of log in same folder, webalizer sort file for order by
# name, but if begin of file file is mix sort didn't make work, then you can disable it
# default is no

# DisableSort yes

# Name of file contain list of server to process like for each line :
# Name of Customer<SPACE>Folder of log<SPACE>Folder output<SPACE>Host Name1;Host Name 2
# sample (extract of production file, who have 255 lines)
# all of option in this file apply to all reports ...
# New in this file you can use coma (") for delimit field
# wA001 c:\WA001\LogIIS\ c:\wA001\stats wa001.LeRelaisInternet.com;www1.jeanlouisaubert.com
# wA002 c:\WA002\LogIIS\ c:\wA002\stats wa002.LeRelaisInternet.com;www.restotel.fr;www.nordpage.fr
# wA003 c:\WA003\LogIIS\ c:\wA003\stats Wa003.LeRelaisInternet.com;www.autobusavapeur.com

#ServerList c:\jnhdev\webalizer\listeserv.txt

# If you have dayly rotation on log name, you can change name after process a file
# to have less no productive work day
# to use this option you need to use "HistoryName" and "Incremental"

RenameLog yes
NewExtension sav

# 2 New Options for optimize DNS resolution : is time to live in data base cache
# for good dns resolution (default is 30 days) and for bad resolution, like
# no reverse IP, in this case it's better to store errors in database file
# cause each day bad dns consume a lot of time (default 7 days)

#TtlDns 30
#TtlDnsError 7

# new option for convert each record date to Local time before process it ...
# Test only
# default = No

ConvertTime yes

# end of JNH .. HAve a nice day !!!

注意：对IIS日志需要通过配置将发送字节数sc_size和referer2个字段启用。

原文出处：<a href="http://www.chedong.com/tech/rotate_merge_log.html">http://www.chedong.com/tech/rotate_merge_log.html</a>

你可能感兴趣的:(apache,应用服务器,Web,浏览器,Access)

理解Gunicorn：Python WSGI服务器的基石范范0825 ipython linux 运维
理解Gunicorn：PythonWSGI服务器的基石介绍Gunicorn，全称GreenUnicorn，是一个为PythonWSGI（WebServerGatewayInterface）应用设计的高效、轻量级HTTP服务器。作为PythonWeb应用部署的常用工具，Gunicorn以其高性能和易用性著称。本文将介绍Gunicorn的基本概念、安装和配置，帮助初学者快速上手。1.什么是Gunico
Long类型前后端数据不一致 igotyback 前端
响应给前端的数据浏览器控制台中response中看到的Long类型的数据是正常的到前端数据不一致前后端数据类型不匹配是一个常见问题，尤其是当后端使用Java的Long类型（64位）与前端JavaScript的Number类型（最大安全整数为2^53-1，即16位）进行数据交互时，很容易出现精度丢失的问题。这是因为JavaScript中的Number类型无法安全地表示超过16位的整数。为了解决这个问
Google earth studio 简介陟彼高冈yu 旅游
GoogleEarthStudio是一个基于Web的动画工具，专为创作使用GoogleEarth数据的动画和视频而设计。它利用了GoogleEarth强大的三维地图和卫星影像数据库，使用户能够轻松地创建逼真的地球动画、航拍视频和动态地图可视化。网址为https://www.google.com/earth/studio/。GoogleEarthStudio是一个基于Web的动画工具，专为创作使用G
PHP环境搭建详细教程好看资源平台前端 php
PHP是一个流行的服务器端脚本语言，广泛用于Web开发。为了使PHP能够在本地或服务器上运行，我们需要搭建一个合适的PHP环境。本教程将结合最新资料，介绍在不同操作系统上搭建PHP开发环境的多种方法，包括Windows、macOS和Linux系统的安装步骤，以及本地和Docker环境的配置。1.PHP环境搭建概述PHP环境的搭建主要分为以下几类：集成开发环境：例如XAMPP、WAMP、MAMP，这
下载github patch到本地小米人er 我的博客 git patch
以下是几种从GitHub上下载以.patch结尾的补丁文件的方法：通过浏览器直接下载打开包含该.patch文件的GitHub仓库。在仓库的文件列表中找到对应的.patch文件。点击该文件，浏览器会显示文件的内容，在页面的右上角通常会有一个“Raw”按钮，点击它可以获取原始文件内容。然后在浏览器中使用快捷键（如Ctrl+S或者Command+S）将原始文件保存到本地，选择保存的文件名并确保后缀为.p
DIV+CSS+JavaScript技术制作网页（旅游主题网页设计与制作）云南大理 STU学生网页设计网页设计期末网页作业 html静态网页 html5期末大作业网页设计 web大作业
️精彩专栏推荐作者主页:【进入主页—获取更多源码】web前端期末大作业：【HTML5网页期末作业(1000套)】程序员有趣的告白方式：【HTML七夕情人节表白网页制作(110套)】文章目录二、网站介绍三、网站效果▶️1.视频演示2.图片演示四、网站代码HTML结构代码CSS样式代码五、更多源码二、网站介绍网站布局方面：计划采用目前主流的、能兼容各大主流浏览器、显示效果稳定的浮动网页布局结构。网站程
关于城市旅游的HTML网页设计——(旅游风景云南 5页)HTML+CSS+JavaScript 二挡起步 web前端期末大作业 javascript html css 旅游风景
⛵源码获取文末联系✈Web前端开发技术描述网页设计题材，DIV+CSS布局制作,HTML+CSS网页设计期末课程大作业|游景点介绍|旅游风景区|家乡介绍|等网站的设计与制作|HTML期末大学生网页设计作业，Web大学生网页HTML：结构CSS：样式在操作方面上运用了html5和css3，采用了div+css结构、表单、超链接、浮动、绝对定位、相对定位、字体样式、引用视频等基础知识JavaScrip
HTML网页设计制作大作业（div+css）云南我的家乡旅游景点带文字滚动二挡起步 web前端期末大作业 web设计网页规划与设计 html css javascript dreamweaver 前端
Web前端开发技术描述网页设计题材，DIV+CSS布局制作,HTML+CSS网页设计期末课程大作业游景点介绍|旅游风景区|家乡介绍|等网站的设计与制作HTML期末大学生网页设计作业HTML：结构CSS：样式在操作方面上运用了html5和css3，采用了div+css结构、表单、超链接、浮动、绝对定位、相对定位、字体样式、引用视频等基础知识JavaScript：做与用户的交互行为文章目录前端学习路线
git - Webhook让部署自动化大猪大猪
我们现在有一个需求，将项目打包上传到gitlab或者github后，程序能自动部署，不用手动地去服务器中进行项目更新并运行，如何做到？这里我们可以使用gitlab与github的挂钩，挂钩的原理就是，每当我们有请求到gitlab与github服务器时，这时他俩会根据我们配置的挂钩地扯进行访问，webhook挂钩程序会一直监听着某个端口请求，一但收到他们发过来的请求，这时就知道用户有请求提交了，这时
webpack图片等资源的处理 dmengmeng
需要的loaderfile-loader（让我们可以引入这些资源文件）url-loader（其实是file-loader的二次封装）img-loader（处理图片所需要的）在没有使用任何处理图片的loader之前，比如说css中用到了背景图片，那么最后打包会报错的，因为他没办法处理图片。其实你只想能够使用图片的话。只加一个file-loader就可以，打开网页能准确看到图片。{test:/\.(p
「豆包Marscode体验官」 | 云端 IDE 启动 & Rust 体验张风捷特烈 ide rust 开发语言后端
theme:cyanosis我正在参加「豆包MarsCode初体验」征文活动MarsCode可以看作一个运行在服务端的远程VSCode开发环境。对于我这种想要学习体验某些语言，但不想在电脑里装环境的人来说非常友好。本文就来介绍一下在MarsCode里，我的体验rust开发体验。一、MarsCode是什么它的本质是:提供代码助手和云端IDE服务的web网站，可通过下面的链接访问https://www
Python神器！WEB自动化测试集成工具 DrissionPage 亚丁号 python 开发语言
一、前言用requests做数据采集面对要登录的网站时，要分析数据包、JS源码，构造复杂的请求，往往还要应付验证码、JS混淆、签名参数等反爬手段，门槛较高。若数据是由JS计算生成的，还须重现计算过程，体验不好，开发效率不高。使用浏览器，可以很大程度上绕过这些坑，但浏览器运行效率不高。因此，这个库设计初衷，是将它们合而为一，能够在不同须要时切换相应模式，并提供一种人性化的使用方法，提高开发和运行效率
Java爬虫框架（一）--架构设计狼图腾-狼之传说 java 框架 java 任务 html解析器存储电子商务
一、架构图那里搜网络爬虫框架主要针对电子商务网站进行数据爬取，分析，存储，索引。爬虫：爬虫负责爬取，解析，处理电子商务网站的网页的内容数据库：存储商品信息索引：商品的全文搜索索引Task队列：需要爬取的网页列表Visited表：已经爬取过的网页列表爬虫监控平台：web平台可以启动，停止爬虫，管理爬虫，task队列，visited表。二、爬虫1.流程1)Scheduler启动爬虫器，TaskMast
Java：爬虫框架 dingcho Java java 爬虫
一、ApacheNutch2【参考地址】Nutch是一个开源Java实现的搜索引擎。它提供了我们运行自己的搜索引擎所需的全部工具。包括全文搜索和Web爬虫。Nutch致力于让每个人能很容易,同时花费很少就可以配置世界一流的Web搜索引擎.为了完成这一宏伟的目标,Nutch必须能够做到:每个月取几十亿网页为这些网页维护一个索引对索引文件进行每秒上千次的搜索提供高质量的搜索结果简单来说Nutch支持分
MongoDB知识概括 GeorgeLin98 持久层 mongodb
MongoDB知识概括MongoDB相关概念单机部署基本常用命令索引-IndexSpirngDataMongoDB集成副本集分片集群安全认证MongoDB相关概念业务应用场景：传统的关系型数据库（如MySQL），在数据操作的“三高”需求以及应对Web2.0的网站需求面前，显得力不从心。解释：“三高”需求：①Highperformance-对数据库高并发读写的需求。②HugeStorage-对海量数
Mongodb Error: queryTxt ETIMEOUT xxxx.wwwdz.mongodb.net 佛一脚 error react mongodb 数据库
背景每天都能遇到奇怪的问题，做个记录，以便有缘人能得到帮助！换了一台电脑开发nextjs程序。需要连接mongodb数据，对数据进行增删改查。上一台电脑好好的程序，新电脑死活连不上mongodb数据库。同一套代码，没任何修改，搞得我怀疑人生了，打开浏览器进入mongodb官网毫无问题，也能进入线上系统查看数据，网络应该是没问题。于是我尝试了一下手机热点，这次代码能正常跑起来，连接数据库了！！！是不
Python实现下载当前年份的谷歌影像 sand&wich python 开发语言
在GIS项目和地图应用中，获取最新的地理影像数据是非常重要的。本文将介绍如何使用Python代码从Google地图自动下载当前年份的影像数据，并将其保存为高分辨率的TIFF格式文件。这个过程涉及地理坐标转换、多线程下载和图像处理。关键功能该脚本的核心功能包括：坐标转换：支持WGS-84与WebMercator投影之间转换，以及处理中国GCJ-02偏移。自动化下载：多线程下载地图瓦片，提高效率。图像
Spring MVC 全面指南：从入门到精通的详细解析一杯梅子酱技术栈学习 spring mvc java
引言：SpringMVC，作为Spring框架的一个重要模块，为构建Web应用提供了强大的功能和灵活性。无论是初学者还是有一定经验的开发者，掌握SpringMVC都将显著提升你的Web开发技能。本文旨在为初学者提供一个全面且易于理解的学习路径，通过详细的知识点分析和实际案例，帮助你快速上手SpringMVC，让学习过程既深刻又高效。一、SpringMVC简介1.1什么是SpringMVC？Spri
Spring Boot中实现跨域请求 BABA8891 spring boot 后端 java
在SpringBoot中实现跨域请求（CORS，Cross-OriginResourceSharing）可以通过多种方式，以下是几种常见的方法：1.使用@CrossOrigin注解在SpringBoot中，你可以在控制器或者具体的请求处理方法上使用@CrossOrigin注解来允许跨域请求。在控制器上应用：importorg.springframework.web.bind.annotation.
WebMagic：强大的Java爬虫框架解析与实战 Aaron_945 Java java 爬虫开发语言
文章目录引言官网链接WebMagic原理概述基础使用1.添加依赖2.编写PageProcessor高级使用1.自定义Pipeline2.分布式抓取优点结论引言在大数据时代，网络爬虫作为数据收集的重要工具，扮演着不可或缺的角色。Java作为一门广泛使用的编程语言，在爬虫开发领域也有其独特的优势。WebMagic是一个开源的Java爬虫框架，它提供了简单灵活的API，支持多线程、分布式抓取，以及丰富的
00. 这里整理了最全的爬虫框架（Java + Python）有一只柴犬爬虫系列爬虫 java python
目录1、前言2、什么是网络爬虫3、常见的爬虫框架3.1、java框架3.1.1、WebMagic3.1.2、Jsoup3.1.3、HttpClient3.1.4、Crawler4j3.1.5、HtmlUnit3.1.6、Selenium3.2、Python框架3.2.1、Scrapy3.2.2、BeautifulSoup+Requests3.2.3、Selenium3.2.4、PyQuery3.2
ubuntu安装wordpress lissettecarlr
1安装nginx网上安装方式很多，这就就直接用apt-get了apt-getinstallnginx不用启动啥，然后直接在浏览器里面输入IP:80就能看到nginx的主页了。如果修改了一些配置可以使用下列命令重启一下systemctlrestartnginx.service2安装mysql输入安装前也可以更新一下软件源，在安装过程中将会让你输入数据库的密码。sudoapt-getinstallmy
最简单将静态网页挂载到服务器上(不用nginx) 全能全知者服务器 nginx 运维前端 html 笔记
最简单将静态网页挂载到服务器上(不用nginx)如果随便弄个静态网页挂在服务器都要用nignx就太麻烦了，所以直接使用Apache来搭建一些简单前端静态网页会相对方便很多检查Web服务器服务状态：sudosystemctlstatushttpd#ApacheWeb服务器如果发现没有安装web服务器：安装Apache：sudoyuminstallhttpd启动Apache：sudosystemctl
02-Cesium聚合分析EntityCluster完整代码 fxshy html css javascript
1.完整代码Document-->-->Cesium.Ion.defaultAccessToken='eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJqdGkiOiJhZjZkZDAwZC1mNTFhLTRhOTEtOGExNi00MzRhNGIzMDdlNDQiLCJpZCI6MTA1MTUzLCJpYXQiOjE2NjA4MDg0Njd9.qajeJtc4-kp
03-Cesium自定义着色器完整代码以及注释 fxshy 着色器 javascript
1.效果展示2.完整代码自定义着色器完整代码#map{position:absolute;width:100%;height:100%;top:0;left:0;right:0;bottom:0;}Cesium.Ion.defaultAccessToken='eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJqdGkiOiJhZjZkZDAwZC1mNTFhLTRhO
抱怨很廉价，别做空想家 Fang2023
今天在整理浏览器收藏夹的时候，看到一个很多年前保存的一个网页，上面是一支央视公益广告的视频，《我创故我在》。思绪一下子回到了好几年前。还记得第一次无意中在电视上看到这支广告，喜悦之情溢于言表。抱怨很廉价，别做空想家，这句歌词尤其喜欢。听着歌曲，仿佛那时候的潮气蓬勃、意气风发，又回来了，即使此时感到疲惫。【公益】央视公益广告歌曲《我创故我在》_腾讯视频
浅谈MapReduce Android路上的人 Hadoop 分布式计算 mapreduce 分布式框架 hadoop
从今天开始，本人将会开始对另一项技术的学习，就是当下炙手可热的Hadoop分布式就算技术。目前国内外的诸多公司因为业务发展的需要，都纷纷用了此平台。国内的比如BAT啦，国外的在这方面走的更加的前面，就不一一列举了。但是Hadoop作为Apache的一个开源项目，在下面有非常多的子项目，比如HDFS，HBase,Hive，Pig,等等，要先彻底学习整个Hadoop，仅仅凭借一个的力量，是远远不够的。
使用datepicker和uploadify的冲突解决（IE双击才能打开附件上传对话框） zhanglb12
在开发的过程当中，IE的兼容无疑是我们的一块绊脚石，在我们使用的如期的datepicker插件和使用上传附件的uploadify插件的时候，两者就产生冲突，只要点击过时间的插件，uploadify上传框要双才能打开ie浏览器提示错误Missinginstancedataforthisdatepicker解决方案//if(.browser.msie&&'9.0'===.browser.version
uniapp使用内置地图选择插件，实现地址选择并在地图上标点神夜大侠 Uniapp vue.js uniapp
uniapp使用内置地图选择插件，实现地址选择并在地图上标点代码如下：page{background:#F4F5F6;}::-webkit-scrollbar{width:0;height:0;color:transparent;}page{height:100%;width:100%;font-size:24rpx;}image,view,input,textarea,label,text,na
mysql学习教程，从入门到精通，TOP 和MySQL LIMIT 子句（15）知识分享小能手大数据数据库 MySQL mysql 学习 oracle 数据库开发语言 adb 大数据
1、TOP和MySQLLIMIT子句内容在SQL中，不同的数据库系统对于限制查询结果的数量有不同的实现方式。TOP关键字主要用于SQLServer和Access数据库中，而LIMIT子句则主要用于MySQL、PostgreSQL（通过LIMIT/OFFSET语法）、SQLite等数据库中。下面将分别详细介绍这两个功能的语法、语句以及案例。1.1、TOP子句（SQLServer和Access）1.1
java线程Thread和Runnable区别和联系 zx_code java jvm thread 多线程 Runnable
我们都晓得java实现线程2种方式，一个是继承Thread，另一个是实现Runnable。模拟窗口买票，第一例子继承thread，代码如下 package thread; public class ThreadTest { public static void main(String[] args) { Thread1 t1 = new Thread1(
【转】JSON与XML的区别比较丁_新 json xml
1.定义介绍 (1).XML定义扩展标记语言 (Extensible Markup Language, XML) ，用于标记电子文件使其具有结构性的标记语言，可以用来标记数据、定义数据类型，是一种允许用户对自己的标记语言进行定义的源语言。 XML使用DTD(document type definition)文档类型定义来组织数据;格式统一，跨平台和语言，早已成为业界公认的标准。 XML是标
c++ 实现五种基础的排序算法 CrazyMizzz C++c 算法
#include<iostream> using namespace std; //辅助函数，交换两数之值 template<class T> void mySwap(T &x, T &y){ T temp = x; x = y; y = temp; } const int size = 10; //一、用直接插入排
我的软件麦田的设计者我的软件音乐类娱乐放松
这是我写的一款app软件，耗时三个月，是一个根据央视节目开门大吉改变的，提供音调，猜歌曲名。1、手机拥有者在android手机市场下载本APP，同意权限，安装到手机上。2、游客初次进入时会有引导页面提醒用户注册。（同时软件自动播放背景音乐）。3、用户登录到主页后，会有五个模块。a、点击不胫而走，用户得到开门大吉首页部分新闻，点击进入有新闻详情。b、
linux awk命令详解被触发 linux awk
awk是行处理器: 相比较屏幕处理的优点，在处理庞大文件时不会出现内存溢出或是处理缓慢的问题，通常用来格式化文本信息 awk处理过程: 依次对每一行进行处理，然后输出 awk命令形式: awk [-F|-f|-v] ‘BEGIN{} //{command1; command2} END{}’ file [-F|-f|-v]大参数，-F指定分隔符，-f调用脚本，-v定义变量 var=val
各种语言比较 _wy_ 编程语言
Java Ruby PHP 擅长领域
oracle 中数据类型为clob的编辑知了ing oracle clob
public void updateKpiStatus(String kpiStatus,String taskId){ Connection dbc=null; Statement stmt=null; PreparedStatement ps=null; try { dbc = new DBConn().getNewConnection(); //stmt = db
分布式服务框架 Zookeeper -- 管理分布式环境中的数据矮蛋蛋 zookeeper
原文地址： http://www.ibm.com/developerworks/cn/opensource/os-cn-zookeeper/ 安装和配置详解本文介绍的 Zookeeper 是以 3.2.2 这个稳定版本为基础，最新的版本可以通过官网 http://hadoop.apache.org/zookeeper/来获取，Zookeeper 的安装非常简单，下面将从单机模式和集群模式两
tomcat数据源 alafqq tomcat
数据库 JNDI(Java Naming and Directory Interface，Java命名和目录接口)是一组在Java应用中访问命名和目录服务的API。没有使用JNDI时我用要这样连接数据库： 03. Class.forName("com.mysql.jdbc.Driver"); 04. conn
遍历的方法百合不是茶遍历
遍历在java的泛
linux查看硬件信息的命令 bijian1013 linux
linux查看硬件信息的命令一.查看CPU： cat /proc/cpuinfo 二.查看内存： free 三.查看硬盘： df linux下查看硬件信息 1、lspci 列出所有PCI 设备； lspci - list all PCI devices:列出机器中的PCI设备（声卡、显卡、Modem、网卡、USB、主板集成设备也能
java常见的ClassNotFoundException bijian1013 java
1.java.lang.ClassNotFoundException: org.apache.commons.logging.LogFactory 添加包common-logging.jar2.java.lang.ClassNotFoundException: javax.transaction.Synchronization
【Gson五】日期对象的序列化和反序列化 bit1129 反序列化
对日期类型的数据进行序列化和反序列化时，需要考虑如下问题： 1. 序列化时，Date对象序列化的字符串日期格式如何 2. 反序列化时，把日期字符串序列化为Date对象，也需要考虑日期格式问题 3. Date A -> str -> Date B,A和B对象是否equals 默认序列化和反序列化 import com
【Spark八十六】Spark Streaming之DStream vs. InputDStream bit1129 Stream
1. DStream的类说明文档： /** * A Discretized Stream (DStream), the basic abstraction in Spark Streaming, is a continuous * sequence of RDDs (of the same type) representing a continuous st
通过nginx获取header信息 ronin47 nginx header
1. 提取整个的Cookies内容到一个变量，然后可以在需要时引用，比如记录到日志里面， if ( $http_cookie ~* "(.*)$") { set $all_cookie $1; } 变量$all_cookie就获得了cookie的值，可以用于运算了
java-65.输入数字n，按顺序输出从1最大的n位10进制数。比如输入3，则输出1、2、3一直到最大的3位数即999 bylijinnan java
参考了网上的http://blog.csdn.net/peasking_dd/article/details/6342984 写了个java版的： public class Print_1_To_NDigit { /** * Q65.输入数字n，按顺序输出从1最大的n位10进制数。比如输入3，则输出1、2、3一直到最大的3位数即999 * 1.使用字符串
Netty源码学习-ReplayingDecoder bylijinnan java netty
ReplayingDecoder是FrameDecoder的子类，不熟悉FrameDecoder的，可以先看看 http://bylijinnan.iteye.com/blog/1982618 API说，ReplayingDecoder简化了操作，比如： FrameDecoder在decode时，需要判断数据是否接收完全： public class IntegerH
js特殊字符过滤 cngolon js特殊字符 js特殊字符过滤
1.js中用正则表达式过滤特殊字符, 校验所有输入域是否含有特殊符号function stripscript(s) { var pattern = new RegExp("[`~!@#$^&*()=|{}':;',\\[\\].<>/?~！@#￥……&*（）——|{}【】‘；：”“'。，、？]"
hibernate使用sql查询 ctrain Hibernate
import java.util.Iterator; import java.util.List; import java.util.Map; import org.hibernate.Hibernate; import org.hibernate.SQLQuery; import org.hibernate.Session; import org.hibernate.Transa
linux shell脚本中切换用户执行命令方法 daizj linux shell 命令切换用户
经常在写shell脚本时，会碰到要以另外一个用户来执行相关命令，其方法简单记下： 1、执行单个命令：su - user -c "command" 如：下面命令是以test用户在/data目录下创建test123目录 [root@slave19 /data]# su - test -c "mkdir /data/test123"
好的代码里只要一个 return 语句 dcj3sjt126com return
别再这样写了：public boolean foo() { if (true) { return true; } else { return false;
Android动画效果学习 dcj3sjt126com android
1、透明动画效果方法一：代码实现 public View onCreateView(LayoutInflater inflater, ViewGroup container, Bundle savedInstanceState) { View rootView = inflater.inflate(R.layout.fragment_main, container, fals
linux复习笔记之bash shell (4)管道命令 eksliang linux管道命令汇总 linux管道命令 linux常用管道命令
转载请出自出处： http://eksliang.iteye.com/blog/2105461 bash命令执行的完毕以后，通常这个命令都会有返回结果，怎么对这个返回的结果做一些操作呢？那就得用管道命令‘|’。上面那段话，简单说了下管道命令的作用，那什么事管道命令呢？答：非常的经典的一句话，记住了，何为管
Android系统中自定义按键的短按、双击、长按事件 gqdy365 android
在项目中碰到这样的问题：由于系统中的按键在底层做了重新定义或者新增了按键，此时需要在APP层对按键事件（keyevent）做分解处理，模拟Android系统做法，把keyevent分解成： 1、单击事件：就是普通key的单击； 2、双击事件：500ms内同一按键单击两次； 3、长按事件：同一按键长按超过1000ms（系统中长按事件为500ms）； 4、组合按键：两个以上按键同时按住；
asp.net获取站点根目录下子目录的名称 hvt .net C#asp.net hovertree Web Forms
使用Visual Studio建立一个.aspx文件(Web Forms)，例如hovertree.aspx,在页面上加入一个ListBox代码如下： <asp:ListBox runat="server" ID="lbKeleyiFolder" /> 那么在页面上显示根目录子文件夹的代码如下： string[] m_sub
Eclipse程序员要掌握的常用快捷键 justjavac java eclipse 快捷键 ide
判断一个人的编程水平，就看他用键盘多，还是鼠标多。用键盘一是为了输入代码（当然了，也包括注释），再有就是熟练使用快捷键。曾有人在豆瓣评《卓有成效的程序员》：“人有多大懒，才有多大闲”。之前我整理了一个程序员图书列表，目的也就是通过读书，让程序员变懒。写道程序员作为特殊的群体，有的人可以这么懒，懒到事情都交给机器去做，而有的人又可
c++编程随记 lx.asymmetric C++笔记
为了字体更好看，改变了格式…… &&运算符： #include<iostream> using namespace std; int main(){ int a=-1,b=4,k; k=(++a<0)&&!(b--
linux标准IO缓冲机制研究音频数据 linux
一、什么是缓存I/O(Buffered I/O)缓存I/O又被称作标准I/O,大多数文件系统默认I/O操作都是缓存I/O。在Linux的缓存I/O机制中，操作系统会将I/O的数据缓存在文件系统的页缓存(page cache)中，也就是说，数据会先被拷贝到操作系统内核的缓冲区中，然后才会从操作系统内核的缓冲区拷贝到应用程序的地址空间。1.缓存I/O有以下优点:A.缓存I/O使用了操作系统内核缓冲区，
随想生活暗黑小菠萝生活
其实账户之前就申请了，但是决定要自己更新一些东西看也是最近。从毕业到现在已经一年了。没有进步是假的，但是有多大的进步可能只有我自己知道。毕业的时候班里12个女生，真正最后做到软件开发的只要两个包括我，PS：我不是说测试不好。当时因为考研完全放弃找工作，考研失败，我想这只是我的借口。那个时候才想到为什么大学的时候不能好好的学习技术，增强自己的实战能力，以至于后来找工作比较费劲。我
我认为POJO是一个错误的概念 windshome java POJO 编程 J2EE 设计
这篇内容其实没有经过太多的深思熟虑，只是个人一时的感觉。从个人风格上来讲，我倾向简单质朴的设计开发理念；从方法论上，我更加倾向自顶向下的设计；从做事情的目标上来看，我追求质量优先，更愿意使用较为保守和稳妥的理念和方法。 &