http://ai-depot.com/articles/the-easy-way-to-extract-useful-text-from-arbitrary-html/

The Easy Way to Extract Useful Text from Arbitrary HTML

By alexjc | April 5, 2007

You’ve finally got your hands on the diverse collection of HTML documents you needed. But the content you’re interested in is hidden amidst adverts, layout tables or formatting markup, and other various links. Even worse, there’s visible text in the menus, headers and footers that you want to filter out. If you don’t want to write a complex scraping program for each type of HTML file, there is a solution.

This article shows you how to write a relatively simple script to extract text paragraphs from large chunks of HTML code, without knowing its structure or the tags used. It works on news articles and blogs pages with worthwhile text content, among others…

Do you want to find out how statistics and machine learning can save you time and effort mining text?

The concept is rather simple: use information about the density of text vs. HTML code to work out if a line of text is worth outputting. (This isn’t a novel idea, but it works!) The basic process works as follows:

Parse the HTML code and keep track of the number of bytes processed.
Store the text output on a per-line, or per-paragraph basis.
Associate with each text line the number of bytes of HTML required to describe it.
Compute the text density of each line by calculating the ratio of text to bytes.
Then decide if the line is part of the content by using a neural network.

You can get pretty good results just by checking if the line’s density is above a fixed threshold (or the average), but the system makes fewer mistakes if you use machine learning — not to mention that it’s easier to implement!

Let’s take it from the top…

Converting the HTML to Text

What you need is the core of a text-mode browser, which is already setup to read files with HTML markup and display raw text. By reusing existing code, you won’t have to spend too much time handling invalid XML documents, which are very common — as you’ll realise quickly.

As a quick example, we’ll be using Python along with a few built-in modules: htmllib for the parsing and formatter for outputting formatted text. This is what the top-level function looks like:

def extract_text(html):
    # Derive from formatter.AbstractWriter to store paragraphs.
    writer = LineWriter()
    # Default formatter sends commands to our writer.
    formatter = AbstractFormatter(writer)
    # Derive from htmllib.HTMLParser to track parsed bytes.
    parser = TrackingParser(writer, formatter)
    # Give the parser the raw HTML data.
    parser.feed(html)
    parser.close()
    # Filter the paragraphs stored and output them.
    return writer.output()

The TrackingParser itself overrides the callback functions for parsing start and end tags, as they are given the current parse index in the buffer. You don’t have access to that normally, unless you start diving into frames in the call stack — which isn’t the best approach! Here’s what the class looks like:

class TrackingParser(htmllib.HTMLParser):
    """Try to keep accurate pointer of parsing location."""
    def __init__(self, writer, *args):
        htmllib.HTMLParser.__init__(self, *args)
        self.writer = writer
    def parse_starttag(self, i):
        index = htmllib.HTMLParser.parse_starttag(self, i)
        self.writer.index = index
        return index
    def parse_endtag(self, i):
        self.writer.index = i
        return htmllib.HTMLParser.parse_endtag(self, i)

The LineWriter class does the bulk of the work when called by the default formatter. If you have any improvements or changes to make, most likely they’ll go here. This is where we’ll put our machine learning code in later. But you can keep the implementation rather simple and still get good results. Here’s the simplest possible code:

class Paragraph:
    def __init__(self):
        self.text = ''
        self.bytes = 0
        self.density = 0.0
 
class LineWriter(formatter.AbstractWriter):
    def __init__(self, *args):
        self.last_index = 0
        self.lines = [Paragraph()]
        formatter.AbstractWriter.__init__(self)
 
    def send_flowing_data(self, data):
        # Work out the length of this text chunk.
        t = len(data)
        # We've parsed more text, so increment index.
        self.index += t
        # Calculate the number of bytes since last time.
        b = self.index - self.last_index
        self.last_index = self.index
        # Accumulate this information in current line.
        l = self.lines[-1]
        l.text += data
        l.bytes += b
 
    def send_paragraph(self, blankline):
        """Create a new paragraph if necessary."""
        if self.lines[-1].text == '':
            return
        self.lines[-1].text += 'n' * (blankline+1)
        self.lines[-1].bytes += 2 * (blankline+1)
        self.lines.append(Writer.Paragraph())
 
    def send_literal_data(self, data):
        self.send_flowing_data(data)
 
    def send_line_break(self):
        self.send_paragraph(0)

This code doesn’t do any outputting yet, it just gathers the data. We now have a bunch of paragraphs in an array, we know their length, and we know roughly how many bytes of HTML were necessary to create them. Let’s see what emerge from our statistics.

Examining the Data

Luckily, there are some patterns in the data. In the raw output below, you’ll notice there are definite spikes in the number of HTML bytes required to encode lines of text, notably around the title, both sidebars, headers and footers.

While the number of HTML bytes spikes in places, it remains below average for quite a few lines. On these lines, the text output is rather high. Calculating the density of text to HTML bytes gives us a better understanding of this relationship.

The patterns are more obvious in this density value, so it gives us something concrete to work with.

Filtering the Lines

The simplest way we can filter lines now is by comparing the density to a fixed threshold, such as 50% or the average density. Finishing the LineWriter class:

    def compute_density(self):
        """Calculate the density for each line, and the average."""
        total = 0.0
        for l in self.lines:
            l.density = len(l.text) / float(l.bytes)
            total += l.density
        # Store for optional use by the neural network.
        self.average = total / float(len(self.lines))
 
    def output(self):
        """Return a string with the useless lines filtered out."""
        self.compute_density()
        output = StringIO.StringIO()
        for l in self.lines:
            # Check density against threshold.
            # Custom filter extensions go here.
            if l.density &gt; 0.5:
	        output.write(l.text)
	return output.getvalue()

This rough filter typically gets most of the lines right. All the headers, footers and sidebars text is usually stripped as long as it’s not too long. However, if there are long copyright notices, comments, or descriptions of other stories, then those are output too. Also, if there are short lines around inline graphics or adverts within the text, these are not output.

To fix this, we need a more complex filtering heuristic. But instead of spending days working out the logic manually, we’ll just grab loads of information about each line and use machine learning to find patterns for us.

Supervised Machine Learning

Here’s an example of an interface for tagging lines of text as content or not:

The idea of supervised learning is to provide examples for an algorithm to learn from. In our case, we give it a set documents that were tagged by humans, so we know which line must be output and which line must be filtered out. For this we’ll use a simple neural network known as the perceptron. It takes floating point inputs and filters the information through weighted connections between “neurons” and outputs another floating point number. Roughly speaking, the number of neurons and layers affects the ability to approximate functions precisely; we’ll use both single-layer perceptrons (SLP) and multi-layer perceptrons (MLP) for prototyping.

To get the neural network to learn, we need to gather some data. This is where the earlier LineWriter.output() function comes in handy; it gives us a central point to process all the lines at once, and make a global decision which lines to output. Starting with intuition and experimenting a bit, we discover that the following data is useful to decide how to filter a line:

Density of the current line.
Number of HTML bytes of the line.
Length of output text for this line.
These three values for the previous line,
… and the same for the next line.

For the implementation, we’ll be using Python to interface with FANN, the Fast Artificial Neural Network Library. The essence of the learning code goes like this:

from pyfann import fann, libfann
 
# This creates a new single-layer perceptron with 1 output and 3 inputs.
obj = libfann.fann_create_standard_array(2, (3, 1))
ann = fann.fann_class(obj)
 
# Load the data we described above.
patterns = fann.read_train_from_file('training.txt')
ann.train_on_data(patterns, 1000, 1, 0.0)
 
# Then test it with different data.
for datin, datout in validation_data:
    result = ann.run(datin)
    print 'Got:', result, ' Expected:', datout

Trying out different data and different network structures is a rather mechanical process. Don’t have too many neurons or you may train too well for the set of documents you have (overfitting), and conversely try to have enough to solve the problem well. Here are the results, varying the number of lines used (1L-3L) and the number of attributes per line (1A-3A):

The interesting thing to note is that 0.5 is already a pretty good guess at a fixed threshold (see first set of columns). The learning algorithm cannot find much better solution for comparing the density alone (1 Attribute in the second column). With 3 Attributes, the next SLP does better overall, though it gets more false negatives. Using multiple lines also increases the performance of the single layer perceptron (fourth set of columns). And finally, using a more complex neural network structure works best overall — making 80% less errors in filtering the lines.

Note that you can tweak how the error is calculated if you want to punish false positives more than false negatives.

Conclusion

Extracting text from arbitrary HTML files doesn’t necessarily require scraping the file with custom code. You can use statistics to get pretty amazing results, and machine learning to get even better. By tweaking the threshold, you can avoid the worst false positive that pollute your text output. But it’s not so bad in practice; where the neural network makes mistakes, even humans have trouble classifying those lines as “content” or not.

Now all you have to figure out is what to do with that clean text content!

【libyuv】windows cmake 构建 for webrtc 等风来不如迎风去 WebRTC入门与实战 windows git bash libyuv
使用vs直接构建webrtc的部分源码，发现libyuv是webrtc源码的依赖库，会有链接错误官方说明https://github.com/frankpapenmeier/libyuv/blob/master/docs/getting_started.md看起来官方灭有推荐windows用cmake构建实测，用cmake也是可以的。deptoolsYou’llneedtohavedepottoo
webRTC源码配置和编译 + Vscode Intelligence配置 TransPlus webrtc vscode ide
Google官方的WebRTC源码并不托管在GitHub上，而是使用Chromium的代码管理工具（depot_tools）。以下是完整的源码下载、编译和学习指南：1.获取WebRTC源码(1)安装depot_tools（必须）WebRTC使用Chromium的构建系统，需先安装depot_tools：#Linux/macOSgitclonehttps://chromium.googlesourc
发行基础：上传版本注意事项 windwind2000 独立游戏发行创业创新游戏青少年编程编辑器玩游戏
1、steam的规则是上传，提审，随时可更新。2、基本流程：根据appid以及depotid，上传本地游戏文件到服务器，把分支版本设置为默认，发布。试玩版与正式版的appid与depotid是相互独立的。3、理论上开发者应该尽早上传版本，好处之一是尽快过审。审核需要较长时间，可能是3-5天，也有人反映拖了一个月。如果迟迟没过审，则也就会拖延发售时间。4、尽早上传的另一个好处是大大提高正式版本的上传
webRTC源码编译(Android,Linux) blazer_luo webRTC webrtc linux
环境M2Mac14.4虚拟机：VMwareFusion13.5Ubuntu22.04ARMServer注意：给虚拟机分配内存和硬盘空间的时候一定是要8GB和100GB的，不然容易出来运行错误，还有网络问题下载mkdirwebrtc&cdwebrtc1.depot_tools:gitclonehttps://chromium.googlesource.com/chromium/tools/depot
webrtc所有平台下载编译步骤详细说明 LG0915 Linux
1、安装depottoolsWindows：国外下载：https://storage.googleapis.com/chrome-infra/depot_tools.zip下载完把压缩包解压，然后把解压目录加入PATH环境变量Linux（Android）/Mac（IOS）：安装git国外：gitclonehttps://chromium.googlesource.com/chromium/tool
macOS 下单步调试 WebRTC Android & iOS 【零声教育】音视频开发进阶程序员音视频开发编程 macos webrtc android ios 音视频
上一篇文章里有位读者朋友咨询我，如何调试WebRTCiOSdemo。显然这个小问题不值一篇文章，所以这周我就花了大量的精力，解决了长久以来困扰广大WebRTC安卓开发者的难题：如何在AndroidStudio里单步调试WebRTCAndroid的native代码。今天我就在这里给大家带来一场盛宴:)1、WebRTC代码下载首先给各位上点冰镇白开，解解暑：depottools是chromium代码库
搭建Flutter Web开发调试环境弦苦 Flutter开发 flutter
SettinguptheFramework/Enginedevelopmentenvironment背景搭建framework开发环境修改调试framework源码运行framework测试用例同步更新framework源码搭建engine开发环境准备depot_tools部署engine源码编译engine源码修改调试engine源码指定--local-engine修改源码调试示例运行engin
electron 源码下载与编译构五一编程学习交流 electron javascript 前端 webrtc c语言 c++
electron源码下载与编译构建预先安装安装nodejs下载eletron构建工具：安装python构建Electron基本要求环境依赖交叉编译构建故障排查高级提示使用clang之外的其它编译器electron的depot_tools工具下载构建源码。这个工具是用nodejs写的，封装了chromium自身的depot_tools工具。非常方便易用。主要是electron在下载完chromium
webrtc在ubuntu系统下的编译备忘录 BZJ110 ffmpge 视频音频 webrtc 音频编码解码
刚刚开始接触rtc相关领域，第一步就是尝试下载代码和编译，运行测试demo，按照网上的教程执行成功，现在记录下来，当做备忘录，提供以后查看学习。转载并修改点击http://commondatastorage.googleapis.com/chrome-infra-docs/flat/depot_tools/docs/html/depot_tools.html进行查看depot_tools工具的li
CEF132 编译指南 MacOS 篇 - depot_tools 安装与配置 (四) 守城小轩 CEF chrome devtools 浏览器开发 chrome 指纹浏览器
1.引言在CEF132（ChromiumEmbeddedFramework）的编译过程中，depot_tools扮演着举足轻重的角色。这套由Chromium项目精心打造的脚本和工具集，专门用于获取、管理和更新Chromium及其相关项目（包括CEF）的源代码。借助depot_tools，开发者能够高效地同步最新的CEF源码，并进行项目构建。本篇将作为CEF132编译指南系列的第四篇，详细阐述如何在
webrtc android native code编译，只生成libwebrtc.a wiki分享 webrtc webrtc android
1.下载depottoolsgitclonehttps://chromium.googlesource.com/chromium/tools/depot_tools.gitdepot_tools添加到PATHexportPATH=$PATH:/pathtodepot_tools2.下载webrtcandroid源码mkdirwebrtc_androidcdwebrtc_androidfetch--
LS-SDMTSP：遗传算法（GA）求解大规模单仓库多旅行商问题（LS-SDMTSP），MATLAB代码 IT猿手 TSP MATLAB matlab linux 开发语言智能优化算法多目标算法
一、问题定义大规模单仓库多旅行商问题（Large-ScaleSingle-DepotMulti-TravelingSalesmanProblem，简称LS-SDMTSP）是组合优化领域中极具挑战性的经典问题。假设存在一个单一仓库，它既是所有旅行商的出发地，也是最终的返回地。同时，有数量众多的客户节点散布在地理空间中，并且有一支由多个旅行商组成的队伍。每个旅行商需要从仓库出发，遍历一定数量的客户节点
【算法应用】基于鲸鱼优化算法求解MD-MTSP问题小O的算法实验室智能算法应用 MD-MTSP 智能算法算法 MD-MTSP
目录1.鲸鱼优化算法算法原理2.MD-MTSP数学模型3.结果展示4.代码获取1.鲸鱼优化算法算法原理SCI二区|鲸鱼优化算法（WOA）原理及实现2.MD-MTSP数学模型MD-MTSP（MultipleDepot,MultipleTravelingSalesmanProblem）是多旅行商问题的一种变体，其中涉及多个销售点（仓库或基地）和多个销售人员。在这个问题中，每个销售人员从指定的销售点出发
BFS——C++ 松定 c++宽度优先算法
BFS常使用于寻找最短路径，使用队列实现。在学习使用BFS的时候有一难点是如何合理使用队列以及搞清楚为什么要使用队列来帮助完成BFS。为方便理解，这里摘用一下CodePotato在讲解BFS的时候的图片这是一个树，想要通过BFS来遍历这个树的顺序应该是1->2->5->3->4->9->7->6->8->10但是如何使用队列来完成呢？请看图片辅助理解。首先理解了如何使用队列以及为什么要使用队列，然
V8 入门记录二：环境与调试
环境搭建这里先说下我的环境，这是代码运行的前提。系统：MacMonterey12.6.8Xcode:14.2Python:3.11.6（不要使用2.x版本！）建议全程指令都在FQ工具下进行由源码构建V8首先我们需要一个工具depot_tools执行指令克隆，这里我是在文件夹/Users/apple/Sites/demo/depot_tools下执行的$gitclonehttps://chromiu
鸿蒙南向开发——GN快速入门指南码中之牛 Harmony OS OpenHarmony 鸿蒙 harmonyos 华为移动开发鸿蒙开发 Openharmony
运行GN(GenerateNinja)运行gn，你只需从命令行运行gn，对于大型项目，GN是与源码一起的。对于Chromium和基于Chromium的项目，有一个在depot_tools中的脚本，它需要加入到你的PATH环境变量中。该脚本将在包含当前目录的源码树中找到二进制文件并运行它。对于Fuchsia树内开发，运行fxgn...，它将找到正确的GN二进制文件，并使用给定的参数运行它。设置一个构
【webrtc】‘ninja.exe‘ 不是内部或外部命令，也不是可运行的程序及vs2019 重新构建m98 等风来不如迎风去 WebRTC入门与实战 webrtc
werbtc就是用ninja.exe来构建找到了自己以前构建的webrtc原版m98【m98】webrtcninja构建、example、tests及OWT-P2P项目P2PMFC-E2E-m98G:\CDN\rtcCli\webrtc-checkout\src找到了自己的deptools的路径deptools里确实没有ninja.exeD:\SOFT\depot_tools\third_part
【webrtc】vs2019 下载编译 WebRTC 最新源码 ArchieFu webrtc webrtc git github google visual studio
文章目录准备工作0.配置git代理和系统代理1.下载depot_tools工具2.将C:\depot_tools添加到系统path环境变量3.管理员权限打开cmd，首次运行gclient自动更新工具，下载Python、Git、ninja等工具4.下载webrtc源码保证磁盘空余空间10G+5.编译源码6.取消Git代理总结技术参考准备工作操作系统：windows10安装vs2019安装win10s
Perforce入门 TOBEALISTENER Devops
Perforce工作原理Perforce应用程序用来在文件信息库与个人用户工作站之间传输文件Perforce版本管理服务Perforce版本管理服务管理公用文件信息库（或称之为depots）。该服务维护一个数据库以记录变更日志、用户权限以及某个用户在某个时点签出某个文件。Perforce服务通过本地操作系统来管理数据库和版本化文件，因而不需要专门的文件系统或者卷。Perforce应用程序作用：与版
浏览器编译修改指纹 sugar椰子皮爬虫逆向 python
步骤先安装vs_community工具然后程序里面debugtools,点开debuggingtoolsforwindows，change配置depot_tools环境变量配置在depot-tools下面执行gclientfetch–-nohistorychromium下载完成源码编译cdsrcgngen--ide=vsout\Defaultautoninja-Cout\Defaultchrome
electron源码下载及编译猿来巡山 electron rtc webrtc 音视频
参考文档：cWin10编译Electron源码_electron源码编译-CSDN博客环境准备depot_toolsvs2017就绪nodejspython就绪要求能编译webrtc网络要使用魔法，自行解决环境变量设置：DEPOT_TOOLS_WIN_TOOLCHAIN=0GIT_CACHE_PATH=D:\.git_cache：用于git缓存目录，很有必要设置，因为后面拉取代码量很大、时间很长，
1.1.3 、Google Analytics的发展历史 GA小站
GA是Google从其他公司收购而来的产品。2005年，Google收购了Urchin，后者的前身是成立于1995年（成立的时间比Google还早）的WebDepot——主要是在SanDiego（圣迭戈）提供网站开发和服务器托管。WebDepot的创始人分别是PaulMuret、JackAncone、Brett和ScottCrosby。直到1997年，Paul才开发出Urchin首个版本，作为他们
（二十二）Kubernetes系列之dashboard 调试大师云计算 #Kubernetes kubernetes 容器云原生
1.下载模板文件wgethttps://raw.githubusercontent.com/kubernetes/dashboard/v2.7.0/aio/deploy/recommended.yaml修改service模板，新增type:NodePot和nodePort:304432.部署kubectlapply-frecommended.yaml3.查看pod和服务kubectlgetpods
VRPSolverEasy：支持VRP问题快速建模的精确算法Python包 Lins号丹车辆路径优化算法 python
文章目录前言一步步安装免费版主要模块介绍1.depotpoint2.customerpoint3.links4.vehicletypeVRPTW算例数据说明模型建立输出求解状态及结果前言VRPSolverEasy是用于车辆路径问题（VRP）的最先进的分支切割和定价算法求解器1，它的一大特点是，即使没有运筹学背景的用户，也可以直观地通过Python接口定义出VRP问题，无需知道模型是如何建立为MIP
【网络安全】简要分析下Chrome-V8-Issue-762874 H_00c8
image.png这是AguidedtourthroughChrome'sjavascriptcompiler上的第二个漏洞，下面是对应的commit在这里插入图片描述环境搭建用v8-actionenv:PATCH_FLAG:trueCOMMIT:d2da19c78005c75e0f658be23c28b473dd76b93b#这里DEPOT_UPLOAD:falseSRC_UPLOAD:true
Windows7下载编译webrtc 亮仔很贪吃
一、前提最重要的是使用稳定的FQ工具，下载中遇到的大部分错误都是因为网络不稳定导致下载失败二、下载代码1、下载安装VS2017，安装时选择VC++，Win10SDK；2、安装depot_tools,地址https://storage.googleapis.com/chrome-infra/depot_tools.zip3、设置环境变量Path中添加depot_tools路径，重启cmd并设置代理;
skia图形引擎库构建一个观察世界的普通人
###简介:Skia是一个开源的２Ｄ图形库，被用于GoogleChrome和ChromeOS，Android，MozillaFirefox和FirefoxOS等多个产品的图形引擎官网:###下载：1.下载depot_tools并配置环境```javagitclone'https://chromium.googlesource.com/chromium/tools/depot_tools.git'e
迈向更高质量的深度估计 prinTao pytorch 深度估计深度学习 pytorch 人工智能
题目：TowardsHighQualityDepothEstimation摘要目前的深度估计从业人员大多followthesettingsofspecificbackbonewithoutthinkingaboutwhyisthat。本文将详细探索从数据（不同类型数据集加载、稳定性、预处理、数据生成），超参数设置（学习率、损失函数）到训练过程的不同技巧。通过在您的pipeline中检查这些设定，可
解决Chrome 浏览器ERR_INSUFFICIENT_RESOURCES过程 hauler~ 踩坑记录 chrome 前端谷歌浏览器
目录一、背景二、下载编译工具depot_tools三、下载Chromium源码四、分析Chromium代码并加日志四、编译Chrome五、定位问题六、解决方案七、踩坑记录一、背景最近公司客服同事经常反馈每到下午四点以后chrome浏览器经常会出现ERR_INSUFFICIENT_RESOURCES异常，导致客服系统无法正常工作。主要特征就是新打开Tab时会报这个异常，在原来的tab内可以正常的使用
下载chromium源码gclient代理设置 git代理设置 SmarterTech Chromium chromium git git代理设置
一，为了解决depot_tools自身更新问题，为depot_tools增加代理支持，修改c:\depot_tools\bootstarp\win\get_file.js文件:（1）改xml_http=newActiveXObject("MSXML2.ServerXMLHTTP");为xml_http=newActiveXObject("MSXML2.ServerXMLHTTP.5.0");（2）
ViewController添加button按钮解析。（翻译）张亚雄 c
<div class="it610-blog-content-contain" style="font-size: 14px"></div>// ViewController.m // Reservation software // // Created by 张亚雄 on 15/6/2.
mongoDB 简单的增删改查开窍的石头 mongodb
在上一篇文章中我们已经讲了mongodb怎么安装和数据库/表的创建。在这里我们讲mongoDB的数据库操作在mongo中对于不存在的表当你用db.表名他会自动统计下边用到的user是表明，db代表的是数据库添加(insert):
log4j配置 0624chenhong log4j
1) 新建java项目 2) 导入jar包，项目右击，properties—java build path—libraries—Add External jar，加入log4j.jar包。 3) 新建一个类com.hand.Log4jTest package com.hand; import org.apache.log4j.Logger; public class
多点触摸(图片缩放为例) 不懂事的小屁孩多点触摸
多点触摸的事件跟单点是大同小异的，上个图片缩放的代码，供大家参考一下 import android.app.Activity; import android.os.Bundle; import android.view.MotionEvent; import android.view.View; import android.view.View.OnTouchListener
有关浏览器窗口宽度高度几个值的解析换个号韩国红果果 JavaScript html
1 元素的 offsetWidth 包括border padding content 整体的宽度。 clientWidth 只包括内容区 padding 不包括border。 clientLeft = offsetWidth -clientWidth 即这个元素border的值 offsetLeft 若无已定位的包裹元素
数据库产品巡礼：IBM DB2概览蓝儿唯美 db2
IBM DB2是一个支持了NoSQL功能的关系数据库管理系统，其包含了对XML，图像存储和Java脚本对象表示（JSON）的支持。DB2可被各种类型的企业使用，它提供了一个数据平台，同时支持事务和分析操作，通过提供持续的数据流来保持事务工作流和分析操作的高效性。 DB2支持的操作系统 DB2可应用于以下三个主要的平台: 工作站，DB2可在Linus、Unix、Windo
java笔记5 a-john java
控制执行流程： 1，true和false 利用条件表达式的真或假来决定执行路径。例：（a==b）。它利用条件操作符“==”来判断a值是否等于b值，返回true或false。java不允许我们将一个数字作为布尔值使用，虽然这在C和C++里是允许的。如果想在布尔测试中使用一个非布尔值，那么首先必须用一个条件表达式将其转化成布尔值，例如if(a!=0)。 2，if-els
Web开发常用手册汇总 aijuans PHP
一门技术，如果没有好的参考手册指导,很难普及大众。这其实就是为什么很多技术，非常好，却得不到普遍运用的原因。正如我们学习一门技术，过程大概是这个样子： ①我们日常工作中，遇到了问题，困难。寻找解决方案，即寻找新的技术； ②为什么要学习这门技术？这门技术是不是很好的解决了我们遇到的难题，困惑。这个问题，非常重要，我们不是为了学习技术而学习技术，而是为了更好的处理我们遇到的问题，才需要学习新的
今天帮助人解决的一个sql问题 asialee sql
今天有个人问了一个问题，如下： type AD value A
意图对象传递数据百合不是茶 android 意图Intent Bundle对象数据的传递
学习意图将数据传递给目标活动; 初学者需要好好研究的 1,将下面的代码添加到main.xml中 <?xml version="1.0" encoding="utf-8"?> <LinearLayout xmlns:android="http:/
oracle查询锁表解锁语句 bijian1013 oracle object session kill
一.查询锁定的表如下语句，都可以查询锁定的表语句一： select a.sid, a.serial#, p.spid, c.object_name, b.session_id, b.oracle_username, b.os_user_name from v$process p, v$s
mac osx 10.10 下安装 mysql 5.6 二进制文件［tar.gz］征客丶 mysql osx
场景：在 mac osx 10.10 下安装 mysql 5.6 的二进制文件。环境：mac osx 10.10、mysql 5.6 的二进制文件步骤：[所有目录请从根“/”目录开始取，以免层级弄错导致找不到目录] 1、下载 mysql 5.6 的二进制文件，下载目录下面称之为 mysql5.6SourceDir；下载地址：http://dev.mysql.com/downl
分布式系统与框架 bit1129 分布式
RPC框架 Dubbo 什么是Dubbo Dubbo是一个分布式服务框架，致力于提供高性能和透明化的RPC远程服务调用方案，以及SOA服务治理方案。其核心部分包含: 远程通讯: 提供对多种基于长连接的NIO框架抽象封装，包括多种线程模型，序列化，以及“请求-响应”模式的信息交换方式。集群容错: 提供基于接
那些令人蛋痛的专业术语白糖_ spring Web SSO IOC
spring 【控制反转(IOC)/依赖注入(DI)】：由容器控制程序之间的关系，而非传统实现中，由程序代码直接操控。这也就是所谓“控制反转”的概念所在：控制权由应用代码中转到了外部容器，控制权的转移，是所谓反转。简单的说：对象的创建又容器(比如spring容器)来执行，程序里不直接new对象。 Web 【单点登录(SSO)】：SSO的定义是在多个应用系统中，用户
《给大忙人看的java8》摘抄 braveCS java8
函数式接口：只包含一个抽象方法的接口 lambda表达式：是一段可以传递的代码你最好将一个lambda表达式想象成一个函数，而不是一个对象，并记住它可以被转换为一个函数式接口。事实上，函数式接口的转换是你在Java中使用lambda表达式能做的唯一一件事。方法引用：又是要传递给其他代码的操作已经有实现的方法了，这时可以使
编程之美-计算字符串的相似度 bylijinnan java 算法编程之美
public class StringDistance { /** * 编程之美计算字符串的相似度 * 我们定义一套操作方法来把两个不相同的字符串变得相同，具体的操作方法为： * 1.修改一个字符（如把“a”替换为“b”）; * 2.增加一个字符（如把“abdd”变为“aebdd”）; * 3.删除一个字符（如把“travelling”变为“trav
上传、下载压缩图片 chengxuyuancsdn 下载
/** * * @param uploadImage --本地路径(tomacat路径) * @param serverDir --服务器路径 * @param imageType --文件或图片类型 * 此方法可以上传文件或图片.txt,.jpg,.gif等 */ public void upload(String uploadImage,Str
bellman-ford(贝尔曼-福特)算法 comsci 算法 F#
Bellman-Ford算法(根据发明者 Richard Bellman 和 Lester Ford 命名)是求解单源最短路径问题的一种算法。单源点的最短路径问题是指：给定一个加权有向图G和源点s，对于图G中的任意一点v，求从s到v的最短路径。有时候这种算法也被称为 Moore-Bellman-Ford 算法，因为 Edward F. Moore zu 也为这个算法的发展做出了贡献。与迪科
oracle ASM中ASM_POWER_LIMIT参数 daizj ASM oracle ASM_POWER_LIMIT 磁盘平衡
ASM_POWER_LIMIT 该初始化参数用于指定ASM例程平衡磁盘所用的最大权值，其数值范围为0~11，默认值为1。该初始化参数是动态参数，可以使用ALTER SESSION或ALTER SYSTEM命令进行修改。示例如下： SQL>ALTER SESSION SET Asm_power_limit=2;
高级排序:快速排序 dieslrae 快速排序
public void quickSort(int[] array){ this.quickSort(array, 0, array.length - 1); } public void quickSort(int[] array,int left,int right){ if(right - left <= 0
C语言学习六指针_何谓变量的地址一个指针变量到底占几个字节 dcj3sjt126com C语言
# include <stdio.h> int main(void) { /* 1、一个变量的地址只用第一个字节表示 2、虽然他只使用了第一个字节表示，但是他本身指针变量类型就可以确定出他指向的指针变量占几个字节了 3、他都只存了第一个字节地址，为什么只需要存一个字节的地址，却占了4个字节，虽然只有一个字节，但是这些字节比较多，所以编号就比较大，
phpize使用方法 dcj3sjt126com PHP
phpize是用来扩展php扩展模块的，通过phpize可以建立php的外挂模块,下面介绍一个它的使用方法,需要的朋友可以参考下安装（fastcgi模式）的时候，常常有这样一句命令：代码如下: /usr/local/webserver/php/bin/phpize 一、phpize是干嘛的？ phpize是什么？ phpize是用来扩展php扩展模块的，通过phpi
Java虚拟机学习 - 对象引用强度 shuizhaosi888 JAVA虚拟机
本文原文链接：http://blog.csdn.net/java2000_wl/article/details/8090276 转载请注明出处！无论是通过计数算法判断对象的引用数量，还是通过根搜索算法判断对象引用链是否可达，判定对象是否存活都与“引用”相关。引用主要分为：强引用(Strong Reference)、软引用(Soft Reference)、弱引用(Wea
.NET Framework 3.5 Service Pack 1（完整软件包）下载地址 happyqing .net 下载 framework
Microsoft .NET Framework 3.5 Service Pack 1（完整软件包） http://www.microsoft.com/zh-cn/download/details.aspx?id=25150 Microsoft .NET Framework 3.5 Service Pack 1 是一个累积更新，包含很多基于 .NET Framewo
JAVA定时器的使用 jingjing0907 java timer 线程定时器
1、在应用开发中，经常需要一些周期性的操作，比如每5分钟执行某一操作等。对于这样的操作最方便、高效的实现方式就是使用java.util.Timer工具类。 privatejava.util.Timer timer; timer = newTimer(true); timer.schedule( newjava.util.TimerTask() { public void run()
Webbench 流浪鱼 webbench
首页下载地址 http://home.tiscali.cz/~cz210552/webbench.html Webbench是知名的网站压力测试工具，它是由Lionbridge公司（http://www.lionbridge.com）开发。 Webbench能测试处在相同硬件上，不同服务的性能以及不同硬件上同一个服务的运行状况。webbench的标准测试可以向我们展示服务器的两项内容：每秒钟相
第11章动画效果（中） onestopweb 动画
index.html <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/
windows下制作bat启动脚本. sanyecao2314 java cmd 脚本 bat
java -classpath C:\dwjj\commons-dbcp.jar;C:\dwjj\commons-pool.jar;C:\dwjj\log4j-1.2.16.jar;C:\dwjj\poi-3.9-20121203.jar;C:\dwjj\sqljdbc4.jar;C:\dwjj\voucherimp.jar com.citsamex.core.startup.MainStart
Java进行RSA加解密的例子 tomcat_oracle java
加密是保证数据安全的手段之一。加密是将纯文本数据转换为难以理解的密文；解密是将密文转换回纯文本。　　数据的加解密属于密码学的范畴。通常，加密和解密都需要使用一些秘密信息，这些秘密信息叫做密钥，将纯文本转为密文或者转回的时候都要用到这些密钥。　　对称加密指的是发送者和接收者共用同一个密钥的加解密方法。　　非对称加密(又称公钥加密)指的是需要一个私有密钥一个公开密钥，两个不同的密钥的
Android_ViewStub 阿尔萨斯 ViewStub
public final class ViewStub extends View java.lang.Object android.view.View android.view.ViewStub 类摘要： ViewStub 是一个隐藏的，不占用内存空间的视图对象，它可以在运行时延迟加载布局资源文件。当 ViewSt