Mr.小白

《精通Python自然语言处理（ Deepti Chopra)》读书笔记（第二章）

《精通Python自然语言处理》

Deepti Chopra(印度)
王威译

第二章统计语言建模

计算语言学的应用范围包括机器翻译，语音识别、智能Web搜索、信息检索和智能拼写等。

2.1理解单词频率

用于Alpino语料库生成unigrams：

import nltk
from nltk.util import ngrams
from nltk.corpus import alpino
print(alpino.words())
unigrams=ngrams(alpino.words(),1)
for i in unigrams:
print(i)

用于Alpino语料库生成quadgrams或fourgrams：

import nltk
from nltk.util import ngrams
from nltk.corpus import alpino
print(alpino.words())
quadgrams=ngrams(alpino.words(),4)
for i in quadgrams:
print(i)

用于在文本中查找bigrams：

import nltk
from nltk.collocations import BigramCollocationFinder
from nltk.corpus import webtext
from nltk.metrics import BigramAssocMeasures
tokens=[t.lower() for t in webtext.words('grail.txt')]
words=BigramCollocationFinder.from_words(tokens)
print(words.nbest(BigramAssocMeasures.likelihood_ratio, 10))

给上述代码添加一个单词过滤器（用来取消停止词和标点符号）：

from nltk.corpus import stopwords
from nltk.corpus import webtext
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
set=set(stopwords.words(‘english’))
stops_filter=lambda w:len(w)<3 or w in set
tokens=[t.lower() for t in webtext.words('grail.txt')]
words=BigramCollocationFinder.from_words(tokens)
words.apply_word_filter(stops_filter)
print(words.nbest(BigramAssocMeasures.likelihood_ratio, 10))

使用词汇搭配查找器来生成bigrams：

import nltk
from nltk.collocations import *
text1="Hardwork is the key to success. Never give up!"
word = nltk.wordpunct_tokenize(text1)
finder = BigramCollocationFinder.from_words(word)
bigram_measures = nltk.collocations.BigramAssocMeasures()
value = finder.score_ngrams(bigram_measures.raw_freq)
print(sorted(bigram for bigram, score in value))

用于Alpino语料库生成bigrams：

import nltk
from nltk.util import ngrams
from nltk.corpus import alpino
print(alpino.words())
bigrams_tokens=ngrams(alpino.words(),2)
for i in bigrams_tokens:
print(i)

用于Alpino语料库生成trigrams：

import nltk
from nltk.util import ngrams
from nltk.corpus import alpino
print(alpino.words())
trigrams_tokens=ngrams(alpino.words(),3)
for i in trigrams_tokens:
print(i)

用来生成fourgarms并生成fourgrams的频率：

import nltk
import nltk
from nltk.collocations import *
text="Hello how are you doing ? I hope you find the book interesting"
tokens=nltk.wordpunct_tokenize(text)
fourgrams=nltk.collocations.QuadgramCollocationFinder.from_words(tokens)
for fourgram, freq in fourgrams.ngram_fd.items():
print(fourgram,freq)

生成ngrams:

import nltk
from nltk.util import ngrams
sent=" Hello , please read the book thoroughly . If you have any queries , then don't hesitate to ask . There is no shortcut to success ."
n=5
fivegrams=ngrams(sent.split(),n)
for grams in fivegrams:
print(grams)

2.1.1为给定的文本开发MLE

最大熵模型：

from __future__ import print_function,unicode_literals
__docformat__='epytext en'
try:
import numpy
except ImportError:
	pass
import tempfile
import os
from collections import defaultdict
from nltk import compat
from nltk.data import gzip_open_unicode
from nltk.util import OrderedDict
from nltk.probability import DictionaryProbDist
from nltk.classify.api import ClassifierI
from nltk.classify.util import CutoffChecker,accuracy,log_likelihood
from nltk.classify.megam import (call_megam,write_megam_file,parse_megam_weights)
from nltk.classify.tadm import call_tadm,write_tadm_file,parse_tadm_weig

概率分布分类	说明
派生概率分布	从频率分布中获取
分析概率分布	从参数中获取

基于各个标识符在频率分布中的频率来计算其概率：

class MLEProbDist(ProbDistI):
	def __init__(self, freqdist, bins=None):
		self._freqdist = freqdist
	def freqdist(self):
	
"""此函数将在概率分布的基础上找到频率分布："""
	return self._freqdist
	def prob(self, sample):
		return self._freqdist.freq(sample)
	def max(self):
		return self._freqdist.max()
	def samples(self):
		return self._freqdist.keys()
	def __repr__(self):
	"""It will return string representation of ProbDist"""
		return '' % self._freqdist.N()

class LidstoneProbDist(ProbDistI):

"""该类用于获取频率分布。该频率分布由实数 Gamma 表示，其取值范围在 0 到 1 之间。
LidstoneProbDist 使用计数 c、样本结果 N 和能够从概率分布中获取的样本值 B 来计
算给定样本概率的公式如下： (c+Gamma)/(N+B*Gamma)。
这也意味着将 Gamma 加到了每一个可能的样本结果的计数上，并且从给定的频率分
布中计算出了 MLE："""
SUM_TO_ONE = False
	def __init__(self, freqdist, gamma, bins=None):

"""
Lidstone 用于计算概率分布以便获取 freqdist。
参数 freqdist 可以定义为概率估计所基于的频率分布。
参数 bins 可以被定义为能够从概率分布中获取的样本值，概率的总和等于 1：
"""
		if (bins == 0) or (bins is None and freqdist.N() == 0):
			name = self.__class__.__name__[:-8]
			raise ValueError('A %s probability distribution ' % name +'must have at least one bin.')
		if (bins is not None) and (bins < freqdist.B()):
			name = self.__class__.__name__[:-8]
			raise ValueError('\nThe number of bins in a %sdistribution ' % name +'(%d) must be greater than or equal to\n' % bins +'the number of bins in the FreqDist used ' +'to create it (%d).' % freqdist.B())
		self._freqdist = freqdist
		self._gamma = float(gamma)
		self._N = self._freqdist.N()
		if bins is None:
			bins = freqdist.B()
		self._bins = bins
		self._divisor = self._N + bins * gamma
		if self._divisor == 0.0:
			# In extreme cases we force the probability to be 0,
			# which it will be, since the count will be 0:
			self._gamma = 0
			self._divisor = 1
def freqdist(self):

"""
该函数基于概率分布获取了频率分布：
"""
	return self._freqdist
def prob(self, sample):
c = self._freqdist[sample]
	return (c + self._gamma) / self._divisor
def max(self):
	# To obtain most probable sample, choose the one
	# that occurs very frequently.
	return self._freqdist.max()
def samples(self):
	return self._freqdist.keys()
def discount(self):
	gb = self._gamma * self._bins
		return gb / (self._N + gb)
def __repr__(self):
"""
String representation of ProbDist is obtained.2.1 理解单词频率 29
"""
	return '' % self._freqdist.N()

class LaplaceProbDist(LidstoneProbDist):

"""
该类用于获取频率分布。它使用计数 c、样本结果 N 和能够被生成的样本值的频率 B
来计算一个样本的概率，计算公式如下：
(c+1)/(N+B)
这也意味着将 1 加到了每一个可能的样本结果的计数上，并且获取了所得频率分布的
最大似然估计：
"""
	def __init__(self, freqdist, bins=None):
"""
LaplaceProbDist 用于获取为生成 freqdist 的概率分布。
参数 freqdist 用于获取基于概率估计的频率分布。
参数 bins 可以被认为是能够被生成的样本值的频率。概率的总和必须为 1：
"""
		LidstoneProbDist.__init__(self, freqdist, 1, bins)
	def __repr__(self):
"""
		String representation of ProbDist is obtained.
"""
		return '' % self._freqdist.N()


class ELEProbDist(LidstoneProbDist):

"""
该类用于获取频率分布。它使用计数 c，样本结果 N 和能够被生成的样本值的频率 B
来计算一个样本的概率，计算公式如下： 
(c+0.5)/(N+B/2)
这也意味着将 0.5 加到了每一个可能的样本结果的计数上，并且获取了所得频率分布
的最大似然估计：
"""
	def __init__(self, freqdist, bins=None):
	
"""
预期似然估计用于获取生成 freqdist 的概率分布。参数 freqdist 用于获取基于概
率估计的频率分布。
参数 bins 可以被认为是能够被生成的样本值的频率。概率的总和必须为 1：
"""
LidstoneProbDist.__init__(self, freqdist, 0.5, bins)
	def __repr__(self):
"""
String representation of ProbDist is obtained.
"""
		return '' % self._freqdist.N()


class WittenBellProbDist(ProbDistI):

"""
WittenBellProbDist 类用于获取概率分布。在之前看到的样本频率的基础上，该
类用于获取均匀的概率质量。关于样本概率质量的计算公式如下：
T / (N + T)
这里， T 是观察到的样本数， N 是观察到的事件的总数。样本的概率质量等于即将出
现的新样本的最大似然估计。所有概率的总和等于 1：
Here,
	p = T / Z (N + T), if count = 0
	p = c / (N + T), otherwise
"""
	def __init__(self, freqdist, bins=None):
"""

此段代码获取了概率分布。该概率用于向未知的样本提供均匀的概率质量。样本的概
率质量计算公式给出如下：

T / (N + T)

这里， T 是观察到的样本数， N 是观察到的事件的总数。样本的概率质量等于即将出
现的新样本的最大似然估计。所有概率的总和等于 1：

p = T / Z (N + T), if count = 0
p = c / (N + T), otherwise

Z 是使用这些值和一个 bin 值计算出的规范化因子。参数 freqdist 用于估算可以从中获取概率分布的频率计数。参数 bins 可以定义为样本的可能类型的数量：

"""
		assert bins is None or bins >= freqdist.B(),\'bins parameter must not be less than %d=freqdist.B()' % freqdist.B()
		if bins is None:
			bins = freqdist.B()
		self._freqdist = freqdist
		self._T = self._freqdist.B()
		self._Z = bins - self._freqdist.B()
		self._N = self._freqdist.N()
		# self._P0 is P(0), precalculated for efficiency:
		if self._N==0:
			# if freqdist is empty, we approximate P(0) by aUniformProbDist:
			self._P0 = 1.0 / self._Z
		else:
			self._P0 = self._T / float(self._Z * (self._N + self._T))
	def prob(self, sample):
		# inherit docs from ProbDistI
		c = self._freqdist[sample]
		return (c / float(self._N + self._T) if c != 0 else self._P0)
	def max(self):
		return self._freqdist.max()
	def samples(self):
		return self._freqdist.keys()
	def freqdist(self):
		return self._freqdist
	def discount(self):
		raise NotImplementedError()
	def __repr__(self):
"""
String representation of ProbDist is obtained.
"""
		return '' % self._freqdist.N()

使用最大似然估计来执行测试：

import nltk
from nltk.probability import *
print(train_and_test(mle))
print(train_and_test(LaplaceProbDist))
print(train_and_test(ELEProbDist))
def lidstone(gamma):
	return lambda fd, bins:LidstoneProbDist(fd,gamma,bins)
print(train_and_test(lidstone(0.1)))
print(train_and_test(lidstone(0.5)))
print(train_and_test(lidstone(1.0)))

2.1.2隐马尔科夫模型

使用Brown语料库的代码，用HMM估计执行测试：

import nltk
corpus = nltk.corpus.brown.tagged_sents(categories='adventure')[:700]
print(len(corpus))
from nltk.util import unique_list
tag_set = unique_list(tag for sent in cor for (word,tag) in sent)
print(len(tag_set))
symbols = unique_list(word for sent in cor for (word,tag) in sent)
print(len(symbols))
print(len(tag_set))
symbols = unique_list(word for sent in cor for (word,tag) in sent)
print(len(symbols))
trainer = nltk.tag.HiddenMarkovModelTrainer(tag_set, symbols)
train_corpus = []
test_corpus = []
for i in range(len(corpus)):
    if i % 10:
        train_corpus +=[corpus[i]]
    else:
        test_corpus +=[corpus[i]]
print(len(train_corpus))
print(len(test_corpus))
def train_and_test(est):
	hmm=trainer.train+supervised(train_corpus,estimator=est)
	print(‘%.2f%%’ % (100*hmm.evaluate(test_corpus)))

2.2在MLE模型上应用平滑

平滑（Smooshing）用于处理之前未曾出现过的单词。

2.2.1加法平滑

有关加法平滑的代码：

import nltk
corpus=u" hello how are you doing ? Hope you find the book interesting. ".split()
sentence=u"how are you doing".split()
vocabulary=set(corpus)
print(len(vocabulary))
cfd = nltk.ConditionalFreqDist(nltk.bigrams(corpus))

#The corpus counts of each bigram in the sentence:
print([cfd[a][b] for (a,b) in nltk.bigrams(sentence)])

#The counts for each word in the sentence:
print([cfd[a].N() for (a,b) in nltk.bigrams(sentence)])

#There is already a FreqDist method for MLE probability:
print([cfd[a].freq(b) for (a,b) in nltk.bigrams(sentence)])

#laplace smooshing of each bigram count:
print([1 + cfd[a][b] for (a,b) in nltk.bigrams(sentence)])

#we need to normalize the counts for each word:
print([len(vocabulary) + cfd[a].N() for (a,b) in nltk.bigrams(sentence)])

#the smoothed Laplace probability for each bigram:
print([1.0 * (1+cfd[a][b]) / (len(vocabulary)+cfd[a].N()) for (a,b) in nltk.bigrams(sentence)])

另外一种执行加权平滑或者说生成Laplace概率分布的方法：

#MLEProbDist id the unsmoothed probability distribution:
cpd_mle = nltk.ConditionalProbDist(cfd, nltk.MLEProbDist, bins=len(vocabulary))

#Now we can get the MLE probabilities by using the .prob method:
print([cpd_mle[a].prob(b) for (a,b) in nltk.bigrams(sentence)])

#LaplaceProbDist is the add_one smoothed ProbDist:
cpd_laplace = nltk.ConditionalProbDist(cfd, nltk.LaplaceProbDist, bins=len(vocabulary))

#Getting the Laplace probabilities is the same as for MLE:
print([cpd_laplace[a].prob(b) for (a,b) in nltk.bigrams(sentence)])

2.2.2Good Turing平滑

Simple Good Turing平滑:

class SimpleGoodTuringProbDist(ProbDistI):
“””Given a pair (pi, qi), where pi refers to the frequency and qi refers to the frequency of frequency, our aim is to minimize the square variation. E(p) and E(q) is the mean of pi and qi.
-  slope, b=sigma ((pi-E(p)(qi-E(q)))1/sigma ((pi-E(p))(pi-E(p)))
-  intercept: a =E(q) - b.E(p)”””

	SUM_ TO_ ONE =False
	def___init__(self,  freqdist,  bins=None) :

“”” param freqdist refers to the count of frequency from which probability distribution is estimated.
Param bins is used to estimate the possible number of samples.”””
		assert bins is None or bins > freqdist.B(), \'bins parameter must not be less than %d=freqdist.B()+1' % (freqdist.B()+l)
		if bins is None:
			bins . freqdist.B() + 1
		self._ freqdist=freqdist
		self._ bins =bins
		r,  nr  = self._ r_ Nr()
		self.find_ best_ fit(r, nr)
		self._ switch(r, nr)
		self._ renormalize(r, nr)

	def _ _r_ Nr_ non_ zero(self):
		r_ Nr = self._ freqdist.r Nr()
		del r_ Nr[0]
		return r_ Nr

	def_ r_ Nr(self):
“””Split the frequency distribution in two list (r, Nr), where Nr(r) > 0”””
		nonzero = self._ r_ Nr_ non_ zero()
		if not nonzero:
			return [],[]
		return zip(* sorted (nonzero. Items()))

	def find_best_ fit(self, r, nr):
“”” Use simple linear regression to tune parameters self. slope and self._ intercept in the log-log space based on count and Nr(count) (Work in log space to avoid floating point underflow.)”””
# For higher sample frequencies the data points becomes horizontal
# along line Nr=1. To create a more evident linear model in log-log
# space, we average positive NI values with the surrounding zero
# values. (Church and Gale, 1991)
		If not r or not nr:
			# Empty r or nr?
			return
		zr =[] 
		for j in range(len(r)):
			i= (r[j-1] if j>0 else0)
			k= (2·r[j]-I if j== len(r) - 1 else r[j+1])
			zr_= 2.0* nr[j] / (k- i)
			zr. append(zr_ )

		log_r = [math.log(i)  for i in r]
		log_zr = [math.log(i)  for i in zr]

		xy_ cov = x_var= 0.0
		x_mean = 1.0 * sum(log_r) / len(log_r)
		y_mean = 1.0 * sum(log_zr) / len(log_zr)
		for (x,y)  in zip (log_r,log_zr):
			xy_cov += (x - x_mean) * (y - y_mean)
			x_var +=  (x - x_mean)**2
		self._slope = (xy_cov / x_var if x_var != 0 else 0.0)
			if  self. Slope  >=  -1:
				warnings .warn(‘SimpleGoodTuring did not find a proper bestfit ‘  ‘line for smoothing probabilities occurrences.’ ‘the probability estimates are likely to be’ 'unreliable.')
		self._intercept = y_mean – self._slope * x_mean
	def  _switch(self,  r,  nr):
	
“””calculate the r frontier where we must switch from Nr to Sr when estimating E{Nr].”””
		for i, r_  in  enumerate(r):
			if len(r) == i + 1 or r[i+1] != r_+1:
				#we are at the end of r, or there is a gap in r
				self._ switch_at =  r_
				break

			Sr = self. smoothedNr
			Smooth_r_star = (r_ + 1) * Sr(r_+1)  /  Sr(r_)
			unsmooh_r_star = 1.0*(r_ + 1)* nr[i+1]  /  nr[i]

			std = math.sqrt(self._varianceI(r_ , nr[i], nr[i+1]))
			if  abs (unsmooth_r_star-smooth_r_star)  <=  1.96 * std:
				self._ switch_at = r_
				break

	def  _ variance(self, r, nr, nr_1):
		r = float(r)
		nr = float(nr)
		nr_1 = float(nr_1)
		return(r+1.0)**2*(nr_1/nr**2)*(1.0+nr_1/nr)

	def _ renormalize(self, r, nr) :
	
“””重整化对于确保获取到正确的概率分布是至关重要的。它可以通过公式N(1)/N对未知的样本迸行概率估计，然后对所有之前所见的样本概率迸行重整来获取:”””
		prob cov  =  0.0
		for r_, nr_ in zip(r, nr):
			prob cov  +=  nr_ * self. prob_ measure(r_)
		if prob_ cov:
			self._ renormal = (1 - self._ prob_ measure(0))  /  prob_ cov

	def smocthedNr(self, r) :
“””Return the number of samples with count r.”””

#Nr= a*r^b (with b < -1 to give the appropriate hyperbolic relationship)
#Estimate a and b by simple linear regression technique on the logarithmic form of the equation: #log Nr = a + b*log(r)
	return math.exp(self._ intercept + self._ slope * math.log(r))

	def prob(self, sample) :
		“””Return the sample's probability. “””
		count = self._ freqdist [sample]
		p = self._ prob_ measure (count)
		if count == 0:
			if self._ bins == self._ freqdist.B():
				p=0.0
			else:
				P =p/(I.0* self._ bins - self._ freqdist.B()
		else:
			p = p* self._ renormal
		return p

	def_ prob_ measure (self, count) :
		if  count == 0 and self._ freqdist.N() == 0 :
			return 1.0
		elif  count = 0 and self._ freqdist.N() != 0:
			return 1.0 *self._ freqdist.Nr(1) / self._ freqdist.N()
		if  self._ switch_ at > count:
			Er_1 = 1.0* self._ freqdist . Nr (count+1)
			Er = 1.0 * self._ freqdist . Nr (count)
		else:
			Er_1 = self. smoothedNr (count+1)
			Er = self. smoothedNr (count )

		r_star = (count+1)*Er_1 / Er
		return r_star  /  self._freqdist.N()
	def  check(self) :
		prob_sum = 0.0
		for i in range(0, len(self._Nr):
			prob_sum += self._Nr[i] * self._prob_measure(i) / self._renormal
		print("Probability Sum:", prob_ sum)
		#assert prob_sun != 1.0, "probability sum should be one!"

	def discount (self) :
“””It is used to provide the total probability transfers from the seen events to the unseen events.”””
		return 1.0 * self. smoothedNr(1) / self._freqdist.N()

	def  max(self) :
		return self._freqdist .max()

	def  samples(self) :
		return self._freqdist. keys()
	def freqdist(self) :
		return self._freqdist

	def_ repr__ (self) :
	“””It obtains the string representation of ProbDist.”””
		return ' '\% self._freqdist.N()

NLTK中有关Simple Good Turing的代码：

gt = lambda fd, bins: SimpleGoodTuringProbDist(fd, bins=le5)
print(train_ and_ test(gt))

2.2.3Kneser Ney平滑

Nltk中有关Kneser Ney平滑的代码：

import nltk
corpus = [[((x[0],y[0],z[0]), (x[1],y[1],z[1]))  for x, y, z in nltk. trigrams (sent) ] for sent in corpus[:IOO] ]
tag_ set unique_ list(tag for sent in corpus for (word,tag)  insent)
print(len(tag_ set))
symbols unique_ list(word for sent in corpus for (word, tag)  insent)
print(len (symbols))
trainer = nltk.tag.HiddenMarkovModelTrainer(tag_ set, symbols)
train_ corpus = []
test_ corpus = []
for i in range (len (corpus)) :
	if i%10:
		train_corpus += [corpus[i]]
	else:
		test_corpus += [corpus[i]]

print(len(train_ corpus))
print(len(test_ corpus))
kn = lambda fd, bins: KneserNeyProbDist (fd)
print( train_ and_ test (kn))

2.2.4 Witten Bell 平滑

Print(train_and_test(WittenBellProbDist))

2.3为MLE开发一个回退机制

Katz回退模型可以认为是一个具备高效生产力的n gram语言模型，如果在n gram中能给定一个指定标识符的先前条件，那没该模型能够计算出其条件概率。依据这个模型，在训练文件中，如果n gram出现的次数多于n次，在已知的先前条件下，标识符的条件概率与该n gram的MLE成正比。否则，条件概率相当于（N-1）gram的回退条件概率。

Katz回退模型代码：

def  prob(self, word, context):

“””Evaluate the probability of this word in this context using Kat Backoff.
: param word: the word to get the probability of
: type word: str
:param context: the context the word is in
:type context: list(str)”””

context = tuple (context)
if (context+(word,) in self._ngrams) or (self.n == 1):
return self [context].prob (word)
else:
return self._alpha (context) * self.backoff.prob (word, context[1:])

2.4应用数据的插值以便获取混合搭配

单词captivating在训练数据中出现了五次，其中三次出现在by之前，两次出现在the之前。使用加法平滑模型，在captivating之前，a和new的出现频率是一样的。我们可以开发一个能够结合unigram和bigram模型的插值模型。

2.5通过复杂度来评估模型

nltk.model.ngram模块中所呈现的用于评估文本复杂度的代码如下：

def perplexity(self, text):
	return pow(2.0, self.entropy(text))

2.6在语言建模中应用Metropolis-Hastings算法

在马尔科夫链蒙特卡罗(Markov Chain Monte Cardo MCMC)中有多种关于后验概率执行处理方法。一种方法是使用 Metropolis-Hastings采样器。为了实现Metropolis-Hastings算法，我们需要标准的均匀分布、建议分布和与后验概率成正比的目标分布。

2.7在语言处理中应用Gibbs采样法

在每一次迭代中，我们为每一个特定参数的新值抽取一个建议值。

考虑一个关于投掷两枚硬币的例子，它以一枚硬币正面朝上的次数和掷币次数为表征：

def  bern(theta,z,N):
“””Bernoulli likelihood with N trials and z successes. ”""
	return np.clip(theta**z*(1-theta)**(N-z),0,1)
def bern2 (theta1, theta2,z1,z2,N1,N2):
“""Bernoulli likelihood with N trials and z successes."””
	Return bern(theta1, z1, N1)*bern(theta2,z2,N2)
def make_ thetas(xmin, xmax, n):
	xs = np.linspace (xm in, xmax, n)
	widths = (xs[1:] – xs[:-1])/2.0
	thetas = xs[:-1] + widths
	return thetas  
def make_ plots(X, Y, prior, likelihood, posterior, projection = None):
	fig, ax=plt.subplots(1, 3, subplot_kw = dict(projection = projection, aspect = ‘equal’), figsize=(12,3))
if projection == '3d':
	ax[0].plot_surface(X, Y, prior,alpha = 0.3 ,cmap = plt.cm.jet)
	ax[1].plot_ surface(X, Y, likelihood,alpha = 0.3 ,cmap = plt.cm.jet)
	ax[2].plot _surface(X, Y, posterior, alpha = 0.3, cmep = plt.cm.jet)
else:
	ax[0].contour(X, Y, prior )
	ax[1].ccntour (X, Y, likelihood)
	ax[2].contour(X, Y, posterior)
	ax[0].set_title('Prior’)
	ax[I].set_title('likelithood')
	ax[2].set_title('posteior')
	plt.tight_layout()
	thetas1 = make_thetas (0,1,101)
	thetas2 = make_thetas (0,1,101)
	X,Y = np.meshgrid(thetas1, thetas2)

对于Metropolis算法，可考虑一下值：

a=2
b=3

z1=11
N1=14
z2=7
N2=14

prior = lambda theta1, theta2: stats .beta(a,b).pdf (theta1)*stats .beta(a,b).pdf (theta2)
lik = partial (bern2,z1=z1,z2=z2, N1=N1, N2=N2)
target = lambda theta1, theta2:prior (theta1, theta2) *lik (theta1, theta2)

theta = np .array([0.5,0.5])
niters = 10000
burnin = 500
sigma = np.diag([0.2,0.2])

thetas = np.zero((niters-burnin, 2), np.float)
for  i  inrange (niters):
	new_theta = stats.multivariate_normal(theta, sigma) . rvs()
	p = min(target(*new_ theta)  / target (*theta),1)
	if np.random.rand()= burnin: ,
      	thetas[i-burnin] = theta
kde = stats.gaussian_kde (thetas.T)
XY = np.vstack([X.ravel(),Y.ravel()])
posterior_metroplis = kde (XY).reshape (X.shape)
make_plots(X, Y, prior(X, Y), lik(X,Y), posterior_metroplis)
make_plots(X, Y, prior(X, Y), lik(X,Y), posterior_metroplis, projection='3d')

对于Gibbs算法，可考虑以下值：

a=2
b=3

z1=11
N1=14
z2=7
N2=14

prior = lambda theta1, theta2: stats .beta(a,b).pdf (theta1)*stats .beta(a,b).pdf (theta2)
lik = partial (bern2,z1=z1,z2=z2, N1=N1, N2=N2)
target = lambda theta1, theta2:prior (theta1, theta2) *lik (theta1, theta2)

theta = np .array([0.5,0.5])
niters = 10000
burnin = 500
sigma = np.diag([0.2,0.2])

thetas = np.zero((niters-burnin, 2), np.float)
for  i  inrange (niters):
	theta = [stats.beta(a+z1, b+N1-z1).rvs), theta[1]]
	theta = [theta[0], stats.beta (a+z2, b+N2-z2) .rvs0)]
	if i >= burnin: ,
      	thetas[i-burnin] = theta
kde = stats.gaussian_kde (thetas.T)
XY = np.vstack([X.ravel(),Y.ravel()])
posterior_gibbs = kde (XY).reshape (X.shape)
make_plots(X, Y, prior(X, Y), lik(X,Y), posterior_gibbs)
make_plots(X, Y, prior(X, Y), lik(X,Y), posterior_gibbs,  projection='3d')

“”"***笔者的话：整理了《精通Python自然语言处理》的第二章内容。总结还算详细，书中的每段代码都有。希望对阅读这本书的人有所帮助。ＦＩＧＨＴＩＮＧ．．．（热烈欢迎大家批评指正，互相讨论）
（you cannot find peace by avoiding life.） ***"""

你可能感兴趣的:(ＮＬＰ,中文分词)

Objective-C实现NLP中文分词（附完整源码）源代码大师 Objective-C实战教程自然语言处理 objective-c 中文分词
Objective-C实现NLP中文分词实现中文分词（NLP中的重要任务之一）在Objective-C中需要处理文本的切分和识别词语边界。尽管Objective-C在自然语言处理（NLP）领域并不常见，但通过合理的算法设计和数据结构，可以实现基本的中文分词功能。本文将介绍如何使用基于字典的最大匹配算法（MaximumMatchingAlgorithm），例如正向最大匹配（ForwardMaximu
PHP实现站内搜索的开源利器——WindSearch rock365337 WindSearch php 开源搜索引擎
WindSearch是一个基于中文分词，由纯PHP开发全文检索引擎，可快速搭建PHP站点的站内搜索，他没有任何繁琐的安装配置、不需要维护调优、不占用服务器内存、可与PHP项目完美融合在一起。github地址：https://github.com/rock365/windsearch必须极速安装~使用composer安装：composerrequirerock365/windsearch或使用Git
PHP实现站内搜索的开源利器——WindSearch
WindSearch是一个基于中文分词，由纯PHP开发全文检索引擎，可快速搭建PHP站点的站内搜索，他没有任何繁琐的安装配置、不需要维护调优、不占用服务器内存、可与PHP项目完美融合在一起。github地址：https://github.com/rock365/windsearch必须极速安装~使用composer安装：composerrequirerock365/windsearch或使用Git
PHP搜索引擎WindSearch，新增Faker伪数据生成功能
WindSearch是一个基于中文分词，由纯PHP开发全文检索引擎，可快速搭建PHP站点的站内搜索，他没有任何繁琐的安装配置、不需要维护调优、不占用服务器内存、可与PHP项目完美融合在一起。Faker数据生成安装导入//将WindSearch代码下载到本地，再像下面这样引入require_once'yourdirname/windsearch/vendor/autoload.php';开始生成//
Jieba分词算法应用 C嘎嘎嵌入式开发算法服务器数据库 c++linux
1.Jieba分词算法简介Jieba是一个用于中文分词的Python库，其核心思想是基于词典和统计模型来进行分词。由于中文文本中没有明显的单词边界，因此分词是中文处理中的一个重要任务。Jieba提供了以下几种主要的分词模式：精确模式：尽可能准确地切分句子，适合用于文本分析。全模式：将句子中所有可能的词语都切分出来，适合用于搜索引擎。搜索引擎模式：在精确模式的基础上，对长词再次切分，适合用于搜索引擎
Python：第三方库衍生星球 python 第三方库
1.第三方Python库库名用途pip安装指令NumPy矩阵运算pipinstallnumpyMatplotlib产品级2D图形绘制pipinstallmatplotlibPIL图像处理pipinstallpillowsklearn机器学习和数据挖掘pipinstallsklearnRequestsHTTP协议访问pipinstallrequestsJieba中文分词pipinstalljieba
python --jieba 分词好好学习的顾顾 python 二级备考 python
jieba库是什么jieba库中文分词第三方库，中文文本需要通过分词获得单个的词语。jieba库的原理：利用中文字库，确定汉字之间的关联概率，汉字件概率大的组成词组，形成分词结果，还可以添加自定义的词组。jieba库的使用jieba库分词有3种1.精确模式：一段文本精确地切分成若干个中文单词，若干个中文单词经过组合，精确还原原先地文本，不存在冗余单词。2.全模式：一段文本种所有可能出现地词语都扫描
想做 Python 聊天机器人，有什么好用的中文分词、数据挖掘、AI方面的 Python 库或者开源项目推荐 xiamu_CDA 人工智能 python 机器人
想做Python聊天机器人，有什么好用的中文分词、数据挖掘、AI方面的Python库或者开源项目推荐？在当今数字化时代，聊天机器人已经成为了连接人与机器的重要桥梁。从客户服务到娱乐互动，从智能家居到医疗咨询，聊天机器人的应用场景越来越广泛。而作为一门强大的编程语言，Python在构建聊天机器人方面拥有得天独厚的优势。如果你正打算开发一个Python聊天机器人，尤其是涉及到中文分词、数据挖掘和AI技
毕设基于python的搜索引擎设计与实现 A毕设分享家 python 毕业设计
文章目录0简介1课题简介2系统设计实现2.1总体设计2.2搜索关键流程2.3推荐算法2.4数据流的实现3实现细节3.1系统架构3.2爬取大量网页数据3.3中文分词3.4相关度排序第1个排名算法：根据单词位置进行评分的函数第2个排名算法：根据单词频度进行评价的函数第3个排名算法：根据单词距离进行评价的函数最后0简介今天学长向大家分享一个毕业设计项目毕业设计基于python的搜索引擎设计与实现项目运行
华为OD机试 - 中文分词模拟器（Python/JS/C/C++ 2024 D卷 100分）哪吒华为od 中文分词 python
一、题目描述给定一个连续不包含空格字符的字符串，该字符串仅包含英文小写字母及英文标点符号（逗号、句号、分号），同时给定词库，对该字符串进行精确分词。说明：精确分词：字符串分词后，不会出现重叠。例如“ilovechina”，不同切分后可得到“i”,“love”,“china”。标点符号不分词，仅用于断句。词库：根据常识及词库统计出来的常用词汇。例如：dictionary={“i”,“love”,“c
THULAC-Python 使用教程时昕海Minerva
THULAC-Python使用教程THULAC-PythonTHULAC-Python:由清华大学开发的中文词法分析工具包，提供中文分词和词性标注功能。项目地址:https://gitcode.com/gh_mirrors/th/THULAC-Python项目介绍THULAC（THULexicalAnalyzerforChinese）是由清华大学自然语言处理与社会人文计算实验室研制推出的一套中文词
mysql5.7全文检索方案,深度解析MySQL 5.7之中文全文检索渚熏 mysql5.7全文检索方案
前言其实全文检索在MySQL里面很早就支持了，只不过一直以来只支持英文。缘由是他从来都使用空格来作为分词的分隔符，而对于中文来讲，显然用空格就不合适，需要针对中文语义进行分词。这不，从MySQL5.7开始，MySQL内置了ngram全文检索插件，用来支持中文分词，并且对MyISAM和InnoDB引擎有效。在使用中文检索分词插件ngram之前，先得在MySQL配置文件里面设置他的分词大小，比如，[m
mysql 5.7全文索引_MySql5.7 使用全文索引 wonder-yyc mysql 5.7全文索引
一、ngramandMeCabfull-textparser插件全文检索在MySQL里面很早就支持了，只不过一直以来只支持英文。缘由是他从来都使用空格来作为分词的分隔符，而对于中文来讲，显然用空格就不合适，需要针对中文语义进行分词。但从MySQL5.7开始，MySQL内置了ngram全文检索插件，用来支持中文分词，并且对MyISAM和InnoDB引擎有效。二、必要的参数设置在使用中文检索分词插件n
利用jieba库和wordcloud库绘制词云图像 baichui python学习 python
目录jieba库的使用利用jieba库获取红楼梦中人物名字的出现频次wordcloud库的使用根据红楼梦中人物出现频次，制作词云图jieba库的使用jieba库是优秀的中文分词工具，能对文本进行分词处理常用函数:lcut与cutlcut函数接受一个中文文本字符串，对该文本进行分词处理，返回一个分词列表（推荐使用）而jieba.cut生成的是一个生成器，generator,可以通过for循环来取里面
基于网络爬虫技术的网络新闻分析众拾达人 Java Web 爬虫爬虫
文末附有完整项目代码在信息爆炸的时代，如何从海量的网络新闻中挖掘出有价值的信息呢？今天就来给大家分享一下基于网络爬虫技术的网络新闻分析的实现过程。首先，我们来了解一下系统的需求。我们的目标是能够实时抓取凤凰网新闻、网易新闻、搜狐新闻等网站的新闻数据，正确抽取正文并获取点击量，每日定时抓取。然后对抓取回来的新闻进行中文分词，利用分词结果计算新闻相似度，将相似新闻合并并展示相似新闻的用户点击趋势。接下
NLP_jieba中文分词的常用模块 Hiweir · NLP_jieba的使用自然语言处理中文分词人工智能 nlp
1.jieba分词模式（1）精确模式:把句子最精确的切分开,比较适合文本分析.默认精确模式.（2）全模式:把句子中所有可能成词的词都扫描出来,cut_all=True,缺点:速度快,不能解决歧义（3）paddle:利用百度的paddlepaddle深度学习框架.简单来说就是使用百度提供的分词模型.use_paddle=True.（4）搜索引擎模式:在精确模式的基础上,对长词再进行切分,提高召回率,
Python的情感词典情感分析和情绪计算 yava_free python 大数据人工智能
一.大连理工中文情感词典情感分析(SentimentAnalysis)和情绪分类(EmotionClassification）都是非常重要的文本挖掘手段。情感分析的基本流程如下图所示，通常包括：自定义爬虫抓取文本信息；使用Jieba工具进行中文分词、词性标注；定义情感词典提取每行文本的情感词；通过情感词构建情感矩阵，并计算情感分数；结果评估，包括将情感分数置于0.5到-0.5之间，并可视化显示。目
python连接es_Elasticsearch --- 3. ik中文分词器, python操作es weixin_39962285 python连接es
一.IK中文分词器1.下载安装2.测试#显示结果{"tokens":[{"token":"上海","start_offset":0,"end_offset":2,"type":"CN_WORD","position":0},{"token":"自来水","start_offset":2,"end_offset":5,"type":"CN_WORD","position":1},{"token":"
自然语言处理系列八》中文分词》规则分词》正向最大匹配法陈敬雷-充电了么-CEO兼CTO 算法人工智能大数据算法人工智能编程语言 java 自然语言处理
注：此文章内容均节选自充电了么创始人，CEO兼CTO陈敬雷老师的新书《自然语言处理原理与实战》（人工智能科学与技术丛书）【陈敬雷编著】【清华大学出版社】文章目录自然语言处理系列八规则分词正向最大匹配法总结自然语言处理系列八规则分词规则分词是基于字典、词库匹配的分词方法（机械分词法），其实现的主要思想是：切分语句时，将语句特定长的字符串与字典进行匹配，匹配成功就进行切分。按照匹配的方式可分为：正向最
Java 结合elasticsearch-ik分词器，实现评论的违规词汇脱敏等操作八百码 elasticsearch 大数据搜索引擎
IK分词（IKAnalyzer）是一款基于Java开发的中文分词工具，它结合了词典分词和基于统计的分词方法，旨在为用户提供高效、准确、灵活的中文分词服务。注意：需要自己建立一个敏感词库，然后自己选择方式同步到elasticsearch中，方便比对操作话不多说，直接上后台代码这个依赖是我使用的，可以结合自己的情况自己选择适用版本的相关依赖org.elasticsearchelasticsearcho
文本分析之关键词提取（TF-IDF算法） SEVEN-YEARS tf-idf
键词提取是自然语言处理中的一个重要步骤，可以帮助我们理解文本的主要内容。TF-IDF（TermFrequency-InverseDocumentFrequency）是一种常用的关键词提取方法，它基于词频和逆文档频率的概念来确定词语的重要性。准备工作首先，我们需要准备一些工具和库，包括Pandas、jieba（结巴分词）、sklearn等。Pandas：用于数据处理。jieba：用于中文分词。skl
MySQL 实现模糊匹配 flying jiang 架构设计数据库 mysql 数据库
摘要：在不依赖Elasticsearch等外部搜索引擎的情况下，您依然能够充分利用MySQL数据库内置的LIKE和REGEXP操作符来实现高效的模糊匹配功能。针对更为复杂的搜索需求，尤其是在处理大型数据集时，结合使用IK分词器（虽然IK分词器本身主要用于中文分词，在Elasticsearch等搜索引擎中广泛应用，但可以通过一些创造性的方法间接应用于MySQL环境）可以显著提升搜索的准确性和效率。正
Python数据可视化词云展示周董的歌 PathonDiss
马上开始了，你准备好了么准备工作环境：Windows+Python3.6IDE：根据个人喜好，自行选择模块：Matplotlib是一个Python的2D数学绘图库pipinstallmatplotlibimportmatplotlib.pyplotaspltjieba中文分词库pipinstalljiebaimportjiebawordcloud词云库pipinstallwordcloudfrom
android sqlite 分词,sqlite3自定义分词器雷幺幺 android sqlite 分词
sqlite3通过使用fts3虚表支持全文搜索，默认支持simple和porter两种分词器，并提供了接口来自定义分词器。这里我们利用mmseg来构造自定义的中文分词器。虽然sqlite在fts3_tokenizer.h中提供了各种接口供用户自定义分词器，但其并未提供c函数供用户来注册自定义的分词器，分词器的注册必须使用sql语句来完成。SELECTfts3_tokenizer(,);其中toke
自然语言处理NLP之中文分词和词性标注陈敬雷-充电了么-CEO兼CTO 自然语言处理
注：此文章内容均节选自充电了么创始人，CEO兼CTO陈敬雷老师的新书《自然语言处理原理与实战》（人工智能科学与技术丛书）【陈敬雷编著】【清华大学出版社】文章目录一、Python第三方库jieba（中文分词、词性标注）特点二、jieba中文分词的安装关键词抽取基于TF-IDF算法TF-IDF原理介绍基于TextRank算法的关键词抽取textRank算法原理介绍总结一、Python第三方库jieba
ElasticSearch HW-- elasticsearch
一、适用场景全文搜索：1.电商搜索2.站内搜索3.文档管理系统4.论坛和社交媒体日志分析与监控：1.服务器日志2.应用日志3.运维监控数据分析：1.业务分析2.时序数据分析NoSQLJSON文档数据库：作为JSON文档数据库使用搜索推荐实现个性化搜索和推荐功能地理信息系统存储和查询带有地理信息的数据大规模监控系统二、为什么要安装分词器？IK分词器中针对中文分词提供了ik_smart和ik_max_
Lucene实现自定义中文同义词分词器 WangJonney Lucene Lucene
----------------------------------------------------------lucene的分词_中文分词介绍----------------------------------------------------------Paoding:庖丁解牛分词器。已经没有更新了mmseg:使用搜狗的词库1.导入包（有两个包：1.带dic的，2.不带dic的）如果使用
HanLP实战教程：离线本地版分词与命名实体识别 Tim_Van 中文分词命名实体识别自然语言处理
HanLP是一个功能强大的自然语言处理库，提供了多种语言的分词、命名实体识别等功能。然而，网上关于HanLP的说明往往比较混乱，很多教程都是针对很多年前的API用法。而HanLP官网主要讲述的是RESTful格式的在线请求，但很少提到离线本地版本。本文将介绍如何在离线本地环境中使用HanLP2.1的nativeAPI进行中文分词和命名实体识别。本文使用的HanLP版本为HanLP2.1.0-bet
es安装中文分词器 IK 我要好好学java elasticsearch 中文分词大数据
1.下载https://github.com/medcl/elasticsearch-analysis-ik这个是官方的下载地址，下载跟自己es版本对应的即可那么需要下载7.12.0版本的分词器2.安装1.在es的plugins的文件夹下先创建一个ik目录bashcd/home/apps/elasticsearch/plugins/mkdirik2.然后将下载解压后的文件放入到ik文件夹下3.重启
python笔记——jieba库 Toby不写代码 python学习 python
文章目录一.概述二.jieba库使用三.实例一.概述1.jieba库概述jieba库是一个重要的第三方中文分词函数库，不是安装包自带的，需要通过pip指令安装pip3installjieba二.jieba库使用1.库函数jieba.cut(s)——精确模式，返回一个可迭代数据类型jieba.cut(s,cut_all=True)——全模式，输出文本s中可能的单词jieba.cut_for_sear
Linux的Initrd机制被触发 linux
Linux 的 initrd 技术是一个非常普遍使用的机制，linux2.6 内核的 initrd 的文件格式由原来的文件系统镜像文件转变成了 cpio 格式，变化不仅反映在文件格式上， linux 内核对这两种格式的 initrd 的处理有着截然的不同。本文首先介绍了什么是 initrd 技术，然后分别介绍了 Linux2.4 内核和 2.6 内核的 initrd 的处理流程。最后通过对 Lin
maven本地仓库路径修改 bitcarter maven
默认maven本地仓库路径：C:\Users\Administrator\.m2 修改maven本地仓库路径方法： 1.打开E:\maven\apache-maven-2.2.1\conf\settings.xml 2.找到
XSD和XML中的命名空间 darrenzhu xml xsd schema namespace 命名空间
http://www.360doc.com/content/12/0418/10/9437165_204585479.shtml http://blog.csdn.net/wanghuan203/article/details/9203621 http://blog.csdn.net/wanghuan203/article/details/9204337 http://www.cn
Java 求素数运算周凡杨 java 算法素数
网络上对求素数之解数不胜数，我在此总结归纳一下，同时对一些编码，加以改进，效率有成倍热提高。第一种：原理: 6N(+-)1法任何一个自然数，总可以表示成为如下的形式之一： 6N，6N+1，6N+2，6N+3，6N+4，6N+5 (N=0，1，2，…)
java 单例模式 g21121 java
想必单例模式大家都不会陌生，有如下两种方式来实现单例模式： class Singleton { private static Singleton instance=new Singleton(); private Singleton(){} static Singleton getInstance() { return instance; }
Linux下Mysql源码安装 510888780 mysql
1.假设已经有mysql-5.6.23-linux-glibc2.5-x86_64.tar.gz (1)创建mysql的安装目录及数据库存放目录解压缩下载的源码包，目录结构，特殊指定的目录除外：
32位和64位操作系统墙头上一根草 32位和64位操作系统
32位和64位操作系统是指：CPU一次处理数据的能力是32位还是64位。现在市场上的CPU一般都是64位的，但是这些CPU并不是真正意义上的64 位CPU，里面依然保留了大部分32位的技术，只是进行了部分64位的改进。32位和64位的区别还涉及了内存的寻址方面，32位系统的最大寻址空间是2 的32次方= 4294967296（bit）= 4（GB）左右，而64位系统的最大寻址空间的寻址空间则达到了
我的spring学习笔记10-轻量级_Spring框架 aijuans Spring 3
一、问题提问： → 请简单介绍一下什么是轻量级？轻量级（Leightweight）是相对于一些重量级的容器来说的，比如Spring的核心是一个轻量级的容器，Spring的核心包在文件容量上只有不到1M大小，使用Spring核心包所需要的资源也是很少的，您甚至可以在小型设备中使用Spring。
mongodb 环境搭建及简单CURD antlove Web Install curd NoSQL mongo
一搭建mongodb环境 1. 在mongo官网下载mongodb 2. 在本地创建目录 "D:\Program Files\mongodb-win32-i386-2.6.4\data\db" 3. 运行mongodb服务 [mongod.exe --dbpath "D:\Program Files\mongodb-win32-i386-2.6.4\data\
数据字典和动态视图百合不是茶 oracle 数据字典动态视图系统和对象权限
数据字典（data dictionary）是 Oracle 数据库的一个重要组成部分，这是一组用于记录数据库信息的只读（read-only）表。随着数据库的启动而启动,数据库关闭时数据字典也关闭数据字典中包含数据库中所有方案对象（schema object）的定义(包括表，视图，索引，簇，同义词，序列，过程，函数，包，触发器等等) 数据库为一
多线程编程一般规则 bijian1013 java thread 多线程 java多线程
如果两个工两个以上的线程都修改一个对象，那么把执行修改的方法定义为被同步的，如果对象更新影响到只读方法，那么只读方法也要定义成同步的。不要滥用同步。如果在一个对象内的不同的方法访问的不是同一个数据，就不要将方法设置为synchronized的。
将文件或目录拷贝到另一个Linux系统的命令scp bijian1013 linux unix scp
一.功能说明 scp就是security copy，用于将文件或者目录从一个Linux系统拷贝到另一个Linux系统下。scp传输数据用的是SSH协议，保证了数据传输的安全，其格式如下： scp 远程用户名@IP地址：文件的绝对路径
【持久化框架MyBatis3五】MyBatis3一对多关联查询 bit1129 Mybatis3
以教员和课程为例介绍一对多关联关系，在这里认为一个教员可以叫多门课程，而一门课程只有1个教员教，这种关系在实际中不太常见，通过教员和课程是多对多的关系。示例数据：地址表： CREATE TABLE ADDRESSES ( ADDR_ID INT(11) NOT NULL AUTO_INCREMENT, STREET VAR
cookie状态判断引发的查找问题 bitcarter form cgi
先说一下我们的业务背景： 1.前台将图片和文本通过form表单提交到后台，图片我们都做了base64的编码，并且前台图片进行了压缩 2.form中action是一个cgi服务 3.后台cgi服务同时供PC，H5，APP 4.后台cgi中调用公共的cookie状态判断方法（公共的，大家都用，几年了没有问题）问题：（折腾两天。。。。） 1.PC端cgi服务正常调用，cookie判断没
通过Nginx,Tomcat访问日志(access log)记录请求耗时 ronin47
一、Nginx通过$upstream_response_time $request_time统计请求和后台服务响应时间 nginx.conf使用配置方式： log_format main '$remote_addr - $remote_user [$time_local] "$request" ''$status $body_bytes_sent "$http_r
java-67- n个骰子的点数。把n个骰子扔在地上，所有骰子朝上一面的点数之和为S。输入n，打印出S的所有可能的值出现的概率。 bylijinnan java
public class ProbabilityOfDice { /** * Q67 n个骰子的点数 * 把n个骰子扔在地上，所有骰子朝上一面的点数之和为S。输入n，打印出S的所有可能的值出现的概率。 * 在以下求解过程中，我们把骰子看作是有序的。 * 例如当n=2时，我们认为（1，2）和（2，1）是两种不同的情况 */ private stati
看别人的博客，觉得心情很好 Cb123456 博客心情
以为写博客，就是总结，就和日记一样吧，同时也在督促自己。今天看了好长时间博客: 职业规划: http://www.iteye.com/blogs/subjects/zhiyeguihua android学习: 1.http://byandby.i
[JWFD开源工作流]尝试用原生代码引擎实现循环反馈拓扑分析 comsci 工作流
我们已经不满足于仅仅跳跃一次，通过对引擎的升级，今天我测试了一下循环反馈模式，大概跑了200圈，引擎报一个溢出错误在一个流程图的结束节点中嵌入一段方程，每次引擎运行到这个节点的时候，通过实时编译器GM模块，计算这个方程，计算结果与预设值进行比较，符合条件则跳跃到开始节点，继续新一轮拓扑分析，直到遇到
JS常用的事件及方法 cwqcwqmax9 js
事件描述 onactivate 当对象设置为活动元素时触发。 onafterupdate 当成功更新数据源对象中的关联对象后在数据绑定对象上触发。 onbeforeactivate 对象要被设置为当前元素前立即触发。 onbeforecut 当选中区从文档中删除之前在源对象触发。 onbeforedeactivate 在 activeElement 从当前对象变为父文档其它对象之前立即
正则表达式验证日期格式 dashuaifu 正则表达式 IT其它 java其它
正则表达式验证日期格式 function isDate(d){ var v = d.match(/^(\d{4})-(\d{1,2})-(\d{1,2})$/i); if(!v) { this.focus(); return false; } } <input value="2000-8-8" onblu
Yii CModel.rules() 方法、validate预定义完整列表、以及说说验证 dcj3sjt126com yii
public array rules () {return} array 要调用 validate() 时应用的有效性规则。返回属性的有效性规则。声明验证规则，应重写此方法。每个规则是数组具有以下结构：array('attribute list', 'validator name', 'on'=>'scenario name', ...validation
UITextAttributeTextColor = deprecated in iOS 7.0 dcj3sjt126com ios
In this lesson we used the key "UITextAttributeTextColor" to change the color of the UINavigationBar appearance to white. This prompts a warning "first deprecated in iOS 7.0." Ins
判断一个数是质数的几种方法 EmmaZhao Math python
质数也叫素数，是只能被1和它本身整除的正整数，最小的质数是2，目前发现的最大的质数是p=2^57885161-1【注1】。判断一个数是质数的最简单的方法如下： def isPrime1(n): for i in range(2, n): if n % i == 0: return False return True 但是在上面的方法中有一些冗余的计算，所以
SpringSecurity工作原理小解读坏我一锅粥 SpringSecurity
SecurityContextPersistenceFilter ConcurrentSessionFilter WebAsyncManagerIntegrationFilter HeaderWriterFilter CsrfFilter LogoutFilter Use
JS实现自适应宽度的Tag切换 ini JavaScript html Web css html5
效果体验：http://hovertree.com/texiao/js/3.htm 该效果使用纯JavaScript代码，实现TAB页切换效果，TAB标签根据内容自适应宽度，点击TAB标签切换内容页。 HTML文件代码： <!DOCTYPE html> <html xmlns="http://www.w3.org/1999/xhtml"
Hbase Rest API : 数据查询 kane_xie REST hbase
hbase（hadoop）是用java编写的，有些语言（例如python）能够对它提供良好的支持，但也有很多语言使用起来并不是那么方便，比如c#只能通过thrift访问。Rest就能很好的解决这个问题。Hbase的org.apache.hadoop.hbase.rest包提供了rest接口，它内嵌了jetty作为servlet容器。启动命令：./bin/hbase rest s
JQuery实现鼠标拖动元素移动位置（源码+注释）明子健 jquery js 源码拖动鼠标
欢迎讨论指正！ print.html代码： <!DOCTYPE html> <html> <head> <meta http-equiv=Content-Type content="text/html;charset=utf-8"> <title>发票打印</title> &l
Postgresql 连表更新字段语法 update qifeifei PostgreSQL
下面这段sql本来目的是想更新条件下的数据，可是这段sql却更新了整个表的数据。sql如下： UPDATE tops_visa.visa_order SET op_audit_abort_pass_date = now() FROM tops_visa.visa_order as t1 INNER JOIN tops_visa.visa_visitor as t2 ON t1.
将redis,memcache结合使用的方案? tcrct redis cache
公司架构上使用了阿里云的服务，由于阿里的kvstore收费相当高，打算自建，自建后就需要自己维护，所以就有了一个想法，针对kvstore(redis)及ocs(memcache)的特点，想自己开发一个cache层，将需要用到list，set，map等redis方法的继续使用redis来完成，将整条记录放在memcache下，即findbyid，save等时就memcache，其它就对应使用redi
开发中遇到的诡异的bug wudixiaotie bug
今天我们服务器组遇到个问题：我们的服务是从Kafka里面取出数据，然后把offset存储到ssdb中，每个topic和partition都对应ssdb中不同的key，服务启动之后，每次kafka数据更新我们这边收到消息，然后存储之后就发现ssdb的值偶尔是-2,这就奇怪了，最开始我们是在代码中打印存储的日志，发现没什么问题，后来去查看ssdb的日志，才发现里面每次set的时候都会对同一个key