实体解析在数据清理和融合中是一个普遍但困难的问题。这里我们将展示如何使用Spark来进行强大可扩展的文本分析技巧并执行跨数据集的实体解析。被用来描述结合来自不同数据源的记录表述同一实体的过程,另外一些常用的说法有实体连接、重复侦测、记录匹配、对象识别、数据融合等等。它指在数据集中找到跨不同数据源(例如数据文件、图书、网站、数据库)的同一实体的记录。
这里我们要处理来自两个不同数据库的记录,其中Amazon的记录为:
"id","title","description","manufacturer","price"
Google的记录为:
"id","name","description","manufacturer","price"
import re
DATAFILE_PATTERN = '^(.+),"(.+)",(.*),(.*),(.*)'
def removeQuotes(s):
return ''.join(i for i in s if i!='"')
def parseDatafileLine(datafileLine):
match = re.search(DATAFILE_PATTERN, datafileLine)
if match is None:
print 'Invalid datafile line: %s' % datafileLine
return (datafileLine, -1)
elif match.group(1) == '"id"':
print 'Header datafile line: %s' % datafileLine
return (datafileLine, 0)
else:
product = '%s %s %s' % (match.group(2), match.group(3), match.group(4))
return ((removeQuotes(match.group(1)), product), 1)
现在载入数据文件
import sys
import os
from databricks_test_helper import Test
data_dir = os.path.join('databricks-datasets', 'cs100', 'lab3', 'data-001')
GOOGLE_PATH = 'Google.csv'
GOOGLE_SMALL_PATH = 'Google_small.csv'
AMAZON_PATH = 'Amazon.csv'
AMAZON_SMALL_PATH = 'Amazon_small.csv'
GOLD_STANDARD_PATH = 'Amazon_Google_perfectMapping.csv'
STOPWORDS_PATH = 'stopwords.txt'
def parseData(filename):
return (sc
.textFile(filename, 4, 0)
.map(parseDatafileLine)
.cache())
def loadData(path):
filename = 'dbfs:/' + os.path.join(data_dir, path)
raw = parseData(filename).cache()
failed = (raw
.filter(lambda s: s[1] == -1)
.map(lambda s: s[0]))
for line in failed.take(10):
print '%s - Invalid datafile line: %s' % (path, line)
valid = (raw
.filter(lambda s: s[1] == 1)
.map(lambda s: s[0])
.cache())
print '%s - Read %d lines, successfully parsed %d lines, failed to parse %d lines' % (path,
raw.count(),
valid.count(),
failed.count())
assert failed.count() == 0
assert raw.count() == (valid.count() + 1)
return valid
googleSmall = loadData(GOOGLE_SMALL_PATH)
google = loadData(GOOGLE_PATH)
amazonSmall = loadData(AMAZON_SMALL_PATH)
amazon = loadData(AMAZON_PATH)
检查一下载入的数据,打印前3行。
for line in googleSmall.take(3):
print 'google: %s: %s\n' % (line[0], line[1])
for line in amazonSmall.take(3):
print 'amazon: %s: %s\n' % (line[0], line[1])
文本相似性——词袋分析
一个简单的方法进行实体解析是把所有的记录视为字符串并通过距离函数计算相关性。词袋是一种易于理解功能强大的文本分析方式,其理念是将字符,或者文本处理为无序的词或标记集合,即词袋。
首先将字符串标记化
quickbrownfox = 'A quick brown fox jumps over the lazy dog.'
split_regex = r'\W+'
def simpleTokenize(string):
s = string.lower()
a = re.split(split_regex, s)
b = filter(None, a)
return b
print simpleTokenize(quickbrownfox)
其次将一些没有实质意义的词去除
stopfile = os.path.join(data_dir, STOPWORDS_PATH)
stopwords = set(sc.textFile(stopfile).collect())
print 'These are the stopwords: %s' % stopwords
def tokenize(string):
a = simpleTokenize(string)
b = filter(lambda x: x not in stopwords, a)
return b
print tokenize(quickbrownfox)
对数据集进行标记化并计数
amazonRecToToken = amazonSmall.map(lambda (k, v): (k, tokenize(v)))
googleRecToToken = googleSmall.map(lambda (k, v): (k, tokenize(v)))
def countTokens(vendorRDD):
return vendorRDD.flatMap(lambda (k,v): v).count()
totalTokens = countTokens(amazonRecToToken) + countTokens(googleRecToToken)
print 'There are %s tokens in the combined datasets' % totalTokens
找到Amazon中词最多的记录
def findBiggestRecord(vendorRDD):
return vendorRDD.takeOrdered(1, key=lambda (k,v): -len(v))
biggestRecordAmazon = findBiggestRecord(amazonRecToToken)
print 'The Amazon record with ID "%s" has the most tokens (%s)' % (biggestRecordAmazon[0][0],
len(biggestRecordAmazon[0][1]))
文本相似性——权重词袋TF-IDF
由于词的重要性不同,因此将所有词的权重都视为相同效果并不理想。加入不同权重后我们在比较文本时不是简单的对词的出现频率进行计数而是对它们的权重进行加总。一个常用的方法是词频率/反向文本频率,即TF-IDF。
TF指在文本中出现次数越多的词越重要。例如一个文本含100个词,其中词t出现了100次,那么TF就是5/100,它代表了在同一文本中一个词出现的频率越高那么它就越为重要。
IDF指在整体数据集中出现次数越少的词越重要。
TF-IDF是上述两个权重的乘积。
首先定义一个tf函数将接收到的词标记列表映射到TF权重
from collections import defaultdict
def tf(tokens):
d = defaultdict(int)
l = len(tokens)
for t in tokens:
d[t]+=1
rd = {k: v/float(l) for k, v in d.items()}
return rd
print tf(tokenize(quickbrownfox))
corpusRDD = amazonRecToToken.union(googleRecToToken)
定义IDF函数
def idfs(corpus):
uniqueTokens = corpus.map(lambda (k,v): set(v))
tokenCountPairTuple = uniqueTokens.flatMap(lambda x: x).map(lambda x: (x,1))
tokenSumPairTuple = tokenCountPairTuple.reduceByKey(lambda a, b: a+b)
N = corpus.count()
return (tokenSumPairTuple.map(lambda (k, v):(k, N/float(v))))
idfsSmall = idfs(amazonRecToToken.union(googleRecToToken))
uniqueTokenCount = idfsSmall.count()
print 'There are %s unique tokens in the small datasets.' % uniqueTokenCount
现在可以定义TF-IDF函数
def tfidf(tokens, idfs):
tfs = tf(tokens)
tfIdfDict = {k: v*idfs[k] for k, v in tfs.items()}
return tfIdfDict
recb000hkgj8k = amazonRecToToken.filter(lambda x: x[0] == 'b000hkgj8k').collect()[0][1]
idfsSmallWeights = idfsSmall.collectAsMap()
rec_b000hkgj8k_weights = tfidf(recb000hkgj8k, idfsSmallWeights)
recb000jz4hqo = amazonRecToToken.filter(lambda x: x[0] == 'b000jz4hqo').collect()[0][1]
rec_b000jz4hqo_weights = tfidf(recb000jz4hqo, idfsSmallWeights)
print 'Amazon record "b000hkgj8k" has tokens and weights:\n%s' % rec_b000hkgj8k_weights
print 'Amazon record "b000jz4hqo" has tokens and weights: \n%s' % rec_b000jz4hqo_weights
文本相似性——Cosine相似性
这里我们使用cosine相似性来衡量字符串间的距离。我们将每个文本视为高维空间的一个向量,然后通过计算两个文本向量之间的cosine角度来比较文本的相似性。
首先定义一些相关函数
import math
def dotprod(a, b):
x=0
for ak in a.keys():
if ak in b.keys():
x += a[ak]*b[ak]
return x
def norm(a):
y=0
p=0
for i in a.keys():
p += pow(a[i], 2)
y = math.sqrt(p)
return y
def cossim(a, b):
return dotprod(a,b)/norm(a)/norm(b)
testVec1 = {'foo': 2, 'bar': 3, 'baz': 5 }
testVec2 = {'foo': 1, 'bar': 0, 'baz': 20 }
dp = dotprod(testVec1, testVec2)
nm = norm(testVec1)
cs = cossim(testVec1, testVec2)
print dp, nm, cs
现在可以定义cosineSimilarity函数
def cosineSimilarity(string1, string2, idfsDictionary):
w1 = tfidf(tokenize(string1), idfsDictionary)
w2 = tfidf(tokenize(string2), idfsDictionary)
return cossim(w1, w2)
cossimAdobe = cosineSimilarity('Adobe Photoshop',
'Adobe Illustrator',
idfsSmallWeights)
print cossimAdobe