matdodo

[pySpark][note]Text Analysis and Entity Resolution

+

Text Analysis and Entity Resolution

Entity resolution is a common, yet difficult problem in data cleaning and integration. This lab will demonstrate how we can use Apache Spark to apply powerful and scalable text analysis techniques and perform entity resolution across two datasets of commercial products.

Entity Resolution, or “Record linkage” is the term used by statisticians, epidemiologists, and historians, among others, to describe the process of joining records from one data source with another that describe the same entity. Our terms with the same meaning include, “entity disambiguation/linking”, duplicate detection”, “deduplication”, “record matching”, “(reference) reconciliation”, “object identification”, “data/information integration”, and “conflation”.

Code

This assignment can be completed using basic Python, pySpark Transformations and actions, and the plotting library matplotlib. Other libraries are not allowed.

Files

Data files for this assignment are from the metric-learning project and can be found at:

cs100/lab3

The directory contains the following files:

Google.csv, the Google Products dataset
Amazon.csv, the Amazon dataset
Google_small.csv, 200 records sampled from the Google data
Amazon_small.csv, 200 records sampled from the Amazon data
Amazon_Google_perfectMapping.csv, the “gold standard” mapping
stopwords.txt, a list of common English words

Besides the complete data files, there are “sample” data files for each dataset - we will use these for Part 1. In addition, there is a “gold standard” file that contains all of the true mappings between entities in the two datasets. Every row in the gold standard file has a pair of record IDs (one Google, one Amazon) that belong to two record that describe the same thing in the real world. We will use the gold standard to evaluate our algorithms.

Part 0: Preliminaries

We read in each of the files and create an RDD consisting of lines.

For each of the data files (“Google.csv”, “Amazon.csv”, and the samples), we want to parse the IDs out of each record. The IDs are the first column of the file (they are URLs for Google, and alphanumeric strings for Amazon). Omitting the headers, we load these data files into pair RDDs where the mapping ID is the key, and the value is a string consisting of the name/title, description, and manufacturer from the record.

The file format of an Amazon line is:

"id","title","description","manufacturer","price"

The file format of a Google line is:

"id","name","description","manufacturer","price"

import re
DATAFILE_PATTERN = '^(.+),"(.+)",(.*),(.*),(.*)'

def removeQuotes(s):
    """ Remove quotation marks from an input string
    Args:
        s (str): input string that might have the quote "" characters
    Returns:
        str: a string without the quote characters
    """
    return ''.join(i for i in s if i!='"')


def parseDatafileLine(datafileLine):
    """ Parse a line of the data file using the specified regular expression pattern
    Args:
        datafileLine (str): input string that is a line from the data file
    Returns:
        str: a string parsed using the given regular expression and without the quote characters
    """
    match = re.search(DATAFILE_PATTERN, datafileLine)
    if match is None:
        print 'Invalid datafile line: %s' % datafileLine
        return (datafileLine, -1)
    elif match.group(1) == '"id"':
        print 'Header datafile line: %s' % datafileLine
        return (datafileLine, 0)
    else:
        product = '%s %s %s' % (match.group(2), match.group(3), match.group(4))
        return ((removeQuotes(match.group(1)), product), 1)


import sys  
import os
from test_helper import Test

baseDir = os.path.join('data')
inputPath = os.path.join('cs100', 'lab3')

GOOGLE_PATH = 'Google.csv'
GOOGLE_SMALL_PATH = 'Google_small.csv'
AMAZON_PATH = 'Amazon.csv'
AMAZON_SMALL_PATH = 'Amazon_small.csv'
GOLD_STANDARD_PATH = 'Amazon_Google_perfectMapping.csv'
STOPWORDS_PATH = 'stopwords.txt'

def parseData(filename):
    """ Parse a data file
    Args:
        filename (str): input file name of the data file
    Returns:
        RDD: a RDD of parsed lines
    """
    return (sc
            .textFile(filename, 4, 0)
            .map(parseDatafileLine)
            .cache())

def loadData(path):
    """ Load a data file
    Args:
        path (str): input file name of the data file
    Returns:
        RDD: a RDD of parsed valid lines
    """
    filename = os.path.join(baseDir, inputPath, path)
    raw = parseData(filename).cache()
    failed = (raw
              .filter(lambda s: s[1] == -1)
              .map(lambda s: s[0]))
    for line in failed.take(10):
        print '%s - Invalid datafile line: %s' % (path, line)
    valid = (raw
             .filter(lambda s: s[1] == 1)
             .map(lambda s: s[0])
             .cache())
    print '%s - Read %d lines, successfully parsed %d lines, failed to parse %d lines' % (path,
                                                                                        raw.count(),
                                                                                        valid.count(),
                                                                                        failed.count())
    assert failed.count() == 0
    assert raw.count() == (valid.count() + 1)
    return valid

googleSmall = loadData(GOOGLE_SMALL_PATH)
google = loadData(GOOGLE_PATH)
amazonSmall = loadData(AMAZON_SMALL_PATH)
amazon = loadData(AMAZON_PATH)

Google_small.csv - Read 201 lines, successfully parsed 200 lines, failed to parse 0 lines
Google.csv - Read 3227 lines, successfully parsed 3226 lines, failed to parse 0 lines
Amazon_small.csv - Read 201 lines, successfully parsed 200 lines, failed to parse 0 lines
Amazon.csv - Read 1364 lines, successfully parsed 1363 lines, failed to parse 0 lines

Let’s examine the lines that were just loaded in the two subset (small) files - one from Google and one from Amazon

for line in googleSmall.take(3):
    print 'google: %s: %s\n' % (line[0], line[1])

for line in amazonSmall.take(3):
    print 'amazon: %s: %s\n' % (line[0], line[1])

google: http://www.google.com/base/feeds/snippets/11448761432933644608: spanish vocabulary builder "expand your vocabulary! contains fun lessons that both teach and entertain you'll quickly find yourself mastering new terms. includes games and more!" 

google: http://www.google.com/base/feeds/snippets/8175198959985911471: topics presents: museums of world "5 cd-rom set. step behind the velvet rope to examine some of the most treasured collections of antiquities art and inventions. includes the following the louvre - virtual visit 25 rooms in full screen interactive video detailed map of the louvre ..." 

google: http://www.google.com/base/feeds/snippets/18445827127704822533: sierrahome hse hallmark card studio special edition win 98 me 2000 xp "hallmark card studio special edition (win 98 me 2000 xp)" "sierrahome"

amazon: b000jz4hqo: clickart 950 000 - premier image pack (dvd-rom)  "broderbund"

amazon: b0006zf55o: ca international - arcserve lap/desktop oem 30pk "oem arcserve backup v11.1 win 30u for laptops and desktops" "computer associates"

amazon: b00004tkvy: noah's ark activity center (jewel case ages 3-8)  "victory multimedia"

Part 1: ER as Text Similarity - Bags of Words

A simple approach to entity resolution is to treat all records as strings and compute their similarity with a string distance function. In this part, we will build some components for performing bag-of-words text-analysis, and then use them to compute record similarity.

Bag-of-words is a conceptually simple yet powerful approach to text analysis.

The idea is to treat strings, a.k.a. documents, as unordered collections of words, or tokens, i.e., as bags of words.

Note on terminology: a “token” is the result of parsing the document down to the elements we consider “atomic” for the task at hand. Tokens can be things like words, numbers, acronyms, or other exotica like word-roots or fixed-length character strings.

Bag of words techniques all apply to any sort of token, so when we say “bag-of-words” we really mean “bag-of-tokens,” strictly speaking.

1(a) Tokenize a String

Implement the function `simpleTokenize(string)` that takes a string and returns a list of non-empty tokens in the string. `simpleTokenize` should split strings using the provided regular expression. Since we want to make token-matching case insensitive, make sure all tokens are turned lower-case. Give an interpretation, in natural language, of what the regular expression, `split_regex`, matches.

If you need help with Regular Expressions, try the site regex101 where you can interactively explore the results of applying different regular expressions to strings. Note that \W includes the “_” character. You should use re.split() to perform the string split. Also, make sure you remove any empty tokens.

# TODO: Replace  with appropriate code
quickbrownfox = 'A quick brown fox jumps over the lazy dog.'
split_regex = r'\W+'

def simpleTokenize(string):
    """ A simple implementation of input string tokenization
    Args:
        string (str): input string
    Returns:
        list: a list of tokens
    """
    lstr = string.lower()

    return [s for s in re.split(split_regex,lstr) if s != '']

print simpleTokenize(quickbrownfox) # Should give ['a', 'quick', 'brown', ... ]


['a', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']



# TEST Tokenize a String (1a)
Test.assertEquals(simpleTokenize(quickbrownfox),
                  ['a','quick','brown','fox','jumps','over','the','lazy','dog'],
                  'simpleTokenize should handle sample text')
Test.assertEquals(simpleTokenize(' '), [], 'simpleTokenize should handle empty string')
Test.assertEquals(simpleTokenize('!!!!123A/456_B/789C.123A'), ['123a','456_b','789c','123a'],
                  'simpleTokenize should handle puntuations and lowercase result')
Test.assertEquals(simpleTokenize('fox fox'), ['fox', 'fox'],
                  'simpleTokenize should not remove duplicates')

1 test passed.
1 test passed.
1 test passed.
1 test passed.

(1b) Removing stopwords

Stopwords are common (English) words that do not contribute much to the content or meaning of a document (e.g., “the”, “a”, “is”, “to”, etc.). Stopwords add noise to bag-of-words comparisons, so they are usually excluded.

Using the included file “stopwords.txt”, implement `tokenize`, an improved tokenizer that does not emit stopwords.

# TODO: Replace  with appropriate code
stopfile = os.path.join(baseDir, inputPath, STOPWORDS_PATH)
stopwords = set(sc.textFile(stopfile).collect())
print 'These are the stopwords: %s' % stopwords

def tokenize(string):
    """ An implementation of input string tokenization that excludes stopwords
    Args:
        string (str): input string
    Returns:
        list: a list of tokens without stopwords
    """
    return filter(lambda s:s not in stopwords,simpleTokenize(string))

print tokenize(quickbrownfox) # Should give ['quick', 'brown', ... ]

These are the stopwords: set([u'all', u'just', u'being', u'over', u'both', u'through', u'yourselves', u'its', u'before', u'with', u'had', u'should', u'to', u'only', u'under', u'ours', u'has', u'do', u'them', u'his', u'very', u'they', u'not', u'during', u'now', u'him', u'nor', u'did', u'these', u't', u'each', u'where', u'because', u'doing', u'theirs', u'some', u'are', u'our', u'ourselves', u'out', u'what', u'for', u'below', u'does', u'above', u'between', u'she', u'be', u'we', u'after', u'here', u'hers', u'by', u'on', u'about', u'of', u'against', u's', u'or', u'own', u'into', u'yourself', u'down', u'your', u'from', u'her', u'whom', u'there', u'been', u'few', u'too', u'themselves', u'was', u'until', u'more', u'himself', u'that', u'but', u'off', u'herself', u'than', u'those', u'he', u'me', u'myself', u'this', u'up', u'will', u'while', u'can', u'were', u'my', u'and', u'then', u'is', u'in', u'am', u'it', u'an', u'as', u'itself', u'at', u'have', u'further', u'their', u'if', u'again', u'no', u'when', u'same', u'any', u'how', u'other', u'which', u'you', u'who', u'most', u'such', u'why', u'a', u'don', u'i', u'having', u'so', u'the', u'yours', u'once'])
['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog']



# TEST Removing stopwords (1b)
Test.assertEquals(tokenize("Why a the?"), [], 'tokenize should remove all stopwords')
Test.assertEquals(tokenize("Being at the_?"), ['the_'], 'tokenize should handle non-stopwords')
Test.assertEquals(tokenize(quickbrownfox), ['quick','brown','fox','jumps','lazy','dog'],
                    'tokenize should handle sample text')

1 test passed.
1 test passed.
1 test passed.

(1c) Tokenizing the small datasets

Now let’s tokenize the two small datasets. For each ID in a dataset, `tokenize` the values, and then count the total number of tokens.

How many tokens, total, are there in the two datasets?

# TODO: Replace  with appropriate code
amazonRecToToken = amazonSmall.map(lambda line:(line[0],tokenize(line[1])))
googleRecToToken = googleSmall.map(lambda line:(line[0],tokenize(line[1])))

def countTokens(vendorRDD):
    """ Count and return the number of tokens
    Args:
        vendorRDD (RDD of (recordId, tokenizedValue)): Pair tuple of record ID to tokenized output
    Returns:
        count: count of all tokens
    """

    return vendorRDD.map(lambda line:len(line[1])).reduce(lambda a,b:a+b)

totalTokens = countTokens(amazonRecToToken) + countTokens(googleRecToToken)
print 'There are %s tokens in the combined datasets' % totalTokens

There are 22520 tokens in the combined datasets



# TEST Tokenizing the small datasets (1c)
Test.assertEquals(totalTokens, 22520, 'incorrect totalTokens')

1 test passed.

(1d) Amazon record with the most tokens

Which Amazon record has the biggest number of tokens?

In other words, you want to sort the records and get the one with the largest count of tokens.

# TODO: Replace  with appropriate code
def findBiggestRecord(vendorRDD):
    """ Find and return the record with the largest number of tokens
    Args:
        vendorRDD (RDD of (recordId, tokens)): input Pair Tuple of record ID and tokens
    Returns:
        list: a list of 1 Pair Tuple of record ID and tokens
    """
    return [vendorRDD.max(key=lambda (k,v):len(v))]

biggestRecordAmazon = findBiggestRecord(amazonRecToToken)
print 'The Amazon record with ID "%s" has the most tokens (%s)' % (biggestRecordAmazon[0][0],
                                                                   len(biggestRecordAmazon[0][1]))

The Amazon record with ID "b000o24l3q" has the most tokens (1547)



# TEST Amazon record with the most tokens (1d)
Test.assertEquals(biggestRecordAmazon[0][0], 'b000o24l3q', 'incorrect biggestRecordAmazon')
Test.assertEquals(len(biggestRecordAmazon[0][1]), 1547, 'incorrect len for biggestRecordAmazon')

1 test passed.
1 test passed.

Part 2: ER as Text Similarity - Weighted Bag-of-Words using TF-IDF

Bag-of-words comparisons are not very good when all tokens are treated the same: some tokens are more important than others. Weights give us a way to specify which tokens to favor. With weights, when we compare documents, instead of counting common tokens, we sum up the weights of common tokens. A good heuristic for assigning weights is called “Term-Frequency/Inverse-Document-Frequency,” or TF-IDF for short.

TF

TF rewards tokens that appear many times in the same document. It is computed as the frequency of a token in a document, that is, if document d contains 100 tokens and token t appears in d 5 times, then the TF weight of t in d is 5/100 = 1/20. The intuition for TF is that if a word occurs often in a document, then it is more important to the meaning of the document.

IDF

#### Let N be the total number of documents in U
#### Find n(t), the number of documents in U that contain t
#### Then IDF(t) = N/n(t).

Note that n(t)/N is the frequency of t in U, and N/n(t) is the inverse frequency.

Note on terminology: Sometimes token weights depend on the document the token belongs to, that is, the same token may have a different weight when it’s found in different documents. We call these weights local weights. TF is an example of a local weight, because it depends on the length of the source. On the other hand, some token weights only depend on the token, and are the same everywhere that token is found. We call these weights global, and IDF is one such weight.

TF-IDF

Finally, to bring it all together, the total TF-IDF weight for a token in a document is the product of its TF and IDF weights.

(2a) Implement a TF function

Implement `tf(tokens)` that takes a list of tokens and returns a Python dictionary mapping tokens to TF weights.

The steps your function should perform are:

#### Create an empty Python dictionary
#### For each of the tokens in the input tokens list, count 1 for each occurance and add the token to the dictionary
For each of the tokens in the dictionary, divide the token’s count by the total number of tokens in the input tokens list

def tf(tokens):
“”” Compute TF
Args:
tokens (list of str): input list of tokens from tokenize
Returns:
dictionary: a dictionary of tokens to its TF values
“””
d = {}
for s in tokens:
if s in d:
d[s] += 1.0
else:
d[s] = 1.0
```
return {s:d[s]/len(tokens) for s in d}
```
print tf(tokenize(quickbrownfox)) # Should give { ‘quick’: 0.1666 … }

{‘brown’: 0.16666666666666666, ‘lazy’: 0.16666666666666666, ‘jumps’: 0.16666666666666666, ‘fox’: 0.16666666666666666, ‘dog’: 0.16666666666666666, ‘quick’: 0.16666666666666666}

# TEST Implement a TF function (2a)
tf_test = tf(tokenize(quickbrownfox))
Test.assertEquals(tf_test, {‘brown’: 0.16666666666666666, ‘lazy’: 0.16666666666666666,
‘jumps’: 0.16666666666666666, ‘fox’: 0.16666666666666666,
‘dog’: 0.16666666666666666, ‘quick’: 0.16666666666666666},
‘incorrect result for tf on sample text’)
tf_test2 = tf(tokenize(‘one_ one_ two!’))
Test.assertEquals(tf_test2, {‘one_’: 0.6666666666666666, ‘two’: 0.3333333333333333},
‘incorrect result for tf test’)

1 test passed.
1 test passed.

(2b) Create a corpus

Create a pair RDD called `corpusRDD`, consisting of a combination of the two small datasets, `amazonRecToToken` and `googleRecToToken`. Each element of the `corpusRDD` should be a pair consisting of a key from one of the small datasets (ID or URL) and the value is the associated value for that key from the small datasets.

# TODO: Replace  with appropriate code
corpusRDD = amazonRecToToken.union(googleRecToToken)# amazonRecToToken.fullOuterJoin(googleRecToToken).map(lambda (k,v):(k,v[0]if v[0]is not None else v[1]))


# TEST Create a corpus (2b)
Test.assertEquals(corpusRDD.count(), 400, 'incorrect corpusRDD.count()')
#print corpusRDD.first()

1 test passed.

(2c) Implement an IDFs function

Implement `idfs` that assigns an IDF weight to every unique token in an RDD called `corpus`. The function should return an pair RDD where the `key` is the unique token and value is the IDF weight for the token.

Recall that the IDF weight for a token, t, in a set of documents, U, is computed as follows:

#### Let N be the total number of documents in U.
#### Find n(t), the number of documents in U that contain t.
#### Then IDF(t) = N/n(t).

The steps your function should perform are:

#### Calculate N. Think about how you can calculate N from the input RDD.
#### Create an RDD (not a pair RDD) containing the unique tokens from each document in the input corpus. For each document, you should only include a token once, even if it appears multiple times in that document.
#### For each of the unique tokens, count how many times it appears in the document and then compute the IDF for that token: N/n(t)

Use your `idfs` to compute the IDF weights for all tokens in `corpusRDD` (the combined small datasets).

How many unique tokens are there?

# TODO: Replace  with appropriate code
def idfs(corpus):
    """ Compute IDF
    Args:
        corpus (RDD): input corpus
    Returns:
        RDD: a RDD of (token, IDF value)
    """
    N = corpus.count()
    uniqueTokens = corpus.flatMap(lambda (k,v):[(s,k) for s in v])
    tokenCountPairTuple = uniqueTokens.groupByKey()
    tokenSumPairTuple = tokenCountPairTuple.map(lambda (k,v):(k,len(set(v))))
    return (tokenSumPairTuple.map(lambda (k,v):(k,N*1.0/v)))

idfsSmall = idfs(amazonRecToToken.union(googleRecToToken))
uniqueTokenCount = idfsSmall.count()

print 'There are %s unique tokens in the small datasets.' % uniqueTokenCount

There are 4772 unique tokens in the small datasets.



# TEST Implement an IDFs function (2c)
Test.assertEquals(uniqueTokenCount, 4772, 'incorrect uniqueTokenCount')
tokenSmallestIdf = idfsSmall.takeOrdered(1, lambda s: s[1])[0]
Test.assertEquals(tokenSmallestIdf[0], 'software', 'incorrect smallest IDF token')
Test.assertTrue(abs(tokenSmallestIdf[1] - 4.25531914894) < 0.0000000001,
                'incorrect smallest IDF value')

1 test passed.
1 test passed.
1 test passed.

(2d) Tokens with the smallest IDF

Print out the 11 tokens with the smallest IDF in the combined small dataset.

smallIDFTokens = idfsSmall.takeOrdered(11, lambda s: s[1])
print smallIDFTokens

[('software', 4.25531914893617), ('new', 6.896551724137931), ('features', 6.896551724137931), ('use', 7.017543859649122), ('complete', 7.2727272727272725), ('easy', 7.6923076923076925), ('create', 8.333333333333334), ('system', 8.333333333333334), ('cd', 8.333333333333334), ('1', 8.51063829787234), ('windows', 8.51063829787234)]

(2e) IDF Histogram

Plot a histogram of IDF values. Be sure to use appropriate scaling and bucketing for the data.

First plot the histogram using `matplotlib`

import matplotlib.pyplot as plt

small_idf_values = idfsSmall.map(lambda s: s[1]).collect()
fig = plt.figure(figsize=(8,3))
plt.hist(small_idf_values, 50, log=True)
pass

(2f) Implement a TF-IDF function

Use your `tf` function to implement a `tfidf(tokens, idfs)` function that takes a list of tokens from a document and a Python dictionary of IDF weights and returns a Python dictionary mapping individual tokens to total TF-IDF weights.

The steps your function should perform are:

#### Calculate the token frequencies (TF) for tokens
#### Create a Python dictionary where each token maps to the token’s frequency times the token’s IDF weight

Use your `tfidf` function to compute the weights of Amazon product record ‘b000hkgj8k’. To do this, we need to extract the record for the token from the tokenized small Amazon dataset and we need to convert the IDFs for the small dataset into a Python dictionary. We can do the first part, by using a `filter()` transformation to extract the matching record and a `collect()` action to return the value to the driver. For the second part, we use the `collectAsMap()` action to return the IDFs to the driver as a Python dictionary.

# TODO: Replace  with appropriate code
def tfidf(tokens, idfs):
    """ Compute TF-IDF
    Args:
        tokens (list of str): input list of tokens from tokenize
        idfs (dictionary): record to IDF value
    Returns:
        dictionary: a dictionary of records to TF-IDF values
    """
    tfs = tf(tokens)
    tfIdfDict = {t:tfs[t]*idfs[t] for t in tfs}
    return tfIdfDict

recb000hkgj8k = amazonRecToToken.filter(lambda x: x[0] == 'b000hkgj8k').collect()[0][1]
idfsSmallWeights = idfsSmall.collectAsMap() # mark 
rec_b000hkgj8k_weights = tfidf(recb000hkgj8k, idfsSmallWeights)

print 'Amazon record "b000hkgj8k" has tokens and weights:\n%s' % rec_b000hkgj8k_weights

Amazon record "b000hkgj8k" has tokens and weights:
{'autocad': 33.33333333333333, 'autodesk': 8.333333333333332, 'courseware': 66.66666666666666, 'psg': 33.33333333333333, '2007': 3.5087719298245617, 'customizing': 16.666666666666664, 'interface': 3.0303030303030303}



# TEST Implement a TF-IDF function (2f)
Test.assertEquals(rec_b000hkgj8k_weights,
                   {'autocad': 33.33333333333333, 'autodesk': 8.333333333333332,
                    'courseware': 66.66666666666666, 'psg': 33.33333333333333,
                    '2007': 3.5087719298245617, 'customizing': 16.666666666666664,
                    'interface': 3.0303030303030303}, 'incorrect rec_b000hkgj8k_weights')

1 test passed.

Part 3: ER as Text Similarity - Cosine Similarity

Now we are ready to do text comparisons in a formal way. The metric of string distance we will use is called cosine similarity. We will treat each document as a vector in some high dimensional space. Then, to compare two documents we compute the cosine of the angle between their two document vectors. This is much easier than it sounds.

The first question to answer is how do we represent documents as vectors? The answer is familiar: bag-of-words! We treat each unique token as a dimension, and treat token weights as magnitudes in their respective token dimensions. For example, suppose we use simple counts as weights, and we want to interpret the string “Hello, world! Goodbye, world!” as a vector. Then in the “hello” and “goodbye” dimensions the vector has value 1, in the “world” dimension it has value 2, and it is zero in all other dimensions.

The next question is: given two vectors how do we find the cosine of the angle between them? Recall the formula for the dot product of two vectors:

$a \cdot b = ∥ a ∥ ∥ b ∥ cos θ$

Here a⋅b=∑aibi is the ordinary dot product of two vectors, and ∥a∥=∑a2i−−−−√ is the norm of a .

We can rearrange terms and solve for the cosine to find it is simply the normalized dot product of the vectors. With our vector model, the dot product and norm computations are simple functions of the bag-of-words document representations, so we now have a formal way to compute similarity:

$s i m i l a r i t y = cos θ = a \cdot b ∥ a ∥ ∥ b ∥ = \sum a i b i \sum a 2 i - - - - \sqrt \sum b 2 i - - - - \sqrt$

(3a) Implement the components of a `cosineSimilarity` function

Implement the components of a `cosineSimilarity` function.

Use the `tokenize` and `tfidf` functions, and the IDF weights from Part 2 for extracting tokens and assigning them weights.

The steps you should perform are:

#### Define a function dotprod that takes two Python dictionaries and produces the dot product of them, where the dot product is defined as the sum of the product of values for tokens that appear in both dictionaries
#### Define a function norm that returns the square root of the dot product of a dictionary and itself
Define a function cossim that returns the dot product of two dictionaries divided by the norm of the first dictionary and then by the norm of the second dictionary

# TODO: Replace with appropriate code
import math

def dotprod(a, b):
“”” Compute dot product
Args:
a (dictionary): first dictionary of record to value
b (dictionary): second dictionary of record to value
Returns:
dotProd: result of the dot product with the two input dictionaries
“””
return sum(a[k]*b[k]for k in a if k in b)

def norm(a):
“”” Compute square root of the dot product
Args:
a (dictionary): a dictionary of record to value
Returns:
norm: a dictionary of tokens to its TF values
“””
return math.sqrt(sum(a[k]**2 for k in a))

def cossim(a, b):
“”” Compute cosine similarity
Args:
a (dictionary): first dictionary of record to value
b (dictionary): second dictionary of record to value
Returns:
cossim: dot product of two dictionaries divided by the norm of the first dictionary and
then by the norm of the second dictionary
“””
return dotprod(a,b)/(norm(a)*norm(b))

testVec1 = {‘foo’: 2, ‘bar’: 3, ‘baz’: 5 }
testVec2 = {‘foo’: 1, ‘bar’: 0, ‘baz’: 20 }
dp = dotprod(testVec1, testVec2)
nm = norm(testVec1)
print dp, nm

102 6.16441400297

# TEST Implement the components of a cosineSimilarity function (3a)
Test.assertEquals(dp, 102, ‘incorrect dp’)
Test.assertTrue(abs(nm - 6.16441400297) < 0.0000001, ‘incorrrect nm’)

1 test passed.
1 test passed.

(3b) Implement a `cosineSimilarity` function

Implement a `cosineSimilarity(string1, string2, idfsDictionary)` function that takes two strings and a dictionary of IDF weights, and computes their cosine similarity in the context of some global IDF weights.

The steps you should perform are:

#### Apply your tfidf function to the tokenized first and second strings, using the dictionary of IDF weights
Compute and return your cossim function applied to the results of the two tfidf functions

# TODO: Replace with appropriate code
def cosineSimilarity(string1, string2, idfsDictionary):
“”” Compute cosine similarity between two strings
Args:
string1 (str): first string
string2 (str): second string
idfsDictionary (dictionary): a dictionary of IDF values
Returns:
cossim: cosine similarity value
“””
w1 = tfidf(tokenize(string1),idfsDictionary)
w2 = tfidf(tokenize(string2),idfsDictionary)
return cossim(w1, w2)

cossimAdobe = cosineSimilarity(‘Adobe Photoshop’,
‘Adobe Illustrator’,
idfsSmallWeights)

print cossimAdobe

0.0577243382163

# TEST Implement a cosineSimilarity function (3b)
Test.assertTrue(abs(cossimAdobe - 0.0577243382163) < 0.0000001, ‘incorrect cossimAdobe’)

1 test passed.

(3c) Perform Entity Resolution

Now we can finally do some entity resolution!

For every product record in the small Google dataset, use your `cosineSimilarity` function to compute its similarity to every record in the small Amazon dataset. Then, build a dictionary mapping `(Google URL, Amazon ID)` tuples to similarity scores between 0 and 1.

We’ll do this computation two different ways, first we’ll do it without a broadcast variable, and then we’ll use a broadcast variable

The steps you should perform are:

#### Create an RDD that is a combination of the small Google and small Amazon datasets that has as elements all pairs of elements (a, b) where a is in self and b is in other. The result will be an RDD of the form: [ ((Google URL1, Google String1), (Amazon ID1, Amazon String1)), ((Google URL1, Google String1), (Amazon ID2, Amazon String2)), ((Google URL2, Google String2), (Amazon ID1, Amazon String1)), ... ]
#### Define a worker function that given an element from the combination RDD computes the cosineSimlarity for the two records in the element
#### Apply the worker function to every element in the RDD

Now, compute the similarity between Amazon record `b000o24l3q` and Google record `http://www.google.com/base/feeds/snippets/17242822440574356561`.

# TODO: Replace  with appropriate code
crossSmall = (googleSmall
              .cartesian(amazonSmall)
              .cache())

def computeSimilarity(record):
    """ Compute similarity on a combination record
    Args:
        record: a pair, (google record, amazon record)
    Returns:
        pair: a pair, (google URL, amazon ID, cosine similarity value)
    """
    googleRec = record[0]
    amazonRec = record[1]
    googleURL = googleRec[0]
    amazonID = amazonRec[0]
    googleValue = googleRec[1]
    amazonValue = amazonRec[1]
    cs = cosineSimilarity(googleValue,amazonValue,idfsSmallWeights)
    return (googleURL, amazonID, cs)

similarities = (crossSmall
                .map(lambda line:computeSimilarity(line))
                .cache())

def similar(amazonID, googleURL):
    """ Return similarity value
    Args:
        amazonID: amazon ID
        googleURL: google URL
    Returns:
        similar: cosine similarity value
    """
    return (similarities
            .filter(lambda record: (record[0] == googleURL and record[1] == amazonID))
            .collect()[0][2])

similarityAmazonGoogle = similar('b000o24l3q', 'http://www.google.com/base/feeds/snippets/17242822440574356561')
print 'Requested similarity is %s.' % similarityAmazonGoogle

Requested similarity is 0.000303171940451.



# TEST Perform Entity Resolution (3c)
Test.assertTrue(abs(similarityAmazonGoogle - 0.000303171940451) < 0.0000001,
                'incorrect similarityAmazonGoogle')

1 test passed.

(3d) Perform Entity Resolution with Broadcast Variables

The solution in (3c) works well for small datasets, but it requires Spark to (automatically) send the `idfsSmallWeights` variable to all the workers. If we didn’t `cache()` similarities, then it might have to be recreated if we run `similar()` multiple times. This would cause Spark to send `idfsSmallWeights` every time.

Instead, we can use a broadcast variable - we define the broadcast variable in the driver and then we can refer to it in each worker. Spark saves the broadcast variable at each worker, so it is only sent once.

The steps you should perform are:

#### Define a computeSimilarityBroadcast function that given an element from the combination RDD computes the cosine simlarity for the two records in the element. This will be the same as the worker function computeSimilarity in (3c) except that it uses a broadcast variable.
#### Apply the worker function to every element in the RDD

Again, compute the similarity between Amazon record `b000o24l3q` and Google record `http://www.google.com/base/feeds/snippets/17242822440574356561`.

# TODO: Replace  with appropriate code
def computeSimilarityBroadcast(record):
    """ Compute similarity on a combination record, using Broadcast variable
    Args:
        record: a pair, (google record, amazon record)
    Returns:
        pair: a pair, (google URL, amazon ID, cosine similarity value)
    """
    googleRec = record[0]
    amazonRec = record[1]
    googleURL = googleRec[0]
    amazonID = amazonRec[0]
    googleValue = googleRec[1]
    amazonValue = amazonRec[1]
    cs = cosineSimilarity(googleValue,amazonValue,idfsSmallBroadcast.value)
    return (googleURL, amazonID, cs)

idfsSmallBroadcast = sc.broadcast(idfsSmallWeights)
similaritiesBroadcast = (crossSmall
                         .map(lambda record:computeSimilarityBroadcast(record))
                         .cache())

def similarBroadcast(amazonID, googleURL):
    """ Return similarity value, computed using Broadcast variable
    Args:
        amazonID: amazon ID
        googleURL: google URL
    Returns:
        similar: cosine similarity value
    """
    return (similaritiesBroadcast
            .filter(lambda record: (record[0] == googleURL and record[1] == amazonID))
            .collect()[0][2])

similarityAmazonGoogleBroadcast = similarBroadcast('b000o24l3q', 'http://www.google.com/base/feeds/snippets/17242822440574356561')
print 'Requested similarity is %s.' % similarityAmazonGoogleBroadcast

Requested similarity is 0.000303171940451.



# TEST Perform Entity Resolution with Broadcast Variables (3d)
from pyspark import Broadcast
Test.assertTrue(isinstance(idfsSmallBroadcast, Broadcast), 'incorrect idfsSmallBroadcast')
Test.assertEquals(len(idfsSmallBroadcast.value), 4772, 'incorrect idfsSmallBroadcast value')
Test.assertTrue(abs(similarityAmazonGoogleBroadcast - 0.000303171940451) < 0.0000001,
                'incorrect similarityAmazonGoogle')

1 test passed.
1 test passed.
1 test passed.

(3e) Perform a Gold Standard evaluation

First, we’ll load the “gold standard” data and use it to answer several questions. We read and parse the Gold Standard data, where the format of each line is “Amazon Product ID”,”Google URL”. The resulting RDD has elements of the form (“AmazonID GoogleURL”, ‘gold’)

GOLDFILE_PATTERN = '^(.+),(.+)'

# Parse each line of a data file useing the specified regular expression pattern
def parse_goldfile_line(goldfile_line):
    """ Parse a line from the 'golden standard' data file
    Args:
        goldfile_line: a line of data
    Returns:
        pair: ((key, 'gold', 1 if successful or else 0))
    """
    match = re.search(GOLDFILE_PATTERN, goldfile_line)
    if match is None:
        print 'Invalid goldfile line: %s' % goldfile_line
        return (goldfile_line, -1)
    elif match.group(1) == '"idAmazon"':
        print 'Header datafile line: %s' % goldfile_line
        return (goldfile_line, 0)
    else:
        key = '%s %s' % (removeQuotes(match.group(1)), removeQuotes(match.group(2)))
        return ((key, 'gold'), 1)

goldfile = os.path.join(baseDir, inputPath, GOLD_STANDARD_PATH)
gsRaw = (sc
         .textFile(goldfile)
         .map(parse_goldfile_line)
         .cache())

gsFailed = (gsRaw
            .filter(lambda s: s[1] == -1)
            .map(lambda s: s[0]))
for line in gsFailed.take(10):
    print 'Invalid goldfile line: %s' % line

goldStandard = (gsRaw
                .filter(lambda s: s[1] == 1)
                .map(lambda s: s[0])
                .cache())

print 'Read %d lines, successfully parsed %d lines, failed to parse %d lines' % (gsRaw.count(),
                                                                                 goldStandard.count(),
                                                                                 gsFailed.count())
assert (gsFailed.count() == 0)
assert (gsRaw.count() == (goldStandard.count() + 1))

Read 1301 lines, successfully parsed 1300 lines, failed to parse 0 lines

Using the “gold standard” data we can answer the following questions:

#### How many true duplicate pairs are there in the small datasets?
#### What is the average similarity score for true duplicates?
#### What about for non-duplicates?

The steps you should perform are:

#### Create a new sims RDD from the similaritiesBroadcast RDD, where each element consists of a pair of the form (“AmazonID GoogleURL”, cosineSimilarityScore). An example entry from sims is: (‘b000bi7uqs http://www.google.com/base/feeds/snippets/18403148885652932189‘, 0.40202896125621296)
#### Combine the sims RDD with the goldStandard RDD by creating a new trueDupsRDD RDD that has the just the cosine similarity scores for those “AmazonID GoogleURL” pairs that appear in both the sims RDD and goldStandard RDD. Hint: you can do this using the join() transformation.
#### Count the number of true duplicate pairs in the trueDupsRDD dataset
#### Compute the average similarity score for true duplicates in the trueDupsRDD datasets. Remember to use float for calculation
#### Create a new nonDupsRDD RDD that has the just the cosine similarity scores for those “AmazonID GoogleURL” pairs from the similaritiesBroadcast RDD that do not appear in both the sims RDD and gold standard RDD.
Compute the average similarity score for non-duplicates in the last datasets. Remember to use float for calculation

# TODO: Replace with appropriate code
sims = similaritiesBroadcast.map(lambda line:(‘%s %s’%(line[1],line[0]),line[2]))

trueDupsRDD = (sims.join(goldStandard))
trueDupsCount = trueDupsRDD.count()
avgSimDups = trueDupsRDD.map(lambda (k,v):v[0]).mean()

nonDupsRDD = (sims
.leftOuterJoin(goldStandard).map(lambda (k,v):v[0] if v[1] is None else -1)).filter(lambda v:v!=-1)
avgSimNon = nonDupsRDD.mean()

print ‘There are %s true duplicates.’ % trueDupsCount
print ‘The average similarity of true duplicates is %s.’ % avgSimDups
print ‘And for non duplicates, it is %s.’ % avgSimNon

There are 146 true duplicates.
The average similarity of true duplicates is 0.264332573435.
And for non duplicates, it is 0.00123476304656.

# TEST Perform a Gold Standard evaluation (3e)
Test.assertEquals(trueDupsCount, 146, ‘incorrect trueDupsCount’)
Test.assertTrue(abs(avgSimDups - 0.264332573435) < 0.0000001, ‘incorrect avgSimDups’)
Test.assertTrue(abs(avgSimNon - 0.00123476304656) < 0.0000001, ‘incorrect avgSimNon’)

1 test passed.
1 test passed.
1 test passed.

Part 4: Scalable ER

In the previous parts, we built a text similarity function and used it for small scale entity resolution. Our implementation is limited by its quadratic run time complexity, and is not practical for even modestly sized datasets. In this part, we will implement a more scalable algorithm and use it to do entity resolution on the full dataset.

Inverted Indices

To improve our ER algorithm from the earlier parts, we should begin by analyzing its running time. In particular, the algorithm above is quadratic in two ways. First, we did a lot of redundant computation of tokens and weights, since each record was reprocessed every time it was compared. Second, we made quadratically many token comparisons between records.

The first source of quadratic overhead can be eliminated with precomputation and look-up tables, but the second source is a little more tricky. In the worst case, every token in every record in one dataset exists in every record in the other dataset, and therefore every token makes a non-zero contribution to the cosine similarity. In this case, token comparison is unavoidably quadratic.

But in reality most records have nothing (or very little) in common. Moreover, it is typical for a record in one dataset to have at most one duplicate record in the other dataset (this is the case assuming each dataset has been de-duplicated against itself). In this case, the output is linear in the size of the input and we can hope to achieve linear running time.

An inverted index is a data structure that will allow us to avoid making quadratically many token comparisons. It maps each token in the dataset to the list of documents that contain the token. So, instead of comparing, record by record, each token to every other token to see if they match, we will use inverted indices to look up records that match on a particular token.

Note on terminology: In text search, a forward index maps documents in a dataset to the tokens they contain. An inverted index supports the inverse mapping.

Note: For this section, use the complete Google and Amazon datasets, not the samples

(4a) Tokenize the full dataset

Tokenize each of the two full datasets for Google and Amazon.

# TODO: Replace  with appropriate code
amazonFullRecToToken = amazon.map(lambda (k,v):(k,tokenize(v)))
googleFullRecToToken = google.map(lambda (k,v):(k,tokenize(v)))
print 'Amazon full dataset is %s products, Google full dataset is %s products' % (amazonFullRecToToken.count(),
                                                                                    googleFullRecToToken.count())

Amazon full dataset is 1363 products, Google full dataset is 3226 products



# TEST Tokenize the full dataset (4a)
Test.assertEquals(amazonFullRecToToken.count(), 1363, 'incorrect amazonFullRecToToken.count()')
Test.assertEquals(googleFullRecToToken.count(), 3226, 'incorrect googleFullRecToToken.count()')

1 test passed.
1 test passed.

(4b) Compute IDFs and TF-IDFs for the full datasets

We will reuse your code from above to compute IDF weights for the complete combined datasets.

The steps you should perform are:

#### Create a new fullCorpusRDD that contains the tokens from the full Amazon and Google datasets.
#### Apply your idfs function to the fullCorpusRDD
#### Create a broadcast variable containing a dictionary of the IDF weights for the full dataset.
For each of the Amazon and Google full datasets, create weight RDDs that map IDs/URLs to TF-IDF weighted token vectors.

# TODO: Replace with appropriate code
fullCorpusRDD = amazonFullRecToToken.union(googleFullRecToToken)
idfsFull = idfs(fullCorpusRDD)
idfsFullCount = idfsFull.count()
print ‘There are %s unique tokens in the full datasets.’ % idfsFullCount

# Recompute IDFs for full dataset
idfsFullWeights = idfsFull.collectAsMap()
idfsFullBroadcast = sc.broadcast(idfsFullWeights)

# Pre-compute TF-IDF weights. Build mappings from record ID weight vector.
amazonWeightsRDD = amazonFullRecToToken.map(lambda (k,v):(k,tfidf(v, idfsFullBroadcast.value)))
googleWeightsRDD = googleFullRecToToken.map(lambda (k,v):(k,tfidf(v, idfsFullBroadcast.value)))
print ‘There are %s Amazon weights and %s Google weights.’ % (amazonWeightsRDD.count(),
googleWeightsRDD.count())

There are 17078 unique tokens in the full datasets.
There are 1363 Amazon weights and 3226 Google weights.

# TEST Compute IDFs and TF-IDFs for the full datasets (4b)
Test.assertEquals(idfsFullCount, 17078, ‘incorrect idfsFullCount’)
Test.assertEquals(amazonWeightsRDD.count(), 1363, ‘incorrect amazonWeightsRDD.count()’)
Test.assertEquals(googleWeightsRDD.count(), 3226, ‘incorrect googleWeightsRDD.count()’)

1 test passed.
1 test passed.
1 test passed.

(4c) Compute Norms for the weights from the full datasets

We will reuse your code from above to compute norms of the IDF weights for the complete combined dataset.

The steps you should perform are:

#### Create two collections, one for each of the full Amazon and Google datasets, where IDs/URLs map to the norm of the associated TF-IDF weighted token vectors.
Convert each collection into a broadcast variable, containing a dictionary of the norm of IDF weights for the full dataset

# TODO: Replace with appropriate code
amazonNorms = amazonWeightsRDD.map(lambda (k,d):(k,norm(d)))
amazonNormsBroadcast = sc.broadcast(amazonNorms.collectAsMap())
googleNorms = googleWeightsRDD.map(lambda (k,d):(k,norm(d)))
googleNormsBroadcast = sc.broadcast(googleNorms.collectAsMap())

# TEST Compute Norms for the weights from the full datasets (4c)
Test.assertTrue(isinstance(amazonNormsBroadcast, Broadcast), ‘incorrect amazonNormsBroadcast’)
Test.assertEquals(len(amazonNormsBroadcast.value), 1363, ‘incorrect amazonNormsBroadcast.value’)
Test.assertTrue(isinstance(googleNormsBroadcast, Broadcast), ‘incorrect googleNormsBroadcast’)
Test.assertEquals(len(googleNormsBroadcast.value), 3226, ‘incorrect googleNormsBroadcast.value’)

1 test passed.
1 test passed.
1 test passed.
1 test passed.

(4d) Create inverted indicies from the full datasets

Build inverted indices of both data sources.

The steps you should perform are:

#### Create an invert function that given a pair of (ID/URL, TF-IDF weighted token vector), returns a list of pairs of (token, ID/URL). Recall that the TF-IDF weighted token vector is a Python dictionary with keys that are tokens and values that are weights.
Use your invert function to convert the full Amazon and Google TF-IDF weighted token vector datasets into two RDDs where each element is a pair of a token and an ID/URL that contain that token. These are inverted indicies.

def invert(record):
“”” Invert (ID, tokens) to a list of (token, ID)
Args:
record: a pair, (ID, token vector)
Returns:
pairs: a list of pairs of token to ID
“””
ID, tokenvector = record
```
return [(k,ID) for k in tokenvector]
```
amazonInvPairsRDD = (amazonWeightsRDD
.flatMap(invert)
.cache())

googleInvPairsRDD = (googleWeightsRDD
.flatMap(invert)
.cache())

print ‘There are %s Amazon inverted pairs and %s Google inverted pairs.’ % (amazonInvPairsRDD.count(),
googleInvPairsRDD.count())

There are 111387 Amazon inverted pairs and 77678 Google inverted pairs.

# TEST Create inverted indicies from the full datasets (4d)
invertedPair = invert((1, {‘foo’: 2}))
Test.assertEquals(invertedPair[0][1], 1, ‘incorrect invert result’)
Test.assertEquals(amazonInvPairsRDD.count(), 111387, ‘incorrect amazonInvPairsRDD.count()’)
Test.assertEquals(googleInvPairsRDD.count(), 77678, ‘incorrect googleInvPairsRDD.count()’)

1 test passed.
1 test passed.
1 test passed.

(4e) Identify common tokens from the full dataset

#### Using the two inverted indicies (RDDs where each element is a pair of a token and an ID or URL that contains that token), create a new RDD that contains only tokens that appear in both datasets. This will yield an RDD of pairs of (token, iterable(ID, URL)).
#### We need a mapping from (ID, URL) to token, so create a function that will swap the elements of the RDD you just created to create this new RDD consisting of ((ID, URL), token) pairs.
Finally, create an RDD consisting of pairs mapping (ID, URL) to all the tokens the pair shares in common
```
""" Swap (token, (ID, URL)) to ((ID, URL), token)
Args:
    record: a pair, (token, (ID, URL))
Returns:
    pair: ((ID, URL), token)
"""
token = record[0]
keys = record[1]
return (keys, token)
```
commonTokens = (amazonInvPairsRDD
.join(googleInvPairsRDD)
.map(swap)
.groupByKey()
.cache())

print ‘Found %d common tokens’ % commonTokens.count()

Found 2441100 common tokens

# TEST Identify common tokens from the full dataset (4e)
Test.assertEquals(commonTokens.count(), 2441100, ‘incorrect commonTokens.count()’)

[((‘b00005lzly’, ‘http://www.google.com/base/feeds/snippets/13823221823254120257‘),

(4f) Identify common tokens from the full dataset

Use the data structures from parts (4a) and (4e) to build a dictionary to map record pairs to cosine similarity scores.

The steps you should perform are:

#### Create two broadcast dictionaries from the amazonWeights and googleWeights RDDs
#### Create a fastCosinesSimilarity function that takes in a record consisting of the pair ((Amazon ID, Google URL), tokens list) and computes the sum for each of the tokens in the token list of the products of the Amazon weight for the token times the Google weight for the token. The sum should then be divided by the norm for the Google URL and then divided by the norm for the Amazon ID. The function should return this value in a pair with the key being the (Amazon ID, Google URL). Make sure you use broadcast variables you created for both the weights and norms
Apply your fastCosinesSimilarity function to the common tokens from the full dataset

# TODO: Replace with appropriate code
amazonWeightsBroadcast = sc.broadcast(amazonWeightsRDD.collectAsMap())
googleWeightsBroadcast = sc.broadcast(googleWeightsRDD.collectAsMap())

def fastCosineSimilarity(record):
“”” Compute Cosine Similarity using Broadcast variables
Args:
record: ((ID, URL), token)
Returns:
pair: ((ID, URL), cosine similarity value)
“””
amazonRec = record[0][0]
googleRec = record[0][1]
tokens = record[1]
```
value = sum((amazonWeightsBroadcast.value[amazonRec][t])*(googleWeightsBroadcast.value[googleRec][t])\
            for t in tokens if t in amazonWeightsBroadcast.value[amazonRec] and t in googleWeightsBroadcast.value[googleRec])\
        /((amazonNormsBroadcast.value[amazonRec])*(googleNormsBroadcast.value[googleRec]))
key = (amazonRec, googleRec)
return (key, value)
```
similaritiesFullRDD = (commonTokens
.map(fastCosineSimilarity)
.cache())

print similaritiesFullRDD.count()

2441100

# TEST Identify common tokens from the full dataset (4f)
similarityTest = similaritiesFullRDD.filter(lambda ((aID, gURL), cs): aID == ‘b00005lzly’ and gURL == ‘http://www.google.com/base/feeds/snippets/13823221823254120257‘).collect()
Test.assertEquals(len(similarityTest), 1, ‘incorrect len(similarityTest)’)
Test.assertTrue(abs(similarityTest[0][1] - 4.286548414e-06) < 0.000000000001, ‘incorrect similarityTest fastCosineSimilarity’)
Test.assertEquals(similaritiesFullRDD.count(), 2441100, ‘incorrect similaritiesFullRDD.count()’)

1 test passed.
1 test passed.
1 test passed.
4.286548414e-06

Part 5: Analysis

Now we have an authoritative list of record-pair similarities, but we need a way to use those similarities to decide if two records are duplicates or not. The simplest approach is to pick a threshold. Pairs whose similarity is above the threshold are declared duplicates, and pairs below the threshold are declared distinct.

To decide where to set the threshold we need to understand what kind of errors result at different levels. If we set the threshold too low, we get more false positives, that is, record-pairs we say are duplicates that in reality are not. If we set the threshold too high, we get more false negatives, that is, record-pairs that really are duplicates but that we miss.

ER algorithms are evaluated by the common metrics of information retrieval and search called precision and recall. Precision asks of all the record-pairs marked duplicates, what fraction are true duplicates? Recall asks of all the true duplicates in the data, what fraction did we successfully find? As with false positives and false negatives, there is a trade-off between precision and recall. A third metric, called F-measure, takes the harmonic mean of precision and recall to measure overall goodness in a single value:

$F m e a s u r e = 2 p r e c i s i o n * r e c a l l p r e c i s i o n + r e c a l l$

Note: In this part, we use the “gold standard” mapping from the included file to look up true duplicates, and the results of Part 4.

Note: In this part, you will not be writing any code. We’ve written all of the code for you. Run each cell and then answer the quiz questions on Studio.

(5a) Counting True Positives, False Positives, and False Negatives

We need functions that count True Positives (true duplicates above the threshold), and False Positives and False Negatives:

#### We start with creating the simsFullRDD from our similaritiesFullRDD that consists of a pair of ((Amazon ID, Google URL), simlarity score)
#### From this RDD, we create an RDD consisting of only the similarity scores
To look up the similarity scores for true duplicates, we perform a left outer join using the goldStandard RDD and simsFullRDD and extract the

# Create an RDD of ((Amazon ID, Google URL), similarity score)
simsFullRDD = similaritiesFullRDD.map(lambda x: (“%s %s” % (x[0][0], x[0][1]), x[1]))
assert (simsFullRDD.count() == 2441100)

# Create an RDD of just the similarity scores
simsFullValuesRDD = (simsFullRDD
.map(lambda x: x[1])
.cache())
assert (simsFullValuesRDD.count() == 2441100)

# Look up all similarity scores for true duplicates

# This helper function will return the similarity score for records that are in the gold standard and the simsFullRDD (True positives), and will return 0 for records that are in the gold standard but not in simsFullRDD (False Negatives).
def gs_value(record):
if (record[1][1] is None):
return 0
else:
return record[1][1]

# Join the gold standard and simsFullRDD, and then extract the similarities scores using the helper function
trueDupSimsRDD = (goldStandard
.leftOuterJoin(simsFullRDD)
.map(gs_value)
.cache())
print ‘There are %s true duplicates.’ % trueDupSimsRDD.count()
assert(trueDupSimsRDD.count() == 1300)

There are 1300 true duplicates.

The next step is to pick a threshold between 0 and 1 for the count of True Positives (true duplicates above the threshold). However, we would like to explore many different thresholds. To do this, we divide the space of thresholds into 100 bins, and take the following actions:

#### We use Spark Accumulators to implement our counting function. We define a custom accumulator type, VectorAccumulatorParam, along with functions to initialize the accumulator’s vector to zero, and to add two vectors. Note that we have to use the += operator because you can only add to an accumulator.
#### We create a helper function to create a list with one entry (bit) set to a value and all others set to 0.
#### We create 101 bins for the 100 threshold values between 0 and 1.
#### Now, for each similarity score, we can compute the false positives. We do this by adding each similarity score to the appropriate bin of the vector. Then we remove true positives from the vector by using the gold standard data.
We define functions for computing false positive and negative and true positives, for a given threshold.

from pyspark.accumulators import AccumulatorParam
class VectorAccumulatorParam(AccumulatorParam):
# Initialize the VectorAccumulator to 0
def zero(self, value):
return [0] * len(value)
```
# Add two VectorAccumulator variables
def addInPlace(self, val1, val2):
    for i in xrange(len(val1)):
        val1[i] += val2[i]
    return val1
```
Return a list with entry x set to value and all other entries set to 0

def set_bit(x, value, length):
bits = []
for y in xrange(length):
if (x == y):
bits.append(value)
else:
bits.append(0)
return bits

Pre-bin counts of false positives for different threshold ranges

BINS = 101
nthresholds = 100
def bin(similarity):
return int(similarity * nthresholds)

fpCounts[i] = number of entries (possible false positives) where bin(similarity) == i

zeros = [0] * BINS
fpCounts = sc.accumulator(zeros, VectorAccumulatorParam())

def add_element(score):
global fpCounts
b = bin(score)
fpCounts += set_bit(b, 1, BINS)

simsFullValuesRDD.foreach(add_element)

Remove true positives from FP counts

def sub_element(score):
global fpCounts
b = bin(score)
fpCounts += set_bit(b, -1, BINS)

trueDupSimsRDD.foreach(sub_element)

def falsepos(threshold):
fpList = fpCounts.value
return sum([fpList[b] for b in range(0, BINS) if float(b) / nthresholds >= threshold])

def falseneg(threshold):
return trueDupSimsRDD.filter(lambda x: x < threshold).count()

def truepos(threshold):
return trueDupSimsRDD.count() - falsenegDict[threshold]

(5b) Precision, Recall, and F-measures

We define functions so that we can compute the Precision, Recall, and F-measure as a function of threshold value:

#### Precision = true-positives / (true-positives + false-positives)
#### Recall = true-positives / (true-positives + false-negatives)
F-measure = 2 x Recall x Precision / (Recall + Precision)

# Precision = true-positives / (true-positives + false-positives)
# Recall = true-positives / (true-positives + false-negatives)
# F-measure = 2 x Recall x Precision / (Recall + Precision)

def precision(threshold):
tp = trueposDict[threshold]
return float(tp) / (tp + falseposDict[threshold])

def recall(threshold):
tp = trueposDict[threshold]
return float(tp) / (tp + falsenegDict[threshold])

def fmeasure(threshold):
r = recall(threshold)
p = precision(threshold)
return 2 * r * p / (r + p)

(5c) Line Plots

We can make line plots of precision, recall, and F-measure as a function of threshold value, for thresholds between 0.0 and 1.0. You can change `nthresholds` (above in part (5a)) to change the threshold values to plot.

thresholds = [float(n) / nthresholds for n in range(0, nthresholds)]
falseposDict = dict([(t, falsepos(t)) for t in thresholds])
falsenegDict = dict([(t, falseneg(t)) for t in thresholds])
trueposDict = dict([(t, truepos(t)) for t in thresholds])

precisions = [precision(t) for t in thresholds]
recalls = [recall(t) for t in thresholds]
fmeasures = [fmeasure(t) for t in thresholds]

print precisions[0], fmeasures[0]
assert (abs(precisions[0] - 0.000532546802671) < 0.0000001)
assert (abs(fmeasures[0] - 0.00106452669505) < 0.0000001)


fig = plt.figure()
plt.plot(thresholds, precisions)
plt.plot(thresholds, recalls)
plt.plot(thresholds, fmeasures)
plt.legend(['Precision', 'Recall', 'F-measure'])
pass

0.000532546802671 0.00106452669505

Discussion

State-of-the-art tools can get an F-measure of about 60% on this dataset. In this lab exercise, our best F-measure is closer to 40%. Look at some examples of errors (both False Positives and False Negatives) and think about what went wrong.

There are several ways we might improve our simple classifier, including:

* Using additional attributes

* Performing better featurization of our textual data (e.g., stemming, n-grams, etc.)

* Using different similarity functions

你可能感兴趣的:(MathModeling,python,apache,spark,pySpark,文本分析,python)

Python中np.vstack和np.hstack的应用解释
Python中np.vstack和np.hstack的应用解释用法说明对于np.vstack和np.hstack各自有两种用法•第1种：np.vstack((a,b))或np.hstack((a,b))，即常规用法，也就是两个维数相等的ndarray在对应的方向上进行合并•第2种：np.vstack(a)或np.hstack(a)，对一个ndarray在其内部对应的方向上进行合并，这种属于非常规用
python np.hstack gz153016 python语法总结
importnumpyasnparr1=np.array([1,2,3])arr2=np.array([4,5,6])#print('np.vstack((arr1,arr2)):',np.vstack((arr1,arr2)))print('np.hstack((arr1,arr2)):',np.hstack((arr1,arr2)))#np.hstack((arr1,arr2)):[12345
Python个人学习基础笔记-3.爬虫（1）孜宸润泽 python 学习笔记
一.爬虫的定义爬虫（crawler/spider）是模拟浏览器行为，按照编写规则，自动接收网页信息的工具。通常而言爬虫首先从初始URL集选择URL，向目标网页发起请求，获取网页的HTML源码，然后将获取的数据进行解析过滤，保存我们所需要的标题、内容等，最后提取新的URL加入待爬序列。爬虫常见所需要的库包括Request库、BeautifulSoup4库、Scrapy库和Selenium库等。二.R
Python开发AI智能体(三)———Langchain定义提示词模板【本人】 Agent智能体 python 人工智能 langchain 语言模型
前言上篇文章给大家介绍AI项目检测平台LangSmish以及开源框架Langchain的使用，并且带领大家编写了一个案例。这篇文章将介绍在Langchain框架中如何定义提示词模板一、什么是提示词模板？提示词模板（PromptTemplate）是大语言模型（LLM）应用开发中的核心概念，本质是预定义的提示结构框架。它通过将静态文本与动态变量结合，实现标准化、可复用的提示生成机制。它提示词可以是一个
python：pydub模块 face丶第三方模块音频 pydub
一、安装1、安装模块pipinstallpydub2、安装插件云盘中下载文件ffmpeg打开电脑上的控制面板-系统-高级系统设置-环境变量然后双击path,看到如下的界面：然后点新建会出现一个新建的地址栏，你需要在这个新建地址栏里输入一个文件地址：打开你下载的ffmpeg文件中的bin文件，你应该可以看到一个这样的界面，把这个界面中地址栏中的地址复制粘贴到上面图片新建的地址栏中，然后点确定，来保存
将Python Tkinter程序转换为手机可运行的Web应用 - 详细教程随机森林404 python 智能手机前端
前言作为一名Python开发者，你可能已经使用Tkinter创建了一些桌面GUI应用。但是如何让这些应用也能在手机上运行呢？本教程将详细介绍如何将基于Tkinter的Python程序转换为手机可访问的Web应用，让你的应用随时随地可用！一、为什么需要转换？Tkinter是Python的标准GUI库，但它主要针对桌面环境。移动设备(Android/iOS)上无法直接运行Tkinter程序，主要原因有
如何使用 langchain 与 openAI 连接海乐学习 langchain python langchain python
上一篇写了如何安装langchainhttps://www.cnblogs.com/hailexuexi/p/18087602这里主要说一个langchain的使用创建一个目录langchain，在这个目录下创建两个文件main.py这段python代码，用到了openAI，需要openAI及FQ。这里只做为示例#-*-coding:utf-8-*-fromlangchain.text_split
Pydub音频处理库核心API详解滕娴殉
Pydub音频处理库核心API详解pydubManipulateaudiowithasimpleandeasyhighlevelinterface项目地址:https://gitcode.com/gh_mirrors/py/pydub概述Pydub是一个功能强大的Python音频处理库，它提供了简洁直观的API来处理各种音频操作。本文将深入解析Pydub的核心功能，帮助开发者快速掌握音频处理的关键
python循环语句for BuckData python
目录1、for循环2、示例1、for循环Pythonfor循环可以遍历任何可迭代对象。通过使用for循环，我们可以为列表、元组、集合中的每个项目等执行一组语句。range()函数如需循环一组代码指定的次数，我们可以使用range()函数，range()函数返回一个数字序列，默认情况下从0开始，并递增1（默认地），并以指定的数字结束。2、示例#遍历字典d={'CNY':'人民币','USD':'美元
python循环语句
Python循环语句文章目录Python循环语句一、实验目的二、实验原理三、实验环境四、实验内容五、实验步骤1.While循环结构2.While无限循环3.For循环语法4.break语句和continue语句一、实验目的掌握循环结构的语法二、实验原理Python中的循环语句有for和while。Python循环语句的控制结构图如下所示：三、实验环境Python3.6以上PyCharm四、实验内容
基于opencv的鱼群检测和数量统计识别鱼群密度带界面
完整项目点文末名片查看获取一、项目简介本项目旨在通过计算机视觉技术，实现对视频中鱼类数量的自动检测与计数。利用OpenCV库进行图像处理，包括背景减除、形态学操作、轮廓检测等步骤，最终在视频帧中标记出鱼类并统计其数量。该系统可广泛应用于水产养殖、生态监测等领域，有助于提高工作效率和数据准确性。二、环境准备在开始项目之前，需要确保以下环境和工具已安装：Python：推荐使用Python3.6及以上版
上位机知识篇---Conda/pip install Atticus-Orion 上位机知识篇上位机操作篇深度学习篇 conda pip
在Python环境中，condainstall和pipinstall是两个常用的包安装命令，它们分别属于不同的包管理系统。下面从多个方面详细介绍它们的区别和使用场景：1.所属系统与适用范围特性condainstallpipinstall所属系统Anaconda/Miniconda生态系统Python标准包管理系统（PyPI）适用语言支持Python、R、Java等多种语言的包仅支持Python包依
目标跟踪领域经典论文解析 ♢.＊目标跟踪人工智能计算机视觉
亲爱的小伙伴们，在求知的漫漫旅途中，若你对深度学习的奥秘、JAVA、PYTHON与SAP的奇妙世界，亦或是读研论文的撰写攻略有所探寻，那不妨给我一个小小的关注吧。我会精心筹备，在未来的日子里不定期地为大家呈上这些领域的知识宝藏与实用经验分享。每一个点赞，都如同春日里的一缕阳光，给予我满满的动力与温暖，让我们在学习成长的道路上相伴而行，共同进步✨。期待你的关注与点赞哟！目标跟踪是计算机视觉领域的一个
【Python从零到壹】Python中的标识符和保留字互联网老辛 #Python从零到壹 Python
保留字，也叫关键字，这些关键字是python直接提供给我们使用的，因此，我们在定义标识符的时候，不能用这些保留字。比如教育局就属于官方用的，你开个公司起名就不能叫教育局怎么查看关键字？importkeywordprint(keyword.kwlist)输出结果：E:\Python_demo\vippython\venv\Scripts\python.exeE:/Python_demo/vippyt
Python中的变量与数据类型難釋懷 python windows 开发语言
一、前言在Python编程中，变量（Variable）和数据类型（DataType）是程序开发中最基本也是最核心的概念。变量用于存储程序运行过程中的各种值，而数据类型则决定了变量可以存储什么样的数据、支持哪些操作。Python作为一门动态类型语言，无需显式声明变量的数据类型，解释器会根据赋给变量的值自动推断其类型。这种特性使得Python更加简洁易用，但也要求开发者对常见数据类型有清晰的认识。本文
Python中的count()方法溪流.ii python 数据库
文章目录Python中的count()方法基本语法在不同数据类型中的使用1.列表(List)中的count()2.元组(Tuple)中的count()3.字符串(String)中的count()高级用法1.指定搜索范围2.统计复杂元素注意事项Python中的count()方法前言：count()是Python中用于序列类型（如列表、元组、字符串等）的内置方法，用于统计某个元素在序列中出现的次数。基
Python中的标识符与保留字難釋懷 python java 数据库
一、前言在学习Python编程语言的过程中，标识符（Identifier）和保留字（Keywords）是两个非常基础但又极其重要的概念。它们是编写程序时必须遵守的语言规则之一。本文将带你深入了解：什么是标识符；标识符的命名规则与规范；Python中有哪些保留字；常见错误与注意事项；实际开发中的命名建议；掌握好这些内容，不仅能帮助你写出更规范、可读性更强的代码，还能避免因使用关键字作为变量名而导致的
浅谈HttpClient weixin_34092455 网络
为什么80%的码农都做不了架构师？>>>HttpClient简介HttpClient是ApacheJakartaCommon下的子项目，可以用来提供高效的、最新的、功能丰富的支持HTTP协议的客户端编程工具包，并且它支持HTTP协议最新的版本和建议。HttpClient支持的功能如下：支持Http0.9、Http1.0和Http1.1协议。实现了Http全部的方法（GET,POST,PUT,HEA
SpringBoot生态全景图：从SpringCloud到云原生技术栈演进 fanxbl957 Web spring boot spring cloud 云原生
博主介绍：Java、Python、js全栈开发“多面手”，精通多种编程语言和技术，痴迷于人工智能领域。秉持着对技术的热爱与执着，持续探索创新，愿在此分享交流和学习，与大家共进步。DeepSeek-行业融合之万象视界(附实战案例详解100+)全栈开发环境搭建运行攻略：多语言一站式指南(环境搭建+运行+调试+发布+保姆级详解)感兴趣的可以先收藏起来，希望帮助更多的人SpringBoot生态全景图：从S
利用大数据领域Doris提升企业数据决策效率大数据洞察大数据网络 ai
利用大数据领域Doris提升企业数据决策效率关键词：大数据、Doris、企业数据决策、数据处理、效率提升摘要：本文围绕利用大数据领域的Doris来提升企业数据决策效率展开。首先介绍了背景，包括目的、预期读者、文档结构和相关术语。接着阐述了Doris的核心概念、架构以及与其他系统的联系。详细讲解了Doris的核心算法原理和具体操作步骤，并给出Python代码示例。同时介绍了相关的数学模型和公式。通过
Python爬虫技术实战：高效市场趋势分析与数据采集 Python爬虫项目 2025年爬虫实战项目 python 爬虫开发语言 easyui 汽车
摘要本文将深入探讨如何利用最新的Python爬虫技术进行市场趋势分析，涵盖异步IO、无头浏览器、智能解析等前沿技术，并提供完整可运行的代码示例。文章将系统介绍从基础爬虫到高级反反爬策略的全套解决方案，帮助读者掌握市场数据采集的核心技能。1.市场趋势分析与爬虫技术概述市场趋势分析已成为现代商业决策的核心环节，而数据采集则是分析的基石。根据2024年最新统计，全球83%的企业已将网络爬虫技术纳入其数据
Nuitka打包python脚本 __如风__ python 开发语言
Python脚本打包Python是解释执行语言，需要解释器才能运行代码，这就导致在开发机上编写的代码在别的电脑上无法直接运行，除非目标机器上也安装了Python解释器，有时候还需要额外安装Python第三方包，相当麻烦。事实上Python并不适合干这种事，但有时候确实需要Python编写的程序打包给他人一键运行。思路通常都是分析脚本依赖（所有使用到的模块），然后收集相关资源，为了能在目标机器上正确
本地搭建WordPress （XAMPP环境） weixin_30577801 数据库运维 php
1，XAMPP是一个流行的PHP开发环境，官网下载：https://www.apachefriends.org/zh_cn/index.html然后安装。官方介绍：XAMPP是最流行的PHP开发环境XAMPP是完全免费且易于安装的Apache发行版，其中包含MariaDB、PHP和Perl。XAMPP开放源码包的设置让安装和使用出奇容易。2，WordPress官网下载：https://cn.wor
燕大《Python机器学习》实验报告：探索机器学习的奥秘温冰礼
燕大《Python机器学习》实验报告：探索机器学习的奥秘【下载地址】燕大Python机器学习实验报告下载这份实验报告是燕山大学软件工程专业的学生在进行机器学习实验时所编写的，内容详实，结构清晰，可以直接下载使用。报告中的实验数据和代码均经过验证，确保下载后可以直接应用于实际项目或作为学习参考项目地址:https://gitcode.com/Open-source-documentation-tut
Python 运用 Matplotlib 绘制动画图的流程 Python编程之道 Python人工智能与大数据 Python编程之道 python matplotlib 开发语言 ai
Python运用Matplotlib绘制动画图的流程关键词：Python、Matplotlib、动画图、绘制流程、动画原理摘要：本文详细介绍了使用Python的Matplotlib库绘制动画图的完整流程。从背景知识入手，阐述了Matplotlib动画绘制的目的和适用读者群体，接着深入剖析了核心概念，包括动画的基本原理和架构。通过核心算法原理的讲解和Python源代码示例，展示了如何实现动画绘制。同
Python Pandas 如何进行数据分组统计 Python编程之道 Python人工智能与大数据 Python编程之道 python pandas 网络 ai
PythonPandas如何进行数据分组统计关键词：PythonPandas、数据分组、groupby、聚合函数、数据透视表、数据统计、数据分析摘要：本文将深入探讨如何使用PythonPandas库进行高效的数据分组统计操作。我们将从基础概念入手，详细讲解groupby机制的原理和使用方法，介绍各种聚合函数的应用，探讨高级分组技巧，并通过实际案例展示如何解决复杂的数据分析问题。文章还将涵盖性能优化
Python可视化环境：Matplotlib_Seaborn+Conda配置 Python编程之道 Python人工智能与大数据 Python编程之道 python matplotlib conda ai
Python可视化环境：Matplotlib/Seaborn+Conda配置关键词：Python可视化、Matplotlib、Seaborn、Conda、环境配置摘要：本文主要探讨了如何利用Conda来配置Python可视化所需的Matplotlib和Seaborn环境。首先介绍了Python可视化的背景和重要性，明确目标读者为想要学习Python可视化的初学者和有一定基础的开发者。接着详细解析了
Nuitka 打包Python程序 Humbunklung 学海泛舟 python 开发语言 nuitka
文章目录Nuitka打包Python程序**一、Nuitka核心优势**⚙️**二、环境准备（Windows示例）****三、基础打包命令****单文件脚本打包****带第三方库的项目**️**四、高级配置选项****示例：完整命令**⚠️**五、常见问题与解决****六、Nuitkavs其他工具****七、最佳实践建议****八、使用举例**总结Nuitka打包Python程序需要把Python
python selenium 滚动页面到定位元素我有一个希哥 python selenium 前端
用js语句target=driver.find_element_by_id("id")driver.execute_script("arguments[0].scrollIntoView();",target)或target=WebDriverWait(driver,3).until(expected_conditions.presence_of_element_located((By.ID,"i
pythonselenium时间选择_使用pythonselenium选择特定日期（滚动日期） xu534328661
所有人我们正在尝试自动化日期选择过程以供参考Clickhere。请参考出生日期和预约日期字段。我们选择日期的方式是不同的。我不知道如何为这两个字段选择日期。你能帮帮我吗？在我已经尽了我的最大努力，它与下面的代码除了日期字段Python版本：2.7硒3.8.0铬：48倍importseleniumimportsysfromseleniumimportwebdriverfromselenium.web
HttpClient 4.3与4.3版本以下版本比较 spjich java httpclient
网上利用java发送http请求的代码很多，一搜一大把，有的利用的是java.net.*下的HttpURLConnection，有的用httpclient，而且发送的代码也分门别类。今天我们主要来说的是利用httpclient发送请求。 httpclient又可分为 httpclient3.x httpclient4.x到httpclient4.3以下 httpclient4.3
Essential Studio Enterprise Edition 2015 v1新功能体验 Axiba .net
概述：Essential Studio已全线升级至2015 v1版本了！新版本为JavaScript和ASP.NET MVC添加了新的文件资源管理器控件，还有其他一些控件功能升级，精彩不容错过，让我们一起来看看吧！ syncfusion公司是世界领先的Windows开发组件提供商，该公司正式对外发布Essential Studio Enterprise Edition 2015 v1版本。新版本
[宇宙与天文]微波背景辐射值与地球温度 comsci 背景
宇宙这个庞大,无边无际的空间是否存在某种确定的,变化的温度呢? 如果宇宙微波背景辐射值是表示宇宙空间温度的参数之一,那么测量这些数值,并观测周围的恒星能量输出值,我们是否获得地球的长期气候变化的情况呢? &nbs
lvs-server 男人50 server
#!/bin/bash # # LVS script for VS/DR # #./etc/rc.d/init.d/functions # VIP=10.10.6.252 RIP1=10.10.6.101 RIP2=10.10.6.13 PORT=80 case $1 in start) /sbin/ifconfig eth2:0 $VIP broadca
java的WebCollector爬虫框架 oloz 爬虫
WebCollector主页： https://github.com/CrawlScript/WebCollector 下载：webcollector-版本号-bin.zip将解压后文件夹中的所有jar包添加到工程既可。接下来看demo package org.spider.myspider; import cn.edu.hfut.dmic.webcollector.cra
jQuery append 与 after 的区别小猪猪08
1、after函数定义和用法： after() 方法在被选元素后插入指定的内容。语法： $(selector).after(content) 实例： <html> <head> <script type="text/javascript" src="/jquery/jquery.js"></scr
mysql知识充电香水浓 mysql
索引索引是在存储引擎中实现的，因此每种存储引擎的索引都不一定完全相同，并且每种存储引擎也不一定支持所有索引类型。根据存储引擎定义每个表的最大索引数和最大索引长度。所有存储引擎支持每个表至少16个索引，总索引长度至少为256字节。大多数存储引擎有更高的限制。MYSQL中索引的存储类型有两种：BTREE和HASH，具体和表的存储引擎相关； MYISAM和InnoDB存储引擎
我的架构经验系列文章索引 agevs 架构
下面是一些个人架构上的总结，本来想只在公司内部进行共享的，因此内容写的口语化一点，也没什么图示，所有内容没有查任何资料是脑子里面的东西吐出来的因此可能会不准确不全，希望抛砖引玉，大家互相讨论。要注意，我这些文章是一个总体的架构经验不针对具体的语言和平台，因此也不一定是适用所有的语言和平台的。（内容是前几天写的，现附上索引）前端架构 http://www.
Android so lib库远程http下载和动态注册 aijuans andorid
一、背景在开发Android应用程序的实现，有时候需要引入第三方so lib库，但第三方so库比较大，例如开源第三方播放组件ffmpeg库, 如果直接打包的apk包里面, 整个应用程序会大很多.经过查阅资料和实验，发现通过远程下载so文件，然后再动态注册so文件时可行的。主要需要解决下载so文件存放位置以及文件读写权限问题。二、主要
linux中svn配置出错 conf/svnserve.conf:12: Option expected 解决方法 baalwolf option
在客户端访问subversion版本库时出现这个错误： svnserve.conf:12: Option expected 为什么会出现这个错误呢，就是因为subversion读取配置文件svnserve.conf时，无法识别有前置空格的配置文件，如### This file controls the configuration of the svnserve daemon, if you##
MongoDB的连接池和连接管理 BigCat2013 mongodb
在关系型数据库中，我们总是需要关闭使用的数据库连接，不然大量的创建连接会导致资源的浪费甚至于数据库宕机。这篇文章主要想解释一下mongoDB的连接池以及连接管理机制，如果正对此有疑惑的朋友可以看一下。通常我们习惯于new 一个connection并且通常在finally语句中调用connection的close()方法将其关闭。正巧，mongoDB中当我们new一个Mongo的时候，会发现它也
AngularJS使用Socket.IO bijian1013 JavaScript AngularJS Socket.IO
目前，web应用普遍被要求是实时web应用，即服务端的数据更新之后，应用能立即更新。以前使用的技术（例如polling）存在一些局限性，而且有时我们需要在客户端打开一个socket，然后进行通信。 Socket.IO(http://socket.io/)是一个非常优秀的库，它可以帮你实
[Maven学习笔记四]Maven依赖特性 bit1129 maven
三个模块为了说明问题，以用户登陆小web应用为例。通常一个web应用分为三个模块，模型和数据持久化层user-core, 业务逻辑层user-service以及web展现层user-web， user-service依赖于user-core user-web依赖于user-core和user-service 依赖作用范围 Maven的dependency定义
【Akka一】Akka入门 bit1129 akka
什么是Akka Message-Driven Runtime is the Foundation to Reactive Applications In Akka, your business logic is driven through message-based communication patterns that are independent of physical locatio
zabbix_api之perl语言写法 ronin47 zabbix_api之perl
zabbix_api网上比较多的写法是python或curl。上次我用java－－http://bossr.iteye.com/blog/2195679，这次用perl。for example: #!/usr/bin/perl use 5.010 ; use strict ; use warnings ; use JSON :: RPC :: Client ; use
比优衣库跟牛掰的视频流出了，兄弟连Linux运维工程师课堂实录，更加刺激，更加实在！ brotherlamp linux运维工程师 linux运维工程师教程 linux运维工程师视频 linux运维工程师资料 linux运维工程师自学
比优衣库跟牛掰的视频流出了，兄弟连Linux运维工程师课堂实录，更加刺激，更加实在！ ----------------------------------------------------- 兄弟连Linux运维工程师课堂实录-计算机基础-1-课程体系介绍1 链接：http://pan.baidu.com/s/1i3GQtGL 密码：bl65 兄弟连Lin
bitmap求哈密顿距离-给定N（1<=N<=100000）个五维的点A(x1,x2,x3,x4,x5)，求两个点X(x1,x2,x3,x4,x5)和Y( bylijinnan java
import java.util.Random; /** * 题目： * 给定N（1<=N<=100000）个五维的点A(x1,x2,x3,x4,x5)，求两个点X(x1,x2,x3,x4,x5)和Y(y1,y2,y3,y4,y5)， * 使得他们的哈密顿距离（d=|x1-y1| + |x2-y2| + |x3-y3| + |x4-y4| + |x5-y5|）最大
map的三种遍历方法 chicony map
package com.test; import java.util.Collection; import java.util.HashMap; import java.util.Iterator; import java.util.Map; import java.util.Set; public class TestMap { public static v
Linux安装mysql的一些坑 chenchao051 linux
1、mysql不建议在root用户下运行 2、出现服务启动不了，111错误，注意要用chown来赋予权限，我在root用户下装的mysql，我就把usr/share/mysql/mysql.server复制到/etc/init.d/mysqld, (同时把my-huge.cnf复制/etc/my.cnf) chown -R cc /etc/init.d/mysql
Sublime Text 3 配置 daizj 配置 Sublime Text
Sublime Text 3 配置解释(默认){// 设置主题文件“color_scheme”: “Packages/Color Scheme – Default/Monokai.tmTheme”,// 设置字体和大小“font_face”: “Consolas”,“font_size”: 12,// 字体选项：no_bold不显示粗体字，no_italic不显示斜体字，no_antialias和
MySQL server has gone away 问题的解决方法 dcj3sjt126com SQL Server
MySQL server has gone away 问题解决方法，需要的朋友可以参考下。应用程序（比如PHP）长时间的执行批量的MYSQL语句。执行一个SQL，但SQL语句过大或者语句中含有BLOB或者longblob字段。比如，图片数据的处理。都容易引起MySQL server has gone away。今天遇到类似的情景，MySQL只是冷冷的说：MySQL server h
javascript/dom:固定居中效果 dcj3sjt126com JavaScript
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml&
使用 Spring 2.5 注释驱动的 IoC 功能 e200702084 spring bean 配置管理 IOC Office
使用 Spring 2.5 注释驱动的 IoC 功能 developerWorks 文档选项将打印机的版面设置成横向打印模式打印本页将此页作为电子邮件发送将此页作为电子邮件发送级别：初级陈雄华 ([email protected]), 技术总监, 宝宝淘网络科技有限公司 2008 年 2 月 28 日 &nb
MongoDB常用操作命令 geeksun mongodb
1. 基本操作 db.AddUser(username,password) 添加用户 db.auth(usrename,password) 设置数据库连接验证 db.cloneDataBase(fromhost)
php写守护进程（Daemon） hongtoushizi PHP
转载自： http://blog.csdn.net/tengzhaorong/article/details/9764655 守护进程（Daemon）是运行在后台的一种特殊进程。它独立于控制终端并且周期性地执行某种任务或等待处理某些发生的事件。守护进程是一种很有用的进程。php也可以实现守护进程的功能。 1、基本概念 &nbs
spring整合mybatis,关于注入Dao对象出错问题 jonsvien DAO spring bean mybatis prototype
今天在公司测试功能时发现一问题：先进行代码说明： 1，controller配置了Scope="prototype"（表明每一次请求都是原子型） @resource/@autowired service对象都可以（两种注解都可以）。 2，service 配置了Scope="prototype"（表明每一次请求都是原子型）
对象关系行为模式之标识映射 home198979 PHP 架构企业应用对象关系标识映射
HELLO!架构一、概念 identity Map:通过在映射中保存每个已经加载的对象，确保每个对象只加载一次，当要访问对象的时候，通过映射来查找它们。其实在数据源架构模式之数据映射器代码中有提及到标识映射，Mapper类的getFromMap方法就是实现标识映射的实现。二、为什么要使用标识映射？在数据源架构模式之数据映射器中 //c
Linux下hosts文件详解 pda158 linux
　1、主机名：　　无论在局域网还是INTERNET上，每台主机都有一个IP地址，是为了区分此台主机和彼台主机，也就是说IP地址就是主机的门牌号。　　公网：IP地址不方便记忆，所以又有了域名。域名只是在公网（INtERNET)中存在，每个域名都对应一个IP地址，但一个IP地址可有对应多个域名。　　局域网：每台机器都有一个主机名，用于主机与主机之间的便于区分，就可以为每台机器设置主机
nginx配置文件粗解 spjich java nginx
#运行用户#user nobody;#启动进程,通常设置成和cpu的数量相等worker_processes 2;#全局错误日志及PID文件#error_log logs/error.log;#error_log logs/error.log notice;#error_log logs/error.log inf
数学函数 w54653520 java
public class S { // 传入两个整数，进行比较，返回两个数中的最大值的方法。 public int get( int num1, int nu

[pySpark][note]Text Analysis and Entity Resolution

+

Text Analysis and Entity Resolution

Entity resolution is a common, yet difficult problem in data cleaning and integration. This lab will demonstrate how we can use Apache Spark to apply powerful and scalable text analysis techniques and perform entity resolution across two datasets of commercial products.

Code

This assignment can be completed using basic Python, pySpark Transformations and actions, and the plotting library matplotlib. Other libraries are not allowed.

Files

Data files for this assignment are from the metric-learning project and can be found at:

The directory contains the following files:

Part 0: Preliminaries

We read in each of the files and create an RDD consisting of lines.

The file format of an Amazon line is:

The file format of a Google line is:

Let’s examine the lines that were just loaded in the two subset (small) files - one from Google and one from Amazon

Part 1: ER as Text Similarity - Bags of Words

A simple approach to entity resolution is to treat all records as strings and compute their similarity with a string distance function. In this part, we will build some components for performing bag-of-words text-analysis, and then use them to compute record similarity.

Bag-of-words is a conceptually simple yet powerful approach to text analysis.

The idea is to treat strings, a.k.a. documents, as unordered collections of words, or tokens, i.e., as bags of words.

Note on terminology: a “token” is the result of parsing the document down to the elements we consider “atomic” for the task at hand. Tokens can be things like words, numbers, acronyms, or other exotica like word-roots or fixed-length character strings.

Bag of words techniques all apply to any sort of token, so when we say “bag-of-words” we really mean “bag-of-tokens,” strictly speaking.

1(a) Tokenize a String

(1b) Removing stopwords

Stopwords are common (English) words that do not contribute much to the content or meaning of a document (e.g., “the”, “a”, “is”, “to”, etc.). Stopwords add noise to bag-of-words comparisons, so they are usually excluded.

Using the included file “stopwords.txt”, implement tokenize, an improved tokenizer that does not emit stopwords.

(1c) Tokenizing the small datasets

Now let’s tokenize the two small datasets. For each ID in a dataset, tokenize the values, and then count the total number of tokens.

How many tokens, total, are there in the two datasets?

(1d) Amazon record with the most tokens

Which Amazon record has the biggest number of tokens?

In other words, you want to sort the records and get the one with the largest count of tokens.

Part 2: ER as Text Similarity - Weighted Bag-of-Words using TF-IDF

TF

IDF

IDF rewards tokens that are rare overall in a dataset. The intuition is that it is more significant if two documents share a rare word than a common one. IDF weight for a token, t, in a set of documents, U, is computed as follows:

Note that n(t)/N is the frequency of t in U, and N/n(t) is the inverse frequency.

TF-IDF

Finally, to bring it all together, the total TF-IDF weight for a token in a document is the product of its TF and IDF weights.

(2a) Implement a TF function

Implement tf(tokens) that takes a list of tokens and returns a Python dictionary mapping tokens to TF weights.

The steps your function should perform are:

For each of the tokens in the dictionary, divide the token’s count by the total number of tokens in the input tokens list

(2b) Create a corpus

(2c) Implement an IDFs function

Implement idfs that assigns an IDF weight to every unique token in an RDD called corpus. The function should return an pair RDD where the key is the unique token and value is the IDF weight for the token.

Recall that the IDF weight for a token, t, in a set of documents, U, is computed as follows:

The steps your function should perform are:

Use your idfs to compute the IDF weights for all tokens in corpusRDD (the combined small datasets).

How many unique tokens are there?

(2d) Tokens with the smallest IDF

Print out the 11 tokens with the smallest IDF in the combined small dataset.

(2e) IDF Histogram

Plot a histogram of IDF values. Be sure to use appropriate scaling and bucketing for the data.

First plot the histogram using matplotlib

(2f) Implement a TF-IDF function

Use your tf function to implement a tfidf(tokens, idfs) function that takes a list of tokens from a document and a Python dictionary of IDF weights and returns a Python dictionary mapping individual tokens to total TF-IDF weights.

The steps your function should perform are:

Part 3: ER as Text Similarity - Cosine Similarity

The next question is: given two vectors how do we find the cosine of the angle between them? Recall the formula for the dot product of two vectors:

a⋅b=∥a∥∥b∥cosθ

Here a⋅b=∑aibi is the ordinary dot product of two vectors, and ∥a∥=∑a2i−−−−√ is the norm of a .

We can rearrange terms and solve for the cosine to find it is simply the normalized dot product of the vectors. With our vector model, the dot product and norm computations are simple functions of the bag-of-words document representations, so we now have a formal way to compute similarity:

similarity=cosθ=a⋅b∥a∥∥b∥=∑aibi∑a2i−−−−√∑b2i−−−−√

(3a) Implement the components of a cosineSimilarity function

Implement the components of a cosineSimilarity function.

Use the tokenize and tfidf functions, and the IDF weights from Part 2 for extracting tokens and assigning them weights.

The steps you should perform are:

Define a function cossim that returns the dot product of two dictionaries divided by the norm of the first dictionary and then by the norm of the second dictionary

(3b) Implement a cosineSimilarity function

Implement a cosineSimilarity(string1, string2, idfsDictionary) function that takes two strings and a dictionary of IDF weights, and computes their cosine similarity in the context of some global IDF weights.

The steps you should perform are:

Compute and return your cossim function applied to the results of the two tfidf functions

(3c) Perform Entity Resolution

Now we can finally do some entity resolution!

For every product record in the small Google dataset, use your cosineSimilarity function to compute its similarity to every record in the small Amazon dataset. Then, build a dictionary mapping (Google URL, Amazon ID) tuples to similarity scores between 0 and 1.

We’ll do this computation two different ways, first we’ll do it without a broadcast variable, and then we’ll use a broadcast variable

The steps you should perform are:

Now, compute the similarity between Amazon record b000o24l3q and Google record http://www.google.com/base/feeds/snippets/17242822440574356561.

(3d) Perform Entity Resolution with Broadcast Variables

Instead, we can use a broadcast variable - we define the broadcast variable in the driver and then we can refer to it in each worker. Spark saves the broadcast variable at each worker, so it is only sent once.

The steps you should perform are:

Using the included file “stopwords.txt”, implement `tokenize`, an improved tokenizer that does not emit stopwords.

Now let’s tokenize the two small datasets. For each ID in a dataset, `tokenize` the values, and then count the total number of tokens.

Implement `tf(tokens)` that takes a list of tokens and returns a Python dictionary mapping tokens to TF weights.

For each of the tokens in the dictionary, divide the token’s count by the total number of tokens in the input `tokens` list

Implement `idfs` that assigns an IDF weight to every unique token in an RDD called `corpus`. The function should return an pair RDD where the `key` is the unique token and value is the IDF weight for the token.

Use your `idfs` to compute the IDF weights for all tokens in `corpusRDD` (the combined small datasets).

First plot the histogram using `matplotlib`

Use your `tf` function to implement a `tfidf(tokens, idfs)` function that takes a list of tokens from a document and a Python dictionary of IDF weights and returns a Python dictionary mapping individual tokens to total TF-IDF weights.

$a \cdot b = ∥ a ∥ ∥ b ∥ cos θ$

$s i m i l a r i t y = cos θ = a \cdot b ∥ a ∥ ∥ b ∥ = \sum a i b i \sum a 2 i - - - - \sqrt \sum b 2 i - - - - \sqrt$

(3a) Implement the components of a `cosineSimilarity` function

Implement the components of a `cosineSimilarity` function.

Use the `tokenize` and `tfidf` functions, and the IDF weights from Part 2 for extracting tokens and assigning them weights.

Define a function `cossim` that returns the dot product of two dictionaries divided by the norm of the first dictionary and then by the norm of the second dictionary

(3b) Implement a `cosineSimilarity` function

Implement a `cosineSimilarity(string1, string2, idfsDictionary)` function that takes two strings and a dictionary of IDF weights, and computes their cosine similarity in the context of some global IDF weights.

Compute and return your `cossim` function applied to the results of the two `tfidf` functions

For every product record in the small Google dataset, use your `cosineSimilarity` function to compute its similarity to every record in the small Amazon dataset. Then, build a dictionary mapping `(Google URL, Amazon ID)` tuples to similarity scores between 0 and 1.

Now, compute the similarity between Amazon record `b000o24l3q` and Google record `http://www.google.com/base/feeds/snippets/17242822440574356561`.

Again, compute the similarity between Amazon record `b000o24l3q` and Google record `http://www.google.com/base/feeds/snippets/17242822440574356561`.

Compute the average similarity score for non-duplicates in the last datasets. Remember to use `float` for calculation

Apply your `fastCosinesSimilarity` function to the common tokens from the full dataset

$F m e a s u r e = 2 p r e c i s i o n * r e c a l l p r e c i s i o n + r e c a l l$

To look up the similarity scores for true duplicates, we perform a left outer join using the `goldStandard` RDD and `simsFullRDD` and extract the