CS190 Scalable Machine Learning Spark - 1 Python基础

CS190 Scalable Machine Learning Spark - 1


Python 基础

Part 1: NumPy

NumPy is a Python library for working with arrays.

     # It is convention to import NumPy with the alias np
     import numpy as np

(1a) 标量相乘 Scalar multiplication

$ a $ is the scalar (constant) and $ \mathbf{v} $ is the vector
$$ a \mathbf{v} = \begin{bmatrix} a v_1 \\ a v_2 \\ \vdots \\ a v_n \end{bmatrix} $$

# Create a numpy array with the values 1, 2, 3
simpleArray = np.array([1,2,3])
# Perform the scalar product of 5 and the numpy array
timesFive = simpleArray * 5
print simpleArray
print timesFive
-----
#result
[1 2 3]
[5 10 15

(1b) 点乘 Element-wise multiplication and dot product

The element-wise calculation is as follows:

$$ \mathbf{x} \odot \mathbf{y} = \begin{bmatrix} x_1 y_1 \\ x_2 y_2 \\ \vdots \\ x_n y_n \end{bmatrix} $$

dot product is equivalent to performing element-wise multiplication and then summing the result。

$ w \cdot x$ 也可以表示为 $ w^\top x $

$$ w \cdot x = \sum_{i=1}^n w_i x_i $$

Element-wise multiplication use the ***** operator to multiply two ndarray objects of the same length.
Dot product you can use either np.dot() or np.ndarray.dot()


# Create a ndarray based on a range and step size.
u = np.arange(0, 5, .5)
v = np.arange(5, 10, .5)

elementWise = u * v 
dotProduct = np.dot(u,v)

print 'u: {0}'.format(u)
print 'v: {0}'.format(v)
print '\nelementWise\n{0}'.format(elementWise)
print '\ndotProduct\n{0}'.format(dotProduct)

----
#result
u: [ 0.   0.5  1.   1.5  2.   2.5  3.   3.5  4.   4.5]
v: [ 5.   5.5  6.   6.5  7.   7.5  8.   8.5  9.   9.5]

elementWise
[  0.     2.75   6.     9.75  14.    18.75  24.    29.75  36.    42.75]

dotProduct
183.75

(1c) 矩阵计算 Matrix math

np.matrix() 生成矩阵

matrix math on NumPy matrices using *

转置矩阵 transpose a matrix by calling numpy.matrix.transpose() or by using .T on the matrix object (e.g. myMatrix.T).

Transposing a matrix produces a matrix where the new rows are the columns from the old matrix. For example: $$ \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{bmatrix}^\mathbf{\top} = \begin{bmatrix} 1 & 4 \\ 2 & 5 \\ 3 & 6 \end{bmatrix} $$

逆矩阵 Inverting a matrix can be done using numpy.linalg.inv().

Note that only square matrices can be inverted, and square matrices are not guaranteed to have an inverse. If the inverse exists, then multiplying a matrix by its inverse will produce the identity matrix. $ \scriptsize ( \mathbf{A}^{-1} \mathbf{A} = \mathbf{I_n} ) $ The identity matrix $ \scriptsize \mathbf{I_n} $ has ones along its diagonal and zero elsewhere. $$ \mathbf{I_n} = \begin{bmatrix} 1 & 0 & 0 & \dots & 0 \\ 0 & 1 & 0 & \dots & 0 \\ 0 & 0 & 1 & \dots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & \dots & 1 \end{bmatrix} $$

For this exercise, multiply $ \mathbf{A} $ times its transpose $ ( \mathbf{A}^\top ) $ and then calculate the inverse of the result $ ( [ \mathbf{A} \mathbf{A}^\top ]^{-1} ) $.

from numpy.linalg import inv

A = np.matrix([[1,2,3,4],[5,6,7,8]])
print 'A:\n{0}'.format(A)
# Print A transpose
print '\nA transpose:\n{0}'.format(A.T)

# Multiply A by A transpose
AAt = A * A.T
print '\nAAt:\n{0}'.format(AAt)

# Invert AAt with np.linalg.inv()
AAtInv = np.linalg.inv(AAt)
print '\nAAtInv:\n{0}'.format(AAtInv)

# Show inverse times matrix equals identity
# We round due to numerical precision
print '\nAAtInv * AAt:\n{0}'.format((AAtInv * AAt).round(4))
print '\nAAtInv * AAt:\n{0}'.format((AAtInv * AAt).round(4))

result

A:
[[1 2 3 4]
[5 6 7 8]]

A transpose:
[[1 5]
[2 6]
[3 7]
[4 8]]

AAt:
[[ 30 70]
[ 70 174]]

AAtInv:
[[ 0.54375 -0.21875]
[-0.21875 0.09375]]

AAtInv * AAt:
[[ 1. 0.]
[-0. 1.]]

AAtInv * AAt:
[[ 1. 0.]
[-0. 1.]]


Part 2: Additional NumPy and Spark linear algebra

(2a) Slices

features = np.array([1, 2, 3, 4])
print 'features:\n{0}'.format(features)

# The first three elements of features
firstThree = features[0:3]

# The last three elements of features
lastThree = features[-3:]

(2b) Combining ndarray objects

np.hstack(), which allows you to combine arrays column-wise,
np.vstack(), which allows you to combine arrays row-wise.
Note that both np.hstack() and np.vstack() take in a tuple of arrays as their first argument.
To horizontally combine three arrays a, b, and c, you would run np.hstack((a, b, c)).

If we had two arrays: a = [1, 2, 3, 4] and b = [5, 6, 7, 8], we could use np.vstack((a, b)) to produce the two-dimensional array: $$ \begin{bmatrix} 1 & 2 & 3 & 4 \\ 5 & 6 & 7 & 8 \end{bmatrix} $$

zeros = np.zeros(8)
ones = np.ones(8)
print 'zeros:\n{0}'.format(zeros)
print '\nones:\n{0}'.format(ones)

zerosThenOnes = np.hstack((zeros,ones))   # A 1 by 16 array
zerosAboveOnes = np.vstack((zeros,ones)) # A 2 by 8 array

print '\nzerosThenOnes:\n{0}'.format(zerosThenOnes)
print '\nzerosAboveOnes:\n{0}'.format(zerosAboveOnes)

result:
zeros:
[ 0. 0. 0. 0. 0. 0. 0. 0.]
ones:
[ 1. 1. 1. 1. 1. 1. 1. 1.]
zerosThenOnes:
[ 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1.]

zerosAboveOnes:
[[ 0. 0. 0. 0. 0. 0. 0. 0.]
[ 1. 1. 1. 1. 1. 1. 1. 1.]]


(2c) PySpark's DenseVector

PySpark provides a DenseVector class within the module pyspark.mllib.linalg.

DenseVector is used to store arrays of values for use in PySpark. DenseVector actually stores values in a NumPy array and delegates calculations to that object. You can create a new DenseVector using DenseVector() and passing in an NumPy array or a Python list.

Note that DenseVector stores all values as np.float64

DenseVector objects exist locally and are not inherently distributed. DenseVector objects can be used in the distributed setting by either passing functions that contain them to resilient distributed dataset (RDD) transformations or by distributing them directly as RDDs.

from pyspark.mllib.linalg import DenseVector

numpyVector = np.array([-3, -4, 5])
print '\nnumpyVector:\n{0}'.format(numpyVector)

# Create a DenseVector consisting of the values [3.0, 4.0, 5.0]
myDenseVector = DenseVector([3,4,5])
# Calculate the dot product between the two vectors.
denseDotProduct = DenseVector.dot(myDenseVector,numpyVector)

print 'myDenseVector:\n{0}'.format(myDenseVector)
print '\ndenseDotProduct:\n{0}'.format(denseDotProduct)

numpyVector:
[-3 -4 5]
myDenseVector:
[3.0,4.0,5.0]
denseDotProduct:
0.0


Part 3: Python lambda expressions

Lambda 是匿名函数

一些链接: Lambda Functions, Lambda Tutorial, and Python Functions.

# Example function
def addS(x):
    return x + 's'
#lambda 形式
addSLambda = lambda x: x + 's'

# 乘法
multiplyByTen = lambda x: x * 10
print multiplyByTen(5)

#lambda fewer steps than def 
# The first function should add two values, while the second function should subtract the second  value from the first value.
def plus(x, y):
    return x + y

def minus(x, y):
    return x - y

functions = [plus, minus]
print functions[0](4, 5)
print functions[1](4, 5)

# lambda
lambdaFunctions = [lambda x,y : x+y ,  lambda x,y: x-y]
print lambdaFunctions[0](4, 5)
print lambdaFunctions[1](4, 5)

Lambda expressions consist of a single expression statement and cannot contain other simple statements. In short, this means that the lambda expression needs to evaluate to a value and exist on a single logical line. If more complex logic is necessary, use def in place of lambda.
Expression statements evaluate to a value (sometimes that value is None). Lambda expressions automatically return the value of their expression statement. In fact, a return statement in a lambda would raise a SyntaxError.
The following Python keywords refer to simple statements that cannot be used in a lambda expression: assert, pass, del, print, return, yield, raise, break, continue, import, global, and exec. Also, note that assignment statements (=) and augmented assignment statements (e.g. +=) cannot be used either.

你可能感兴趣的:(CS190 Scalable Machine Learning Spark - 1 Python基础)