Introduction to Data Science in Python学习笔记

本文主要是作者在学习coursera的Introduction to Data Science in Python课程的学习笔记,仅供参考。


1. 50 Years of Data Science

    (1) Data Exploration and Preparation 

    (2) Data Representation and Transformation

    (3) Computing with Data

    (4) Data Modeling

    (5) Data Visualization and Presentation

    (6) Science about Data Science


2. Functions

def add_numbers(x,  y,  z = None, flag = False):

    if (flag):

        print('Flag is true!')

    if (z == None):

        return x + y

    else:

        return x + y + z

print(add_numbers(1, 2, flag=true))


Assign function add_numbers to a variable a:

a = add_numbers

a = (1, 2, flag=true)


3. 查看数据类型

type('This is a string')

-> str

type(None)

-> NoneType


4. Tuple 元组

Tuples are an immutable data structure (cannot be altered).

元组是一个不变的数据结构(无法更改)。

x = (1, 'a', 2, 'b')

type(x)

->tuple


5. List 列表

Lists are a mutable data structure.

列表是可变的数据结构。

x = [1, 'a', 2, 'b']

type(x)

->list


6. Append 附加

Use append to append an object to a list.

使用附加将对象附加到列表。

x.append(3.3)

print(x)

->[1, 'a', 2, 'b', 3.3]


7. Loop through each item in the list

for item in x:

    print(item)

->1

    a

    2

    b

    3.3


8. Using the indexing operator to loop through each item in the list

i = 0

while( i != len(x) ):

        print(x[I])

        i = i +1

->1

    a

    2

    b

    3.3


9. List 基本操作

(1)Use + to concatenate连接 lists

[1, 2] + [3, 4]

-> [1, 2, 3, 4]

(2)Use * to repeat lists

[1]*3

->[1, 1, 1]

(3) Use the in operator to check if something is inside a list

1 in [1, 2, 3]

->True


10. String 基本操作

(1)Use bracket notation to slice a string.

          使用方括号符号来分割字符串。

x = 'This is a string'

print(x[0])

->T

print(x[0:1])

->T

print(x[0:2])

->Th

print(x[-1])  # the last element

->g

print(x[-4:-2])  # start from the 4th element from the end and stop before the 2nd element from the end

->ri

x[:3]  # This is a slice from the beginning of the string and stopping before the 3rd element.

->Thi

x[3:] # this is a slice starting from the 4th element of the string and going all the way to the end.

-> s is a string

(2) New example on list

firstname = 'Christopher'

lastname = 'Brooks'

print(firstname + ' ' + lastname)

->Christopher Brooks

print(firstname*3)

->ChristopherChristopherChristopher

print('Chris' in firstname)

->True

(3) Split returns a list of all the words in a string, or a list split on a specific character.

firstname = 'Christopher Arthur Hansen Brooks'.split(' ')[0] 

lastname = 'Christopher Arthur Hansen Brooks'.split(' ')[-1] 

print(firstname)

->Christopher

print(lastname)

->Brooks

(4) Make sure you convert objects to strings before concatenating串联.

'Chris' + 2

->Error

'Chris' + str(2)

->Chris2


11. Dictionary 字典 

(1)Dictionaries associate keys with values

x = {'Christopher Brooks': '[email protected]', 'Bill Gates': '[email protected]'}

x['Christopher Brooks']

->[email protected]

x['Kevyn Collins-Thompson'] = None

x['Kevyn Collins-Thompson']

->没有输出

(2)Iterate over all of the keys:

          遍历所有的键:

for name in x:

    print(x[name])

->[email protected]

    [email protected]

    None

(3) Iterate over all of the values:

for email in x.values():

    print(email)

->[email protected]

    [email protected]

    None

(4) Iterate over all of the items in the list:

for name, email in x.items():

    print(name)

    print(email)

->Christopher Brooks

    [email protected]

    Bill Gates

    [email protected]

    Kevyn Collins-Thompson

    None

(5) unpack a sequence into different variables:

          将序列解压为不同的变量:

x = ('Christopher', 'Brooks', '[email protected]')

fname, lname, email = x

fname

->Christopher

lname

->Brooks

(6) Make sure the number of values you are unpacking matches the number of variables being assigned.

x = ('Christopher', 'Brooks', '[email protected]', 'Ann Anbor')

fname, lname, email = x

->error


12. More on Strings

(1) Simple Samples

print('Chris' + 2)

->error

print('Chris' + str(2))

->Chris2

(2) Python has a built in method for convenient string formatting.

sales_record = {'price': 3.24, 'num_items': 4, 'person': 'Chris' }

sales_statement = '{} bought {} item(s) at a price of {} each for a total of {}'

print(sales_statement.format(sales_record['person'], sales_record['num_items'], sales_record['price'], sales_record['num_items']*sales_record['price']))

->Chris bought 4 item(s) at a price of 3.24 each for a total of 12.96


13. Reading and Writing CSV files

(1)导入csv

import csv

%precision 2

with open('mpg.csv') as csvfile:

    mpg = list(csv.DictReader(csvfile)) # 将csvfile转化为元素为字典的list

mpg[:3]

->

[OrderedDict([('', '1'),

              ('manufacturer', 'audi'),

              ('model', 'a4'),

              ('displ', '1.8'),

              ('year', '1999'),

              ('cyl', '4'),

              ('trans', 'auto(l5)'),

              ('drv', 'f'),

              ('cty', '18'),

              ('hwy', '29'),

              ('fl', 'p'),

              ('class', 'compact')]),

OrderedDict([('', '2'),

              ('manufacturer', 'audi'),

              ('model', 'a4'),

              ('displ', '1.8'),

              ('year', '1999'),

              ('cyl', '4'),

              ('trans', 'manual(m5)'),

              ('drv', 'f'),

              ('cty', '21'),

              ('hwy', '29'),

              ('fl', 'p'),

              ('class', 'compact')]),

OrderedDict([('', '3'),

              ('manufacturer', 'audi'),

              ('model', 'a4'),

              ('displ', '2'),

              ('year', '2008'),

              ('cyl', '4'),

              ('trans', 'manual(m6)'),

              ('drv', 'f'),

              ('cty', '20'),

              ('hwy', '31'),

              ('fl', 'p'),

              ('class', 'compact')])]

(2)查看list长度

len(mpg)

->234

(3)keys gives us the column names of our csv

mpg[0].keys()

->odict_keys(['', 'manufacturer', 'model', 'displ', 'year', 'cyl', 'trans', 'drv', 'cty', 'hwy', 'fl', 'class'])

(4)Find the average cty fuel economy across all car. All values in the dictionaries are strings, so we need to convert to float.

sum(float(d['hwy']) for d in mpg) / len(mpg)

->23.44

(5)Use set to return the unique values for the number of cylinders the cars in our dataset have.

使用set返回数据集中汽车具有的汽缸数的唯一值。

cylinders = set(d['cyl'] for d in mpg)

cylinders

->'4', '5', '6', '8'

(6) We are grouping the cars by number of cylinder, and find the average cty mpg for each group.

CtyMpgByCyl = []

for c in cylinders:

    summpg = 0

    cyltypecount = 0

    for d in mpg:

            if d['cyl'] == c:

                summpg += float(d['cty'])

                cyltypecount += 1

    CtyMpgByCyl.append((c, summpg / cyltypecount))

CtyMpgByCyl.sort(key = lambda x: x[0])

CtyMpgByCyl

->[('4', 21.01), ('5', 20.50), ('6', 16.22), ('8', 12.57)]

(7) Use set to return the unique values for the class types in our dataset

vehicleclass = set(d['class'] for d in mpg)

vehicleclass

->{'2seater', 'compact', 'midsize', 'minivan', 'pickup', 'subcompact', 'suv'}

(8) How to find the average hwy mpg for each class of vehicle in our dataset.

HwyMpgByClass = []

for t in vehicleclass:

    summpg = 0

    vclasscount = 0

    for d in mpg:

            if d['class'] == t:

                    summpg += float(d['hwy'])

                    vclasscount += 1

    HwyMpgByClass.append((t, summpg / vclasscount))

HwyMpgByClass.sort(key = lambda x: x[1])

HwyMpgByClass

->

[('pickup', 16.88),

('suv', 18.13),

('minivan', 22.36),

('2seater', 24.80),

('midsize', 27.29),

('subcompact', 28.14),

('compact', 28.30)]


14. Dates and Times

(1) 安装Datetime和Times的包

import datetime as dt

import time as tm

(2) Time returns the current time in seconds since the Epoch

tm.time()

->1583932727.90

(3) Convert the timestamp to datetime

dtnow = dt.datetime.fromtimestamp(tm.time())

dtnow

->

datetime.datetime(2020, 3, 11, 13, 18, 56, 990293)

(4) Handy datetime attributes: get year, month, day, etc. from a datetime

dtnow.year, dtnow.month, dtnow.day, dtnow.hour, dtnow.minute, dtnow.second

->(2020, 3, 11, 13, 18, 56)

(5) Timedelta is a duration expressing the difference between two dates.

delta = dt.timedelta(days = 100)

delta

->datetime.timedelta(100)

(6) date.today returns the current local date

today = dt.date.today()

today

->datetime.date(2020, 3, 11)

(7) the date 100 days ago

today - delta

->datetime.date(2019, 12, 2)

(8) compare dates

today > today - delta

-> True


15. Objects and map()

(1) an example of a class in python:

class Person:

    department = 'School of Information'

    def set_name(self, new_name)

            self.name = new_name

    def set_location(self, new_location)

            self.location = new_location


person = Person()

person.set_name('Christopher Brooks')

person.set_location('Ann Arbor, MI, USA')

print('{} live in {} and work in the department {}'.format(person.name, person.location, person.department))

(2) mapping the min function between two lists

store1 = [10.00, 11.00, 12.34, 2.34]

store2 = [9.00, 11.10, 12.34, 2.01]

cheapest = map(min, store1, store2)

cheapest

->

(3) iterate through the map object to see the values

for item in cheapest:

    print(item)

->

9.0

11.0

12.34

2.01


16. Lambda and List Comprehensions

(1) an example of lambda that takes in three parameters and adds the first two

my_function = lambda a, b, c: a+b

my_function(1, 2, 3)

->3

(2) iterate from 0 to 999 and return the even numbers.

my_list = []

for number in range(0, 1000):

        if number % 2 == 0:

                my_list.append(number)

my_list

->[0, 2, 4,...]

(3) Now the same thing but with list comprehension

my_list = [number for number in range(0, 1000) if number % 2 == 0]

my_list

->[0, 2, 4,...]


17. Numpy

(1) import package

import numpy as np


18.creating array数组(tuple元组,list列表)

(1) create a list and convert it to a numpy array

mylist = [1, 2, 3]

x = np.array(mylist)

x

->array([1, 2, 3])

(2) just pass in a list directly

y = np.array([4, 5, 6])

y

->array([4, 5, 6])

(3) pass in a list of lists to create a multidimensional array

m = np.array([[[7, 8, 9,],[10, 11, 12]])

m

->

array([[ 7, 8, 9],

      [10, 11, 12]])

(4) use the shape method to find the dimensions of array

m.shape 

->(2,3)

(5) arange returns evenly spaced values within a given interval

n = np.arange(0, 30, 2)

n

->array([ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28])

(6) reshape returns an array with the same data with a new shape

n = n.reshape(3, 5)

n

->

array([[ 0, 2, 4, 6, 8],

      [10, 12, 14, 16, 18],

      [20, 22, 24, 26, 28]])

(7) linspace returns evenly spaced numbers over a specified interval

o = np.linspace(0, 4, 9)

o

->array([ 0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. ])

(8) resize changes the shape and size of array in-space

o.resize(3, 3)

o

->

array([[ 0. , 0.5, 1. ],

      [ 1.5,  2. ,  2.5],

      [ 3. ,  3.5,  4. ]])

(9) ones returns a new array of given shape and type, filled with ones

np.ones((3, 2))

->

array([[ 1., 1.],

      [ 1.,  1.],

      [ 1.,  1.]])

(10) zeros returns a new array of given shape and type, filled with zeros

np.zeros((2,3))

->

array([[ 0., 0., 0.],

      [ 0.,  0.,  0.]])

(11) eye returns a 2D array with ones on the diagonal and zeros

np.eye(3)

->

array([[ 1., 0., 0.],

      [ 0.,  1.,  0.],

      [ 0.,  0.,  1.]])

(12) diag extracts a diagonal or constructs a diagonal array

np.diag(y)

->

array([[4, 0, 0],

      [0, 5, 0],

      [0, 0, 6]])

(13)creating an array using repeating list

np.array([1, 2, 3]*3)

->array([1, 2, 3, 1, 2, 3, 1, 2, 3])

(14) repeat elements of an array using repeat

np.repeat([1, 2, 3], 3)

->array([1, 1, 1, 2, 2, 2, 3, 3, 3])

(15) combine arrays

p = np.ones([2, 3], int)

p

->

array([[1, 1, 1],

      [1, 1, 1]])

(16) use vstack to stack arrays in sequence vertically (row wise).

np.vstack([p, 2*p])

->

array([[1, 1, 1],

      [1, 1, 1],

      [2, 2, 2],

      [2, 2, 2]])

(17) use hstack to stack arrays in sequence horizontally (column wise).

np.hstack([p, 2*p])

->

array([[1, 1, 1, 2, 2, 2],

      [1, 1, 1, 2, 2, 2]])


19. Operations

(1) element wise + - * /

print(x+y)

print(x-y)

->

[5 7 9]

[-3 -3 -3]

print(x*y)

print(x/y)

->

[ 4 10 18]

[ 0.25  0.4  0.5 ]

print(x**2)

->[1 4 9]

(2) Dot Product

x.dot(y) # x1y1+x2y2+x3y3

->32

(3)

 z = np.array([y, y**2])

print(z)

print(len(z)) #number of rows of array

->

[[ 4 5 6]

[16 25 36]]

2

(4) transpose array

z

->

[[ 4 5 6]

[16 25 36]]

z.T

->

array([[ 4, 16],

      [ 5, 25],

      [ 6, 36]])

(5) use .dtype to see the data type of the elements in the array

z.dtype

->dtype('int64')

(6) use .astype to cast to a specific type 

z = z.astype('f')

z.dtype

->dtype('float32')

(7) math functions 

a = np.array([-4, -2, 1, 3, 5])

a.sum()

->3

a.max()

->5

a.min()

->-4

a.mean()

->0.59999999998

a.std()

->3.2619012860600183

a.argmax()

->4

a.argmin()

->0

(8) indexing / slicing

s = np.arange(13)**2

s

->array([ 0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144])

(9)use bracket notation to get the value at a specific index

s[0], s[4], s[-1]

->(0, 16, 144)

(10) use : to indicate a range.array[start:stop]

s[1:5]

->array([ 1, 4, 9, 16])

(11) use negatives to count from the back

s[-4:]

->array([ 81, 100, 121, 144])

(12) A second : can be used to indicate step-size.array[start : stop : stepsize]

Here we are starting 5th element from the end, and counting backwards by 2 until the beginning of the array is reached.

s[-5::-2]

->array([64, 36, 16, 4, 0])

(13) look at the multidimensional array

r = np.arange(36)

r.resize((6,6))

r

->

array([[ 0, 1, 2, 3, 4, 5],

      [ 6,  7,  8,  9, 10, 11],

      [12, 13, 14, 15, 16, 17],

      [18, 19, 20, 21, 22, 23],

      [24, 25, 26, 27, 28, 29],

      [30, 31, 32, 33, 34, 35]])

(14) use bracket notation to slice

r[2, 2]

->14

(15) use : to select a range of rows or columns

r[3, 3:6]

->array([21, 22, 23])

(16) select all the rows up to row2 , and all the columns up to the last column.

r[:2, :-1]

->

array([[ 0, 1, 2, 3, 4],

      [ 6,  7,  8,  9, 10]])

(17) a slice of last row, only every other element

r[-1, ::2]

->array([30, 32, 34])

(18) perform conditional indexing.

r[r > 30]

->array([31, 32, 33, 34, 35])

(19) assigning all values in the array that are greater than 30 to the value of 30

r[r > 30] = 30

r

->

array([[ 0, 1, 2, 3, 4, 5],

      [ 6,  7,  8,  9, 10, 11],

      [12, 13, 14, 15, 16, 17],

      [18, 19, 20, 21, 22, 23],

      [24, 25, 26, 27, 28, 29],

      [30, 30, 30, 30, 30, 30]])

(20) copy and modify arrays

r2 = r[:3, :3]

r2

->

array([[ 0, 1, 2],

      [ 6,  7,  8],

      [12, 13, 14]])

(21)set this slice's values to zero([:] selects the entire array)

r2[:] = 0

r2

->

array([[0, 0, 0],

      [0, 0, 0],

      [0, 0, 0]])

(22) r has also be changed

r

->

array([[ 0, 0, 0, 3, 4, 5],

      [ 0,  0,  0,  9, 10, 11],

      [ 0,  0,  0, 15, 16, 17],

      [18, 19, 20, 21, 22, 23],

      [24, 25, 26, 27, 28, 29],

      [30, 30, 30, 30, 30, 30]])

(23) to avoid this, use .copy()

r_copy = r.copy()

r_copy

->

array([[ 0, 0, 0, 3, 4, 5],

      [ 0,  0,  0,  9, 10, 11],

      [ 0,  0,  0, 15, 16, 17],

      [18, 19, 20, 21, 22, 23],

      [24, 25, 26, 27, 28, 29],

      [30, 30, 30, 30, 30, 30]])

(24) now when r_copy is modified, r will not be changed

r_copy[:] =10

print(r_copy, '\n')

print(r)

->

[[10 10 10 10 10 10]

[10 10 10 10 10 10]

[10 10 10 10 10 10]

[10 10 10 10 10 10]

[10 10 10 10 10 10]

[10 10 10 10 10 10]]


[[ 0  0  0  3  4  5]

[ 0  0  0  9 10 11]

[ 0  0  0 15 16 17]

[18 19 20 21 22 23]

[24 25 26 27 28 29]

[30 30 30 30 30 30]]

(25) create a new 4*3 array of random numbers 0-9

test = np.random.randint(0, 10, (4,3))

test

->

array([[1, 8, 2],

      [6, 1, 5],

      [7, 8, 0],

      [7, 6, 2]])

(26) iterate by row

for row in test:

    print(row)

->

[1 8 2] 

[6 1 5]

[7 8 0]

[7 6 2]

(27) iterate by index

for i in range(len(test)):

        print(test[I])

->

[1 8 2]

[6 1 5]

[7 8 0]

[7 6 2]

(28) iterate by row and index

for i, row in enumerate(test):

        print('row', i, 'is', row)

->

row 0 is [1 8 2]

row 1 is [6 1 5]

row 2 is [7 8 0]

row 3 is [7 6 2]

(29) use zip to iterate over multiple iterables

test2 = test**2

test2

->

array([[ 1, 64, 4],

      [36,  1, 25],

      [49, 64,  0],

      [49, 36,  4]])


for i, j in zip(test, test2):

        print(i, '+', j, '=', i+j)

->

[1 8 2] + [ 1 64 4] = [ 2 72 6]

[6 1 5] + [36  1 25] = [42  2 30]

[7 8 0] + [49 64  0] = [56 72  0]

[7 6 2] + [49 36  4] = [56 42  6]

你可能感兴趣的:(Introduction to Data Science in Python学习笔记)