本人以简书作者 SeanCheney 系列专题文章并结合原书为学习资源,记录个人笔记,仅作为知识记录及后期复习所用,原作者地址查看 简书 SeanCheney,如有错误,还望批评指教。——ZJ
原作者:SeanCheney | 链接 | 來源:简书
Github:wesm | Github:中文 BrambleXu|
环境:Python 3.6
学习目标: 学会利用 Python 进行数据控制、处理、整理、分析等方面的具体细节和基本要点。掌握 Python 编程和用于数据处理的库和工具环境,成为数据分析专家。
NumPy (Numerical Python的简称)是Python科学计算的基础包。本书大部分内容都基于NumPy以及构建于其上的库。它提供了以下功能(不限于此):
IPython 和 Jupyter:
SciPy: 是一组专门解决科学计算中各种标准问题域的包的集合
scikit-learn: scikit-learn 成为了 Python 的通用机器学习工具包
statsmodels: 是一个统计分析包,statsmodels包含经典统计学和经济计量学的算法
1.4 1.5 (略)
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import statsmodels as sm
Python 是解释性语言。Python 解释器同一时间只能运行一个程序的一条语句。标准的交互 Python 解释器可以在命令行中通过键入 python 命令打开:
Python 3.6.2 (v3.6.2:5fd33b5, Jul 8 2017, 04:57:36) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> print(input('what\'s your name:'))
what's your name: ZJ
>>> exit()
运行 Python 程序只需调用 Python 的同时,使用一个.py文件作为它的第一个参数。假设创建了一个 hello_world.py 文件,它的内容是:
print('Hello world')
$ python hello_world.py
Hello world
一些 Python 程序员总是这样执行 Python 代码的,从事数据分析和科学计算的人却会使用 IPython,一个强化的 Python 解释器,或 Jupyter notebooks,一个网页代码笔记本,它原先是 IPython 的一个子项目。
当你使用 %run
Python 3.6.2 (v3.6.2:5fd33b5, Jul 8 2017, 04:57:36) [MSC v.1900 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 6.2.1 -- An enhanced Interactive Python. Type '?' for help.
In [1]: %run hello.py
Hello World
In [2]: exit()
运行 IPython Shell
In [1]: import numpy as np
In [2]: data = {i : np.random.randn() for i in range(7)}
In [3]: data
{0: -0.7871942220349025,
1: 0.5968958863243701,
2: -0.670023515225677,
3: -0.030930268126603183,
4: 2.0550986476324473,
5: -0.7468422713170355,
6: -0.2948531366214833}
In [4]: exit()
运行 Jupyter Notebook (略)
Tab 自动补全(略)
In [1]: b = [2,3,4,5]
In [2]: b?
Type: list
String form: [2, 3, 4, 5]
Length: 4
list() -> new empty list
list(iterable) -> new list initialized from iterable's items
In [3]: exit()
In [1]: def add_number(a, b):
...: '''
...: Add two numbers together
...: Returns
...: ---------
...: the_sum : type of arguments
...: '''
...: return a + b
In [2]: add_number?
Signature: add_number(a, b)
Add two numbers together
the_sum : type of arguments
File: d:\github\pythonpractice\
Type: function
In [3]: exit()
In [2]: add_number??
Signature: add_number(a, b)
def add_number(a, b):
Add two numbers together
the_sum : type of arguments
return a + b
File: d:\github\pythonpractice\1-0d88bc512be6>
Type: function
def f(x,y,z):
return (x +y)/z
a = 5
b = 6
c = 7.5
result = f(a,b,c)
命令 :你可以用%run命令运行所有的 Python 程序。In [1]: %run ipython_script_test.py
In [2]: c
Out[2]: 7.5
In [3]: result
Out[3]: 1.4666666666666666
如果想让一个脚本访问 IPython 已经定义过的变量,可以使用 %run -i。
在Jupyter notebook中,你也可以使用 %load
In [4]: %load ipython_script_test.py
In [5]: # %load ipython_script_test.py
...: def f(x,y,z):
...: return (x +y)/z
...: a = 5
...: b = 6
...: c = 7.5
...: result = f(a,b,c)
中断运行的代码: 代码运行时按 Ctrl-C,无论是%run
从剪贴板执行程序:如果使用 Jupyter notebook,你可以将代码复制粘贴到任意代码格执行。在 IPython shell 中也可以从剪贴板执行。假设在其它应用中复制了如下代码:
x = 5
y = 7
if x > 5:
x += 1
y = 8
可以直接运行剪贴板中的代码:In [6]: %paste
x = 5
y = 7
if x > 5:
x += 1
y = 8
## -- End pasted text --
功能类似,(输入命令后 Ctrl-V)但会给出一条提示:In [18]: %cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
:x = 5
:y = 7
:if x > 5:
: x += 1
: y = 8
(这个命令后面会详谈)测量任何 Python 语句,例如矩阵乘法,的执行时间:In [10]: import numpy as np
In [11]: a = np.random.randn(100, 100)
In [12]: %timeit np.dot(a,a)
53.5 µs ± 451 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [21]: %debug?
%debug [--breakpoint FILE:LINE] [statement [statement ...]]
Activate the interactive debugger.
This magic command support two ways of activating debugger.
One is to activate debugger before executing code. This way, you
can set a break point, to step through the code from the point.
You can use this mode by giving statements to execute and optionally
a breakpoint.
The other one is to activate debugger in post-mortem mode. You can
activate this mode simply running %debug without any argument.
If an exception has just occurred, this lets you inspect its stack
frames interactively. Note that this will always work only on the last
traceback that occurred, so you must call this quickly after an
exception that you wish to inspect has fired, because if another one
occurs, it clobbers the previous one.
If you want IPython to automatically do this on every exception, see
the %pdb magic for more details.
positional arguments:
statement Code to run in debugger. You can omit this in cell
magic mode.
optional arguments:
--breakpoint , -b
Set break point at LINE in FILE.
或 %magic
学习下所有特殊命令。集成 Matplotlib
魔术函数配置了 IPython shell 和 Jupyter notebook 中的matplotlib。这点很重要,其它创建的图不会出现(notebook)或获取 session的控制,直到结束(shell)。%matplotlib
可以进行设置,可以创建多个绘图窗口,而不会干扰控制台 session:In [31]: %matplotlib
Using matplotlib backend: TkAgg
%matplotlib inline
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
In [1]: def append_element(some_list, element):
...: some_list.append(element)
In [2]: data = [1, 2, 3]
In [3]: append_element(data, 4)
In [4]: data
Out[4]: [1, 2, 3, 4]
In [9]: a = 4.5
In [10]: b = 2
In [11]: print('a is {0}, b is {1}'.format(type(a), type(b)))
a is <class 'float'>, b is <class 'int'>
In [12]: a / b
Out[12]: 2.25
In [13]: a = 5
In [14]: isinstance(a, int)
Out[14]: True
In [15]: a = 5; b = 4.5
In [16]: isinstance(a, (int, float))
Out[16]: True
In [18]: isinstance(b, (bool, int))
Out[18]: False
In [23]: getattr(a,'split')
Out[23]: <function str.split>
魔术方法,其它更好的判断方法是使用 iter 函数
In [24]: def isiterable(obj):
...: try:
...: iter(obj)
...: return True
...: except TypeError: # not iterable
...: return False
In [25]: isiterable('a string')
Out[25]: True
In [26]: isiterable([2,3,4,5])
Out[26]: True
In [28]: isiterable(6)
Out[28]: False
In [29]: p = 6
In [30]: isiterable(p)
Out[30]: False
In [31]: p = '6'
In [32]: isiterable(p)
Out[32]: True
In [33]: if not isinstance(x, list) and isiterable(x):
...: x = list(x)
# some_module.py
PI = 3.14159
def f(x):
return x + 2
def g(a, b):
return a + b
如果想从同目录下的另一个文件访问 some_module.py 中定义的变量和函数,可以:
In [4]: import some_module
In [5]: result = some_module.f(5)
In [6]: result
Out[6]: 7
In [7]: pi = some_module.PI
In [8]: pi
Out[8]: 3.14159
In [9]: from some_module import f,g,PI
In [10]: result = g(5,PI)
In [11]: result
Out[11]: 8.14159
import some_module as sm
from some_module import PI as pi, g as gf
r1 = sm.f(pi)
r2 = gf(6, pi)
In [14]: c =list(a)
In [15]: type(a)
Out[15]: list
In [16]: type(b)
Out[16]: list
In [17]: type(c)
Out[17]: list
In [18]: a is b
Out[18]: True
In [19]: a is c
Out[19]: False
In [20]: a is not c
Out[20]: True
运算符不同,如下:In [40]: a == c
Out[40]: True
In [22]: a = None
In [23]: a is None
Out[23]: True
In [29]: a_list = ['foo', 2, [4,5]]
In [30]: a_list[2]
Out[30]: [4, 5]
In [31]: a_list[2] = (3,4)
In [32]: a_list
Out[32]: ['foo', 2, (3, 4)]
In [33]: a = (3,4)
In [34]: type(a)
Out[34]: tuple
In [35]: a_tuple = (3, 5, (4,5))
In [36]: a_tuple[1] = 'what'
TypeError Traceback (most recent call last)
input -36-785d8261e70a> in <module>()
----> 1 a_tuple[1] = 'what'
TypeError: 'tuple' object does not support item assignment
In [37]:
- 单引号,双引号,三引号
- 字符串 c 实际包含四行文本,”“”后面和 lines 后面的换行符。可以用 count 方法计算 c 中的新的行
In [1]: c = """
...: This is a long string that
...: spans multiple lines
...: """
In [2]: c.count('\n')
Out[2]: 3
In [2]: c.count('\n')
Out[2]: 3
In [3]: a = 5.6
In [4]: s = str(a)
In [5]: type(s)
Out[5]: str
In [6]: type(a)
Out[6]: float
In [7]: s = 'python'
In [8]: list(s)
Out[8]: ['p', 'y', 't', 'h', 'o', 'n']
In [9]: s[:3]
Out[9]: 'pyt'
In [10]: s[0:3]
Out[10]: 'pyt'
In [11]: s[::]
Out[11]: 'python'
In [12]: s[3:]
Out[12]: 'hon'
In [13]: s = '12//14'
In [14]: s
Out[14]: '12//14'
In [15]: s = '12\\14'
In [16]: s
Out[16]: '12\\14'
In [17]: print(s)
In [20]: s = r'this\has\no\special\characters'
In [21]: s
Out[21]: 'this\\has\\no\\special\\characters'
In [22]: print(s)
,会产生一个新的字符串In [24]: template = '{0:.2f} {1:s} are worth US${2:d}'
字符串格式化是一个很深的主题,有多种方法和大量的选项,可以控制字符串中的值是如何格式化的。推荐参阅 Python 官方文档。
字节和 Unicode
In [26]: val = "español"
In [27]: val
Out[27]: 'español'
In [28]: val_utf8 = val.encode('utf-8')
In [29]: val_utf8
Out[29]: b'espa\xc3\xb1ol'
In [30]: type(val_utf8)
Out[30]: bytes
In [31]: val_utf8.decode('utf-8')
Out[31]: 'español'
In [37]: val.encode('latin1')
Out[37]: b'espa\xf1ol'
In [38]: val.encode('utf-16')
Out[38]: b'\xff\xfee\x00s\x00p\x00a\x00\xf1\x00o\x00l\x00'
In [39]: val.encode('utf-16le')
Out[39]: b'e\x00s\x00p\x00a\x00\xf1\x00o\x00l\x00'
In [40]: bytes_val = b'this is bytes'
In [41]: bytes_val
Out[41]: b'this is bytes'
In [42]: decode = bytes_val.decode('utf-8')
In [43]: decode # this is str (Unicode) now
Out[43]: 'this is bytes'
类型转换 str、bool、int 和 float也是函数,可以用来转换类型
bool() 函数用于将给定参数转换为布尔类型,如果没有参数,返回 False。
In [44]: True and True
Out[44]: True
In [45]: False or True
Out[45]: True
In [46]: False and False
Out[46]: False
In [47]: s = '3.14159'
In [48]: fval = float(s)
In [49]: type(fval)
Out[49]: float
In [50]: int(fval)
Out[50]: 3
In [51]: bool(fval)
Out[51]: True
In [52]: bool(0)
Out[52]: False
In [55]: c
Out[55]: True
In [56]:
In [56]: a = None
In [57]: a is None
Out[57]: True
In [58]: b = 5
In [59]: b is not None
Out[59]: True
In [63]: def add_and_maybe_multiply(a, b, c=None):
...: result = a + b
...: if c is not None:
...: result = result * c
...: return result
In [64]: add_and_maybe_multiply(4,5,7)
Out[64]: 63
In [65]: add_and_maybe_multiply(4,5,None)
Out[65]: 9
In [66]:
In [66]: type(None)
Out[66]: NoneType
In [72]: from datetime import datetime, date, time
In [73]: dt = datetime(2018, 3, 30, 9, 36 ,21)
In [74]: dt.day
Out[74]: 30
In [75]: dt.month
Out[75]: 3
In [76]: dt.year
Out[76]: 2018
In [77]: dt.minute
Out[77]: 36
In [78]: dt.date()
Out[78]: datetime.date(2018, 3, 30)
In [79]: dt.time()
Out[79]: datetime.time(9, 36, 21)
In [83]: dt.strftime('%m/%d/%Y %H:%M')
Out[83]: '03/30/2018 09:36'
In [84]: dt.strftime('%y/%m/%d %H:%M')
Out[84]: '18/03/30 09:36'
In [85]: dt.strftime('%Y/%m/%d %H:%M')
Out[85]: '2018/03/30 09:36'
In [90]: datetime.strptime('20180330', '%Y%m%d')
Out[90]: datetime.datetime(2018, 3, 30, 0, 0)
In [98]: dt.replace(minute=0, second=0)
Out[98]: datetime.datetime(2018, 3, 30, 9, 0)
In [99]: dt.replace(minute=1, second=1)
Out[99]: datetime.datetime(2018, 3, 30, 9, 1, 1)
因为 datetime.datetime 是不可变类型,上面的方法会产生新的对象
两个 datetime 对象的差会产生一个 datetime.timedelta 类型
In [100]: dt2 = datetime(2018,2, 15, 22, 30)
In [101]: delta = dt2 -dt
In [102]: delta
Out[102]: datetime.timedelta(-43, 46419)
In [103]: type(delta)
Out[103]: datetime.timedelta
In [104]: dt
Out[104]: datetime.datetime(2018, 3, 30, 9, 36, 21)
结果 timedelta(-43, 46419) 指明了 timedelta 将-43天、46419 秒的编码方式。
将 timedelta 添加到 datetime,会产生一个新的偏移 datetime:
In [104]: dt
Out[104]: datetime.datetime(2018, 3, 30, 9, 36, 21)
In [105]: dt + delta
Out[105]: datetime.datetime(2018, 2, 15, 22, 30)
In [120]: 4 > 3 > 2 > 1
Out[120]: True
for value in collection:
# do something with value
In [108]: sequence = [1, 2, None, 4, None, 5]
In [109]: total = 0
In [110]: for value in sequence:
...: if value is None:
...: continue
...: total += value
In [111]: sequence = [1, 2, 0, 4, 6, 5, 2, 1]
In [112]: total_value_5 = 0
In [113]: for value in sequence:
...: if value == 5:
...: break
...: total_value_5 += value
In [114]: for i in range(4):
...: for j in range(4):
...: if j > i:
...: break
...: print((i,j))
(0, 0)
(1, 0)
(1, 1)
(2, 0)
(2, 1)
(2, 2)
(3, 0)
(3, 1)
(3, 2)
(3, 3)
for a, b, c in iterator:
# do something
x = 256
total = 0
while x > 0:
if total > 500:
total += x
x = x // 2
if x < 0:
elif x == 0:
# TODO: put something smart here
range 函数返回一个迭代器,它产生一个均匀分布的整数序列:
range 的三个参数是(起点,终点,步进)
In [124]: range(10)
Out[124]: range(0, 10)
In [125]: print(range(10))
range(0, 10)
In [126]: list(range(10))
Out[126]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
In [127]: list(range(0,20, 5))
Out[127]: [0, 5, 10, 15]
In [128]: list(range(10, 2, -2))
Out[128]: [10, 8, 6, 4]
seq = [1, 2, 3, 4]
for i in range(len(seq)):
val = seq[i]
In [129]: sum = 0
In [130]: for i in range(100000):
...: if i % 3 == 0 or i % 5 ==0:
...: sum += i
In [131]: sum
Out[131]: 2333316668
value = true-expr if condition else false-expr
In [132]: x = 5
In [133]: 'None-negative' if x>=0 else 'Negative'
Out[133]: 'None-negative'