本人以简书作者 SeanCheney 系列专题文章并结合原书为学习资源,记录个人笔记,仅作为知识记录及后期复习所用,原作者地址查看 简书 SeanCheney,如有错误,还望批评指教。——ZJ
原作者:SeanCheney | 链接 | 來源:简书
Github:wesm | Github:中文 BrambleXu|
简书:利用Python进行数据分析·第2版
环境:Python 3.6
学习目标: 学会利用 Python 进行数据控制、处理、整理、分析等方面的具体细节和基本要点。掌握 Python 编程和用于数据处理的库和工具环境,成为数据分析专家。
这绝不是一个完整的列表。大部分数据集都能被转化为更加适合分析和建模的结构化形式。
NumPy (Numerical Python的简称)是Python科学计算的基础包。本书大部分内容都基于NumPy以及构建于其上的库。它提供了以下功能(不限于此):
pandas:提供了快速便捷处理结构化数据的大量数据结构和函数。
matplotlib:是最流行的用于绘制图表和其它二维数据可视化的Python库
IPython 和 Jupyter:
SciPy: 是一组专门解决科学计算中各种标准问题域的包的集合
scikit-learn: scikit-learn 成为了 Python 的通用机器学习工具包
statsmodels: 是一个统计分析包,statsmodels包含经典统计学和经济计量学的算法
1.4 1.5 (略)
与外部世界交互
阅读编写多种文件格式和数据商店;
数据准备
清洗、修改、结合、标准化、重塑、切片、切割、转换数据,以进行分析;
转换数据
对旧的数据集进行数学和统计操作,生成新的数据集(例如,通过各组变量聚类成大的表);
建模和计算
将数据绑定统计模型、机器学习算法、或其他计算工具;
展示
创建交互式和静态的图表可视化和文本总结。
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import statsmodels as sm
Python 是解释性语言。Python 解释器同一时间只能运行一个程序的一条语句。标准的交互 Python 解释器可以在命令行中通过键入 python 命令打开:
C:\Users\qhtf>python
Python 3.6.2 (v3.6.2:5fd33b5, Jul 8 2017, 04:57:36) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> print(input('what\'s your name:'))
what's your name: ZJ
ZJ
>>> exit()
运行 Python 程序只需调用 Python 的同时,使用一个.py文件作为它的第一个参数。假设创建了一个 hello_world.py 文件,它的内容是:
print('Hello world')
你可以用下面的命令运行它(hello_world.py文件必须位于终端的工作目录):
$ python hello_world.py
Hello world
一些 Python 程序员总是这样执行 Python 代码的,从事数据分析和科学计算的人却会使用 IPython,一个强化的 Python 解释器,或 Jupyter notebooks,一个网页代码笔记本,它原先是 IPython 的一个子项目。
当你使用 %run
命令,IPython会同样执行指定文件中的代码,结束之后,还可以与结果交互:
D:\github\pythonpractice>ipython
Python 3.6.2 (v3.6.2:5fd33b5, Jul 8 2017, 04:57:36) [MSC v.1900 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 6.2.1 -- An enhanced Interactive Python. Type '?' for help.
In [1]: %run hello.py
Hello World
In [2]: exit()
运行 IPython Shell
In [1]: import numpy as np
In [2]: data = {i : np.random.randn() for i in range(7)}
In [3]: data
Out[3]:
{0: -0.7871942220349025,
1: 0.5968958863243701,
2: -0.670023515225677,
3: -0.030930268126603183,
4: 2.0550986476324473,
5: -0.7468422713170355,
6: -0.2948531366214833}
In [4]: exit()
运行 Jupyter Notebook (略)
Tab 自动补全(略)
自省:在变量前后使用问号?,可以显示对象的信息:
In [1]: b = [2,3,4,5]
In [2]: b?
Type: list
String form: [2, 3, 4, 5]
Length: 4
Docstring:
list() -> new empty list
list(iterable) -> new list initialized from iterable's items
In [3]: exit()
In [1]: def add_number(a, b):
...: '''
...: Add two numbers together
...: Returns
...: ---------
...: the_sum : type of arguments
...: '''
...: return a + b
...:
In [2]: add_number?
Signature: add_number(a, b)
Docstring:
Add two numbers together
Returns
---------
the_sum : type of arguments
File: d:\github\pythonpractice\
Type: function
In [3]: exit()
In [2]: add_number??
Signature: add_number(a, b)
Source:
def add_number(a, b):
'''
Add two numbers together
Returns
---------
the_sum : type of arguments
'''
return a + b
File: d:\github\pythonpractice\1-0d88bc512be6>
Type: function
def f(x,y,z):
return (x +y)/z
a = 5
b = 6
c = 7.5
result = f(a,b,c)
%run
命令 :你可以用%run命令运行所有的 Python 程序。In [1]: %run ipython_script_test.py
In [2]: c
Out[2]: 7.5
In [3]: result
Out[3]: 1.4666666666666666
如果想让一个脚本访问 IPython 已经定义过的变量,可以使用 %run -i。
在Jupyter notebook中,你也可以使用 %load
,它将脚本导入到一个代码格中:
In [4]: %load ipython_script_test.py
In [5]: # %load ipython_script_test.py
...: def f(x,y,z):
...: return (x +y)/z
...:
...: a = 5
...: b = 6
...: c = 7.5
...:
...: result = f(a,b,c)
中断运行的代码: 代码运行时按 Ctrl-C,无论是%run
或长时间运行命令,都会导致KeyboardInterrupt。这会导致几乎左右Python程序立即停止,除非一些特殊情况。
从剪贴板执行程序:如果使用 Jupyter notebook,你可以将代码复制粘贴到任意代码格执行。在 IPython shell 中也可以从剪贴板执行。假设在其它应用中复制了如下代码:
x = 5
y = 7
if x > 5:
x += 1
y = 8
%paste
和%cpaste
函数。%paste
可以直接运行剪贴板中的代码:In [6]: %paste
x = 5
y = 7
if x > 5:
x += 1
y = 8
## -- End pasted text --
%cpaste
功能类似,(输入命令后 Ctrl-V)但会给出一条提示:In [18]: %cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
:x = 5
:y = 7
:if x > 5:
: x += 1
:
: y = 8
:--
键盘快捷键
魔术命令
%
前缀。%timeit
(这个命令后面会详谈)测量任何 Python 语句,例如矩阵乘法,的执行时间:In [10]: import numpy as np
In [11]: a = np.random.randn(100, 100)
In [12]: %timeit np.dot(a,a)
53.5 µs ± 451 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [21]: %debug?
Docstring:
::
%debug [--breakpoint FILE:LINE] [statement [statement ...]]
Activate the interactive debugger.
This magic command support two ways of activating debugger.
One is to activate debugger before executing code. This way, you
can set a break point, to step through the code from the point.
You can use this mode by giving statements to execute and optionally
a breakpoint.
The other one is to activate debugger in post-mortem mode. You can
activate this mode simply running %debug without any argument.
If an exception has just occurred, this lets you inspect its stack
frames interactively. Note that this will always work only on the last
traceback that occurred, so you must call this quickly after an
exception that you wish to inspect has fired, because if another one
occurs, it clobbers the previous one.
If you want IPython to automatically do this on every exception, see
the %pdb magic for more details.
positional arguments:
statement Code to run in debugger. You can omit this in cell
magic mode.
optional arguments:
--breakpoint , -b
Set break point at LINE in FILE.
%automagic
打开或关闭。%quickref
或 %magic
学习下所有特殊命令。集成 Matplotlib
%matplotlib
魔术函数配置了 IPython shell 和 Jupyter notebook 中的matplotlib。这点很重要,其它创建的图不会出现(notebook)或获取 session的控制,直到结束(shell)。%matplotlib
可以进行设置,可以创建多个绘图窗口,而不会干扰控制台 session:In [31]: %matplotlib
Using matplotlib backend: TkAgg
%matplotlib inline
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
plt.plot(np.random.randn(50).cumsum())
[]
In [1]: def append_element(some_list, element):
...: some_list.append(element)
...:
In [2]: data = [1, 2, 3]
In [3]: append_element(data, 4)
In [4]: data
Out[4]: [1, 2, 3, 4]
In [9]: a = 4.5
In [10]: b = 2
In [11]: print('a is {0}, b is {1}'.format(type(a), type(b)))
a is <class 'float'>, b is <class 'int'>
In [12]: a / b
Out[12]: 2.25
In [13]: a = 5
In [14]: isinstance(a, int)
Out[14]: True
In [15]: a = 5; b = 4.5
In [16]: isinstance(a, (int, float))
Out[16]: True
In [18]: isinstance(b, (bool, int))
Out[18]: False
In [23]: getattr(a,'split')
Out[23]: <function str.split>
例如,你可以通过验证一个对象是否遵循迭代协议,判断它是可迭代的。对于许多对象,这意味着它有一个__iter__
魔术方法,其它更好的判断方法是使用 iter 函数
这个函数会返回字符串以及大多数Python集合类型为True:
In [24]: def isiterable(obj):
...: try:
...: iter(obj)
...: return True
...: except TypeError: # not iterable
...: return False
...:
In [25]: isiterable('a string')
Out[25]: True
In [26]: isiterable([2,3,4,5])
Out[26]: True
In [28]: isiterable(6)
Out[28]: False
In [29]: p = 6
In [30]: isiterable(p)
Out[30]: False
In [31]: p = '6'
In [32]: isiterable(p)
Out[32]: True
In [33]: if not isinstance(x, list) and isiterable(x):
...: x = list(x)
# some_module.py
PI = 3.14159
def f(x):
return x + 2
def g(a, b):
return a + b
如果想从同目录下的另一个文件访问 some_module.py 中定义的变量和函数,可以:
In [4]: import some_module
In [5]: result = some_module.f(5)
In [6]: result
Out[6]: 7
In [7]: pi = some_module.PI
In [8]: pi
Out[8]: 3.14159
In [9]: from some_module import f,g,PI
In [10]: result = g(5,PI)
In [11]: result
Out[11]: 8.14159
import some_module as sm
from some_module import PI as pi, g as gf
r1 = sm.f(pi)
r2 = gf(6, pi)
In [14]: c =list(a)
In [15]: type(a)
Out[15]: list
In [16]: type(b)
Out[16]: list
In [17]: type(c)
Out[17]: list
In [18]: a is b
Out[18]: True
In [19]: a is c
Out[19]: False
In [20]: a is not c
Out[20]: True
==
运算符不同,如下:In [40]: a == c
Out[40]: True
In [22]: a = None
In [23]: a is None
Out[23]: True
In [29]: a_list = ['foo', 2, [4,5]]
In [30]: a_list[2]
Out[30]: [4, 5]
In [31]: a_list[2] = (3,4)
In [32]: a_list
Out[32]: ['foo', 2, (3, 4)]
In [33]: a = (3,4)
In [34]: type(a)
Out[34]: tuple
In [35]: a_tuple = (3, 5, (4,5))
In [36]: a_tuple[1] = 'what'
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
input -36-785d8261e70a> in <module>()
----> 1 a_tuple[1] = 'what'
TypeError: 'tuple' object does not support item assignment
In [37]:
标量类型
数值类型
字符串
- 单引号,双引号,三引号
- 字符串 c 实际包含四行文本,”“”后面和 lines 后面的换行符。可以用 count 方法计算 c 中的新的行
In [1]: c = """
...: This is a long string that
...: spans multiple lines
...: """
In [2]: c.count('\n')
Out[2]: 3
In [2]: c.count('\n')
Out[2]: 3
In [3]: a = 5.6
In [4]: s = str(a)
In [5]: type(s)
Out[5]: str
In [6]: type(a)
Out[6]: float
In [7]: s = 'python'
In [8]: list(s)
Out[8]: ['p', 'y', 't', 'h', 'o', 'n']
In [9]: s[:3]
Out[9]: 'pyt'
In [10]: s[0:3]
Out[10]: 'pyt'
In [11]: s[::]
Out[11]: 'python'
In [12]: s[3:]
Out[12]: 'hon'
In [13]: s = '12//14'
In [14]: s
Out[14]: '12//14'
In [15]: s = '12\\14'
In [16]: s
Out[16]: '12\\14'
In [17]: print(s)
12\14
In [20]: s = r'this\has\no\special\characters'
In [21]: s
Out[21]: 'this\\has\\no\\special\\characters'
In [22]: print(s)
this\has\no\special\characters
+
,会产生一个新的字符串In [24]: template = '{0:.2f} {1:s} are worth US${2:d}'
在这个字符串中,
字符串格式化是一个很深的主题,有多种方法和大量的选项,可以控制字符串中的值是如何格式化的。推荐参阅 Python 官方文档。
字节和 Unicode
In [26]: val = "español"
In [27]: val
Out[27]: 'español'
In [28]: val_utf8 = val.encode('utf-8')
In [29]: val_utf8
Out[29]: b'espa\xc3\xb1ol'
In [30]: type(val_utf8)
Out[30]: bytes
In [31]: val_utf8.decode('utf-8')
Out[31]: 'español'
In [37]: val.encode('latin1')
Out[37]: b'espa\xf1ol'
In [38]: val.encode('utf-16')
Out[38]: b'\xff\xfee\x00s\x00p\x00a\x00\xf1\x00o\x00l\x00'
In [39]: val.encode('utf-16le')
Out[39]: b'e\x00s\x00p\x00a\x00\xf1\x00o\x00l\x00'
In [40]: bytes_val = b'this is bytes'
In [41]: bytes_val
Out[41]: b'this is bytes'
In [42]: decode = bytes_val.decode('utf-8')
In [43]: decode # this is str (Unicode) now
Out[43]: 'this is bytes'
类型转换 str、bool、int 和 float也是函数,可以用来转换类型
bool() 函数用于将给定参数转换为布尔类型,如果没有参数,返回 False。
In [44]: True and True
Out[44]: True
In [45]: False or True
Out[45]: True
In [46]: False and False
Out[46]: False
In [47]: s = '3.14159'
In [48]: fval = float(s)
In [49]: type(fval)
Out[49]: float
In [50]: int(fval)
Out[50]: 3
In [51]: bool(fval)
Out[51]: True
In [52]: bool(0)
Out[52]: False
In [55]: c
Out[55]: True
In [56]:
In [56]: a = None
In [57]: a is None
Out[57]: True
In [58]: b = 5
In [59]: b is not None
Out[59]: True
In [63]: def add_and_maybe_multiply(a, b, c=None):
...: result = a + b
...: if c is not None:
...: result = result * c
...: return result
...:
In [64]: add_and_maybe_multiply(4,5,7)
Out[64]: 63
In [65]: add_and_maybe_multiply(4,5,None)
Out[65]: 9
In [66]:
In [66]: type(None)
Out[66]: NoneType
日期和时间
In [72]: from datetime import datetime, date, time
In [73]: dt = datetime(2018, 3, 30, 9, 36 ,21)
In [74]: dt.day
Out[74]: 30
In [75]: dt.month
Out[75]: 3
In [76]: dt.year
Out[76]: 2018
In [77]: dt.minute
Out[77]: 36
In [78]: dt.date()
Out[78]: datetime.date(2018, 3, 30)
In [79]: dt.time()
Out[79]: datetime.time(9, 36, 21)
In [83]: dt.strftime('%m/%d/%Y %H:%M')
Out[83]: '03/30/2018 09:36'
In [84]: dt.strftime('%y/%m/%d %H:%M')
Out[84]: '18/03/30 09:36'
In [85]: dt.strftime('%Y/%m/%d %H:%M')
Out[85]: '2018/03/30 09:36'
In [90]: datetime.strptime('20180330', '%Y%m%d')
Out[90]: datetime.datetime(2018, 3, 30, 0, 0)
In [98]: dt.replace(minute=0, second=0)
Out[98]: datetime.datetime(2018, 3, 30, 9, 0)
In [99]: dt.replace(minute=1, second=1)
Out[99]: datetime.datetime(2018, 3, 30, 9, 1, 1)
因为 datetime.datetime 是不可变类型,上面的方法会产生新的对象
两个 datetime 对象的差会产生一个 datetime.timedelta 类型
In [100]: dt2 = datetime(2018,2, 15, 22, 30)
In [101]: delta = dt2 -dt
In [102]: delta
Out[102]: datetime.timedelta(-43, 46419)
In [103]: type(delta)
Out[103]: datetime.timedelta
In [104]: dt
Out[104]: datetime.datetime(2018, 3, 30, 9, 36, 21)
结果 timedelta(-43, 46419) 指明了 timedelta 将-43天、46419 秒的编码方式。
将 timedelta 添加到 datetime,会产生一个新的偏移 datetime:
In [104]: dt
Out[104]: datetime.datetime(2018, 3, 30, 9, 36, 21)
In [105]: dt + delta
Out[105]: datetime.datetime(2018, 2, 15, 22, 30)
控制流
In [120]: 4 > 3 > 2 > 1
Out[120]: True
for value in collection:
# do something with value
In [108]: sequence = [1, 2, None, 4, None, 5]
In [109]: total = 0
In [110]: for value in sequence:
...: if value is None:
...: continue
...: total += value
In [111]: sequence = [1, 2, 0, 4, 6, 5, 2, 1]
In [112]: total_value_5 = 0
In [113]: for value in sequence:
...: if value == 5:
...: break
...: total_value_5 += value
In [114]: for i in range(4):
...: for j in range(4):
...: if j > i:
...: break
...: print((i,j))
...:
(0, 0)
(1, 0)
(1, 1)
(2, 0)
(2, 1)
(2, 2)
(3, 0)
(3, 1)
(3, 2)
(3, 3)
for a, b, c in iterator:
# do something
x = 256
total = 0
while x > 0:
if total > 500:
break
total += x
x = x // 2
if x < 0:
print('negative!')
elif x == 0:
# TODO: put something smart here
pass
else:
print('positive!')
range 函数返回一个迭代器,它产生一个均匀分布的整数序列:
range 的三个参数是(起点,终点,步进)
In [124]: range(10)
Out[124]: range(0, 10)
In [125]: print(range(10))
range(0, 10)
In [126]: list(range(10))
Out[126]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
In [127]: list(range(0,20, 5))
Out[127]: [0, 5, 10, 15]
In [128]: list(range(10, 2, -2))
Out[128]: [10, 8, 6, 4]
seq = [1, 2, 3, 4]
for i in range(len(seq)):
val = seq[i]
In [129]: sum = 0
In [130]: for i in range(100000):
...: if i % 3 == 0 or i % 5 ==0:
...: sum += i
...:
In [131]: sum
Out[131]: 2333316668
三元表达式
value = true-expr if condition else false-expr
In [132]: x = 5
In [133]: 'None-negative' if x>=0 else 'Negative'
Out[133]: 'None-negative'