1 第一章:数据载入及初步观察
1.4 知道你的数据叫什么
我们学习pandas的基础操作,那么上一节通过pandas加载之后的数据,其数据类型是什么呢?
开始前导入numpy和pandas
import numpy as np
import pandas as pd
1.4.1 任务一:pandas中有两个数据类型DateFrame和Series,通过查找简单了解他们。然后自己写一个关于这两个数据类型的小例子[开放题]
data1={'tom':2500,'jack':4500,'mary':5600,'me':15000}
ep1= pd.Series(data1)
print(ep1)
print(ep1.values)
print(ep1.index)
ep1[1:3]
print(ep1['tom'])
tom 2500
jack 4500
mary 5600
me 15000
dtype: int64
[ 2500 4500 5600 15000]
Index(['tom', 'jack', 'mary', 'me'], dtype='object')
2500
data2={'tom':2500,'jack':4500,'mary':5600,'me':15000,'dsfs':4800,'eng':4890}
money_data=pd.DataFrame({'money':data2})
print(money_data)
money_data.index
money
dsfs 4800
eng 4890
jack 4500
mary 5600
me 15000
tom 2500
Index(['dsfs', 'eng', 'jack', 'mary', 'me', 'tom'], dtype='object')
'''
#我们举的例子
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
example_1 = pd.Series(sdata)
example_1
'''
'''
#我们举的例子
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002, 2003],'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
example_2 = pd.DataFrame(data)
example_2
'''
1.4.2 任务二:根据上节课的方法载入"train.csv"文件
df = pd.read_csv('train.csv')
df.head(3)
|
PassengerId |
Survived |
Pclass |
Name |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
0 |
1 |
0 |
3 |
Braund, Mr. Owen Harris |
male |
22.0 |
1 |
0 |
A/5 21171 |
7.2500 |
NaN |
S |
1 |
2 |
1 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Th... |
female |
38.0 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
2 |
3 |
1 |
3 |
Heikkinen, Miss. Laina |
female |
26.0 |
0 |
0 |
STON/O2. 3101282 |
7.9250 |
NaN |
S |
也可以加载上一节课保存的"train_chinese.csv"文件。通过翻译版train_chinese.csv熟悉了这个数据集,然后我们对trian.csv来进行操作
1.4.3 任务三:查看DataFrame数据的每列的项
df.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
dtype='object')
1.4.4任务四:查看"cabin"这列的所有项 [有多种方法]
df['Cabin'].head()
0 NaN
1 C85
2 NaN
3 C123
4 NaN
Name: Cabin, dtype: object
df.Cabin.head()
0 NaN
1 C85
2 NaN
3 C123
4 NaN
Name: Cabin, dtype: object
df['Cabin'].unique()
array([nan, 'C85', 'C123', 'E46', 'G6', 'C103', 'D56', 'A6',
'C23 C25 C27', 'B78', 'D33', 'B30', 'C52', 'B28', 'C83', 'F33',
'F G73', 'E31', 'A5', 'D10 D12', 'D26', 'C110', 'B58 B60', 'E101',
'F E69', 'D47', 'B86', 'F2', 'C2', 'E33', 'B19', 'A7', 'C49', 'F4',
'A32', 'B4', 'B80', 'A31', 'D36', 'D15', 'C93', 'C78', 'D35',
'C87', 'B77', 'E67', 'B94', 'C125', 'C99', 'C118', 'D7', 'A19',
'B49', 'D', 'C22 C26', 'C106', 'C65', 'E36', 'C54',
'B57 B59 B63 B66', 'C7', 'E34', 'C32', 'B18', 'C124', 'C91', 'E40',
'T', 'C128', 'D37', 'B35', 'E50', 'C82', 'B96 B98', 'E10', 'E44',
'A34', 'C104', 'C111', 'C92', 'E38', 'D21', 'E12', 'E63', 'A14',
'B37', 'C30', 'D20', 'B79', 'E25', 'D46', 'B73', 'C95', 'B38',
'B39', 'B22', 'C86', 'C70', 'A16', 'C101', 'C68', 'A10', 'E68',
'B41', 'A20', 'D19', 'D50', 'D9', 'A23', 'B50', 'A26', 'D48',
'E58', 'C126', 'B71', 'B51 B53 B55', 'D49', 'B5', 'B20', 'F G63',
'C62 C64', 'E24', 'C90', 'C45', 'E8', 'B101', 'D45', 'C46', 'D30',
'E121', 'D11', 'E77', 'F38', 'B3', 'D6', 'B82 B84', 'D17', 'A36',
'B102', 'B69', 'E49', 'C47', 'D28', 'E17', 'A24', 'C50', 'B42',
'C148'], dtype=object)
1.4.5 任务五:加载文件"test_1.csv",然后对比"train.csv",看看有哪些多出的列,然后将多出的列删除
经过我们的观察发现一个测试集test_1.csv有一列是多余的,我们需要将这个多余的列删去
test_1=pd.read_csv('test_1.csv')
test_1.head(3)
|
Unnamed: 0 |
PassengerId |
Survived |
Pclass |
Name |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
a |
0 |
0 |
1 |
0 |
3 |
Braund, Mr. Owen Harris |
male |
22.0 |
1 |
0 |
A/5 21171 |
7.2500 |
NaN |
S |
100 |
1 |
1 |
2 |
1 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Th... |
female |
38.0 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
100 |
2 |
2 |
3 |
1 |
3 |
Heikkinen, Miss. Laina |
female |
26.0 |
0 |
0 |
STON/O2. 3101282 |
7.9250 |
NaN |
S |
100 |
del test_1['a']
test_1.head(3)
|
Unnamed: 0 |
PassengerId |
Survived |
Pclass |
Name |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
0 |
0 |
1 |
0 |
3 |
Braund, Mr. Owen Harris |
male |
22.0 |
1 |
0 |
A/5 21171 |
7.2500 |
NaN |
S |
1 |
1 |
2 |
1 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Th... |
female |
38.0 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
2 |
2 |
3 |
1 |
3 |
Heikkinen, Miss. Laina |
female |
26.0 |
0 |
0 |
STON/O2. 3101282 |
7.9250 |
NaN |
S |
【思考】还有其他的删除多余的列的方式吗?
test_1=pd.read_csv('test_1.csv')
test_1.head(3)
test_1.drop(['a'],axis=1,inplace=True)
test_1.head(3)
|
Unnamed: 0 |
PassengerId |
Survived |
Pclass |
Name |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
0 |
0 |
1 |
0 |
3 |
Braund, Mr. Owen Harris |
male |
22.0 |
1 |
0 |
A/5 21171 |
7.2500 |
NaN |
S |
1 |
1 |
2 |
1 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Th... |
female |
38.0 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
2 |
2 |
3 |
1 |
3 |
Heikkinen, Miss. Laina |
female |
26.0 |
0 |
0 |
STON/O2. 3101282 |
7.9250 |
NaN |
S |
test_1=pd.read_csv('test_1.csv')
test_1.head(3)
test_1.drop(columns=['a'],inplace=True)
test_1.head(3)
|
Unnamed: 0 |
PassengerId |
Survived |
Pclass |
Name |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
0 |
0 |
1 |
0 |
3 |
Braund, Mr. Owen Harris |
male |
22.0 |
1 |
0 |
A/5 21171 |
7.2500 |
NaN |
S |
1 |
1 |
2 |
1 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Th... |
female |
38.0 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
2 |
2 |
3 |
1 |
3 |
Heikkinen, Miss. Laina |
female |
26.0 |
0 |
0 |
STON/O2. 3101282 |
7.9250 |
NaN |
S |
1.4.6 任务六: 将[‘PassengerId’,‘Name’,‘Age’,‘Ticket’]这几个列元素隐藏,只观察其他几个列元素
df.drop(['PassengerId','Name','Age','Ticket'],axis=1).head(3)
|
Survived |
Pclass |
Sex |
SibSp |
Parch |
Fare |
Cabin |
Embarked |
0 |
0 |
3 |
male |
1 |
0 |
7.2500 |
NaN |
S |
1 |
1 |
1 |
female |
1 |
0 |
71.2833 |
C85 |
C |
2 |
1 |
3 |
female |
0 |
0 |
7.9250 |
NaN |
S |
【思考】对比任务五和任务六,是不是使用了不一样的方法(函数),如果使用一样的函数如何完成上面的不同的要求呢?
【思考回答】
如果想要完全的删除你的数据结构,使用inplace=True,因为使用inplace就将原数据覆盖了,所以这里没有用
1.5 筛选的逻辑
表格数据中,最重要的一个功能就是要具有可筛选的能力,选出我所需要的信息,丢弃无用的信息。
下面我们还是用实战来学习pandas这个功能。
1.5.1 任务一: 我们以"Age"为筛选条件,显示年龄在10岁以下的乘客信息。
df[df["Age"]<10].head()
|
PassengerId |
Survived |
Pclass |
Name |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
7 |
8 |
0 |
3 |
Palsson, Master. Gosta Leonard |
male |
2.0 |
3 |
1 |
349909 |
21.0750 |
NaN |
S |
10 |
11 |
1 |
3 |
Sandstrom, Miss. Marguerite Rut |
female |
4.0 |
1 |
1 |
PP 9549 |
16.7000 |
G6 |
S |
16 |
17 |
0 |
3 |
Rice, Master. Eugene |
male |
2.0 |
4 |
1 |
382652 |
29.1250 |
NaN |
Q |
24 |
25 |
0 |
3 |
Palsson, Miss. Torborg Danira |
female |
8.0 |
3 |
1 |
349909 |
21.0750 |
NaN |
S |
43 |
44 |
1 |
2 |
Laroche, Miss. Simonne Marie Anne Andree |
female |
3.0 |
1 |
2 |
SC/Paris 2123 |
41.5792 |
NaN |
C |
1.5.2 任务二: 以"Age"为条件,将年龄在10岁以上和50岁以下的乘客信息显示出来,并将这个数据命名为midage
midage=df[(df.Age>10) &(df.Age<50)]
midage.head()
|
PassengerId |
Survived |
Pclass |
Name |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
0 |
1 |
0 |
3 |
Braund, Mr. Owen Harris |
male |
22.0 |
1 |
0 |
A/5 21171 |
7.2500 |
NaN |
S |
1 |
2 |
1 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Th... |
female |
38.0 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
2 |
3 |
1 |
3 |
Heikkinen, Miss. Laina |
female |
26.0 |
0 |
0 |
STON/O2. 3101282 |
7.9250 |
NaN |
S |
3 |
4 |
1 |
1 |
Futrelle, Mrs. Jacques Heath (Lily May Peel) |
female |
35.0 |
1 |
0 |
113803 |
53.1000 |
C123 |
S |
4 |
5 |
0 |
3 |
Allen, Mr. William Henry |
male |
35.0 |
0 |
0 |
373450 |
8.0500 |
NaN |
S |
【提示】了解pandas的条件筛选方式以及如何使用交集和并集操作
dage=df[df.Age<10]
dage.head()
|
PassengerId |
Survived |
Pclass |
Name |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
7 |
8 |
0 |
3 |
Palsson, Master. Gosta Leonard |
male |
2.0 |
3 |
1 |
349909 |
21.0750 |
NaN |
S |
10 |
11 |
1 |
3 |
Sandstrom, Miss. Marguerite Rut |
female |
4.0 |
1 |
1 |
PP 9549 |
16.7000 |
G6 |
S |
16 |
17 |
0 |
3 |
Rice, Master. Eugene |
male |
2.0 |
4 |
1 |
382652 |
29.1250 |
NaN |
Q |
24 |
25 |
0 |
3 |
Palsson, Miss. Torborg Danira |
female |
8.0 |
3 |
1 |
349909 |
21.0750 |
NaN |
S |
43 |
44 |
1 |
2 |
Laroche, Miss. Simonne Marie Anne Andree |
female |
3.0 |
1 |
2 |
SC/Paris 2123 |
41.5792 |
NaN |
C |
1.5.3 任务三:将midage的数据中第100行的"Pclass"和"Sex"的数据显示出来
midage.ix[100,'Pclass','Sex']
C:\Users\13153\Anaconda3\lib\site-packages\ipykernel_launcher.py:4: FutureWarning:
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing
See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#ix-indexer-is-deprecated
after removing the cwd from sys.path.
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
in
2 # https://www.cnblogs.com/nxf-rabbit75/p/10105271.html
3 #方法一:
----> 4 midage.ix[100,'Pclass','Sex']
~\Anaconda3\lib\site-packages\pandas\core\indexing.py in __getitem__(self, key)
123 key = tuple(com.apply_if_callable(x, self.obj) for x in key)
124 try:
--> 125 values = self.obj._get_value(*key)
126 except (KeyError, TypeError, InvalidIndexError, AttributeError):
127 # TypeError occurs here if the key has non-hashable entries,
~\Anaconda3\lib\site-packages\pandas\core\frame.py in _get_value(self, index, col, takeable)
2823
2824 if takeable:
-> 2825 series = self._iget_item_cache(col)
2826 return com.maybe_box_datetimelike(series._values[index])
2827
~\Anaconda3\lib\site-packages\pandas\core\generic.py in _iget_item_cache(self, item)
3292 ax = self._info_axis
3293 if ax.is_unique:
-> 3294 lower = self._get_item_cache(ax[item])
3295 else:
3296 lower = self.take(item, axis=self._info_axis_number)
~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in __getitem__(self, key)
4278 if is_scalar(key):
4279 key = com.cast_scalar_indexer(key)
-> 4280 return getitem(key)
4281
4282 if isinstance(key, slice):
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
midage.loc[[100],['Pclass','Sex']]
midage = midage.reset_index(drop=True)
midage.head(3)
midage.loc[[100],['Pclass','Sex']]
【思考】这个reset_index()函数的作用是什么?如果不用这个函数,下面的任务会出现什么情况?
1.5.4 任务四:将midage的数据中第100,105,108行的"Pclass","Name"和"Sex"的数据显示出来
midage.loc[[100,105,108],['Pclass','Name','Sex']]
|
Pclass |
Name |
Sex |
100 |
2 |
Byles, Rev. Thomas Roussel Davids |
male |
105 |
3 |
Cribb, Mr. John Hatfield |
male |
108 |
3 |
Calic, Mr. Jovo |
male |
【提示】使用pandas提出的简单方式,你可以看看loc方法
对比整体的数据位置,你有发现什么问题吗?那么如何解决?
from IPython.display import display
display(midage)
display(midage.loc[[100,105,108],['Pclass','Name','Sex']] )
|
PassengerId |
Survived |
Pclass |
Name |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
0 |
1 |
0 |
3 |
Braund, Mr. Owen Harris |
male |
22.0 |
1 |
0 |
A/5 21171 |
7.2500 |
NaN |
S |
1 |
2 |
1 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Th... |
female |
38.0 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
2 |
3 |
1 |
3 |
Heikkinen, Miss. Laina |
female |
26.0 |
0 |
0 |
STON/O2. 3101282 |
7.9250 |
NaN |
S |
3 |
4 |
1 |
1 |
Futrelle, Mrs. Jacques Heath (Lily May Peel) |
female |
35.0 |
1 |
0 |
113803 |
53.1000 |
C123 |
S |
4 |
5 |
0 |
3 |
Allen, Mr. William Henry |
male |
35.0 |
0 |
0 |
373450 |
8.0500 |
NaN |
S |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
571 |
886 |
0 |
3 |
Rice, Mrs. William (Margaret Norton) |
female |
39.0 |
0 |
5 |
382652 |
29.1250 |
NaN |
Q |
572 |
887 |
0 |
2 |
Montvila, Rev. Juozas |
male |
27.0 |
0 |
0 |
211536 |
13.0000 |
NaN |
S |
573 |
888 |
1 |
1 |
Graham, Miss. Margaret Edith |
female |
19.0 |
0 |
0 |
112053 |
30.0000 |
B42 |
S |
574 |
890 |
1 |
1 |
Behr, Mr. Karl Howell |
male |
26.0 |
0 |
0 |
111369 |
30.0000 |
C148 |
C |
575 |
891 |
0 |
3 |
Dooley, Mr. Patrick |
male |
32.0 |
0 |
0 |
370376 |
7.7500 |
NaN |
Q |
576 rows × 12 columns
|
Pclass |
Name |
Sex |
100 |
2 |
Byles, Rev. Thomas Roussel Davids |
male |
105 |
3 |
Cribb, Mr. John Hatfield |
male |
108 |
3 |
Calic, Mr. Jovo |
male |
1.5.5 任务五:使用iloc方法将midage的数据中第100,105,108行的"Pclass","Name"和"Sex"的数据显示出来
midage.iloc[[100,105,108],[2,3,4]]
|
Pclass |
Name |
Sex |
100 |
2 |
Byles, Rev. Thomas Roussel Davids |
male |
105 |
3 |
Cribb, Mr. John Hatfield |
male |
108 |
3 |
Calic, Mr. Jovo |
male |
midage.loc[[100,105,108],["Pclass","Name","Sex"]]
|
Pclass |
Name |
Sex |
100 |
2 |
Byles, Rev. Thomas Roussel Davids |
male |
105 |
3 |
Cribb, Mr. John Hatfield |
male |
108 |
3 |
Calic, Mr. Jovo |
male |