跟着datawhale学习python数据分析之第一章:第二节pandas基础-课程

复习:数据分析的第一步,加载数据我们已经学习完毕了。当数据展现在我们面前的时候,我们所要做的第一步就是认识他,今天我们要学习的就是了解字段含义以及初步观察数据

1 第一章:数据载入及初步观察

1.4 知道你的数据叫什么

我们学习pandas的基础操作,那么上一节通过pandas加载之后的数据,其数据类型是什么呢?

开始前导入numpy和pandas

import numpy as np
import pandas as pd

1.4.1 任务一:pandas中有两个数据类型DateFrame和Series,通过查找简单了解他们。然后自己写一个关于这两个数据类型的小例子[开放题]

#写入代码
dict1={'aa':25,'bb':10}
s1=pd.Series(dict1)
s1
aa    25
bb    10
dtype: int64
#写入代码
dict2={'aa':[25,12],'bb':[10,56]}
s2=pd.DataFrame(dict2)
s2
aa bb
0 25 10
1 12 56
'''
#我们举的例子
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
example_1 = pd.Series(sdata)
example_1
'''
'''
#我们举的例子
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002, 2003],'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
example_2 = pd.DataFrame(data)
example_2
'''


1.4.2 任务二:根据上节课的方法载入"train.csv"文件

#写入代码
df=pd.read_csv('train.csv')
df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

也可以加载上一节课保存的"train_chinese.csv"文件。通过翻译版train_chinese.csv熟悉了这个数据集,然后我们对trian.csv来进行操作

1.4.3 任务三:查看DataFrame数据的每列的项

#写入代码
df.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

1.4.4任务四:查看"cabin"这列的所有项 [有多种方法]

#写入代码
df['Cabin'].head(3)
0    NaN
1    C85
2    NaN
Name: Cabin, dtype: object
#写入代码

1.4.5 任务五:加载文件"test_1.csv",然后对比"train.csv",看看有哪些多出的列,然后将多出的列删除

经过我们的观察发现一个测试集test_1.csv有一列是多余的,我们需要将这个多余的列删去

#写入代码
df2=pd.read_csv('test_1.csv')
df2.head(5)
Unnamed: 0 PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked a
0 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 100
1 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 100
2 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 100
3 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 100
4 4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S 100
#写入代码
del df2['a']
df2.head(5)
Unnamed: 0 PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

【思考】还有其他的删除多余的列的方式吗?

# 思考回答





1.4.6 任务六: 将[‘PassengerId’,‘Name’,‘Age’,‘Ticket’]这几个列元素隐藏,只观察其他几个列元素

#写入代码
df.drop(['PassengerId','Name','Age','Ticket'],axis=1).head()
Survived Pclass Sex SibSp Parch Fare Cabin Embarked
0 0 3 male 1 0 7.2500 NaN S
1 1 1 female 1 0 71.2833 C85 C
2 1 3 female 0 0 7.9250 NaN S
3 1 1 female 1 0 53.1000 C123 S
4 0 3 male 0 0 8.0500 NaN S

【思考】对比任务五和任务六,是不是使用了不一样的方法(函数),如果使用一样的函数如何完成上面的不同的要求呢?

【思考回答】

如果想要完全的删除你的数据结构,使用inplace=True,因为使用inplace就将原数据覆盖了,所以这里没有用

1.5 筛选的逻辑

表格数据中,最重要的一个功能就是要具有可筛选的能力,选出我所需要的信息,丢弃无用的信息。

下面我们还是用实战来学习pandas这个功能。

1.5.1 任务一: 我们以"Age"为筛选条件,显示年龄在10岁以下的乘客信息。

#写入代码
df[df['Age']<=10].head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S
10 11 1 3 Sandstrom, Miss. Marguerite Rut female 4.0 1 1 PP 9549 16.7000 G6 S
16 17 0 3 Rice, Master. Eugene male 2.0 4 1 382652 29.1250 NaN Q
24 25 0 3 Palsson, Miss. Torborg Danira female 8.0 3 1 349909 21.0750 NaN S
43 44 1 2 Laroche, Miss. Simonne Marie Anne Andree female 3.0 1 2 SC/Paris 2123 41.5792 NaN C
#### 1.5.2 任务二: 以"Age"为条件,将年龄在10岁以上和50岁以下的乘客信息显示出来,并将这个数据命名为midage
#写入代码
midage=df[(df['Age']>10) & (df['Age']<50)]

【提示】了解pandas的条件筛选方式以及如何使用交集和并集操作

#### 1.5.3 任务三:将midage的数据中第100行的"Pclass"和"Sex"的数据显示出来
#写入代码
midage.reset_index(drop=True)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
5 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 347742 11.1333 NaN S
6 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.0708 NaN C
7 13 0 3 Saundercock, Mr. William Henry male 20.0 0 0 A/5. 2151 8.0500 NaN S
8 14 0 3 Andersson, Mr. Anders Johan male 39.0 1 5 347082 31.2750 NaN S
9 15 0 3 Vestrom, Miss. Hulda Amanda Adolfina female 14.0 0 0 350406 7.8542 NaN S
10 19 0 3 Vander Planke, Mrs. Julius (Emelia Maria Vande... female 31.0 1 0 345763 18.0000 NaN S
11 21 0 2 Fynney, Mr. Joseph J male 35.0 0 0 239865 26.0000 NaN S
12 22 1 2 Beesley, Mr. Lawrence male 34.0 0 0 248698 13.0000 D56 S
13 23 1 3 McGowan, Miss. Anna "Annie" female 15.0 0 0 330923 8.0292 NaN Q
14 24 1 1 Sloper, Mr. William Thompson male 28.0 0 0 113788 35.5000 A6 S
15 26 1 3 Asplund, Mrs. Carl Oscar (Selma Augusta Emilia... female 38.0 1 5 347077 31.3875 NaN S
16 28 0 1 Fortune, Mr. Charles Alexander male 19.0 3 2 19950 263.0000 C23 C25 C27 S
17 31 0 1 Uruchurtu, Don. Manuel E male 40.0 0 0 PC 17601 27.7208 NaN C
18 35 0 1 Meyer, Mr. Edgar Joseph male 28.0 1 0 PC 17604 82.1708 NaN C
19 36 0 1 Holverson, Mr. Alexander Oskar male 42.0 1 0 113789 52.0000 NaN S
20 38 0 3 Cann, Mr. Ernest Charles male 21.0 0 0 A./5. 2152 8.0500 NaN S
21 39 0 3 Vander Planke, Miss. Augusta Maria female 18.0 2 0 345764 18.0000 NaN S
22 40 1 3 Nicola-Yarred, Miss. Jamila female 14.0 1 0 2651 11.2417 NaN C
23 41 0 3 Ahlin, Mrs. Johan (Johanna Persdotter Larsson) female 40.0 1 0 7546 9.4750 NaN S
24 42 0 2 Turpin, Mrs. William John Robert (Dorothy Ann ... female 27.0 1 0 11668 21.0000 NaN S
25 45 1 3 Devaney, Miss. Margaret Delia female 19.0 0 0 330958 7.8792 NaN Q
26 50 0 3 Arnold-Franchi, Mrs. Josef (Josefine Franchi) female 18.0 1 0 349237 17.8000 NaN S
27 52 0 3 Nosworthy, Mr. Richard Cater male 21.0 0 0 A/4. 39886 7.8000 NaN S
28 53 1 1 Harper, Mrs. Henry Sleeper (Myna Haxtun) female 49.0 1 0 PC 17572 76.7292 D33 C
29 54 1 2 Faunthorpe, Mrs. Lizzie (Elizabeth Anne Wilkin... female 29.0 1 0 2926 26.0000 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
546 854 1 1 Lines, Miss. Mary Conover female 16.0 0 1 PC 17592 39.4000 D28 S
547 855 0 2 Carter, Mrs. Ernest Courtenay (Lilian Hughes) female 44.0 1 0 244252 26.0000 NaN S
548 856 1 3 Aks, Mrs. Sam (Leah Rosen) female 18.0 0 1 392091 9.3500 NaN S
549 857 1 1 Wick, Mrs. George Dennick (Mary Hitchcock) female 45.0 1 1 36928 164.8667 NaN S
550 859 1 3 Baclini, Mrs. Solomon (Latifa Qurban) female 24.0 0 3 2666 19.2583 NaN C
551 861 0 3 Hansen, Mr. Claus Peter male 41.0 2 0 350026 14.1083 NaN S
552 862 0 2 Giles, Mr. Frederick Edward male 21.0 1 0 28134 11.5000 NaN S
553 863 1 1 Swift, Mrs. Frederick Joel (Margaret Welles Ba... female 48.0 0 0 17466 25.9292 D17 S
554 865 0 2 Gill, Mr. John William male 24.0 0 0 233866 13.0000 NaN S
555 866 1 2 Bystrom, Mrs. (Karolina) female 42.0 0 0 236852 13.0000 NaN S
556 867 1 2 Duran y More, Miss. Asuncion female 27.0 1 0 SC/PARIS 2149 13.8583 NaN C
557 868 0 1 Roebling, Mr. Washington Augustus II male 31.0 0 0 PC 17590 50.4958 A24 S
558 871 0 3 Balkic, Mr. Cerin male 26.0 0 0 349248 7.8958 NaN S
559 872 1 1 Beckwith, Mrs. Richard Leonard (Sallie Monypeny) female 47.0 1 1 11751 52.5542 D35 S
560 873 0 1 Carlsson, Mr. Frans Olof male 33.0 0 0 695 5.0000 B51 B53 B55 S
561 874 0 3 Vander Cruyssen, Mr. Victor male 47.0 0 0 345765 9.0000 NaN S
562 875 1 2 Abelson, Mrs. Samuel (Hannah Wizosky) female 28.0 1 0 P/PP 3381 24.0000 NaN C
563 876 1 3 Najib, Miss. Adele Kiamie "Jane" female 15.0 0 0 2667 7.2250 NaN C
564 877 0 3 Gustafsson, Mr. Alfred Ossian male 20.0 0 0 7534 9.8458 NaN S
565 878 0 3 Petroff, Mr. Nedelio male 19.0 0 0 349212 7.8958 NaN S
566 881 1 2 Shelley, Mrs. William (Imanita Parrish Hall) female 25.0 0 1 230433 26.0000 NaN S
567 882 0 3 Markun, Mr. Johann male 33.0 0 0 349257 7.8958 NaN S
568 883 0 3 Dahlberg, Miss. Gerda Ulrika female 22.0 0 0 7552 10.5167 NaN S
569 884 0 2 Banfield, Mr. Frederick James male 28.0 0 0 C.A./SOTON 34068 10.5000 NaN S
570 885 0 3 Sutehall, Mr. Henry Jr male 25.0 0 0 SOTON/OQ 392076 7.0500 NaN S
571 886 0 3 Rice, Mrs. William (Margaret Norton) female 39.0 0 5 382652 29.1250 NaN Q
572 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
573 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
574 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
575 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

576 rows × 12 columns

midage.loc[[100],['Pclass','Sex']]
Pclass Sex
100 3 female
loc是根据index来索引。iloc并不是根据index来索引,而是根据行号来索引,行号从0开始,逐次加1。

【思考】这个reset_index()函数的作用是什么?如果不用这个函数,下面的任务会出现什么情况?

1.5.4 任务四:将midage的数据中第100,105,108行的"Pclass","Name"和"Sex"的数据显示出来

#写入代码
midage.loc[[100,105,108],['Pclass','Name','Sex']]
Pclass Name Sex
100 3 Petranec, Miss. Matilda female
105 3 Mionoff, Mr. Stoytcho male
108 3 Rekic, Mr. Tido male

【提示】使用pandas提出的简单方式,你可以看看loc方法

对比整体的数据位置,你有发现什么问题吗?那么如何解决?

1.5.5 任务五:使用iloc方法将midage的数据中第100,105,108行的"Pclass","Name"和"Sex"的数据显示出来

#写入代码

你可能感兴趣的:(数据分析)