数据分析第一章task1_pandas基础

1 第一章:数据载入及初步观察

1.4 知道你的数据叫什么

我们学习pandas的基础操作,那么上一节通过pandas加载之后的数据,其数据类型是什么呢?

开始前导入numpy和pandas

import numpy as np
import pandas as pd

1.4.1 任务一:pandas中有两个数据类型DateFrame和Series,通过查找简单了解他们。然后自己写一个关于这两个数据类型的小例子[开放题]

#写入代码
data1={'tom':2500,'jack':4500,'mary':5600,'me':15000}
ep1= pd.Series(data1)
print(ep1)
print(ep1.values)
print(ep1.index)
ep1[1:3]
print(ep1['tom'])
tom      2500
jack     4500
mary     5600
me      15000
dtype: int64
[ 2500  4500  5600 15000]
Index(['tom', 'jack', 'mary', 'me'], dtype='object')
2500
#dataframe
data2={'tom':2500,'jack':4500,'mary':5600,'me':15000,'dsfs':4800,'eng':4890}
money_data=pd.DataFrame({'money':data2})
print(money_data)
money_data.index
      money
dsfs   4800
eng    4890
jack   4500
mary   5600
me    15000
tom    2500





Index(['dsfs', 'eng', 'jack', 'mary', 'me', 'tom'], dtype='object')
'''
#我们举的例子
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
example_1 = pd.Series(sdata)
example_1
'''
'''
#我们举的例子
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002, 2003],'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
example_2 = pd.DataFrame(data)
example_2
'''


1.4.2 任务二:根据上节课的方法载入"train.csv"文件

#写入代码
df = pd.read_csv('train.csv')
df.head(3)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S

也可以加载上一节课保存的"train_chinese.csv"文件。通过翻译版train_chinese.csv熟悉了这个数据集,然后我们对trian.csv来进行操作

1.4.3 任务三:查看DataFrame数据的每列的项

#写入代码
df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

1.4.4任务四:查看"cabin"这列的所有项 [有多种方法]

#写入代码
df['Cabin'].head()
0     NaN
1     C85
2     NaN
3    C123
4     NaN
Name: Cabin, dtype: object
#写入代码
df.Cabin.head()
0     NaN
1     C85
2     NaN
3    C123
4     NaN
Name: Cabin, dtype: object
#如果想查看不重复的项,可以用dataframe['xxx'].unique()
df['Cabin'].unique()
array([nan, 'C85', 'C123', 'E46', 'G6', 'C103', 'D56', 'A6',
       'C23 C25 C27', 'B78', 'D33', 'B30', 'C52', 'B28', 'C83', 'F33',
       'F G73', 'E31', 'A5', 'D10 D12', 'D26', 'C110', 'B58 B60', 'E101',
       'F E69', 'D47', 'B86', 'F2', 'C2', 'E33', 'B19', 'A7', 'C49', 'F4',
       'A32', 'B4', 'B80', 'A31', 'D36', 'D15', 'C93', 'C78', 'D35',
       'C87', 'B77', 'E67', 'B94', 'C125', 'C99', 'C118', 'D7', 'A19',
       'B49', 'D', 'C22 C26', 'C106', 'C65', 'E36', 'C54',
       'B57 B59 B63 B66', 'C7', 'E34', 'C32', 'B18', 'C124', 'C91', 'E40',
       'T', 'C128', 'D37', 'B35', 'E50', 'C82', 'B96 B98', 'E10', 'E44',
       'A34', 'C104', 'C111', 'C92', 'E38', 'D21', 'E12', 'E63', 'A14',
       'B37', 'C30', 'D20', 'B79', 'E25', 'D46', 'B73', 'C95', 'B38',
       'B39', 'B22', 'C86', 'C70', 'A16', 'C101', 'C68', 'A10', 'E68',
       'B41', 'A20', 'D19', 'D50', 'D9', 'A23', 'B50', 'A26', 'D48',
       'E58', 'C126', 'B71', 'B51 B53 B55', 'D49', 'B5', 'B20', 'F G63',
       'C62 C64', 'E24', 'C90', 'C45', 'E8', 'B101', 'D45', 'C46', 'D30',
       'E121', 'D11', 'E77', 'F38', 'B3', 'D6', 'B82 B84', 'D17', 'A36',
       'B102', 'B69', 'E49', 'C47', 'D28', 'E17', 'A24', 'C50', 'B42',
       'C148'], dtype=object)

1.4.5 任务五:加载文件"test_1.csv",然后对比"train.csv",看看有哪些多出的列,然后将多出的列删除

经过我们的观察发现一个测试集test_1.csv有一列是多余的,我们需要将这个多余的列删去

#写入代码
test_1=pd.read_csv('test_1.csv')
test_1.head(3)
Unnamed: 0 PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked a
0 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 100
1 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 100
2 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 100
#写入代码
del test_1['a']
test_1.head(3)
Unnamed: 0 PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S

【思考】还有其他的删除多余的列的方式吗?

# 思考回答
#方法一
# https://www.cnblogs.com/datasnail/p/9767158.html
test_1=pd.read_csv('test_1.csv')
test_1.head(3)
test_1.drop(['a'],axis=1,inplace=True)
test_1.head(3)

Unnamed: 0 PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
#方法二
test_1=pd.read_csv('test_1.csv')
test_1.head(3)
test_1.drop(columns=['a'],inplace=True)
test_1.head(3)
Unnamed: 0 PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S

1.4.6 任务六: 将[‘PassengerId’,‘Name’,‘Age’,‘Ticket’]这几个列元素隐藏,只观察其他几个列元素

#写入代码
df.drop(['PassengerId','Name','Age','Ticket'],axis=1).head(3)
Survived Pclass Sex SibSp Parch Fare Cabin Embarked
0 0 3 male 1 0 7.2500 NaN S
1 1 1 female 1 0 71.2833 C85 C
2 1 3 female 0 0 7.9250 NaN S

【思考】对比任务五和任务六,是不是使用了不一样的方法(函数),如果使用一样的函数如何完成上面的不同的要求呢?

【思考回答】

如果想要完全的删除你的数据结构,使用inplace=True,因为使用inplace就将原数据覆盖了,所以这里没有用

1.5 筛选的逻辑

表格数据中,最重要的一个功能就是要具有可筛选的能力,选出我所需要的信息,丢弃无用的信息。

下面我们还是用实战来学习pandas这个功能。

1.5.1 任务一: 我们以"Age"为筛选条件,显示年龄在10岁以下的乘客信息。

#写入代码
df[df["Age"]<10].head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S
10 11 1 3 Sandstrom, Miss. Marguerite Rut female 4.0 1 1 PP 9549 16.7000 G6 S
16 17 0 3 Rice, Master. Eugene male 2.0 4 1 382652 29.1250 NaN Q
24 25 0 3 Palsson, Miss. Torborg Danira female 8.0 3 1 349909 21.0750 NaN S
43 44 1 2 Laroche, Miss. Simonne Marie Anne Andree female 3.0 1 2 SC/Paris 2123 41.5792 NaN C

1.5.2 任务二: 以"Age"为条件,将年龄在10岁以上和50岁以下的乘客信息显示出来,并将这个数据命名为midage

#写入代码
# midage=df[(df["Age"]>10) &(df["Age"]<50)]
# midage.head()
#方式二:
midage=df[(df.Age>10) &(df.Age<50)]
midage.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

【提示】了解pandas的条件筛选方式以及如何使用交集和并集操作

dage=df[df.Age<10]
dage.head()
#多条件必须加上()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S
10 11 1 3 Sandstrom, Miss. Marguerite Rut female 4.0 1 1 PP 9549 16.7000 G6 S
16 17 0 3 Rice, Master. Eugene male 2.0 4 1 382652 29.1250 NaN Q
24 25 0 3 Palsson, Miss. Torborg Danira female 8.0 3 1 349909 21.0750 NaN S
43 44 1 2 Laroche, Miss. Simonne Marie Anne Andree female 3.0 1 2 SC/Paris 2123 41.5792 NaN C

1.5.3 任务三:将midage的数据中第100行的"Pclass"和"Sex"的数据显示出来

# 写入代码
# https://www.cnblogs.com/nxf-rabbit75/p/10105271.html
#方法一:
midage.ix[100,'Pclass','Sex']
#这个已经不行了
C:\Users\13153\Anaconda3\lib\site-packages\ipykernel_launcher.py:4: FutureWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#ix-indexer-is-deprecated
  after removing the cwd from sys.path.



---------------------------------------------------------------------------

IndexError                                Traceback (most recent call last)

 in 
      2 # https://www.cnblogs.com/nxf-rabbit75/p/10105271.html
      3 #方法一:
----> 4 midage.ix[100,'Pclass','Sex']


~\Anaconda3\lib\site-packages\pandas\core\indexing.py in __getitem__(self, key)
    123             key = tuple(com.apply_if_callable(x, self.obj) for x in key)
    124             try:
--> 125                 values = self.obj._get_value(*key)
    126             except (KeyError, TypeError, InvalidIndexError, AttributeError):
    127                 # TypeError occurs here if the key has non-hashable entries,


~\Anaconda3\lib\site-packages\pandas\core\frame.py in _get_value(self, index, col, takeable)
   2823 
   2824         if takeable:
-> 2825             series = self._iget_item_cache(col)
   2826             return com.maybe_box_datetimelike(series._values[index])
   2827 


~\Anaconda3\lib\site-packages\pandas\core\generic.py in _iget_item_cache(self, item)
   3292         ax = self._info_axis
   3293         if ax.is_unique:
-> 3294             lower = self._get_item_cache(ax[item])
   3295         else:
   3296             lower = self.take(item, axis=self._info_axis_number)


~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in __getitem__(self, key)
   4278         if is_scalar(key):
   4279             key = com.cast_scalar_indexer(key)
-> 4280             return getitem(key)
   4281 
   4282         if isinstance(key, slice):


IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
#这个没有加reset_index()函数,是错的
#选取的是第101行
midage.loc[[100],['Pclass','Sex']]
Pclass Sex
100 3 female
#正确做法:
midage = midage.reset_index(drop=True)
midage.head(3)
midage.loc[[100],['Pclass','Sex']]
Pclass Sex
100 2 male

【思考】这个reset_index()函数的作用是什么?如果不用这个函数,下面的任务会出现什么情况?

1.5.4 任务四:将midage的数据中第100,105,108行的"Pclass","Name"和"Sex"的数据显示出来

#写入代码
midage.loc[[100,105,108],['Pclass','Name','Sex']] 
Pclass Name Sex
100 2 Byles, Rev. Thomas Roussel Davids male
105 3 Cribb, Mr. John Hatfield male
108 3 Calic, Mr. Jovo male

【提示】使用pandas提出的简单方式,你可以看看loc方法

对比整体的数据位置,你有发现什么问题吗?那么如何解决?

#若在JupyterNotebook中直接输出DataFrame格式,则是有线框的HTML格式的表格
#但是这种方式无法同时在一个cell中显示两个表格,只显示最后一个表格
from IPython.display import display
display(midage)
display(midage.loc[[100,105,108],['Pclass','Name','Sex']] )
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
571 886 0 3 Rice, Mrs. William (Margaret Norton) female 39.0 0 5 382652 29.1250 NaN Q
572 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
573 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
574 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
575 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

576 rows × 12 columns

Pclass Name Sex
100 2 Byles, Rev. Thomas Roussel Davids male
105 3 Cribb, Mr. John Hatfield male
108 3 Calic, Mr. Jovo male

1.5.5 任务五:使用iloc方法将midage的数据中第100,105,108行的"Pclass","Name"和"Sex"的数据显示出来

#写入代码
#方法一:
midage.iloc[[100,105,108],[2,3,4]]
Pclass Name Sex
100 2 Byles, Rev. Thomas Roussel Davids male
105 3 Cribb, Mr. John Hatfield male
108 3 Calic, Mr. Jovo male
#方法二:
midage.loc[[100,105,108],["Pclass","Name","Sex"]]
Pclass Name Sex
100 2 Byles, Rev. Thomas Roussel Davids male
105 3 Cribb, Mr. John Hatfield male
108 3 Calic, Mr. Jovo male

你可能感兴趣的:(自学,数据分析)