python之pandas基础知识以及练习题

####pandas数据分析与处理库

import pandas as pd

df=pd.read_csv(‘E:\pyhon\pandas\Pandas%E4%BB%A3%E7%A0%81\data\titanic.csv’)

df

PassengerId 	Survived 	Pclass 	Name 	Sex 	Age 	SibSp 	Parch 	Ticket 	Fare 	Cabin 	Embarked

0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th… female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 347742 11.1333 NaN S
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.0708 NaN C
10 11 1 3 Sandstrom, Miss. Marguerite Rut female 4.0 1 1 PP 9549 16.7000 G6 S
11 12 1 1 Bonnell, Miss. Elizabeth female 58.0 0 0 113783 26.5500 C103 S
12 13 0 3 Saundercock, Mr. William Henry male 20.0 0 0 A/5. 2151 8.0500 NaN S
13 14 0 3 Andersson, Mr. Anders Johan male 39.0 1 5 347082 31.2750 NaN S
14 15 0 3 Vestrom, Miss. Hulda Amanda Adolfina female 14.0 0 0 350406 7.8542 NaN S
15 16 1 2 Hewlett, Mrs. (Mary D Kingcome) female 55.0 0 0 248706 16.0000 NaN S
16 17 0 3 Rice, Master. Eugene male 2.0 4 1 382652 29.1250 NaN Q
17 18 1 2 Williams, Mr. Charles Eugene male NaN 0 0 244373 13.0000 NaN S
18 19 0 3 Vander Planke, Mrs. Julius (Emelia Maria Vande… female 31.0 1 0 345763 18.0000 NaN S
19 20 1 3 Masselmani, Mrs. Fatima female NaN 0 0 2649 7.2250 NaN C
20 21 0 2 Fynney, Mr. Joseph J male 35.0 0 0 239865 26.0000 NaN S
21 22 1 2 Beesley, Mr. Lawrence male 34.0 0 0 248698 13.0000 D56 S
22 23 1 3 McGowan, Miss. Anna “Annie” female 15.0 0 0 330923 8.0292 NaN Q
23 24 1 1 Sloper, Mr. William Thompson male 28.0 0 0 113788 35.5000 A6 S
24 25 0 3 Palsson, Miss. Torborg Danira female 8.0 3 1 349909 21.0750 NaN S
25 26 1 3 Asplund, Mrs. Carl Oscar (Selma Augusta Emilia… female 38.0 1 5 347077 31.3875 NaN S
26 27 0 3 Emir, Mr. Farred Chehab male NaN 0 0 2631 7.2250 NaN C
27 28 0 1 Fortune, Mr. Charles Alexander male 19.0 3 2 19950 263.0000 C23 C25 C27 S
28 29 1 3 O’Dwyer, Miss. Ellen “Nellie” female NaN 0 0 330959 7.8792 NaN Q
29 30 0 3 Todoroff, Mr. Lalio male NaN 0 0 349216 7.8958 NaN S
… … … … … … … … … … … … …
861 862 0 2 Giles, Mr. Frederick Edward male 21.0 1 0 28134 11.5000 NaN S
862 863 1 1 Swift, Mrs. Frederick Joel (Margaret Welles Ba… female 48.0 0 0 17466 25.9292 D17 S
863 864 0 3 Sage, Miss. Dorothy Edith “Dolly” female NaN 8 2 CA. 2343 69.5500 NaN S
864 865 0 2 Gill, Mr. John William male 24.0 0 0 233866 13.0000 NaN S
865 866 1 2 Bystrom, Mrs. (Karolina) female 42.0 0 0 236852 13.0000 NaN S
866 867 1 2 Duran y More, Miss. Asuncion female 27.0 1 0 SC/PARIS 2149 13.8583 NaN C
867 868 0 1 Roebling, Mr. Washington Augustus II male 31.0 0 0 PC 17590 50.4958 A24 S
868 869 0 3 van Melkebeke, Mr. Philemon male NaN 0 0 345777 9.5000 NaN S
869 870 1 3 Johnson, Master. Harold Theodor male 4.0 1 1 347742 11.1333 NaN S
870 871 0 3 Balkic, Mr. Cerin male 26.0 0 0 349248 7.8958 NaN S
871 872 1 1 Beckwith, Mrs. Richard Leonard (Sallie Monypeny) female 47.0 1 1 11751 52.5542 D35 S
872 873 0 1 Carlsson, Mr. Frans Olof male 33.0 0 0 695 5.0000 B51 B53 B55 S
873 874 0 3 Vander Cruyssen, Mr. Victor male 47.0 0 0 345765 9.0000 NaN S
874 875 1 2 Abelson, Mrs. Samuel (Hannah Wizosky) female 28.0 1 0 P/PP 3381 24.0000 NaN C
875 876 1 3 Najib, Miss. Adele Kiamie “Jane” female 15.0 0 0 2667 7.2250 NaN C
876 877 0 3 Gustafsson, Mr. Alfred Ossian male 20.0 0 0 7534 9.8458 NaN S
877 878 0 3 Petroff, Mr. Nedelio male 19.0 0 0 349212 7.8958 NaN S
878 879 0 3 Laleff, Mr. Kristo male NaN 0 0 349217 7.8958 NaN S
879 880 1 1 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) female 56.0 0 1 11767 83.1583 C50 C
880 881 1 2 Shelley, Mrs. William (Imanita Parrish Hall) female 25.0 0 1 230433 26.0000 NaN S
881 882 0 3 Markun, Mr. Johann male 33.0 0 0 349257 7.8958 NaN S
882 883 0 3 Dahlberg, Miss. Gerda Ulrika female 22.0 0 0 7552 10.5167 NaN S
883 884 0 2 Banfield, Mr. Frederick James male 28.0 0 0 C.A./SOTON 34068 10.5000 NaN S
884 885 0 3 Sutehall, Mr. Henry Jr male 25.0 0 0 SOTON/OQ 392076 7.0500 NaN S
885 886 0 3 Rice, Mrs. William (Margaret Norton) female 39.0 0 5 382652 29.1250 NaN Q
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen “Carrie” female NaN 1 2 W./C. 6607 23.4500 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

891 rows × 12 columns

#head()可以读取前几条数据,可以指定前面的任意几条数据

df.head(6)

PassengerId 	Survived 	Pclass 	Name 	Sex 	Age 	SibSp 	Parch 	Ticket 	Fare 	Cabin 	Embarked

0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th… female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q

df.info()#info()返回当前信息


RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

df.index

RangeIndex(start=0, stop=891, step=1)

df.columns

Index([‘PassengerId’, ‘Survived’, ‘Pclass’, ‘Name’, ‘Sex’, ‘Age’, ‘SibSp’,
‘Parch’, ‘Ticket’, ‘Fare’, ‘Cabin’, ‘Embarked’],
dtype=‘object’)

df.dtypes

PassengerId int64
Survived int64
Pclass int64
Name object
Sex object
Age float64
SibSp int64
Parch int64
Ticket object
Fare float64
Cabin object
Embarked object
dtype: object

df.values

array([[1, 0, 3, …, 7.25, nan, ‘S’],
[2, 1, 1, …, 71.2833, ‘C85’, ‘C’],
[3, 1, 3, …, 7.925, nan, ‘S’],
…,
[889, 0, 3, …, 23.45, nan, ‘S’],
[890, 1, 1, …, 30.0, ‘C148’, ‘C’],
[891, 0, 3, …, 7.75, nan, ‘Q’]], dtype=object)

#创建一个dataFrame结构

#创建pandas

data={‘country’:[‘aaa’,‘bbb’,‘ccc’],#指定列名相当于字典结构

   'population':[10,12,24]}

data

{‘country’: [‘aaa’, ‘bbb’, ‘ccc’], ‘population’: [10, 12, 24]}

df_data=pd.DataFrame(data)

df_data

country 	population

0 aaa 10
1 bbb 12
2 ccc 24

df_data.info

0 aaa 10
1 bbb 12
2 ccc 24>

#取指定的数据

age=df[‘Age’]

age[:5]

0 22.0
1 38.0
2 26.0
3 35.0
4 35.0
Name: Age, dtype: float64

#series:dataframe中的一行/列

age.index

RangeIndex(start=0, stop=891, step=1)

age.values[:5]

array([ 22., 38., 26., 35., 35.])

df.head()

PassengerId 	Survived 	Pclass 	Name 	Sex 	Age 	SibSp 	Parch 	Ticket 	Fare 	Cabin 	Embarked

0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th… female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

df[‘Age’][:5]

0 22.0
1 38.0
2 26.0
3 35.0
4 35.0
Name: Age, dtype: float64

age=df[‘Age’]

age[:5]

Name
Braund, Mr. Owen Harris 22.0
Cumings, Mrs. John Bradley (Florence Briggs Thayer) 38.0
Heikkinen, Miss. Laina 26.0
Futrelle, Mrs. Jacques Heath (Lily May Peel) 35.0
Allen, Mr. William Henry 35.0
Name: Age, dtype: float64

#加减法

age[:5]+10

Name
Braund, Mr. Owen Harris 32.0
Cumings, Mrs. John Bradley (Florence Briggs Thayer) 48.0
Heikkinen, Miss. Laina 36.0
Futrelle, Mrs. Jacques Heath (Lily May Peel) 45.0
Allen, Mr. William Henry 45.0
Name: Age, dtype: float64

age*10

Name
Braund, Mr. Owen Harris 220.0
Cumings, Mrs. John Bradley (Florence Briggs Thayer) 380.0
Heikkinen, Miss. Laina 260.0
Futrelle, Mrs. Jacques Heath (Lily May Peel) 350.0
Allen, Mr. William Henry 350.0
Moran, Mr. James NaN
McCarthy, Mr. Timothy J 540.0
Palsson, Master. Gosta Leonard 20.0
Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) 270.0
Nasser, Mrs. Nicholas (Adele Achem) 140.0
Sandstrom, Miss. Marguerite Rut 40.0
Bonnell, Miss. Elizabeth 580.0
Saundercock, Mr. William Henry 200.0
Andersson, Mr. Anders Johan 390.0
Vestrom, Miss. Hulda Amanda Adolfina 140.0
Hewlett, Mrs. (Mary D Kingcome) 550.0
Rice, Master. Eugene 20.0
Williams, Mr. Charles Eugene NaN
Vander Planke, Mrs. Julius (Emelia Maria Vandemoortele) 310.0
Masselmani, Mrs. Fatima NaN
Fynney, Mr. Joseph J 350.0
Beesley, Mr. Lawrence 340.0
McGowan, Miss. Anna “Annie” 150.0
Sloper, Mr. William Thompson 280.0
Palsson, Miss. Torborg Danira 80.0
Asplund, Mrs. Carl Oscar (Selma Augusta Emilia Johansson) 380.0
Emir, Mr. Farred Chehab NaN
Fortune, Mr. Charles Alexander 190.0
O’Dwyer, Miss. Ellen “Nellie” NaN
Todoroff, Mr. Lalio NaN

Giles, Mr. Frederick Edward 210.0
Swift, Mrs. Frederick Joel (Margaret Welles Barron) 480.0
Sage, Miss. Dorothy Edith “Dolly” NaN
Gill, Mr. John William 240.0
Bystrom, Mrs. (Karolina) 420.0
Duran y More, Miss. Asuncion 270.0
Roebling, Mr. Washington Augustus II 310.0
van Melkebeke, Mr. Philemon NaN
Johnson, Master. Harold Theodor 40.0
Balkic, Mr. Cerin 260.0
Beckwith, Mrs. Richard Leonard (Sallie Monypeny) 470.0
Carlsson, Mr. Frans Olof 330.0
Vander Cruyssen, Mr. Victor 470.0
Abelson, Mrs. Samuel (Hannah Wizosky) 280.0
Najib, Miss. Adele Kiamie “Jane” 150.0
Gustafsson, Mr. Alfred Ossian 200.0
Petroff, Mr. Nedelio 190.0
Laleff, Mr. Kristo NaN
Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) 560.0
Shelley, Mrs. William (Imanita Parrish Hall) 250.0
Markun, Mr. Johann 330.0
Dahlberg, Miss. Gerda Ulrika 220.0
Banfield, Mr. Frederick James 280.0
Sutehall, Mr. Henry Jr 250.0
Rice, Mrs. William (Margaret Norton) 390.0
Montvila, Rev. Juozas 270.0
Graham, Miss. Margaret Edith 190.0
Johnston, Miss. Catherine Helen “Carrie” NaN
Behr, Mr. Karl Howell 260.0
Dooley, Mr. Patrick 320.0
Name: Age, Length: 891, dtype: float64

age[:5]*10

Name
Braund, Mr. Owen Harris 220.0
Cumings, Mrs. John Bradley (Florence Briggs Thayer) 380.0
Heikkinen, Miss. Laina 260.0
Futrelle, Mrs. Jacques Heath (Lily May Peel) 350.0
Allen, Mr. William Henry 350.0
Name: Age, dtype: float64

age.mean()

29.69911764705882

age.max()

80.0

age.min()

0.41999999999999998

#指标做的简单,通俗易懂.得到数统计的基本特性

df.describe()

PassengerId 	Survived 	Pclass 	Age 	SibSp 	Parch 	Fare

count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200

####pandas索引结构

import pandas as pd

df=pd.read_csv(‘E:\pyhon\pandas\Pandas%E4%BB%A3%E7%A0%81\data\titanic.csv’)

df

PassengerId 	Survived 	Pclass 	Name 	Sex 	Age 	SibSp 	Parch 	Ticket 	Fare 	Cabin 	Embarked

0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th… female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 347742 11.1333 NaN S
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.0708 NaN C
10 11 1 3 Sandstrom, Miss. Marguerite Rut female 4.0 1 1 PP 9549 16.7000 G6 S
11 12 1 1 Bonnell, Miss. Elizabeth female 58.0 0 0 113783 26.5500 C103 S
12 13 0 3 Saundercock, Mr. William Henry male 20.0 0 0 A/5. 2151 8.0500 NaN S
13 14 0 3 Andersson, Mr. Anders Johan male 39.0 1 5 347082 31.2750 NaN S
14 15 0 3 Vestrom, Miss. Hulda Amanda Adolfina female 14.0 0 0 350406 7.8542 NaN S
15 16 1 2 Hewlett, Mrs. (Mary D Kingcome) female 55.0 0 0 248706 16.0000 NaN S
16 17 0 3 Rice, Master. Eugene male 2.0 4 1 382652 29.1250 NaN Q
17 18 1 2 Williams, Mr. Charles Eugene male NaN 0 0 244373 13.0000 NaN S
18 19 0 3 Vander Planke, Mrs. Julius (Emelia Maria Vande… female 31.0 1 0 345763 18.0000 NaN S
19 20 1 3 Masselmani, Mrs. Fatima female NaN 0 0 2649 7.2250 NaN C
20 21 0 2 Fynney, Mr. Joseph J male 35.0 0 0 239865 26.0000 NaN S
21 22 1 2 Beesley, Mr. Lawrence male 34.0 0 0 248698 13.0000 D56 S
22 23 1 3 McGowan, Miss. Anna “Annie” female 15.0 0 0 330923 8.0292 NaN Q
23 24 1 1 Sloper, Mr. William Thompson male 28.0 0 0 113788 35.5000 A6 S
24 25 0 3 Palsson, Miss. Torborg Danira female 8.0 3 1 349909 21.0750 NaN S
25 26 1 3 Asplund, Mrs. Carl Oscar (Selma Augusta Emilia… female 38.0 1 5 347077 31.3875 NaN S
26 27 0 3 Emir, Mr. Farred Chehab male NaN 0 0 2631 7.2250 NaN C
27 28 0 1 Fortune, Mr. Charles Alexander male 19.0 3 2 19950 263.0000 C23 C25 C27 S
28 29 1 3 O’Dwyer, Miss. Ellen “Nellie” female NaN 0 0 330959 7.8792 NaN Q
29 30 0 3 Todoroff, Mr. Lalio male NaN 0 0 349216 7.8958 NaN S
… … … … … … … … … … … … …
861 862 0 2 Giles, Mr. Frederick Edward male 21.0 1 0 28134 11.5000 NaN S
862 863 1 1 Swift, Mrs. Frederick Joel (Margaret Welles Ba… female 48.0 0 0 17466 25.9292 D17 S
863 864 0 3 Sage, Miss. Dorothy Edith “Dolly” female NaN 8 2 CA. 2343 69.5500 NaN S
864 865 0 2 Gill, Mr. John William male 24.0 0 0 233866 13.0000 NaN S
865 866 1 2 Bystrom, Mrs. (Karolina) female 42.0 0 0 236852 13.0000 NaN S
866 867 1 2 Duran y More, Miss. Asuncion female 27.0 1 0 SC/PARIS 2149 13.8583 NaN C
867 868 0 1 Roebling, Mr. Washington Augustus II male 31.0 0 0 PC 17590 50.4958 A24 S
868 869 0 3 van Melkebeke, Mr. Philemon male NaN 0 0 345777 9.5000 NaN S
869 870 1 3 Johnson, Master. Harold Theodor male 4.0 1 1 347742 11.1333 NaN S
870 871 0 3 Balkic, Mr. Cerin male 26.0 0 0 349248 7.8958 NaN S
871 872 1 1 Beckwith, Mrs. Richard Leonard (Sallie Monypeny) female 47.0 1 1 11751 52.5542 D35 S
872 873 0 1 Carlsson, Mr. Frans Olof male 33.0 0 0 695 5.0000 B51 B53 B55 S
873 874 0 3 Vander Cruyssen, Mr. Victor male 47.0 0 0 345765 9.0000 NaN S
874 875 1 2 Abelson, Mrs. Samuel (Hannah Wizosky) female 28.0 1 0 P/PP 3381 24.0000 NaN C
875 876 1 3 Najib, Miss. Adele Kiamie “Jane” female 15.0 0 0 2667 7.2250 NaN C
876 877 0 3 Gustafsson, Mr. Alfred Ossian male 20.0 0 0 7534 9.8458 NaN S
877 878 0 3 Petroff, Mr. Nedelio male 19.0 0 0 349212 7.8958 NaN S
878 879 0 3 Laleff, Mr. Kristo male NaN 0 0 349217 7.8958 NaN S
879 880 1 1 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) female 56.0 0 1 11767 83.1583 C50 C
880 881 1 2 Shelley, Mrs. William (Imanita Parrish Hall) female 25.0 0 1 230433 26.0000 NaN S
881 882 0 3 Markun, Mr. Johann male 33.0 0 0 349257 7.8958 NaN S
882 883 0 3 Dahlberg, Miss. Gerda Ulrika female 22.0 0 0 7552 10.5167 NaN S
883 884 0 2 Banfield, Mr. Frederick James male 28.0 0 0 C.A./SOTON 34068 10.5000 NaN S
884 885 0 3 Sutehall, Mr. Henry Jr male 25.0 0 0 SOTON/OQ 392076 7.0500 NaN S
885 886 0 3 Rice, Mrs. William (Margaret Norton) female 39.0 0 5 382652 29.1250 NaN Q
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen “Carrie” female NaN 1 2 W./C. 6607 23.4500 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

891 rows × 12 columns

df[‘Age’][:5]

0 22.0
1 38.0
2 26.0
3 35.0
4 35.0
Name: Age, dtype: float64

df[[‘Age’,‘Fare’]][:5]

Age 	Fare

0 22.0 7.2500
1 38.0 71.2833
2 26.0 7.9250
3 35.0 53.1000
4 35.0 8.0500

loc:用labe来去定位 *iloc:用position来去定位

df.iloc[0]

PassengerId 1
Survived 0
Pclass 3
Name Braund, Mr. Owen Harris
Sex male
Age 22
SibSp 1
Parch 0
Ticket A/5 21171
Fare 7.25
Cabin NaN
Embarked S
Name: 0, dtype: object

df.iloc[0:5]

PassengerId 	Survived 	Pclass 	Name 	Sex 	Age 	SibSp 	Parch 	Ticket 	Fare 	Cabin 	Embarked

0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th… female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

#筛选

df.iloc[0:5,1:3]

Survived 	Pclass

0 0 3
1 1 1
2 1 3
3 1 1
4 0 3

df=df.set_index(‘Name’)


KeyError Traceback (most recent call last)
D:\program\Anaconda\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
2441 try:
-> 2442 return self._engine.get_loc(key)
2443 except KeyError:

pandas_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: ‘Name’

During handling of the above exception, another exception occurred:

KeyError Traceback (most recent call last)
in ()
----> 1 df=df.set_index(‘Name’)

D:\program\Anaconda\lib\site-packages\pandas\core\frame.py in set_index(self, keys, drop, append, inplace, verify_integrity)
2828 names.append(None)
2829 else:
-> 2830 level = frame[col]._values
2831 names.append(col)
2832 if drop:

D:\program\Anaconda\lib\site-packages\pandas\core\frame.py in getitem(self, key)
1962 return self._getitem_multilevel(key)
1963 else:
-> 1964 return self._getitem_column(key)
1965
1966 def _getitem_column(self, key):

D:\program\Anaconda\lib\site-packages\pandas\core\frame.py in _getitem_column(self, key)
1969 # get column
1970 if self.columns.is_unique:
-> 1971 return self._get_item_cache(key)
1972
1973 # duplicate columns & possible reduce dimensionality

D:\program\Anaconda\lib\site-packages\pandas\core\generic.py in _get_item_cache(self, item)
1643 res = cache.get(item)
1644 if res is None:
-> 1645 values = self._data.get(item)
1646 res = self._box_item_values(item, values)
1647 cache[item] = res

D:\program\Anaconda\lib\site-packages\pandas\core\internals.py in get(self, item, fastpath)
3588
3589 if not isnull(item):
-> 3590 loc = self.items.get_loc(item)
3591 else:
3592 indexer = np.arange(len(self.items))[isnull(self.items)]

D:\program\Anaconda\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
2442 return self._engine.get_loc(key)
2443 except KeyError:
-> 2444 return self._engine.get_loc(self._maybe_cast_indexer(key))
2445
2446 indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: ‘Name’

df.loc[‘Heikkinen, Miss. Laina’,‘Fare’]=1000

df.head()

PassengerId 	Survived 	Pclass 	Sex 	Age 	SibSp 	Parch 	Ticket 	Fare 	Cabin 	Embarked

Name
Braund, Mr. Owen Harris 1 0 3 male 22.0 1 0 A/5 21171 7.2500 NaN S
Cumings, Mrs. John Bradley (Florence Briggs Thayer) 2 1 1 female 38.0 1 0 PC 17599 71.2833 C85 C
Heikkinen, Miss. Laina 3 1 3 female 26.0 0 0 STON/O2. 3101282 1000.0000 NaN S
Futrelle, Mrs. Jacques Heath (Lily May Peel) 4 1 1 female 35.0 1 0 113803 53.1000 C123 S
Allen, Mr. William Henry 5 0 3 male 35.0 0 0 373450 8.0500 NaN S

####bool类型索引

df[‘Fare’]>40

df.head()

PassengerId 	Survived 	Pclass 	Sex 	Age 	SibSp 	Parch 	Ticket 	Fare 	Cabin 	Embarked

Name
Braund, Mr. Owen Harris 1 0 3 male 22.0 1 0 A/5 21171 7.2500 NaN S
Cumings, Mrs. John Bradley (Florence Briggs Thayer) 2 1 1 female 38.0 1 0 PC 17599 71.2833 C85 C
Heikkinen, Miss. Laina 3 1 3 female 26.0 0 0 STON/O2. 3101282 1000.0000 NaN S
Futrelle, Mrs. Jacques Heath (Lily May Peel) 4 1 1 female 35.0 1 0 113803 53.1000 C123 S
Allen, Mr. William Henry 5 0 3 male 35.0 0 0 373450 8.0500 NaN S

df[df[‘Fare’]>40]

PassengerId 	Survived 	Pclass 	Sex 	Age 	SibSp 	Parch 	Ticket 	Fare 	Cabin 	Embarked

Name
Cumings, Mrs. John Bradley (Florence Briggs Thayer) 2 1 1 female 38.0 1 0 PC 17599 71.2833 C85 C
Heikkinen, Miss. Laina 3 1 3 female 26.0 0 0 STON/O2. 3101282 1000.0000 NaN S
Futrelle, Mrs. Jacques Heath (Lily May Peel) 4 1 1 female 35.0 1 0 113803 53.1000 C123 S
McCarthy, Mr. Timothy J 7 0 1 male 54.0 0 0 17463 51.8625 E46 S
Fortune, Mr. Charles Alexander 28 0 1 male 19.0 3 2 19950 263.0000 C23 C25 C27 S
Spencer, Mrs. William Augustus (Marie Eugenie) 32 1 1 female NaN 1 0 PC 17569 146.5208 B78 C
Meyer, Mr. Edgar Joseph 35 0 1 male 28.0 1 0 PC 17604 82.1708 NaN C
Holverson, Mr. Alexander Oskar 36 0 1 male 42.0 1 0 113789 52.0000 NaN S
Laroche, Miss. Simonne Marie Anne Andree 44 1 2 female 3.0 1 2 SC/Paris 2123 41.5792 NaN C
Harper, Mrs. Henry Sleeper (Myna Haxtun) 53 1 1 female 49.0 1 0 PC 17572 76.7292 D33 C
Ostby, Mr. Engelhart Cornelius 55 0 1 male 65.0 0 1 113509 61.9792 B30 C
Goodwin, Master. William Frederick 60 0 3 male 11.0 5 2 CA 2144 46.9000 NaN S
Icard, Miss. Amelie 62 1 1 female 38.0 0 0 113572 80.0000 B28 NaN
Harris, Mr. Henry Birkhardt 63 0 1 male 45.0 1 0 36973 83.4750 C83 S
Goodwin, Miss. Lillian Amy 72 0 3 female 16.0 5 2 CA 2144 46.9000 NaN S
Hood, Mr. Ambrose Jr 73 0 2 male 21.0 0 0 S.O.C. 14879 73.5000 NaN S
Bing, Mr. Lee 75 1 3 male 32.0 0 0 1601 56.4958 NaN S
Carrau, Mr. Francisco M 84 0 1 male 28.0 0 0 113059 47.1000 NaN S
Fortune, Miss. Mabel Helen 89 1 1 female 23.0 3 2 19950 263.0000 C23 C25 C27 S
Chaffee, Mr. Herbert Fuller 93 0 1 male 46.0 1 0 W.E.P. 5734 61.1750 E31 S
Greenfield, Mr. William Bertram 98 1 1 male 23.0 0 1 PC 17759 63.3583 D10 D12 C
White, Mr. Richard Frasar 103 0 1 male 21.0 0 1 35281 77.2875 D26 S
Porter, Mr. Walter Chamberlain 111 0 1 male 47.0 0 0 110465 52.0000 C110 S
Baxter, Mr. Quigg Edmond 119 0 1 male 24.0 0 1 PC 17558 247.5208 B58 B60 C
Hickman, Mr. Stanley George 121 0 2 male 21.0 2 0 S.O.C. 14879 73.5000 NaN S
White, Mr. Percival Wayland 125 0 1 male 54.0 0 1 35281 77.2875 D26 S
Futrelle, Mr. Jacques Heath 138 0 1 male 37.0 1 0 113803 53.1000 C123 S
Giglio, Mr. Victor 140 0 1 male 24.0 0 0 PC 17593 79.2000 B86 C
Pears, Mrs. Thomas (Edith Wearne) 152 1 1 female 22.0 1 0 113776 66.6000 C2 S
Williams, Mr. Charles Duane 156 0 1 male 51.0 0 1 PC 17597 61.3792 NaN C
… … … … … … … … … … … …
Endres, Miss. Caroline Louise 717 1 1 female 38.0 0 0 PC 17757 227.5250 C45 C
Chambers, Mr. Norman Campbell 725 1 1 male 27.0 1 0 113806 53.1000 E8 S
Allen, Miss. Elisabeth Walton 731 1 1 female 29.0 0 0 24160 211.3375 B5 S
Lesurer, Mr. Gustave J 738 1 1 male 35.0 0 0 PC 17755 512.3292 B101 C
Cavendish, Mr. Tyrell William 742 0 1 male 36.0 1 0 19877 78.8500 C46 S
Ryerson, Miss. Susan Parker “Suzette” 743 1 1 female 21.0 2 2 PC 17608 262.3750 B57 B59 B63 B66 C
Crosby, Capt. Edward Gifford 746 0 1 male 70.0 1 1 WE/P 5735 71.0000 B22 S
Marvin, Mr. Daniel Warner 749 0 1 male 19.0 1 0 113773 53.1000 D30 S
Herman, Mrs. Samuel (Jane Laver) 755 1 2 female 48.0 1 2 220845 65.0000 NaN S
Rothes, the Countess. of (Lucy Noel Martha Dyer-Edwards) 760 1 1 female 33.0 0 0 110152 86.5000 B77 S
Carter, Mrs. William Ernest (Lucile Polk) 764 1 1 female 36.0 1 2 113760 120.0000 B96 B98 S
Hogeboom, Mrs. John C (Anna Andrews) 766 1 1 female 51.0 1 0 13502 77.9583 D11 S
Robert, Mrs. Edward Scott (Elisabeth Walton McMillan) 780 1 1 female 43.0 0 1 24160 211.3375 B3 S
Dick, Mrs. Albert Adrian (Vera Gillespie) 782 1 1 female 17.0 1 0 17474 57.0000 B20 S
Guggenheim, Mr. Benjamin 790 0 1 male 46.0 0 0 PC 17593 79.2000 B82 B84 C
Sage, Miss. Stella Anna 793 0 3 female NaN 8 2 CA. 2343 69.5500 NaN S
Carter, Master. William Thornton II 803 1 1 male 11.0 1 2 113760 120.0000 B96 B98 S
Chambers, Mrs. Norman Campbell (Bertha Griggs) 810 1 1 female 33.0 1 0 113806 53.1000 E8 S
Hays, Mrs. Charles Melville (Clara Jennings Gregg) 821 1 1 female 52.0 1 1 12749 93.5000 B69 S
Lam, Mr. Len 827 0 3 male NaN 0 0 1601 56.4958 NaN S
Stone, Mrs. George Nelson (Martha Evelyn) 830 1 1 female 62.0 0 0 113572 80.0000 B28 NaN
Compton, Miss. Sara Rebecca 836 1 1 female 39.0 1 1 PC 17756 83.1583 E49 C
Chip, Mr. Chang 839 1 3 male 32.0 0 0 1601 56.4958 NaN S
Sage, Mr. Douglas Bullen 847 0 3 male NaN 8 2 CA. 2343 69.5500 NaN S
Goldenberg, Mrs. Samuel L (Edwiga Grabowska) 850 1 1 female NaN 1 0 17453 89.1042 C92 C
Wick, Mrs. George Dennick (Mary Hitchcock) 857 1 1 female 45.0 1 1 36928 164.8667 NaN S
Sage, Miss. Dorothy Edith “Dolly” 864 0 3 female NaN 8 2 CA. 2343 69.5500 NaN S
Roebling, Mr. Washington Augustus II 868 0 1 male 31.0 0 0 PC 17590 50.4958 A24 S
Beckwith, Mrs. Richard Leonard (Sallie Monypeny) 872 1 1 female 47.0 1 1 11751 52.5542 D35 S
Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) 880 1 1 female 56.0 0 1 11767 83.1583 C50 C

177 rows × 11 columns

#选择数据做布尔类型判断

df[df[‘Fare’] > 40][:5]

PassengerId 	Survived 	Pclass 	Sex 	Age 	SibSp 	Parch 	Ticket 	Fare 	Cabin 	Embarked

Name
Cumings, Mrs. John Bradley (Florence Briggs Thayer) 2 1 1 female 38.0 1 0 PC 17599 71.2833 C85 C
Heikkinen, Miss. Laina 3 1 3 female 26.0 0 0 STON/O2. 3101282 1000.0000 NaN S
Futrelle, Mrs. Jacques Heath (Lily May Peel) 4 1 1 female 35.0 1 0 113803 53.1000 C123 S
McCarthy, Mr. Timothy J 7 0 1 male 54.0 0 0 17463 51.8625 E46 S
Fortune, Mr. Charles Alexander 28 0 1 male 19.0 3 2 19950 263.0000 C23 C25 C27 S

df[df[‘Sex’] == ‘male’][:5]

PassengerId 	Survived 	Pclass 	Sex 	Age 	SibSp 	Parch 	Ticket 	Fare 	Cabin 	Embarked

Name
Braund, Mr. Owen Harris 1 0 3 male 22.0 1 0 A/5 21171 7.2500 NaN S
Allen, Mr. William Henry 5 0 3 male 35.0 0 0 373450 8.0500 NaN S
Moran, Mr. James 6 0 3 male NaN 0 0 330877 8.4583 NaN Q
McCarthy, Mr. Timothy J 7 0 1 male 54.0 0 0 17463 51.8625 E46 S
Palsson, Master. Gosta Leonard 8 0 3 male 2.0 3 1 349909 21.0750 NaN S

df.loc[df[‘Sex’]==‘male’,‘Age’].mean()

30.72664459161148

(df[‘Age’] > 70).sum()

5

###GROUPBy操作

df = pd.DataFrame({‘key’:[‘A’,‘B’,‘C’,‘A’,‘B’,‘C’,‘A’,‘B’,‘C’],

              'data':[0,5,10,5,10,15,10,15,20]})

df

data 	key

0 0 A
1 5 B
2 10 C
3 5 A
4 10 B
5 15 C
6 10 A
7 15 B
8 20 C

for key in [‘A’,‘B’,‘C’]:

 print (key,df[df['key'] == key].sum())

A data 15
key AAA
dtype: object
B data 30
key BBB
dtype: object
C data 45
key CCC
dtype: object

df.groupby(‘key’).sum()

data

key
A 15
B 30
C 45

import numpy as np

df.groupby(‘key’).aggregate(np.sum)

data

key
A 15
B 30
C 45

df=pd.read_csv(‘E:\pyhon\pandas\Pandas%E4%BB%A3%E7%A0%81\data\titanic.csv’)

df.groupby(‘Sex’)[‘Age’].mean()

Sex
female 27.915709
male 30.726645
Name: Age, dtype: float64

df.groupby(‘Sex’)[‘Survived’].mean()

Sex
female 0.742038
male 0.188908
Name: Survived, dtype: float64

#数值运算

import pandas as pd

df = pd.DataFrame([[1,2,3],[4,5,6]],index = [‘a’,‘b’],columns = [‘A’,‘B’,‘C’])

df

A 	B 	C

a 1 2 3
b 4 5 6

df.sum()#默认按列求和

df.sum(axis=0)#按列求和

A 5
B 7
C 9
dtype: int64

df.sum(axis=1)#按行求和

a 6
b 15
dtype: int64

df.sum(axis=‘columns’)

a 6
b 15
dtype: int64

df.mean()

A 2.5
B 3.5
C 4.5
dtype: float64

df.std()

A 2.12132
B 2.12132
C 2.12132
dtype: float64

df.var()

A 4.5
B 4.5
C 4.5
dtype: float64

df.max()

A 4
B 5
C 6
dtype: int64

df.min()

A 1
B 2
C 3
dtype: int64

df.max(axis=0)

A 4
B 5
C 6
dtype: int64

df.min(axis=1)

a 1
b 4
dtype: int64

df.median()

A 2.5
B 3.5
C 4.5
dtype: float64

###按照二元统计

df=pd.read_csv(‘E:\pyhon\pandas\Pandas%E4%BB%A3%E7%A0%81\data\titanic.csv’)

df.head()

PassengerId 	Survived 	Pclass 	Name 	Sex 	Age 	SibSp 	Parch 	Ticket 	Fare 	Cabin 	Embarked

0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th… female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

df.cov()

PassengerId 	Survived 	Pclass 	Age 	SibSp 	Parch 	Fare

PassengerId 66231.000000 -0.626966 -7.561798 138.696504 -16.325843 -0.342697 161.883369
Survived -0.626966 0.236772 -0.137703 -0.551296 -0.018954 0.032017 6.221787
Pclass -7.561798 -0.137703 0.699015 -4.496004 0.076599 0.012429 -22.830196
Age 138.696504 -0.551296 -4.496004 211.019125 -4.163334 -2.344191 73.849030
SibSp -16.325843 -0.018954 0.076599 -4.163334 1.216043 0.368739 8.748734
Parch -0.342697 0.032017 0.012429 -2.344191 0.368739 0.649728 8.661052
Fare 161.883369 6.221787 -22.830196 73.849030 8.748734 8.661052 2469.436846

df.corr()

PassengerId 	Survived 	Pclass 	Age 	SibSp 	Parch 	Fare

PassengerId 1.000000 -0.005007 -0.035144 0.036847 -0.057527 -0.001652 0.012658
Survived -0.005007 1.000000 -0.338481 -0.077221 -0.035322 0.081629 0.257307
Pclass -0.035144 -0.338481 1.000000 -0.369226 0.083081 0.018443 -0.549500
Age 0.036847 -0.077221 -0.369226 1.000000 -0.308247 -0.189119 0.096067
SibSp -0.057527 -0.035322 0.083081 -0.308247 1.000000 0.414838 0.159651
Parch -0.001652 0.081629 0.018443 -0.189119 0.414838 1.000000 0.216225
Fare 0.012658 0.257307 -0.549500 0.096067 0.159651 0.216225 1.000000

df[‘Age’].value_counts()

24.00 30
22.00 27
18.00 26
19.00 25
30.00 25
28.00 25
21.00 24
25.00 23
36.00 22
29.00 20
32.00 18
27.00 18
35.00 18
26.00 18
16.00 17
31.00 17
20.00 15
33.00 15
23.00 15
34.00 15
39.00 14
17.00 13
42.00 13
40.00 13
45.00 12
38.00 11
50.00 10
2.00 10
4.00 10
47.00 9

71.00 2
59.00 2
63.00 2
0.83 2
30.50 2
70.00 2
57.00 2
0.75 2
13.00 2
10.00 2
64.00 2
40.50 2
32.50 2
45.50 2
20.50 1
24.50 1
0.67 1
14.50 1
0.92 1
74.00 1
34.50 1
80.00 1
12.00 1
36.50 1
53.00 1
55.50 1
70.50 1
66.00 1
23.50 1
0.42 1
Name: Age, Length: 88, dtype: int64

df[‘Age’].value_counts(ascending=True,bins=5)#ascending升降序排列分五组,true为正序,false为降序

(64.084, 80.0] 11
(48.168, 64.084] 69
(0.339, 16.336] 100
(32.252, 48.168] 188
(16.336, 32.252] 346
Name: Age, dtype: int64

print(help(pd.value_counts))

Help on function value_counts in module pandas.core.algorithms:

value_counts(values, sort=True, ascending=False, normalize=False, bins=None, dropna=True)
Compute a histogram of the counts of non-null values.

Parameters
----------
values : ndarray (1-d)
sort : boolean, default True
    Sort by values
ascending : boolean, default False
    Sort in ascending order
normalize: boolean, default False
    If True then compute a relative histogram
bins : integer, optional
    Rather than count values, group them into half-open bins,
    convenience for pd.cut, only works with numeric data
dropna : boolean, default True
    Don't include counts of NaN

Returns
-------
value_counts : Series

None

df[‘Age’].count()

714

*Series增删改查

import pandas as pd

data=[10,11,12]

index=[‘a’,‘b’,‘c’]

s=pd.Series(data=data,index=index)

s

a 10
b 11
c 12
dtype: int64

s[0]

10

s[0:2]

a 10
b 11
dtype: int64

mask=[True,False,True]

s[mask]

a 10
c 12
dtype: int64

s.loc[‘b’]

11

s.iloc[1]

11

**改操作

s1=s.copy()

s1

a 10
b 11
c 12
dtype: int64

s1[‘a’]

10

s1.replace(to_replace=100,value=101,inplace=False)#false是默认操作,如果修改,可以把false改为true

a 10
b 11
c 12
dtype: int64

s1.replace(to_replace=100,value=101,inplace=True)

s1

a 10
b 11
c 12
dtype: int64

s1.index

Index([‘a’, ‘b’, ‘c’], dtype=‘object’)

s1.index=[‘a’,‘b’,‘c’]

s1

a 10
b 11
c 12
dtype: int64

s1.rename(index={‘a’:‘A’},inplace=True)#作用是把小写字母改为大写字母

s1

A 10
b 11
c 12
dtype: int64

s2=pd.Series([100,500],index=[‘g’,‘h’])

s2

g 100
h 500
dtype: int64

**增操作

data=[100,101]

index=[‘h’,‘k’]

s2=pd.Series(data=data,index=index)

s2

h 100
k 101
dtype: int64

s1.append(s2)

j 500
h 100
k 101
dtype: int64

s1[‘j’]=500

s1

j 500
dtype: int64

s1.append(s2,ignore_index=False)

A 10
b 11
c 12
j 500
h 100
k 101
dtype: int64

s1.append(s2,ignore_index=True)

0 10
1 11
2 12
3 500
4 100
5 101
dtype: int64

**删除操作

s1.append(s2)

j 500
h 100
k 101
dtype: int64

del s1[‘j’]

s1

Series([], dtype: int64)

s1.drop([‘j’],inplace=True)

s1

Series([], dtype: int64)

*DataFrame结构的增删改查

data=[[1,2,3],[4,5,6]]

index=[‘a’,‘b’]

columns=[‘A’,‘B’,‘C’]

df=pd.DataFrame(data=data,index=index,columns=columns)

df

A 	B 	C

a 1 2 3
b 4 5 6

df[‘A’]#查操作

a 1
b 4
Name: A, dtype: int64

import pandas as pd

df.iloc[0]

A 1
B 2
C 3
Name: a, dtype: int64

df.loc[‘a’]

A 1
B 2
C 3
Name: a, dtype: int64

**改操作

df.loc[‘a’][‘A’]

1

df.loc[‘a’][‘A’]=150

df

A 	B 	C

a 150 2 3
b 4 5 6

df.index=[‘f’,‘g’]

df

A 	B 	C

f 150 2 3
g 4 5 6

**增加操作

df.loc[‘c’]=[1,2,3]

df

A 	B 	C

f 150 2 3
g 4 5 6
lc 1 2 3
c 1 2 3

data=[[1,2,3],[4,5,6]]

index=[‘j’,‘k’]

columns=[‘A’,‘B’,‘C’]

df2=pd.DataFrame(data=data,index=index,columns=columns)

df2

A 	B 	C

j 1 2 3
k 4 5 6

df3=pd.concat([df,df2],axis=1)

df3

A 	B 	C 	A 	B 	C

c 1.0 2.0 3.0 NaN NaN NaN
f 150.0 2.0 3.0 NaN NaN NaN
g 4.0 5.0 6.0 NaN NaN NaN
j NaN NaN NaN 1.0 2.0 3.0
k NaN NaN NaN 4.0 5.0 6.0
lc 1.0 2.0 3.0 NaN NaN NaN

df2[‘cui’]=[10,11]

df2

A 	B 	C 	cui

j 1 2 3 10
k 4 5 6 11

df4=pd.DataFrame([[10,11],[12,13]],index=[‘j’,‘k’],columns=[‘D’,‘E’])

df4

D 	E

j 10 11
k 12 13

df5=pd.concat([df2,df4],axis=1)

df5

A 	B 	C 	cui 	D 	E

j 1 2 3 10 10 11
k 4 5 6 11 12 13

**删除操作

df5.drop([‘j’],axis=0,inplace=True)

df5

A 	B 	C 	cui 	D 	E

k 4 5 6 11 12 13

del df5[‘cui’]

df5

A 	B 	C 	D 	E

k 4 5 6 12 13

**merge操作

import pandas as pd

left=pd.DataFrame({‘key’:[‘k0’,‘k1’,‘k2’,‘k3’],

               'A':['A0','A1','A2','A3'],

               'B':['B0','B1','B2','B3'],   

})

right=pd.DataFrame({‘key’:[‘k0’,‘k1’,‘k2’,‘k3’],

                  'C':['C0','C1','C2','C3'],

               'D':['D0','D1','D2','D3']

               })

left

A 	B 	key

0 A0 B0 k0
1 A1 B1 k1
2 A2 B2 k2
3 A3 B3 k3

right

C 	D 	key

0 C0 D0 k0
1 C1 D1 k1
2 C2 D2 k2
3 C3 D3 k3

pd.merge(left,right)

A 	B 	key 	C 	D

0 A0 B0 k0 C0 D0
1 A1 B1 k1 C1 D1
2 A2 B2 k2 C2 D2
3 A3 B3 k3 C3 D3

pd.merge(left,right,on=‘key’)

A 	B 	key 	C 	D

0 A0 B0 k0 C0 D0
1 A1 B1 k1 C1 D1
2 A2 B2 k2 C2 D2
3 A3 B3 k3 C3 D3

left=pd.DataFrame({‘key1’:[‘k10’,‘k11’,‘k12’,‘k13’],

                'key2':['k20','k21','k22','k23'],

               'A':['A0','A1','A2','A3'],

               'B':['B0','B1','B2','B3'],

})

right=pd.DataFrame({‘key1’:[‘k10’,‘k11’,‘k12’,‘k13’],

                'key2':['k20','k21','k22','k23'],

                  'C':['C0','C1','C2','C3'],

               'D':['D0','D1','D2','D3'],

               })

left

A 	B 	key1 	key2

0 A0 B0 k10 k20
1 A1 B1 k11 k21
2 A2 B2 k12 k22
3 A3 B3 k13 k23

right

C 	D 	key1 	key2

0 C0 D0 k10 k20
1 C1 D1 k11 k21
2 C2 D2 k12 k22
3 C3 D3 k13 k23

pd.merge(left,right,on=[‘key1’,‘key2’],how=‘outer’)

A 	B 	key1 	key2 	C 	D

0 A0 B0 k10 k20 C0 D0
1 A1 B1 k11 k21 C1 D1
2 A2 B2 k12 k22 C2 D2
3 A3 B3 k13 k23 C3 D3

#两个写进去

pd.merge(left,right,on=[‘key1’,‘key2’],how=‘outer’,indicator=True)

A 	B 	key1 	key2 	C 	D 	_merge

0 A0 B0 k10 k20 C0 D0 both
1 A1 B1 k11 k21 C1 D1 both
2 A2 B2 k12 k22 C2 D2 both
3 A3 B3 k13 k23 C3 D3 both

#显示设置

import pandas as pd

pd.get_option(‘display.max_rows’)

60

pd.Series(index=range(0,100))

0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
15 NaN
16 NaN
17 NaN
18 NaN
19 NaN
20 NaN
21 NaN
22 NaN
23 NaN
24 NaN
25 NaN
26 NaN
27 NaN
28 NaN
29 NaN

70 NaN
71 NaN
72 NaN
73 NaN
74 NaN
75 NaN
76 NaN
77 NaN
78 NaN
79 NaN
80 NaN
81 NaN
82 NaN
83 NaN
84 NaN
85 NaN
86 NaN
87 NaN
88 NaN
89 NaN
90 NaN
91 NaN
92 NaN
93 NaN
94 NaN
95 NaN
96 NaN
97 NaN
98 NaN
99 NaN
Length: 100, dtype: float64

pd.get_option(‘display.max_columns’)

20

pd.DataFrame(columns=range(0,30))

0 	1 	2 	3 	4 	5 	6 	7 	8 	9 	... 	20 	21 	22 	23 	24 	25 	26 	27 	28 	29

0 rows × 30 columns

pd.set_option(‘display.max_columns’,30)

pd.Series(index=[‘A’],data=[‘t’*70])

A tttttttttttttttttttttttttttttttttttttttttttttt…
dtype: object

pd.get_option(‘display.precision’)

6

pd.set_option(‘display.precision’,5)

pd.Series(data=[1.23456789123456])

0 1.23457
dtype: float64

*pivot操作

#数据透视表

import pandas as pd

example=pd.DataFrame({‘Month’: [“January”, “January”, “January”, “January”,

                              "February", "February", "February", "February", 

                              "March", "March", "March", "March"],

                     'Category': ["Transportation", "Grocery", "Household", "Entertainment",

                            "Transportation", "Grocery", "Household", "Entertainment",

                            "Transportation", "Grocery", "Household", "Entertainment"],

               'Amount': [74., 235., 175., 100., 115., 240., 225., 125., 90., 260., 200., 120.]

                    })

example

Amount 	Category 	Month

0 74.0 Transportation January
1 235.0 Grocery January
2 175.0 Household January
3 100.0 Entertainment January
4 115.0 Transportation February
5 240.0 Grocery February
6 225.0 Household February
7 125.0 Entertainment February
8 90.0 Transportation March
9 260.0 Grocery March
10 200.0 Household March
11 120.0 Entertainment March

example_pivot=example.pivot(index=‘Category’,columns=‘Month’,values=‘Amount’)

example_pivot

Month February January March
Category
Entertainment 125.0 100.0 120.0
Grocery 240.0 235.0 260.0
Household 225.0 175.0 200.0
Transportation 115.0 74.0 90.0

example_pivot.sum(axis=1)

Category
Entertainment 345.0
Grocery 735.0
Household 600.0
Transportation 279.0
dtype: float64

example_pivot.sum(axis=0)

Month
February 705.0
January 584.0
March 670.0
dtype: float64

df=pd.read_csv(‘E:\pyhon\pandas\Pandas%E4%BB%A3%E7%A0%81\data\titanic.csv’)

#默认值就是平均值

df.pivot_table(index=‘Sex’,columns=‘Pclass’,values=‘Fare’)

Pclass 1 2 3
Sex
female 106.12580 21.97012 16.11881
male 67.22613 19.74178 12.66163

df.pivot_table(index=‘Sex’,columns=‘Pclass’,values=‘Fare’,aggfunc=‘max’)

Pclass 1 2 3
Sex
female 512.3292 65.0 69.55
male 512.3292 73.5 69.55

df.pivot_table(index=‘Sex’,columns=‘Pclass’,values=‘Fare’,aggfunc=‘count’)

Pclass 1 2 3
Sex
female 94 76 144
male 122 108 347

pd.crosstab(index=df[‘Sex’],columns=df[‘Pclass’])

Pclass 1 2 3
Sex
female 94 76 144
male 122 108 347

df[‘underaged’]=df[‘Age’]<=18

#时间操作

import datetime

dt=datetime.datetime(year=2019,month=4,day=3,hour=15,minute=30)

dt

datetime.datetime(2019, 4, 3, 15, 30)

print(dt)

2019-04-03 15:30:00

import pandas as pd

ts=pd.Timestamp(‘2019-04-03’)

ts

Timestamp(‘2019-04-03 00:00:00’)

ts.month

4

ts.day

3

#加减法操作

ts+pd.Timedelta(‘5days’)

Timestamp(‘2019-04-08 00:00:00’)

pd.to_datetime(‘2019-4-3’)

Timestamp(‘2019-04-03 00:00:00’)

pd.to_datetime(‘3/4/2019’)

Timestamp(‘2019-03-04 00:00:00’)

s=pd.Series([‘2019-04-03 00:00:00’,‘2019-04-03 00:00:00’,‘2019-04-03 00:00:00’])

s

0 2019-04-03 00:00:00
1 2019-04-03 00:00:00
2 2019-04-03 00:00:00
dtype: object

ts=pd.to_datetime(s)

ts

0 2019-04-03
1 2019-04-03
2 2019-04-03
dtype: datetime64[ns]

ts.dt.hour

0 0
1 0
2 0
dtype: int64

ts.dt.weekday

0 2
1 2
2 2
dtype: int64

pd.Series(pd.date_range(start=‘2019-4-3’,periods=10,freq=‘12H’))

0 2019-04-03 00:00:00
1 2019-04-03 12:00:00
2 2019-04-04 00:00:00
3 2019-04-04 12:00:00
4 2019-04-05 00:00:00
5 2019-04-05 12:00:00
6 2019-04-06 00:00:00
7 2019-04-06 12:00:00
8 2019-04-07 00:00:00
9 2019-04-07 12:00:00
dtype: datetime64[ns]

data=pd.read_csv(‘E:\pyhon\pandas\Pandas%E4%BB%A3%E7%A0%81\data\flowdata.csv’)

data

Time 	L06_347 	LS06_347 	LS06_348

0 2009-01-01 00:00:00 0.13742 0.09750 0.01683
1 2009-01-01 03:00:00 0.13125 0.08883 0.01642
2 2009-01-01 06:00:00 0.11350 0.09125 0.01675
3 2009-01-01 09:00:00 0.13575 0.09150 0.01625
4 2009-01-01 12:00:00 0.14092 0.09617 0.01700
5 2009-01-01 15:00:00 0.09917 0.09167 0.01758
6 2009-01-01 18:00:00 0.13267 0.09017 0.01625
7 2009-01-01 21:00:00 0.10942 0.09117 0.01600
8 2009-01-02 00:00:00 0.13383 0.09042 0.01608
9 2009-01-02 03:00:00 0.09208 0.08867 0.01600
10 2009-01-02 06:00:00 0.11292 0.09142 0.01633
11 2009-01-02 09:00:00 0.14192 0.09708 0.01642
12 2009-01-02 12:00:00 0.14783 0.10192 0.01642
13 2009-01-02 15:00:00 0.10792 0.10025 0.01642
14 2009-01-02 18:00:00 0.14358 0.09842 0.01675
15 2009-01-02 21:00:00 0.11308 0.09808 0.01683
16 2009-01-03 00:00:00 0.13583 0.09217 0.01683
17 2009-01-03 03:00:00 0.08325 0.08000 0.01608
18 2009-01-03 06:00:00 0.11942 0.08025 0.01542
19 2009-01-03 09:00:00 0.12458 0.08442 0.01583
20 2009-01-03 12:00:00 0.09167 0.08825 0.01625
21 2009-01-03 15:00:00 0.12500 0.08467 0.01650
22 2009-01-03 18:00:00 0.12158 0.08208 0.01583
23 2009-01-03 21:00:00 0.10717 0.09250 0.01600
24 2009-01-04 00:00:00 0.13525 0.09117 0.01633
25 2009-01-04 03:00:00 0.13558 0.09158 0.01608
26 2009-01-04 06:00:00 0.11717 0.09517 0.01600
27 2009-01-04 09:00:00 0.10900 0.10517 0.01800
28 2009-01-04 12:00:00 0.15742 0.11075 0.01842
29 2009-01-04 15:00:00 0.16042 0.11375 0.01842
… … … … …
11667 2012-12-29 09:00:00 0.78683 0.78683 0.07700
11668 2012-12-29 12:00:00 0.72375 0.72375 0.07267
11669 2012-12-29 15:00:00 0.69067 0.69067 0.06967
11670 2012-12-29 18:00:00 0.66342 0.66342 0.06967
11671 2012-12-29 21:00:00 0.73592 0.73592 0.07283
11672 2012-12-30 00:00:00 0.75367 0.75367 0.06183
11673 2012-12-30 03:00:00 0.66333 0.66333 0.07367
11674 2012-12-30 06:00:00 0.79683 0.79683 0.09517
11675 2012-12-30 09:00:00 0.91600 0.91600 0.10158
11676 2012-12-30 12:00:00 1.46500 1.46500 0.08683
11677 2012-12-30 15:00:00 1.31417 1.31417 0.08542
11678 2012-12-30 18:00:00 1.23917 1.23917 0.09808
11679 2012-12-30 21:00:00 1.06975 1.06975 0.10142
11680 2012-12-31 00:00:00 0.97333 0.97333 0.08500
11681 2012-12-31 03:00:00 0.85083 0.85083 0.07392
11682 2012-12-31 06:00:00 0.73592 0.73592 0.06942
11683 2012-12-31 09:00:00 0.68275 0.68275 0.06658
11684 2012-12-31 12:00:00 0.65125 0.65125 0.06383
11685 2012-12-31 15:00:00 0.62900 0.62900 0.06183
11686 2012-12-31 18:00:00 0.61733 0.61733 0.06058
11687 2012-12-31 21:00:00 0.84650 0.84650 0.17017
11688 2013-01-01 00:00:00 1.68833 1.68833 0.20733
11689 2013-01-01 03:00:00 2.69333 2.69333 0.20150
11690 2013-01-01 06:00:00 2.22083 2.22083 0.16692
11691 2013-01-01 09:00:00 2.05500 2.05500 0.17567
11692 2013-01-01 12:00:00 1.71000 1.71000 0.12958
11693 2013-01-01 15:00:00 1.42000 1.42000 0.09633
11694 2013-01-01 18:00:00 1.17858 1.17858 0.08308
11695 2013-01-01 21:00:00 0.89825 0.89825 0.07717
11696 2013-01-02 00:00:00 0.86000 0.86000 0.07500

11697 rows × 4 columns

data.head()

Time 	L06_347 	LS06_347 	LS06_348

0 2009-01-01 00:00:00 0.13742 0.09750 0.01683
1 2009-01-01 03:00:00 0.13125 0.08883 0.01642
2 2009-01-01 06:00:00 0.11350 0.09125 0.01675
3 2009-01-01 09:00:00 0.13575 0.09150 0.01625
4 2009-01-01 12:00:00 0.14092 0.09617 0.01700

data[‘Time’]=pd.to_datetime(data[‘Time’])

data=data.set_index(‘Time’)

data

L06_347 	LS06_347 	LS06_348

Time
2009-01-01 00:00:00 0.13742 0.09750 0.01683
2009-01-01 03:00:00 0.13125 0.08883 0.01642
2009-01-01 06:00:00 0.11350 0.09125 0.01675
2009-01-01 09:00:00 0.13575 0.09150 0.01625
2009-01-01 12:00:00 0.14092 0.09617 0.01700
2009-01-01 15:00:00 0.09917 0.09167 0.01758
2009-01-01 18:00:00 0.13267 0.09017 0.01625
2009-01-01 21:00:00 0.10942 0.09117 0.01600
2009-01-02 00:00:00 0.13383 0.09042 0.01608
2009-01-02 03:00:00 0.09208 0.08867 0.01600
2009-01-02 06:00:00 0.11292 0.09142 0.01633
2009-01-02 09:00:00 0.14192 0.09708 0.01642
2009-01-02 12:00:00 0.14783 0.10192 0.01642
2009-01-02 15:00:00 0.10792 0.10025 0.01642
2009-01-02 18:00:00 0.14358 0.09842 0.01675
2009-01-02 21:00:00 0.11308 0.09808 0.01683
2009-01-03 00:00:00 0.13583 0.09217 0.01683
2009-01-03 03:00:00 0.08325 0.08000 0.01608
2009-01-03 06:00:00 0.11942 0.08025 0.01542
2009-01-03 09:00:00 0.12458 0.08442 0.01583
2009-01-03 12:00:00 0.09167 0.08825 0.01625
2009-01-03 15:00:00 0.12500 0.08467 0.01650
2009-01-03 18:00:00 0.12158 0.08208 0.01583
2009-01-03 21:00:00 0.10717 0.09250 0.01600
2009-01-04 00:00:00 0.13525 0.09117 0.01633
2009-01-04 03:00:00 0.13558 0.09158 0.01608
2009-01-04 06:00:00 0.11717 0.09517 0.01600
2009-01-04 09:00:00 0.10900 0.10517 0.01800
2009-01-04 12:00:00 0.15742 0.11075 0.01842
2009-01-04 15:00:00 0.16042 0.11375 0.01842
… … … …
2012-12-29 09:00:00 0.78683 0.78683 0.07700
2012-12-29 12:00:00 0.72375 0.72375 0.07267
2012-12-29 15:00:00 0.69067 0.69067 0.06967
2012-12-29 18:00:00 0.66342 0.66342 0.06967
2012-12-29 21:00:00 0.73592 0.73592 0.07283
2012-12-30 00:00:00 0.75367 0.75367 0.06183
2012-12-30 03:00:00 0.66333 0.66333 0.07367
2012-12-30 06:00:00 0.79683 0.79683 0.09517
2012-12-30 09:00:00 0.91600 0.91600 0.10158
2012-12-30 12:00:00 1.46500 1.46500 0.08683
2012-12-30 15:00:00 1.31417 1.31417 0.08542
2012-12-30 18:00:00 1.23917 1.23917 0.09808
2012-12-30 21:00:00 1.06975 1.06975 0.10142
2012-12-31 00:00:00 0.97333 0.97333 0.08500
2012-12-31 03:00:00 0.85083 0.85083 0.07392
2012-12-31 06:00:00 0.73592 0.73592 0.06942
2012-12-31 09:00:00 0.68275 0.68275 0.06658
2012-12-31 12:00:00 0.65125 0.65125 0.06383
2012-12-31 15:00:00 0.62900 0.62900 0.06183
2012-12-31 18:00:00 0.61733 0.61733 0.06058
2012-12-31 21:00:00 0.84650 0.84650 0.17017
2013-01-01 00:00:00 1.68833 1.68833 0.20733
2013-01-01 03:00:00 2.69333 2.69333 0.20150
2013-01-01 06:00:00 2.22083 2.22083 0.16692
2013-01-01 09:00:00 2.05500 2.05500 0.17567
2013-01-01 12:00:00 1.71000 1.71000 0.12958
2013-01-01 15:00:00 1.42000 1.42000 0.09633
2013-01-01 18:00:00 1.17858 1.17858 0.08308
2013-01-01 21:00:00 0.89825 0.89825 0.07717
2013-01-02 00:00:00 0.86000 0.86000 0.07500

11697 rows × 3 columns

data.index

DatetimeIndex([‘2009-01-01 00:00:00’, ‘2009-01-01 03:00:00’,
‘2009-01-01 06:00:00’, ‘2009-01-01 09:00:00’,
‘2009-01-01 12:00:00’, ‘2009-01-01 15:00:00’,
‘2009-01-01 18:00:00’, ‘2009-01-01 21:00:00’,
‘2009-01-02 00:00:00’, ‘2009-01-02 03:00:00’,

‘2012-12-31 21:00:00’, ‘2013-01-01 00:00:00’,
‘2013-01-01 03:00:00’, ‘2013-01-01 06:00:00’,
‘2013-01-01 09:00:00’, ‘2013-01-01 12:00:00’,
‘2013-01-01 15:00:00’, ‘2013-01-01 18:00:00’,
‘2013-01-01 21:00:00’, ‘2013-01-02 00:00:00’],
dtype=‘datetime64[ns]’, name=‘Time’, length=11697, freq=None)

data=pd.read_csv(‘E:\pyhon\pandas\Pandas%E4%BB%A3%E7%A0%81\data\flowdata.csv’,index_col=0,parse_dates=True)

data.head()

L06_347 	LS06_347 	LS06_348

Time
2009-01-01 00:00:00 0.13742 0.09750 0.01683
2009-01-01 03:00:00 0.13125 0.08883 0.01642
2009-01-01 06:00:00 0.11350 0.09125 0.01675
2009-01-01 09:00:00 0.13575 0.09150 0.01625
2009-01-01 12:00:00 0.14092 0.09617 0.01700

#时间为索引

data[pd.Timestamp(‘2019-04-03 09:00’):pd.Timestamp(‘2019-04-03 19:00’)]

L06_347 	LS06_347 	LS06_348

Time

data.tail(10)

L06_347 	LS06_347 	LS06_348

Time
2012-12-31 21:00:00 0.84650 0.84650 0.17017
2013-01-01 00:00:00 1.68833 1.68833 0.20733
2013-01-01 03:00:00 2.69333 2.69333 0.20150
2013-01-01 06:00:00 2.22083 2.22083 0.16692
2013-01-01 09:00:00 2.05500 2.05500 0.17567
2013-01-01 12:00:00 1.71000 1.71000 0.12958
2013-01-01 15:00:00 1.42000 1.42000 0.09633
2013-01-01 18:00:00 1.17858 1.17858 0.08308
2013-01-01 21:00:00 0.89825 0.89825 0.07717
2013-01-02 00:00:00 0.86000 0.86000 0.07500

data[‘2013’]

L06_347 	LS06_347 	LS06_348

Time
2013-01-01 00:00:00 1.68833 1.68833 0.20733
2013-01-01 03:00:00 2.69333 2.69333 0.20150
2013-01-01 06:00:00 2.22083 2.22083 0.16692
2013-01-01 09:00:00 2.05500 2.05500 0.17567
2013-01-01 12:00:00 1.71000 1.71000 0.12958
2013-01-01 15:00:00 1.42000 1.42000 0.09633
2013-01-01 18:00:00 1.17858 1.17858 0.08308
2013-01-01 21:00:00 0.89825 0.89825 0.07717
2013-01-02 00:00:00 0.86000 0.86000 0.07500

data[‘2019-04-03’:‘2019-05’]

L06_347 	LS06_347 	LS06_348

Time

data[data.index.month==1]

L06_347 	LS06_347 	LS06_348

Time
2009-01-01 00:00:00 0.13742 0.09750 0.01683
2009-01-01 03:00:00 0.13125 0.08883 0.01642
2009-01-01 06:00:00 0.11350 0.09125 0.01675
2009-01-01 09:00:00 0.13575 0.09150 0.01625
2009-01-01 12:00:00 0.14092 0.09617 0.01700
2009-01-01 15:00:00 0.09917 0.09167 0.01758
2009-01-01 18:00:00 0.13267 0.09017 0.01625
2009-01-01 21:00:00 0.10942 0.09117 0.01600
2009-01-02 00:00:00 0.13383 0.09042 0.01608
2009-01-02 03:00:00 0.09208 0.08867 0.01600
2009-01-02 06:00:00 0.11292 0.09142 0.01633
2009-01-02 09:00:00 0.14192 0.09708 0.01642
2009-01-02 12:00:00 0.14783 0.10192 0.01642
2009-01-02 15:00:00 0.10792 0.10025 0.01642
2009-01-02 18:00:00 0.14358 0.09842 0.01675
2009-01-02 21:00:00 0.11308 0.09808 0.01683
2009-01-03 00:00:00 0.13583 0.09217 0.01683
2009-01-03 03:00:00 0.08325 0.08000 0.01608
2009-01-03 06:00:00 0.11942 0.08025 0.01542
2009-01-03 09:00:00 0.12458 0.08442 0.01583
2009-01-03 12:00:00 0.09167 0.08825 0.01625
2009-01-03 15:00:00 0.12500 0.08467 0.01650
2009-01-03 18:00:00 0.12158 0.08208 0.01583
2009-01-03 21:00:00 0.10717 0.09250 0.01600
2009-01-04 00:00:00 0.13525 0.09117 0.01633
2009-01-04 03:00:00 0.13558 0.09158 0.01608
2009-01-04 06:00:00 0.11717 0.09517 0.01600
2009-01-04 09:00:00 0.10900 0.10517 0.01800
2009-01-04 12:00:00 0.15742 0.11075 0.01842
2009-01-04 15:00:00 0.16042 0.11375 0.01842
… … … …
2012-01-29 09:00:00 0.29683 0.31583 0.03475
2012-01-29 12:00:00 0.29400 0.31192 0.03433
2012-01-29 15:00:00 0.26950 0.30800 0.03300
2012-01-29 18:00:00 0.25942 0.30442 0.03183
2012-01-29 21:00:00 0.25458 0.29625 0.03133
2012-01-30 00:00:00 0.24350 0.28733 0.03092
2012-01-30 03:00:00 0.23625 0.28167 0.03025
2012-01-30 06:00:00 0.23033 0.27217 0.02942
2012-01-30 09:00:00 0.22183 0.26325 0.02783
2012-01-30 12:00:00 0.22425 0.26258 0.02925
2012-01-30 15:00:00 0.20600 0.25675 0.02892
2012-01-30 18:00:00 0.20042 0.25842 0.02825
2012-01-30 21:00:00 0.19275 0.25108 0.02725
2012-01-31 00:00:00 0.19125 0.24742 0.02592
2012-01-31 03:00:00 0.18108 0.24158 0.02583
2012-01-31 06:00:00 0.18875 0.23675 0.02600
2012-01-31 09:00:00 0.19100 0.23125 0.02558
2012-01-31 12:00:00 0.18333 0.22717 0.02592
2012-01-31 15:00:00 0.16342 0.22100 0.02375
2012-01-31 18:00:00 0.15708 0.22067 0.02317
2012-01-31 21:00:00 0.16008 0.21475 0.02333
2013-01-01 00:00:00 1.68833 1.68833 0.20733
2013-01-01 03:00:00 2.69333 2.69333 0.20150
2013-01-01 06:00:00 2.22083 2.22083 0.16692
2013-01-01 09:00:00 2.05500 2.05500 0.17567
2013-01-01 12:00:00 1.71000 1.71000 0.12958
2013-01-01 15:00:00 1.42000 1.42000 0.09633
2013-01-01 18:00:00 1.17858 1.17858 0.08308
2013-01-01 21:00:00 0.89825 0.89825 0.07717
2013-01-02 00:00:00 0.86000 0.86000 0.07500

1001 rows × 3 columns

data[(data.index.hour>8)&(data.index.hour<12)]

L06_347 	LS06_347 	LS06_348

Time
2009-01-01 09:00:00 0.13575 0.09150 0.01625
2009-01-02 09:00:00 0.14192 0.09708 0.01642
2009-01-03 09:00:00 0.12458 0.08442 0.01583
2009-01-04 09:00:00 0.10900 0.10517 0.01800
2009-01-05 09:00:00 0.16150 0.11458 0.02158
2009-01-06 09:00:00 0.10008 0.06558 0.01550
2009-01-07 09:00:00 0.13850 0.09392 0.01500
2009-01-08 09:00:00 0.10133 0.06642 0.01683
2009-01-09 09:00:00 0.06175 0.05942 0.01517
2009-01-10 09:00:00 0.19350 0.14700 0.01300
2009-01-11 09:00:00 0.08025 0.07742 0.01358
2009-01-12 09:00:00 0.13250 0.08917 0.01683
2009-01-13 09:00:00 0.19650 0.19267 0.04533
2009-01-14 09:00:00 0.32292 0.29925 0.02933
2009-01-15 09:00:00 0.21075 0.16750 0.02500
2009-01-16 09:00:00 0.15783 0.15392 0.02300
2009-01-17 09:00:00 0.21867 0.17333 0.02292
2009-01-18 09:00:00 0.63300 0.74567 0.07700
2009-01-19 09:00:00 1.04217 1.39850 0.13367
2009-01-20 09:00:00 0.75300 0.77300 0.06558
2009-01-21 09:00:00 0.39850 0.39850 0.04250
2009-01-22 09:00:00 0.36242 0.35125 0.03667
2009-01-23 09:00:00 8.23750 8.56000 0.38375
2009-01-24 09:00:00 1.85750 2.35667 0.09975
2009-01-25 09:00:00 0.57558 0.65775 0.05900
2009-01-26 09:00:00 0.30542 0.27992 0.04417
2009-01-27 09:00:00 0.27992 0.27492 0.03250
2009-01-28 09:00:00 0.28708 0.25383 0.03108
2009-01-29 09:00:00 0.26075 0.22183 0.02817
2009-01-30 09:00:00 0.24200 0.20017 0.02475
… … … …
2012-12-03 09:00:00 0.14450 0.14450 0.07467
2012-12-04 09:00:00 0.29208 0.29208 0.04108
2012-12-05 09:00:00 0.77525 0.77525 0.07567
2012-12-06 09:00:00 0.46792 0.46792 0.06075
2012-12-07 09:00:00 0.50983 0.50983 0.09658
2012-12-08 09:00:00 0.45758 0.45758 0.06467
2012-12-09 09:00:00 0.28875 0.28875 0.05317
2012-12-10 09:00:00 0.28925 0.28925 0.06008
2012-12-11 09:00:00 0.22608 0.22608 0.03783
2012-12-12 09:00:00 0.20133 0.20133 0.03517
2012-12-13 09:00:00 0.17575 0.17575 0.03450
2012-12-14 09:00:00 0.16583 0.16583 0.03542
2012-12-15 09:00:00 0.57683 0.57683 0.06508
2012-12-16 09:00:00 0.38175 0.38175 0.04642
2012-12-17 09:00:00 0.30583 0.30583 0.05092
2012-12-18 09:00:00 0.30217 0.30217 0.07067
2012-12-19 09:00:00 0.28292 0.28292 0.04133
2012-12-20 09:00:00 0.30608 0.30608 0.06825
2012-12-21 09:00:00 0.55033 0.55033 0.05925
2012-12-22 09:00:00 0.37883 0.37883 0.06967
2012-12-23 09:00:00 5.91750 5.91750 0.28658
2012-12-24 09:00:00 1.63833 1.63833 0.15133
2012-12-25 09:00:00 1.71917 1.71917 0.14625
2012-12-26 09:00:00 1.35417 1.35417 0.12758
2012-12-27 09:00:00 1.07667 1.07667 0.10300
2012-12-28 09:00:00 0.96150 0.96150 0.09242
2012-12-29 09:00:00 0.78683 0.78683 0.07700
2012-12-30 09:00:00 0.91600 0.91600 0.10158
2012-12-31 09:00:00 0.68275 0.68275 0.06658
2013-01-01 09:00:00 2.05500 2.05500 0.17567

1462 rows × 3 columns

data.between_time(‘05:00’,‘12:00’)

L06_347 	LS06_347 	LS06_348

Time
2009-01-01 06:00:00 0.11350 0.09125 0.01675
2009-01-01 09:00:00 0.13575 0.09150 0.01625
2009-01-01 12:00:00 0.14092 0.09617 0.01700
2009-01-02 06:00:00 0.11292 0.09142 0.01633
2009-01-02 09:00:00 0.14192 0.09708 0.01642
2009-01-02 12:00:00 0.14783 0.10192 0.01642
2009-01-03 06:00:00 0.11942 0.08025 0.01542
2009-01-03 09:00:00 0.12458 0.08442 0.01583
2009-01-03 12:00:00 0.09167 0.08825 0.01625
2009-01-04 06:00:00 0.11717 0.09517 0.01600
2009-01-04 09:00:00 0.10900 0.10517 0.01800
2009-01-04 12:00:00 0.15742 0.11075 0.01842
2009-01-05 06:00:00 0.14650 0.11517 0.01767
2009-01-05 09:00:00 0.16150 0.11458 0.02158
2009-01-05 12:00:00 0.11567 0.11175 0.02017
2009-01-06 06:00:00 0.09175 0.06825 0.01425
2009-01-06 09:00:00 0.10008 0.06558 0.01550
2009-01-06 12:00:00 0.12267 0.08292 0.01733
2009-01-07 06:00:00 0.12242 0.09333 0.01475
2009-01-07 09:00:00 0.13850 0.09392 0.01500
2009-01-07 12:00:00 0.13925 0.09467 0.01642
2009-01-08 06:00:00 0.10433 0.06875 0.01525
2009-01-08 09:00:00 0.10133 0.06642 0.01683
2009-01-08 12:00:00 0.11517 0.07700 0.01492
2009-01-09 06:00:00 0.06983 0.05192 0.01358
2009-01-09 09:00:00 0.06175 0.05942 0.01517
2009-01-09 12:00:00 0.10467 0.06925 0.01667
2009-01-10 06:00:00 0.13658 0.11342 0.01167
2009-01-10 09:00:00 0.19350 0.14700 0.01300
2009-01-10 12:00:00 0.14708 0.10208 0.01875
… … … …
2012-12-23 06:00:00 6.07917 6.07917 0.41633
2012-12-23 09:00:00 5.91750 5.91750 0.28658
2012-12-23 12:00:00 4.28333 4.28333 0.27575
2012-12-24 06:00:00 2.45167 2.45167 0.18958
2012-12-24 09:00:00 1.63833 1.63833 0.15133
2012-12-24 12:00:00 1.39583 1.39583 0.13075
2012-12-25 06:00:00 1.81083 1.81083 0.24717
2012-12-25 09:00:00 1.71917 1.71917 0.14625
2012-12-25 12:00:00 1.46417 1.46417 0.11942
2012-12-26 06:00:00 1.30583 1.30583 0.16708
2012-12-26 09:00:00 1.35417 1.35417 0.12758
2012-12-26 12:00:00 1.45917 1.45917 0.10833
2012-12-27 06:00:00 1.44333 1.44333 0.10450
2012-12-27 09:00:00 1.07667 1.07667 0.10300
2012-12-27 12:00:00 1.24417 1.24417 0.19542
2012-12-28 06:00:00 1.39417 1.39417 0.09958
2012-12-28 09:00:00 0.96150 0.96150 0.09242
2012-12-28 12:00:00 0.88842 0.88842 0.11592
2012-12-29 06:00:00 0.84583 0.84583 0.08058
2012-12-29 09:00:00 0.78683 0.78683 0.07700
2012-12-29 12:00:00 0.72375 0.72375 0.07267
2012-12-30 06:00:00 0.79683 0.79683 0.09517
2012-12-30 09:00:00 0.91600 0.91600 0.10158
2012-12-30 12:00:00 1.46500 1.46500 0.08683
2012-12-31 06:00:00 0.73592 0.73592 0.06942
2012-12-31 09:00:00 0.68275 0.68275 0.06658
2012-12-31 12:00:00 0.65125 0.65125 0.06383
2013-01-01 06:00:00 2.22083 2.22083 0.16692
2013-01-01 09:00:00 2.05500 2.05500 0.17567
2013-01-01 12:00:00 1.71000 1.71000 0.12958

4386 rows × 3 columns

data.head(6)

L06_347 	LS06_347 	LS06_348

Time
2009-01-01 00:00:00 0.13742 0.09750 0.01683
2009-01-01 03:00:00 0.13125 0.08883 0.01642
2009-01-01 06:00:00 0.11350 0.09125 0.01675
2009-01-01 09:00:00 0.13575 0.09150 0.01625
2009-01-01 12:00:00 0.14092 0.09617 0.01700
2009-01-01 15:00:00 0.09917 0.09167 0.01758

#重采样

data.resample(‘D’).mean()

L06_347 	LS06_347 	LS06_348

Time
2009-01-01 0.12501 0.09228 0.01664
2009-01-02 0.12415 0.09578 0.01641
2009-01-03 0.11356 0.08554 0.01609
2009-01-04 0.14020 0.10271 0.01732
2009-01-05 0.12881 0.10449 0.01817
2009-01-06 0.09577 0.06793 0.01452
2009-01-07 0.11865 0.08364 0.01434
2009-01-08 0.09432 0.07015 0.01507
2009-01-09 0.07816 0.05844 0.01402
2009-01-10 0.11992 0.09384 0.01357
2009-01-11 0.09965 0.07112 0.01383
2009-01-12 0.12826 0.09405 0.01623
2009-01-13 0.50280 0.59831 0.05027
2009-01-14 0.32390 0.30935 0.02859
2009-01-15 0.21419 0.18598 0.02373
2009-01-16 0.18621 0.14951 0.02148
2009-01-17 0.23133 0.20364 0.02932
2009-01-18 0.75227 0.87436 0.08229
2009-01-19 1.00881 1.26615 0.09624
2009-01-20 0.66869 0.78560 0.06565
2009-01-21 0.37927 0.37766 0.04078
2009-01-22 0.39792 0.40364 0.04447
2009-01-23 5.93353 6.19993 0.40471
2009-01-24 1.89376 2.19266 0.10548
2009-01-25 0.57097 0.62696 0.05811
2009-01-26 0.39536 0.39737 0.04149
2009-01-27 0.28993 0.26887 0.03194
2009-01-28 0.26766 0.23915 0.02860
2009-01-29 0.23087 0.18735 0.02529
2009-01-30 0.22431 0.17994 0.02451
… … … …
2012-12-04 0.42692 0.42692 0.06622
2012-12-05 0.85898 0.85898 0.10526
2012-12-06 0.50203 0.50203 0.06621
2012-12-07 0.75386 0.75386 0.10757
2012-12-08 0.45820 0.45820 0.06511
2012-12-09 0.29474 0.29474 0.05185
2012-12-10 0.29318 0.29318 0.04817
2012-12-11 0.22973 0.22973 0.03755
2012-12-12 0.19996 0.19996 0.03577
2012-12-13 0.17421 0.17421 0.03508
2012-12-14 0.48484 0.48484 0.08501
2012-12-15 0.77540 0.77540 0.06838
2012-12-16 0.35960 0.35960 0.04601
2012-12-17 0.32079 0.32079 0.04277
2012-12-18 0.33107 0.33107 0.04792
2012-12-19 0.28434 0.28434 0.03737
2012-12-20 0.61780 0.61780 0.07503
2012-12-21 0.65828 0.65828 0.05967
2012-12-22 1.71293 1.71293 0.20872
2012-12-23 4.76792 4.76792 0.33498
2012-12-24 1.72472 1.72472 0.14780
2012-12-25 1.81454 1.81454 0.19384
2012-12-26 1.34976 1.34976 0.14795
2012-12-27 1.74635 1.74635 0.15516
2012-12-28 1.25864 1.25864 0.11720
2012-12-29 0.80760 0.80760 0.07803
2012-12-30 1.02724 1.02724 0.08800
2012-12-31 0.74836 0.74836 0.08142
2013-01-01 1.73304 1.73304 0.14220
2013-01-02 0.86000 0.86000 0.07500

1463 rows × 3 columns

data.resample(‘D’,how=‘mean’).head()

D:\program\Anaconda\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: how in .resample() is deprecated
the new syntax is .resample(…).mean()
“”"Entry point for launching an IPython kernel.

L06_347 	LS06_347 	LS06_348

Time
2009-01-01 0.12501 0.09228 0.01664
2009-01-02 0.12415 0.09578 0.01641
2009-01-03 0.11356 0.08554 0.01609
2009-01-04 0.14020 0.10271 0.01732
2009-01-05 0.12881 0.10449 0.01817

%matplotlib notebook

data.resample(‘M’).mean().plot()

#常用操作

import pandas as pd

data = pd.DataFrame({‘group’:[‘a’,‘a’,‘a’,‘b’,‘b’,‘b’,‘c’,‘c’,‘c’],

                'data':[4,3,2,1,12,3,4,5,7]})

data

data 	group

0 4 a
1 3 a
2 2 a
3 1 b
4 12 b
5 3 b
6 4 c
7 5 c
8 7 c

data.sort_values(by=[‘group’,‘data’],ascending=[False,True],inplace=True)

data

data 	group

6 4 c
7 5 c
8 7 c
3 1 b
5 3 b
4 12 b
2 2 a
1 3 a
0 4 a

data=pd.DataFrame({‘k1’:[‘one’]*3+[‘two’]*4,‘k2’:[3,2,1,3,3,4,4]})

data

k1 	k2

0 one 3
1 one 2
2 one 1
3 two 3
4 two 3
5 two 4
6 two 4

data.sort_values(by=‘k2’)

k1 	k2

2 one 1
1 one 2
0 one 3
3 two 3
4 two 3
5 two 4
6 two 4

data.drop_duplicates()#调用重复值

k1 	k2

0 one 3
1 one 2
2 one 1
3 two 3
5 two 4

data.drop_duplicates(subset=‘k1’)

k1 	k2

0 one 3
3 two 3

data=pd.DataFrame({‘food’:[‘A1’,‘A2’,‘B1’,‘B2’,‘B3’,‘C1’,‘C2’],‘data’:[1,2,3,4,5,6,7]})

data

data 	food

0 1 A1
1 2 A2
2 3 B1
3 4 B2
4 5 B3
5 6 C1
6 7 C2

def food_map(series):

if series['food'] == 'A1':

    return 'A'

elif series['food'] == 'A2':

    return 'A'

elif series['food'] == 'B1':

    return 'B'

elif series['food'] == 'B2':

    return 'B'

elif series['food'] == 'B3':

    return 'B'

elif series['food'] == 'C1':

    return 'C'

elif series['food'] == 'C2':

    return 'C'

data[‘food_map’] = data.apply(food_map,axis = ‘columns’)

data

data 	food 	food_map

0 1 A1 A
1 2 A2 A
2 3 B1 B
3 4 B2 B
4 5 B3 B
5 6 C1 C
6 7 C2 C

#字典映射

food2Upper={

'A1':'A',

'A2':'A',

'B1':'B',

'B2':'B',

'B3':'B',

'C1':'C',

'C2':'C'}

data[‘upper’]=data[‘food’].map(food2Upper)

data

data 	food 	food_map 	upper

0 1 A1 A A
1 2 A2 A A
2 3 B1 B B
3 4 B2 B B
4 5 B3 B B
5 6 C1 C C
6 7 C2 C C

#常用操作

import numpy as np

df = pd.DataFrame({‘data1’:np.random.randn(5),

              'data2':np.random.randn(5)})

df2 = df.assign(ration = df[‘data1’]/df[‘data2’])

df

data1 	data2

0 1.80403 0.75122
1 -0.49803 1.43483
2 0.57212 -0.69617
3 0.88256 -0.57113
4 1.31386 -0.10652

df2

data1 	data2 	ration

0 1.80403 0.75122 2.40145
1 -0.49803 1.43483 -0.34710
2 0.57212 -0.69617 -0.82181
3 0.88256 -0.57113 -1.54529
4 1.31386 -0.10652 -12.33390

#召回调数据

df2.drop(‘ration’,axis=‘columns’,inplace=True)

data=pd.Series([1,2,3,4,5,6,7,8,9])

data

0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
dtype: int64

data.replace(9,np.nan,inplace=True)

data

0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 6.0
6 7.0
7 8.0
8 NaN
dtype: float64

ages=[15,18,20,21,22,345,41,52,63,79]

bins=[10,20,30,40,50,60,70,80]

bin_res=pd.cut(ages,bins)

bin_res

[(10, 20], (10, 20], (10, 20], (20, 30], (20, 30], NaN, (40, 50], (50, 60], (60, 70], (70, 80]]
Categories (7, interval[int64]): [(10, 20] < (20, 30] < (30, 40] < (40, 50] < (50, 60] < (60, 70] < (70, 80]]

bin_res.labels

D:\program\Anaconda\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: ‘labels’ is deprecated. Use ‘codes’ instead
“”"Entry point for launching an IPython kernel.

array([ 0, 0, 0, 1, 1, -1, 3, 4, 5, 6], dtype=int8)

pd.value_counts(bin_res)

(10, 20] 3
(20, 30] 2
(70, 80] 1
(60, 70] 1
(50, 60] 1
(40, 50] 1
(30, 40] 0
dtype: int64

pd.cut(ages,[10,30,50,80])

[(10, 30], (10, 30], (10, 30], (10, 30], (10, 30], NaN, (30, 50], (50, 80], (50, 80], (50, 80]]
Categories (3, interval[int64]): [(10, 30] < (30, 50] < (50, 80]]

group_names = [‘Yonth’,‘Mille’,‘Old’]

#pd.cut(ages,[10,20,50,80],labels=group_names)

pd.value_counts(pd.cut(ages,[10,20,50,80],labels=group_names))

Old 3
Mille 3
Yonth 3
dtype: int64

df=pd.DataFrame([range(3),[0,np.nan,0],[0,0,np.nan],range(3)])

df

0 	1 	2

0 0 1.0 2.0
1 0 NaN 0.0
2 0 0.0 NaN
3 0 1.0 2.0

df.isnull()

0 	1 	2

0 False False False
1 False True False
2 False False True
3 False False False

df.isnull().any()

0 False
1 True
2 True
dtype: bool

df.isnull().any(axis=1)

0 False
1 True
2 True
3 False
dtype: bool

df.fillna(5)

0 	1 	2

0 0 1.0 2.0
1 0 5.0 0.0
2 0 0.0 5.0
3 0 1.0 2.0

df[df.isnull().any(axis=1)]#求取所引值

0 	1 	2

1 0 NaN 0.0
2 0 0.0 NaN

#Groupby操作延申

import pandas as pd

import numpy as np

df = pd.DataFrame({‘A’ : [‘foo’, ‘bar’, ‘foo’, ‘bar’,

                       'foo', 'bar', 'foo', 'foo'],

               'B' : ['one', 'one', 'two', 'three',

                      'two', 'two', 'one', 'three'],

               'C' : np.random.randn(8),

               'D' : np.random.randn(8)})

df

A 	B 	C 	D

0 foo one 0.71168 0.86983
1 bar one 0.91770 0.69098
2 foo two 0.48605 0.40056
3 bar three 0.20739 0.26912
4 foo two 0.60928 2.15210
5 bar two -0.87134 -0.37828
6 foo one 0.20450 -0.49510
7 foo three 0.33635 1.16671

grouped = df.groupby(‘A’)

grouped

grouped.count()

B 	C 	D

A
bar 3 3 3
foo 5 5 5

grouped=df.groupby([‘A’,‘B’])

grouped.count()

	C 	D

A B
bar one 1 1
three 1 1
two 1 1
foo one 2 2
three 1 1
two 2 2

def get_letter_type(letter):

if letter.lower() in 'aeiou':

    return 'a'

else:

    return 'b'

grouped=df.groupby(get_letter_type,axis=1)

grouped.count().iloc[0]

a 1
b 3
Name: 0, dtype: int64

s = pd.Series([1,2,3,1,2,3],[8,7,5,8,7,5])

s

8 1
7 2
5 3
8 1
7 2
5 3
dtype: int64

grouped = s.groupby(level = 0)

grouped

grouped.first()

5 3
7 2
8 1
dtype: int64

grouped.last()

5 3
7 2
8 1
dtype: int64

grouped.sum()

5 6
7 4
8 2
dtype: int64

grouped=s.groupby(level=0,sort=False)

grouped.first()

8 1
7 2
5 3
dtype: int64

df2=pd.DataFrame({‘x’:[‘A’,‘B’,‘A’,‘B’],‘Y’:[1,2,3,4]})

df2

Y 	x

0 1 A
1 2 B
2 3 A
3 4 B

df2.groupby([‘x’]).get_group(‘A’)

Y 	x

0 1 A
2 3 A

df2.groupby([‘x’]).get_group(‘B’)

Y 	x

1 2 B
3 4 B

arrays = [[‘bar’, ‘bar’, ‘baz’, ‘baz’, ‘foo’, ‘foo’, ‘qux’, ‘qux’],

      ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]

arrays

[[‘bar’, ‘bar’, ‘baz’, ‘baz’, ‘foo’, ‘foo’, ‘qux’, ‘qux’],
[‘one’, ‘two’, ‘one’, ‘two’, ‘one’, ‘two’, ‘one’, ‘two’]]

index=pd.MultiIndex.from_arrays(arrays,names=[‘first’,‘second’])

index

MultiIndex(levels=[[‘bar’, ‘baz’, ‘foo’, ‘qux’], [‘one’, ‘two’]],
labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],
names=[‘first’, ‘second’])

s=pd.Series(np.random.randn(8),index=index)

s

first second
bar one 0.94576
two 0.23519
baz one -1.26270
two -0.60026
foo one -1.08169
two 1.86771
qux one -0.57534
two -0.31680
dtype: float64

grouped=s.groupby(level=0)

grouped

grouped=s.groupby(level=‘first’)

grouped.sum()

first
bar 1.18096
baz -1.86296
foo 0.78603
qux -0.89214
dtype: float64

grouped=df.groupby([‘A’,‘B’])

df.aggregate(np.sum)

A foobarfoobarfoobarfoofoo
B oneonetwothreetwotwoonethree
C 2.6016
D 4.6759
dtype: object

grouped = df.groupby([‘A’,‘B’],as_index = False)

grouped.aggregate(np.sum)

A 	B 	C 	D

0 bar one 0.91770 0.69098
1 bar three 0.20739 0.26912
2 bar two -0.87134 -0.37828
3 foo one 0.91618 0.37473
4 foo three 0.33635 1.16671
5 foo two 1.09532 2.55266

df.groupby([‘A’,‘B’]).sum().reset_index()

A 	B 	C 	D

0 bar one 0.91770 0.69098
1 bar three 0.20739 0.26912
2 bar two -0.87134 -0.37828
3 foo one 0.91618 0.37473
4 foo three 0.33635 1.16671
5 foo two 1.09532 2.55266

grouped.size()

A B
bar one 1
three 1
two 1
foo one 2
three 1
two 2
dtype: int64

grouped.describe().head()

C 	D
count 	mean 	std 	min 	25% 	50% 	75% 	max 	count 	mean 	std 	min 	25% 	50% 	75% 	max

0 1.0 0.91770 NaN 0.91770 0.91770 0.91770 0.91770 0.91770 1.0 0.69098 NaN 0.69098 0.69098 0.69098 0.69098 0.69098
1 1.0 0.20739 NaN 0.20739 0.20739 0.20739 0.20739 0.20739 1.0 0.26912 NaN 0.26912 0.26912 0.26912 0.26912 0.26912
2 1.0 -0.87134 NaN -0.87134 -0.87134 -0.87134 -0.87134 -0.87134 1.0 -0.37828 NaN -0.37828 -0.37828 -0.37828 -0.37828 -0.37828
3 2.0 0.45809 0.35863 0.20450 0.33129 0.45809 0.58489 0.71168 2.0 0.18737 0.96515 -0.49510 -0.15387 0.18737 0.52860 0.86983
4 1.0 0.33635 NaN 0.33635 0.33635 0.33635 0.33635 0.33635 1.0 1.16671 NaN 1.16671 1.16671 1.16671 1.16671 1.16671

grouped=df.groupby(‘A’)

grouped = df.groupby(‘A’)

grouped[‘C’].agg([np.sum,np.mean,np.std])

sum 	mean 	std

A
bar 0.25375 0.08458 0.90082
foo 2.34785 0.46957 0.20397

#字符串操作

import numpy as np

import pandas as pd

s=pd.Series([‘A’,‘a’,‘b’,‘B’,‘gaer’,‘AGER’,np.nan])

s

0 A
1 a
2 b
3 B
4 gaer
5 AGER
6 NaN
dtype: object

s.str.lower()

0 a
1 a
2 b
3 b
4 gaer
5 ager
6 NaN
dtype: object

s.str.upper()

0 A
1 A
2 B
3 B
4 GAER
5 AGER
6 NaN
dtype: object

s.str.len()

0 1.0
1 1.0
2 1.0
3 1.0
4 4.0
5 4.0
6 NaN
dtype: float64

index=pd.Index([‘cui’ ,’ li’ , ‘jun’])

index

Index([‘cui’, ’ li’, ‘jun’], dtype=‘object’)

index.str.strip()

Index([‘cui’, ‘li’, ‘jun’], dtype=‘object’)

index.str.lstrip()

Index([‘cui’, ‘li’, ‘jun’], dtype=‘object’)

index.str.strip()

Index([‘cui’, ‘li’, ‘jun’], dtype=‘object’)

df=pd.DataFrame(np.random.randn(3,2),columns=[‘A’,‘B’],index=range(3))

df

A 	B

0 -0.172169 1.626435
1 -0.604493 0.374151
2 0.716009 2.219520

df.columns=df.columns.str.replace(’’,’_’)

df

_A_ 	_B_

0 -0.172169 1.626435
1 -0.604493 0.374151
2 0.716009 2.219520

s=pd.Series([‘a_b-C’,‘c_d_e’,‘f_g_h’])

s

0 a_b-C
1 c_d_e
2 f_g_h
dtype: object

s.str.split(’_’)

0 [a, b-C]
1 [c, d, e]
2 [f, g, h]
dtype: object

s.str.split(’_’,expand=True,n=1)

0 	1

0 a b-C
1 c d_e
2 f g_h

s=pd.Series([‘A’,‘Aas’,‘Afgew’,‘Ager’,‘Agre’,‘Aw’])

s

0 A
1 Aas
2 Afgew
3 Ager
4 Agre
5 Aw
dtype: object

s.str.contains(‘Aa’)

0 False
1 True
2 False
3 False
4 False
5 False
dtype: bool

s=pd.Series([‘a’,‘a|b’,‘a|c’])

s

0 a
1 a|b
2 a|c
dtype: object

s.str.get_dummies(sep=‘1’)

a 	a|b 	a|c

0 1 0 0
1 0 1 0
2 0 0 1

#索引

s = pd.Series(np.arange(5),index = np.arange(5)[::-1],dtype=‘int64’)

s

4 0
3 1
2 2
1 3
0 4
dtype: int64

s.isin([1,23,4])

4 False
3 True
2 False
1 False
0 True
dtype: bool

s[s.isin([1,3,4])]

3 1
1 3
0 4
dtype: int64

s2 = pd.Series(np.arange(6),index = pd.MultiIndex.from_product([[0,1],[‘a’,‘b’,‘c’]]))

s2

0 a 0
b 1
c 2
1 a 3
b 4
c 5
dtype: int32

s2.iloc[s2.index.isin([(1,‘a’),(2,‘b’)])]

s2

0 a 0
b 1
c 2
1 a 3
b 4
c 5
dtype: int32

s[s>2]

s

4 0
3 1
2 2
1 3
0 4
dtype: int64

dates=pd.date_range(‘2019-04-05’,periods=8)

df=pd.DataFrame(np.random.randn(8,4),index=dates)

columns=[‘A’,‘B’,‘C’,‘D’]

df

0 	1 	2 	3

2019-04-05 0.145422 0.342281 0.971241 -0.041731
2019-04-06 -2.102217 0.778930 -1.972598 -0.694885
2019-04-07 -0.158922 1.619844 -0.689797 -0.934461
2019-04-08 0.636213 -0.681186 0.089263 0.550155
2019-04-09 -0.094493 0.721435 1.333688 -0.069475
2019-04-10 1.197129 -0.697439 -0.884878 1.433160
2019-04-11 -0.968315 0.430566 -0.930414 -0.153921
2019-04-12 -0.129315 -0.056980 0.572650 -0.016057

df.select(lambda x:x==‘A’,axis=‘columns’)

2019-04-05
2019-04-06
2019-04-07
2019-04-08
2019-04-09
2019-04-10
2019-04-11
2019-04-12

df.where(df<0)

df

0 	1 	2 	3

2019-04-05 0.145422 0.342281 0.971241 -0.041731
2019-04-06 -2.102217 0.778930 -1.972598 -0.694885
2019-04-07 -0.158922 1.619844 -0.689797 -0.934461
2019-04-08 0.636213 -0.681186 0.089263 0.550155
2019-04-09 -0.094493 0.721435 1.333688 -0.069475
2019-04-10 1.197129 -0.697439 -0.884878 1.433160
2019-04-11 -0.968315 0.430566 -0.930414 -0.153921
2019-04-12 -0.129315 -0.056980 0.572650 -0.016057

df.where(df<0,-df)

df

0 	1 	2 	3

2019-04-05 0.145422 0.342281 0.971241 -0.041731
2019-04-06 -2.102217 0.778930 -1.972598 -0.694885
2019-04-07 -0.158922 1.619844 -0.689797 -0.934461
2019-04-08 0.636213 -0.681186 0.089263 0.550155
2019-04-09 -0.094493 0.721435 1.333688 -0.069475
2019-04-10 1.197129 -0.697439 -0.884878 1.433160
2019-04-11 -0.968315 0.430566 -0.930414 -0.153921
2019-04-12 -0.129315 -0.056980 0.572650 -0.016057

df=pd.DataFrame(np.random.randn(10,3),columns=list(‘abc’))

df

a 	b 	c

0 0.233600 -0.118476 1.910718
1 0.453123 0.328837 1.967945
2 -0.719929 -1.564187 0.457447
3 1.464841 1.631935 0.351648
4 -0.977479 -1.000130 -0.275709
5 -0.253827 0.032827 -1.997572
6 -0.322984 0.226921 0.465433
7 0.018869 -1.393526 1.270390
8 -1.213045 -0.418379 0.584319
9 0.662430 0.761807 -0.990689

df.query(’(a

df

a 	b 	c

0 0.233600 -0.118476 1.910718
1 0.453123 0.328837 1.967945
2 -0.719929 -1.564187 0.457447
3 1.464841 1.631935 0.351648
4 -0.977479 -1.000130 -0.275709
5 -0.253827 0.032827 -1.997572
6 -0.322984 0.226921 0.465433
7 0.018869 -1.393526 1.270390
8 -1.213045 -0.418379 0.584319
9 0.662430 0.761807 -0.990689

df.query(’(a

df

a 	b 	c

0 0.233600 -0.118476 1.910718
1 0.453123 0.328837 1.967945
2 -0.719929 -1.564187 0.457447
3 1.464841 1.631935 0.351648
4 -0.977479 -1.000130 -0.275709
5 -0.253827 0.032827 -1.997572
6 -0.322984 0.226921 0.465433
7 0.018869 -1.393526 1.270390
8 -1.213045 -0.418379 0.584319
9 0.662430 0.761807 -0.990689

#Pandas绘图

%matplotlib inline

import pandas as pd

import numpy as np

s = pd.Series(np.random.randn(10),index = np.arange(0,100,10))

s.plot()

df=pd.DataFrame(np.random.randn(10,4).cumsum(0),index=np.arange(0,100,10),columns=[‘A’,‘B’,‘C’,‘D’])

df.head()

A 	B 	C 	D

0 -0.600440 0.862748 -0.902197 -0.372323
10 -0.543945 0.229546 -0.963724 0.452196
20 -1.744248 0.161023 0.073936 -0.225950
30 -3.167785 0.514504 1.225721 0.756929
40 -1.606017 0.472679 1.758449 -0.160899

df.plot()

import matplotlib.pyplot as plt

fig,axes = plt.subplots(2,1)

data = pd.Series(np.random.rand(16),index=list(‘abcdefghijklmnop’))

data.plot(ax = axes[0],kind=‘bar’)

data.plot(ax = axes[1],kind=‘barh’)

df = pd.DataFrame(np.random.rand(6, 4),

           index = ['one', 'two', 'three', 'four', 'five', 'six'], 

           columns = pd.Index(['A', 'B', 'C', 'D'], name = 'Genus'))

df.head()

Genus A B C D
one 0.350508 0.225946 0.141177 0.353882
two 0.390222 0.989578 0.332295 0.474077
three 0.837848 0.944297 0.442973 0.730698
four 0.604615 0.099858 0.390346 0.336698
five 0.736496 0.055303 0.108531 0.251296

df.plot(kind=‘bar’)

tips=pd.read_csv(‘E:\pyhon\pandas\Pandas%E4%BB%A3%E7%A0%81\tips.csv’)

tips.head()

total_bill 	tip 	sex 	smoker 	day 	time 	size

0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

tips.total_bill.plot(kind=‘hist’,bins=20)

macro = pd.read_csv(‘E:\pyhon\pandas\Pandas%E4%BB%A3%E7%A0%81\macrodata.csv’)

macro.head()

year 	quarter 	realgdp 	realcons 	realinv 	realgovt 	realdpi 	cpi 	m1 	tbilrate 	unemp 	pop 	infl 	realint

0 1959.0 1.0 2710.349 1707.4 286.898 470.045 1886.9 28.98 139.7 2.82 5.8 177.146 0.00 0.00
1 1959.0 2.0 2778.801 1733.7 310.859 481.301 1919.7 29.15 141.7 3.08 5.1 177.830 2.34 0.74
2 1959.0 3.0 2775.488 1751.8 289.226 491.260 1916.4 29.35 140.5 3.82 5.3 178.657 2.74 1.09
3 1959.0 4.0 2785.204 1753.7 299.356 484.052 1931.3 29.37 140.0 4.33 5.6 179.386 0.27 4.06
4 1960.0 1.0 2847.699 1770.5 331.722 462.199 1955.5 29.54 139.6 3.50 5.2 180.007 2.31 1.19

data = macro[[‘quarter’,‘realgdp’,‘realcons’]]

data.plot.scatter(‘quarter’,‘realgdp’)

pd.scatter_matrix(macro,color=‘g’,alpha=0.3)

D:\program\Anaconda\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: pandas.scatter_matrix is deprecated. Use pandas.plotting.scatter_matrix instead
“”"Entry point for launching an IPython kernel.

array([[,
,
,
,
,
,
,
,
,
,
,
,
,
],
[,
,
,
,
,
,
,
,
,
,
,
,
,
],
[,
,
,
,
,
,
,
,
,
,
,
,
,
],
[,
,
,
,
,
,
,
,
,
,
,
,
,
],
[,
,
,
,
,
,
,
,
,
,
,
,
,
],
[,
,
,
,
,
,
,
,
,
,
,
,
,
],
[,
,
,
,
,
,
,
,
,
,
,
,
,
],
[,
,
,
,
,
,
,
,
,
,
,
,
,
],
[,
,
,
,
,
,
,
,
,
,
,
,
,
],
[,
,
,
,
,
,
,
,
,
,
,
,
,
],
[,
,
,
,
,
,
,
,
,
,
,
,
,
],
[,
,
,
,
,
,
,
,
,
,
,
,
,
],
[,
,
,
,
,
,
,
,
,
,
,
,
,
],
[,
,
,
,
,
,
,
,
,
,
,
,
,
]], dtype=object)

#大数据处理

import pandas as pd

gl = pd.read_csv(‘E:\pyhon\pandas\Pandas%E4%BB%A3%E7%A0%81\game_logs.csv’)

gl.head()

D:\program\Anaconda\lib\site-packages\IPython\core\interactiveshell.py:2698: DtypeWarning: Columns (12,13,14,15,19,20,81,83,85,87,93,94,95,96,97,98,99,100,105,106,108,109,111,112,114,115,117,118,120,121,123,124,126,127,129,130,132,133,135,136,138,139,141,142,144,145,147,148,150,151,153,154,156,157,160) have mixed types. Specify dtype option on import or set low_memory=False.
interactivity=interactivity, compiler=compiler, result=result)

date 	number_of_game 	day_of_week 	v_name 	v_league 	v_game_number 	h_name 	h_league 	h_game_number 	v_score 	... 	h_player_7_name 	h_player_7_def_pos 	h_player_8_id 	h_player_8_name 	h_player_8_def_pos 	h_player_9_id 	h_player_9_name 	h_player_9_def_pos 	additional_info 	acquisition_info

0 18710504 0 Thu CL1 na 1 FW1 na 1 0 … Ed Mincher 7.0 mcdej101 James McDermott 8.0 kellb105 Bill Kelly 9.0 NaN Y
1 18710505 0 Fri BS1 na 1 WS3 na 1 20 … Asa Brainard 1.0 burrh101 Henry Burroughs 9.0 berth101 Henry Berthrong 8.0 HTBF Y
2 18710506 0 Sat CL1 na 2 RC1 na 1 12 … Pony Sager 6.0 birdg101 George Bird 7.0 stirg101 Gat Stires 9.0 NaN Y
3 18710508 0 Mon CL1 na 3 CH1 na 1 12 … Ed Duffy 6.0 pinke101 Ed Pinkham 5.0 zettg101 George Zettlein 1.0 NaN Y
4 18710509 0 Tue BS1 na 2 TRO na 1 9 … Steve Bellan 5.0 pikel101 Lip Pike 3.0 cravb101 Bill Craver 6.0 HTBF Y

5 rows × 161 columns

gl.info(memory_usage=‘deep’)


RangeIndex: 171907 entries, 0 to 171906
Columns: 161 entries, date to acquisition_info
dtypes: float64(77), int64(6), object(78)
memory usage: 860.5 MB

for dtype in [‘float64’,‘object’,‘int64’]:

selected_dtype = gl.select_dtypes(include=[dtype])

mean_usage_b = selected_dtype.memory_usage(deep=True).mean()

mean_usage_mb = mean_usage_b / 1024 ** 2

print("Average memory usage for {} columns: {:03.2f} MB".format(dtype,mean_usage_mb))

Average memory usage for float64 columns: 1.29 MB
Average memory usage for object columns: 9.51 MB
Average memory usage for int64 columns: 1.12 MB

import numpy as np

int_types = [“uint8”, “int8”, “int16”,“int32”,“int64”]

for it in int_types:

print(np.iinfo(it))

Machine parameters for uint8

min = 0
max = 255

Machine parameters for int8

min = -128
max = 127

Machine parameters for int16

min = -32768
max = 32767

Machine parameters for int32

min = -2147483648
max = 2147483647

Machine parameters for int64

min = -9223372036854775808
max = 9223372036854775807

def mem_usage(pandas_obj):

if isinstance(pandas_obj,pd.DataFrame):

    usage_b = pandas_obj.memory_usage(deep=True).sum()

else: # we assume if not a df it's a series

    usage_b = pandas_obj.memory_usage(deep=True)

usage_mb = usage_b / 1024 ** 2 # convert bytes to megabytes

return "{:03.2f} MB".format(usage_mb)

gl_int = gl.select_dtypes(include=[‘int64’])

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_numeric.html

converted_int = gl_int.apply(pd.to_numeric,downcast=‘unsigned’)

print(mem_usage(gl_int))

print(mem_usage(converted_int))

7.87 MB
1.48 MB

gl_float = gl.select_dtypes(include=[‘float64’])

converted_float = gl_float.apply(pd.to_numeric,downcast=‘float’)

print(mem_usage(gl_float))

print(mem_usage(converted_float))

100.99 MB
50.49 MB

optimized_gl = gl.copy()

optimized_gl[converted_int.columns] = converted_int

optimized_gl[converted_float.columns] = converted_float

print(mem_usage(gl))

print(mem_usage(optimized_gl))

860.50 MB
803.61 MB

optimized_gl = gl.copy()

optimized_gl[converted_int.columns] = converted_int

optimized_gl[converted_float.columns] = converted_float

print(mem_usage(gl))

print(mem_usage(optimized_gl))

860.50 MB
803.61 MB

for dtype in [‘float64’,‘int64’,‘object’]:

selected_dtype = gl.select_dtypes(include = [dtype])

mean_usage_b = selected_dtype.memory_usage(deep=True).mean()

mean_usage_mb = mean_usage_b/1024**2

print ('平均内存占用',dtype,mean_usage_mb)

平均内存占用 float64 1.2947326073279748
平均内存占用 int64 1.1241934640066964
平均内存占用 object 9.514454069016855

import numpy as np

int_types = [‘uint8’,‘int8’,‘int16’,‘int32’,‘int64’]

for it in int_types:

print (np.iinfo(it))

Machine parameters for uint8

min = 0
max = 255

Machine parameters for int8

min = -128
max = 127

Machine parameters for int16

min = -32768
max = 32767

Machine parameters for int32

min = -2147483648
max = 2147483647

Machine parameters for int64

min = -9223372036854775808
max = 9223372036854775807

def mem_usage(pandas_obj):

if isinstance(pandas_obj,pd.DataFrame):

    usage_b = pandas_obj.memory_usage(deep=True).sum()

else:

    usage_b = pandas_obj.memory_usage(deep=True)

usage_mb = usage_b/1024**2

return '{:03.2f} MB'.format(usage_mb)

gl_int = gl.select_dtypes(include = [‘int64’])

coverted_int = gl_int.apply(pd.to_numeric,downcast=‘unsigned’)

print (mem_usage(gl_int))

print (mem_usage(coverted_int))

7.87 MB
1.48 MB

gl_float = gl.select_dtypes(include=[‘float64’])

converted_float = gl_float.apply(pd.to_numeric,downcast=‘float’)

print(mem_usage(gl_float))

print(mem_usage(converted_float))

100.99 MB
50.49 MB

optimized_gl = gl.copy()

optimized_gl[coverted_int.columns] = coverted_int

optimized_gl[converted_float.columns] = converted_float

print(mem_usage(gl))

print(mem_usage(optimized_gl))

860.50 MB
803.61 MB

gl_obj = gl.select_dtypes(include = [‘object’]).copy()

gl_obj.describe()

day_of_week 	v_name 	v_league 	h_name 	h_league 	day_night 	completion 	forefeit 	protest 	park_id 	... 	h_player_6_id 	h_player_6_name 	h_player_7_id 	h_player_7_name 	h_player_8_id 	h_player_8_name 	h_player_9_id 	h_player_9_name 	additional_info 	acquisition_info

count 171907 171907 171907 171907 171907 140150 116 145 180 171907 … 140838 140838 140838 140838 140838 140838 140838 140838 1456 140841
unique 7 148 7 148 7 2 116 3 5 245 … 4774 4720 5253 5197 4760 4710 5193 5142 332 1
top Sat CHN NL CHN NL D 19810610,CHI11,1,2,45 H V STL07 … grimc101 Charlie Grimm grimc101 Charlie Grimm lopea102 Al Lopez spahw101 Warren Spahn HTBF Y
freq 28891 8870 88866 9024 88867 82724 1 69 90 7022 … 427 427 491 491 676 676 339 339 1112 140841

4 rows × 78 columns

dow = gl_obj.day_of_week

dow.head()

0 Thu
1 Fri
2 Sat
3 Mon
4 Tue
Name: day_of_week, dtype: object

dow_cat = dow.astype(‘category’)

dow_cat.head()

0 Thu
1 Fri
2 Sat
3 Mon
4 Tue
Name: day_of_week, dtype: category
Categories (7, object): [Fri, Mon, Sat, Sun, Thu, Tue, Wed]

dow_cat.head(10).cat.codes

0 4
1 0
2 2
3 1
4 5
5 4
6 2
7 2
8 1
9 5
dtype: int8

print (mem_usage(dow))

print (mem_usage(dow_cat))

9.84 MB
0.16 MB

converted_obj = pd.DataFrame()

for col in gl_obj.columns:

num_unique_values = len(gl_obj[col].unique())

num_total_values = len(gl_obj[col])

if num_unique_values / num_total_values < 0.5:

    converted_obj.loc[:,col] = gl_obj[col].astype('category')

else:

    converted_obj.loc[:,col] = gl_obj[col]

print(mem_usage(gl_obj))

print(mem_usage(converted_obj))

751.64 MB
51.67 MB

date = optimized_gl.date

date[:5]

0 18710504
1 18710505
2 18710506
3 18710508
4 18710509
Name: date, dtype: uint32

print (mem_usage(date))

0.66 MB

optimized_gl[‘date’] = pd.to_datetime(date,format=’%Y%m%d’)

print (mem_usage(optimized_gl[‘date’]))

1.31 MB

optimized_gl[‘date’][:5]

0 1871-05-04
1 1871-05-05
2 1871-05-06
3 1871-05-08
4 1871-05-09
Name: date, dtype: datetime64[ns]

#apply操作

import pandas as pd

import numpy as np

titanic = pd.read_csv(‘E:\pyhon\pandas\Pandas%E4%BB%A3%E7%A0%81\titanic_train.csv’)

titanic.head()

PassengerId 	Survived 	Pclass 	Name 	Sex 	Age 	SibSp 	Parch 	Ticket 	Fare 	Cabin 	Embarked

0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th… female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

def hundredth_row(columns):

item = columns.iloc[99]

return item

hundredth_row = titanic.apply(hundredth_row)

hundredth_row

PassengerId 100
Survived 0
Pclass 2
Name Kantor, Mr. Sinai
Sex male
Age 34
SibSp 1
Parch 0
Ticket 244367
Fare 26
Cabin NaN
Embarked S
dtype: object

def not_null_count(columns):

columns_null = pd.isnull(columns)

null = columns[columns_null]

return len(null)

columns_null_count = titanic.apply(not_null_count)

columns_null_count

PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64

def which_class(row):

pclass = row['Pclass']

if pd.isnull(pclass):

    return 'Unknow'

elif pclass == 1:

    return 'First class'

elif pclass == 2:

    return 'Second class'

elif pclass == 3:

    return 'Third class'

classes = titanic.apply(which_class,axis = 1)

classes

0 Third class
1 First class
2 Third class
3 First class
4 Third class
5 Third class
6 First class
7 Third class
8 Third class
9 Second class
10 Third class
11 First class
12 Third class
13 Third class
14 Third class
15 Second class
16 Third class
17 Second class
18 Third class
19 Third class
20 Second class
21 Second class
22 Third class
23 First class
24 Third class
25 Third class
26 Third class
27 First class
28 Third class
29 Third class

861 Second class
862 First class
863 Third class
864 Second class
865 Second class
866 Second class
867 First class
868 Third class
869 Third class
870 Third class
871 First class
872 First class
873 Third class
874 Second class
875 Third class
876 Third class
877 Third class
878 Third class
879 First class
880 Second class
881 Third class
882 Third class
883 Second class
884 Third class
885 Third class
886 Second class
887 First class
888 Third class
889 First class
890 Third class
Length: 891, dtype: object

def is_minor(row):

if row['Age'] < 18:

    return True

else:

    return False

minors = titanic.apply(is_minor,axis = 1)

minors

0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 True
8 False
9 True
10 True
11 False
12 False
13 False
14 True
15 False
16 True
17 False
18 False
19 False
20 False
21 False
22 True
23 False
24 True
25 False
26 False
27 False
28 False
29 False

861 False
862 False
863 False
864 False
865 False
866 False
867 False
868 False
869 True
870 False
871 False
872 False
873 False
874 False
875 True
876 False
877 False
878 False
879 False
880 False
881 False
882 False
883 False
884 False
885 False
886 False
887 False
888 False
889 False
890 False
Length: 891, dtype: bool

****pandas练习题

import pandas as pd

#显示版本信息

pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 158 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.20.3
pytest: 3.2.1
pip: 9.0.1
setuptools: 36.5.0.post20170921
Cython: 0.26.1
numpy: 1.13.3
scipy: 0.19.1
xarray: None
IPython: 6.1.0
sphinx: 1.6.3
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.1.0
openpyxl: 2.4.8
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.0
bs4: 4.6.0
html5lib: 0.999999999
sqlalchemy: 1.1.13
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

import numpy as np

data = {‘animal’: [‘cat’, ‘cat’, ‘snake’, ‘dog’, ‘dog’, ‘cat’, ‘snake’, ‘cat’, ‘dog’, ‘dog’],

    'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],

    'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],

    'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}

labels = [‘a’, ‘b’, ‘c’, ‘d’, ‘e’, ‘f’, ‘g’, ‘h’, ‘i’, ‘j’]

#创建一个DataFrame

df = pd.DataFrame(data,index = labels)

df.head()

age 	animal 	priority 	visits

a 2.5 cat yes 1
b 3.0 cat yes 3
c 0.5 snake no 2
d NaN dog yes 3
e 5.0 dog no 2

#显示详细信息

df.info()


Index: 10 entries, a to j
Data columns (total 4 columns):
age 8 non-null float64
animal 10 non-null object
priority 10 non-null object
visits 10 non-null int64
dtypes: float64(1), int64(1), object(2)
memory usage: 400.0+ bytes

#索引

df.iloc[:3]

age 	animal 	priority 	visits

a 2.5 cat yes 1
b 3.0 cat yes 3
c 0.5 snake no 2

#指定选择数据范围

df[df[‘visits’] > 2]

age 	animal 	priority 	visits

b 3.0 cat yes 3
d NaN dog yes 3
f 2.0 cat no 3

#查看缺失值

df[df[‘age’].isnull()]

age 	animal 	priority 	visits

d NaN dog yes 3
h NaN cat yes 1

#通过给定范围查找某一属性

df[(df[‘animal’] ==‘cat’) & (df[‘age’] < 3)]

age 	animal 	priority 	visits

a 2.5 cat yes 1
f 2.0 cat no 3

#改变数值

df.loc[‘f’,‘age’] = 1.5

df[(df[‘animal’] ==‘cat’) & (df[‘age’] < 3)]

age 	animal 	priority 	visits

a 2.5 cat yes 1
f 1.5 cat no 3

#groupby求平均值

df.groupby(‘animal’)[‘age’].mean()

animal
cat 2.333333
dog 5.000000
snake 2.500000
Name: age, dtype: float64

#计算相同属性的个数

df[‘animal’].value_counts()

cat 4
dog 4
snake 2
Name: animal, dtype: int64

#属性值进行映射

df[‘priority’] = df[‘priority’].map({‘yes’:True,‘no’:False})

df.head()

age 	animal 	priority 	visits

a 2.5 cat True 1
b 3.0 cat True 3
c 0.5 snake False 2
d NaN dog True 3
e 5.0 dog False 2

#属性值进行替换

df[‘animal’] = df[‘animal’].replace(‘snake’,‘tangyudi’)

df.head()

age 	animal 	priority 	visits

a 2.5 cat True 1
b 3.0 cat True 3
c 0.5 tangyudi False 2
d NaN dog True 3
e 5.0 dog False 2

#数据透视表

df.pivot_table(index = ‘animal’,columns = ‘visits’,values=‘age’,aggfunc = ‘mean’)

visits 1 2 3
animal
cat 2.5 NaN 2.25
dog 3.0 6.0 NaN
tangyudi 4.5 0.5 NaN

#提取平均值组成新的数据

df = pd.DataFrame(np.random.random(size = (5,3)))

df.head()

0 	1 	2

0 0.787464 0.544326 0.763849
1 0.574682 0.880216 0.688106
2 0.947957 0.526658 0.704592
3 0.073148 0.601730 0.721848
4 0.592968 0.835612 0.710174

df.sub(df.mean(axis = 1),axis = 0)

0 	1 	2

0 0.088918 -0.154221 0.065303
1 -0.139652 0.165881 -0.026229
2 0.221554 -0.199744 -0.021810
3 -0.392427 0.136155 0.256273
4 -0.119950 0.122694 -0.002744

#统计不同属性值的个数

df.sub(df.mean(axis = 1),axis = 0)

0 	1 	2

0 0.088918 -0.154221 0.065303
1 -0.139652 0.165881 -0.026229
2 0.221554 -0.199744 -0.021810
3 -0.392427 0.136155 0.256273
4 -0.119950 0.122694 -0.002744

len(df.drop_duplicates(keep=False))

5

#给定数据,分别求滑动窗口的平均值(加入补0操作)

import numpy as np

df = pd.DataFrame({‘group’: list(‘aabbabbbabab’),

                   'value': [1, 2, 3, np.nan, 2, 3, 

                             np.nan, 1, 7, 3, np.nan, 8]})

df.head(12)

group 	value

0 a 1.0
1 a 2.0
2 b 3.0
3 b NaN
4 a 2.0
5 b 3.0
6 b NaN
7 b 1.0
8 a 7.0
9 b 3.0
10 a NaN
11 b 8.0

g1 = df.groupby([‘group’])[‘value’]

g2 = df.fillna(0).groupby([‘group’])[‘value’]

s = g2.rolling(3,min_periods=1).sum()/g2.rolling(3,min_periods=1).count()

s.reset_index(level = 0,drop=True).sort_index()

0 1.000000
1 1.500000
2 3.000000
3 1.500000
4 1.666667
5 2.000000
6 1.000000
7 1.333333
8 3.666667
9 1.333333
10 3.000000
11 4.000000
Name: value, dtype: float64

#指定时间序列进行计算

g1 = df.groupby([‘group’])[‘value’]

g2 = df.fillna(0).groupby([‘group’])[‘value’]

s = g2.rolling(3,min_periods=1).sum()/g2.rolling(3,min_periods=1).count()

s.reset_index(level = 0,drop=True).sort_index()

0 1.000000
1 1.500000
2 3.000000
3 1.500000
4 1.666667
5 2.000000
6 1.000000
7 1.333333
8 3.666667
9 1.333333
10 3.000000
11 4.000000
Name: value, dtype: float64

#对缺失值数据自动计算

import pandas as pd

import numpy as np

df = pd.DataFrame({‘From_To’: [‘LoNDon_paris’, ‘MAdrid_miLAN’, ‘londON_StockhOlm’,

                           'Budapest_PaRis', 'Brussels_londOn'],

          'FlightNumber': [10045, np.nan, 10065, np.nan, 10085],

          'RecentDelays': [[23, 47], [], [24, 43, 87], [13], [67, 32]],

               'Airline': ['KLM(!)', ' (12)', '(British Airways. )', 

                           '12. Air France', '"Swiss Air"']})

df.head()

Airline 	FlightNumber 	From_To 	RecentDelays

0 KLM(!) 10045.0 LoNDon_paris [23, 47]
1 (12) NaN MAdrid_miLAN []
2 (British Airways. ) 10065.0 londON_StockhOlm [24, 43, 87]
3 12. Air France NaN Budapest_PaRis [13]
4 “Swiss Air” 10085.0 Brussels_londOn [67, 32]

df[‘FlightNumber’] = df[‘FlightNumber’].interpolate().astype(int)

df.head()

Airline 	FlightNumber 	From_To 	RecentDelays

0 KLM(!) 10045 LoNDon_paris [23, 47]
1 (12) 10055 MAdrid_miLAN []
2 (British Airways. ) 10065 londON_StockhOlm [24, 43, 87]
3 12. Air France 10075 Budapest_PaRis [13]
4 “Swiss Air” 10085 Brussels_londOn [67, 32]

#将from to这列展开两个特征

temp = df.From_To.str.split(’_’,expand = True)

temp.columns = [‘From’,‘To’]

temp[‘From’] = temp[‘From’].str.capitalize()

temp[‘To’] = temp[‘To’].str.capitalize()

df = df.join(temp)

df.head()

Airline 	FlightNumber 	From_To 	RecentDelays 	From 	To

0 KLM(!) 10045 LoNDon_paris [23, 47] London Paris
1 (12) 10055 MAdrid_miLAN [] Madrid Milan
2 (British Airways. ) 10065 londON_StockhOlm [24, 43, 87] London Stockholm
3 12. Air France 10075 Budapest_PaRis [13] Budapest Paris
4 “Swiss Air” 10085 Brussels_londOn [67, 32] Brussels London

df = df.drop(‘From_To’,axis = 1) df.head()

#首字母大写,其他字母小写

#删除from to这列,并加入temp这列

#去掉airline中多余的字符

df[‘Airline’] = df[‘Airline’].str.extract(’([a-zA-Z\s]+)’,expand = False).str.strip()

df.head()

Airline 	FlightNumber 	RecentDelays 	From 	To

0 KLM 10045 [23, 47] London Paris
1 Air France 10055 [] Madrid Milan
2 British Airways 10065 [24, 43, 87] London Stockholm
3 Air France 10075 [13] Budapest Paris
4 Swiss Air 10085 [67, 32] Brussels London

#将recentDelay中的数据分开写

delays = df[‘RecentDelays’].apply(pd.Series)

delays.columns = [‘delay_{}’.format(n) for n in range(1,len(delays.columns)+1)]

delays

delay_1 	delay_2 	delay_3

0 23.0 47.0 NaN
1 NaN NaN NaN
2 24.0 43.0 87.0
3 13.0 NaN NaN
4 67.0 32.0 NaN

#多重索引

letters = [‘A’,‘B’,‘C’]

numbers = list(range(10))

mi = pd.MultiIndex.from_product([letters,numbers])

s = pd.Series(np.random.rand(30),index=mi)

s

A 0 0.935360
1 0.197775
2 0.095093
3 0.465841
4 0.907051
5 0.260017
6 0.439027
7 0.051335
8 0.825270
9 0.554543
B 0 0.335595
1 0.913604
2 0.894998
3 0.489151
4 0.322718
5 0.475781
6 0.727297
7 0.065137
8 0.488248
9 0.386090
C 0 0.264502
1 0.826158
2 0.479834
3 0.893296
4 0.058635
5 0.499101
6 0.873221
7 0.877330
8 0.524506
9 0.256802
dtype: float64

#定位数据

s.loc[pd.IndexSlice[:‘B’,5:]]

A 5 0.677665
6 0.533658
7 0.326082
8 0.071546
9 0.434138
B 5 0.339513
6 0.901485
7 0.529628
8 0.409966
9 0.650863
dtype: float64

#按索引计算

s.sum(level = 1)

0 2.062543
1 1.803863
2 1.762901
3 1.795556
4 2.305935
5 1.446287
6 1.854993
7 1.723217
8 0.964694
9 1.742379
dtype: float64

#变换索引

new = s.swaplevel(0,1)

new

0 A 0.660232
1 A 0.749437
2 A 0.907719
3 A 0.617251
4 A 0.837966
5 A 0.677665
6 A 0.533658
7 A 0.326082
8 A 0.071546
9 A 0.434138
0 B 0.779293
1 B 0.924057
2 B 0.425088
3 B 0.652168
4 B 0.811879
5 B 0.339513
6 B 0.901485
7 B 0.529628
8 B 0.409966
9 B 0.650863
0 C 0.623017
1 C 0.130369
2 C 0.430094
3 C 0.526137
4 C 0.656089
5 C 0.429108
6 C 0.419850
7 C 0.867507
8 C 0.483183
9 C 0.657378
dtype: float64

你可能感兴趣的:(python之pandas基础知识以及练习题)