文章目录
- 4. pandas
-
- 4.1. 环境配置
- 4.2. pandas基础
-
- 4.2.1. 简单的pandas
- 4.2.2. pandas Series简单示例
- 4.2.3. 通过字典来创建Series
- 4.2.4. DataFrame,如果不指定index的话,默认就是从0开始编号
- 4.2.5. DataFrame获取数据 df.loc[0]
- 4.2.6. DataFrame返回多行数据
- 4.2.7. DataFrame写入到csv
- 4.2.8. pandas读取csv, read_csv, head, tail
- 4.2.9. 构造DataFrame:从json构建DataFrame
- 4.2.10. 构造DataFrame:字典格式的json
- 4.2.11. DataFrame去掉空数据 dropna()
- 4.2.12. 读取csv的时候,可以指定哪些是空数据
- 4.2.13. 指定列,如果有空数据,则删除整行
- 4.2.14. 填充NaN,空数据,替换
- 4.2.15. 修改DataFrame中的错误数据
- 4.2.16. DataFrame的遍历,遍历DataFrame
4. pandas
4.1. 环境配置
pip3 install pandas
4.2. pandas基础
4.2.1. 简单的pandas
import pandas as pd
test_data = {
"country":["China","China","China"],
"sites":["baidu","sougou","hao123"],
"rank":[1,3,2]
}
df1 = pd.DataFrame(test_data)
print(df1)
country sites rank
0 China baidu 1
1 China sougou 3
2 China hao123 2
4.2.2. pandas Series简单示例
se1 = pd.Series(["Hello","world"], index=["a","b"])
print(se1)
print(se1["a"])
a Hello
b world
dtype: object
Hello
4.2.3. 通过字典来创建Series
dict1 = {1:"hello", 2:"world", "3":"kelvin"}
se1 = pd.Series(dict1)
se1
1 hello
2 world
3 kelvin
dtype: object
4.2.4. DataFrame,如果不指定index的话,默认就是从0开始编号
my_data = [["kelvin","31"],["tom","29"],["kipper","13"]]
my_column = ["name", "age"]
df = pd.DataFrame(data = my_data, columns = my_column)
print(df)
name age
0 kelvin 31
1 tom 29
2 kipper 13
4.2.5. DataFrame获取数据 df.loc[0]
my_data = [["kelvin","31"],["tom","29"],["kipper","13"]]
my_column = ["name", "age"]
df = pd.DataFrame(data = my_data, columns = my_column)
print(df)
print()
print(df.loc[0])
print()
print(df.loc[0]["name"])
name age
0 kelvin 31
1 tom 29
2 kipper 13
name kelvin
age 31
Name: 0, dtype: object
kelvin
4.2.6. DataFrame返回多行数据
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
df = pd.DataFrame(data, index = ["day1", "day2", "day3"])
print(df)
df.loc[["day1","day2"]]
print(df)
calories duration
day1 420 50
day2 380 40
day3 390 45
calories duration
day1 420 50
day2 380 40
day3 390 45
4.2.7. DataFrame写入到csv
name = ["kelvin", "tom", "kipper"]
age = [31, 29, 15]
region = ["江苏", "上海", "杭州"]
dict1 = {"name":name, "age":age, "region":region}
df = pd.DataFrame(dict1)
print(df)
df.to_csv("test_csv.csv")
name age region
0 kelvin 31 江苏
1 tom 29 上海
2 kipper 15 杭州
4.2.8. pandas读取csv, read_csv, head, tail
df = pd.read_csv("nba.csv")
df.head(10)
print(df)
Name Team Number Position Age Height Weight \
0 Avery Bradley Boston Celtics 0.0 PG 25.0 6-2 180.0
1 Jae Crowder Boston Celtics 99.0 SF 25.0 6-6 235.0
2 John Holland Boston Celtics 30.0 SG 27.0 6-5 205.0
3 R.J. Hunter Boston Celtics 28.0 SG 22.0 6-5 185.0
4 Jonas Jerebko Boston Celtics 8.0 PF 29.0 6-10 231.0
.. ... ... ... ... ... ... ...
453 Shelvin Mack Utah Jazz 8.0 PG 26.0 6-3 203.0
454 Raul Neto Utah Jazz 25.0 PG 24.0 6-1 179.0
455 Tibor Pleiss Utah Jazz 21.0 C 26.0 7-3 256.0
456 Jeff Withey Utah Jazz 24.0 C 26.0 7-0 231.0
457 NaN NaN NaN NaN NaN NaN NaN
College Salary
0 Texas 7730337.0
1 Marquette 6796117.0
2 Boston University NaN
3 Georgia State 1148640.0
4 NaN 5000000.0
.. ... ...
453 Butler 2433333.0
454 NaN 900000.0
455 NaN 2900000.0
456 Kansas 947276.0
457 NaN NaN
[458 rows x 9 columns]
4.2.9. 构造DataFrame:从json构建DataFrame
json1 = [
{"name":"kelvin","age":"31","region":"江苏"},
{"name":"tom","age":"29","region":"上海","unkonwn":"111"},
{"name":"kipper","age":"15","region":"杭州"}
]
df = pd.DataFrame(json1)
print(df)
df.to_json()
name age region unkonwn
0 kelvin 31 江苏 NaN
1 tom 29 上海 111
2 kipper 15 杭州 NaN
'{"name":{"0":"kelvin","1":"tom","2":"kipper"},"age":{"0":"31","1":"29","2":"15"},"region":{"0":"\\u6c5f\\u82cf","1":"\\u4e0a\\u6d77","2":"\\u676d\\u5dde"},"unkonwn":{"0":null,"1":"111","2":null}}'
4.2.10. 构造DataFrame:字典格式的json
json1 = {
"2021-07-01":{"name":"kelvin","age":"31","region":"江苏"},
"2021-07-02":{"name":"tom","age":"29","region":"上海","unkonwn":"111"},
"2021-07-03":{"name":"kipper","age":"15","region":"杭州"}
}
df = pd.DataFrame(json1).T
print(df)
name age region unkonwn
2021-07-01 kelvin 31 江苏 NaN
2021-07-02 tom 29 上海 111
2021-07-03 kipper 15 杭州 NaN
4.2.11. DataFrame去掉空数据 dropna()
df = pd.read_csv("property-data.csv")
print(df)
print(df.isnull())
print()
print(df.dropna())
PID ST_NUM ST_NAME OWN_OCCUPIED NUM_BEDROOMS NUM_BATH SQ_FT
0 100001000.0 104.0 PUTNAM Y 3 1 1000
1 100002000.0 197.0 LEXINGTON N 3 1.5 --
2 100003000.0 NaN LEXINGTON N NaN 1 850
3 100004000.0 201.0 BERKELEY 12 1 NaN 700
4 NaN 203.0 BERKELEY Y 3 2 1600
5 100006000.0 207.0 BERKELEY Y NaN 1 800
6 100007000.0 NaN WASHINGTON NaN 2 HURLEY 950
7 100008000.0 213.0 TREMONT Y 1 1 NaN
8 100009000.0 215.0 TREMONT Y na 2 1800
PID ST_NUM ST_NAME OWN_OCCUPIED NUM_BEDROOMS NUM_BATH SQ_FT
0 False False False False False False False
1 False False False False False False False
2 False True False False True False False
3 False False False False False True False
4 True False False False False False False
5 False False False False True False False
6 False True False True False False False
7 False False False False False False True
8 False False False False False False False
PID ST_NUM ST_NAME OWN_OCCUPIED NUM_BEDROOMS NUM_BATH SQ_FT
0 100001000.0 104.0 PUTNAM Y 3 1 1000
1 100002000.0 197.0 LEXINGTON N 3 1.5 --
8 100009000.0 215.0 TREMONT Y na 2 1800
4.2.12. 读取csv的时候,可以指定哪些是空数据
missing_value = ["na","--"]
df = pd.read_csv("property-data.csv", na_values = missing_value)
print(df)
PID ST_NUM ST_NAME OWN_OCCUPIED NUM_BEDROOMS NUM_BATH SQ_FT
0 100001000.0 104.0 PUTNAM Y 3.0 1 1000.0
1 100002000.0 197.0 LEXINGTON N 3.0 1.5 NaN
2 100003000.0 NaN LEXINGTON N NaN 1 850.0
3 100004000.0 201.0 BERKELEY 12 1.0 NaN 700.0
4 NaN 203.0 BERKELEY Y 3.0 2 1600.0
5 100006000.0 207.0 BERKELEY Y NaN 1 800.0
6 100007000.0 NaN WASHINGTON NaN 2.0 HURLEY 950.0
7 100008000.0 213.0 TREMONT Y 1.0 1 NaN
8 100009000.0 215.0 TREMONT Y NaN 2 1800.0
4.2.13. 指定列,如果有空数据,则删除整行
df = pd.read_csv("property-data.csv")
print(df)
print()
df.dropna(subset=["ST_NUM"])
print(df)
PID ST_NUM ST_NAME OWN_OCCUPIED NUM_BEDROOMS NUM_BATH SQ_FT
0 100001000.0 104.0 PUTNAM Y 3 1 1000
1 100002000.0 197.0 LEXINGTON N 3 1.5 --
2 100003000.0 NaN LEXINGTON N NaN 1 850
3 100004000.0 201.0 BERKELEY 12 1 NaN 700
4 NaN 203.0 BERKELEY Y 3 2 1600
5 100006000.0 207.0 BERKELEY Y NaN 1 800
6 100007000.0 NaN WASHINGTON NaN 2 HURLEY 950
7 100008000.0 213.0 TREMONT Y 1 1 NaN
8 100009000.0 215.0 TREMONT Y na 2 1800
PID ST_NUM ST_NAME OWN_OCCUPIED NUM_BEDROOMS NUM_BATH SQ_FT
0 100001000.0 104.0 PUTNAM Y 3 1 1000
1 100002000.0 197.0 LEXINGTON N 3 1.5 --
2 100003000.0 NaN LEXINGTON N NaN 1 850
3 100004000.0 201.0 BERKELEY 12 1 NaN 700
4 NaN 203.0 BERKELEY Y 3 2 1600
5 100006000.0 207.0 BERKELEY Y NaN 1 800
6 100007000.0 NaN WASHINGTON NaN 2 HURLEY 950
7 100008000.0 213.0 TREMONT Y 1 1 NaN
8 100009000.0 215.0 TREMONT Y na 2 1800
4.2.14. 填充NaN,空数据,替换
df = pd.read_csv("property-data.csv")
print(df)
df["ST_NUM"].fillna(10000, inplace=True)
print()
print(df)
PID ST_NUM ST_NAME OWN_OCCUPIED NUM_BEDROOMS NUM_BATH SQ_FT
0 100001000.0 104.0 PUTNAM Y 3 1 1000
1 100002000.0 197.0 LEXINGTON N 3 1.5 --
2 100003000.0 NaN LEXINGTON N NaN 1 850
3 100004000.0 201.0 BERKELEY 12 1 NaN 700
4 NaN 203.0 BERKELEY Y 3 2 1600
5 100006000.0 207.0 BERKELEY Y NaN 1 800
6 100007000.0 NaN WASHINGTON NaN 2 HURLEY 950
7 100008000.0 213.0 TREMONT Y 1 1 NaN
8 100009000.0 215.0 TREMONT Y na 2 1800
PID ST_NUM ST_NAME OWN_OCCUPIED NUM_BEDROOMS NUM_BATH SQ_FT
0 100001000.0 104.0 PUTNAM Y 3 1 1000
1 100002000.0 197.0 LEXINGTON N 3 1.5 --
2 100003000.0 10000.0 LEXINGTON N NaN 1 850
3 100004000.0 201.0 BERKELEY 12 1 NaN 700
4 NaN 203.0 BERKELEY Y 3 2 1600
5 100006000.0 207.0 BERKELEY Y NaN 1 800
6 100007000.0 10000.0 WASHINGTON NaN 2 HURLEY 950
7 100008000.0 213.0 TREMONT Y 1 1 NaN
8 100009000.0 215.0 TREMONT Y na 2 1800
4.2.15. 修改DataFrame中的错误数据
person = {
"name": ['Google', 'Runoob' , 'Taobao'],
"age": [50, 40, 12345]
}
df = pd.DataFrame(person)
print(df)
print()
df.loc[2,"age"] = 30
print(df)
name age
0 Google 50
1 Runoob 40
2 Taobao 12345
name age
0 Google 50
1 Runoob 40
2 Taobao 30
4.2.16. DataFrame的遍历,遍历DataFrame
person = {
"name": ['Google', 'Runoob' , 'Taobao'],
"age": [50, 200, 12345]
}
df = pd.DataFrame(person)
print(df)
print(df.index)
for x in df.index:
print(x)
if df.loc[x, "age"] > 120:
df.loc[x, "age"] = 1
print(df)
name age
0 Google 50
1 Runoob 200
2 Taobao 12345
RangeIndex(start=0, stop=3, step=1)
0
1
2
name age
0 Google 50
1 Runoob 1
2 Taobao 1