利用python进行股票分析(四)pandas

文章目录

  • 4. pandas
    • 4.1. 环境配置
    • 4.2. pandas基础
      • 4.2.1. 简单的pandas
      • 4.2.2. pandas Series简单示例
      • 4.2.3. 通过字典来创建Series
      • 4.2.4. DataFrame,如果不指定index的话,默认就是从0开始编号
      • 4.2.5. DataFrame获取数据 df.loc[0]
      • 4.2.6. DataFrame返回多行数据
      • 4.2.7. DataFrame写入到csv
      • 4.2.8. pandas读取csv, read_csv, head, tail
      • 4.2.9. 构造DataFrame:从json构建DataFrame
      • 4.2.10. 构造DataFrame:字典格式的json
      • 4.2.11. DataFrame去掉空数据 dropna()
      • 4.2.12. 读取csv的时候,可以指定哪些是空数据
      • 4.2.13. 指定列,如果有空数据,则删除整行
      • 4.2.14. 填充NaN,空数据,替换
      • 4.2.15. 修改DataFrame中的错误数据
      • 4.2.16. DataFrame的遍历,遍历DataFrame

4. pandas

4.1. 环境配置

pip3 install pandas

4.2. pandas基础

4.2.1. 简单的pandas

# 简单的pandas
import pandas as pd
test_data = {
    "country":["China","China","China"],
    "sites":["baidu","sougou","hao123"],
    "rank":[1,3,2]
}
df1 = pd.DataFrame(test_data)
print(df1)

  country   sites  rank
0   China   baidu     1
1   China  sougou     3
2   China  hao123     2

4.2.2. pandas Series简单示例

# pandas Series简单示例
# pandas.Series( data, index, dtype, name, copy)
se1 = pd.Series(["Hello","world"], index=["a","b"])
print(se1)
print(se1["a"])

a    Hello
b    world
dtype: object
Hello

4.2.3. 通过字典来创建Series

# 通过字典来创建Series
dict1 = {1:"hello", 2:"world", "3":"kelvin"}
se1 = pd.Series(dict1)
se1

1     hello
2     world
3    kelvin
dtype: object

4.2.4. DataFrame,如果不指定index的话,默认就是从0开始编号

# DataFrame,如果不指定index的话,默认就是从0开始编号
# pandas.DataFrame( data, index, columns, dtype, copy)
my_data = [["kelvin","31"],["tom","29"],["kipper","13"]]
my_column = ["name", "age"]
df = pd.DataFrame(data = my_data, columns = my_column)
print(df)

     name age
0  kelvin  31
1     tom  29
2  kipper  13

4.2.5. DataFrame获取数据 df.loc[0]

# DataFrame获取数据 df.loc[0]
my_data = [["kelvin","31"],["tom","29"],["kipper","13"]]
my_column = ["name", "age"]
df = pd.DataFrame(data = my_data, columns = my_column)
print(df)
print()
print(df.loc[0]) # 第0行
print()
print(df.loc[0]["name"])

     name age
0  kelvin  31
1     tom  29
2  kipper  13

name    kelvin
age         31
Name: 0, dtype: object

kelvin

4.2.6. DataFrame返回多行数据

# DataFrame返回多行数据
data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}
df = pd.DataFrame(data, index = ["day1", "day2", "day3"])
print(df)
df.loc[["day1","day2"]]
print(df)

      calories  duration
day1       420        50
day2       380        40
day3       390        45
      calories  duration
day1       420        50
day2       380        40
day3       390        45

4.2.7. DataFrame写入到csv

# DataFrame写入到csv
name = ["kelvin", "tom", "kipper"]
age = [31, 29, 15]
region = ["江苏", "上海", "杭州"]
dict1 = {"name":name, "age":age, "region":region}
df = pd.DataFrame(dict1)
print(df)
df.to_csv("test_csv.csv")

     name  age region
0  kelvin   31     江苏
1     tom   29     上海
2  kipper   15     杭州

4.2.8. pandas读取csv, read_csv, head, tail

# pandas读取csv, read_csv, head, tail
df = pd.read_csv("nba.csv")
df.head(10)
print(df)

              Name            Team  Number Position   Age Height  Weight  \
0    Avery Bradley  Boston Celtics     0.0       PG  25.0    6-2   180.0   
1      Jae Crowder  Boston Celtics    99.0       SF  25.0    6-6   235.0   
2     John Holland  Boston Celtics    30.0       SG  27.0    6-5   205.0   
3      R.J. Hunter  Boston Celtics    28.0       SG  22.0    6-5   185.0   
4    Jonas Jerebko  Boston Celtics     8.0       PF  29.0   6-10   231.0   
..             ...             ...     ...      ...   ...    ...     ...   
453   Shelvin Mack       Utah Jazz     8.0       PG  26.0    6-3   203.0   
454      Raul Neto       Utah Jazz    25.0       PG  24.0    6-1   179.0   
455   Tibor Pleiss       Utah Jazz    21.0        C  26.0    7-3   256.0   
456    Jeff Withey       Utah Jazz    24.0        C  26.0    7-0   231.0   
457            NaN             NaN     NaN      NaN   NaN    NaN     NaN   

               College     Salary  
0                Texas  7730337.0  
1            Marquette  6796117.0  
2    Boston University        NaN  
3        Georgia State  1148640.0  
4                  NaN  5000000.0  
..                 ...        ...  
453             Butler  2433333.0  
454                NaN   900000.0  
455                NaN  2900000.0  
456             Kansas   947276.0  
457                NaN        NaN  

[458 rows x 9 columns]

4.2.9. 构造DataFrame:从json构建DataFrame

# 构造DataFrame:从json构建DataFrame
# df = pd.read_json('sites.json'),可以从sites.json文件构建DataFrame
json1 = [
    {"name":"kelvin","age":"31","region":"江苏"},
    {"name":"tom","age":"29","region":"上海","unkonwn":"111"},
    {"name":"kipper","age":"15","region":"杭州"}
]
df = pd.DataFrame(json1)
print(df)
df.to_json()

     name age region unkonwn
0  kelvin  31     江苏     NaN
1     tom  29     上海     111
2  kipper  15     杭州     NaN
'{"name":{"0":"kelvin","1":"tom","2":"kipper"},"age":{"0":"31","1":"29","2":"15"},"region":{"0":"\\u6c5f\\u82cf","1":"\\u4e0a\\u6d77","2":"\\u676d\\u5dde"},"unkonwn":{"0":null,"1":"111","2":null}}'

4.2.10. 构造DataFrame:字典格式的json

# 构造DataFrame:字典格式的json
json1 = {
    "2021-07-01":{"name":"kelvin","age":"31","region":"江苏"},
    "2021-07-02":{"name":"tom","age":"29","region":"上海","unkonwn":"111"},
    "2021-07-03":{"name":"kipper","age":"15","region":"杭州"}
}
df = pd.DataFrame(json1).T
print(df)

              name age region unkonwn
2021-07-01  kelvin  31     江苏     NaN
2021-07-02     tom  29     上海     111
2021-07-03  kipper  15     杭州     NaN

4.2.11. DataFrame去掉空数据 dropna()

# DataFrame去掉空数据
# 可以看出,这些不是null:na、--
# 这些是null:NaN、空、NA、n/a
# 如果是空、NA,那么读取到DataFrame会被转成NaN
df = pd.read_csv("property-data.csv")
print(df)
print(df.isnull()) # 可以看出,这些不是null:na、--;这些是null:NaN、空、NA、n/a
print()
print(df.dropna()) # 小写的na,不会被删掉;--也不会被删掉

           PID  ST_NUM     ST_NAME OWN_OCCUPIED NUM_BEDROOMS NUM_BATH SQ_FT
0  100001000.0   104.0      PUTNAM            Y            3        1  1000
1  100002000.0   197.0   LEXINGTON            N            3      1.5    --
2  100003000.0     NaN   LEXINGTON            N          NaN        1   850
3  100004000.0   201.0    BERKELEY           12            1      NaN   700
4          NaN   203.0    BERKELEY            Y            3        2  1600
5  100006000.0   207.0    BERKELEY            Y          NaN        1   800
6  100007000.0     NaN  WASHINGTON          NaN            2   HURLEY   950
7  100008000.0   213.0     TREMONT            Y            1        1   NaN
8  100009000.0   215.0     TREMONT            Y           na        2  1800
     PID  ST_NUM  ST_NAME  OWN_OCCUPIED  NUM_BEDROOMS  NUM_BATH  SQ_FT
0  False   False    False         False         False     False  False
1  False   False    False         False         False     False  False
2  False    True    False         False          True     False  False
3  False   False    False         False         False      True  False
4   True   False    False         False         False     False  False
5  False   False    False         False          True     False  False
6  False    True    False          True         False     False  False
7  False   False    False         False         False     False   True
8  False   False    False         False         False     False  False

           PID  ST_NUM    ST_NAME OWN_OCCUPIED NUM_BEDROOMS NUM_BATH SQ_FT
0  100001000.0   104.0     PUTNAM            Y            3        1  1000
1  100002000.0   197.0  LEXINGTON            N            3      1.5    --
8  100009000.0   215.0    TREMONT            Y           na        2  1800

4.2.12. 读取csv的时候,可以指定哪些是空数据

# 读取csv的时候,可以指定哪些是空数据;
# 原先,na和--不会被认为是空,指定后,读取出来就是NaN了;同时,NaN、空、NA、n/a仍旧被认为是空
missing_value = ["na","--"]
df = pd.read_csv("property-data.csv", na_values = missing_value)
print(df)

           PID  ST_NUM     ST_NAME OWN_OCCUPIED  NUM_BEDROOMS NUM_BATH   SQ_FT
0  100001000.0   104.0      PUTNAM            Y           3.0        1  1000.0
1  100002000.0   197.0   LEXINGTON            N           3.0      1.5     NaN
2  100003000.0     NaN   LEXINGTON            N           NaN        1   850.0
3  100004000.0   201.0    BERKELEY           12           1.0      NaN   700.0
4          NaN   203.0    BERKELEY            Y           3.0        2  1600.0
5  100006000.0   207.0    BERKELEY            Y           NaN        1   800.0
6  100007000.0     NaN  WASHINGTON          NaN           2.0   HURLEY   950.0
7  100008000.0   213.0     TREMONT            Y           1.0        1     NaN
8  100009000.0   215.0     TREMONT            Y           NaN        2  1800.0

4.2.13. 指定列,如果有空数据,则删除整行

# 指定列,如果有空数据,则删除整行
# DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
df = pd.read_csv("property-data.csv")
print(df)
print()
df.dropna(subset=["ST_NUM"])
print(df)

           PID  ST_NUM     ST_NAME OWN_OCCUPIED NUM_BEDROOMS NUM_BATH SQ_FT
0  100001000.0   104.0      PUTNAM            Y            3        1  1000
1  100002000.0   197.0   LEXINGTON            N            3      1.5    --
2  100003000.0     NaN   LEXINGTON            N          NaN        1   850
3  100004000.0   201.0    BERKELEY           12            1      NaN   700
4          NaN   203.0    BERKELEY            Y            3        2  1600
5  100006000.0   207.0    BERKELEY            Y          NaN        1   800
6  100007000.0     NaN  WASHINGTON          NaN            2   HURLEY   950
7  100008000.0   213.0     TREMONT            Y            1        1   NaN
8  100009000.0   215.0     TREMONT            Y           na        2  1800

           PID  ST_NUM     ST_NAME OWN_OCCUPIED NUM_BEDROOMS NUM_BATH SQ_FT
0  100001000.0   104.0      PUTNAM            Y            3        1  1000
1  100002000.0   197.0   LEXINGTON            N            3      1.5    --
2  100003000.0     NaN   LEXINGTON            N          NaN        1   850
3  100004000.0   201.0    BERKELEY           12            1      NaN   700
4          NaN   203.0    BERKELEY            Y            3        2  1600
5  100006000.0   207.0    BERKELEY            Y          NaN        1   800
6  100007000.0     NaN  WASHINGTON          NaN            2   HURLEY   950
7  100008000.0   213.0     TREMONT            Y            1        1   NaN
8  100009000.0   215.0     TREMONT            Y           na        2  1800

4.2.14. 填充NaN,空数据,替换

# 填充NaN,空数据,替换
df = pd.read_csv("property-data.csv")
print(df)
df["ST_NUM"].fillna(10000, inplace=True)
print()
print(df)

           PID  ST_NUM     ST_NAME OWN_OCCUPIED NUM_BEDROOMS NUM_BATH SQ_FT
0  100001000.0   104.0      PUTNAM            Y            3        1  1000
1  100002000.0   197.0   LEXINGTON            N            3      1.5    --
2  100003000.0     NaN   LEXINGTON            N          NaN        1   850
3  100004000.0   201.0    BERKELEY           12            1      NaN   700
4          NaN   203.0    BERKELEY            Y            3        2  1600
5  100006000.0   207.0    BERKELEY            Y          NaN        1   800
6  100007000.0     NaN  WASHINGTON          NaN            2   HURLEY   950
7  100008000.0   213.0     TREMONT            Y            1        1   NaN
8  100009000.0   215.0     TREMONT            Y           na        2  1800

           PID   ST_NUM     ST_NAME OWN_OCCUPIED NUM_BEDROOMS NUM_BATH SQ_FT
0  100001000.0    104.0      PUTNAM            Y            3        1  1000
1  100002000.0    197.0   LEXINGTON            N            3      1.5    --
2  100003000.0  10000.0   LEXINGTON            N          NaN        1   850
3  100004000.0    201.0    BERKELEY           12            1      NaN   700
4          NaN    203.0    BERKELEY            Y            3        2  1600
5  100006000.0    207.0    BERKELEY            Y          NaN        1   800
6  100007000.0  10000.0  WASHINGTON          NaN            2   HURLEY   950
7  100008000.0    213.0     TREMONT            Y            1        1   NaN
8  100009000.0    215.0     TREMONT            Y           na        2  1800

4.2.15. 修改DataFrame中的错误数据

# 修改DataFrame中的错误数据
person = {
  "name": ['Google', 'Runoob' , 'Taobao'],
  "age": [50, 40, 12345]    # 12345 年龄数据是错误的
}
df = pd.DataFrame(person)
print(df)
print()
df.loc[2,"age"] = 30 # 修改数据
print(df)

     name    age
0  Google     50
1  Runoob     40
2  Taobao  12345

     name  age
0  Google   50
1  Runoob   40
2  Taobao   30

4.2.16. DataFrame的遍历,遍历DataFrame

# DataFrame的遍历,遍历DataFrame
person = {
  "name": ['Google', 'Runoob' , 'Taobao'],
  "age": [50, 200, 12345]    
}
df = pd.DataFrame(person)
print(df)
print(df.index)
for x in df.index:
    print(x)
    if df.loc[x, "age"] > 120:
        df.loc[x, "age"] = 1
        # df.drop(x, inplace = True),这种操作是删除一行
print(df)

     name    age
0  Google     50
1  Runoob    200
2  Taobao  12345
RangeIndex(start=0, stop=3, step=1)
0
1
2
     name  age
0  Google   50
1  Runoob    1
2  Taobao    1

你可能感兴趣的:(pyhon,数据分析,python,数据分析)