Python数据分析学习系列一——Pandas入门学习

  • Pandas是一个强大的Python数据分析的工具包,是基于Numpy构建的
  • Pandas的主要功能
    • 具备对其功能的数据结构DataFrame、Series
    • 集成时间序列功能
    • 提供丰富的数学运算和操作
    • 灵活处理缺失数据
  • 安装方法:pip install pandas
  • 引用方法:import pandas as pd

1 Series-一维数据对象

  • Series是一种类似于一维数组的对象,由一维数据和一组与之相关的数据标签(索引)组成。
  • 创建方式:pd.Series([1,2,3])
  • 获取值数组和索引数组:values属性和index属性
  • Series比较像列表(数组)和字典的结合体

1.1 Series-使用特性

  • Series支持array的特性(下标):
    • 从arrar创建Series:Series(array)
    • 与标量进行运算:sr*2
    • 两个长度一样的Series运算:sr1+sr2
    • 索引:sr[0], sr[[1,2,4]]
    • 切片:sr[0:2]
    • 通用函数:np.abs(sr)
    • 布尔值索引:sr[sr>5]
  • Series支持字典的特性(标签):
    • 从字典创建Series:Series(dict)
    • in运算:‘a’ in sr
    • 键索引:sr[‘a’], sr[[‘a’,‘b’,‘d’]]
import pandas as pd
pd.Series([2,3,4,5])
0    2
1    3
2    4
3    5
dtype: int64
pd.Series([2,3,4,5], index=['a','b','c','d'])
a    2
b    3
c    4
d    5
dtype: int64
import numpy as np
pd.Series(np.arange(5))
0    0
1    1
2    2
3    3
4    4
dtype: int32
pd.Series({
     'a':1,
          'b':2,
          'c':3,
          'd':4})
a    1
b    2
c    3
d    4
dtype: int64
sr = pd.Series([2,3,4,5], index=['a','b','c','d'])
sr
a    2
b    3
c    4
d    5
dtype: int64
sr[0] # 可以通过位置进行索引
2
sr['a'] #也可以通过“索引(index)”进行索引
2
sr+2 # 可以像ndarray一样,与数值进行+、-、*、/运算
a    4
b    5
c    6
d    7
dtype: int64
sr+sr # 也可以ndarray一样,2个长度一样的Series进行运算
a     4
b     6
c     8
d    10
dtype: int64
sr[0:2] # 可以进行切片
a    2
b    3
dtype: int64
np.sum(sr) # 支持通用函数
14
sr[sr>3] # 支持布尔值索引
c    4
d    5
dtype: int64
sr = pd.Series({
     'a':1,'b':2})
sr
a    1
b    2
dtype: int64
'a' in sr # 可以像字典一样,用“in”查询索引(键)
True
sr.index # 获取索引
Index(['a', 'b'], dtype='object')
sr.values # 获取值
array([1, 2], dtype=int64)
sr = pd.Series([2,3,4,5], index=['a','b','c','d'])
sr
a    2
b    3
c    4
d    5
dtype: int64
sr[['a','c']]
a    2
c    4
dtype: int64
sr['a':'c'] # 通过键切片是前后都包括的
a    2
b    3
c    4
dtype: int64

1.2 Series-整数索引

sr = pd.Series(np.arange(20))
sr
0      0
1      1
2      2
3      3
4      4
5      5
6      6
7      7
8      8
9      9
10    10
11    11
12    12
13    13
14    14
15    15
16    16
17    17
18    18
19    19
dtype: int32
sr2 = sr[10:].copy()
sr2
10    10
11    11
12    12
13    13
14    14
15    15
16    16
17    17
18    18
19    19
dtype: int32
sr2[10] #整数索引既可以解释成标签,也可以解释成下标,pandas为了明确,全部定义为“标签”
10
sr2.loc[10] # 明确使用标签
10
sr2.iloc[9] # 明确解释成下标
19
sr2.iloc[0:3]
10    10
11    11
12    12
dtype: int32

1.3 Series-数据对齐

sr1 = pd.Series([12,23,34],index=['c','a','d']) # a与a相加,b与b相加...
sr2 = pd.Series([11,20,10],index=['d','c','a'])
sr1+sr2
a    33
c    32
d    45
dtype: int64

Pandas在运行两个Series对象的运算时,会按索引(标签)进行对齐,然后运算。

sr1 = pd.Series([12,23,34],index=['c','a','d'])
sr2 = pd.Series([11,20,10,21],index=['d','c','a','b'])
sr1+sr2
a    33.0
b     NaN
c    32.0
d    45.0
dtype: float64
sr1 = pd.Series([12,23,34],index=['c','a','d'])
sr2 = pd.Series([11,20,10],index=['b','c','a'])
sr1+sr2
a    33.0
b     NaN
c    32.0
d     NaN
dtype: float64
sr1 = pd.Series([12,23,34],index=['c','a','d'])
sr2 = pd.Series([11,20,10],index=['b','c','a'])
sr1.add(sr2, fill_value=0) 
# 想让没有的索引不显示nan,需要将“+”改成add(+),此外还有sub(-),div(/),mul(*)。
a    33.0
b    11.0
c    32.0
d    34.0
dtype: float64

1.4 Series-缺失数据

# 方法一:去掉缺失数据isnull(),dropna()
sr = sr1+sr2
sr.isnull() # notnull()
a    False
b     True
c    False
d     True
dtype: bool
sr[sr.notnull()]
a    33.0
c    32.0
dtype: float64
sr[~sr.isnull()]
a    33.0
c    32.0
dtype: float64
sr.dropna()
a    33.0
c    32.0
dtype: float64
# 方法二:给缺失值赋值fillna()
sr.fillna(0)
a    33.0
b     0.0
c    32.0
d     0.0
dtype: float64
sr # numpy和pandas都是不会在sr的基础上修改的,所以需要进行赋值 sr = sr.fillna(0)
a    33.0
b     NaN
c    32.0
d     NaN
dtype: float64
sr.fillna(sr.mean())
a    33.0
b    32.5
c    32.0
d    32.5
dtype: float64

2 DataFrame-二维数据对象

  • DataFrame是一个表格型的数据结构,含有一组有序的列。DataFrame可以被看做是由Series组成的字典,并且共用一个行索引。
  • 创建方式:
    • pd.DataFrame({‘one’:[1,3.5.7],‘two’:[2,4,6,8]})
  • csv文件读取与写入:
    • df.read_csv(‘filename.csv’)
    • df.to_csv(‘filename.csv’)
d1 = pd.DataFrame({
     'one':[1,3,5,7],'two':[2,4,6,8]}) # 通过字典创建DataFrame时,不同列的行数必须一样
d1
one two
0 1 2
1 3 4
2 5 6
3 7 8
d1 = pd.DataFrame({
     'one':[1,3,5,7],'two':[2,4,6,8]},index=['a','b','c','d'])
d1
one two
a 1 2
b 3 4
c 5 6
d 7 8
d1 = pd.DataFrame({
     'one':pd.Series([1,3,5],index=['a','b','c']),'two':pd.Series([2,4,6,8],index=['a','b','c','d'])})
d1
one two
a 1.0 2
b 3.0 4
c 5.0 6
d NaN 8
d1.dtypes
one    float64
two      int64
dtype: object

注:

  • 1.当使用Series组成一个DataFrame的时候,两个Series位置按照标签对齐;
  • 2.因为有nan(浮点型),所以“one”整列自动变成浮点型。
d2 = pd.read_csv('test.csv')
d2
a b c
0 1 2 3
1 4 5 6
2 7 8 9
d1.to_csv('test2.csv')

2.1 DataFrame-常用属性

  • index:获取索引
  • T:转置
  • columns:获取列索引
  • values:获取值数组
  • describe():获取快速统计(这是一个方法)
d1 = pd.DataFrame({
     'one':pd.Series([1,3,5],index=['a','b','c']),'two':pd.Series([2,4,6,8],index=['a','b','c','d'])})
d1
one two
a 1.0 2
b 3.0 4
c 5.0 6
d NaN 8
d1.values
array([[ 1.,  2.],
       [ 3.,  4.],
       [ 5.,  6.],
       [nan,  8.]])
d1.T
a b c d
one 1.0 3.0 5.0 NaN
two 2.0 4.0 6.0 8.0
d1.dtypes
one    float64
two      int64
dtype: object
d1.describe()
one two
count 3.0 4.000000
mean 3.0 5.000000
std 2.0 2.581989
min 1.0 2.000000
25% 2.0 3.500000
50% 3.0 5.000000
75% 4.0 6.500000
max 5.0 8.000000

2.2 DataFrame-索引切片

  • DataFrame是一个二维数据类型,所以有行索引和列索引。
  • DataFrame同样可以通过标签和位置两种方法进行索引和切片
  • loc属性和iloc属性
    • 使用方法:逗号隔开,前面是行索引,后面是列索引
    • 行/列索引部分可以是常规索引、切片、布尔值索引、花式索引任意搭配
d1 = pd.DataFrame({
     'one':pd.Series([1,3,5],index=['a','b','c']),'two':pd.Series([2,4,6,8],index=['a','b','c','d'])})
d1
one two
a 1.0 2
b 3.0 4
c 5.0 6
d NaN 8
d1['one']['a']
1.0
d1.loc['a','one']
1.0
d1.loc['a',:]
one    1.0
two    2.0
Name: a, dtype: float64
d1.loc['d','one']
nan
d1.loc[d1.one.isnull(),'one']
d   NaN
Name: one, dtype: float64

2.3 DataFrame-数据对齐与缺失值处理

  • DataFrame对象在运算时,同样会进行数据对齐,其行索引和列索引分别对齐。
  • DataFrame处理缺失数据的方法:
    • dropna(axis=0,how=‘any’)
    • fillna()
    • isnull()
    • notnull()
d2 = pd.DataFrame({
     'two':[1,2,3,4],'one':[4,5,6,7]},index=['c','d','b','a'])
d2
two one
c 1 4
d 2 5
b 3 6
a 4 7
d1+d2 # 行和列都要对齐
one two
a 8.0 6
b 9.0 7
c 9.0 7
d NaN 10
d1.fillna(0)
one two
a 1.0 2
b 3.0 4
c 5.0 6
d 0.0 8
d1.dropna() # 这一行只要有一个缺失值,就会把整行删掉
one two
a 1.0 2
b 3.0 4
c 5.0 6
import numpy as np
d1.loc['d','two'] = np.nan
d1.loc['c','two'] = np.nan
d1
one two
a 1.0 2.0
b 3.0 4.0
c 5.0 NaN
d NaN NaN
d1.dropna(how='all')
one two
a 1.0 2.0
b 3.0 4.0
c 5.0 NaN
d2 = d1.dropna(how='all')
d2.dropna(axis=1)
one
a 1.0
b 3.0
c 5.0

2.4 pandas-其他常用方法

  • mean(axis=0,skipna=False):对列(行)求平均值
  • sum(axis=1):对列(行)求和
  • sort_index(axis,…,ascending=True):对列(行)索引排序
  • sort_values(by,axis,ascending):按某一列(行)的值排序
  • Numpy的通用函数同样适用于pandas
d1
one two
a 1.0 2.0
b 3.0 4.0
c 5.0 NaN
d NaN NaN
d1.mean() # 返回的是一个对于每一列(行)求平均的Series
one    3.0
two    3.0
dtype: float64
d1.mean(axis=1)
a    1.5
b    3.5
c    5.0
d    NaN
dtype: float64
d1.mean(axis='columns')
a    1.5
b    3.5
c    5.0
d    NaN
dtype: float64
d1.sort_values(by='two')
one two
a 1.0 2.0
b 3.0 4.0
c 5.0 NaN
d NaN NaN
d1.sort_values(by='two', ascending=False) # 有缺失值的部分不参与排序,统一放在最后面
one two
b 3.0 4.0
a 1.0 2.0
c 5.0 NaN
d NaN NaN
d1.sort_values(by='a', axis=1,ascending=False)
two one
a 2.0 1.0
b 4.0 3.0
c NaN 5.0
d NaN NaN
d1.sort_index()
one two
a 1.0 2.0
b 3.0 4.0
c 5.0 NaN
d NaN NaN

3 pandas-时间对象

3.1 pandas-时间对象处理

  • 时间序列类型
    • 时间戳:特定时刻
    • 固定时期:如2020年12月
    • 时间间隔:起始时间-结束时间
  • Python标准库处理时间对象:datetime
  • 灵活处理时间对象:dateutil
    • dateutil.parser.parse()
  • 成组处理时间对象:pandas
    • pd.to_datetime()
import datetime
datetime.datetime.strptime('2020-01-01','%Y-%m-%d') # strptime的p代表parse,strftime的f代表format
datetime.datetime(2020, 1, 1, 0, 0)
import dateutil
dateutil.parser.parse('2020-01-01')
datetime.datetime(2020, 1, 1, 0, 0)
dateutil.parser.parse('02/03/2020')
datetime.datetime(2020, 2, 3, 0, 0)
dateutil.parser.parse('20200203')
datetime.datetime(2020, 2, 3, 0, 0)
pd.to_datetime(['2001-01-01','2002/01/01'])
DatetimeIndex(['2001-01-01', '2002-01-01'], dtype='datetime64[ns]', freq=None)

3.2 pandas-时间对象生成

pd.date_range?
pd.date_range('2010-01-01','2010-05-01')
DatetimeIndex(['2010-01-01', '2010-01-02', '2010-01-03', '2010-01-04',
               '2010-01-05', '2010-01-06', '2010-01-07', '2010-01-08',
               '2010-01-09', '2010-01-10',
               ...
               '2010-04-22', '2010-04-23', '2010-04-24', '2010-04-25',
               '2010-04-26', '2010-04-27', '2010-04-28', '2010-04-29',
               '2010-04-30', '2010-05-01'],
              dtype='datetime64[ns]', length=121, freq='D')
pd.date_range('2010-01-01',periods=10)
DatetimeIndex(['2010-01-01', '2010-01-02', '2010-01-03', '2010-01-04',
               '2010-01-05', '2010-01-06', '2010-01-07', '2010-01-08',
               '2010-01-09', '2010-01-10'],
              dtype='datetime64[ns]', freq='D')
pd.date_range('2010-01-01',periods=10,freq='w') # 生成每个周日
DatetimeIndex(['2010-01-03', '2010-01-10', '2010-01-17', '2010-01-24',
               '2010-01-31', '2010-02-07', '2010-02-14', '2010-02-21',
               '2010-02-28', '2010-03-07'],
              dtype='datetime64[ns]', freq='W-SUN')
pd.date_range('2010-01-01',periods=10,freq='w-MON') # 生成每个周一
DatetimeIndex(['2010-01-04', '2010-01-11', '2010-01-18', '2010-01-25',
               '2010-02-01', '2010-02-08', '2010-02-15', '2010-02-22',
               '2010-03-01', '2010-03-08'],
              dtype='datetime64[ns]', freq='W-MON')
pd.date_range('2010-01-01',periods=10,freq='B') # 生成每个工作日
DatetimeIndex(['2010-01-01', '2010-01-04', '2010-01-05', '2010-01-06',
               '2010-01-07', '2010-01-08', '2010-01-11', '2010-01-12',
               '2010-01-13', '2010-01-14'],
              dtype='datetime64[ns]', freq='B')
dt = pd.date_range('2010-01-01',periods=10,freq='B')
dt[0]
Timestamp('2010-01-01 00:00:00', freq='B')
dt[0].to_pydatetime()
datetime.datetime(2010, 1, 1, 0, 0)
pd.date_range('2010-01-01',periods=10,freq='1h20min') # 还可以间隔1小时20分钟
DatetimeIndex(['2010-01-01 00:00:00', '2010-01-01 01:20:00',
               '2010-01-01 02:40:00', '2010-01-01 04:00:00',
               '2010-01-01 05:20:00', '2010-01-01 06:40:00',
               '2010-01-01 08:00:00', '2010-01-01 09:20:00',
               '2010-01-01 10:40:00', '2010-01-01 12:00:00'],
              dtype='datetime64[ns]', freq='80T')

4 pandas-时间序列

  • 时间序列就是以时间对象为索引的Series或DataFrame。
  • datetime对象作为索引时是存储在DatetimeIndex对象中的。
  • 时间序列特殊功能:
    • 传入“年”或“月”作为切片当时
    • 传入日期范围作为切片方式
    • 丰富的函数支持:resample(),truncate(),…
pd.date_range('2010-01-01','2010-05-01')
DatetimeIndex(['2010-01-01', '2010-01-02', '2010-01-03', '2010-01-04',
               '2010-01-05', '2010-01-06', '2010-01-07', '2010-01-08',
               '2010-01-09', '2010-01-10',
               ...
               '2010-04-22', '2010-04-23', '2010-04-24', '2010-04-25',
               '2010-04-26', '2010-04-27', '2010-04-28', '2010-04-29',
               '2010-04-30', '2010-05-01'],
              dtype='datetime64[ns]', length=121, freq='D')
sr = pd.Series(np.arange(1000), index=pd.date_range('2020-01-01', periods=1000))
sr
2020-01-01      0
2020-01-02      1
2020-01-03      2
2020-01-04      3
2020-01-05      4
             ... 
2022-09-22    995
2022-09-23    996
2022-09-24    997
2022-09-25    998
2022-09-26    999
Freq: D, Length: 1000, dtype: int32
sr.index
DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
               '2020-01-05', '2020-01-06', '2020-01-07', '2020-01-08',
               '2020-01-09', '2020-01-10',
               ...
               '2022-09-17', '2022-09-18', '2022-09-19', '2022-09-20',
               '2022-09-21', '2022-09-22', '2022-09-23', '2022-09-24',
               '2022-09-25', '2022-09-26'],
              dtype='datetime64[ns]', length=1000, freq='D')
sr['2020-03'] # 可以选取某一年或月的数据
2020-03-01    60
2020-03-02    61
2020-03-03    62
2020-03-04    63
2020-03-05    64
2020-03-06    65
2020-03-07    66
2020-03-08    67
2020-03-09    68
2020-03-10    69
2020-03-11    70
2020-03-12    71
2020-03-13    72
2020-03-14    73
2020-03-15    74
2020-03-16    75
2020-03-17    76
2020-03-18    77
2020-03-19    78
2020-03-20    79
2020-03-21    80
2020-03-22    81
2020-03-23    82
2020-03-24    83
2020-03-25    84
2020-03-26    85
2020-03-27    86
2020-03-28    87
2020-03-29    88
2020-03-30    89
2020-03-31    90
Freq: D, dtype: int32
sr['2020']
2020-01-01      0
2020-01-02      1
2020-01-03      2
2020-01-04      3
2020-01-05      4
             ... 
2020-12-27    361
2020-12-28    362
2020-12-29    363
2020-12-30    364
2020-12-31    365
Freq: D, Length: 366, dtype: int32
sr['2020-05':'2020-10']
2020-05-01    121
2020-05-02    122
2020-05-03    123
2020-05-04    124
2020-05-05    125
             ... 
2020-10-27    300
2020-10-28    301
2020-10-29    302
2020-10-30    303
2020-10-31    304
Freq: D, Length: 184, dtype: int32
sr['2020-05-01':'2020-10-31']
2020-05-01    121
2020-05-02    122
2020-05-03    123
2020-05-04    124
2020-05-05    125
             ... 
2020-10-27    300
2020-10-28    301
2020-10-29    302
2020-10-30    303
2020-10-31    304
Freq: D, Length: 184, dtype: int32
sr.resample('W').sum() # 每周的求和
2020-01-05      10
2020-01-12      56
2020-01-19     105
2020-01-26     154
2020-02-02     203
              ... 
2022-09-04    6818
2022-09-11    6867
2022-09-18    6916
2022-09-25    6965
2022-10-02     999
Freq: W-SUN, Length: 144, dtype: int32
sr.resample('M').mean() # 每月的平均
2020-01-31     15.0
2020-02-29     45.0
2020-03-31     75.0
2020-04-30    105.5
2020-05-31    136.0
2020-06-30    166.5
2020-07-31    197.0
2020-08-31    228.0
2020-09-30    258.5
2020-10-31    289.0
2020-11-30    319.5
2020-12-31    350.0
2021-01-31    381.0
2021-02-28    410.5
2021-03-31    440.0
2021-04-30    470.5
2021-05-31    501.0
2021-06-30    531.5
2021-07-31    562.0
2021-08-31    593.0
2021-09-30    623.5
2021-10-31    654.0
2021-11-30    684.5
2021-12-31    715.0
2022-01-31    746.0
2022-02-28    775.5
2022-03-31    805.0
2022-04-30    835.5
2022-05-31    866.0
2022-06-30    896.5
2022-07-31    927.0
2022-08-31    958.0
2022-09-30    986.5
Freq: M, dtype: float64
sr.truncate(before='2020-05-01',after='2020-10-01')
2020-05-01    121
2020-05-02    122
2020-05-03    123
2020-05-04    124
2020-05-05    125
             ... 
2020-09-27    270
2020-09-28    271
2020-09-29    272
2020-09-30    273
2020-10-01    274
Freq: D, Length: 154, dtype: int32

5 pandas-文件处理

5.1 pandas-读取文件

  • 数据文件常用格式:csv(以某间隔符分割的数据)

  • pandas读取文件:从文件名、url、文件对象中加载数据

    • read_csv 默认分隔符为逗号
    • read_tabel 默认分隔符为制表符
  • read_csv,read_tabel函数主要参数:

    • sep 指定分隔符,可以用正则表达式如’\s+’
    • header=None 指定文件无列名
    • names 指定列名
    • index_col 指定某列作为索引
    • skip_row 指定跳过某些行
    • na_values 指定某些字符串表示缺失值
    • parse_dates 指定某些列是否被解析为日期,输入类型为布尔值或列表
pd.read_csv('399300.csv')
日期 股票代码 名称 收盘价 最高价 最低价 开盘价 前收盘 涨跌额 涨跌幅 成交量 成交金额
0 2021/1/29 '399300 沪深300 5351.9646 5430.2015 5288.0955 5413.9684 5377.1427 -25.1781 -0.4682 18217878400 390,287,690,019.00
1 2021/1/28 '399300 沪深300 5377.1427 5462.2352 5360.3766 5450.3695 5528.0034 -150.8607 -2.729 17048558500 376,166,523,178.00
2 2021/1/27 '399300 沪深300 5528.0034 5534.9928 5449.6385 5505.7708 5512.9678 15.0356 0.2727 16019084100 376,892,605,839.00
3 2021/1/26 '399300 沪深300 5512.9678 5600.9017 5505.9962 5600.9017 5625.9232 -112.9554 -2.0078 17190459000 415,008,069,865.00
4 2021/1/25 '399300 沪深300 5625.9232 5655.4795 5543.2663 5564.1237 5569.776 56.1472 1.0081 19704701900 508,166,980,802.00
... ... ... ... ... ... ... ... ... ... ... ... ...
4625 2002/1/10 '399300 沪深300 1281.2600 1281.2600 1281.2600 1281.2600 1272.65 8.61 0.6765 0 -
4626 2002/1/9 '399300 沪深300 1272.6500 1272.6500 1272.6500 1272.6500 1292.71 -20.06 -1.5518 0 -
4627 2002/1/8 '399300 沪深300 1292.7100 1292.7100 1292.7100 1292.7100 1302.08 -9.37 -0.7196 0 -
4628 2002/1/7 '399300 沪深300 1302.0800 1302.0800 1302.0800 1302.0800 1316.46 -14.38 -1.0923 0 -
4629 2002/1/4 '399300 沪深300 1316.4600 1316.4600 1316.4600 1316.4600 None None None 0 -

4630 rows × 12 columns

pd.read_csv('399300.csv', index_col='日期')
股票代码 名称 收盘价 最高价 最低价 开盘价 前收盘 涨跌额 涨跌幅 成交量 成交金额
日期
2021/1/29 '399300 沪深300 5351.9646 5430.2015 5288.0955 5413.9684 5377.1427 -25.1781 -0.4682 18217878400 390,287,690,019.00
2021/1/28 '399300 沪深300 5377.1427 5462.2352 5360.3766 5450.3695 5528.0034 -150.8607 -2.729 17048558500 376,166,523,178.00
2021/1/27 '399300 沪深300 5528.0034 5534.9928 5449.6385 5505.7708 5512.9678 15.0356 0.2727 16019084100 376,892,605,839.00
2021/1/26 '399300 沪深300 5512.9678 5600.9017 5505.9962 5600.9017 5625.9232 -112.9554 -2.0078 17190459000 415,008,069,865.00
2021/1/25 '399300 沪深300 5625.9232 5655.4795 5543.2663 5564.1237 5569.776 56.1472 1.0081 19704701900 508,166,980,802.00
... ... ... ... ... ... ... ... ... ... ... ...
2002/1/10 '399300 沪深300 1281.2600 1281.2600 1281.2600 1281.2600 1272.65 8.61 0.6765 0 -
2002/1/9 '399300 沪深300 1272.6500 1272.6500 1272.6500 1272.6500 1292.71 -20.06 -1.5518 0 -
2002/1/8 '399300 沪深300 1292.7100 1292.7100 1292.7100 1292.7100 1302.08 -9.37 -0.7196 0 -
2002/1/7 '399300 沪深300 1302.0800 1302.0800 1302.0800 1302.0800 1316.46 -14.38 -1.0923 0 -
2002/1/4 '399300 沪深300 1316.4600 1316.4600 1316.4600 1316.4600 None None None 0 -

4630 rows × 11 columns

df = _
df.index # 日期被解释成一个字符串
Index(['2021/1/29', '2021/1/28', '2021/1/27', '2021/1/26', '2021/1/25',
       '2021/1/22', '2021/1/21', '2021/1/20', '2021/1/19', '2021/1/18',
       ...
       '2002/1/17', '2002/1/16', '2002/1/15', '2002/1/14', '2002/1/11',
       '2002/1/10', '2002/1/9', '2002/1/8', '2002/1/7', '2002/1/4'],
      dtype='object', name='日期', length=4630)
# df = pd.read_csv('399300.csv',index_col='日期',parse_dates=True) # 把所有能解释成日期的都解释成时间对象
df = pd.read_csv('399300.csv',index_col='日期',parse_dates=['日期']) # 把冲入的列解释成时间对象
df
股票代码 名称 收盘价 最高价 最低价 开盘价 前收盘 涨跌额 涨跌幅 成交量 成交金额
日期
2021-01-29 '399300 沪深300 5351.9646 5430.2015 5288.0955 5413.9684 5377.1427 -25.1781 -0.4682 18217878400 390,287,690,019.00
2021-01-28 '399300 沪深300 5377.1427 5462.2352 5360.3766 5450.3695 5528.0034 -150.8607 -2.729 17048558500 376,166,523,178.00
2021-01-27 '399300 沪深300 5528.0034 5534.9928 5449.6385 5505.7708 5512.9678 15.0356 0.2727 16019084100 376,892,605,839.00
2021-01-26 '399300 沪深300 5512.9678 5600.9017 5505.9962 5600.9017 5625.9232 -112.9554 -2.0078 17190459000 415,008,069,865.00
2021-01-25 '399300 沪深300 5625.9232 5655.4795 5543.2663 5564.1237 5569.776 56.1472 1.0081 19704701900 508,166,980,802.00
... ... ... ... ... ... ... ... ... ... ... ...
2002-01-10 '399300 沪深300 1281.2600 1281.2600 1281.2600 1281.2600 1272.65 8.61 0.6765 0 -
2002-01-09 '399300 沪深300 1272.6500 1272.6500 1272.6500 1272.6500 1292.71 -20.06 -1.5518 0 -
2002-01-08 '399300 沪深300 1292.7100 1292.7100 1292.7100 1292.7100 1302.08 -9.37 -0.7196 0 -
2002-01-07 '399300 沪深300 1302.0800 1302.0800 1302.0800 1302.0800 1316.46 -14.38 -1.0923 0 -
2002-01-04 '399300 沪深300 1316.4600 1316.4600 1316.4600 1316.4600 None None None 0 -

4630 rows × 11 columns

df.index
DatetimeIndex(['2021-01-29', '2021-01-28', '2021-01-27', '2021-01-26',
               '2021-01-25', '2021-01-22', '2021-01-21', '2021-01-20',
               '2021-01-19', '2021-01-18',
               ...
               '2002-01-17', '2002-01-16', '2002-01-15', '2002-01-14',
               '2002-01-11', '2002-01-10', '2002-01-09', '2002-01-08',
               '2002-01-07', '2002-01-04'],
              dtype='datetime64[ns]', name='日期', length=4630, freq=None)
df = pd.read_csv('399300-2.csv',header=None) # 如果原文件中没有列名,可以让header=None,自动生成列名
df
0 1 2 3 4 5 6 7 8 9 10 11
0 2021/1/29 '399300 沪深300 5351.9646 5430.2015 5288.0955 5413.9684 5377.1427 -25.1781 -0.4682 18217878400 390,287,690,019.00
1 2021/1/28 '399300 沪深300 5377.1427 5462.2352 5360.3766 5450.3695 5528.0034 -150.8607 -2.729 17048558500 376,166,523,178.00
2 2021/1/27 '399300 沪深300 5528.0034 5534.9928 5449.6385 5505.7708 5512.9678 15.0356 0.2727 16019084100 376,892,605,839.00
3 2021/1/26 '399300 沪深300 5512.9678 5600.9017 5505.9962 5600.9017 5625.9232 -112.9554 -2.0078 17190459000 415,008,069,865.00
4 2021/1/25 '399300 沪深300 5625.9232 5655.4795 5543.2663 5564.1237 5569.776 56.1472 1.0081 19704701900 508,166,980,802.00
... ... ... ... ... ... ... ... ... ... ... ... ...
4625 2002/1/10 '399300 沪深300 1281.2600 1281.2600 1281.2600 1281.2600 1272.65 8.61 0.6765 0 -
4626 2002/1/9 '399300 沪深300 1272.6500 1272.6500 1272.6500 1272.6500 1292.71 -20.06 -1.5518 0 -
4627 2002/1/8 '399300 沪深300 1292.7100 1292.7100 1292.7100 1292.7100 1302.08 -9.37 -0.7196 0 -
4628 2002/1/7 '399300 沪深300 1302.0800 1302.0800 1302.0800 1302.0800 1316.46 -14.38 -1.0923 0 -
4629 2002/1/4 '399300 沪深300 1316.4600 1316.4600 1316.4600 1316.4600 None None None 0 -

4630 rows × 12 columns

# 如果原文件中没有列名,可以让header=None,同时自己命名
df = pd.read_csv('399300-2.csv',header=None, names=['股票代码', '名称', '收盘价', '最高价', '最低价',
                                                    '开盘价', '前收盘', '涨跌额', '涨跌幅', '成交量','成交金额 ']) 
df
股票代码 名称 收盘价 最高价 最低价 开盘价 前收盘 涨跌额 涨跌幅 成交量 成交金额
2021/1/29 '399300 沪深300 5351.9646 5430.2015 5288.0955 5413.9684 5377.1427 -25.1781 -0.4682 18217878400 390,287,690,019.00
2021/1/28 '399300 沪深300 5377.1427 5462.2352 5360.3766 5450.3695 5528.0034 -150.8607 -2.729 17048558500 376,166,523,178.00
2021/1/27 '399300 沪深300 5528.0034 5534.9928 5449.6385 5505.7708 5512.9678 15.0356 0.2727 16019084100 376,892,605,839.00
2021/1/26 '399300 沪深300 5512.9678 5600.9017 5505.9962 5600.9017 5625.9232 -112.9554 -2.0078 17190459000 415,008,069,865.00
2021/1/25 '399300 沪深300 5625.9232 5655.4795 5543.2663 5564.1237 5569.776 56.1472 1.0081 19704701900 508,166,980,802.00
... ... ... ... ... ... ... ... ... ... ... ...
2002/1/10 '399300 沪深300 1281.2600 1281.2600 1281.2600 1281.2600 1272.65 8.61 0.6765 0 -
2002/1/9 '399300 沪深300 1272.6500 1272.6500 1272.6500 1272.6500 1292.71 -20.06 -1.5518 0 -
2002/1/8 '399300 沪深300 1292.7100 1292.7100 1292.7100 1292.7100 1302.08 -9.37 -0.7196 0 -
2002/1/7 '399300 沪深300 1302.0800 1302.0800 1302.0800 1302.0800 1316.46 -14.38 -1.0923 0 -
2002/1/4 '399300 沪深300 1316.4600 1316.4600 1316.4600 1316.4600 None None None 0 -

4630 rows × 11 columns

df = pd.read_csv('399300.csv',index_col='日期',parse_dates=['日期'], skiprows=[1,2,3])
df
股票代码 名称 收盘价 最高价 最低价 开盘价 前收盘 涨跌额 涨跌幅 成交量 成交金额
日期
2021-01-26 '399300 沪深300 5512.9678 5600.9017 5505.9962 5600.9017 5625.9232 -112.9554 -2.0078 17190459000 415,008,069,865.00
2021-01-25 '399300 沪深300 5625.9232 5655.4795 5543.2663 5564.1237 5569.776 56.1472 1.0081 19704701900 508,166,980,802.00
2021-01-22 '399300 沪深300 5569.7760 5573.6594 5513.8769 5562.3790 5564.9693 4.8067 0.0864 19930002000 456,622,193,436.00
2021-01-21 '399300 沪深300 5564.9693 5593.1058 5490.5626 5492.9587 5476.4336 88.5357 1.6167 20995019700 453,183,684,479.00
2021-01-20 '399300 沪深300 5476.4336 5496.0493 5426.5357 5439.9111 5437.5234 38.9102 0.7156 17091326000 373,770,384,496.00
... ... ... ... ... ... ... ... ... ... ... ...
2002-01-10 '399300 沪深300 1281.2600 1281.2600 1281.2600 1281.2600 1272.65 8.61 0.6765 0 -
2002-01-09 '399300 沪深300 1272.6500 1272.6500 1272.6500 1272.6500 1292.71 -20.06 -1.5518 0 -
2002-01-08 '399300 沪深300 1292.7100 1292.7100 1292.7100 1292.7100 1302.08 -9.37 -0.7196 0 -
2002-01-07 '399300 沪深300 1302.0800 1302.0800 1302.0800 1302.0800 1316.46 -14.38 -1.0923 0 -
2002-01-04 '399300 沪深300 1316.4600 1316.4600 1316.4600 1316.4600 None None None 0 -

4627 rows × 11 columns

df = pd.read_csv('399300.csv',index_col='日期',parse_dates=['日期'], na_values=['None','NA','nan']) # 那些解释成缺失值

5.2 pandas-写入文件

  • 写入到csv文件:to_csv函数
  • 写入文件函数的主要参数
    • sep 指定文件分隔符
    • na_rep 指定缺失值转换的字符串,默认为空字符串
    • header=False 不输出列名一行
    • index=False 不输出索引一列
    • columns 指定输出的列,传入列表
df.iloc[1,1] = np.nan
df.to_csv('test.csv', header=False, index=False, na_rep='None',encoding='ANSI')

你可能感兴趣的:(Python学习,python,pandas)