05pandas读取excel csv txt文件

pandas丨数据读取与保存

  • 读取excel文件: pandas.read_excel()
  • 保存excel文件: pandas.to_excel()

pandas.read_excel(io, sheet_name=0, header=0, names=None, index_col=None, usecols=None,
squeeze=False, dtype=None, engine=None, converters=None, true_values=None,
false_values=None, skiprows=None, nrows=None, na_values=None, keep_default_na=True,
verbose=False, parse_dates=False, date_parser=None, thousands=None, comment=None,
skip_footer=0, skipfooter=0, convert_float=True, mangle_dupe_cols=True, **kwds)

#查看帮助文件
import pandas as pd
help(pd.read_excel)
Help on function read_excel in module pandas.io.excel._base:

read_excel(io, sheet_name=0, header=0, names=None, index_col=None, usecols=None, squeeze=False, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skiprows=None, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, parse_dates=False, date_parser=None, thousands=None, comment=None, skipfooter=0, convert_float=True, mangle_dupe_cols=True)
    Read an Excel file into a pandas DataFrame.
    
    Supports `xls`, `xlsx`, `xlsm`, `xlsb`, `odf`, `ods` and `odt` file extensions
    read from a local filesystem or URL. Supports an option to read
    a single sheet or a list of sheets.
  • 常用参数:
    • io:excel的路径,选中文件,鼠标右键,在"属性"中找到文件位置,再补充上文件名称,则为完整路径。注意反斜杠方向 ★★★★★
    • sheet_name:工作表的名称。当不输入时,默认读取第一个工作表
  • 不常用参数
    • index_col :指定某一列为索引。index_col=1
    • names :列名称,传入list数据
    • header: 指定行作为列名,默认为第1行。header=[1,2]多级索引
    • usecols: 读取指定列。usecols = [“A”,“B”]
    • skiprows: 忽略前几行

pd.read_excel()常用参数

io

要读取文件所在的位置

  • 字符串
  • 注意 \ 和 / 的区别
# 读取文件
# 导入pandas工具包
import pandas as pd
# 方式1 ★★★
data1 = pd.read_excel('C:/Users/yyz/Desktop/python数据分析基础/data/泰坦尼克数据.xlsx')
data1.head()
乘客ID 是否存活 票类 姓名 性别 年龄 乘客兄弟姐妹个数 乘客父母/孩子的个数 票号 票价 仓位 登船港口
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
data2 = pd.read_excel('C:\\Users\\yyz\\Desktop\\python数据分析基础\\data\\泰坦尼克数据.xlsx')
data3 = pd.read_excel(r'C:\Users\yyz\Desktop\python数据分析基础\data\泰坦尼克数据.xlsx')
data3.head()
乘客ID 是否存活 票类 姓名 性别 年龄 乘客兄弟姐妹个数 乘客父母/孩子的个数 票号 票价 仓位 登船港口
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
# 方式2 ★★★
# 导入os工具包
import os
# 设置默认路径,当读取此文件夹下的文件时,直接输入文件名称就可以
os.chdir('C:/Users/yyz/Desktop/python数据分析基础/data/')
data4 = pd.read_excel('泰坦尼克数据.xlsx')
data4.head()
乘客ID 是否存活 票类 姓名 性别 年龄 乘客兄弟姐妹个数 乘客父母/孩子的个数 票号 票价 仓位 登船港口
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

sheet_name

读取的工作表,可以是工作表名称, 也可以是工作表所在的位置,0 表示第1个.

  • 格式: 整数或者是字符串
# 方式1 ★★★
data5 = pd.read_excel('泰坦尼克数据.xlsx',sheet_name='Sheet1')
data5.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
1 32 1 1 Spencer, Mrs. William Augustus (Marie Eugenie) female NaN 1 0 PC 17569 146.5208 B78 C
2 53 1 1 Harper, Mrs. Henry Sleeper (Myna Haxtun) female 49.0 1 0 PC 17572 76.7292 D33 C
3 98 1 1 Greenfield, Mr. William Bertram male 23.0 0 1 PC 17759 63.3583 D10 D12 C
4 195 1 1 Brown, Mrs. James Joseph (Margaret Tobin) female 44.0 0 0 PC 17610 27.7208 B4 C
# 方式2
data6 = pd.read_excel('泰坦尼克数据.xlsx',sheet_name=1)
data6.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
1 32 1 1 Spencer, Mrs. William Augustus (Marie Eugenie) female NaN 1 0 PC 17569 146.5208 B78 C
2 53 1 1 Harper, Mrs. Henry Sleeper (Myna Haxtun) female 49.0 1 0 PC 17572 76.7292 D33 C
3 98 1 1 Greenfield, Mr. William Bertram male 23.0 0 1 PC 17759 63.3583 D10 D12 C
4 195 1 1 Brown, Mrs. James Joseph (Margaret Tobin) female 44.0 0 0 PC 17610 27.7208 B4 C

pd.read_excel()不常用参数

index_col

指定哪一列为索引, 默认不设置

  • 格式:字符串或者字符串构成的列表
data7 = pd.read_excel('泰坦尼克数据.xlsx',index_col='乘客ID')
data7
是否存活 票类 姓名 性别 年龄 乘客兄弟姐妹个数 乘客父母/孩子的个数 票号 票价 仓位 登船港口
乘客ID
1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
... ... ... ... ... ... ... ... ... ... ... ...
887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

891 rows × 11 columns

names

指定列名

  • 格式: 列表
data8 = pd.read_excel('泰坦尼克数据.xlsx',
                      names=['变量1','变量2','变量3','变量4','变量5','变量6','变量7','变量8','变量9','变量10','变量11','变量12'])
data8.head()
变量1 变量2 变量3 变量4 变量5 变量6 变量7 变量8 变量9 变量10 变量11 变量12
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
# 更改列名一般用DataFrame中columns属性★★★
data8.columns = ['乘客ID', '是否存活', '票类', '姓名', '性别', '年龄', '乘客兄弟姐妹个数',
                 '乘客父母/孩子的个数', '票号','票价', '仓位', '登船港口']
data8.head()
乘客ID 是否存活 票类 姓名 性别 年龄 乘客兄弟姐妹个数 乘客父母/孩子的个数 票号 票价 仓位 登船港口
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

usecols

选择要读取的列

  • 格式: 列表
data7 = pd.read_excel('泰坦尼克数据.xlsx',usecols=['姓名','性别','年龄'])
data7.head()
姓名 性别 年龄
0 Braund, Mr. Owen Harris male 22.0
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0
2 Heikkinen, Miss. Laina female 26.0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0
4 Allen, Mr. William Henry male 35.0
# 方式二★★★
data8 = pd.read_excel('泰坦尼克数据.xlsx')[['姓名','性别','年龄']]  # 注意是两个中括号
data8.head()
姓名 性别 年龄
0 Braund, Mr. Owen Harris male 22.0
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0
2 Heikkinen, Miss. Laina female 26.0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0
4 Allen, Mr. William Henry male 35.0

header

设置列名所在的行

  • 格式: 整数或者 None
data9 = pd.read_excel('泰坦尼克数据.xlsx',header=1)
data9.head()
1 0 3 Braund, Mr. Owen Harris male 22 1.1 0.1 A/5 21171 7.25 Unnamed: 10 S
0 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
1 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
2 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
3 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
4 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
data10 = pd.read_excel('泰坦尼克数据.xlsx',header=None)
data10.head()
0 1 2 3 4 5 6 7 8 9 10 11
0 乘客ID 是否存活 票类 姓名 性别 年龄 乘客兄弟姐妹个数 乘客父母/孩子的个数 票号 票价 仓位 登船港口
1 1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.25 NaN S
2 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 0 PC 17599 71.2833 C85 C
3 3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.925 NaN S
4 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1 C123 S

skiprows

忽略前几行:当前几行为空行或者其他不需要读取的数据时使用

data11 = pd.read_excel(r'C:\Users\yyz\Desktop\python数据分析基础\data\泰坦尼克数据.xlsx',
                      skiprows=1)
data11.head()
1 0 3 Braund, Mr. Owen Harris male 22 1.1 0.1 A/5 21171 7.25 Unnamed: 10 S
0 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
1 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
2 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
3 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
4 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q

pd.to_excel()常用参数

# 导入pandas工具包,并简写为pd
import pandas as pd
# 读取数据
data1 = pd.read_excel('C:/Users/yyz/Desktop/python数据分析基础/data/泰坦尼克数据.xlsx')
data1.head()
乘客ID 是否存活 票类 姓名 性别 年龄 乘客兄弟姐妹个数 乘客父母/孩子的个数 票号 票价 仓位 登船港口
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
data1.info()

RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   乘客ID        891 non-null    int64  
 1   是否存活        891 non-null    int64  
 2   票类          891 non-null    int64  
 3   姓名          891 non-null    object 
 4   性别          891 non-null    object 
 5   年龄          714 non-null    float64
 6   乘客兄弟姐妹个数    891 non-null    int64  
 7   乘客父母/孩子的个数  891 non-null    int64  
 8   票号          891 non-null    object 
 9   票价          891 non-null    float64
 10  仓位          204 non-null    object 
 11  登船港口        889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
# 数据透视表1:按照性别、登录港口统计人数
result1 = data1.pivot_table('姓名',      # 需要计算统计量的变量(字段)
                            index='性别',   # 行变量,当需要输入多个时 用列表形式 ['','']
                            columns='登船港口', # 列变量,当需要输入多个时 用列表形式 ['','']
                            aggfunc='count', # 需要计算的统计量,'sum','mean','max'等
                            margins=True)   # 是否显示合计

# 数据透视表2:按照性别、票类统计人数
result2 = data1.pivot_table('姓名',
                            index='性别',
                            columns='票类',
                            aggfunc='count',
                            margins=True)   # 是否显示合计
result1.head()
登船港口 C Q S All
性别
female 73 36 203 312
male 95 41 441 577
All 168 77 644 889
result2
票类 1 2 3 All
性别
female 94 76 144 314
male 122 108 347 577
All 216 184 491 891
# 最常用写法
result1.to_excel('C:/Users/yyz/Desktop/保存数据1.xlsx')
# 当索引为数值序列,不期望导出的时候设置index=False
result2.to_excel('C:/Users/yyz/Desktop/保存数据2.xlsx',index=False)
result2.to_excel('C:/Users/yyz/Desktop/保存数据3.xlsx',sheet_name='汇总')
with pd.ExcelWriter('C:/Users/yyz/Desktop/保存数据4.xlsx') as writer:
    result1.to_excel(writer,sheet_name='第1个表')
    result2.to_excel(writer,sheet_name='第2个表')

pandas丨读取csv、txt文件

  • 当数据量比较大时, 一般会存储为csv或者txt格式文件;
  • 读取方法:pandas.read_csv(), 括号内参数如下:

pandas.read_csv(filepath_or_buffer: Union[str, pathlib.Path, IO[~AnyStr]], sep=’,’, delimiter=None,
header=‘infer’, names=None, index_col=None, usecols=None, squeeze=False, prefix=None,
mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None,
false_values=None, skipinitialspace=False, skiprows=None, skipfooter=0, nrows=None,
na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True,
parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None,
dayfirst=False, cache_dates=True, iterator=False, chunksize=None, compression=‘infer’,
thousands=None, decimal=b’.’, lineterminator=None, quotechar=’"’, quoting=0, doublequote=True,
escapechar=None, comment=None, encoding=None, dialect=None, error_bad_lines=True,
warn_bad_lines=True, delim_whitespace=False, low_memory=True, memory_map=False,
float_precision=None)

  • 常用参数:
    • filepath_or_buffer :文件路径 ,和读取excel中io参数一样
    • sep :分隔符,默认逗号,其他特殊符号: ※
      • 回车: \r,
      • 换行: \n,
      • 制表符: \t,
      • 空白字符: \s
      • 多个空白字符: \s+
    • encoding :一般utf-8 或者 gbk
  • 其他参数和pd.read_excel()参数类似
# 查看帮助文档
import pandas as pd
help(pd.read_csv)
Help on function read_csv in module pandas.io.parsers:

read_csv(filepath_or_buffer: Union[str, pathlib.Path, IO[~AnyStr]], sep=',', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, skipfooter=0, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, cache_dates=True, iterator=False, chunksize=None, compression='infer', thousands=None, decimal: str = '.', lineterminator=None, quotechar='"', quoting=0, doublequote=True, escapechar=None, comment=None, encoding=None, dialect=None, error_bad_lines=True, warn_bad_lines=True, delim_whitespace=False, low_memory=True, memory_map=False, float_precision=None)
# 读取csv格式文件
data1 = pd.read_csv('C:/Users/yyz/Desktop/python数据分析基础/data/titanic_train.csv')
data1.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
# 读取txt格式文件:sep参数
data2 = pd.read_csv('C:/Users/yyz/Desktop/python数据分析基础/data/titanic_train.txt',sep='\t')  # 不输入sep参数则无法准确读取
data2.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
data2.sample(10).to_csv('C:/Users/yyz/Desktop/导出txt文件.txt',sep=':')

提升代码速度小技巧

  • 微软拼音:设置→常规→开启“中文输入时使用英文标点”
  • 搜狗拼音:设置→常用→开启“中文时使用英文标点”

扫码关注微信, 赠送《pandas数据读取与清洗》视频及课程代码!
05pandas读取excel csv txt文件_第1张图片
05pandas读取excel csv txt文件_第2张图片

你可能感兴趣的:(python数据分析基础,python,数据分析,pandas,数据读取)