人生苦短我用Python pandas文件格式转换
- 前言
- 示例1 excel与csv互转
- 常用格式的方法
-
- 示例2 常用格式转换
-
- 附其它格式的方法
-
- HTML
- Pickling
- Clipboard
- Latex
- HDFStore: PyTables (HDF5)
- Feather
- Parquet
- ORC
- SAS
- SPSS
- SQL
- Google BigQuery
- STATA
前言
pandas
支持多种文件格式,通过pandas
的IO
方法,可以实现不同格式之间的互相转换。本文通过excel
与csv
互转的示例和pandas
的支持的文件格式,实现一个简单的文件格式转换的功能。
示例1 excel与csv互转
在前文实现了excel转csv,即通过pandas
将excel
转csv
,反过来也可以将csv
转为excel
。
下面是excel
与csv
互转的示例代码:
def export_csv(input_file, output_path):
with pd.ExcelFile(input_file) as xls:
for i, sheet_name in enumerate(xls.sheet_names):
df = pd.read_excel(xls, sheet_name=sheet_name)
output_file = os.path.join(output_path, f'{i + 1}-{sheet_name}.csv')
df.to_csv(output_file, index=False)
def export_excel(input_file, output_file):
if not output_file:
input_path = pathlib.Path(input_file)
output_path = input_path.parent / (input_path.stem + '.xlsx')
output_file = str(output_path)
df = pd.read_csv(input_file)
df.to_excel(output_file, index=False)
常用格式的方法
以下来自pandas官网 Input/Outout部分
Flat file
方法 |
说明 |
read_table (filepath_or_buffer, *[, sep, …]) |
Read general delimited file into DataFrame. |
read_csv (filepath_or_buffer, *[, sep, …]) |
Read a comma-separated values (csv) file into DataFrame. |
DataFrame.to_csv ([path_or_buf, sep, na_rep, …]) |
Write object to a comma-separated values (csv) file. |
read_fwf (filepath_or_buffer, *[, colspecs, …]) |
Read a table of fixed-width formatted lines into DataFrame. |
Excel
方法 |
说明 |
read_excel (io[, sheet_name, header, names, …]) |
Read an Excel file into a pandas DataFrame . |
DataFrame.to_excel (excel_writer, *[, …]) |
Write object to an Excel sheet. |
ExcelFile (path_or_buffer[, engine, …]) |
Class for parsing tabular Excel sheets into DataFrame objects. |
ExcelFile.book |
|
ExcelFile.sheet_names |
|
ExcelFile.parse ([sheet_name, header, names, …]) |
Parse specified sheet(s) into a DataFrame. |
方法 |
说明 |
Styler.to_excel (excel_writer[, sheet_name, …]) |
Write Styler to an Excel sheet. |
方法 |
说明 |
ExcelWriter (path[, engine, date_format, …]) |
Class for writing DataFrame objects into excel sheets. |
JSON
方法 |
说明 |
read_json (path_or_buf, *[, orient, typ, …]) |
Convert a JSON string to pandas object. |
json_normalize (data[, record_path, meta, …]) |
Normalize semi-structured JSON data into a flat table. |
DataFrame.to_json ([path_or_buf, orient, …]) |
Convert the object to a JSON string. |
方法 |
说明 |
build_table_schema (data[, index, …]) |
Create a Table schema from data . |
XML
方法 |
说明 |
read_xml (path_or_buffer, *[, xpath, …]) |
Read XML document into a DataFrame object. |
DataFrame.to_xml ([path_or_buffer, index, …]) |
Render a DataFrame to an XML document. |
示例2 常用格式转换
根据常用格式的IO方法,完成一个常用格式的格式转换功能。
第一步从指定格式的文件中读取数据,并将其转换为 DataFrame
对象。
第二部将 DataFrame
中的数据写入指定格式的文件中。
简要需求
- 根据输入输出的文件后缀名,自动进行格式转换,若格式不支持输出提示。
- 支持的格式
csv
,xlsx
,json
,xml
。
依赖
pip install pandas
pip install openpyxl
pip install lxml
export方法
def export(input_file, output_file):
if not os.path.isfile(input_file):
print('Input file does not exist')
return
if input_file.endswith('.csv'):
df = pd.read_csv(input_file, encoding='utf-8')
elif input_file.endswith('.json'):
df = pd.read_json(input_file, encoding='utf-8')
elif input_file.endswith('.xlsx'):
df = pd.read_excel(input_file)
elif input_file.endswith('.xml', encoding='utf-8'):
df = pd.read_xml(input_file)
else:
print('Input file type not supported')
return
if output_file.endswith('.csv'):
df.to_csv(output_file, index=False)
elif output_file.endswith('.json'):
df.to_json(output_file, orient='records', force_ascii=False)
elif output_file.endswith('.xlsx'):
df.to_excel(output_file, index=False)
elif output_file.endswith('.xml'):
df.to_xml(output_file, index=False)
elif output_file.endswith('.html'):
df.to_html(output_file, index=False, encoding='utf-8')
else:
print('Output file type not supported')
return
main方法
def main(argv):
input_path = None
output_path = None
try:
shortopts = "hi:o:"
longopts = ["ipath=", "opath="]
opts, args = getopt.getopt(argv, shortopts, longopts)
except getopt.GetoptError:
print('usage: export.py -i -o ')
sys.exit(2)
for opt, arg in opts:
if opt in ("-h", "--help"):
print('usage: export.py -i -o ')
sys.exit()
elif opt in ("-i", "--ipath"):
input_path = arg
elif opt in ("-o", "--opath"):
output_path = arg
print(f'输入路径为:{input_path}')
print(f'输出路径为:{output_path}')
export(input_path, output_path)
附其它格式的方法
以下来自pandas官网 Input/Outout部分
HTML
方法 |
说明 |
read_html (io, *[, match, flavor, header, …]) |
Read HTML tables into a list of DataFrame objects. |
DataFrame.to_html ([buf, columns, col_space, …]) |
Render a DataFrame as an HTML table. |
方法 |
说明 |
Styler.to_html ([buf, table_uuid, …]) |
Write Styler to a file, buffer or string in HTML-CSS format. |
Pickling
方法 |
说明 |
read_pickle (filepath_or_buffer[, …]) |
Load pickled pandas object (or any object) from file. |
DataFrame.to_pickle (path, *[, compression, …]) |
Pickle (serialize) object to file. |
Clipboard
方法 |
说明 |
read_clipboard ([sep, dtype_backend]) |
Read text from clipboard and pass to read_csv() . |
DataFrame.to_clipboard (*[, excel, sep]) |
Copy object to the system clipboard. |
Latex
方法 |
说明 |
DataFrame.to_latex ([buf, columns, header, …]) |
Render object to a LaTeX tabular, longtable, or nested table. |
方法 |
说明 |
Styler.to_latex ([buf, column_format, …]) |
Write Styler to a file, buffer or string in LaTeX format. |
HDFStore: PyTables (HDF5)
方法 |
说明 |
read_hdf (path_or_buf[, key, mode, errors, …]) |
Read from the store, close it if we opened it. |
HDFStore.put (key, value[, format, index, …]) |
Store object in HDFStore. |
HDFStore.append (key, value[, format, axes, …]) |
Append to Table in file. |
HDFStore.get (key) |
Retrieve pandas object stored in file. |
HDFStore.select (key[, where, start, stop, …]) |
Retrieve pandas object stored in file, optionally based on where criteria. |
HDFStore.info () |
Print detailed information on the store. |
HDFStore.keys ([include]) |
Return a list of keys corresponding to objects stored in HDFStore. |
HDFStore.groups () |
Return a list of all the top-level nodes. |
HDFStore.walk ([where]) |
Walk the pytables group hierarchy for pandas objects. |
Warning
One can store a subclass of DataFrame
or Series
to HDF5, but the type of the subclass is lost upon storing.
Feather
方法 |
说明 |
read_feather (path[, columns, use_threads, …]) |
Load a feather-format object from the file path. |
DataFrame.to_feather (path, **kwargs) |
Write a DataFrame to the binary Feather format. |
Parquet
方法 |
说明 |
read_parquet (path[, engine, columns, …]) |
Load a parquet object from the file path, returning a DataFrame. |
DataFrame.to_parquet ([path, engine, …]) |
Write a DataFrame to the binary parquet format. |
ORC
方法 |
说明 |
read_orc (path[, columns, dtype_backend, …]) |
Load an ORC object from the file path, returning a DataFrame. |
DataFrame.to_orc ([path, engine, index, …]) |
Write a DataFrame to the ORC format. |
SAS
方法 |
说明 |
read_sas (filepath_or_buffer, *[, format, …]) |
Read SAS files stored as either XPORT or SAS7BDAT format files. |
SPSS
方法 |
说明 |
read_spss (path[, usecols, …]) |
Load an SPSS file from the file path, returning a DataFrame. |
SQL
方法 |
说明 |
read_sql_table (table_name, con[, schema, …]) |
Read SQL database table into a DataFrame. |
read_sql_query (sql, con[, index_col, …]) |
Read SQL query into a DataFrame. |
read_sql (sql, con[, index_col, …]) |
Read SQL query or database table into a DataFrame. |
DataFrame.to_sql (name, con, *[, schema, …]) |
Write records stored in a DataFrame to a SQL database. |
Google BigQuery
方法 |
说明 |
read_gbq (query[, project_id, index_col, …]) |
(DEPRECATED) Load data from Google BigQuery. |
STATA
方法 |
说明 |
read_stata (filepath_or_buffer, *[, …]) |
Read Stata file into DataFrame. |
DataFrame.to_stata (path, *[, convert_dates, …]) |
Export DataFrame object to Stata dta format. |
方法 |
说明 |
StataReader.data_label |
Return data label of Stata file. |
StataReader.value_labels () |
Return a nested dict associating each variable name to its value and label. |
StataReader.variable_labels () |
Return a dict associating each variable name with corresponding label. |
StataWriter.write_file () |
Export DataFrame object to Stata dta format. |