爬取的数据保存为csv格式
重点 (Top highlight)
CSV is a great format for data exchange. It’s understood all around the world and editable in a regular notepad. That doesn’t mean that it’s suitable for persisting all DataFrames. CSVs can be slow to read and write, they take more disk space, and most importantly CSVs don’t store information about data types.
CSV是一种很好的数据交换格式。 它在世界各地都可以理解,并且可以在常规记事本中进行编辑。 这并不意味着它适合持久化所有DataFrame。 CSV的读写速度可能很慢,它们会占用更多的磁盘空间,最重要的是CSV不会存储有关数据类型的信息。
Advantage of the CSV:
CSV的优势:
- Commonly understandable 可以理解的
- Parsing and creation supported by most programming languages 大多数编程语言支持的解析和创建
Disadvantages of the CSV:
CSV的缺点:
- Bigger disk usage 更大的磁盘使用量
- Slower reads and writes 较慢的读写
- Don’t store information about data types 不要存储有关数据类型的信息
Ilia Zaitsev did a great analysis of speed and memory usage of various pandas persistence methods. I’ll touch the performance at the end when we will program an automated performance test. First, I want to focus on a different thing — how do these file formats handle various data types.
Ilia Zaitsev对各种熊猫持久化方法的速度和内存使用情况进行了很好的分析。 最后,我们将对自动性能测试进行编程时,将涉及性能。 首先,我想关注另一件事-这些文件格式如何处理各种数据类型。
熊猫数据类型 (Pandas Data types)
Pandas support a rich set of data types and some of them have multiple subtypes to make work with big data frames more efficient. The fundamental data types are:
熊猫支持丰富的数据类型集,其中一些具有多个子类型,以使处理大数据框的效率更高。 基本数据类型是:
object
— strings or mixed typesobject
-字符串或混合类型string
— since pandas 1.0.0string
— 自熊猫1.0.0起int
— integer numberint
—整数float
— floating-point numbersfloat
—浮点数bool
— boolean True and False valuesbool
—布尔值True和False值datetime
— date and time valuesdatetime
日期和时间值timedelta
—time difference between two datetimestimedelta
两个日期时间之间的timedelta
category
— limited list of values stored a memory-efficient lookupcategory
-有限的值列表存储了内存有效的查找
Since pandas is using numpy arrays as its backend structures, the int
s and float
s can be differentiated into more memory efficient types like int8
, int16
, int32
, int64
, unit8
, uint16
, uint32
and uint64
as well as float32
and float64
.
由于pandas使用numpy数组作为其后端结构,因此可以将int
和float
分为内存效率更高的类型,例如int8
, int16
, int32
, int64
, unit8
, uint16
, uint32
和uint64
以及float32
和float64
。
CSV doesn’t store information about the data types and you have to specify it with each read_csv()
. Without telling CSV reader, it will infer all integer columns as the least efficient int64
, floats as float64
, categories will be loaded as strings as well as datetimes.
CSV不存储有关数据类型的信息,您必须在每个read_csv()
指定它。 在不告知CSV阅读器的情况下,它将推断所有整数列为效率最低的int64
,将float转换为float64
,类别将作为字符串以及日期时间加载。
# for each loaded dataset you have to specify the formats to make DataFrame efficientdf = pd.read_csv(new_file,
dtype={"colA": "int8",
"colB": "int16",
"colC": "uint8",
"colcat": "category"},
parse_dates=["colD","colDt"])
TimeDeltas
are stored as string in CSVs -5 days +18:59:39.000000
and you would have to write a special parser to turn these strings back to timedelta format of pandas.
TimeDeltas
以字符串形式存储在TimeDeltas
-5 days +18:59:39.000000
,您必须编写特殊的解析器才能将这些字符串转换为熊猫的timedelta格式。
Timezones
looks like 2020-08-06 15:35:06-06:00
and also would require special handling in read_csv()
.
Timezones
看起来像2020-08-06 15:35:06-06:00
并且还需要在read_csv()
进行特殊处理。
CSV替代 (CSV alternatives)
Luckily, csv is not the only option to persist the data frames. Reading Pandas’s IO tools you see that a data frame can be written into many formats, databases, or even a clipboard.
幸运的是,csv不是持久保留数据帧的唯一选择。 通过阅读熊猫的IO工具,您可以看到数据帧可以写入多种格式,数据库甚至剪贴板中。
You can run the code yourself using this GitHub notebook. In the end, I’ll describe in detail how the data were created and I’ll guide you through the performance tests and sanity check using a real dataframe.
您可以使用此GitHub笔记本自己运行代码。 最后,我将详细介绍如何创建数据,并指导您使用真实的数据框进行性能测试和健全性检查。
泡菜和to_pickle() (Pickle and to_pickle())
Pickle is the python native format for object serialization. It allows the python code to implement any kind of enhancement, like the latest protocol 5 described in PEP574 pickling out-of-band data buffers.
Pickle是用于对象序列化的python本地格式 。 它允许python代码实现任何形式的增强,例如PEP574酸洗带外数据缓冲区中描述的最新协议5。
It also means that pickling outside of the Python ecosystem is difficult. However, if you want to store some pre-processed data for later or you just don’t want to lose hours of analytical work while not having any immediate use for the data, just pickle them.
这也意味着很难在Python生态系统之外进行酸洗。 但是,如果您想存储一些预处理的数据以备后用,或者不想在不立即使用数据的情况下浪费数小时的分析工作,只需对它们进行腌制即可 。
# Pandas's to_pickle method
df.to_pickle(path)
Contrary to .to_csv()
.to_pickle()
method accepts only 3 parameters.
违背.to_csv()
.to_pickle()
方法只接受3个参数。
path
— where the data will be storedpath
-数据将存储到的位置compression
— allowing to choose various compression methodscompression
-允许选择各种压缩方法protocol
— higher protocol can process a wider range of data more efficientlyprotocol
-更高的协议可以更有效地处理更大范围的数据
Advantages of pickle:
泡菜的优点:
- Faster than CSV (5–300% of CSV write and 15–200% of CSV read depending on the compression method) 比CSV更快(取决于压缩方法,写入CSV的5–300%和读取CSV的15–200%)
- The resulting file is smaller (~50% of the csv) 生成的文件更小(约为csv的50%)
- It keeps the information about data types (100%) 保留有关数据类型的信息(100%)
- no need to specify a plethora of parameters 无需指定过多的参数
Disadvantages of pickle:
泡菜的缺点:
- Native to python, so it lacks support in other programming languages 原生于python,因此缺少其他编程语言的支持
- Not reliable even across different python version 即使在不同的python版本中也不可靠
实木复合地板和to_parquet() (Parquet and to_parquet())
Apache Parquet is a compressed binary columnar storage format used in Hadoop ecosystem. It allows serializing complex nested structures, supports column-wise compression and column-wise encoding, and offers fast reads because it’s not necessary to read the whole column is you need only part of the data.
Apache Parquet是Hadoop生态系统中使用的压缩二进制列式存储格式。 它允许序列化复杂的嵌套结构,支持按列压缩和按列编码,并提供快速读取,因为不需要读取整个列,因为您只需要部分数据。
演示地址
# Pandas's to_parquet method
df.to_parquet(path, engine, compression, index, partition_cols)
.to_parquet()
method accepts only several parameters.
.to_parquet()
方法仅接受几个参数。
path
— where the data will be storedpath
-数据将存储到的位置engine
— pyarrow or fastparquet engine.pyarrow
is usually faster, but it struggles withtimedelta
format.fastparquet
can be significantly slower.engine
— pyarrow或fastparquet引擎。pyarrow
通常更快,但与timedelta
格式比较timedelta
。fastparquet
可能会明显变慢。compression
— allowing to choose various compression methodscompression
-允许选择各种压缩方法index
— whether to store data frame’s indexindex
—是否存储数据框的索引partition_cols
— specify the order of the column partitioningpartition_cols
—指定列分区的顺序
Advantages of parquet:
实木复合地板的优势:
- Faster than CSV (starting at 10 rows, pyarrow is about 5 times faster) 比CSV更快(从10行开始,pyarrow大约快5倍)
- The resulting file is smaller (~50% of CSV) 生成的文件较小(约为CSV的50%)
- It keeps the information about data types (pyarrow cannot handle timedelta which slower fastparquet can) 它保留有关数据类型的信息(pyarrow无法处理速度较慢的fastparquet可以处理的timedelta)
- Wide support in the Hadoop ecosystem allowing fast filtering throughout many partitions 在Hadoop生态系统中的广泛支持允许对多个分区进行快速过滤
Disadvantages of parquet:
镶木地板的缺点:
duplicated column names not supported
不支持重复的列名
- pyarrow engine doesn’t support some data types pyarrow引擎不支持某些数据类型
More specifics on Pandas IO page.
有关Pandas IO页面的更多详细信息。
Excel和to_excel() (Excel and to_excel())
Sometimes it’s handy to export your data into an excel. It adds the benefit of easy manipulation, at a cost of the slowest reads and writes. It also ignores many datatypes. Timezones
cannot be written into excel at all.
有时将数据导出到excel很方便。 它增加了易于操作的优势,但代价是读写速度最慢。 它还忽略许多数据类型。 Timezones
根本无法写入excel。
# exporting a dataframe to excel
df.to_excel(excel_writer, sheet_name, many_other_parameters)
Useful Parameters:
有用的参数:
excel_writer
— pandas excel writer object or file pathexcel_writer
—熊猫的excel writer对象或文件路径sheet_name
— name of the sheet where the data will be outputsheet_name
数据将输出到的图纸的名称float_format
— excel’s native number formattingfloat_format
— Excel的本机数字格式columns
— option to alias data frames’ columnscolumns
—为数据框的列设置别名的选项startrow
— option to shift the starting cell downwardstartrow
向下移动起始单元格的选项engine
—openpyxl
orxlsxwriter
engine
—openpyxl
或xlsxwriter
freeze_panes
— option to freeze rows and columnsfreeze_panes
冻结行和列的选项
Advantages of excel:
excel的优点:
- allow custom formating and cell freezing 允许自定义格式和单元格冻结
- human-readable and editable format 可读和可编辑的格式
Disadvantages of excel:
excel的缺点:
- very slow reads/writes (20 times/40 times slower) 读/写非常慢(慢20倍/ 40倍)
- limit to 1048576 rows 限制为1048576行
- serialization of the datetimes with timezones fails 带有时区的datetimes的序列化失败
More information on Pandas IO page.
有关Pandas IO页面的更多信息。
HDF5和to_hdf() (HDF5 and to_hdf())
Compressed format using an internal file-like structure suitable for huge heterogeneous data. It’s also ideal if we need to randomly access various parts of the dataset. If the data are stored as table
(PyTable) you can directly query the hdf store using store.select(key,where="A>0 or B<5")
使用适合于大量异构数据的内部文件状结构的压缩格式。 如果需要随机访问数据集的各个部分,这也是理想的选择。 如果数据存储为table
(PyTable),则可以使用store.select(key,where="A>0 or B<5")
直接查询hdf存储
# exporting a dataframe to hdf
df.to_hdf(path_or_buf, key, mode, complevel, complib, append ...)
Useful Parameters:
有用的参数:
path_or_buf
— file path or HDFStore objectpath_or_buf
—文件路径或HDFStore对象key
—Identified or the group in the storekey
识别或商店中的组mode
— write, append or read-appendmode
-写入,追加或读取追加format
—fixed
for fast writing and reading whiletable
allow selecting just subset of the dataformat
-fixed
用于快速写入和读取,而table
仅允许选择数据的子集
Advantages of HDF5:
HDF5的优点:
- for some data structures, the size and access speed can be awesome 对于某些数据结构,大小和访问速度可能很棒
Disadvantages of HDF5:
HDF5的缺点:
- dataframes can be very big in size (even 300 times bigger than csv) 数据帧的大小可能非常大(甚至比csv大300倍)
- HDFStore is not thread-safe for writing HDFStore不是线程安全的写入
fixed
format cannot handle categorical valuesfixed
格式无法处理分类值
SQL和to_sql() (SQL and to_sql())
Quite often it’s useful to persist your data into the database. Libraries like sqlalchemy
are dedicated to this task.
通常,将数据持久保存到数据库中很有用。 像sqlalchemy
这样的库专用于此任务。
# Set up sqlalchemy engine
engine = create_engine(
'mssql+pyodbc://user:pass@localhost/DB?driver=ODBC+Driver+13+for+SQL+server',
isolation_level="REPEATABLE READ"
)# connect to the DB
connection = engine.connect()# exporting dataframe to SQL
df.to_sql(name="test", con=connection)
Useful Parameters:
有用的参数:
name
— name of the SQL tablename
-SQL表的名称con
— connection engine usually bysqlalchemy.engine
con
—通常由sqlalchemy.engine
连接的引擎chunksize
— optionally load data in batches of the chunksizechunksize
—可选地以chunksize的批量加载数据
Advantages of SQL:
SQL的优点:
- Slower than persisting on disk (read 10 times/write 5 times, but this can be optimized) 比保留在磁盘上慢(读取10次/写入5次,但这可以优化)
- Databases are understandable by all programmers 所有程序员都可以理解数据库
Disadvantages of SQL:
SQL的缺点:
- Some data format are not kept — category, int, floats and timedeltas 某些数据格式未保留-类别,整数,浮点数和时间增量
- depending on the database performance can be slow 取决于数据库性能可能很慢
- you may struggle to set up a DB connection in some cases 在某些情况下,您可能难以建立数据库连接
If you would like to increase the write time of .to_sql()
try Kiran Kumar Chilla’s method described in Speed up Bulk inserts article.
如果您想增加.to_sql()
的写入时间, .to_sql()
尝试“加快批量插入”一文中描述的Kiran Kumar .to_sql()
的方法。
其他方法 (Other methods)
Pandas offer even more persistence and reading methods. I’ve omitted json and fix-width-file because they have similar characteristics like csv. You can try to write directly to Google Big Query with .to_gbq()
or stata
format. New formats will definitely appear to address the need to communicate with a variety of cloud providers.
熊猫提供了更多的持久性和阅读方法。 我省略了json和fix-width-file,因为它们具有类似csv的特征。 您可以尝试使用.to_gbq()
或stata
格式直接写入Google Big Query。 新格式肯定会满足与各种云提供商进行通信的需求。
Thanks to this article, I started to like .to_clipboard()
when I copy one-liners to emails, excel, or google doc.
多亏了这篇文章,当我将.to_clipboard()
复制到电子邮件,excel或google doc时,我开始喜欢.to_clipboard()
。
性能测试 (Performance Test)
Many of the methods have benefits over the CSV, but is it worth using these unusual approaches when CSV is so readily understandable around the world. Let’s have a look at the performance.
许多方法都比CSV有好处,但是当CSV在全世界都很容易理解时,值得使用这些不寻常的方法。 让我们看一下性能。
During the performance test I focus on 4 key measures:
在性能测试期间,我重点关注4个关键指标:
data type preservation
— how many % of columns remained original type after readingdata type preservation
-读取后保留原始类型的列百分比compression/size
— how big is the file in % of csvcompression/size
-文件占csv的百分比write_time
— how long does it take to write this format as % of csv writing timewrite_time
以csv写入时间的百分比表示,写入这种格式需要多长时间read_time
— how long does it take to read this format as % of csv reading timeread_time
以csv读取时间的百分比形式读取此格式需要多长时间
For this reason, I have prepared a dataset with 50K random numbers, strings, categories, datetimes and bools. The ranges of the numerical values come from numpy data types overview.
因此,我准备了一个包含50K随机数,字符串,类别,日期时间和布尔值的数据集。 数值范围来自numpy数据类型概述 。
data = []
for i in range(1000000):
data.append(
[random.randint(-127,127), # int8
random.randint(-32768,32767), # int16
...
Generating random samples is a skill used almost in every test.
生成随机样本是几乎每个测试都使用的一项技能。
You can check the support function generating random strings and dates in the GitHub notebook, I’ll only mention one here:
您可以在GitHub笔记本中检查生成随机字符串和日期的支持功能,这里仅提及一个:
def get_random_string(length: int) -> str:
"""Generated random string up to the specific lenght"""
letters = string.ascii_letters
result_str = ''.join([random.choice(letters) for i in range(random.randint(3,length))])
return result_str
Full code to generate the data frame is described in this gist:
本要点描述了用于生成数据帧的完整代码:
Generate random data and measure the read/write speed in 7 iterations 生成随机数据并在7次迭代中测量读写速度Once we have some data we want to process them over and over again by different algorithms. You can write each of the tests separately, but let’s squeeze the test into one line:
一旦有了一些数据,我们就想通过不同的算法来一遍又一遍地处理它们。 您可以分别编写每个测试,但让我们将测试压缩为一行:
# performance test
performance_df = performance_test(exporting_types) # results
performance_df.style.format("{:.2%}")
The performance_test
function accepts a dictionary with the test definition, which looks like:
performance_test
函数接受带有测试定义的字典,该字典如下所示:
d = { ... "parquet_fastparquet": {
"type": "Parquet via fastparquet",
"extension": ".parquet.gzip",
"write_function": pd.DataFrame.to_parquet,
"write_params": {"engine":"fastparquet","compression":"GZIP"},
"read_function": pd.read_parquet,
"read_params": {"engine":"fastparquet"}
}
... }
The dictionary contains the functions which should be run, e.g. pd.DataFrame.to_parquet
and the parameters. We iterate the dict and run one function after another:
字典包含应运行的功能,例如pd.DataFrame.to_parquet
和参数。 我们迭代dict并逐个运行一个函数:
path = "output_file"
# df is our performance test sample dataframe # persist the df
d["write_function"](df, path, **d["write_params"]) # load the df
df_loaded = d["read_function"](path, **d["read_params"]
I store the result into a dataframe to leverage Plotly.Express power to display the results with few lines of code:
我将结果存储到一个数据帧中,以利用Plotly.Express的功能通过几行代码显示结果:
# display the graph with the results
fig = pe.bar(performance_df.T, barmode='group', text="value") # format the labels
fig.update_traces(texttemplate='%{text:.2%}', textposition='auto') # add a title
fig.update_layout(title=f"Statistics for {dataset_size} records")
fig.show()
完整性检查 (Sanity Check)
Testing things on random samples is useful to get the first impression how good your application or tool is, but once it will have to meet the reality. To avoid any surprise, you should try your code on the real data. I’ve picked my favorite dataset — US Securities and Exchange Commission quarterly data dump — and run it through the performance test. I have achieved very similar results which persuaded me that my assumption was not completely wrong.
对随机样本进行测试有助于获得第一印象,即您的应用程序或工具的性能如何,但是一旦它必须符合现实。 为避免任何意外,您应该在真实数据上尝试使用代码。 我选择了我最喜欢的数据集-美国证券交易委员会季度数据转储-并通过性能测试对其进行了运行。 我取得了非常相似的结果,这使我相信我的假设并非完全错误。
结论 (Conclusion)
Even though pickle clearly won the performance competition there can be use-cases when you would prefer to pick another format. We also had quite a special dataset and on the real data, the performance may differ.
即使pickle显然赢得了性能竞赛,您还是会选择用其他格式的用例。 我们还有一个非常特殊的数据集,在实际数据上,性能可能有所不同。
Feel free to play with the code (GitHub) and try the performance on your favorite dataset. For me personally, .to_pickle()
is great when I store a pre-processed dataset because I don’t have to worry about data format, I simply read_pickle()
and hours of work materialize back to my notebook.
随意使用代码(GitHub),并在您喜欢的数据集上尝试性能。 对我个人而言, .to_pickle()
在存储预处理的数据集时非常.to_pickle()
因为我不必担心数据格式,我只需read_pickle()
然后将大量的工作恢复到笔记本上。
翻译自: https://towardsdatascience.com/stop-persisting-pandas-data-frames-in-csvs-f369a6440af5
爬取的数据保存为csv格式