爬取的数据保存为csv格式_停止以CSV格式保存熊猫数据框

爬取的数据保存为csv格式

重点 (Top highlight)

CSV is a great format for data exchange. It’s understood all around the world and editable in a regular notepad. That doesn’t mean that it’s suitable for persisting all DataFrames. CSVs can be slow to read and write, they take more disk space, and most importantly CSVs don’t store information about data types.

CSV是一种很好的数据交换格式。 它在世界各地都可以理解,并且可以在常规记事本中进行编辑。 这并不意味着它适合持久化所有DataFrame。 CSV的读写速度可能很慢,它们会占用更多的磁盘空间,最重要的是CSV不会存储有关数据类型的信息。

Advantage of the CSV:

CSV的优势:

  • Commonly understandable

    可以理解的
  • Parsing and creation supported by most programming languages

    大多数编程语言支持的解析和创建

Disadvantages of the CSV:

CSV的缺点:

  • Bigger disk usage

    更大的磁盘使用量
  • Slower reads and writes

    较慢的读写
  • Don’t store information about data types

    不要存储有关数据类型的信息

Ilia Zaitsev did a great analysis of speed and memory usage of various pandas persistence methods. I’ll touch the performance at the end when we will program an automated performance test. First, I want to focus on a different thing — how do these file formats handle various data types.

Ilia Zaitsev对各种熊猫持久化方法的速度和内存使用情况进行了很好的分析。 最后,我们将对自动性能测试进行编程时,将涉及性能。 首先,我想关注另一件事-这些文件格式如何处理各种数据类型。

熊猫数据类型 (Pandas Data types)

Pandas support a rich set of data types and some of them have multiple subtypes to make work with big data frames more efficient. The fundamental data types are:

熊猫支持丰富的数据类型集,其中一些具有多个子类型,以使处理大数据框的效率更高。 基本数据类型是:

  • object — strings or mixed types

    object -字符串或混合类型

  • string — since pandas 1.0.0

    string — 自熊猫1.0.0起

  • int — integer number

    int —整数

  • float — floating-point numbers

    float —浮点数

  • bool — boolean True and False values

    bool —布尔值True和False值

  • datetime— date and time values

    datetime日期和时间值

  • timedelta —time difference between two datetimes

    timedelta两个日期时间之间的timedelta

  • category — limited list of values stored a memory-efficient lookup

    category -有限的值列表存储了内存有效的查找

Since pandas is using numpy arrays as its backend structures, the ints and floats can be differentiated into more memory efficient types like int8 , int16 , int32 , int64 , unit8 , uint16 , uint32 and uint64 as well as float32 and float64.

由于pandas使用numpy数组作为其后端结构,因此可以将intfloat分为内存效率更高的类型,例如int8int16int32int64unit8uint16uint32uint64以及float32float64

CSV doesn’t store information about the data types and you have to specify it with each read_csv(). Without telling CSV reader, it will infer all integer columns as the least efficient int64, floats as float64, categories will be loaded as strings as well as datetimes.

CSV不存储有关数据类型的信息,您必须在每个read_csv()指定它。 在不告知CSV阅读器的情况下,它将推断所有整数列为效率最低的int64 ,将float转换为float64 ,类别将作为字符串以及日期时间加载。

# for each loaded dataset you have to specify the formats to make DataFrame efficientdf = pd.read_csv(new_file,
dtype={"colA": "int8",
"colB": "int16",
"colC": "uint8",
"colcat": "category"},
parse_dates=["colD","colDt"])

TimeDeltas are stored as string in CSVs -5 days +18:59:39.000000 and you would have to write a special parser to turn these strings back to timedelta format of pandas.

TimeDeltas以字符串形式存储在TimeDeltas -5 days +18:59:39.000000 ,您必须编写特殊的解析器才能将这些字符串转换为熊猫的timedelta格式。

Timezones looks like 2020-08-06 15:35:06-06:00 and also would require special handling in read_csv() .

Timezones看起来像2020-08-06 15:35:06-06:00并且还需要在read_csv()进行特殊处理。

Comparison of the original dtypes and automatically infered types using read_csv() without parameters 使用没有参数的read_csv()比较原始dtypes和自动推断的类型

CSV替代 (CSV alternatives)

Luckily, csv is not the only option to persist the data frames. Reading Pandas’s IO tools you see that a data frame can be written into many formats, databases, or even a clipboard.

幸运的是,csv不是持久保留数据帧的唯一选择。 通过阅读熊猫的IO工具,您可以看到数据帧可以写入多种格式,数据库甚至剪贴板中。

You can run the code yourself using this GitHub notebook. In the end, I’ll describe in detail how the data were created and I’ll guide you through the performance tests and sanity check using a real dataframe.

您可以使用此GitHub笔记本自己运行代码。 最后,我将详细介绍如何创建数据,并指导您使用真实的数据框进行性能测试和健全性检查。

In this article, we test many types of persisting methods with several parameters. Thanks to Plotly’s interactive features you can explore any combination of methods and the chart will automatically update. 在本文中,我们使用多个参数测试了许多类型的持久化方法。 借助Plotly的交互功能,您可以探索任何方法的组合,图表将自动更新。

泡菜和to_pickle() (Pickle and to_pickle())

Pickle is the python native format for object serialization. It allows the python code to implement any kind of enhancement, like the latest protocol 5 described in PEP574 pickling out-of-band data buffers.

Pickle是用于对象序列化的python本地格式 。 它允许python代码实现任何形式的增强,例如PEP574酸洗带外数据缓冲区中描述的最新协议5。

It also means that pickling outside of the Python ecosystem is difficult. However, if you want to store some pre-processed data for later or you just don’t want to lose hours of analytical work while not having any immediate use for the data, just pickle them.

这也意味着很难在Python生态系统之外进行酸洗。 但是,如果您想存储一些预处理的数据以备后用,或者不想在不立即使用数据的情况下浪费数小时的分析工作,只需对它们进行腌制即可 。

# Pandas's to_pickle method
df.to_pickle(path)

Contrary to .to_csv() .to_pickle() method accepts only 3 parameters.

违背.to_csv() .to_pickle()方法只接受3个参数。

  • path — where the data will be stored

    path -数据将存储到的位置

  • compression — allowing to choose various compression methods

    compression -允许选择各种压缩方法

  • protocol — higher protocol can process a wider range of data more efficiently

    protocol -更高的协议可以更有效地处理更大范围的数据

Advantages of pickle:

泡菜的优点:

  • Faster than CSV (5–300% of CSV write and 15–200% of CSV read depending on the compression method)

    比CSV更快(取决于压缩方法,写入CSV的5–300%和读取CSV的15–200%)
  • The resulting file is smaller (~50% of the csv)

    生成的文件更小(约为csv的50%)
  • It keeps the information about data types (100%)

    保留有关数据类型的信息(100%)
  • no need to specify a plethora of parameters

    无需指定过多的参数

Disadvantages of pickle:

泡菜的缺点:

  • Native to python, so it lacks support in other programming languages

    原生于python,因此缺少其他编程语言的支持
  • Not reliable even across different python version

    即使在不同的python版本中也不可靠
Pickle is able to serialize 100% of the padnas data types Pickle能够序列化100%的padnas数据类型

实木复合地板和to_parquet() (Parquet and to_parquet())

Apache Parquet is a compressed binary columnar storage format used in Hadoop ecosystem. It allows serializing complex nested structures, supports column-wise compression and column-wise encoding, and offers fast reads because it’s not necessary to read the whole column is you need only part of the data.

Apache Parquet是Hadoop生态系统中使用的压缩二进制列式存储格式。 它允许序列化复杂的嵌套结构,支持按列压缩和按列编码,并提供快速读取,因为不需要读取整个列,因为您只需要部分数据。

演示地址

# Pandas's to_parquet method
df.to_parquet(path, engine, compression, index, partition_cols)

.to_parquet() method accepts only several parameters.

.to_parquet()方法仅接受几个参数。

  • path — where the data will be stored

    path -数据将存储到的位置

  • engine — pyarrow or fastparquet engine. pyarrow is usually faster, but it struggles with timedelta format. fastparquet can be significantly slower.

    engine — pyarrow或fastparquet引擎。 pyarrow通常更快,但与timedelta格式比较timedeltafastparquet可能会明显变慢。

  • compression — allowing to choose various compression methods

    compression -允许选择各种压缩方法

  • index — whether to store data frame’s index

    index —是否存储数据框的索引

  • partition_cols — specify the order of the column partitioning

    partition_cols —指定列分区的顺序

Advantages of parquet:

实木复合地板的优势:

  • Faster than CSV (starting at 10 rows, pyarrow is about 5 times faster)

    比CSV更快(从10行开始,pyarrow大约快5倍)
  • The resulting file is smaller (~50% of CSV)

    生成的文件较小(约为CSV的50%)
  • It keeps the information about data types (pyarrow cannot handle timedelta which slower fastparquet can)

    它保留有关数据类型的信息(pyarrow无法处理速度较慢的fastparquet可以处理的timedelta)
  • Wide support in the Hadoop ecosystem allowing fast filtering throughout many partitions

    在Hadoop生态系统中的广泛支持允许对多个分区进行快速过滤

Disadvantages of parquet:

镶木地板的缺点:

  • duplicated column names not supported

    不支持重复的列名

  • pyarrow engine doesn’t support some data types

    pyarrow引擎不支持某些数据类型

More specifics on Pandas IO page.

有关Pandas IO页面的更多详细信息。

Excel和to_excel() (Excel and to_excel())

Sometimes it’s handy to export your data into an excel. It adds the benefit of easy manipulation, at a cost of the slowest reads and writes. It also ignores many datatypes. Timezones cannot be written into excel at all.

有时将数据导出到excel很方便。 它增加了易于操作的优势,但代价是读写速度最慢。 它还忽略许多数据类型。 Timezones根本无法写入excel。

# exporting a dataframe to excel
df.to_excel(excel_writer, sheet_name, many_other_parameters)

Useful Parameters:

有用的参数:

  • excel_writer — pandas excel writer object or file path

    excel_writer —熊猫的excel writer对象或文件路径

  • sheet_name — name of the sheet where the data will be output

    sheet_name数据将输出到的图纸的名称

  • float_format — excel’s native number formatting

    float_format — Excel的本机数字格式

  • columns — option to alias data frames’ columns

    columns —为数据框的列设置别名的选项

  • startrow — option to shift the starting cell downward

    startrow向下移动起始单元格的选项

  • engine openpyxl or xlsxwriter

    engineopenpyxlxlsxwriter

  • freeze_panes — option to freeze rows and columns

    freeze_panes冻结行和列的选项

Advantages of excel:

excel的优点:

  • allow custom formating and cell freezing

    允许自定义格式和单元格冻结
  • human-readable and editable format

    可读和可编辑的格式

Disadvantages of excel:

excel的缺点:

  • very slow reads/writes (20 times/40 times slower)

    读/写非常慢(慢20倍/ 40倍)
  • limit to 1048576 rows

    限制为1048576行
  • serialization of the datetimes with timezones fails

    带有时区的datetimes的序列化失败

More information on Pandas IO page.

有关Pandas IO页面的更多信息。

Performance tests results for excel. Only 54% of columns kept the original datatype, it took 90% size of the CSV, but it took 20 times more time to write and 42 times more time to read 性能测试结果为excel。 只有54%的列保留了原始数据类型,占用了CSV大小的90%,但是写入时间增加了20倍,读取时间增加了42倍

HDF5和to_hdf() (HDF5 and to_hdf())

Compressed format using an internal file-like structure suitable for huge heterogeneous data. It’s also ideal if we need to randomly access various parts of the dataset. If the data are stored as table (PyTable) you can directly query the hdf store using store.select(key,where="A>0 or B<5")

使用适合于大量异构数据的内部文件状结构的压缩格式。 如果需要随机访问数据集的各个部分,这也是理想的选择。 如果数据存储为table (PyTable),则可以使用store.select(key,where="A>0 or B<5")直接查询hdf存储

# exporting a dataframe to hdf
df.to_hdf(path_or_buf, key, mode, complevel, complib, append ...)

Useful Parameters:

有用的参数:

  • path_or_buf— file path or HDFStore object

    path_or_buf —文件路径或HDFStore对象

  • key—Identified or the group in the store

    key识别或商店中的组

  • mode— write, append or read-append

    mode -写入,追加或读取追加

  • formatfixed for fast writing and reading while table allow selecting just subset of the data

    format - fixed用于快速写入和读取,而table仅允许选择数据的子集

Advantages of HDF5:

HDF5的优点:

  • for some data structures, the size and access speed can be awesome

    对于某些数据结构,大小和访问速度可能很棒

Disadvantages of HDF5:

HDF5的缺点:

  • dataframes can be very big in size (even 300 times bigger than csv)

    数据帧的大小可能非常大(甚至比csv大300倍)
  • HDFStore is not thread-safe for writing

    HDFStore不是线程安全的写入
  • fixed format cannot handle categorical values

    fixed格式无法处理分类值

SQL和to_sql() (SQL and to_sql())

Quite often it’s useful to persist your data into the database. Libraries like sqlalchemy are dedicated to this task.

通常,将数据持久保存到数据库中很有用。 像sqlalchemy这样的库专用于此任务。

# Set up sqlalchemy engine
engine = create_engine(
'mssql+pyodbc://user:pass@localhost/DB?driver=ODBC+Driver+13+for+SQL+server',
isolation_level="REPEATABLE READ"
)# connect to the DB
connection = engine.connect()# exporting dataframe to SQL
df.to_sql(name="test", con=connection)

Useful Parameters:

有用的参数:

  • name — name of the SQL table

    name -SQL表的名称

  • con — connection engine usually by sqlalchemy.engine

    con —通常由sqlalchemy.engine连接的引擎

  • chunksize — optionally load data in batches of the chunksize

    chunksize —可选地以chunksize的批量加载数据

Advantages of SQL:

SQL的优点:

  • Slower than persisting on disk (read 10 times/write 5 times, but this can be optimized)

    比保留在磁盘上慢(读取10次/写入5次,但这可以优化)
  • Databases are understandable by all programmers

    所有程序员都可以理解数据库

Disadvantages of SQL:

SQL的缺点:

  • Some data format are not kept — category, int, floats and timedeltas

    某些数据格式未保留-类别,整数,浮点数和时间增量
  • depending on the database performance can be slow

    取决于数据库性能可能很慢
  • you may struggle to set up a DB connection in some cases

    在某些情况下,您可能难以建立数据库连接

If you would like to increase the write time of .to_sql() try Kiran Kumar Chilla’s method described in Speed up Bulk inserts article.

如果您想增加.to_sql()的写入时间, .to_sql()尝试“加快批量插入”一文中描述的Kiran Kumar .to_sql()的方法。

其他方法 (Other methods)

Pandas offer even more persistence and reading methods. I’ve omitted json and fix-width-file because they have similar characteristics like csv. You can try to write directly to Google Big Query with .to_gbq() or stata format. New formats will definitely appear to address the need to communicate with a variety of cloud providers.

熊猫提供了更多的持久性和阅读方法。 我省略了json和fix-width-file,因为它们具有类似csv的特征。 您可以尝试使用.to_gbq()stata格式直接写入Google Big Query。 新格式肯定会满足与各种云提供商进行通信的需求。

Thanks to this article, I started to like .to_clipboard() when I copy one-liners to emails, excel, or google doc.

多亏了这篇文章,当我将.to_clipboard()复制到电子邮件,excel或google doc时,我开始喜欢.to_clipboard()

性能测试 (Performance Test)

Many of the methods have benefits over the CSV, but is it worth using these unusual approaches when CSV is so readily understandable around the world. Let’s have a look at the performance.

许多方法都比CSV有好处,但是当CSV在全世界都很容易理解时,值得使用这些不寻常的方法。 让我们看一下性能。

During the performance test I focus on 4 key measures:

在性能测试期间,我重点关注4个关键指标:

  • data type preservation — how many % of columns remained original type after reading

    data type preservation -读取后保留原始类型的列百分比

  • compression/size — how big is the file in % of csv

    compression/size -文件占csv的百分比

  • write_time — how long does it take to write this format as % of csv writing time

    write_time以csv写入时间的百分比表示,写入这种格式需要多长时间

  • read_time — how long does it take to read this format as % of csv reading time

    read_time以csv读取时间的百分比形式读取此格式需要多长时间

For this reason, I have prepared a dataset with 50K random numbers, strings, categories, datetimes and bools. The ranges of the numerical values come from numpy data types overview.

因此,我准备了一个包含50K随机数,字符串,类别,日期时间和布尔值的数据集。 数值范围来自numpy数据类型概述 。

data = []
for i in range(1000000):
data.append(
[random.randint(-127,127), # int8
random.randint(-32768,32767), # int16
...

Generating random samples is a skill used almost in every test.

生成随机样本是几乎每个测试都使用的一项技能。

You can check the support function generating random strings and dates in the GitHub notebook, I’ll only mention one here:

您可以在GitHub笔记本中检查生成随机字符串和日期的支持功能,这里仅提及一个:

def get_random_string(length: int) -> str:
"""Generated random string up to the specific lenght"""

letters = string.ascii_letters
result_str = ''.join([random.choice(letters) for i in range(random.randint(3,length))])
return result_str

Full code to generate the data frame is described in this gist:

本要点描述了用于生成数据帧的完整代码:

Generate random data and measure the read/write speed in 7 iterations 生成随机数据并在7次迭代中测量读写速度

Once we have some data we want to process them over and over again by different algorithms. You can write each of the tests separately, but let’s squeeze the test into one line:

一旦有了一些数据,我们就想通过不同的算法来一遍又一遍地处理它们。 您可以分别编写每个测试,但让我们将测试压缩为一行:

 # performance test
performance_df = performance_test(exporting_types) # results
performance_df.style.format("{:.2%}")

The performance_test function accepts a dictionary with the test definition, which looks like:

performance_test函数接受带有测试定义的字典,该字典如下所示:

 d = { ... "parquet_fastparquet": {
"type": "Parquet via fastparquet",
"extension": ".parquet.gzip",
"write_function": pd.DataFrame.to_parquet,
"write_params": {"engine":"fastparquet","compression":"GZIP"},
"read_function": pd.read_parquet,
"read_params": {"engine":"fastparquet"}
}
... }

The dictionary contains the functions which should be run, e.g. pd.DataFrame.to_parquet and the parameters. We iterate the dict and run one function after another:

字典包含应运行的功能,例如pd.DataFrame.to_parquet和参数。 我们迭代dict并逐个运行一个函数:

 path = "output_file"
# df is our performance test sample dataframe # persist the df
d["write_function"](df, path, **d["write_params"]) # load the df
df_loaded = d["read_function"](path, **d["read_params"]

I store the result into a dataframe to leverage Plotly.Express power to display the results with few lines of code:

我将结果存储到一个数据帧中,以利用Plotly.Express的功能通过几行代码显示结果:

 # display the graph with the results
fig = pe.bar(performance_df.T, barmode='group', text="value") # format the labels
fig.update_traces(texttemplate='%{text:.2%}', textposition='auto') # add a title
fig.update_layout(title=f"Statistics for {dataset_size} records")
fig.show()
Performance test results. Data Format Perservation measure % success and size and speed are compared to the csv. 性能测试结果。 数据格式保留度量的成功百分比,大小和速度与CSV进行了比较。

完整性检查 (Sanity Check)

Testing things on random samples is useful to get the first impression how good your application or tool is, but once it will have to meet the reality. To avoid any surprise, you should try your code on the real data. I’ve picked my favorite dataset — US Securities and Exchange Commission quarterly data dump — and run it through the performance test. I have achieved very similar results which persuaded me that my assumption was not completely wrong.

对随机样本进行测试有助于获得第一印象,即您的应用程序或工具的性能如何,但是一旦它必须符合现实。 为避免任何意外,您应该在真实数据上尝试使用代码。 我选择了我最喜欢的数据集-美国证券交易委员会季度数据转储-并通过性能测试对其进行了运行。 我取得了非常相似的结果,这使我相信我的假设并非完全错误。

The results of the sanity check on the 8 columns of the SEC quarterly data dump confirmed the results of the performance test. SEC季度数据转储的8列上的健全性检查结果确认了性能测试的结果。

结论 (Conclusion)

Even though pickle clearly won the performance competition there can be use-cases when you would prefer to pick another format. We also had quite a special dataset and on the real data, the performance may differ.

即使pickle显然赢得了性能竞赛,您还是会选择用其他格式的用例。 我们还有一个非常特殊的数据集,在实际数据上,性能可能有所不同。

Feel free to play with the code (GitHub) and try the performance on your favorite dataset. For me personally, .to_pickle() is great when I store a pre-processed dataset because I don’t have to worry about data format, I simply read_pickle() and hours of work materialize back to my notebook.

随意使用代码(GitHub),并在您喜欢的数据集上尝试性能。 对我个人而言, .to_pickle()在存储预处理的数据集时非常.to_pickle()因为我不必担心数据格式,我只需read_pickle()然后将大量的工作恢复到笔记本上。

made with canva.com 用canva.com制作

翻译自: https://towardsdatascience.com/stop-persisting-pandas-data-frames-in-csvs-f369a6440af5

爬取的数据保存为csv格式

你可能感兴趣的:(python,java,csv,大数据,机器学习)