本文翻译自:“Large data” work flows using pandas
I have tried to puzzle out an answer to this question for many months while learning pandas. 在学习熊猫的过程中,我试图解决这个问题的答案已经有很多月了。 I use SAS for my day-to-day work and it is great for it's out-of-core support. 我在日常工作中使用SAS,这非常有用,因为它提供了核心支持。 However, SAS is horrible as a piece of software for numerous other reasons. 但是,由于许多其他原因,SAS作为一个软件也是很糟糕的。
One day I hope to replace my use of SAS with python and pandas, but I currently lack an out-of-core workflow for large datasets. 有一天,我希望用python和pandas取代我对SAS的使用,但是我目前缺少大型数据集的核心工作流程。 I'm not talking about "big data" that requires a distributed network, but rather files too large to fit in memory but small enough to fit on a hard-drive. 我并不是在说需要分布式网络的“大数据”,而是文件太大而无法容纳在内存中,但文件又足够小而无法容纳在硬盘上。
My first thought is to use HDFStore
to hold large datasets on disk and pull only the pieces I need into dataframes for analysis. 我的第一个想法是使用HDFStore
将大型数据集保存在磁盘上,然后仅将我需要的部分拉入数据帧中进行分析。 Others have mentioned MongoDB as an easier to use alternative. 其他人则提到MongoDB是一种更易于使用的替代方案。 My question is this: 我的问题是这样的:
What are some best-practice workflows for accomplishing the following: 什么是实现以下目标的最佳实践工作流:
Real-world examples would be much appreciated, especially from anyone who uses pandas on "large data". 现实世界中的示例将不胜感激,尤其是那些使用“大数据”中的熊猫的人。
Edit -- an example of how I would like this to work: 编辑-我希望如何工作的示例:
I am trying to find a best-practice way of performing these steps. 我正在尝试找到执行这些步骤的最佳实践方法。 Reading links about pandas and pytables it seems that appending a new column could be a problem. 阅读有关熊猫和pytables的链接,似乎添加一个新列可能是个问题。
Edit -- Responding to Jeff's questions specifically: 编辑-专门回答杰夫的问题:
if var1 > 2 then newvar = 'A' elif var2 = 4 then newvar = 'B'
. 例如, if var1 > 2 then newvar = 'A' elif var2 = 4 then newvar = 'B'
; if var1 > 2 then newvar = 'A' elif var2 = 4 then newvar = 'B'
。 The result of these operations is a new column for every record in my dataset. 这些操作的结果是数据集中每个记录的新列。 It is rare that I would ever add rows to the dataset. 我很少向数据集添加行。 I will nearly always be creating new columns (variables or features in statistics/machine learning parlance). 我几乎总是会创建新列(统计/机器学习术语中的变量或功能)。
参考:https://stackoom.com/question/xqJF/使用熊猫的-大数据-工作流程
I routinely use tens of gigabytes of data in just this fashion eg I have tables on disk that I read via queries, create data and append back. 我通常以这种方式使用数十GB的数据,例如,我在磁盘上有一些表,这些表是通过查询读取,创建数据并追加回去的。
It's worth reading the docs and late in this thread for several suggestions for how to store your data. 值得阅读文档以及该线程的后期内容,以获取有关如何存储数据的一些建议。
Details which will affect how you store your data, like: 将影响您存储数据方式的详细信息,例如:
Give as much detail as you can; 尽可能多地提供细节; and I can help you develop a structure. 我可以帮助您建立结构。
Ensure you have pandas at least 0.10.1
installed. 确保已安装至少0.10.1
熊猫 。
Read iterating files chunk-by-chunk and multiple table queries . 逐块读取迭代文件和多个表查询 。
Since pytables is optimized to operate on row-wise (which is what you query on), we will create a table for each group of fields. 由于pytables已优化为可按行操作(这是您要查询的内容),因此我们将为每组字段创建一个表。 This way it's easy to select a small group of fields (which will work with a big table, but it's more efficient to do it this way... I think I may be able to fix this limitation in the future... this is more intuitive anyhow): 这样一来,很容易选择一小组字段(它将与一个大表一起使用,但是这样做更有效。我想我将来可能会解决此限制。这是更加直观):
(The following is pseudocode.) (以下是伪代码。)
import numpy as np
import pandas as pd
# create a store
store = pd.HDFStore('mystore.h5')
# this is the key to your storage:
# this maps your fields to a specific group, and defines
# what you want to have as data_columns.
# you might want to create a nice class wrapping this
# (as you will want to have this map and its inversion)
group_map = dict(
A = dict(fields = ['field_1','field_2',.....], dc = ['field_1',....,'field_5']),
B = dict(fields = ['field_10',...... ], dc = ['field_10']),
.....
REPORTING_ONLY = dict(fields = ['field_1000','field_1001',...], dc = []),
)
group_map_inverted = dict()
for g, v in group_map.items():
group_map_inverted.update(dict([ (f,g) for f in v['fields'] ]))
Reading in the files and creating the storage (essentially doing what append_to_multiple
does): 读入文件并创建存储(基本上是执行append_to_multiple
操作):
for f in files:
# read in the file, additional options hmay be necessary here
# the chunksize is not strictly necessary, you may be able to slurp each
# file into memory in which case just eliminate this part of the loop
# (you can also change chunksize if necessary)
for chunk in pd.read_table(f, chunksize=50000):
# we are going to append to each table by group
# we are not going to create indexes at this time
# but we *ARE* going to create (some) data_columns
# figure out the field groupings
for g, v in group_map.items():
# create the frame for this group
frame = chunk.reindex(columns = v['fields'], copy = False)
# append it
store.append(g, frame, index=False, data_columns = v['dc'])
Now you have all of the tables in the file (actually you could store them in separate files if you wish, you would prob have to add the filename to the group_map, but probably this isn't necessary). 现在,您已将所有表存储在文件中(实际上,您可以根据需要将它们存储在单独的文件中,您可能需要将文件名添加到group_map中,但这可能不是必需的)。
This is how you get columns and create new ones: 这是获取列并创建新列的方式:
frame = store.select(group_that_I_want)
# you can optionally specify:
# columns = a list of the columns IN THAT GROUP (if you wanted to
# select only say 3 out of the 20 columns in this sub-table)
# and a where clause if you want a subset of the rows
# do calculations on this frame
new_frame = cool_function_on_frame(frame)
# to 'add columns', create a new group (you probably want to
# limit the columns in this new_group to be only NEW ones
# (e.g. so you don't overlap from the other tables)
# add this info to the group_map
store.append(new_group, new_frame.reindex(columns = new_columns_created, copy = False), data_columns = new_columns_created)
When you are ready for post_processing: 准备进行后期处理时:
# This may be a bit tricky; and depends what you are actually doing.
# I may need to modify this function to be a bit more general:
report_data = store.select_as_multiple([groups_1,groups_2,.....], where =['field_1>0', 'field_1000=foo'], selector = group_1)
About data_columns, you don't actually need to define ANY data_columns; 关于data_columns,您实际上不需要定义任何 data_columns。 they allow you to sub-select rows based on the column. 它们使您可以根据列来子选择行。 Eg something like: 例如:
store.select(group, where = ['field_1000=foo', 'field_1001>0'])
They may be most interesting to you in the final report generation stage (essentially a data column is segregated from other columns, which might impact efficiency somewhat if you define a lot). 在最后的报告生成阶段,它们可能对您来说最有趣(实际上,数据列与其他列是分开的,如果定义太多,这可能会在一定程度上影响效率)。
You also might want to: 您可能还想:
Let me know when you have questions! 如有疑问,请告诉我!
This is the case for pymongo. pymongo就是这种情况。 I have also prototyped using sql server, sqlite, HDF, ORM (SQLAlchemy) in python. 我还使用python中的sql server,sqlite,HDF,ORM(SQLAlchemy)进行了原型设计。 First and foremost pymongo is a document based DB, so each person would be a document ( dict
of attributes). pymongo首先是基于文档的数据库,因此每个人都是文档(属性dict
)。 Many people form a collection and you can have many collections (people, stock market, income). 很多人组成一个集合,您可以有很多集合(人,股票市场,收入)。
pd.dateframe -> pymongo Note: I use the chunksize
in read_csv
to keep it to 5 to 10k records(pymongo drops the socket if larger) pd.dateframe-> pymongo注意:我在read_csv
使用chunksize
将其保持为5至10k记录(如果较大,pymongo会删除套接字)
aCollection.insert((a[1].to_dict() for a in df.iterrows()))
querying: gt = greater than... 查询:gt =大于...
pd.DataFrame(list(mongoCollection.find({'anAttribute':{'$gt':2887000, '$lt':2889000}})))
.find()
returns an iterator so I commonly use ichunked
to chop into smaller iterators. .find()
返回一个迭代器,因此我通常使用ichunked
切成更小的迭代器。
How about a join since I normally get 10 data sources to paste together: 由于我通常可以将10个数据源粘贴在一起,因此如何进行联接:
aJoinDF = pandas.DataFrame(list(mongoCollection.find({'anAttribute':{'$in':Att_Keys}})))
then (in my case sometimes I have to agg on aJoinDF
first before its "mergeable".) 然后(就我而言,有时我必须先在aJoinDF
进行“可合并”操作)。
df = pandas.merge(df, aJoinDF, on=aKey, how='left')
And you can then write the new info to your main collection via the update method below. 然后,您可以通过下面的update方法将新信息写入您的主要收藏夹。 (logical collection vs physical datasources). (逻辑收集与物理数据源)。
collection.update({primarykey:foo},{key:change})
On smaller lookups, just denormalize. 在较小的查询中,只需进行非规范化即可。 For example, you have code in the document and you just add the field code text and do a dict
lookup as you create documents. 例如,您在文档中有代码,而您仅添加域代码文本并在创建文档时进行dict
查找。
Now you have a nice dataset based around a person, you can unleash your logic on each case and make more attributes. 现在,您有了一个基于人的漂亮数据集,您可以在每种情况下释放自己的逻辑并添加更多属性。 Finally you can read into pandas your 3 to memory max key indicators and do pivots/agg/data exploration. 最后,您可以将3个最大记忆键指标读入大熊猫,并进行数据透视/汇总/数据探索。 This works for me for 3 million records with numbers/big text/categories/codes/floats/... 这对我来说适合300万条带有数字/大文本/类别/代码/浮点数/ ...的记录
You can also use the two methods built into MongoDB (MapReduce and aggregate framework). 您还可以使用MongoDB内置的两种方法(MapReduce和聚合框架)。 See here for more info about the aggregate framework , as it seems to be easier than MapReduce and looks handy for quick aggregate work. 有关聚合框架的更多信息,请参见此处 ,因为它似乎比MapReduce容易,并且看起来便于进行快速聚合工作。 Notice I didn't need to define my fields or relations, and I can add items to a document. 注意,我不需要定义字段或关系,可以将项目添加到文档中。 At the current state of the rapidly changing numpy, pandas, python toolset, MongoDB helps me just get to work :) 在快速变化的numpy,pandas,python工具集的当前状态下,MongoDB可以帮助我开始工作:)
I spotted this a little late, but I work with a similar problem (mortgage prepayment models). 我发现这有点晚了,但我遇到了类似的问题(抵押预付款模型)。 My solution has been to skip the pandas HDFStore layer and use straight pytables. 我的解决方案是跳过熊猫HDFStore层,并使用直接pytables。 I save each column as an individual HDF5 array in my final file. 我将每列保存为最终文件中的单独HDF5阵列。
My basic workflow is to first get a CSV file from the database. 我的基本工作流程是首先从数据库中获取CSV文件。 I gzip it, so it's not as huge. 我用gzip压缩,所以它没有那么大。 Then I convert that to a row-oriented HDF5 file, by iterating over it in python, converting each row to a real data type, and writing it to a HDF5 file. 然后,通过在python中对其进行迭代,将每一行转换为实际数据类型并将其写入HDF5文件,将其转换为面向行的HDF5文件。 That takes some tens of minutes, but it doesn't use any memory, since it's only operating row-by-row. 这花费了数十分钟,但是它不使用任何内存,因为它仅逐行运行。 Then I "transpose" the row-oriented HDF5 file into a column-oriented HDF5 file. 然后,我将面向行的HDF5文件“转置”为面向列的HDF5文件。
The table transpose looks like: 表转置如下:
def transpose_table(h_in, table_path, h_out, group_name="data", group_path="/"):
# Get a reference to the input data.
tb = h_in.getNode(table_path)
# Create the output group to hold the columns.
grp = h_out.createGroup(group_path, group_name, filters=tables.Filters(complevel=1))
for col_name in tb.colnames:
logger.debug("Processing %s", col_name)
# Get the data.
col_data = tb.col(col_name)
# Create the output array.
arr = h_out.createCArray(grp,
col_name,
tables.Atom.from_dtype(col_data.dtype),
col_data.shape)
# Store the data.
arr[:] = col_data
h_out.flush()
Reading it back in then looks like: 然后读回它就像:
def read_hdf5(hdf5_path, group_path="/data", columns=None):
"""Read a transposed data set from a HDF5 file."""
if isinstance(hdf5_path, tables.file.File):
hf = hdf5_path
else:
hf = tables.openFile(hdf5_path)
grp = hf.getNode(group_path)
if columns is None:
data = [(child.name, child[:]) for child in grp]
else:
data = [(child.name, child[:]) for child in grp if child.name in columns]
# Convert any float32 columns to float64 for processing.
for i in range(len(data)):
name, vec = data[i]
if vec.dtype == np.float32:
data[i] = (name, vec.astype(np.float64))
if not isinstance(hdf5_path, tables.file.File):
hf.close()
return pd.DataFrame.from_items(data)
Now, I generally run this on a machine with a ton of memory, so I may not be careful enough with my memory usage. 现在,我通常在具有大量内存的计算机上运行此程序,因此我可能对内存使用情况不够谨慎。 For example, by default the load operation reads the whole data set. 例如,默认情况下,装入操作将读取整个数据集。
This generally works for me, but it's a bit clunky, and I can't use the fancy pytables magic. 这通常对我有用,但是有点笨拙,我不能使用花式的pytables魔术。
Edit: The real advantage of this approach, over the array-of-records pytables default, is that I can then load the data into R using h5r, which can't handle tables. 编辑:相对于默认的记录数组pytables,此方法的真正优势在于,我可以使用无法处理表的h5r将数据加载到R中。 Or, at least, I've been unable to get it to load heterogeneous tables. 或者,至少,我一直无法使它加载异构表。
If your datasets are between 1 and 20GB, you should get a workstation with 48GB of RAM. 如果您的数据集介于1到20GB之间,则应该获得具有48GB RAM的工作站。 Then Pandas can hold the entire dataset in RAM. 然后,熊猫可以将整个数据集保存在RAM中。 I know its not the answer you're looking for here, but doing scientific computing on a notebook with 4GB of RAM isn't reasonable. 我知道这不是您在这里寻找的答案,但是在具有4GB RAM的笔记本电脑上进行科学计算是不合理的。
I think the answers above are missing a simple approach that I've found very useful. 我认为以上答案都缺少一种我发现非常有用的简单方法。
When I have a file that is too large to load in memory, I break up the file into multiple smaller files (either by row or cols) 当我的文件太大而无法加载到内存中时,我将该文件分成多个较小的文件(按行或列)
Example: In case of 30 days worth of trading data of ~30GB size, I break it into a file per day of ~1GB size. 示例:如果有30天的〜30GB大小的交易数据值得每天将其拆分为一个〜1GB大小的文件。 I subsequently process each file separately and aggregate results at the end 随后,我分别处理每个文件,并在最后汇总结果
One of the biggest advantages is that it allows parallel processing of the files (either multiple threads or processes) 最大的优势之一是它允许并行处理文件(多个线程或多个进程)
The other advantage is that file manipulation (like adding/removing dates in the example) can be accomplished by regular shell commands, which is not be possible in more advanced/complicated file formats 另一个优点是文件操作(例如示例中的添加/删除日期)可以通过常规的shell命令来完成,而在更高级/更复杂的文件格式中则无法实现
This approach doesn't cover all scenarios, but is very useful in a lot of them 这种方法无法涵盖所有情况,但在许多情况下非常有用