在Pandas中使用大型Excel文件

Today we are going to learn how to work with large files in Pandas, focusing on reading and analyzing an Excel file and then working with a subset of the original data.

今天,我们将学习如何在Pandas中处理大型文件,重点是读取和分析Excel文件,然后处理原始数据的子集。

在Pandas中使用大型Excel文件_第1张图片

This tutorial utilizes Python (tested with 64-bit versions of v2.7.9 and v3.4.3), Pandas (v0.16.1), and XlsxWriter (v0.7.3). We recommend using the Anaconda distribution to quickly get started, as it comes pre-installed with all the needed libraries.

本教程使用Python(经过v2.7.9和v3.4.3的64位版本测试), Pandas (v0.16.1)和XlsxWriter (v0.7.3)。 我们建议您使用Anaconda发行版快速入门,因为它已预先安装了所有必需的库。

This is a collaboration piece between Shantnu Tiwari, founder of Python For Engineers, and the fine folks at Real Python.

这是Python For Engineers的创始人Shantnu Tiwari与Real Python的优秀人士之间的合作。

读取文件 (Reading the File)

The first file we’ll work with is a compilation of all the car accidents in England from 1979-2004, to extract all accidents that happened in London in the year 2000.

我们将使用的第一个文件是1979年至2004年英格兰所有车祸的汇编,以提取2000年伦敦发生的所有车祸。

电子表格 (Excel)

Start by downloading the source ZIP file from data.gov.uk, and extract the contents. Then try to open Accidents7904.csv in Excel. Be careful. If you don’t have enough memory, this could very well crash your computer.

首先从data.gov.uk下载源ZIP文件,然后解压缩内容。 然后尝试在Excel中打开Accidents7904.csv。 小心。 如果没有足够的内存,这很可能会使计算机崩溃。

What happens?

怎么了?

You should see a “File Not Loaded Completely” error since Excel can only handle one million rows at a time.

您应该会看到“文件未完全加载”错误,因为Excel一次只能处理一百万行。

We tested this in LibreOffice as well and received a similar error – “The data could not be loaded completely because the maximum number of rows per sheet was exceeded.”

我们也在LibreOffice中对此进行了测试,并收到了类似的错误-“由于超出了每张纸的最大行数,因此无法完全加载数据。”

To solve this, we can open the file in Pandas. Before we start, the source code is on Github.

为了解决这个问题,我们可以在Pandas中打开文件。 在开始之前,源代码在Github上 。

大熊猫 (Pandas)

Within a new project directory, activate a virtualenv, and then install Pandas:

在新的项目目录中,激活virtualenv,然后安装Pandas:

1
1

Now let’s build the scr

你可能感兴趣的:(python,java,mysql,数据库,编程语言)