pandas简介
Pandas是Python的一个结构化数据分析的利器。其中,DataFrame是比较常用的处理数据的对象,类似于一个数据库里的table或者excel中的worksheet,可以非常方便的对二维数据读取(xls,csv,hdf等)、增删改查、基本绘图等。pandas应该是用python做数据分析必不可少的工具。下图是一个dataframe的实例
geopandas简介
DataFrame相当于GIS数据中的一张属性表,为了将pandas的特性用到空间数据,就有了geopandas。其目标是使得在python中操作地理数据更方便。
下面我们打开一张存放在本地的.shp格式的地图
import shapely, geopandas, fiona
import seaborn as sns
from fiona.crs import from_epsg,from_string
tpath = 'D:/shapefile/province.shp'
shp_df = geopandas.GeoDataFrame.from_file(tpath,encoding = 'gb18030')
shp_df.head() # 获取表头
shp_df.plot()
print(shp_df.head())
输出结果是这样的:
GeoPandas implements two main data structures, a GeoSeries
and a GeoDataFrame
. These are subclasses of pandas Series
and DataFrame
, respectively.
GeoPandas有两种主要的数据结构 GeoSeries
和 GeoDataFrame
.
A GeoSeries
is essentially a vector where each entry in the vector is a set of shapes corresponding to one observation. An entry may consist of only one shape (like a single polygon) or multiple shapes that are meant to be thought of as one observation (like the many polygons that make up the State of Hawaii or a country like Indonesia).
GeoSeries本质上是一个向量,其中向量中的每个入口都是一组对应于一个观测的形状。 一个入口可能只包含一个形状(如单个多边形)或多个形状,这些形状被认为是一个观测结果(如构成夏威夷州或像印度尼西亚这样的国家的许多多边形)。
geopandas has three basic classes of geometric objects (which are actually shapely objects):
geopandas有几个基本的几何对象类(实际上是形状对象):
Note that all entries in a GeoSeries
need not be of the same geometric type, although certain export operations will fail if this is not the case.请注意,GeoSeries中的所有入口不必具有相同的几何类型,但如果不是这种情况,某些导出操作将失败。
The GeoSeries
class implements nearly all of the attributes and methods of Shapely objects. When applied to a GeoSeries
, they will apply elementwise to all geometries in the series. Binary operations can be applied between two GeoSeries
, in which case the operation is carried out elementwise. The two series will be aligned by matching indices. Binary operations can also be applied to a single geometry, in which case the operation is carried out for each element of the series with that geometry. In either case, a Series
or a GeoSeries
will be returned, as appropriate.
GeoSeries类几乎实现了Shapely对象的所有属性和方法。 当要应用GeoSeries时,它们将‘要素’应用于系列中的所有几何。 可以在两个GeoSeries之间应用二进制操作,在这种情况下,操作是按‘要素’执行的。 这两个系列将通过匹配索引进行对齐。 二元运算也可以应用于单个几何,在这种情况下,对具有该几何的系列的每个要素执行运算。 在任何一种情况下,都会根据需要返回Series或GeoSeries。
A short summary of a few attributes and methods for GeoSeries is presented here, and a full list can be found in the all attributes and methods page. There is also a family of methods for creating new shapes by expanding existing shapes or applying set-theoretic operations like “union” described in geometric manipulations.
此处提供了GeoSeries的一些属性和方法的简短摘要,并且可以在所有属性和方法页面中找到完整列表。 还有一系列方法可以通过扩展现有形状或应用几何操作中描述的“联合”等集合理论来创建新形状。
Attributes
area
: shape area (units of projection – see projections) 面积:形状面积(投影单位 - 见投影)bounds
: tuple of max and min coordinates on each axis for each shape 边界:每个形状的每个轴上的最大和最小坐标的元组total_bounds
: tuple of max and min coordinates on each axis for entire GeoSeries 总边界:整个GeoSeries的每个轴上的最大和最小坐标的元组 geom_type
: type of geometry. 几何类型。is_valid
: tests if coordinates make a shape that is reasonable geometric shape (according to this).测试坐标是否形成合理几何形状的形状(根据此)。Basic Methods 基本方法
distance(other)
: returns Series
with minimum distance from each entry to other
距离(其他):返回从每个入口到另一个入口的距离最小的系列centroid
: returns GeoSeries
of centroids 返回GeoSeries的质心representative_point()
: returns GeoSeries
of points that are guaranteed to be within each geometry. It does NOT return centroids. 返回保证在每个几何体内的点的GeoSeries。 它不会返回质心。to_crs()
: change coordinate reference system. See projections 更改坐标参考系。 见投影plot()
: plot GeoSeries
. See mapping. 绘制GeoSeries(地理序列)。 查看地图映射。Relationship Tests 关系测试
geom_almost_equals(other)
: is shape almost the same as other
(good when floating point precision issues make shapes slightly different) 形状几乎与其他形状相同(当浮点精度问题 使形状略有不同时很好)contains(other)
: is shape contained within other
包含(其他):包含在其他内容中的形状intersects(other)
: does shape intersect other
相交(其他):形状与其他相交 A GeoDataFrame
is a tabular data structure that contains a GeoSeries
. GeoDataFrame是包含GeoSeries的表格数据结构。
The most important property of a GeoDataFrame
is that it always has one GeoSeries
column that holds a special status. This GeoSeries
is referred to as the GeoDataFrame
’s “geometry”. When a spatial method is applied to a GeoDataFrame
(or a spatial attribute like area
is called), this commands will always act on the “geometry” column.
GeoDataFrame最重要的属性是它总是有一个具有特殊状态的GeoSeries列(几何列)。 这个GeoSeries被称为GeoDataFrame的“几何”。 将空间方法应用于GeoDataFrame(或调用区域等空间属性)时,此命令将始终作用于“几何”列。
The “geometry” column – no matter its name – can be accessed through the geometry
attribute (gdf.geometry
), and the name of the geometry
column can be found by typing gdf.geometry.name
.
可以通过几何属性(gdf.geometry)访问“几何”列(无论其名称),也可以通过键入gdf.geometry.name找到几何列的名称。
A GeoDataFrame
may also contain other columns with geometrical (shapely) objects, but only one column can be the active geometry at a time. To change which column is the active geometry column, use the set_geometry
method.
GeoDataFrame还可以包含具有几何(形状)对象的其他列,但一次只能有一列作为活动几何体。 要更改哪个列是活动几何列,请使用set_geometry方法。
来看一个官网的栗子吧:
In [1]: world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
In [2]: world.head()
Out[2]:
pop_est ... geometry
0 28400000.0 ... POLYGON ((61.21081709172574 35.65007233330923,...
1 12799293.0 ... (POLYGON ((16.32652835456705 -5.87747039146621...
2 3639453.0 ... POLYGON ((20.59024743010491 41.85540416113361,...
3 4798491.0 ... POLYGON ((51.57951867046327 24.24549713795111,...
4 40913584.0 ... (POLYGON ((-65.50000000000003 -55.199999999999...
[5 rows x 6 columns]
#Plot countries
In [3]: world.plot();
Currently, the column named “geometry” with country borders is the active geometry column:目前,名为“geometry”且具有国家/地区边框的列是活动几何列:
In [4]: world.geometry.name
Out[4]: 'geometry'
We can also rename this column to “borders”:
In [5]: world = world.rename(columns={'geometry': 'borders'}).set_geometry('borders')
In [6]: world.geometry.name
Out[6]: 'borders'
Now, we create centroids and make it the geometry:
In [7]: world['centroid_column'] = world.centroid
In [8]: world = world.set_geometry('centroid_column')
In [9]: world.plot();
Note: A GeoDataFrame
keeps track of the active column by name, so if you rename the active geometry column, you must also reset the geometry:
注意:GeoDataFrame按名称跟踪活动列,因此如果重命名活动几何列,则还必须重置几何:
gdf = gdf.rename(columns={'old_name': 'new_name'}).set_geometry('new_name')
Note 2: Somewhat confusingly, by default when you use the read_file
command, the column containing spatial objects from the file is named “geometry” by default, and will be set as the active geometry column. However, despite using the same term for the name of the column and the name of the special attribute that keeps track of the active column, they are distinct. You can easily shift the active geometry column to a different GeoSeries
with the set_geometry
command. Further, gdf.geometry
will always return the active geometry column, not the column named geometry
. If you wish to call a column named “geometry”, and a different column is the active geometry column, use gdf['geometry']
, not gdf.geometry
.
注意2:有些令人困惑,默认情况下,使用read_file命令时,默认情况下,包含文件中空间对象的列名为“geometry”,并将设置为活动几何列。但是,尽管对列的名称使用相同的术语以及跟踪活动列的特殊属性的名称,但它们是不同的。您可以使用set_geometry命令轻松地将活动几何列移动到其他GeoSeries。此外,gdf.geometry将始终返回活动几何列,而不是名为geometry的列。如果要调用名为“geometry”的列,并且另一列是活动几何列,请使用gdf ['geometry'],而不是gdf.geometry。
Any of the attributes calls or methods described for a GeoSeries
will work on a GeoDataFrame
– effectively, they are just applied to the “geometry” GeoSeries
.为GeoSeries描述的任何属性调用或方法都可以在GeoDataFrame上运行 - 实际上,它们只适用于“几何”GeoSeries。
However, GeoDataFrames
also have a few extra methods for input and output which are described on the Input and Output page and for geocoding with are described in Geocoding.
但是,GeoDataFrames还有一些额外的输入和输出方法,这些方法在输入和输出页面中描述,并且在地理编码中描述了地理编码。