【pandas】教程:8-如何组合多个表格的数据

Pandas 组合多个表格的数据

本节使用的数据为 data/air_quality_no2_long.csv,链接为 pandas案例和教程所使用的数据-机器学习文档类资源-CSDN文库

【pandas】教程:8-如何组合多个表格的数据_第1张图片

【pandas】教程:8-如何组合多个表格的数据_第2张图片

导入数据

  • N O 2 NO_2 NO2
import pandas as pd 
air_quality_no2 = pd.read_csv("data/air_quality_no2_long.csv",
                            parse_dates=True)
air_quality_no2 = air_quality_no2[["date.utc", "location",
                                "parameter", "value"]]

air_quality_no2
                       date.utc            location parameter  value
0     2019-06-21 00:00:00+00:00             FR04014       no2   20.0
1     2019-06-20 23:00:00+00:00             FR04014       no2   21.8
2     2019-06-20 22:00:00+00:00             FR04014       no2   26.5
3     2019-06-20 21:00:00+00:00             FR04014       no2   24.9
4     2019-06-20 20:00:00+00:00             FR04014       no2   21.4
...                         ...                 ...       ...    ...
2063  2019-05-07 06:00:00+00:00  London Westminster       no2   26.0
2064  2019-05-07 04:00:00+00:00  London Westminster       no2   16.0
2065  2019-05-07 03:00:00+00:00  London Westminster       no2   19.0
2066  2019-05-07 02:00:00+00:00  London Westminster       no2   19.0
2067  2019-05-07 01:00:00+00:00  London Westminster       no2   23.0

[2068 rows x 4 columns]
  • P M 2.5 PM2.5 PM2.5
air_quality_pm25 = pd.read_csv("data/air_quality_pm25_long.csv",
                            parse_dates=True)
air_quality_pm25 = air_quality_pm25[["date.utc", "location",
                                    "parameter", "value"]]
air_quality_pm25
                       date.utc            location parameter  value
0     2019-06-18 06:00:00+00:00             BETR801      pm25   18.0
1     2019-06-17 08:00:00+00:00             BETR801      pm25    6.5
2     2019-06-17 07:00:00+00:00             BETR801      pm25   18.5
3     2019-06-17 06:00:00+00:00             BETR801      pm25   16.0
4     2019-06-17 05:00:00+00:00             BETR801      pm25    7.5
...                         ...                 ...       ...    ...
1105  2019-05-07 06:00:00+00:00  London Westminster      pm25    9.0
1106  2019-05-07 04:00:00+00:00  London Westminster      pm25    8.0
1107  2019-05-07 03:00:00+00:00  London Westminster      pm25    8.0
1108  2019-05-07 02:00:00+00:00  London Westminster      pm25    8.0
1109  2019-05-07 01:00:00+00:00  London Westminster      pm25    8.0

[1110 rows x 4 columns]

数据连接 (concat)

【pandas】教程:8-如何组合多个表格的数据_第3张图片

  • 将同样数据结构的两个表格的数据连接在一起
air_quality = pd.concat([air_quality_pm25, air_quality_no2], axis=0)
air_quality
                       date.utc            location parameter  value
0     2019-06-18 06:00:00+00:00             BETR801      pm25   18.0
1     2019-06-17 08:00:00+00:00             BETR801      pm25    6.5
2     2019-06-17 07:00:00+00:00             BETR801      pm25   18.5
3     2019-06-17 06:00:00+00:00             BETR801      pm25   16.0
4     2019-06-17 05:00:00+00:00             BETR801      pm25    7.5
...                         ...                 ...       ...    ...
2063  2019-05-07 06:00:00+00:00  London Westminster       no2   26.0
2064  2019-05-07 04:00:00+00:00  London Westminster       no2   16.0
2065  2019-05-07 03:00:00+00:00  London Westminster       no2   19.0
2066  2019-05-07 02:00:00+00:00  London Westminster       no2   19.0
2067  2019-05-07 01:00:00+00:00  London Westminster       no2   23.0

[3178 rows x 4 columns]

concat 会将两个表格连接起来,默认 行连接(rows 增多),也可以设置为 列连接(columns 增多)

  • 按日期排序
air_quality = air_quality.sort_values("date.utc")
air_quality.head()
                       date.utc            location parameter  value
2067  2019-05-07 01:00:00+00:00  London Westminster       no2   23.0
1003  2019-05-07 01:00:00+00:00             FR04014       no2   25.0
100   2019-05-07 01:00:00+00:00             BETR801      pm25   12.5
1098  2019-05-07 01:00:00+00:00             BETR801       no2   50.5
1109  2019-05-07 01:00:00+00:00  London Westminster      pm25    8.0

使用多个表格公共的标识符连接表格 (merge)

【pandas】教程:8-如何组合多个表格的数据_第4张图片

stations_coord = pd.read_csv("data/air_quality_stations.csv")
stations_coord
                      location  coordinates.latitude  coordinates.longitude
0                      BELAL01              51.23619                4.38522
1                      BELHB23              51.17030                4.34100
2                      BELLD01              51.10998                5.00486
3                      BELLD02              51.12038                5.02155
4                      BELR833              51.32766                4.36226
..                         ...                   ...                    ...
61             Southend-on-Sea              51.54420                0.67841
62  Southwark A2 Old Kent Road              51.48050               -0.05955
63                    Thurrock              51.47707                0.31797
64      Tower Hamlets Roadside              51.52253               -0.04216
65        Groton Fort Griswold              41.35360              -72.07890

[66 rows x 3 columns]
air_quality_merge = pd.merge(air_quality, stations_coord, how="left", on="location")
air_quality_merge
                       date.utc            location parameter  value  \
0     2019-05-07 01:00:00+00:00  London Westminster       no2   23.0   
1     2019-05-07 01:00:00+00:00             FR04014       no2   25.0   
2     2019-05-07 01:00:00+00:00             FR04014       no2   25.0   
3     2019-05-07 01:00:00+00:00             BETR801      pm25   12.5   
4     2019-05-07 01:00:00+00:00             BETR801       no2   50.5   
...                         ...                 ...       ...    ...   
4177  2019-06-20 23:00:00+00:00             FR04014       no2   21.8   
4178  2019-06-20 23:00:00+00:00             FR04014       no2   21.8   
4179  2019-06-21 00:00:00+00:00  London Westminster      pm25    7.0   
4180  2019-06-21 00:00:00+00:00             FR04014       no2   20.0   
4181  2019-06-21 00:00:00+00:00             FR04014       no2   20.0   

      coordinates.latitude  coordinates.longitude  
0                 51.49467               -0.13193  
1                 48.83724                2.39390  
2                 48.83722                2.39390  
3                 51.20966                4.43182  
4                 51.20966                4.43182  
...                    ...                    ...  
4177              48.83724                2.39390  
4178              48.83722                2.39390  
4179              51.49467               -0.13193  
4180              48.83724                2.39390  
4181              48.83722                2.39390  

[4182 rows x 6 columns]

使用 merge ,对于 air_quality 表中的每一行,都从 air_quality_stations_coord 表中添加相应的坐标,两个表都有相同的列位置,用作组合信息的键。merge 函数支持多个连接选项,类似于数据库风格的操作。

air_quality_parameters = pd.read_csv("data/air_quality_parameters.csv")
air_quality_merge2 = pd.merge(air_quality_merge, air_quality_parameters, 
                            how="left", left_on="parameter", right_on='id')
air_quality_merge2
                      date.utc            location parameter  value  \
0     2019-05-07 01:00:00+00:00  London Westminster       no2   23.0   
1     2019-05-07 01:00:00+00:00             FR04014       no2   25.0   
2     2019-05-07 01:00:00+00:00             FR04014       no2   25.0   
3     2019-05-07 01:00:00+00:00             BETR801      pm25   12.5   
4     2019-05-07 01:00:00+00:00             BETR801       no2   50.5   
...                         ...                 ...       ...    ...   
4177  2019-06-20 23:00:00+00:00             FR04014       no2   21.8   
4178  2019-06-20 23:00:00+00:00             FR04014       no2   21.8   
4179  2019-06-21 00:00:00+00:00  London Westminster      pm25    7.0   
4180  2019-06-21 00:00:00+00:00             FR04014       no2   20.0   
4181  2019-06-21 00:00:00+00:00             FR04014       no2   20.0   

      coordinates.latitude  coordinates.longitude    id  \
0                 51.49467               -0.13193   no2   
1                 48.83724                2.39390   no2   
2                 48.83722                2.39390   no2   
3                 51.20966                4.43182  pm25   
4                 51.20966                4.43182   no2   
...                    ...                    ...   ...   
4177              48.83724                2.39390   no2   
4178              48.83722                2.39390   no2   
4179              51.49467               -0.13193  pm25   
4180              48.83724                2.39390   no2   
4181              48.83722                2.39390   no2   
...
4179  Particulate matter less than 2.5 micrometers i...  PM2.5  
4180                                   Nitrogen Dioxide    NO2  
4181                                   Nitrogen Dioxide    NO2  

[4182 rows x 9 columns]

记住

多个表格的连接,可以用 concat 函数,可以基于 column,也可以基于 row 的连接

对于类似于数据库的 merging/joining 表格,可以使用 merge 函数。

参考

  1. How to combine data from multiple tables? — pandas 1.5.2 documentation (pydata.org)

你可能感兴趣的:(pandas,pandas,python,数据分析)