《Pandas 1.x Cookbook · 第二版》第03章 创建和持久化DataFrame

第01章 Pandas基础
第02章 DataFrame基础运算
第03章 创建和持久化DataFrame
第04章 开始数据分析
第05章 探索性数据分析
第06章 选取数据子集
第07章 过滤行
第08章 索引对齐


3.1 创建DataFrame

使用平行的列表创建DataFrame

>>> import pandas as pd
>>> import numpy as np
>>> fname = ["Paul", "John", "Richard", "George"]
>>> lname = ["McCartney", "Lennon", "Starkey", "Harrison"]
>>> birth = [1942, 1940, 1940, 1943]

创建字典:

>>> people = {"first": fname, "last": lname, "birth": birth}

用该字典创建DataFrame:

>>> beatles = pd.DataFrame(people)
>>> beatles
     first       last  birth
0     Paul  McCartney   1942
1     John     Lennon   1940
2  Richard    Starkey   1940
3   George   Harrison   1943

原理

当调用DataFrame构造器时,Pandas会创建一个RangeIndex对象:

>>> beatles.index
RangeIndex(start=0, stop=4, step=1)

重新指定索引:

>>> pd.DataFrame(people, index=["a", "b", "c", "d"])
     first       last  birth
a     Paul  McCartney   1942
b     John     Lennon   1940
c  Richard    Starkey   1940
d   George   Harrison   1943

更多

还可以用字典构成的列表构建DataFrame:

>>> pd.DataFrame(
...     [
...         {
...             "first": "Paul",
...             "last": "McCartney",
...             "birth": 1942,
...         },
...         {
...             "first": "John",
...             "last": "Lennon",
...             "birth": 1940,
...         },
...         {
...             "first": "Richard",
...             "last": "Starkey",
...             "birth": 1940,
...         },
...         {
...             "first": "George",
...             "last": "Harrison",
...             "birth": 1943,
...         },
...     ]
... )
   birth    first       last
0   1942     Paul  McCartney
1   1940     John     Lennon
2   1940  Richard    Starkey
3   1943   George   Harrison

使用columns参数指定列的顺序:

>>> pd.DataFrame(
...     [
...         {
...             "first": "Paul",
...             "last": "McCartney",
...             "birth": 1942,
...         },
...         {
...             "first": "John",
...             "last": "Lennon",
...             "birth": 1940,
...         },
...         {
...             "first": "Richard",
...             "last": "Starkey",
...             "birth": 1940,
...         },
...         {
...             "first": "George",
...             "last": "Harrison",
...             "birth": 1943,
...         },
...     ],
...     columns=["last", "first", "birth"],
... )
        last    first  birth
0  McCartney     Paul   1942
1     Lennon     John   1940
2    Starkey  Richard   1940
3   Harrison   George   1943

3.2 写入CSV

将DataFrame写入CSV文件:

使用.to_csv方法将DataFrame写入CSV文件:

>>> beatles
     first       last  birth
0     Paul  McCartney   1942
1     John     Lennon   1940
2  Richard    Starkey   1940
3   George   Harrison   1943
>>> from io import StringIO
>>> fout = StringIO()
>>> beatles.to_csv(fout)  # 使用文件名

查看文件内容:

>>> print(fout.getvalue())
,first,last,birth
0,Paul,McCartney,1942
1,John,Lennon,1940
2,Richard,Starkey,1940
3,George,Harrison,1943

更多

如果读取刚刚保存的CSV,会读入列名为Unnamed: 0的冗余列:

>>> _ = fout.seek(0)
>>> pd.read_csv(fout)
   Unnamed: 0    first       last  birth
0           0     Paul  McCartney   1942
1           1     John     Lennon   1940
2           2  Richard    Starkey   1940
3           3   George   Harrison   1943

使用index_col参数,可以指定列:

>>> _ = fout.seek(0)
>>> pd.read_csv(fout, index_col=0)
     first       last  birth
0     Paul  McCartney   1942
1     John     Lennon   1940
2  Richard    Starkey   1940
3   George   Harrison   1943

如果存CSV文件时,不想保存行索引,可以将index参数设为False

>>> fout = StringIO()
>>> beatles.to_csv(fout, index=False)
>>> print(fout.getvalue())
first,last,birth
Paul,McCartney,1942
John,Lennon,1940
Richard,Starkey,1940
George,Harrison,1943

3.3 读取大CSV文件

Pandas是在内存中处理文件的,通常来讲,内存的大小需要是文件大小的3至10倍。

这里使用的是diamonds数据集。使用nrows参数读取1000行数据。

>>> diamonds = pd.read_csv("data/diamonds.csv", nrows=1000)
>>> diamonds
     carat      cut color clarity  ...  price     x     y     z
0     0.23    Ideal     E     SI2  ...    326  3.95  3.98  2.43
1     0.21  Premium     E     SI1  ...    326  3.89  3.84  2.31
2     0.23     Good     E     VS1  ...    327  4.05  4.07  2.31
3     0.29  Premium     I     VS2  ...    334  4.20  4.23  2.63
4     0.31     Good     J     SI2  ...    335  4.34  4.35  2.75
..     ...      ...   ...     ...  ...    ...   ...   ...   ...
995   0.54    Ideal     D    VVS2  ...   2897  5.30  5.34  3.26
996   0.72    Ideal     E     SI1  ...   2897  5.69  5.74  3.57
997   0.72     Good     F     VS1  ...   2897  5.82  5.89  3.48
998   0.74  Premium     D     VS2  ...   2897  5.81  5.77  3.58
999   1.12  Premium     J     SI2  ...   2898  6.68  6.61  4.03

使用.info方法查看消耗的内存量:

>>> diamonds.info()

RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
carat      1000 non-null float64
cut        1000 non-null object
color      1000 non-null object
clarity    1000 non-null object
depth      1000 non-null float64
table      1000 non-null float64
price      1000 non-null int64
x          1000 non-null float64
y          1000 non-null float64
z          1000 non-null float64
dtypes: float64(6), int64(1), object(3)
memory usage: 78.2+ KB

可以看到1000行数据使用了78.2KB内存。如果有10亿行数据,则要占用78GB的内存。

使用dtype参数,设置读取的数值类型:

>>> diamonds2 = pd.read_csv(
...     "data/diamonds.csv",
...     nrows=1000,
...     dtype={
...         "carat": np.float32,
...         "depth": np.float32,
...         "table": np.float32,
...         "x": np.float32,
...         "y": np.float32,
...         "z": np.float32,
...         "price": np.int16,
...     },
... )
>>> diamonds2.info()

RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
carat      1000 non-null float32
cut        1000 non-null object
color      1000 non-null object
clarity    1000 non-null object
depth      1000 non-null float32
table      1000 non-null float32
price      1000 non-null int16
x          1000 non-null float32
y          1000 non-null float32
z          1000 non-null float32
dtypes: float32(6), int16(1), object(3)
memory usage: 49.0+ KB

改变了数值类型,对比下新的DataFrame和原先的DataFrame:

>>> diamonds.describe()
             carat        depth  ...            y            z
count  1000.000000  1000.000000  ...  1000.000000  1000.000000
mean      0.689280    61.722800  ...     5.599180     3.457530
std       0.195291     1.758879  ...     0.611974     0.389819
min       0.200000    53.000000  ...     3.750000     2.270000
25%       0.700000    60.900000  ...     5.630000     3.450000
50%       0.710000    61.800000  ...     5.760000     3.550000
75%       0.790000    62.600000  ...     5.910000     3.640000
max       1.270000    69.500000  ...     7.050000     4.330000
>>> diamonds2.describe()
             carat        depth  ...            y            z
count  1000.000000  1000.000000  ...  1000.000000  1000.000000
mean      0.689453    61.718750  ...     5.601562     3.457031
std       0.195312     1.759766  ...     0.611816     0.389648
min       0.199951    53.000000  ...     3.750000     2.269531
25%       0.700195    60.906250  ...     5.628906     3.449219
50%       0.709961    61.812500  ...     5.761719     3.550781
75%       0.790039    62.593750  ...     5.910156     3.640625
max       1.269531    69.500000  ...     7.050781     4.328125

可以看到通过改变数据类型,节省了38%的内存。

使用dtype参数,将数据类型改为category。使用.value_counts先统计数据个数:

>>> diamonds2.cut.value_counts(
Ideal       333
Premium     290
Very Good   226
Good         89
Fair         62
Name: cut, dtype: int64
>>> diamonds2.color.value_counts()
E    240
F    226
G    139
D    129
H    125
I     95
J     46
Name: color, dtype: int64
>>> diamonds2.clarity.value_counts()
SI1     306
VS2     218
VS1     159
SI2     154
VVS2     62
VVS1     58
I1       29
IF       14
Name: clarity, dtype: int64

因为是低基数,将其转换为category,可以节省约37%的内存:

>>> diamonds3 = pd.read_csv(
...     "data/diamonds.csv",
...     nrows=1000,
...      dtype={
...        "carat": np.float32,
...         "depth": np.float32,
...         "table": np.float32,
...         "x": np.float32,
...         "y": np.float32,
...         "z": np.float32,
...         "price": np.int16,
...         "cut": "category",
...         "color": "category",
...         "clarity": "category",
...     },
... )
>>> diamonds3.info()

RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
carat      1000 non-null float32
cut        1000 non-null category
color      1000 non-null category
clarity    1000 non-null category
depth      1000 non-null float32
table      1000 non-null float32
price      1000 non-null int16
x          1000 non-null float32
y          1000 non-null float32
z          1000 non-null float32
dtypes: category(3), float32(6), int16(1)
memory usage: 29.4 KB

使用参数usecols,可以指定加载哪些列。这里忽略了xyz三列:

>>> cols = [
...     "carat",
...     "cut",
...     "color",
...     "clarity",
...     "depth",
...     "table",
...     "price",
... ]
>>> diamonds4 = pd.read_csv(
...     "data/diamonds.csv",
...     nrows=1000,
...     dtype={
...         "carat": np.float32,
...        "depth": np.float32,
...         "table": np.float32,
...         "price": np.int16,
...         "cut": "category",
...         "color": "category",
...         "clarity": "category",
...     },
...     usecols=cols,
... )
>>> diamonds4.info()

RangeIndex: 1000 entries, 0 to 999
Data columns (total 7 columns):
carat      1000 non-null float32
cut        1000 non-null category
color      1000 non-null category
clarity    1000 non-null category
depth      1000 non-null float32
table      1000 non-null float32
price      1000 non-null int16
dtypes: category(3), float32(3), int16(1)
memory usage: 17.7 KB

现在使用的内存只有原始的21%。

如果每次只处理数据的一部分,可以使用chunksize参数:

>>> cols = [
...     "carat",
...     "cut",
...     "color",
...     "clarity",
...     "depth",
...     "table",
...     "price",
... ]
>>> diamonds_iter = pd.read_csv(
...     "data/diamonds.csv",
...     nrows=1000,
...     dtype={
...         "carat": np.float32,
...         "depth": np.float32,
...         "table": np.float32,
...         "price": np.int16,
...         "cut": "category",
...         "color": "category",
...         "clarity": "category",
...     },
...     usecols=cols,
...     chunksize=200,
... )
>>> def process(df):
...     return f"processed {df.size} items"
>>> for chunk in diamonds_iter:
...     process(chunk)

因为CSV文件不保存数据类型,Pandas需要推断每列的数据类型是什么。如果一列的值都是整数,并且没有缺失值,则Pandas将其认定为int64。如果一列是数值类型,但不是整数,或存在缺失值,Pandas使用的是float64。这两种数据类型占用的内存比较大。例如,如果所有数都在200以下,你可以使用一个小的数据类型,比如np.int16(或np.int8,如果都是正数)。

如果某列都是非数值类型,Pandas会将其转换为object类型。object类型占用内存很多,因为它是将数据以Python字符串存储的,将类型改为category,可以大大节省空间,因为它对每个字符串只存储一次。

更多

如果价格使用int8,会导致丢失信息。你可以使用NumPy的iinfo函数列出NumPy整数类型的范围:

>>> np.iinfo(np.int8)
iinfo(min=-128, max=127, dtype=int8)

使用.finfo可以查看浮点数类型的范围:

>>> np.finfo(np.float16)
finfo(resolution=0.001, min=-6.55040e+04,
      max=6.55040e+04, dtype=float16)

还可以用.memory_usage方法查询DataFrame或Series使用了多少字节。注意,其中是包含行索引的。另外,传入deep=True,可以查询带有对象类型的Series的内存用量:

>>> diamonds.price.memory_usage()
8080
>>> diamonds.price.memory_usage(index=False)
8000
>>> diamonds.cut.memory_usage()
8080
>>> diamonds.cut.memory_usage(deep=True)
63413

一旦确定了数据类型,可以将其以二进制并带有数据类型的形式保存下来,比如Feather格式,Pandas使用的是pyarrow库。

>>> diamonds4.to_feather("d.arr")
>>> diamonds5 = pd.read_feather("d.arr")

另一种方法是使用Parquet格式。

>>> diamonds4.to_parquet("/tmp/d.pqt")

3.4 使用Excel文件

需要安装xlwtopenpyxl来写入XLSX文件。

使用.to_excel方法,进行存储:

>>> beatles.to_excel("beat.xls")
>>> beatles.to_excel("beat.xlsx")
《Pandas 1.x Cookbook · 第二版》第03章 创建和持久化DataFrame_第1张图片

使用read_excel读取Excel文件:

>>> beat2 = pd.read_excel("/tmp/beat.xls")
>>> beat2
   Unnamed: 0    first       last  birth
0           0     Paul  McCartney   1942
1           1     John     Lennon   1940
2           2  Richard    Starkey   1940
3           3   George   Harrison   1943

用参数index_col,指定行索引:

>>> beat2 = pd.read_excel("/tmp/beat.xls", index_col=0)
>>> beat2
     first       last  birth
0     Paul  McCartney   1942
1     John     Lennon   1940
2  Richard    Starkey   1940
3   George   Harrison   1943

Excel保存了数据类型:

>>> beat2.dtypes
first    object
last     object
birth     int64
dtype: object

更多

使用sheet_name参数命名表单:

>>> xl_writer = pd.ExcelWriter("beat2.xlsx")
>>> beatles.to_excel(xl_writer, sheet_name="All")
>>> beatles[beatles.birth < 1941].to_excel(
...     xl_writer, sheet_name="1940"
... )
>>> xl_writer.save()

这个Excel文件包含两个表单,一个名字是All,一个名字是1940。


3.5 使用ZIP文件

如果CSV文件是ZIP文件中的唯一文件,可以直接使用read_csv函数:

>>> autos = pd.read_csv("data/vehicles.csv.zip")
>>> autos
       barrels08  barrelsA08  ...  phevHwy  phevComb
0      15.695714         0.0  ...        0         0
1      29.964545         0.0  ...        0         0
2      12.207778         0.0  ...        0         0
3      29.964545         0.0  ...        0         0
4      17.347895         0.0  ...        0         0
...          ...         ...  ...      ...       ...
41139  14.982273         0.0  ...        0         0
41140  14.330870         0.0  ...        0         0
41141  15.695714         0.0  ...        0         0
41142  15.695714         0.0  ...        0         0
41143  18.311667         0.0  ...        0         0
>>> autos.modifiedOn.dtype
dtype('O')

因为CSV文件中包含日期的列,它是字符串。可以在使用read_csv时使用parse_dates加载文件,另一种方法是加载文件后用to_datetime方法解析:

>>> autos= pd.read_csv(
...     "data/vehicles.csv.zip", parse_dates=["modifiedOn"]
... )
>>> autos.modifiedOn
0       2013-01-0...
1       2013-01-0...
2       2013-01-0...
3       2013-01-0...
4       2013-01-0...
            ...     
41139   2013-01-0...
41140   2013-01-0...
41141   2013-01-0...
41142   2013-01-0...
41143   2013-01-0...
Name: modifiedOn, Length: 41144, dtype: datetime64[ns, tzlocal()]
>>> autos.modifiedOn
0        Tue Jan 01 00:00:00 EST 2013
1        Tue Jan 01 00:00:00 EST 2013
2        Tue Jan 01 00:00:00 EST 2013
3        Tue Jan 01 00:00:00 EST 2013
4        Tue Jan 01 00:00:00 EST 2013
                     ...
39096    Tue Jan 01 00:00:00 EST 2013
39097    Tue Jan 01 00:00:00 EST 2013
39098    Tue Jan 01 00:00:00 EST 2013
39099    Tue Jan 01 00:00:00 EST 2013
39100    Tue Jan 01 00:00:00 EST 2013
Name: modifiedOn, Length: 39101, dtype: object
>>> pd.to_datetime(autos.modifiedOn)
0       2013-01-01
1       2013-01-01
2       2013-01-01
3       2013-01-01
4       2013-01-01
           ...
39096   2013-01-01
39097   2013-01-01
39098   2013-01-01
39099   2013-01-01
39100   2013-01-01
Name: modifiedOn, Length: 39101, dtype: datetime64[ns]

如果ZIP文件中有多个文件,可以使用zipfile模块。因为数据集第二行中包含问题,将其存入kag_questions

>>> import zipfile
>>> with zipfile.ZipFile(
...     "data/kaggle-survey-2018.zip"
... ) as z:
...     print("\n".join(z.namelist()))
...     kag = pd.read_csv(
...         z.open("multipleChoiceResponses.csv")
...     )
...     kag_questions = kag.iloc[0]
...     survey = kag.iloc[1:]
multipleChoiceResponses.csv
freeFormResponses.csv
SurveySchema.csv
>>> survey.head(2).T
1          2
Time from...          710        434
Q1                 Female       Male
Q1_OTHER_...           -1         -1
Q2                  45-49      30-34
Q3            United S...  Indonesia
...                   ...        ...
Q50_Part_5            NaN        NaN
Q50_Part_6            NaN        NaN
Q50_Part_7            NaN        NaN
Q50_Part_8            NaN        NaN
Q50_OTHER...           -1         -1

更多

如果压缩文件中只有一个文件,则read_csv方法还可以读取GZIPBZ2XZ文件。


3.6 使用数据库

创建SQLite数据库,存储Beatles信息:

>>> import sqlite3
>>> con = sqlite3.connect("data/beat.db")
>>> with con:
...     cur = con.cursor()
...     cur.execute("""DROP TABLE Band""")
...     cur.execute(
...         """CREATE TABLE Band(id INTEGER PRIMARY KEY,
...         fname TEXT, lname TEXT, birthyear INT)"""
...     )
...     cur.execute(
...         """INSERT INTO Band VALUES(
...         0, 'Paul', 'McCartney', 1942)"""
...     )
...     cur.execute(
...         """INSERT INTO Band VALUES(
...         1, 'John', 'Lennon', 1940)"""
...     )
...     _ = con.commit()

从DataFrame读取数据库,这里使用的是SQLAlchemy:

>>> import sqlalchemy as sa
>>> engine = sa.create_engine(
...     "sqlite:///data/beat.db", echo=True
... )
>>> sa_connection = engine.connect()
>>> beat = pd.read_sql(
...     "Band", sa_connection, index_col="id"
... )
>>> beat
   fname      lname  birthyear
id                            
0   Paul  McCartney       1942
1   John     Lennon       1940

使用SQL语句读取数据。可以使用SQLite或SQLAlchemy连接:

>>> sql = """SELECT fname, birthyear from Band"""
>>> fnames = pd.read_sql(sql, con)
>>> fnames
  fname  birthyear
0  Paul       1942
1  John       1940

3.7 读取JSON

JSON数据的编码和加载:

>>> import json
>>> encoded = json.dumps(people)
>>> encoded
'{"first": ["Paul", "John", "Richard", "George"], "last": ["McCartney", "Lennon", "Starkey", "Harrison"], "birth": [1942, 1940, 1940, 1943]}'
>>> json.loads(encoded)
{'first': ['Paull', 'John', 'Richard', 'George'], 'last': ['McCartney', 'Lennon', 'Starkey', 'Harrison'], 'birth': [1942, 1940, 1940, 1943]}

使用.read_json读取json数据,使用字典中的键名作为列名。

>>> beatles = pd.read_json(encoded)
>>> beatles
     first       last  birth
0     Paul  McCartney   1942
1     John     Lennon   1940
2  Richard    Starkey   1940
3   George   Harrison   1943

读取JSON时,Pandas支持一些特定的方式:

  • columns —— (默认)将列名映射为列中的值的列表;
  • records —— 行的列表。每行是一个字典,一行映射到一个值;
  • split —— columns映射到列名,index映射到行索引值,data映射到每行数据组成的列表;
  • index —— 将索引映射到行,每行是一个列映射到值的字典;
  • values —— 数据行构成的列表(每行也是列表)。不包含列和行索引的值;
  • table —— 将schema映射到DataFrame的纲要,data映射为字典的列表。

参考下面的代码:

>>> records = beatles.to_json(orient="records")
>>> records
'[{"first":"Paul","last":"McCartney","birth":1942},{"first":"John","last":"Lennon","birth":1940},{"first":"Richard","last":"Starkey","birth":1940},{"first":"George","last":"Harrison","birth":1943}]'
>>> pd.read_json(records, orient="records")
   birth    first       last
0   1942     Paul  McCartney
1   1940     John     Lennon
2   1940  Richard    Starkey
3   1943   George   Harrison
>>> split = beatles.to_json(orient="split")
>>> split
'{"columns":["first","last","birth"],"index":[0,1,2,3],"data":[["Paul","McCartney",1942],["John","Lennon",1940],["Richard","Starkey",1940],["George","Harrison",1943]]}'
>>> pd.read_json(split, orient="split")
     first       last  birth
0     Paul  McCartney   1942
1     John     Lennon   1940
2  Richard    Starkey   1940
3   George   Harrison   1943
>>> index = beatles.to_json(orient="index")
>>> index
'{"0":{"first":"Paul","last":"McCartney","birth":1942},"1":{"first":"John","last":"Lennon","birth":1940},"2":{"first":"Richard","last":"Starkey","birth":1940},"3":{"first":"George","last":"Harrison","birth":1943}}'
>>> pd.read_json(index, orient="index")
   birth    first       last
0   1942     Paul  McCartney
1   1940     John     Lennon
2   1940  Richard    Starkey
3   1943   George   Harrison
>>> values = beatles.to_json(orient="values")
>>> values
'[["Paul","McCartney",1942],["John","Lennon",1940],["Richard","Starkey",1940],["George","Harrison",1943]]'
>>> pd.read_json(values, orient="values")
         0          1     2
0     Paul  McCartney  1942
1     John     Lennon  1940
2  Richard    Starkey  1940
3   George   Harrison  1943
>>> (
...     pd.read_json(values, orient="values").rename(
...         columns=dict(
...             enumerate(["first", "last", "birth"])
...         )
...     )
... )
     first       last  birth
0     Paul  McCartney   1942
1     John     Lennon   1940
2  Richard    Starkey   1940
3   George   Harrison   1943
>>> table = beatles.to_json(orient="table")
>>> table
'{"schema": {"fields":[{"name":"index","type":"integer"},{"name":"first","type":"string"},{"name":"last","type":"string"},{"name":"birth","type":"integer"}],"primaryKey":["index"],"pandas_version":"0.20.0"}, "data": [{"index":0,"first":"Paul","last":"McCartney","birth":1942},{"index":1,"first":"John","last":"Lennon","birth":1940},{"index":2,"first":"Richard","last":"Starkey","birth":1940},{"index":3,"first":"George","last":"Harrison","birth":1943}]}'
>>> pd.read_json(table, orient="table")
     first       last  birth
0     Paul  McCartney   1942
1     John     Lennon   1940
2  Richard    Starkey   1940
3   George   Harrison   1943

更多

如果要将数据转换为字典,可以使用.to_dict方法,方便添加数据:

>>> output = beat.to_dict()
>>> output
{'fname': {0: 'Paul', 1: 'John'}, 'lname': {0: 'McCartney', 1: 'Lennon'}, 'birthyear': {0: 1942, 1: 1940}}
>>> output["version"] = "0.4.1"
>>> json.dumps(output)
'{"fname": {"0": "Paul", "1": "John"}, "lname": {"0": "McCartney", "1": "Lennon"}, "birthyear": {"0": 1942, "1": 1940}, "version": "0.4.1"}'

3.8 读取HTML表格

可以使用Pandas读取HTML中的表格:

《Pandas 1.x Cookbook · 第二版》第03章 创建和持久化DataFrame_第2张图片
Wikipedia中的表格
# 译者注:en.wikipedia.org 需要科学上网
>>> url = 'https://en.wikipedia.org/wiki/The_Beatles_discography'
>>> dfs = pd.read_html(url)
>>> len(dfs)
51

一共读取了51个df,检查一下第一个:

>>> dfs[0]
  The Beatles discography The Beatles discography.1
0  The Beat...             The Beat...
1  Studio a...                      23
2  Live albums                       5
3  Compilat...                      53
4  Video al...                      15
5  Music vi...                      64
6          EPs                      21
7      Singles                      63
8     Mash-ups                       2
9     Box sets                      15

.read_html有一个match参数,可以是字符串或正则表达式。还有一个attrs参数,用于定位HTML标签。

检查下HTML的table元素:

将release列分别存入release_datelabel两列:

>>> res = (
...     df.pipe(
...         lambda df_: df_[
...             ~df_.Title.str.startswith("Released")
...         ]
...     )
...     .assign(
...         release_date=lambda df_: pd.to_datetime(
...             df_.Release.str.extract(
...                 r"Released: (.*) Label"
...             )[0].str.replace(r"\[E\]", "")
...         ),
...         label=lambda df_: df_.Release.str.extract(
...             r"Label: (.*)"
...         ),
...     )
...     .loc[
...         :,
...         [
...             "Title",
...             "UK",
...             "AUS",
...             "CAN",
...             "FRA",
...             "GER",
...             "NOR",
...             "US",
...             "release_date",
...             "label",
...         ],
...     ]
... )
>>> res
          Title   UK  ... release_date        label
0   Please P...    1  ...   1963-03-22  Parlopho...
1   With the...    1  ...   1963-11-22  Parlopho...
2   Introduc...  NaN  ...   1964-01-10  Vee-Jay ...
3   Meet the...  NaN  ...   1964-01-20  Capitol ...
4   Twist an...  NaN  ...   1964-02-03  Capitol ...
..          ...  ...  ...          ...          ...
21  Magical ...   31  ...   1967-11-27  Parlopho...
22  The Beat...    1  ...   1968-11-22        Apple
23  Yellow S...    3  ...   1969-01-13  Apple (U...
24   Abbey Road    1  ...   1969-09-26        Apple
25    Let It Be    1  ...   1970-05-08        Apple

更多

直接读取线上的csv文件:

>>> url = https://github.com/mattharrison/datasets/blob/master/data/anscombes.csv
>>> dfs = pd.read_html(url, attrs={"class": "csv-data"})
>>> len(dfs)
1
>>> dfs[0]
    Unnamed: 0 quadrant     x     y
0          NaN        I  10.0  8.04
1          NaN        I  14.0  9.96
2          NaN        I   6.0  7.24
3          NaN        I   9.0  8.81
4          NaN        I   4.0  4.26
..         ...      ...   ...   ...
39         NaN       IV   8.0  6.58
40         NaN       IV   8.0  7.91
41         NaN       IV   8.0  8.47
42         NaN       IV   8.0  5.25
43         NaN       IV   8.0  6.89

第01章 Pandas基础
第02章 DataFrame基础运算
第03章 创建和持久化DataFrame
第04章 开始数据分析
第05章 探索性数据分析
第06章 选取数据子集
第07章 过滤行
第08章 索引对齐

你可能感兴趣的:(《Pandas 1.x Cookbook · 第二版》第03章 创建和持久化DataFrame)

List of studio albums,[A] with selected chart positions and certifications
Title Release ...

虽然没有属性,但可以使用字符串List of studio albums来匹配,缺失值na_values用"—"表示:

>>> url = https://en.wikipedia.org/wiki/The_Beatles_discography
>>> dfs = pd.read_html(
...     url, match="List of studio albums", na_values="—"
... )
>>> len(dfs)
1
>>> dfs[0].columns
Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')

使用前两行作为列名,但还是很乱:

>>> url = https://en.wikipedia.org/wiki/The_Beatles_discography
>>> dfs = pd.read_html(
...     url,
...     match="List of studio albums",
...     na_values="—",
...     header=[0, 1],
... )
>>> len(dfs)
1
>>> dfs[0]
          Title      Release  ... Peak chart positions Certifications
          Title      Release  ...             US[8][9] Certifications
0   Please P...  Released...  ...          NaN          BPI: Gol...
1   With the...  Released...  ...          NaN          BPI: Gol...
2   Introduc...  Released...  ...            2          RIAA: Pl...
3   Meet the...  Released...  ...            1          MC: Plat...
4   Twist an...  Released...  ...          NaN          MC: 3× P...
..          ...          ...  ...          ...                  ...
22  The Beat...  Released...  ...            1          BPI: 2× ...
23  Yellow S...  Released...  ...            2          BPI: Gol...
24   Abbey Road  Released...  ...            1          BPI: 2× ...
25    Let It Be  Released...  ...            1          BPI: Gol...
26  "—" deno...  "—" deno...  ...  "—" deno...          "—" deno...
>>> dfs[0].columns
MultiIndex(levels=[['Certifications', 'Peak chart positions', 'Release', 'Title'], ['AUS[3]', 'CAN[4]', 'Certifications', 'FRA[5]', 'GER[6]', 'NOR[7]', 'Release', 'Title', 'UK[1][2]', 'US[8][9]']],
  codes=[[3, 2, 1, 1, 1, 1, 1, 1, 1, 0], [7, 6, 8, 0, 1, 3, 4, 5, 9, 2]])

对于这种情况,最容易的方法是更新列索引:

>>> df = dfs[0]
>>> df.columns = [
...     "Title",
...     "Release",
...     "UK",
...     "AUS",
...     "CAN",
...     "FRA",
...     "GER",
...     "NOR",
...     "US",
...     "Certifications",
... ]
>>> df
          Title      Release  ...           US Certifications
0   Please P...  Released...  ...          NaN  BPI: Gol...
1   With the...  Released...  ...          NaN  BPI: Gol...
2   Introduc...  Released...  ...            2  RIAA: Pl...
3   Meet the...  Released...  ...            1  MC: Plat...
4   Twist an...  Released...  ...          NaN  MC: 3× P...
..          ...          ...  ...          ...          ...
22  The Beat...  Released...  ...            1  BPI: 2× ...
23  Yellow S...  Released...  ...            2  BPI: Gol...
24   Abbey Road  Released...  ...            1  BPI: 2× ...
25    Let It Be  Released...  ...            1  BPI: Gol...
26  "—" deno...  "—" deno...  ...  "—" deno...  "—" deno...

继续清理Release这列,在HTML中,代码是这样的:

A Hard Day's Night double-dagger