pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool
一、读取文本文件中的数据
import pandas as pd
1.1 读取csv文件,以逗号作为分隔符。
file_path = "./ratings.csv"
datas = pd.read_csv(file_path)
datas.head()
|
userId |
movieId |
rating |
timestamp |
0 |
1 |
296 |
5.0 |
1147880044 |
1 |
1 |
306 |
3.5 |
1147868817 |
2 |
1 |
307 |
5.0 |
1147868828 |
3 |
1 |
665 |
5.0 |
1147878820 |
4 |
1 |
899 |
3.5 |
1147868510 |
datas.shape
(25000095, 4)
datas.columns
Index(['userId', 'movieId', 'rating', 'timestamp'], dtype='object')
datas.index
RangeIndex(start=0, stop=25000095, step=1)
datas.dtypes
userId int64
movieId int64
rating float64
timestamp int64
dtype: object
1.2 读取txt文件,以 \t 作为分隔符
file_path = "./demo.txt"
datas = pd.read_csv(file_path, sep='\t', header=None, names=['Year', 'month', 'day'])
datas
|
Year |
month |
day |
0 |
2019 |
1 |
2 |
1 |
2020 |
2 |
3 |
2 |
2021 |
3 |
7 |
3 |
2022 |
4 |
9 |
1.3 读取excel文件
file_path = "./books.xlsx"
datas = pd.read_excel(file_path)
datas
|
年 |
月 |
日 |
0 |
2019 |
1 |
1 |
1 |
2020 |
2 |
2 |
2 |
2021 |
3 |
3 |
3 |
2022 |
4 |
4 |
1.4 读取MySQL数据表
!pip install sqlalchemy
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting sqlalchemy
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/b4/03/8102d7442498ba2dc915673c6617b1d1729cd3c762f275eb83c6bdc78dd0/SQLAlchemy-1.4.40-cp310-cp310-win_amd64.whl (1.6 MB)
---------------------------------------- 1.6/1.6 MB 1.2 MB/s eta 0:00:00
Collecting greenlet!=0.4.17
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/ac/3f/3af852c44090814ba41b9a4b5bcfd977f49c9fee83d19b65829e164fc11d/greenlet-1.1.3-cp310-cp310-win_amd64.whl (101 kB)
-------------------------------------- 101.7/101.7 KB 1.5 MB/s eta 0:00:00
Installing collected packages: greenlet, sqlalchemy
Successfully installed greenlet-1.1.3 sqlalchemy-1.4.40
WARNING: You are using pip version 22.0.4; however, version 22.2.2 is available.
You should consider upgrading via the 'D:\Software\Python310\python.exe -m pip install --upgrade pip' command.
1.4.1 老版本
import pymysql
conn = pymysql.connect(host='127.0.0.1', user='root', password='wangpeng', database='mydatabase', charset='utf8')
datas = pd.read_sql('select * from books', con=conn)
datas
D:\Software\anaconda3\lib\site-packages\pandas\io\sql.py:761: UserWarning: pandas only support SQLAlchemy connectable(engine/connection) ordatabase string URI or sqlite3 DBAPI2 connectionother DBAPI2 objects are not tested, please consider using SQLAlchemy
warnings.warn(
|
id |
name |
price |
0 |
1 |
数据结构 |
45.0 |
1 |
2 |
操作系统 |
48.0 |
2 |
3 |
计算机网络 |
56.0 |
3 |
4 |
计算机组成原理 |
54.0 |
4 |
5 |
编译原理 |
65.0 |
1.4.2 新版本
from sqlalchemy import create_engine
sql_statement = 'select * from books'
engine = create_engine('mysql+pymysql://root:wangpeng@localhost:3306/mydatabase?charset=utf8')
datas = pd.read_sql(sql_statement, engine)
datas
|
id |
name |
price |
0 |
1 |
数据结构 |
45.0 |
1 |
2 |
操作系统 |
48.0 |
2 |
3 |
计算机网络 |
56.0 |
3 |
4 |
计算机组成原理 |
54.0 |
4 |
5 |
编译原理 |
65.0 |
二、DataFrame和Series
2.1 Series
Series是一种类似于一维数组的对象,它是由一组数据(可以是不同数据类型)以及一组与之相关的数据标签(即索引)组成。
2.1.1 默认创建Series(默认索引)
s1 = pd.Series(['hello', 1, True, 3.5])
s1
0 hello
1 1
2 True
3 3.5
dtype: object
s1.index
RangeIndex(start=0, stop=4, step=1)
s1.values
array(['hello', 1, True, 3.5], dtype=object)
2.1.2 创建指定索引的Series
s2 = pd.Series(['a', False, 5], index=[1, 2, 3])
s2
1 a
2 False
3 5
dtype: object
2.1.3 使用Python字典创建Series
dict_one = {'name': 'wangpeng', 'age': 18, 'province': 'JiangXi'}
s3 = pd.Series(dict_one)
s3
name wangpeng
age 18
province JiangXi
dtype: object
s3['name']
'wangpeng'
s3[['name', 'age']]
name wangpeng
age 18
dtype: object
type(s3['name'])
str
2.2 DataFrame
DataFrame是一个表格型的数据结构
- 每列可以是不同的值类型(数值、字符串、布尔值等)
- 既有行索引index,也有列索引columns
- 可以被看做由Series组成的字典
2.2.1 根据多个字典序列创建DataFrame
datas = {
'State': ['New York', 'Michigan', 'Nevada', 'California', 'Florida'],
'GDP': [14406, 13321, 10003, 12563, 15364]
}
df = pd.DataFrame(datas)
df
|
State |
GDP |
0 |
New York |
14406 |
1 |
Michigan |
13321 |
2 |
Nevada |
10003 |
3 |
California |
12563 |
4 |
Florida |
15364 |
df.dtypes
State object
GDP int64
dtype: object
df.columns
Index(['State', 'GDP'], dtype='object')
df.index
RangeIndex(start=0, stop=5, step=1)
2.3 从DataFrame中查询出Series
- 如果只查询一行、一列,返回的是pd.Series
- 如果查询多行、多列,返回的是pd.DataFrame
2.3.1 查询列
df['State']
0 New York
1 Michigan
2 Nevada
3 California
4 Florida
Name: State, dtype: object
df[['State', 'GDP']]
df.loc[0]
State New York
GDP 14406
Name: 0, dtype: object
df.loc[0:3]
|
State |
GDP |
0 |
New York |
14406 |
1 |
Michigan |
13321 |
2 |
Nevada |
10003 |
3 |
California |
12563 |
df.loc[:, 'GDP']
0 14406
1 13321
2 10003
3 12563
4 15364
Name: GDP, dtype: int64
df.loc[:, 'State']
0 New York
1 Michigan
2 Nevada
3 California
4 Florida
Name: State, dtype: object