数据分析-pandasql篇

pandasql

pandasql(http://blog.yhat.com/posts/pandasql-sql-for-pandas-dataframes.html Yhat 写的一个模拟 R 包 sqldf 的Python 库。

pandasql 的想法是让Python 运行 SQL。对于那些来自 SQL 背景或仍然【使用 SQL 思考】的人来说,pandasql是一种利用两种语言优势的好方式。

加载数据

In [1]:
from sklearn.datasets import load_iris
import pandas as pd
from pandasql import sqldf
from pandasql import load_meat, load_births
import re
iris=load_iris()   #导入数据 

使用pandas对数据进行探查

In [4]:
 iris_df=pd.DataFrame(iris.data, columns=iris.feature_names)   #读入数据,形成表  
print(iris.feature_names) #查看属性名称(列表)
print(iris_df.columns)
iris_df.head()
 
       
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
dtype='object')
'petal width (cm)'],
Out[4]:
  sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2

使用pandasql对数据进行查询

In [7]:
iris_df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)
iris_df.columns = [re.sub("[() ]", "", col) for col in iris_df.columns]
print("==============================SELECT * FROM iris_df LIMIT 10==============================")
print(sqldf("SELECT * FROM iris_df LIMIT 10;", locals()))
print("==================SELECT sepalwidthcm, species FROM iris_df LIMIT 10======================")
print(sqldf("SELECT sepalwidthcm, species FROM iris_df LIMIT 10;", locals()))
 
       
==============================SELECT * FROM iris_df LIMIT 10==============================
sepallengthcm sepalwidthcm petallengthcm petalwidthcm species
1 4.9 3.0 1.4 0.2 setosa
0 5.1 3.5 1.4 0.2 setosa2 4.7 3.2 1.3 0.2 setosa
5 5.4 3.9 1.7 0.4 setosa
3 4.6 3.1 1.5 0.2 setosa4 5.0 3.6 1.4 0.2 setosa6 4.6 3.4 1.4 0.3 setosa
==================SELECT sepalwidthcm, species FROM iris_df LIMIT 10======================
7 5.0 3.4 1.5 0.2 setosa8 4.4 2.9 1.4 0.2 setosa9 4.9 3.1 1.5 0.1 setosa sepalwidthcm species0 3.5 setosa
9 3.1 setosa
1 3.0 setosa2 3.2 setosa3 3.1 setosa4 3.6 setosa5 3.9 setosa6 3.4 setosa7 3.4 setosa
8 2.9 setosa

使用pandasql对数据进行统计

In [10]:
grpSql = """
      select
        species
        , avg(sepalwidthcm)
        , min(sepalwidthcm)
        , max(sepalwidthcm)
      from
        iris_df
      group by
        species;     
"""
print("*" * 80)
print("sql : ")
print(grpSql)
print("*" * 80)
print(sqldf(grpSql, locals()))
 
       
********************************************************************************
sql : select species , avg(sepalwidthcm)
iris_df
, min(sepalwidthcm) , max(sepalwidthcm) from group by
********************************************************************************
species; species avg(sepalwidthcm) min(sepalwidthcm) max(sepalwidthcm)
1 versicolor 2.770 2.0 3.4
0 setosa 3.418 2.3 4.4
2 virginica 2.974 2.2 3.8

使用pandasql,定义函数,对数据进行统计

In [13]:
def pysqldf(sql):
    "add this to your script if you get tired of calling locals()"
    return sqldf(sql, globals())
print("*" * 80)
print("calling from a helper function")
print('''def pysqldf(sql):)
    "add this to your script if you get tired of calling locals()"
        return sqldf(sql, globals())''')
print("*" * 80)
print(grpSql)
print("*" * 80)
print(pysqldf(grpSql))
 
       
********************************************************************************
calling from a helper functiondef pysqldf(sql):)
return sqldf(sql, globals())
"add this to your script if you get tired of calling locals()"
, avg(sepalwidthcm)
******************************************************************************** select species , min(sepalwidthcm)
********************************************************************************
, max(sepalwidthcm) from iris_df group by species; species avg(sepalwidthcm) min(sepalwidthcm) max(sepalwidthcm)
2 virginica 2.974 2.2 3.8
0 setosa 3.418 2.3 4.4
1 versicolor 2.770 2.0 3.4
In [15]:
births = load_births()
births.head()
Out[15]:
  date births
0 1975-01-01 265775
1 1975-02-01 241045
2 1975-03-01 268849
3 1975-04-01 247455
4 1975-05-01 254545
In [16]:
meat = load_meat()
meat.head()
Out[16]:
  date beef veal pork lamb_and_mutton broilers other_chicken turkey
0 1944-01-01 751.0 85.0 1280.0 89.0 NaN NaN NaN
1 1944-02-01 713.0 77.0 1169.0 72.0 NaN NaN NaN
2 1944-03-01 741.0 90.0 1128.0 75.0 NaN NaN NaN
3 1944-04-01 650.0 89.0 978.0 66.0 NaN NaN NaN
4 1944-05-01 681.0 106.0 1029.0 78.0 NaN NaN NaN

join操作

In [19]:
joinSql = """
    SELECT
        m.*
        , b.births
    FROM
        meat m
    INNER JOIN
        births b
            on m.date = b.date
    ORDER BY
        m.date;
"""
print("*" * 80)
print(joinSql)
print("*" * 80)
pysqldf(joinSql).head()
 
       
********************************************************************************
SELECT m.* , b.births FROM meat m
m.date;
INNER JOIN births b on m.date = b.date ORDER BY
********************************************************************************
Out[19]:
  date beef veal pork lamb_and_mutton broilers other_chicken turkey births
0 1975-01-01 00:00:00.000000 2106.0 59.0 1114.0 36.0 646.2 NaN 64.9 265775
1 1975-02-01 00:00:00.000000 1845.0 50.0 954.0 31.0 570.2 NaN 47.1 241045
2 1975-03-01 00:00:00.000000 1891.0 57.0 976.0 35.0 616.6 NaN 54.4 268849
3 1975-04-01 00:00:00.000000 1895.0 60.0 1100.0 34.0 688.3 NaN 68.7 247455
4 1975-05-01 00:00:00.000000 1849.0 59.0 934.0 31.0 690.1 NaN 81.9 254545

你可能感兴趣的:(数据分析)