用Numpy和Pandas分析二维数据

1. 数据说明

  • UNIT
    Remote unit that collects turnstile information. Can collect from multiple banks of turnstiles. Large subway stations can have more than one unit.

  • DATEn
    Date in “yyyy­mm­dd” (2011­05­21) format.

  • TIMEn
    Time in “hh:mm:ss” (08:05:02) format.

  • ENTRIESn
    Raw reading of cummulative turnstile entries from the remote unit. Occasionally resets to 0.

  • EXITSn
    Raw reading of cummulative turnstile exits from the remote unit. Occasionally resets to 0.

  • ENTRIESn_hourly
    Difference in ENTRIES from the previous REGULAR reading.

  • EXITSn_hourly
    Difference in EXITS from the previous REGULAR reading.

  • datetime
    Date and time in “yyyy­mm­dd hh:mm:ss” format (2011­05­01 00:00:00). Can be parsed into a Pandas datetime object without modifications.

  • hour
    Hour of the timestamp from TIMEn. Truncated rather than rounded.

  • day_week
    Integer (0 ­ 6 Mon ­ Sun) corresponding to the day of the week.

  • weekday
    Indicator (0 or 1) if the date is a weekday (Mon ­ Fri).

  • station
    Subway station corresponding to the remote unit.

  • latitude
    Latitude of the subway station corresponding to the remote unit.

  • longitude
    Longitude of the subway station corresponding to the remote unit.

  • conds Categorical variable of the weather conditions (Clear, Cloudy etc.) for the time and location.

  • fog
    Indicator (0 or 1) if there was fog at the time and location.

  • precipi
    Precipitation in inches at the time and location.

  • pressurei
    Barometric pressure in inches Hg at the time and location.

  • rain
    Indicator (0 or 1) if rain occurred within the calendar day at the location.

  • tempi
    Temperature in ℉ at the time and location.

  • wspdi
    Wind speed in mph at the time and location.

  • meanprecipi
    Daily average of precipi for the location.

  • meanpressurei
    Daily average of pressurei for the location.

  • meantempi
    Daily average of tempi for the location.

  • meanwspdi
    Daily average of wspdi for the location.

  • weather_lat
    Latitude of the weather station the weather data is from.

  • weather_lon
    Longitude of the weather station the weather data is from.

questions i thought of :

  • what variables are related to subwary ridership?
    -- which stations have the most riders?
    -- what are the ridership patterns over time?
    -- how does the weather affect ridership?

  • what patterns can i find in the weather?
    -- is the temperature rising throughout the month?
    -- how does weather vary across the city?

3. 二维numpy数组

two-dimensional data:
python:list of lists
numpy:2D array
pandas:dataframe

2D arrays as opposed to array of arrays:

  • more memory efficient
  • accessing element is a bit different a[1,3]
  • mean(),std() operate on entire array
import numpy as np


ridership = np.array([
    [   0,    0,    2,    5,    0],
    [1478, 3877, 3674, 2328, 2539],
    [1613, 4088, 3991, 6461, 2691],
    [1560, 3392, 3826, 4787, 2613],
    [1608, 4802, 3932, 4477, 2705],
    [1576, 3933, 3909, 4979, 2685],
    [  95,  229,  255,  496,  201],
    [   2,    0,    1,   27,    0],
    [1438, 3785, 3589, 4174, 2215],
    [1342, 4043, 4009, 4665, 3033]
])
print ridership
print ridership[1, 3]
print ridership[1:3, 3:5]
print ridership[1, :]
    
# Vectorized operations on rows or columns
print ridership[0, :] + ridership[1, :]
print ridership[:, 0] + ridership[:, 1]
    
# Vectorized operations on entire arrays
a = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
b = np.array([[1, 1, 1], [2, 2, 2], [3, 3, 3]])
print a + b


write a function:
find the max riders on the first day
find the mean riders per days
def mean_riders_for_max_station(ridership):
    
    overall_mean = ridership.mean() # Replace this with your code
    max_station = ridership[0,:].argmax()
    mean_for_max = ridership[:,max_station].mean() # Replace this with your code
    
    return (overall_mean, mean_for_max)

4. NumPy 轴

行的平均值

ridership.mean(axis=1)

列的平均值

ridership.mean(axis=0)
import numpy as np


a = np.array([
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9]
])
    
print a.sum()
print a.sum(axis=0)
print a.sum(axis=1)
    

ridership = np.array([
    [   0,    0,    2,    5,    0],
    [1478, 3877, 3674, 2328, 2539],
    [1613, 4088, 3991, 6461, 2691],
    [1560, 3392, 3826, 4787, 2613],
    [1608, 4802, 3932, 4477, 2705],
    [1576, 3933, 3909, 4979, 2685],
    [  95,  229,  255,  496,  201],
    [   2,    0,    1,   27,    0],
    [1438, 3785, 3589, 4174, 2215],
    [1342, 4043, 4009, 4665, 3033]
])

def min_and_max_riders_per_day(ridership):
    mean_ridership_for_station = ridership.mean(axis=0)
    
    max_daily_ridership = mean_ridership_for_station.max()    # Replace this with your code
    min_daily_ridership = mean_ridership_for_station.min()   # Replace this with your code
    
    return (max_daily_ridership, min_daily_ridership)

5. NumPy 和 Pandas 数据类型

Pandas dataframe 每一列可以是不同的类型
dataframe.mean() 计算每一列的平均值

6. 访问 DataFrame 元素

.loc['索引名']  #访问相应的一行
.iloc[9] #按位置获取一行
.iloc[1,3]
df['列名']  #获取列
df.values #返回不含列名称或行索引,仅含有df中值的numpy二维数据,这样就可以计算整个df的统计量
import pandas as pd

# Subway ridership for 5 stations on 10 different days
ridership_df = pd.DataFrame(
    data=[[   0,    0,    2,    5,    0],
          [1478, 3877, 3674, 2328, 2539],
          [1613, 4088, 3991, 6461, 2691],
          [1560, 3392, 3826, 4787, 2613],
          [1608, 4802, 3932, 4477, 2705],
          [1576, 3933, 3909, 4979, 2685],
          [  95,  229,  255,  496,  201],
          [   2,    0,    1,   27,    0],
          [1438, 3785, 3589, 4174, 2215],
          [1342, 4043, 4009, 4665, 3033]],
    index=['05-01-11', '05-02-11', '05-03-11', '05-04-11', '05-05-11',
           '05-06-11', '05-07-11', '05-08-11', '05-09-11', '05-10-11'],
    columns=['R003', 'R004', 'R005', 'R006', 'R007']
)


# DataFrame creation
print ridership_df.iloc[0]
print ridership_df.loc['05-05-11']
print ridership_df['R003']
print ridership_df.iloc[1, 3]
        
print ridership_df[['R003', 'R005']]
    
df = pd.DataFrame({'A': [0, 1, 2], 'B': [3, 4, 5]})
print df.sum()
print df.sum(axis=1)
print df.values.sum()
    
def mean_riders_for_max_station(ridership):
    overall_mean = ridership.values.mean() 
    max_station = ridership.iloc[0].argmax()  #return the colunm name 
    mean_for_max = ridership.loc[:,max_station].mean() # Replace this with your code
    
    return (overall_mean, mean_for_max)

7. 将数据加载到 DataFrame 中

DataFrame 可有效表示csv文件内容,可使每一列的数据类型不同

df = pd.read_csv('filename.csv')

8. 计算相关性

默认情况下,Pandas 的 std() 函数使用贝塞耳校正系数来计算标准偏差。调用 std(ddof=0) 可以禁止使用贝塞耳校正系数。
计算皮尔森系数时,需要使用ddof=0

NumPy 的 corrcoef() 函数可用来计算皮尔逊积矩相关系数,也简称为“相关系数”。

import pandas as pd
def correlation(x, y):
    x_standard = (x-x.mean())/x.std(ddof=0) 
    y_standard = (y-y.mean())/y.std(ddof=0)
    return (x_standard * y_standard).mean()

9. Pandas 轴名

axis = 1 axis='column' 行
axis = 0 axis='index' 列

10. DataFrame 向量化运算

import pandas as pd

df1 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})
df2 = pd.DataFrame({'a': [10, 20, 30], 'b': [40, 50, 60], 'c': [70, 80, 90]})
print df1 + df2
    
df1 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})
df2 = pd.DataFrame({'d': [10, 20, 30], 'c': [40, 50, 60], 'b': [70, 80, 90]})
print df1 + df2

df1 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]},
                    index=['row1', 'row2', 'row3'])
df2 = pd.DataFrame({'a': [10, 20, 30], 'b': [40, 50, 60], 'c': [70, 80, 90]},
                    index=['row4', 'row3', 'row2'])
print df1 + df2


# Cumulative entries and exits for one station for a few hours.
entries_and_exits = pd.DataFrame({
    'ENTRIESn': [3144312, 3144335, 3144353, 3144424, 3144594,
                 3144808, 3144895, 3144905, 3144941, 3145094],
    'EXITSn': [1088151, 1088159, 1088177, 1088231, 1088275,
               1088317, 1088328, 1088331, 1088420, 1088753]
})

def get_hourly_entries_and_exits(entries_and_exits):
    '''
    Fill in this function to take a DataFrame with cumulative entries
    and exits (entries in the first column, exits in the second) and
    return a DataFrame with hourly entries and exits (entries in the
    first column, exits in the second).
    '''
    return entries_and_exits-entries_and_exits.shift(1)

11. DataFrame applymap()

import pandas as pd

df = pd.DataFrame({
    'a': [1, 2, 3],
    'b': [10, 20, 30],
    'c': [5, 10, 15]
})
    
def add_one(x):
    return x + 1
        
print df.applymap(add_one)
    
grades_df = pd.DataFrame(
    data={'exam1': [43, 81, 78, 75, 89, 70, 91, 65, 98, 87],
          'exam2': [24, 63, 56, 56, 67, 51, 79, 46, 72, 60]},
    index=['Andre', 'Barry', 'Chris', 'Dan', 'Emilio', 
           'Fred', 'Greta', 'Humbert', 'Ivan', 'James']
)
 
def convert_grade(x):
    if x>= 90:
        return 'A'
    elif x>= 80:
        return 'B'
    elif x>= 70:
        return 'C'
    elif x>=60:
        return 'D'
    else:
        return 'F'
def convert_grades(grades):
    
    return grades.applymap(convert_grade)

12.DataFrame apply()

def standardize_column(column):
    return (column - column.mean())/column.std(ddof=0)
def standardize(df):
    return df.apply(standardize_column)

计算得出的默认标准偏差类型在 numpy 的 .std() 和 pandas 的 .std() 函数之间是不同的。默认情况下,numpy 计算的是总体标准偏差,ddof = 0。另一方面,pandas 计算的是样本标准偏差,ddof = 1。如果我们知道所有的分数,那么我们就有了总体——因此,要使用 pandas 进行归一化处理,我们需要将“ddof”设置为 0。

13. DataFrame apply() 使用案例 2

将一列数据转化为单个值

def column_second_largest(column):
    sorted_values = column.sort_values(ascending = False)
    return sorted_values.iloc[1]
    
def second_largest(df):
    '''
    Fill in this function to return the second-largest value of each 
    column of the input DataFrame.
    '''
    return df.apply(column_second_largest)

14. 向 Series 添加 DataFrame

import pandas as pd
s = pd.Series([1, 2, 3, 4])
df = pd.DataFrame({
    0: [10, 20, 30, 40],
    1: [50, 60, 70, 80],
    2: [90, 100, 110, 120],
    3: [130, 140, 150, 160]
})

# Adding a Series to a square DataFrame    
print df + s
    
s = pd.Series([1, 2, 3, 4])
df = pd.DataFrame({0: [10], 1: [20], 2: [30], 3: [40]})
# Adding a Series to a one-row DataFrame 
print df + s

s = pd.Series([1, 2, 3, 4])
df = pd.DataFrame({0: [10, 20, 30, 40]})
# Adding a Series to a one-column DataFrame
print df + s
    

    
# Adding when DataFrame column names match Series index
s = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
df = pd.DataFrame({
    'a': [10, 20, 30, 40],
    'b': [50, 60, 70, 80],
    'c': [90, 100, 110, 120],
    'd': [130, 140, 150, 160]
})
    
print df + s
    
# Adding when DataFrame column names don't match Series index
s = pd.Series([1, 2, 3, 4])
df = pd.DataFrame({
    'a': [10, 20, 30, 40],
    'b': [50, 60, 70, 80],
    'c': [90, 100, 110, 120],
    'd': [130, 140, 150, 160]
})
print df + s

df.add(s) --- df+s
df.add(s,axis='columns')
df.add(s,axis='index')

将dataframe与series相加,就是将dataframe的每一列与series的每一个值相加,它根据series的索引值和dataframe的列名匹配dataframe和series.

15. 再次归一化每一列

def standardize(df):
    '''
    归一化每一列
    '''
    return (df-df.mean())/df.std(ddof=0)

def standardize_rows(df):
    '''
    归一化每一行
    '''
    mean = df.mean(axis='columns')
    mean_difference = df-mean
    std = df.std(axis = 'columns',ddof=0)
    return mean_difference/std

16. Pandas groupby()

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

values = np.array([1, 3, 2, 4, 1, 6, 4])
example_df = pd.DataFrame({
    'value': values,
    'even': values % 2 == 0,
    'above_three': values > 3 
}, index=['a', 'b', 'c', 'd', 'e', 'f', 'g'])


print example_df
    
grouped_data = example_df.groupby('even')
print grouped_data.groups
    
# Group by multiple columns
grouped_data = example_df.groupby(['even', 'above_three'])
print grouped_data.groups
    
# Get sum of each group
grouped_data = example_df.groupby('even')
print grouped_data.sum()
    

grouped_data = example_df.groupby('even')
print grouped_data.sum()['value']
print grouped_data['value'].sum()

17. 每小时入站和出站数

def hourly(column):
return column - column.shift(1)

def get_hourly_entries_and_exits(entries_and_exits):
'''
Fill in this function to take a DataFrame with cumulative entries
and exits and return a DataFrame with hourly entries and exits.
The hourly entries and exits should be calculated separately for
each station (the 'UNIT' column).
'''
return entries_and_exits.groupby('UNIT')[['ENTRIESn','EXITSn']].apply(hourly)

18.合并 Pandas DataFrame

import pandas as pd

subway_df = pd.DataFrame({
    'UNIT': ['R003', 'R003', 'R003', 'R003', 'R003', 'R004', 'R004', 'R004',
             'R004', 'R004'],
    'DATEn': ['05-01-11', '05-02-11', '05-03-11', '05-04-11', '05-05-11',
              '05-01-11', '05-02-11', '05-03-11', '05-04-11', '05-05-11'],
    'hour': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
    'ENTRIESn': [ 4388333,  4388348,  4389885,  4391507,  4393043, 14656120,
                 14656174, 14660126, 14664247, 14668301],
    'EXITSn': [ 2911002,  2911036,  2912127,  2913223,  2914284, 14451774,
               14451851, 14454734, 14457780, 14460818],
    'latitude': [ 40.689945,  40.689945,  40.689945,  40.689945,  40.689945,
                  40.69132 ,  40.69132 ,  40.69132 ,  40.69132 ,  40.69132 ],
    'longitude': [-73.872564, -73.872564, -73.872564, -73.872564, -73.872564,
                  -73.867135, -73.867135, -73.867135, -73.867135, -73.867135]
})

weather_df = pd.DataFrame({
    'DATEn': ['05-01-11', '05-01-11', '05-02-11', '05-02-11', '05-03-11',
              '05-03-11', '05-04-11', '05-04-11', '05-05-11', '05-05-11'],
    'hour': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
    'latitude': [ 40.689945,  40.69132 ,  40.689945,  40.69132 ,  40.689945,
                  40.69132 ,  40.689945,  40.69132 ,  40.689945,  40.69132 ],
    'longitude': [-73.872564, -73.867135, -73.872564, -73.867135, -73.872564,
                  -73.867135, -73.872564, -73.867135, -73.872564, -73.867135],
    'pressurei': [ 30.24,  30.24,  30.32,  30.32,  30.14,  30.14,  29.98,  29.98,
                   30.01,  30.01],
    'fog': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
    'rain': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
    'tempi': [ 52. ,  52. ,  48.9,  48.9,  54. ,  54. ,  57.2,  57.2,  48.9,  48.9],
    'wspdi': [  8.1,   8.1,   6.9,   6.9,   3.5,   3.5,  15. ,  15. ,  15. ,  15. ]
})

def combine_dfs(subway_df, weather_df):
    '''
    Fill in this function to take 2 DataFrames, one with subway data and one with weather data,
    and return a single dataframe with one row for each date, hour, and location. Only include
    times and locations that have both subway data and weather data available.
    '''
    return subway_df.merge(weather_df,
        on=['DATEn','hour','latitude','longitude'],
        how='inner')

你可能感兴趣的:(用Numpy和Pandas分析二维数据)