numpy log10 和 dataframe一起使用
numpy
using the standard alias np
.df
to an array np_vals
using the attribute values
.np_vals
into the NumPy method log10()
and store the results in np_vals_log10
.df
DataFrame into the NumPy method log10()
and store the results in df_log10
.print()
and type()
on both df_vals_log10
and df_log10
, and compare. This has been done for you.# Import numpy
import numpy as np
# Create array of DataFrame values: np_vals
np_vals = df.values
# Create new array of base 10 logarithm values: np_vals_log10
np_vals_log10 = np.log10(np_vals)
# Create array of new DataFrame by passing df to np.log10(): df_log10
df_log10 = np.log10(df)
# Print original and new data containers
print(type(np_vals), type(np_vals_log10))
print(type(df), type(df_log10))
将List使用zip来创建dataframe
list_keys
and list_values
together into one list of (key, value) tuples. Be sure to convert the zip
object into a list, and store the result in zipped
.zipped
using print()
. This has been done for you.zipped
. Store the result as data
.df
.# Zip the 2 lists together into one list of (key,value) tuples: zipped
zipped = list(zip(list_keys, list_values))
# Inspect the list using print()
print(zipped)
# Build a dictionary with the zipped list: data
data = dict(zipped)
# Build and inspect a DataFrame from the dictionary: df
df = pd.DataFrame(data)
print(df)
pandas读取文件时的一些预处理
pd.read_csv()
without using any keyword arguments to read file_messy
into a pandas DataFrame df1
..head()
to print the first 5 rows of df1
and see how messy it is. Do this in the IPython Shell first so you can see how modifying read_csv()
can clean up this mess.delimiter=' '
, header=3
and comment='#'
, use pd.read_csv()
again to read file_messy
into a new DataFrame df2
.df2.head()
to verify the file was read correctly..to_csv()
to save the DataFrame df2
to the variable file_clean
. Be sure to specify index=False
..to_excel()
to save the DataFrame df2
to the file 'file_clean.xlsx'
. Again, remember to specify index=False
.# Read the raw file as-is: df1
df1 = pd.read_csv(file_messy)
# Print the output of df1.head()
print(df1.head())
# Read in the file with the correct parameters: df2
df2 = pd.read_csv(file_messy, delimiter=" ", header=3, comment='#')
# Print the output of df2.head()
print(df2.head())
# Save the cleaned up DataFrame to a CSV file without the index
df2.to_csv(file_clean, index=False)
# Save the cleaned up DataFrame to an excel file without the index
df2.to_excel('file_clean.xlsx', index=False)
pandas画图和存图
df.plot()
. Specify a color
of 'red'
.
c
and color
are interchangeable as parameters here, but we ask you to be explicit and specify color
.plt.title()
to give the plot a title of 'Temperature in Austin'
.plt.xlabel()
to give the plot an x-axis label of 'Hours since midnight August 1, 2010'
.plt.ylabel()
to give the plot a y-axis label of 'Temperature (degrees F)'
.plt.show()
.# Create a plot with color='red'
df.plot(color='red')
# Add a title
plt.title("Temperature in Austin")
# Specify the x-axis label
plt.xlabel("Hours since midnight August 1, 2010")
# Specify the y-axis label
plt.ylabel("Temperature (degrees F)")
# Display the plot
plt.show()
df.plot()
, and noting the vertical scaling problem.subplots=True
inside .plot()
.'Dew Point (deg F)'
, and call df[column_list1].plot()
.'Temperature (deg F)'
and 'Dew Point (deg F)'
. To do this, define a list containing those column names and pass it into df[]
, as df[column_list2].plot()
.y_columns
consisting of 'AAPL'
and 'IBM'
.x='Month'
and y=y_columns
as inputs.'Monthly stock prices'
.'hp'
on the x-axis and 'mpg'
on the y-axis. Specify s=sizes
.cols
of the column names to be plotted: 'weight'
and 'mpg'
. You can then access it using df[cols]
.subplots=True
.# Make a list of the column names to be plotted: cols
cols = ['weight', 'mpg']
# Generate the box plots
df[cols].plot(kind="box", subplots=True)
# Display the plot
plt.show()
fraction
with 30 bins
between 0 and 30%. The range has been taken care of for you. ax=axes[0]
means that this plot will appear in the first row.fraction
with 30 bins
between 0 and 30%. Again, the range has been specified for you. To make the CDF appear on the second row, you need to specify ax=axes[1]
.# This formats the plots such that they appear on separate rows
fig, axes = plt.subplots(nrows=2, ncols=1)
# Plot the PDF
df.fraction.plot(ax=axes[0], kind='hist', normed=True, bins=30, range=(0,.3))
plt.show()
# Plot the CDF
df.fraction.plot(ax=axes[1], kind='hist', normed=True, cumulative=True, bins=30, range=(0,.3))
plt.show()
'Engineering'
column.'Engineering'
column..mean(axis='columns')
. Assign the result to mean
.'Year'
is the index of df
, it will appear on the x-axis of the plot. No keyword arguments are needed in your call to .plot()
.# Print the minimum value of the Engineering column
print(df['Engineering'].min())
# Print the maximum value of the Engineering column
print(df['Engineering'].max())
# Construct the mean percentage per year: mean
mean = df.mean(axis='columns')
# Plot the average percentage per year
mean.plot()
# Display the plot
plt.show()
探索数据规律
.count()
method on the '2015'
column of df
.df
. To do this, use the .quantile()
method with the list [0.05, 0.95]
.years
. This has already been done for you, so click on 'Submit Answer' to view the result!# Print the number of countries reported in 2015
print(df['2015'].count())
# Print the 5th and 95th percentiles
print(df.quantile([0.05, 0.95]))
# Generate a box plot
years = ['1800','1850','1900','1950','2000']
df[years].plot(kind='box')
plt.show()
时间序列数据的处理
time_format
, using '%Y-%m-%d %H:%M'
as the desired format.date_list
into a datetime
object by using the pd.to_datetime()
function. Specify the format string you defined above and assign the result to my_datetimes
.time_series
using pd.Series()
with temperature_list
and my_datetimes
. Set the index
of the Series to be my_datetimes
.# Prepare a format string: time_format
time_format = '%Y-%m-%d %H:%M'
# Convert date_list into a datetime object: my_datetimes
my_datetimes = pd.to_datetime(date_list, format=time_format)
# Construct a pandas Series using temperature_list and my_datetimes: time_series
time_series = pd.Series(temperature_list, index=my_datetimes)
ts3
by reindexing ts2
with the index of ts1
. To do this, call .reindex()
on ts2
and pass in the index of ts1
(ts1.index
).ts4
, by calling the same .reindex()
as above, but also specifiying a fill method, using the keyword argument method="ffill"
to forward-fill values.ts1 + ts2
. Assign the result to sum12
.ts1 + ts3
. Assign the result to sum13
.ts1 + ts4
, Assign the result to sum14
.# Reindex without fill method: ts3
ts3 = ts2.reindex(ts1.index)
# Reindex with fill method, using forward fill: ts4
ts4 = ts2.reindex(ts1.index, method='ffill')
# Combine ts1 + ts2: sum12
sum12 = ts1 + ts2
# Combine ts1 + ts3: sum13
sum13 = ts1 + ts3
# Combine ts1 + ts4: sum14
sum14 = ts1 + ts4
数据的聚合操作
'Temperature'
column of df
to 6 hour data using .resample('6h')
and .mean()
. Assign the result to df1
.'Temperature'
column of df
to daily data using .resample('D')
and then count the number of data points in each day with .count()
. Assign the result df2
.august
.august_highs
.february
.february_lows
.unsmoothed
..rolling()
with a 24 hour window to smooth the mean temperature data. Assign the result to smoothed
.august
with the time series smoothed
and unsmoothed
as columns.august
as line plots using the .plot()
method.# Extract data from 2010-Aug-01 to 2010-Aug-15: unsmoothed
unsmoothed = df['Temperature']['2010-08-01':'2010-08-15']
# Apply a rolling mean with a 24 hour window: smoothed
smoothed = unsmoothed.rolling(window=24).mean()
# Create a new DataFrame with columns smoothed and unsmoothed: august
august = pd.DataFrame({'smoothed':smoothed, 'unsmoothed':unsmoothed})
# Plot both smoothed and unsmoothed data using august.plot().
august.plot()
plt.show()
时间序列数据的平滑处理
august
.daily_highs
.daily_highs
) and then combine it with .rolling()
to apply a 7 day .mean()
(with window=7
inside .rolling()
) so as to smooth the daily highs. Assign the result to daily_highs_smoothed
and print the result..str.strip()
to strip extra whitespace from df.columns
. Assign the result back to df.columns
.'Destination Airport'
column, extract all entries where Dallas ('DAL'
) is the destination airport. Use .str.contains('DAL')
for this and store the result in dallas
.dallas
such that you get the total number of departures each day. Store the result in daily_departures
..describe()
. Store the result in stats
.ts2
with that of ts1
, and then fill in the missing values of ts2
by using .interpolate(how='linear')
. Save the result as ts2_interp
.ts1
and ts2_interp
. Take the absolute value of the difference with np.abs()
, and assign the result to differences
.differences
with .describe()
and print()
pandas 处理 timezone 问题
mask
, such that if the 'Destination Airport'
column of df
equals 'LAX'
, the result is True
, and otherwise, it is False
.LAX
rows. Assign the result to la
.la['Date (MM/DD/YYYY)']
and la['Wheels-off Time']
with a ' '
space in between. Pass this to pd.to_datetime()
to create a datetime array of all the times the LAX-bound flights left the ground.Series.dt.tz_localize()
to localize the time to 'US/Central'
..dt.tz_convert()
method to convert datetimes from 'US/Central'
to 'US/Pacific'
.# Buid a Boolean mask to filter out all the 'LAX' departure flights: mask
mask = df['Destination Airport'] == 'LAX'
# Use the mask to subset the data: la
la = df[mask]
# Combine two columns of data to create a datetime series: times_tz_none
times_tz_none = pd.to_datetime( la['Date (MM/DD/YYYY)'] + ' ' + la['Wheels-off Time'] )
# Localize the time to US/Central: times_tz_central
times_tz_central = times_tz_none.dt.tz_localize('US/Central')
# Convert the datetimes from US/Central to US/Pacific
times_tz_pacific = times_tz_central.dt.tz_convert("US/Pacific")
pd.to_datetime()
to convert the 'Date'
column to a collection of datetime objects, and assign back to df.Date
.'Date'
column, using df.set_index()
with the optional keyword argument inplace=True
, so that you don't have to assign the result back to df
.# Plot the raw data before setting the datetime index
df.plot()
plt.show()
# Convert the 'Date' column into a collection of datetime objects: df.Date
df.Date = pd.to_datetime(df['Date'])
# Set the index to be the converted 'Date' column
df.set_index('Date', inplace=True)
# Re-plot the DataFrame to see that the axis is now datetime aware!
df.plot()
plt.show()
df_clean
with daily frequency and aggregate by the mean. Store the result as daily_mean_2011
.'dry_bulb_faren'
column from daily_mean_2011
as a NumPy array using .values
. Store the result as daily_temp_2011
. Note: .values
is an attribute, not a method, so you don't have to use ()
.df_climate
with daily frequency and aggregate by the mean. Store the result as daily_climate
.'Temperature'
column from daily_climate
using the .reset_index()
method. To do this, first reset the index of daily_climate
, and then use bracket slicing to access 'Temperature'
. Store the result as daily_temp_climate
.# Downsample df_clean by day and aggregate by mean: daily_mean_2011
daily_mean_2011 = df_clean.resample("D").mean()
# Extract the dry_bulb_faren column from daily_mean_2011 using .values: daily_temp_2011
daily_temp_2011 = daily_mean_2011.dry_bulb_faren.values
# Downsample df_climate by day and aggregate by mean: daily_climate
daily_climate = df_climate.resample("D").mean()
# Extract the Temperature column from daily_climate using .reset_index(): daily_temp_climate
daily_temp_climate = daily_climate.reset_index().Temperature
# Compute the difference between the two arrays and print the mean difference
difference = daily_temp_2011 - daily_temp_climate
print(difference.mean())
.loc[]
to select sunny days and assign to sunny
. If 'sky_condition'
equals'CLR'
, then the day is sunny..loc[]
to select overcast days and assign to overcast
. If 'sky_condition'
contains 'OVC'
, then the day is overcast.sunny
and overcast
and aggregate by the maximum (.max()
) daily ('D'
) temperature. Assign to sunny_daily_max
and overcast_daily_max
.sunny_daily_max
and overcast_daily_max
. This has already been done for you, so click 'Submit Answer' to view the result!# Select days that are sunny: sunny
sunny = df_clean.loc[df_clean['sky_condition'].str.contains('CLR')]
# Select days that are overcast: overcast
overcast = df_clean.loc[df_clean['sky_condition'].str.contains('OVC')]
# Resample sunny and overcast, aggregating by maximum daily temperature
sunny_daily_max = sunny.resample('D').max()
overcast_daily_max = overcast.resample('D').max()
# Print the difference between the mean of sunny_daily_max and overcast_daily_max
print(sunny_daily_max.mean() - overcast_daily_max.mean())
画图分析
matplotlib.pyplot
as plt
.'visibility'
and 'dry_bulb_faren'
columns and resample them by week, aggregating the mean. Assign the result to weekly_mean
.weekly_mean.corr()
.weekly_mean
dataframe with .plot()
, specifying subplots=True
.# Import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
# Select the visibility and dry_bulb_faren columns and resample them: weekly_mean
weekly_mean = df_clean[['visibility','dry_bulb_faren']].resample("W").mean()
# Print the output of weekly_mean.corr()
print(weekly_mean.corr())
# Plot weekly_mean with subplots=True
weekly_mean.plot(subplots=True)
plt.show()
sunny
.sunny
by day and compute the sum. Assign the result to sunny_hours
.sunny
by day and compute the count. Assign the result to total_hours
.sunny_hours
by total_hours
. Assign to sunny_fraction
.sunny_fraction
.# Create a Boolean Series for sunny days: sunny
sunny = df_clean.sky_condition.str.contains("CLR")
# Resample the Boolean Series by day and compute the sum: sunny_hours
sunny_hours = sunny.resample('D').sum()
# Resample the Boolean Series by day and compute the count: total_hours
total_hours = sunny.resample("D").count()
# Divide sunny_hours by total_hours: sunny_fraction
sunny_fraction = sunny_hours / total_hours
# Make a box plot of sunny_fraction
sunny_fraction.plot(kind='box')
plt.show()
df_climate
, extract the maximum temperature observed in August 2010. The relevant column here is 'Temperature'
. You can select the rows corresponding to August 2010 in multiple ways. For example, df_climate.loc['2011-Feb']
selects all rows corresponding to February 2011, while df_climate.loc['2009-09', 'Pressure']
selects the rows corresponding to September 2009 from the 'Pressure'
column.df_clean
, select the August 2011 temperature data from the 'dry_bulb_faren'
. Resample this data by day and aggregate the maximum value. Store the result in august_2011
.august_2011
where the value exceeded august_max
. Store the result in august_2011_high
.august_2011_high
using 25 bins. Remember to specify the kind
, normed
, and cumulative
parameters in addition to bins
.# Extract the maximum temperature in August 2010 from df_climate: august_max
august_max = df_climate.loc['2010-08','Temperature'].max()
print(august_max)
# Resample the August 2011 temperatures in df_clean by day and aggregate the maximum value: august_2011
august_2011 = df_clean.loc['2011-08','dry_bulb_faren'].resample("D").max()
# Filter out days in august_2011 where the value exceeded august_max: august_2011_high
august_2011_high = august_2011[august_2011.values>august_max]
# Construct a CDF of august_2011_high
august_2011_high.plot(kind='hist', bins=25, normed=True, cumulative=True)
# Display the plot
plt.show()