radar_sun

Data-Manipulation-with-Pandas

文章目录

1. Transforming Data
- 1.1 Introducing DataFrames
- 1.2 Inspecting a DataFrames
- 1.3 Parts of a DataFrame
- 1.4 Sorting and subsetting
- 1.5 Sorting rows
- 1.6 Subsetting columns
- 1.7 Subsetting rows
- 1.8 Subsetting rows by categorical variables
- 1.9 New Columns
- 1.10 Adding new columns
- 1.11 Combo-attack!
2. Aggregating Data
- 2.1 Summary statistics
- 2.2 Mean and median
- 2.3 Summarizing dates
- 2.4 Efficient summaries
- 2.5 Cumlative statistics
- 2.6 Counting
- 2.7 Dropping duplicates
- 2.8 Counting categorical variables
- 2.9 Grouped summary statistics
- 2.10 What percet of sales occurred at each store type?
- 2.11 Calculations with .groupby()
- 2.12 Mutiple grouped summaried
- 2.13 Pivot tables
- 2.14 Pivoting on one variable
- 2.15 Fill in missing values and sum values with piovt tables
3. Slicing and Indexing
- 3.1 Explicit indexes
- 3.2 Setting and removing indexes
- 3.3 Subsetting with .loc[]
- 3.4 Setting multi-level indexes
- 3.5 Sorting by index values
- 3.6 Slicing and subsetting with .loc and .iloc
- 3.7 Slicing index values
- 3.8 Slicing in both directions
- 3.9 Slicing time series
- 3.10 Subsetting by row/columns number
- 3.11 Working with pivot tables
- 3.12 Subsetting pivot tables
- 3.13 Calculating on a pivot table
4. Creating and Visualizing DataFrames
- 4.1 Visualizing your data
- 4.2 Which avocado size is most popular?
- 4.3 Changes in sales over time
- 4.4 Avocado supply and demand
- 4.5 Price of conventional vs. organic avocados
- 4.6 Missing values
- 4.7 Finding missing values
- 4.8 Removing missing values
- 4.9 Replacing missing values
- 4.10 Creating DataFrames
- 4.11 List of dictionaries
- 4.12 Dictionary of lists
- 4.13 Reading and writing CSVs
- 4.14 CSV to DataFrame
- 4.15 DataFrame to CSV
- 4.16 Wrap-up

1. Transforming Data

1.1 Introducing DataFrames

1.2 Inspecting a DataFrames

When you get a new DataFrame to work with, the first thing you need to do is explore it and see what it contains. There are several useful methods and attributes for this.

.head() returns the first few rows (the “head” of the DataFrame).
.info() shows information on each of the columns, such as the data type and number of missing values.
.shape returns the number of rows and columns of the DataFrame.
.describe() calculates a few summary statistics for each column.

homelessness is a DataFrame containing estimates of homelessness in each U.S. state in 2018. The individual column is the number of homeless individuals not part of a family with children. The family_members column is the number of homeless individuals part of a family with children. The state_pop column is the state’s total population.

Instruction

Print the head of the homelessness DataFrame.
Print information about the column types and missing values in homelessness.
Print the number of rows and columns in homelessness.
Print some summary statistics that describe the homelessness DataFrame.

print(homelessness.head())

print(homelessness.info())

print(homelessness.shape)

print(homelessness.describe())

1.3 Parts of a DataFrame

To better understand DataFrame objects, it’s useful to know that they consist of three components, stored as attributes:

.values: A two-dimensional NumPy array of values.
.columns: An index of columns: the column names.
.index: An index for the rows: either row numbers or row names.

You can usually think of indexes as a list of strings or numbers, though the pandas Index data type allows for more sophisticated options. (These will be covered later in the course.)

Instruction

Import pandas using the alias pd.
Print a 2D NumPy array of the values in homelessness.
Print the column names of homelessness.
Print the index of homelessness.

# Import pandas using the alias pd
import pandas as pd

# Print the values of homelessness
print(homelessness.values)

# Print the column index of homelessness
print(homelessness.columns)

# Print the row index of homelessness
print(homelessness.index)

1.4 Sorting and subsetting

1.5 Sorting rows

Finding interesting bits of data in a DataFrame is often easier if you change the order of the rows. You can sort the rows by passing a column name to .sort_values().

In cases where rows have the same value (this is common if you sort on a categorical variable), you may wish to break the ties by sorting on another column. You can sort on multiple columns in this way by passing a list of column names.

Sort on	Syntax
one column	df.sort_values(“breed”)
multiple column	df.sort_values([“breed”, “weight_kg”])

By combining .sort_values() with .head(), you can answer questions in the form, “What are the top cases where…?”.

Instruction 1

Sort homelessness by the number of homeless individuals, from smallest to largest, and save this as homelessness_ind.
Print the head of the sorted DataFrame.

# Sort homelessness by individual
homelessness_ind = homelessness.sort_values("individuals")

# Print the top few rows
print(homelessness_ind.head())

Instruction 2

Sort homelessness by the number of homeless family_members in descending order, and save this as homelessness_fam.
Print the head of the sorted DataFrame.

# Sort homelessness by descending family members
homelessness_fam = homelessness.sort_values("family_members", ascending = False)

# Print the top few rows
print(homelessness_fam.head())

Instruction 3

Sort homelessness first by region (ascending), and then by number of family members (descending). Save this as homelessness_reg_fam.
Print the head of the sorted DataFrame.

# Sort homelessness by region, then descending family members
homelessness_reg_fam = homelessness.sort_values(["region", 
                                                 "family_members"], 
                                                 ascending = [True, False])

# Print the top few rows
print(homelessness_reg_fam.head())

1.6 Subsetting columns

When working with data, you may not need all of the variables in your dataset. Square brackets ([]) can be used to select only the columns that matter to you in an order that makes sense to you.
To select only "col_a" of the DataFrame df, use df["col_a"]
To select "col_a" and "col_b" of df, use df[["col_a", "col_b"]]

Instruction 1

Create a DataFrame called individuals that contains only the individuals column of homelessness.
Print the head of the result.

# Select the individuals column
individuals = homelessness['individuals']

# Print the head of the result
print(individuals.head())

Instruction 2

Create a DataFrame called state_fam that contains only the state and family_members columns of homelessness, in that order.
Print the head of the result.

# Select the state and family_members columns
state_fam = homelessness[["state", "family_members"]]

# Print the head of the result
print(state_fam.head())

Instruction 3

Create a DataFrame called ind_state that contains the individuals and state columns of homelessness, in that order.
Print the head of the result.

# Select only the individuals and state columns, in that order
ind_state = homelessness[["individuals", "state"]]

# Print the head of the result
print(ind_state.head())

1.7 Subsetting rows

A large part of data science is about finding which bits of your dataset are interesting. One of the simplest techniques for this is to find a subset of rows that match some criteria. This is sometimes known as filtering rows or selecting rows.

There are many ways to subset a DataFrame, perhaps the most common is to use relational operators to return True or False for each row, then pass that inside square brackets.

dogs[dogs["height_cm"] > 60]
dogs[dogs["color"] == "tan"]

You can filter for multiple conditions at once by using the “bitwise and” operator, &.

dogs[(dogs["height_cm"] > 60) & (dogs["color"] == "tan")]

Instruction 1
Filter homelessness for cases where the number of individuals is greater than ten thousand, assigning to ind_gt_10k. View the printed result.

# Filter for rows where individuals is greater than 10000
ind_gt_10k = homelessness[homelessness["individuals"] > 10000]

# See the result
print(ind_gt_10k)

Instruction 2
Filter homelessness for cases where the USA Census region is "Mountain", assigning to mountain_reg. View the printed result.

# Filter for rows where region is Mountain
mountain_reg =homelessness[homelessness["region"] == "Mountain"]

# See the result
print(mountain_reg)

Instruction 3
Filter homelessness for cases where the number of family_members is less than one thousand and the region is “Pacific”, assigning to fam_lt_1k_pac. View the printed result.

# Filter for rows where family_members is less than 1000 # and region is Pacific
fam_lt_1k_pac = homelessness[(homelessness["family_members"] < 1000) & 
                             (homelessness["region"] == "Pacific")]

# See the result
print(fam_lt_1k_pac)

1.8 Subsetting rows by categorical variables

Subsetting data based on a categorical variable often involves using the “or” operator (|) to select rows from multiple categories. This can get tedious when you want all states in one of three different regions, for example.
Instead, use the .isin() method, which will allow you to tackle this problem by writing one condition instead of three separate ones.

colors = ["brown", "black", "tan"]
condition = dogs["color"].isin(colors)
dogs[condition]

*Instruction 1 *
Filter homelessness for cases where the USA census region is “South Atlantic” or it is “Mid-Atlantic”, assigning to south_mid_atlantic. View the printed result.

# Subset for rows in South Atlantic or Mid-Atlantic regions
south_mid_atlantic = homelessness[(homelessness["region"] == "South Atlantic") | 
                                  (homelessness["region"] == "Mid-Atlantic")]
                                  
# See the result
print(south_mid_atlantic)

Instruction 2
Filter homelessness for cases where the USA census state is in the list of Mojave states, canu, assigning to mojave_homelessness. View the printed result.

# The Mojave Desert states
canu = ["California", "Arizona", "Nevada", "Utah"]

# Filter for rows in the Mojave Desert states
mojave_homelessness = homelessness[homelessness["state"].isin(canu)]

# See the result
print(mojave_homelessness)

1.9 New Columns

1.10 Adding new columns

You aren’t stuck with just the data you are given. Instead, you can add new columns to a DataFrame. This has many names, such as transforming, mutating, and feature engineering.

You can create new columns from scratch, but it is also common to derive them from other columns, for example, by adding columns together or by changing their units.

Instruction

Add a new column to homelessness, named total, containing the sum of the individuals and family_members columns.
Add another column to homelessness, named p_individuals, containing the proportion of homeless people in each state who are individuals.

# Add total col as sum of individuals and family_members
homelessness["total"] = homelessness["individuals"] + homelessness["family_members"]

# Add p_individuals col as proportion of individuals
homelessness["p_individuals"] = homelessness["individuals"] / homelessness["total"]

# See the result
print(homelessness)

1.11 Combo-attack!

You’ve seen the four most common types of data manipulation: sorting rows, subsetting columns, subsetting rows, and adding new columns. In a real-life data analysis, you can mix and match these four manipulations to answer a multitude of questions.

In this exercise, you’ll answer the question, “Which state has the highest number of homeless individuals per 10,000 people in the state?” Combine your new pandas skills to find out.

Instruction

Add a column to homelessness, indiv_per_10k, containing the number of homeless individuals per ten thousand people in each state.
Subset rows where indiv_per_10k is higher than 20, assigning to high_homelessness.
Sort high_homelessness by descending indiv_per_10k, assigning to high_homelessness_srt.
Select only the state and indiv_per_10k columns of high_homelessness_srt and save as result. Look at the result.

# Create indiv_per_10k col as homeless individuals per 10k state pop
homelessness["indiv_per_10k"] = 10000 * homelessness["individuals"] / homelessness["state_pop"] 

# Subset rows for indiv_per_10k greater than 20
high_homelessness = homelessness[homelessness["indiv_per_10k"] > 20]

# Sort high_homelessness by descending indiv_per_10k
high_homelessness_srt = high_homelessness.sort_values("indiv_per_10k", ascending = False)

# From high_homelessness_srt, select the state and indiv_per_10k cols
result = high_homelessness_srt[["state", "indiv_per_10k"]]

# See the result
print(result)

2. Aggregating Data

2.1 Summary statistics

2.2 Mean and median

Summary statistics are exactly what they sound like - they summarize many numbers in one statistic. For example, mean, median, minimum, maximum, and standard deviation are summary statistics. Calculating summary statistics allows you to get a better sense of your data, even if there’s a lot of it.

Instruction

Explore your new DataFrame first by printing the first few rows of the sales DataFrame.
Print information about the columns in sales.
Print the mean of the weekly_sales column.
Print the median of the weekly_sales column.

# Print the head of the sales DataFrame
print(sales.head())

# Print the info about the sales DataFrame
print(sales.info())

# Print the mean of weekly_sales
print(sales["weekly_sales"].mean())

# Print the median of weekly_sales
print(sales["weekly_sales"].median())

2.3 Summarizing dates

Summary statistics can also be calculated on date columns that have values with the data type datetime64. Some summary statistics — like mean — don’t make a ton of sense on dates, but others are super helpful, for example, minimum and maximum, which allow you to see what time range your data covers.

Instruction

Print the maximum of the date column.
Print the minimum of the date column.

# Print the maximum of the date column
print(sales["date"].max())

# Print the minimum of the date column
print(sales["date"].min())

2.4 Efficient summaries

While pandas and NumPy have tons of functions, sometimes, you may need a different function to summarize your data.

The .agg() method allows you to apply your own custom functions to a DataFrame, as well as apply functions to more than one column of a DataFrame at once, making your aggregations super-efficient. For example, df['column'].agg(function)

In the custom function for this exercise, “IQR” is short for inter-quartile range, which is the 75th percentile minus the 25th percentile. It’s an alternative to standard deviation that is helpful if your data contains outliers.

Instruction 1
Use the custom iqr function defined for you along with .agg() to print the IQR of the temperature_c column of sales.

# A custom IQR function
def iqr(column):    
  return column.quantile(0.75) - column.quantile(0.25)    
  
# Print IQR of the temperature_c column
print(sales['temperature_c'].agg(iqr))

Instruction 2
Update the column selection to use the custom iqr function with .agg() to print the IQR of temperature_c, fuel_price_usd_per_l, and unemployment, in that order.

# A custom IQR function
def iqr(column):
  return column.quantile(0.75) - column.quantile(0.25)
  
# Update to print IQR of temperature_c, fuel_price_usd_per_l, & unemployment
print(sales[["temperature_c",
             "fuel_price_usd_per_l", 
             "unemployment"]].agg(iqr))

Instruction 3
Update the aggregation functions called by .agg(): include iqr and np.median in that order.

# Import NumPy and create custom IQR function
import numpy as np

def iqr(column):    
  return column.quantile(0.75) - column.quantile(0.25)

# Update to print IQR and median of temperature_c, fuel_price_usd_per_l, & unemployment
print(sales[["temperature_c", 
             "fuel_price_usd_per_l",
              "unemployment"]].agg([iqr, np.median]))

2.5 Cumlative statistics

Cumulative statistics can also be helpful in tracking summary statistics over time. In this exercise, you’ll calculate the cumulative sum and cumulative max of a department’s weekly sales, which will allow you to identify what the total sales were so far as well as what the highest weekly sales were so far.

Instruction

Sort the rows of sales_1_1 by the date column in ascending order.
Get the cumulative sum of weekly_sales and add it as a new column of sales_1_1 called cum_weekly_sales.
Get the cumulative maximum of weekly_sales, and add it as a column called cum_max_sales.
Print the date, weekly_sales, cum_weekly_sales, and cum_max_sales columns.

# Sort sales_1_1 by date
sales_1_1 = sales_1_1.sort_values("date")

# Get the cumulative sum of weekly_sales, add as cum_weekly_sales col
sales_1_1["cum_weekly_sales"] = sales_1_1["weekly_sales"].cumsum()

# Get the cumulative max of weekly_sales, add as cum_max_sales col
sales_1_1["cum_max_sales"] = sales_1_1["weekly_sales"].cummax()

# See the columns you calculated
print(sales_1_1[["date", 
                 "weekly_sales", 
                 "cum_weekly_sales", 
                 "cum_max_sales"]])

2.6 Counting

2.7 Dropping duplicates

Removing duplicates is an essential skill to get accurate counts because often, you don’t want to count the same thing multiple times. In this exercise, you’ll create some new DataFrames using unique values from sales.

Instruction

Remove rows of sales with duplicate pairs of store and type and save as store_types and print the head.
Remove rows of sales with duplicate pairs of store and department and save as store_depts and print the head.
Subset the rows that are holiday weeks using the is_holiday column, and drop the duplicate dates, saving as holiday_dates.
Select the date column of holiday_dates, and print.

# Drop duplicate store/type combinations
store_types = sales.drop_duplicates(subset=["store", "type"])
print(store_types.head())

# Drop duplicate store/department combinations
store_depts = sales.drop_duplicates(subset=["store", "department"])
print(store_depts.head())

# Subset the rows that are holiday weeks and drop duplicate dates
holiday_dates = sales[sales["is_holiday"]].drop_duplicates(subset="date")

# Print date col of holiday_dates
print(holiday_dates["date"])

2.8 Counting categorical variables

Counting is a great way to get an overview of your data and to spot curiosities that you might not notice otherwise. In this exercise, you’ll count the number of each type of store and the number of each department number using the DataFrames you created in the previous exercise:

# Drop duplicate store/type combinations
store_types = sales.drop_duplicates(subset=["store", "type"])

# Drop duplicate store/department combinations
store_depts = sales.drop_duplicates(subset=["store", "department"])

Instruction

Count the number of stores of each store type in store_types.
Count the proportion of stores of each store type in store_types.
Count the number of different departments in store_depts, sorting the counts in descending order.
Count the proportion of different departments in store_depts, sorting the proportions in descending order.

# Count the number of stores of each type
store_counts = store_types["type"].value_counts()
print(store_counts)

# Get the proportion of stores of each type
store_props = store_types["type"].value_counts(normalize=True)
print(store_props)

# Count the number of each department number and sort
dept_counts_sorted = store_depts["department"].value_counts(sort=True)
print(dept_counts_sorted)

# Get the proportion of departments of each number and sort
dept_props_sorted = store_depts["department"].value_counts(sort=True, 
                                                           normalize=True)
print(dept_props_sorted)

2.9 Grouped summary statistics

2.10 What percet of sales occurred at each store type?

While .groupby() is useful, you can calculate grouped summary statistics without it.

Walmart distinguishes three types of stores: “supercenters,” “discount stores,” and “neighborhood markets,” encoded in this dataset as type “A,” “B,” and “C.” In this exercise, you’ll calculate the total sales made at each store type, without using .groupby(). You can then use these numbers to see what proportion of Walmart’s total sales were made at each type.

Instruction

Calculate the total weekly_sales over the whole dataset.
Subset fortype "A" stores, and calculate their total weekly sales.
Do the same for type "B" and type "C" stores.
Combine the A/B/C results into a list, and divide by sales_all to get the proportion of sales by type.

# Calc total weekly sales
sales_all = sales["weekly_sales"].sum()

# Subset for type A stores, calc total weekly sales
sales_A = sales[sales["type"] == "A"]["weekly_sales"].sum()

# Subset for type B stores, calc total weekly sales
sales_B = sales[sales["type"] == "B"]["weekly_sales"].sum()

# Subset for type C stores, calc total weekly sales
sales_C = sales[sales["type"] == "C"]["weekly_sales"].sum()

# Get proportion for each type
sales_propn_by_type = [sales_A, sales_B, sales_C] / sales_all
print(sales_propn_by_type)

2.11 Calculations with .groupby()

The .groupby() method makes life much easier. In this exercise, you’ll perform the same calculations as last time, except you’ll use the .groupby() method. You’ll also perform calculations on data grouped by two variables to see if sales differ by store type depending on if it’s a holiday week or not.

Instruction 1
Group sales by "type", take the sum of "weekly_sales", and store as sales_by_type.
Calculate the proportion of sales at each store type by dividing by the sum of sales_by_type. Assign to sales_propn_by_type.

# Group by type; calc total weekly sales
sales_by_type = sales.groupby("type")["weekly_sales"].sum()

# Get proportion for each type
sales_propn_by_type = sales_by_type/sales['weekly_sales'].sum()
print(sales_propn_by_type)

Instruction 2
Group sales by "type" and "is_holiday", take the sum of weekly_sales, and store as sales_by_type_is_holiday

# From previous step
sales_by_type = sales.groupby("type")["weekly_sales"].sum()

# Group by type and is_holiday; calc total weekly sales
sales_by_type_is_holiday = sales.groupby(["type", "is_holiday"])["weekly_sales"].sum()
print(sales_by_type_is_holiday)

2.12 Mutiple grouped summaried

Earlier in this chapter, you saw that the .agg() method is useful to compute multiple statistics on multiple variables. It also works with grouped data. NumPy, which is imported as np, has many different summary statistics functions, including: np.min, np.max, np.mean, and np.median.

Instruction

Import numpy with the alias np.
Get the min, max, mean, and median of weekly_sales for each store type using .groupby() and .agg(). Store this as sales_stats. Make sure to use numpy functions!
Get the min, max, mean, and median of unemployment and fuel_price_usd_per_l for each store type. Store this as unemp_fuel_stats.

# Import NumPy with the alias np
import numpy as np

# For each store type, aggregate weekly_sales: get min, max, mean, and median
sales_stats = sales.groupby("type")["weekly_sales"].agg([np.min, 
                                                         np.max, 
                                                         np.mean, 
                                                         np.median])

# Print sales_stats
print(sales_stats)

# For each store type, aggregate unemployment and fuel_price_usd_per_l: get min, max, mean, and median
unemp_fuel_stats = sales.groupby("type")[["unemployment", 
                                          "fuel_price_usd_per_l"]].agg([np.min, 
                                                                        np.max, 
                                                                        np.mean, 
                                                                        np.median])

# Print unemp_fuel_stats
print(unemp_fuel_stats)

2.13 Pivot tables

2.14 Pivoting on one variable

Pivot tables are the standard way of aggregating data in spreadsheets. In pandas, pivot tables are essentially just another way of performing grouped calculations. That is, the .pivot_table() method is just an alternative to .groupby().

In this exercise, you’ll perform calculations using .pivot_table() to replicate the calculations you performed in the last lesson using .groupby().

Instruction 1
Get the mean weekly_sales by type using .pivot_table() and store as mean_sales_by_type.

# Pivot for mean weekly_sales for each store type
mean_sales_by_type = sales.pivot_table(values="weekly_sales", 
                                       index="type")
                                       
# Print mean_sales_by_type
print(mean_sales_by_type)

Instruction 2
Get the mean and median (using NumPy functions) of weekly_sales by type using .pivot_table() and store as mean_med_sales_by_type.

# Import NumPy as np
import numpy as np

# Pivot for mean and median weekly_sales for each store type
mean_med_sales_by_type = sales.pivot_table(values="weekly_sales",                                            
                                           index="type",                                          
                                           aggfunc=[np.mean, np.median])

# Print mean_med_sales_by_type
print(mean_med_sales_by_type)

Instruction 3
Get the mean of weekly_sales by type and is_holiday using .pivot_table() and store as mean_sales_by_type_holiday

# Pivot for mean weekly_sales by store type and holiday 
mean_sales_by_type_holiday = sales.pivot_table(values="weekly_sales",   
                                               index="type",
                                               columns="is_holiday")

# Print mean_sales_by_type_holiday
print(mean_sales_by_type_holiday)

Pivot tables are another way to do the same thing as a group-by-then-summarize

2.15 Fill in missing values and sum values with piovt tables

The .pivot_table() method has several useful arguments, including fill_value and margins.

fill_value replaces missing values with a real value (known as imputation). What to replace missing values with is a topic big enough to have its own course (Dealing with Missing Data in Python), but the simplest thing to do is to substitute a dummy value.
margins is a shortcut for when you pivoted by two variables, but also wanted to pivot by each of those variables separately: it gives the row and column totals of the pivot table contents.

In this exercise, you’ll practice using these arguments to up your pivot table skills, which will help you crunch numbers more efficiently!

Instruction 1
Print the mean weekly_sales bydepartment and type, filling in any missing values with 0.

# Print mean weekly_sales by department and type; fill missing values with 0
print(sales.pivot_table(values = "weekly_sales",
                        index = "type",
                        columns = "department",
                        fill_value = 0))

Instruction 2
Print the mean weekly_sales bydepartment and type, filling in any missing values with 0 and summing all rows and columns.

# Print the mean weekly_sales by department and type; fill missing values with 0s; sum all rows and cols
print(sales.pivot_table(values = "weekly_sales", 
                        index = "department", 
                        columns = "type", 
                        fill_value = 0, 
                        margins = True))

3. Slicing and Indexing

3.1 Explicit indexes

3.2 Setting and removing indexes

pandas allows you to designate columns as an index. This enables cleaner code when taking subsets (as well as providing more efficient lookup under some circumstances).
In this chapter, you’ll be exploring temperatures, a DataFrame of average temperatures in cities around the world. pandas is loaded as pd.

Instruction

Look at temperatures.
Set the index of temperatures to "city", assigning to temperatures_ind.
Look at temperatures_ind. How is it different from temperatures?
Reset the index of temperatures_ind, keeping its contents.
Reset the index of temperatures_ind, dropping its contents.

# Look at temperatures
print(temperatures)

# Index temperatures by city
temperatures_ind = temperatures.set_index("city")
# Look at temperatures_ind
print(temperatures_ind)

# Reset the index, keeping its contents
print(temperatures_ind.reset_index())

# Reset the index, dropping its contents
print(temperatures_ind.reset_index(drop = True))

Setting an index allows more concise code for subsetting for rows of a categorical variable via .loc[]

3.3 Subsetting with .loc[]

The killer feature for indexes is .loc[]: a subsetting method that accepts index values. When you pass it a single argument, it will take a subset of rows.

The code for subsetting using .loc[] can be easier to read than standard square bracket subsetting, which can make your code less burdensome to maintain.

pandas is loaded as pd. temperatures and temperatures_ind are available; the latter is indexed by city.

Instruction

Create a list called cities that contains "Moscow" and "Saint Petersburg".
Use [] subsetting to filter temperatures for rows where the city column takes a value in the cities list.
Use .loc[] subsetting to filter temperatures_ind for rows where the city is in the cities list.

# Make a list of cities to subset on
cities = ['Moscow','Saint Petersburg']

# Subset temperatures using square brackets
print(temperatures[temperatures['city'].isin(cities)])

# Subset temperatures_ind using .loc[]
print(temperatures_ind.loc[cities])

3.4 Setting multi-level indexes

Indexes can also be made out of multiple columns, forming a multi-level index (sometimes called a hierarchical index). There is a trade-off to using these.

The benefit is that multi-level indexes make it more natural to reason about nested categorical variables. For example, in a clinical trial, you might have control and treatment groups. Then each test subject belongs to one or another group, and we can say that a test subject is nested inside the treatment group. Similarly, in the temperature dataset, the city is located in the country, so we can say a city is nested inside the country.

The main downside is that the code for manipulating indexes is different from the code for manipulating columns, so you have to learn two syntaxes and keep track of how your data is represented.

Instruction

Set the index of temperatures to the "country" and "city" columns, and assign this to temperatures_ind.
Specify two country/city pairs to keep: "Brazil"/"Rio De Janeiro" and "Pakistan"/"Lahore", assigning to rows_to_keep.
Print and subset temperatures_ind for rows_to_keep using .loc[].

# Index temperatures by country & city
temperatures_ind = temperatures.set_index(['country', 'city'])

# List of tuples: Brazil, Rio De Janeiro & Pakistan, Lahore
rows_to_keep = [('Brazil','Rio De Janeiro'),('Pakistan','Lahore')]

# Subset for rows to keep
print(temperatures_ind.loc[rows_to_keep])

3.5 Sorting by index values

Previously, you changed the order of the rows in a DataFrame by calling .sort_values(). It’s also useful to be able to sort by elements in the index. For this, you need to use .sort_index().

Instruction

Sort temperatures_ind by the index values.
Sort temperatures_ind by the index values at the "city" level.
Sort temperatures_ind by ascending country then descending city.

# Sort temperatures_ind by index values
print(temperatures_ind.sort_index())

# Sort temperatures_ind by index values at the city level
print(temperatures_ind.sort_index(level = 'city'))

# Sort temperatures_ind by country then descending city
print(temperatures_ind.sort_index(level = ['country', 'city'],
                                  ascending=[True, False]))

3.6 Slicing and subsetting with .loc and .iloc

3.7 Slicing index values

Slicing lets you select consecutive elements of an object using first:lastsyntax. DataFrames can be sliced by index values or by row/column number; we’ll start with the first case. This involves slicing inside the .loc[] method.

Compared to slicing lists, there are a few things to remember.

You can only slice an index if the index is sorted (using .sort_index())
To slice at the outer level, first and last can be strings.
To slice at inner levels, first and last should be tuples.
If you pass a single slice to .loc[], it will slice the rows.

temperatures and temperatures_ind are available; the latter is indexed by city

Instruction

Sort the index of temperatures_ind.
Use slicing with .loc[] to get these subsets:
- from Pakistan to Russia.
- from Lahore to Moscow. (This will return nonsense.)
- from Pakistan, Lahore to Russia, Moscow.

# Sort the index of temperatures_ind
temperatures_srt = temperatures_ind.sort_index()

# Incorrectly subset rows from Pakistan to Russia
print(temperatures_srt.loc['Pakistan':'Russia'])

# Subset rows from Lahore to Moscow
print(temperatures_srt.loc['Lahore':'Moscow'])

# Subset rows from Pakistan, Lahore to Russia, Moscow
print(temperatures_srt.loc[('Pakistan', 'Lahore'):('Russia', 'Moscow')])

3.8 Slicing in both directions

You’ve seen slicing DataFrames by rows and by columns, but since DataFrames are two-dimensional objects, it is often natural to slice both dimensions at once. That is, by passing two arguments to .loc[], you can subset by rows and columns in one go.

Instruction

Use .loc[] slicing to subset rows from India, Hyderabad to Iraq, Baghdad.
Use .loc[] slicing to subset columns from date to avg_temp_c.
Slice in both directions at once from Hyderabad to Baghdad, and date to avg_temp_c.

# Subset rows from India, Hyderabad to Iraq, Baghdad
print(temperatures_srt.loc[('India','Hyderabad'):( 'Iraq', 'Baghdad')])

# Subset columns from date to avg_temp_c
print(temperatures_srt.loc[:,"date":'avg_temp_c'])

# Subset in both directions at once
print(temperatures_srt.loc[('India','Hyderabad'):( 'Iraq', 'Baghdad'),
                           "date":'avg_temp_c'])

3.9 Slicing time series

Slicing is particularly useful for time series since it’s a common thing to want to filter for data within a date range. Add the date column to the index, then use .loc[] to perform the subsetting. The important thing to remember is to keep your dates in ISO 8601 format, that is, yyyy-mm-dd.

Recall from Chapter 1 that you can combine multiple Boolean conditions using logical operators (such as &). To do so in one line of code, you’ll need to add parentheses () around each condition.

Instruction

Use Boolean conditions (not .isin() or .loc[]) to subset for rows in 2010 and 2011, and print the results.
Note that because the date isn’t set as an index, a condition that contains only a year, such as df["date"] == "2009", will check if the date is equal to the first day of the first month of the year (e.g. 2009-01-01), rather than checking whether the date occurs within the given year. We recommend writing out the full date when using Boolean conditions (e.g., 2009-12-31).
Set the index to the date column.
Use .loc[] to subset for rows in 2010 and 2011.
Use .loc[] to subset for rows from Aug 2010 to Feb 2011.

# Use Boolean conditions to subset temperatures for rows in 2010 and 2011
print(temperatures[(temperatures["date"] >= "2010") & (temperatures["date"] < "2012")])

# Set date as an index
temperatures_ind = temperatures.set_index("date")

# Use .loc[] to subset temperatures_ind for rows in 2010 and 2011
print(temperatures_ind.loc["2010":"2011"])

# Use .loc[] to subset temperatures_ind for rows from Aug 2010 to Feb 2011
print(temperatures_ind.loc["2010-08":"2011-02"])

3.10 Subsetting by row/columns number

The most common ways to subset rows are the ways we’ve previously discussed: using a Boolean condition or by index labels. However, it is also occasionally useful to pass row numbers.

This is done using .iloc[], and like .loc[], it can take two arguments to let you subset by rows and columns.

Instruction
Use .iloc[] on temperatures to take subsets.

Get the 23rd row, 2nd column (index positions 22 and 1).
Get the first 5 rows (index positions 0 to 5).
Get all rows, columns 3 and 4 (index positions 2 to 4).
Get the first 5 rows, columns 3 and 4.

# Get 23rd row, 2nd column (index 22, 1)
print(temperatures.iloc[22,1])

# Use slicing to get the first 5 rows
print(temperatures.iloc[0:5,])

# Use slicing to get columns 2 to 3
print(temperatures.iloc[:,2:4])

# Use slicing in both directions at once
print(temperatures.iloc[0:5,2:4])

3.11 Working with pivot tables

It’s interesting to see how temperatures for each city change over time—looking at every month results in a big table, which can be tricky to reason about. Instead, let’s look at how temperatures change by year.

You can access the components of a date (year, month and day) using code of the form dataframe["column"].dt.component. For example, the month component is dataframe["column"].dt.month, and the year component is dataframe["column"].dt.year.

Once you have the year column, you can create a pivot table with the data aggregated by city and year, which you’ll explore in the coming exercises.

Instruction

Add a year column to temperatures, from the year component of the date column.
Make a pivot table of the avg_temp_c column, with country and city as rows, and year as columns. Assign to temp_by_country_city_vs_year, and look at the result.

# Add a year column to temperatures
temperatures['year'] = temperatures['date'].dt.year

# Pivot avg_temp_c by country and city vs year
temp_by_country_city_vs_year = temperatures.pivot_table('avg_temp_c', 
                                                        index=['country', 'city'],
                                                        columns='year')

# See the result
print(temp_by_country_city_vs_year)

3.12 Subsetting pivot tables

A pivot table is just a DataFrame with sorted indexes, so the techniques you have learned already can be used to subset them. In particular, the .loc[] + slicing combination is often helpful.

Instruction
Use .loc[] on temp_by_country_city_vs_year to take subsets.

From Egypt to India.
From Egypt, Cairo to India, Delhi.
From Egypt, Cairo to India, Delhi, and 2005 to 2010.

# Subset for Egypt to India
temp_by_country_city_vs_year.loc['Egypt':'India']

# Subset for Egypt, Cairo to India, Delhi
temp_by_country_city_vs_year.loc[('Egypt','Cairo'):('India','Delhi')]

# Subset in both directions at once
temp_by_country_city_vs_year.loc[('Egypt','Cairo'):('India','Delhi'),
                                 '2005':'2010']

3.13 Calculating on a pivot table

Pivot tables are filled with summary statistics, but they are only a first step to finding something insightful. Often you’ll need to perform further calculations on them. A common thing to do is to find the rows or columns where the highest or lowest value occurs.

Recall from Chapter 1 that you can easily subset a Series or DataFrame to find rows of interest using a logical condition inside of square brackets. For example: series[series > value].

Instruction

Calculate the mean temperature for each year, assigning to mean_temp_by_year.
Filter mean_temp_by_year for the year that had the highest mean temperature.
Calculate the mean temperature for each city (across columns), assigning to mean_temp_by_city.
Filter mean_temp_by_city for the city that had the lowest mean temperature.

# Get the worldwide mean temp by year
mean_temp_by_year = temp_by_country_city_vs_year.mean(axis = 'index')

# Filter for the year that had the highest mean temp
print(mean_temp_by_year[mean_temp_by_year == mean_temp_by_year.max()])

# Get the mean temp by city
mean_temp_by_city = temp_by_country_city_vs_year.mean(axis = 'columns')

# Filter for the city that had the lowest mean temp
print(mean_temp_by_city[mean_temp_by_city == mean_temp_by_city.min()])

4. Creating and Visualizing DataFrames

4.1 Visualizing your data

4.2 Which avocado size is most popular?

Avocados are increasingly popular and delicious in guacamole and on toast. The Hass Avocado Board keeps track of avocado supply and demand across the USA, including the sales of three different sizes of avocado. In this exercise, you’ll use a bar plot to figure out which size is the most popular.

Bar plots are great for revealing relationships between categorical (size) and numeric (number sold) variables, but you’ll often have to manipulate your data first in order to get the numbers you need for plotting.

Instruction

Print the head of the avocados dataset. What columns are available?
For each avocado size group, calculate the total number sold, storing as nb_sold_by_size.
Create a bar plot of the number of avocados sold by size.
Show the plot.

# Import matplotlib.pyplot with alias plt
import matplotlib.pyplot as plt

# Look at the first few rows of data
print(avocados.head())

# Get the total number of avocados sold of each size
nb_sold_by_size = avocados.groupby('size')['nb_sold'].sum()

# Create a bar plot of the number of avocados sold by size
nb_sold_by_size.plot(kind = 'bar')

# Show the plot
plt.show()

4.3 Changes in sales over time

Line plots are designed to visualize the relationship between two numeric variables, where each data values is connected to the next one. They are especially useful for visualizing the change in a number over time since each time point is naturally connected to the next time point. In this exercise, you’ll visualize the change in avocado sales over three years.

Instruction

Get the total number of avocados sold on each date. The DataFrame has two rows for each date – one for organic, and one for conventional. Save this as nb_sold_by_date.
Create a line plot of the number of avocados sold.
Show the plot.

# Import matplotlib.pyplot with alias plt
import matplotlib.pyplot as plt

# Get the total number of avocados sold on each date
nb_sold_by_date = avocados.groupby('date')['nb_sold'].sum()

# Create a line plot of the number of avocados sold by date
nb_sold_by_date.plot(x = 'date', 
                     y = 'nb_sold', 
                     kind = 'line')

# Show the plot
plt.show()

4.4 Avocado supply and demand

Scatter plots are ideal for visualizing relationships between numerical variables. In this exercise, you’ll compare the number of avocados sold to average price and see if they’re at all related. If they’re related, you may be able to use one number to predict the other.

Instruction

Create a scatter plot with nb_sold on the x-axis and avg_price on the y-axis. Title it "Number of avocados sold vs. average price".
Show the plot.

# Scatter plot of nb_sold vs avg_price with title
avocados.plot(x = 'nb_sold', 
              y = 'avg_price', 
              kind = 'scatter', 
              title = "Number of avocados sold vs. average price")
              
# Show the plot
plt.show()

4.5 Price of conventional vs. organic avocados

Creating multiple plots for different subsets of data allows you to compare groups. In this exercise, you’ll create multiple histograms to compare the prices of conventional and organic avocados.

Instruction 1

Subset avocados for the conventional type, and the average price column. Create a histogram.
Create a histogram of avg_price for organic type avocados.
Add a legend to your plot, with the names “conventional” and “organic”.
Show your plot.

# Histogram of conventional avg_price 
avocados[avocados['type'] == 'conventional']['avg_price'].hist()
# Histogram of organic 
avg_priceavocados[avocados['type'] == 'organic']['avg_price'].hist()

# Add a legend
plt.legend(['conventional','organic'])
# Show the plot
plt.show()

Instruction 2
Modify your code to adjust the transparency of both histograms to 0.5 to see how much overlap there is between the two distributions.

# Modify histogram transparency to 0.5 
avocados[avocados["type"] == "conventional"]["avg_price"].hist(alpha = 0.5)

# Modify histogram transparency to 0.5
avocados[avocados["type"] == "organic"]["avg_price"].hist(alpha = 0.5)

# Add a legend
plt.legend(["conventional", "organic"])

# Show the plot
plt.show()

Instruction 3
Modify your code to use 20 bins in both histograms.

# Modify bins to 20
avocados[avocados["type"] == "conventional"]["avg_price"].hist(alpha=0.5, 
                                                               bins = 20)
# Modify bins to 20
avocados[avocados["type"] == "organic"]["avg_price"].hist(alpha=0.5,
                                                          bins = 20)

# Add a legend
plt.legend(["conventional", "organic"])

# Show the plot
plt.show()

4.6 Missing values

4.7 Finding missing values

Missing values are everywhere, and you don’t want them interfering with your work. Some functions ignore missing data by default, but that’s not always the behavior you might want. Some functions can’t handle missing values at all, so these values need to be taken care of before you can use them. If you don’t know where your missing values are, or if they exist, you could make mistakes in your analysis. In this exercise, you’ll determine if there are missing values in the dataset, and if so, how many.

Instruction

Print a DataFrame that shows whether each value in avocados_2016 is missing or not.
Print a summary that shows whether any value in each column is missing or not.
Create a bar plot of the total number of missing values in each column.

# Import matplotlib.pyplot with alias plt
import matplotlib.pyplot as plt

# Check individual values for missing values
print(avocados_2016.isna())

# Check each column for missing values
print(avocados_2016.isna().any())

# Bar plot of missing values by variable
avocados_2016.isna().sum().plot(kind='bar')

# Show plot
plt.show()

4.8 Removing missing values

Now that you know there are some missing values in your DataFrame, you have a few options to deal with them. One way is to remove them from the dataset completely. In this exercise, you’ll remove missing values by removing all rows that contain missing values.

Instruction

Remove the rows of avocados_2016 that contain missing values and store the remaining rows in avocados_complete.
Verify that all missing values have been removed from avocados_complete. Calculate each column that has NAs and print.

# Remove rows with missing values
avocados_complete = avocados_2016.dropna()

# Check if any columns contain missing values
print(avocados_complete.isna().any())

Removing observations with missing values is a quick and dirty way to deal with missing data, but this can introduce bias to your data if the values are not missing at random

4.9 Replacing missing values

Another way of handling missing values is to replace them all with the same value. For numerical variables, one option is to replace values with 0— you’ll do this here. However, when you replace missing values, you make assumptions about what a missing value means. In this case, you will assume that a missing number sold means that no sales for that avocado type were made that week.

In this exercise, you’ll see how replacing missing values can affect the distribution of a variable using histograms. You can plot histograms for multiple variables at a time as follows:

dogs[["height_cm", "weight_kg"]].hist()

Instruction 1

A list has been created, cols_with_missing, containing the names of columns with missing values: "small_sold", "large_sold", and "xl_sold".
Create a histogram of those columns.
Show the plot.

# List the columns with missing values
cols_with_missing = ['small_sold', 'large_sold', 'xl_sold']

# Create histograms showing the distributions cols_with_missing
avocados_2016[cols_with_missing].hist()

# Show the plot
plt.show()

Instruction 2

Replace the missing values of avocados_2016 with 0s and store the result as avocados_filled.
Create a histogram of the cols_with_missing columns of avocados_filled.

# Fill in missing values with 0
avocados_filled = avocados_2016.fillna(0)

# Create histograms of the filled columns
avocados_filled[cols_with_missing].hist()

# Show the plot
plt.show()

4.10 Creating DataFrames

4.11 List of dictionaries

You recently got some new avocado data from 2019 that you’d like to put in a DataFrame using the list of dictionaries method. Remember that with this method, you go through the data row by row.

date	small_sold	large_sold
“2019-11-03”	10376832	7835071
“2019-11-10”	10717154	8561348

Instruction

Create a list of dictionaries with the new data called avocados_list.
Convert the list into a DataFrame called avocados_2019.
Print your new DataFrame.

# Create a list of dictionaries with new data
avocados_list = [{"date": "2019-11-03", 
                  "small_sold": 10376832, 
                  "large_sold": 7835071},    
                 {"date": "2019-11-10", 
                 "small_sold": 10717154, 
                 "large_sold": 8561348},]
                 
# Convert list into DataFrame
avocados_2019 = pd.DataFrame(avocados_list)

# Print the new DataFrame
print(avocados_2019)

4.12 Dictionary of lists

Some more data just came in! This time, you’ll use the dictionary of lists method, parsing the data column by column.

date	small_sold	large_sold
“2019-11-17”	10859987	7674135
“2019-12-01”	9291631	6238096

Instruction

Create a dictionary of lists with the new data called avocados_dict.
Convert the dictionary to a DataFrame called avocados_2019.
Print your new DataFrame.

# Create a dictionary of lists with new data
avocados_dict = {"date": ["2019-11-17", "2019-12-01"],  
                 "small_sold": [10859987, 9291631],  
                 "large_sold": [7674135, 6238096]}
                 
# Convert dictionary into DataFrame
avocados_2019 = pd.DataFrame(avocados_dict)

# Print the new DataFrame
print(avocados_2019)

4.13 Reading and writing CSVs

4.14 CSV to DataFrame

You work for an airline, and your manager has asked you to do a competitive analysis and see how often passengers flying on other airlines are involuntarily bumped from their flights.

You got a CSV file (airline_bumping.csv) from the Department of Transportation containing data on passengers that were involuntarily denied boarding in 2016 and 2017, but it doesn’t have the exact numbers you want. In order to figure this out, you’ll need to get the CSV into a pandas DataFrame and do some manipulation!

Instruction 1

Read the CSV file "airline_bumping.csv" and store it as a DataFrame called airline_bumping.
Print the first few rows of airline_bumping.

# Read CSV as DataFrame called airline_bumping
airline_bumping = pd.read_csv("airline_bumping.csv")

# Take a look at the DataFrame
print(airline_bumping.head())

Instruction 2
For each airline group, select the nb_bumped, and total_passengers columns, and calculate the sum (for both years). Store this as airline_totals.

# For each airline, select nb_bumped and total_passengers and sum
airline_totals = airline_bumping.groupby('airline')[['nb_bumped', 
                                                     'total_passengers']].sum()

Instruction 3
Create a new column of airline_totals called bumps_per_10k, which is the number of passengers bumped per 10,000 passengers in 2016 and 2017.

airline_totals["bumps_per_10k"] = airline_totals['nb_bumped'] / airline_totals['total_passengers'] * 10000

Instruction 4
Print airline_totals to see the results of your manipulations.

# Print airline_totals
print(airline_totals)

4.15 DataFrame to CSV

You’re almost there! To make things easier to read, you’ll need to sort the data and export it to CSV so that your colleagues can read it.

Instruction

Sort airline_totals by the values of bumps_per_10k from highest to lowest, storing as airline_totals_sorted.
Print your sorted DataFrame.
Save the sorted DataFrame as a CSV called "airline_totals_sorted.csv.

# Create airline_totals_sorted
airline_totals_sorted = airline_totals.sort_values('bumps_per_10k', 
                                                   ascending=False)
                                                   
# Print airline_totals_sorted
print(airline_totals_sorted)

# Save as airline_totals_sorted.csv
airline_totals_sorted.to_csv("airline_totals_sorted.csv")

4.16 Wrap-up

你可能感兴趣的:(python,pandas)

python_学习爬虫遇到的第一个问题_urllib获取baidu首页源代码 KJDETL python_爬虫 python 学习爬虫
第一天学习爬虫，学习的是urllib的基本用法，通过urllib.request获取baidu首页源代码。#导入urllib所需要的库importurllib.request#左边自定义名称，右边是要访问的地址url='https://www.baidu.com/Index.htm'#左边自定义名称可以叫做响应，右边是通过urllib.request.urlopen方法向url发出请求respon
python基础版课件_Python入门基础ppt课件.ppt 六间仓库的仓老师 python基础版课件
《Python入门基础ppt课件.ppt》由会员分享，可在线阅读，更多相关《Python入门基础ppt课件.ppt(30页珍藏版)》请在人人文库网上搜索。1、Python语言基础,1,Python诞生于20世纪90年代初，是一种解释型、面向对象、动态数据类型的高级程序设计语言，是最受欢迎的程序设计语言之一。这节课我们主要来介绍Python语言的基本情况和基础知识。,课程描述,2,课程知识点,1初识
Python从入门到实践电子书,python编程入门到实践pdf 小六oO 智能写作 python django 开发语言
《Python编程从入门到实践》txt下载在线阅读，求百度网盘云资源《Python编程》（[美]埃里克·马瑟斯（EricMatthes））电子书网盘下载免费在线阅读资源链接：链接：提取码：6vcz书名：Python编程作者：[美]埃里克·马瑟斯（EricMatthes）译者：袁国忠豆瓣评分：9.2出版社：人民邮电出版社出版年份：2020-10页数：476内容简介：本书是针对所有层次Python读者
【2025年春季】全国CTF夺旗赛-从零基础入门到竞赛，看这一篇就稳了！白帽子凯哥 web安全学习安全 CTF夺旗赛网络安全
基于入门网络安全/黑客打造的：黑客&网络安全入门&进阶学习资源包目录一、CTF简介二、CTF竞赛模式三、CTF各大题型简介四、CTF学习路线4.1、初期1、html+css+js（2-3天）2、apache+php（4-5天）3、mysql（2-3天）4、python(2-3天)5、burpsuite（1-2天）4.2、中期1、SQL注入（7-8天）2、文件上传（7-8天）3、其他漏洞（14-15
Python（1）Python全方位指南：定义、应用与零基础入门实战一个天蝎座白勺程序猿 Python入门到精通 python 开发语言
背景：为什么Python成为开发者必备技能？‌Python自1991年发布以来，凭借‌“简单高效”‌的设计理念，成为全球增长最快的编程语言。根据TIOBE2023年榜单，Python稳居前三，其核心竞争力包括：‌开发效率高‌：代码量仅为Java的1/5，C++的1/10。‌跨领域通吃‌：从Web开发到AI训练，覆盖90%以上技术场景。‌企业级应用‌：YouTube用Python处理视频推荐，NAS
Python 赋能经济趋势与股票研究：数据驱动的投资洞察 Small踢倒coffee_氕氘氚笔记经验分享
在当今数据爆炸的时代，Python凭借其强大的数据处理能力和丰富的开源库，已成为经济趋势分析和股票研究的利器。本文将探讨如何利用Python进行以下方面的研究：**一、数据获取与清洗*****数据来源:*****财经数据API:**Tushare、AKShare、YahooFinance、AlphaVantage等提供丰富的股票、基金、宏观经济等数据。***网络爬虫:**使用BeautifulSo
突破反爬终极指南：如何用Python实现100%隐形数据抓取（附实战代码）煜bart 机器人人工智能 web3.py
引言：当爬虫遭遇铜墙铁壁2023年Q2最新统计显示，全球Top100网站中89%部署了AI驱动的反爬系统，传统爬虫存活率暴跌至17%。本文将揭秘一套基于深度伪装技术的爬虫方案，在最近三个月实测中保持100%成功率，成功突破Cloudflare、Distil等顶级防护系统。---###一、指纹伪装：让爬虫"隐身"的核心科技####1.1浏览器指纹深度克隆（代码实现）```pythonfromsele
Python自动化炒股：利用XGBoost和LightGBM进行股票市场预测的实战案例云策量化 Python自动化炒股量化投资量化软件 python 量化交易 QMT PTrade 量化炒股量化投资 deepseek
推荐阅读：《程序化炒股：如何申请官方交易接口权限？个人账户可以申请吗？》Python自动化炒股：利用XGBoost和LightGBM进行股票市场预测的实战案例在当今快节奏的金融市场中，自动化交易和预测模型成为了投资者和交易者的重要工具。Python以其强大的数据处理能力和丰富的机器学习库，成为了实现这些模型的首选语言。本文将带你了解如何使用XGBoost和LightGBM这两个流行的机器学习算法来
python缩进几个空格-解析Python的缩进规则的使用 weixin_39962675
Python中的缩进（Indentation）决定了代码的作用域范围。这一点和传统的c/c++有很大的不同（传统的c/c++使用花括号{}符，python使用缩进空格）。每行代码中开头的空格数（whitespace）用于计算该行代码的缩进级别（Indentationlevel），注意一个Tab等于8个空格（Space），缩进级别为0表示无缩进空格。Python中的每一条语句都有一个缩进级别,并且缩
33.从入门到精通：Python3 正则表达式 re.match函数 re.search方法 re.match与re.search的区别摘星月为妆。 Python从入门到精通正则表达式
33.从入门到精通：Python3正则表达式re.match函数re.search方法re.match与re.search的区别Python3正则表达式re.match函数re.search方法re.match与re.search的区别Python3正则表达式在Python3中，可以使用re模块来进行正则表达式的匹配和处理。以下是一个简单的例子，说明如何使用re模块进行正则表达式匹配：import
Python与Web 3.0：重新定义数字身份验证的未来 Echo_Wish Python！实战！python 前端开发语言
Python与Web3.0：重新定义数字身份验证的未来随着Web3.0的迅猛发展，传统的身份验证方式正面临越来越大的挑战。从依赖中心化服务器存储用户数据，到如今去中心化、用户掌控数据的新时代，身份验证系统经历了前所未有的变革。而作为一个人工智能、区块链和Python技术的深度爱好者，我认为Python将成为构建Web3.0身份验证系统的重要工具。今天，我们就来聊聊如何结合Python与Web3.0
python中re.search()函数的用法前行的zhu pytorch 正则表达式正则表达式 python
说到使用正则匹配字符串，就不得不说三个常用的匹配检索方法：re.search(),re.match()和re.findall()。主要的区别是前两个方法只在目标字符串中匹配一次满足条件的正则表达式；而re.findall()方法匹配目标字符串中所有满足条件的正则表达式；另外re.match()只会匹配目标字符串开头是否满足正则表达式，若开头不满足则匹配失败，函数返回None；而re.search(
python 中 Re库函数 re.search() weixin_43964993 python python
re.search(pattern,string,flags=0)在一个字符串中搜索匹配正则表达式的第一个位置，返回match对象pattern:正则表达式的字符串或原生字符串表示string:待匹配字符串flags:正则表达式使用时的控制标记常用标记说明re.I re.IGNORECASE忽略正则表达式的大小写，[A‐Z]能够匹配小写字符re.M re.MULTILINE正则表达式中的^操作
python中search用法_Python中的python re.search方法详解 weixin_39688856 python中search用法
re.search扫描整个字符串并返回第一个成功的匹配，若string中包含pattern子串，则返回Match对象，否则返回None，注意，如果string中存在多个pattern子串，只返回第一个。re.search()方法用来精确匹配并提取第一个符合规律的对象，而对象内容的提取则使用search方法的属性group()来实现。函数语法：re.search(pattern,string,fla
pandas 读写excel jimox_ai pandas
在Python中，使用Pandas库读写Excel文件是一个常见的操作。Pandas提供了`read_excel`和`to_excel`方法来分别实现读取和写入Excel文件的功能。以下是一些基本的示例：###读取Excel文件```pythonimportpandasaspd#读取Excel文件df=pd.read_excel('path_to_your_excel_file.xlsx')#显示
大话 Python：python 操作 excel 系列 -- pandas 读取、分析、保存 2401_84140734 程序员 python excel pandas
read_excel()直接读取excel文件df=pd.read_excel(‘C:/test.xlsx’)4，读取当前字段计算后生成新字段获取原有字段paymount值paymount=df[‘paymount’]业务计算（金额-10）paymount_new=paymount-10添加新字段paymount_newdf[‘paymount_new’]=paymount_new这个步骤可以加入
python简单案例代码,python案例讲解视频 2401_84471631 python
这篇文章主要介绍了python简单案例代码，具有一定借鉴价值，需要的朋友可以参考下。希望大家阅读完这篇文章后大有收获，下面让小编带着大家一起了解一下。Python是一种高级，解释性，交互式且面向对象的脚本语言。Python的设计具有很高的可读性。它使用英语作为关键字，相对于而其他语言则使用标点符号作为语句结束不同，是依靠缩进作为结束。并且其语法结构比其他语言精简。Python是Web开发，游戏开发
Python爬虫实战教程——如何爬取多个国家的实时汇率数据 Python爬虫项目 2025年爬虫实战项目 python 爬虫 chrome 信息可视化
1.引言随着全球经济一体化，跨国交易和投资变得越来越普遍，实时汇率数据成为了金融领域和国际贸易中的关键数据。对于金融分析师、投资者或者是开发者来说，能够实时获取并分析汇率数据是至关重要的。本文将深入探讨如何使用Python爬虫技术抓取多个国家的实时汇率数据。我们将使用最新的技术和工具，介绍如何通过Python编写一个高效、可扩展的汇率数据爬虫。2.为什么需要实时汇率数据？汇率数据被广泛应用于以下几
漫画算法python篇pdf_用Python抓取漫画并制作mobi格式电子书 jian bao 漫画算法python篇pdf
想看某一部漫画，但是用手机看感觉屏幕太小，用电脑看吧有太不方面。正好有一部Kindle，决定写一个爬虫把漫画爬取下来，然后制作成mobi格式的电子书放到kindle里面看。本人对于Python学习创建了一个小小的学习圈子，为各位提供了一个平台，大家一起来讨论学习Python。欢迎各位到来Python学习群：943752371一起讨论视频分享学习。Python是未来的发展方向，正在挑战我们的分析能力
python的格式转换库_3个Python PDF库，提取信息、转换格式、分割剪裁有它就够了！... 来朝三博士 python的格式转换库
PDFMiner：PDFMiner是一个从PDF文档中提取信息的工具。与其他PDF相关的工具不同，它只用于获取和分析文本数据。PDFMiner能获取页面中文本的准确位置，以及字体或行等其他信息。它还有一个PDF转换器，可以将PDF文件转换成其他文本格式(如HTML)。还有一个可扩展的解析器PDF，可以用于文本分析以外的其他用途。(地址https://github.com/euske/pdfmine
使用python去编写PDF转换成为EPUB以及MOBI工具 winfredzhang python pdf 转换 EPUB MOBI
在数字时代，PDF格式因其可靠性和跨平台特性成为了文档分享的标准。然而，当我们需要在电子阅读器上阅读这些文档时，转换为EPUB或MOBI格式会提供更好的阅读体验。今天，我们将深入分析一个使用Python和wxPython开发的PDF转换工具，探讨其实现原理和技术细节。C:\pythoncode\new\ConvertPdfToEpub.py需求分析在开始编码之前，让我们明确需求：用户友好的界面，允
Ubuntu安装开发者平台Backstage xuhss_com 计算机计算机
Python微信订餐小程序课程视频https://edu.csdn.net/course/detail/36074Python实战量化交易理财系统https://edu.csdn.net/course/detail/35475Ubuntu安装开发者平台Backstage什么是Backstage?Backstage是一个构建开发者门户的开源平台。通过支持一个集中的软件分类，Backstage可以保存
go python 比较 devops_5 大 DevOps 工具，你用过几个？ weixin_39692271 go python 比较 devops
DevOps的概念在软件开发行业中逐渐流行起来。越来越多的团队希望实现产品的敏捷开发，DevOps使一切成为可能。有了DevOps，团队可以定期发布代码、自动化部署、并将持续集成/持续交付作为发布过程的一部分。虽然DevOps背后有各种各样的概念，但幸好有一些工具可以让你更容易地理解和实现。在本文中，你将了解这些工具，并将它们作为软件发布/维护工具包工作的一部分开始使用。DevOps有很多可使用的
python中beautifulsoup怎么安装_Python3爬虫中Beautiful Soup库的安装方法是什么柳虎璐 Python3 BeautifulSoup 安装教程 lxml 爬虫
Python3爬虫中BeautifulSoup库的安装方法是什么发布时间：2020-08-0517:38:09来源：亿速云阅读：70作者：小新这篇文章将为大家详细讲解有关Python3爬虫中BeautifulSoup库的安装方法是什么，小编觉得挺实用的，因此分享给大家做个参考，希望大家阅读完这篇文章后可以有所收获。BeautifulSoup是Python的一个HTML或XML的解析库，我们可以用它
如何减少跨团队交付摩擦？——基于 DevOps 与敏捷的最佳实践网罗开发实战实战源码 devops 运维
网罗开发（小红书、快手、视频号同名）大家好，我是展菲，目前在上市企业从事人工智能项目研发管理工作，平时热衷于分享各种编程领域的软硬技能知识以及前沿技术，包括iOS、前端、HarmonyOS、Java、Python等方向。在移动端开发、鸿蒙开发、物联网、嵌入式、云原生、开源等领域有深厚造诣。图书作者：《ESP32-C3物联网工程开发实战》图书作者：《SwiftUI入门，进阶与实战》超级个体：CO
Python进行DevOps实践黑夜照亮前行的路 python devops 开发语言
使用Python进行DevOps实践可以涉及多个方面，从自动化部署、配置管理、监控到日志分析等等。下面是一些具体的方法和实践，展示如何使用Python在DevOps环境中进行工作：1.自动化部署使用Python编写自动化部署脚本，可以极大地提高部署效率。例如，可以使用fabric或paramiko等库来远程执行命令，或者使用Ansible这样的自动化工具，它本身使用Python编写，并提供了丰富的
批量安装 Python 库的脚本：提高python学习效率的第一步（附源码） TAGRENLA Interesting python project python 学习开发语言
批量安装Python库批量安装Python库的脚本：提高数据分析效率的一步（附源码）批量安装脚本前提条件使用pip：Python包管理工具批量安装脚本查看当前python解释器中安装的所有的库批量安装Python库的脚本：提高数据分析效率的一步（附源码）在现代数据分析领域，Python已成为一个不可或缺的工具。为了进行数据处理、分析、可视化和建模等任务，Python社区涌现出了众多强大的库和工具。
Python 简单后台项目的脚手架程序媛了了 python 开发语言
说明近期写了一个简单的项目，在后台运行获取网上的期货数据并保存到相应的数据库里。由于之前工作很多这种简单的类似调用接口或攫取数据的项目都是用Python来写，因此这次也继续用Python写。但是这次更换了几个包，此份文档简单来说明一下。依赖的包toml：用户解析配置文件，配置文件用的是toml格式。arrow：用于处理日期相关。loguru：用于日志处理。requests：用于http请求响应。p
构建我们的Python代码库依赖图 openwin_top python编程示例系列二 python 开发语言
构建我们的Python代码库依赖图作者：GeorgeFarcasiu,NoahKim,JaconBrugh,JiahaoLi,HudsonRiverTrading引言与我们在高频交易的根基保持一致，HudsonRiverTrading（HRT）行动迅速。与任何工程指标一样，速度有其权衡。在过去的五年中，由于一种通常更重视“足够好”而非“完美”的工程文化，一个鼓励团队间代码共享的协作工作环境，以及一
python 重构 Python 代码隔壁小红馆 python cpython python面试 python cpython
将for循环转换为list/dictionary/set表达式我们在时经常遇到的一个情况是，创建一个值的集合。比如我们创建一个列表，然后迭代地用值填充它，这里我们想创建一个立方数字的列表。大多数语言的标准方法如下：cubes=[]foriinrange(20):cubes.append(i**3)在Python中，我们可以使用列表表达式，生成需要的数据。就可以将代码简化为一行，省去定义列表，然后再
二分查找排序算法周凡杨 java 二分查找排序算法折半
一：概念二分查找又称折半查找（折半搜索/ 二分搜索），优点是比较次数少，查找速度快，平均性能好；其缺点是要求待查表为有序表，且插入删除困难。因此，折半查找方法适用于不经常变动而查找频繁的有序列表。首先，假设表中元素是按升序排列，将表中间位置记录的关键字与查找关键字比较，如果两者相等，则查找成功；否则利用中间位置记录将表分成前、后两个子表，如果中间位置记录的关键字大于查找关键字，则进一步
java中的BigDecimal bijian1013 java BigDecimal
在项目开发过程中出现精度丢失问题，查资料用BigDecimal解决，并发现如下这篇BigDecimal的解决问题的思路和方法很值得学习，特转载。原文地址：http://blog.csdn.net/ugg/article/de
Shell echo命令详解 daizj echo shell
Shell echo命令 Shell 的 echo 指令与 PHP 的 echo 指令类似，都是用于字符串的输出。命令格式： echo string 您可以使用echo实现更复杂的输出格式控制。 1.显示普通字符串: echo "It is a test" 这里的双引号完全可以省略，以下命令与上面实例效果一致： echo Itis a test 2.显示转义
Oracle DBA 简单操作周凡杨 oracle dba sql
--执行次数多的SQL select sql_text,executions from ( select sql_text,executions from v$sqlarea order by executions desc ) where rownum<81; &nb
画图重绘朱辉辉33 游戏
我第一次接触重绘是编写五子棋小游戏的时候，因为游戏里的棋盘是用线绘制的，而这些东西并不在系统自带的重绘里，所以在移动窗体时，棋盘并不会重绘出来。所以我们要重写系统的重绘方法。在重写系统重绘方法时，我们要注意一定要调用父类的重绘方法，即加上super.paint(g)，因为如果不调用父类的重绘方式，重写后会把父类的重绘覆盖掉，而父类的重绘方法是绘制画布，这样就导致我们
线程之初体验西蜀石兰线程
一直觉得多线程是学Java的一个分水岭，懂多线程才算入门。之前看《编程思想》的多线程章节，看的云里雾里，知道线程类有哪几个方法，却依旧不知道线程到底是什么？书上都写线程是进程的模块，共享线程的资源，可是这跟多线程编程有毛线的关系，呜呜。。。线程其实也是用户自定义的任务，不要过多的强调线程的属性，而忽略了线程最基本的属性。你可以在线程类的run()方法中定义自己的任务，就跟正常的Ja
linux集群互相免登陆配置林鹤霄 linux
配置ssh免登陆 1、生成秘钥和公钥 ssh-keygen -t rsa 2、提示让你输入，什么都不输，三次回车之后会在~下面的.ssh文件夹中多出两个文件id_rsa 和 id_rsa.pub 其中id_rsa为秘钥，id_rsa.pub为公钥，使用公钥加密的数据只有私钥才能对这些数据解密 c
mysql : Lock wait timeout exceeded; try restarting transaction aigo mysql
原文：http://www.cnblogs.com/freeliver54/archive/2010/09/30/1839042.html 原因是你使用的InnoDB 表类型的时候, 默认参数:innodb_lock_wait_timeout设置锁等待的时间是50s, 因为有的锁等待超过了这个时间,所以抱错. 你可以把这个时间加长,或者优化存储
Socket编程基本的聊天实现。 alleni123 socket
public class Server { //用来存储所有连接上来的客户 private List<ServerThread> clients; public static void main(String[] args) { Server s = new Server(); s.startServer(9988); } publi
多线程监听器事件模式(一个简单的例子) 百合不是茶线程监听模式
多线程的事件监听器模式监听器时间模式经常与多线程使用,在多线程中如何知道我的线程正在执行那什么内容,可以通过时间监听器模式得到创建多线程的事件监听器模式思路: 1, 创建线程并启动,在创建线程的位置设置一个标记 2,创建队
spring InitializingBean接口 bijian1013 java spring
spring的事务的TransactionTemplate，其源码如下： public class TransactionTemplate extends DefaultTransactionDefinition implements TransactionOperations, InitializingBean{ ... } TransactionTemplate继承了DefaultT
Oracle中询表的权限被授予给了哪些用户 bijian1013 oracle 数据库权限
Oracle查询表将权限赋给了哪些用户的SQL，以备查用。 select t.table_name as "表名", t.grantee as "被授权的属组", t.owner as "对象所在的属组"
【Struts2五】Struts2 参数传值 bit1129 struts2
Struts2中参数传值的3种情况 1.请求参数绑定到Action的实例字段上 2.Action将值传递到转发的视图上 3.Action将值传递到重定向的视图上一、请求参数绑定到Action的实例字段上以及Action将值传递到转发的视图上 Struts可以自动将请求URL中的请求参数或者表单提交的参数绑定到Action定义的实例字段上，绑定的规则使用ognl表达式语言
【Kafka十四】关于auto.offset.reset[Q/A] bit1129 kafka
I got serveral questions about auto.offset.reset. This configuration parameter governs how consumer read the message from Kafka when there is no initial offset in ZooKeeper or
nginx gzip压缩配置 ronin47 nginx gzip 压缩范例
nginx gzip压缩配置更多 0 nginx gzip 配置随着nginx的发展，越来越多的网站使用nginx，因此nginx的优化变得越来越重要，今天我们来看看nginx的gzip压缩到底是怎么压缩的呢？ gzip(GNU-ZIP)是一种压缩技术。经过gzip压缩后页面大小可以变为原来的30%甚至更小，这样，用
java-13.输入一个单向链表，输出该链表中倒数第 k 个节点 bylijinnan java
two cursors. Make the first cursor go K steps first. /* * 第 13 题：题目：输入一个单向链表，输出该链表中倒数第 k 个节点 */ public void displayKthItemsBackWard(ListNode head,int k){ ListNode p1=head,p2=head;
Spring源码学习-JdbcTemplate queryForObject bylijinnan java spring
JdbcTemplate中有两个可能会混淆的queryForObject方法： 1. Object queryForObject(String sql, Object[] args, Class requiredType) 2. Object queryForObject(String sql, Object[] args, RowMapper rowMapper) 第1个方法是只查
[冰川时代]在冰川时代,我们需要什么样的技术? comsci 技术
看美国那边的气候情况....我有个感觉...是不是要进入小冰期了? 那么在小冰期里面...我们的户外活动肯定会出现很多问题...在室内呆着的情况会非常多...怎么在室内呆着而不发闷...怎么用最低的电力保证室内的温度.....这都需要技术手段... &nb
js 获取浏览器型号 cuityang js 浏览器
根据浏览器获取iphone和apk的下载地址 <!DOCTYPE html> <html> <head> <meta charset="utf-8" content="text/html"/> <meta name=
C# socks5详解转 dalan_123 socket C#
http://www.cnblogs.com/zhujiechang/archive/2008/10/21/1316308.html 这里主要讲的是用.NET实现基于Socket5下面的代理协议进行客户端的通讯，Socket4的实现是类似的，注意的事，这里不是讲用C#实现一个代理服务器，因为实现一个代理服务器需要实现很多协议，头大，而且现在市面上有很多现成的代理服务器用，性能又好，
运维 Centos问题汇总 dcj3sjt126com 云主机
一、sh 脚本不执行的原因 sh脚本不执行的原因只有2个 1.权限不够 2.sh脚本里路径没写完整。二、解决You have new mail in /var/spool/mail/root 修改/usr/share/logwatch/default.conf/logwatch.conf配置文件 MailTo = MailFrom 三、查询连接数
Yii防注入攻击笔记 dcj3sjt126com sql WEB安全 yii
网站表单有注入漏洞须对所有用户输入的内容进行个过滤和检查，可以使用正则表达式或者直接输入字符判断，大部分是只允许输入字母和数字的，其它字符度不允许；对于内容复杂表单的内容，应该对html和script的符号进行转义替换：尤其是<,>,',"",&这几个符号这里有个转义对照表： http://blog.csdn.net/xinzhu1990/articl
MongoDB简介[一] eksliang mongodb MongoDB简介
MongoDB简介转载请出自出处：http://eksliang.iteye.com/blog/2173288 1.1易于使用 MongoDB是一个面向文档的数据库，而不是关系型数据库。与关系型数据库相比，面向文档的数据库不再有行的概念，取而代之的是更为灵活的“文档”模型。另外，不
zookeeper windows 入门安装和测试 greemranqq zookeeper 安装分布式
一、序言以下是我对zookeeper 的一些理解： zookeeper 作为一个服务注册信息存储的管理工具，好吧，这样说得很抽象，我们举个“栗子”。栗子1号：假设我是一家KTV的老板，我同时拥有5家KTV，我肯定得时刻监视
Spring之使用事务缘由(2-注解实现) ihuning spring
Spring事务注解实现 1. 依赖包： 1.1 spring包： spring-beans-4.0.0.RELEASE.jar spring-context-4.0.0.
iOS App Launch Option 啸笑天 option
iOS 程序启动时总会调用application:didFinishLaunchingWithOptions:，其中第二个参数launchOptions为NSDictionary类型的对象，里面存储有此程序启动的原因。 launchOptions中的可能键值见UIApplication Class Reference的Launch Options Keys节。 1、若用户直接
jdk与jre的区别（_） macroli java jvm jdk
简单的说JDK是面向开发人员使用的SDK，它提供了Java的开发环境和运行环境。SDK是Software Development Kit 一般指软件开发包，可以包括函数库、编译程序等。 JDK就是Java Development Kit JRE是Java Runtime Enviroment是指Java的运行环境，是面向Java程序的使用者，而不是开发者。如果安装了JDK，会发同你
Updates were rejected because the tip of your current branch is behind qiaolevip 学习永无止境每天进步一点点众观千象 git
$ git push joe prod-2295-1 To [email protected]:joe.le/dr-frontend.git ! [rejected] prod-2295-1 -> prod-2295-1 (non-fast-forward) error: failed to push some refs to '[email protected]
[一起学Hive]之十四-Hive的元数据表结构详解 superlxw1234 hive hive元数据结构
关键字：Hive元数据、Hive元数据表结构之前在 “[一起学Hive]之一–Hive概述，Hive是什么”中介绍过，Hive自己维护了一套元数据，用户通过HQL查询时候，Hive首先需要结合元数据，将HQL翻译成MapReduce去执行。本文介绍一下Hive元数据中重要的一些表结构及用途，以Hive0.13为例。文章最后面，会以一个示例来全面了解一下，
Spring 3.2.14，4.1.7，4.2.RC2发布 wiselyman Spring 3
Spring 3.2.14、4.1.7及4.2.RC2于6月30日发布。其中Spring 3.2.1是一个维护版本(维护周期到2016-12-31截止)，后续会继续根据需求和bug发布维护版本。此时，Spring官方强烈建议升级Spring框架至4.1.7 或者将要发布的4.2 。其中Spring 4.1.7主要包含这些更新内容。