radar_sun

Cleaning-Data-in-Python

文章目录

1. Common data problems
- 1.1 Data type constraints
- 1.2 Common data types
- 1.3 Numeric data or ...?
- 1.4 Summing string and concatenating number
- 1.5 Data range constraints
- 1.6 Tier size constraints
- 1.7 Back to the future
- 1.8 Uniqueness constraints
- 1.9 How big is your subest?
- 1.10 Finding duplicates
- 1.11 Treating duplicates
2. Text and categorical data problems
- 2.1 Membership constraints
- 2.2 Members only
- 2.3 Finding consistency
- 2.4 Categorical variables
- 2.5 Categorical of errors
- 2.6 Inconsistent categories
- 2.7 Remapping categories
- 2.8 Clening text data
- 2.9 Removing titles and taking names
- 2.10 Keeping it descriptive
3. Advanced data problems
- 3.1 Uniformity
- 3.2 Ambiguous dates
- 3.3 Uniform currencies
- 3.4 Uniform dates
- 3.5 Cross field validation
- 3.6 Cross field or no cross field?
- 3.7 How's our data integrity?
- 3.8 Completeness
- 3.9 Is this missing at random?
- 3.10 Missing Investors
- 3.11 Follow the money
4. Record linkage
- 4.1 Comparing strings
- 4.2 Minimum edit distance
- 4.3 The cutoff point
- 4.4 Remapping categories II
- 4.5 Generating pairs
- 4.6 To link or not to link?
- 4.7 Pair of restaurants
- 4.8 Similiar restairants
- 4.9 Linking DataFrames
- 4.10 Getting the right index
- 4.11 Linking them together!
- 4.12 Congratulations

1. Common data problems

1.1 Data type constraints

1.2 Common data types

Manipulating and analyzing data with incorrect data types could lead to compromised analysis as you go along the data science workflow.

When working with new data, you should always check the data types of your columns using the .dtypes attribute or the .info() method which you’ll see in the next exercise. Often times, you’ll run into columns that should be converted to different data types before starting any analysis.

In this exercise, you’ll first identify different types of data and correctly map them to their respective types.

Numeric data types
- Number of points on cusomer loyalty card
- Salary earned monthly
- Number of items bought in a basket
Text
- First name
- Shipping address of a customer
- City of residence
Datas
- Order date of a product
- Birthdates of clients

1.3 Numeric data or …?

In this exercise, and throughout this chapter, you’ll be working with bicycle ride sharing data in San Francisco called ride_sharing. It contains information on the start and end stations, the trip duration, and some user information for a bike sharing service.
The user_type column contains information on whether a user is taking a free ride and takes on the following values:

1 for free riders.
2 for pay per ride.
3 for monthly subscribers.

In this instance, you will print the information of ride_sharing using .info() and see a firsthand example of how an incorrect data type can flaw your analysis of the dataset.

Instruction 1

Print the information of ride_sharing.
Use .describe() to print the summary statistics of the user_type column from ride_sharing.

# Print the information of ride_sharing
print(ride_sharing.info())

# Print summary statistics of user_type column
print(ride_sharing['user_type'].describe())

Instruction 2
Question
By looking at the summary statistics - they don’t really seem to offer much description on how users are distributed along their purchase type, why do you think that is?

Possible Answers

$\blacksquare$ The user_type column is not of the correct type, it should be converted to str.

$\square$ The user_type column has an infinite set of possible values, it should be converted to category.

$\square$ The user_type column has an finite set of possible values that represent groupings of data, it should be converted to category.

Instruction 3

Convert user_type into categorical by assigning it the 'category' data type and store it in the user_type_cat column.
Make sure you converted user_type_cat correctly by using an assert statement.

# Convert user_type from integer to category
ride_sharing['user_type_cat'] = ride_sharing['user_type'].astype('category')

# Write an assert statement confirming the change
assert ride_sharing['user_type_cat'].dtype == 'category'

# Print new summary statistics 
print(ride_sharing['user_type_cat'].describe())

1.4 Summing string and concatenating number

In the previous exercise, you were able to identify that category is the correct data type for user_type and convert it in order to extract relevant statistical summaries that shed light on the distribution of user_type.

Another common data type problem is importing what should be numerical values as strings, as mathematical operations such as summing and multiplication lead to string concatenation, not numerical outputs.

In this exercise, you’ll be converting the string column duration to the type int. Before that however, you will need to make sure to strip "minutes" from the column in order to make sure pandas reads it as numerical.

Instruction

Use the .strip() method to strip duration of "minutes" and store it in the duration_trim column.
Convert duration_trim to int and store it in the duration_time column.
Write an assert statement that checks if duration_time's data type is now an int.
Print the average ride duration.

# Strip duration of minute
sride_sharing['duration_trim'] = ride_sharing['duration'].str.strip("minutes")

# Convert duration to integer
ride_sharing['duration_time'] = ride_sharing['duration_trim'].astype('int')

# Write an assert statement making sure of conversion
assert ride_sharing['duration_time'].dtype == 'int'

# Print formed columns and calculate average ride duration 
print(ride_sharing[['duration','duration_trim','duration_time']])
print(ride_sharing['duration_time'].mean())

1.5 Data range constraints

1.6 Tier size constraints

In this lesson, you’re going to build on top of the work you’ve been doing with the ride_sharing DataFrame. You’ll be working with the tire_sizes column which contains data on each bike’s tire size.

Bicycle tire sizes could be either 26″, 27″ or 29″ and are here correctly stored as a categorical value. In an effort to cut maintenance costs, the ride sharing provider decided to set the maximum tire size to be 27″.

In this exercise, you will make sure the tire_sizes column has the correct range by first converting it to an integer, then setting and testing the new upper limit of 27″ for tire sizes.

Instruction

Convert the tire_sizes column from category to 'int'.
Use .loc[] to set all values of tire_sizes above 27 to 27.
Reconvert back tire_sizes to 'category' from int.
Print the description of the tire_sizes.

# Convert tire_sizes to integer
ride_sharing['tire_sizes'] = ride_sharing['tire_sizes'].astype('int')

# Set all values above 27 to 27
ride_sharing.loc[ride_sharing['tire_sizes'] > 27，'tire_sizes'] = 27

# Reconvert tire_sizes back to categorical
ride_sharing['tire_sizes'] = ride_sharing['tire_sizes'].astype('category')

# Print tire size description
print(ride_sharing['tire_sizes'].describe())

1.7 Back to the future

A new update to the data pipeline feeding into the ride_sharing DataFrame has been updated to register each ride’s date. This information is stored in the ride_date column of the type object, which represents strings in pandas.

A bug was discovered which was relaying rides taken today as taken next year. To fix this, you will find all instances of the ride_date column that occur anytime in the future, and set the maximum possible value of this column to today’s date. Before doing so, you would need to convert ride_date to a datetime object.

Instruction

Convert ride_date to a datetime object and store it in ride_dt column using to_datetime().
Create the variable today, which stores today’s date by using the dt.date.today() function.
For all instances of ride_dt in the future, set them to today’s date.
Print the maximum date in the ride_dt column.

# Convert ride_date to datetime
ride_sharing['ride_dt'] = pd.to_datetime(ride_sharing['ride_date'])
# Save today's date
today = dt.date.today()

# Set all in the future to today's date
ride_sharing.loc[ride_sharing['ride_dt'] > today, 'ride_dt'] = today

# Print maximum of ride_dt column
print(ride_sharing['ride_dt'].max())

1.8 Uniqueness constraints

1.9 How big is your subest?

You have the following loans DataFrame which contains loan and credit score data for consumers, and some metadata such as their first and last names. You want to find both complete and incomplete duplicates using .duplicated().

first_name	last_name	credit_score	has_loan
Justin	Sadd;emeyer	600	1
Hadrien	Lacroix	450	0

Choose the correct usage of .duplicated() below:

$\square$ loans.duplicated()
Because the default method returns both complete and incomplete duplicates.

$\square$ loans.duplicated(subset = 'first_name')
Because constraining the duplicate rows to the first name lets me find incomplete duplicates as well.

$\blacksquare$ loans.duplicated(subset = ['first_name', 'last_name'], keep = False)
Because subsetting on consumer metadata and not discarding any duplicate returns all duplicated rows.

$\square$ loans.duplicated(subset = ['first_name', 'last_name'], keep = 'first')
Because this drops all duplicates.

Subsetting on metadata and keeping all duplicate records gives you a better bird-eye’s view over your data and how to duplicate it ! You can even subset the loans DataFrame using bracketing and sort the values so you can properly identify the duplicates.

1.10 Finding duplicates

A new update to the data pipeline feeding into ride_sharing has added the ride_id column, which represents a unique identifier for each ride.

The update however coincided with radically shorter average ride duration times and irregular user birth dates set in the future. Most importantly, the number of rides taken has increased by 20% overnight, leading you to think there might be both complete and incomplete duplicates in the ride_sharing DataFrame.

In this exercise, you will confirm this suspicion by finding those duplicates.

Instruction

Find duplicated rows of ride_id in the ride_sharing DataFrame while setting keep to False.
Subset ride_sharing on duplicates and sort by ride_id and assign the results to duplicated_rides.
Print the ride_id, duration and user_birth_year columns of duplicated_rides in that order.

# Find duplicates
duplicates = ride_sharing.duplicated(subset = 'ride_id', 
                                     keep = False)
                                     
# Sort your duplicated rides
duplicated_rides = ride_sharing[duplicates].sort_values('ride_id')

# Print relevant columns of duplicated_rides
print(duplicated_rides[['ride_id','duration','user_birth_year']])

1.11 Treating duplicates

In the last exercise, you were able to verify that the new update feeding into ride_sharing contains a bug generating both complete and incomplete duplicated rows for some values of the ride_id column, with occasional discrepant values for the user_birth_year and duration columns.

In this exercise, you will be treating those duplicated rows by first dropping complete duplicates, and then merging the incomplete duplicate rows into one while keeping the average duration, and the minimum user_birth_year for each set of incomplete duplicate rows.

Instruction

Drop complete duplicates in ride_sharing and store the results in ride_dup.
Create the statistics dictionary which holds minimum aggregation for user_birth_year and mean aggregation for duration.
Drop incomplete duplicates by grouping by ride_id and applying the aggregation in statistics.
Find duplicates again and run the assert statement to verify de-duplication.

# Drop complete duplicates from ride_sharing
ride_dup = ride_sharing.drop_duplicates()

# Create statistics dictionary for aggregation function
statistics = {'user_birth_year': 'min', 'duration': 'mean'}

# Group by ride_id and compute new statistics
ride_unique = ride_dup.groupby(by = 'ride_id').agg(statistics).reset_index()

# Find duplicated values again
duplicates = ride_unique.duplicated(subset = 'ride_id', keep = False)
duplicated_rides = ride_unique[duplicates == True]

# Assert duplicates are processed
assert duplicated_rides.shape[0] == 0

2. Text and categorical data problems

2.1 Membership constraints

2.2 Members only

Throughout the course so far, you’ve been exposed to some common problems that you may encounter with your data, from data type constraints, data range constrains, uniqueness constraints, and now membership constraints for categorical values.

In this exercise, you will map hypothetical problems to their respective categories.

Membership Constraint
- A month column with the value 14.
- A day_of_week column with the value Suntermonday
- A has_loan column with the value 12.
- A GPA column containing a Z- grade.
Other Constraint
- An age column with values above 130.
- A revenue column represneted as a string.
- A birthdate columns with values in the future.

2.3 Finding consistency

In this exercise and throughout this chapter, you’ll be working with the airlines DataFrame which contains survey responses on the San Francisco Airport from airline customers.

The DataFrame contains flight metadata such as the airline, the destination, waiting times as well as answers to key questions regarding cleanliness, safety, and satisfaction. Another DataFrame named categories was created, containing all correct possible values for the survey columns.

In this exercise, you will use both of these DataFrames to find survey answers with inconsistent values, and drop them, effectively performing an outer and inner join on both these DataFrames as seen in the video exercise.

Instruction 1

Print the categories DataFrame and take a close look at all possible correct categories of the survey columns.
Print the unique values of the survey columns in airlines using the .unique() method.

# Print categories DataFrame
print(categories)

# Print unique values of survey columns in airlines
print('Cleanliness: ', 
       airlines['cleanliness'].unique(), 
       "\n")
       
print('Safety: ', 
       airlines['safety'].unique(), 
       "\n")
       
print('Satisfaction: ', 
       airlines['satisfaction'].unique(), 
       "\n")

Instruction 2
Question
Take a look at the output. Out of the cleanliness, safety and satisfaction columns, which one has an inconsistent category and what is it?

$\blacksquare$ cleanliness because it has an Unacceptable category.

$\square$ cleanliness because it has a Terribly dirty category.

$\square$ satisfaction because it has a Very satisfied category.

$\square$ safety because it has a Neutral category.

Instruction 3

Create a set out of the cleanliness column in airlines using set() and find the inconsistent category by finding the difference in the cleanliness column of categories.
Find rows of airlines with a cleanliness value not in categories and print the output.
Print the rows with the consistent categories of cleanliness only.

# Find the cleanliness category in airlines not in categories
cat_clean = set(airlines['cleanliness']).difference(categories['cleanliness'])

# Find rows with that category
cat_clean_rows = airlines['cleanliness'].isin(cat_clean)

# Print rows with inconsistent category
print(airlines[cat_clean_rows])

# Print rows with consistent categories only
print(airlines[~cat_clean_rows])

2.4 Categorical variables

2.5 Categorical of errors

In the video exercise, you saw how to address common problems affecting categorical variables in your data, including white spaces and inconsistencies in your categories, and the problem of creating new categories and mapping existing ones to new ones.

To get a better idea of the toolkit at your disposal, you will be mapping functions and methods from pandas and Python used to address each type of problem.

White spaces and Inconsistency
- .str.strip()
- .str.lower()
- .str.upper()
Creating or remapping caregories
- .replace()
- pandas.qcut()
- pandas.cut()

2.6 Inconsistent categories

In this exercise, you’ll be revisiting the airlines DataFrame from the previous lesson.

As a reminder, the DataFrame contains flight metadata such as the airline, the destination, waiting times as well as answers to key questions regarding cleanliness, safety, and satisfaction on the San Francisco Airport.

In this exercise, you will examine two categorical columns from this DataFrame, dest_region and dest_size respectively, assess how to address them and make sure that they are cleaned and ready for analysis.

Instruction 1
Print the unique values in dest_region and dest_size respectively.

# Print unique values of both columns
print(airlines['dest_region'].unique())
print(airlines['dest_size'].unique())

Instruction 2
Question
From looking at the output, what do you think is the problem with these columns?

$\square$ The dest_region column has only inconsistent values due to capitalization.

$\square$ The dest_region column has inconsistent values due to capitalization and has one value that needs to be remapped.

$\square$ The dest_size column has only inconsistent values due to leading and trailing spaces.

$\blacksquare$ Both 2 and 3 are correct.

Instruction 3

Change the capitalization of all values of dest_region to lowercase.
Replace the 'eur' with 'europe' in dest_region using the .replace() method.

# Lower dest_region column and then replace "eur" with "europe"
airlines['dest_region'] = airlines['dest_region'].str.lower()
airlines['dest_region'] = airlines['dest_region'].replace({'eur':'europe'})

Instruction 4

Strip white spaces from the dest_size column using the .strip() method.
Verify that the changes have been into effect by printing the unique values of the columns using .unique() .

# Remove white spaces from `dest_size`
airlines['dest_size'] = airlines['dest_size'].str.strip()

# Verify changes have been effected
print(airlines['dest_region'].unique())
print(airlines['dest_size'].unique())

2.7 Remapping categories

To better understand survey respondents from airlines, you want to find out if there is a relationship between certain responses and the day of the week and wait time at the gate.

The airlines DataFrame contains the day and wait_min columns, which are categorical and numerical respectively. The day column contains the exact day a flight took place, and wait_min contains the amount of minutes it took travelers to wait at the gate. To make your analysis easier, you want to create two new categorical variables:

wait_type: 'short' for 0-60 min, 'medium' for 60-180 and long for 180+
day_week: 'weekday' if day is in the weekday, 'weekend' if day is in the weekend.

Instruction

Create the ranges and labels for the wait_type column mentioned in the description above.
Create the wait_type column by from wait_min by using pd.cut(), while inputting label_ranges and label_names in the correct arguments.
Create the mapping dictionary mapping weekdays to 'weekday' and weekend days to 'weekend'.
Create the day_week column by using .replace().

# Create ranges for categories
label_ranges = [0, 60, 180, np.inf]
label_names = ['short', 'medium', 'long']

# Create wait_type column
airlines['wait_type'] = pd.cut(airlines['wait_min'], 
                               bins = label_ranges,                                
                               labels = label_names)
                               
# Create mappings and replace
mappings = {'Monday':'weekday', 
            'Tuesday':'weekday', 
            'Wednesday': 'weekday',             
            'Thursday': 'weekday', 
            'Friday': 'weekday',             
            'Saturday': 'weekend', 
            'Sunday': 'weekend'}
            
airlines['day_week'] = airlines['day'].replace(mappings)

Don’t forget, you can always use an assert statement to check your changes passed.

2.8 Clening text data

2.9 Removing titles and taking names

While collecting survey respondent metadata in the airlines DataFrame, the full name of respondents was saved in the full_name column. However upon closer inspection, you found that a lot of the different names are prefixed by honorifics such as "Dr.", "Mr.", "Ms." and "Miss".

Your ultimate objective is to create two new columns named first_name and last_name, containing the first and last names of respondents respectively. Before doing so however, you need to remove honorifics.

Instruction

Remove "Dr.", "Mr.", "Miss" and "Ms." from full_name by replacing them with an empty string "" in that order.
Run the assert statement using .str.contains() that tests whether full_name still contains any of the honorifics.

# Replace "Dr." with empty string ""
airlines['full_name'] = airlines['full_name'].str.replace("Dr.","")

# Replace "Mr." with empty string ""
airlines['full_name'] = airlines['full_name'].str.replace("Mr.","")

# Replace "Miss" with empty string ""
airlines['full_name'] = airlines['full_name'].str.replace("Miss","")

# Replace "Ms." with empty string ""
airlines['full_name'] = airlines['full_name'].str.replace("Ms.","")

# Assert that full_name has no honorifics
assert airlines['full_name'].str.contains('Ms.|Mr.|Miss|Dr.').any() == False

2.10 Keeping it descriptive

To further understand travelers’ experiences in the San Francisco Airport, the quality assurance department sent out a qualitative questionnaire to all travelers who gave the airport the worst score on all possible categories. The objective behind this questionnaire is to identify common patterns in what travelers are saying about the airport.

Their response is stored in the survey_response column. Upon a closer look, you realized a few of the answers gave the shortest possible character amount without much substance. In this exercise, you will isolate the responses with a character count higher than 40 , and make sure your new DataFrame contains responses with 40 characters or more using an assert statement.

Instruction

Using the airlines DataFrame, store the length of each instance in the survey_response column in resp_length by using .str.len().
Isolate the rows of airlines with resp_length higher than 40.
Assert that the smallest survey response length in airlines_survey is now bigger than 40.

# Store length of each row in survey_response column
resp_length = airlines['survey_response'].str.len()

# Find rows in airlines where resp_length > 40
airlines_survey = airlines[resp_length > 40]

# Assert minimum survey_response length is > 40
assert airlines_survey['survey_response'].str.len().min() > 40

# Print new survey_response column
print(airlines_survey['survey_response'])

3. Advanced data problems

3.1 Uniformity

3.2 Ambiguous dates

You have a DataFrame containing a subscription_date column that was collected from various sources with different Date formats such as YYYY-mm-dd and YYYY-dd-mm. What is the best way to unify the formats for ambiguous values such as 2019-04-07?

$\square$ Set them to NA and drop them.

$\square$ Infer the format of the data in question by checking the format of subsequent and previous values.

$\square$ Infer the format from the original data source.

$\blacksquare$ All of the above are possible, as long as we investigate where our data comes from, and understand the dynamics affecting it before cleaning it.

3.3 Uniform currencies

In this exercise and throughout this chapter, you will be working with a retail banking dataset stored in the banking DataFrame. The dataset contains data on the amount of money stored in accounts, their currency, amount invested, account opening date and last transaction date that were consolidated from American and European branches.
You are tasked with understanding the average account size and how investments vary by the size of account, however in order to produce this analysis accurately, you first need to unify the currency amount into dollars.

Instruction

Find the rows of acct_cur in banking that are equal to 'euro' and store them in acct_eu.
Find all the rows of acct_amount in banking that fit the acct_eu condition, and convert them to USD by multiplying them with 1.1.
Find all the rows of acct_cur in banking that fit the acct_eu condition, set them to 'dollar'.

# Find values of acct_cur that are equal to 'euro'
acct_eu = banking['acct_cur'] == 'euro'

# Convert acct_amount where it is in euro to dollars
banking.loc[acct_eu, 'acct_amount'] = banking.loc[acct_eu, 'acct_amount'] * 1.1

# Unify acct_cur column by changing 'euro' values to 'dollar'
banking.loc[banking['acct_cur'] == 'euro', 'acct_cur'] = 'dollar'

# Print unique values of acct_cur
assert banking['acct_cur'].unique() == 'dollar'

3.4 Uniform dates

After having unified the currencies of your different account amounts, you want to add a temporal dimension to your analysis and see how customers have been investing their money given the size of their account over each year. The account_opened column represents when customers opened their accounts and is a good proxy for segmenting customer activity and investment over time.

However, since this data was consolidated from multiple sources, you need to make sure that all dates are of the same format. You will do so by converting this column into a datetime object, while making sure that the format is inferred and potentially incorrect formats are set to missing.

Instruction 1
Print the header of account_opened from the banking DataFrame and take a look at the different results.

# Print the header of account_opened
print(banking.account_opened.head())

Instruction 2
Take a look at the output. You tried converting the values to datetime using the default to_datetime() function without changing any argument, however received the following error:

ValueError: month must be in 1..12

Why do you think that is?

$\square$ The to_datetime() function needs to be explicitly told which date format each row is in.

$\square$ The to_datetime() function can only be applied on YY-mm-dd date formats.

$\blacksquare$ The 21-14-17 entry is erroneous and leads to an error.

Instruction 3
Convert the account_opened column to datetime, while making sure the date format is inferred and that erroneous formats that raise error return a missing value.

# Convert account_opened to datetime
banking['account_opened'] = pd.to_datetime(banking['account_opened'],
                                           # Infer datetime format
                                           infer_datetime_format = True,
                                           # Return missing value for error
                                           errors = 'coerce')

Instruction 4

Extract the year from the amended account_opened column and assign it to the acct_year column.
Print the newly created acct_year column.

# Get year of account opened
banking['acct_year'] = banking['account_opened'].dt.strftime('%Y')

# Print acct_year
print(banking['acct_year'])

3.5 Cross field validation

3.6 Cross field or no cross field?

Throughout this course, you’ve been immersed in a variety of data cleaning problems from range constraints, data type constraints, uniformity and more.

In this lesson, you were introduced to cross field validation as a means to sanity check your data and making sure you have strong data integrity.

Now, you will map different applicable concepts and techniques to their respective categories.

Cross field validation
- Confirming the Age provided by users by cross checking their birthdays.
- Row wise operations such as .sum(axis=1)
Not cross field validation
- Making sure that a revenue column is a numeric column.
- Making sure a subscription_data column has no values set in the future.
- The use of the .astype method.

3.7 How’s our data integrity?

New data has been merged into the banking DataFrame that contains details on how investments in the inv_amount column are allocated across four different funds A, B, C and D.

Furthermore, the age and birthdays of customers are now stored in the age and birth_date columns respectively.

You want to understand how customers of different age groups invest. However, you want to first make sure the data you’re analyzing is correct. You will do so by cross field checking values of inv_amount and age against the amount invested in different funds and customers’ birthdays.

Instruction 1

Find the rows where the sum of all rows of the fund_columns in banking are equal to the inv_amount column.
Store the values of banking with consistent inv_amount in consistent_inv, and those with inconsistent ones in inconsistent_inv.

# Store fund columns to sum against
fund_columns = ['fund_A', 'fund_B', 'fund_C', 'fund_D']

# Find rows where fund_columns row sum == inv_amount
inv_equ = banking[fund_columns].sum(axis = 1) == banking['inv_amount']

# Store consistent and inconsistent data
consistent_inv = banking[inv_equ]
inconsistent_inv = banking[~inv_equ]

# Store consistent and inconsistent data
print("Number of inconsistent investments: ",
       inconsistent_inv.shape[0])

Instruction 2

Store today’s date into today, and manually calculate customers’ ages and store them in ages_manual.
Find all rows of banking where the age column is equal to ages_manual and then filter banking into consistent_ages and inconsistent_ages.

# Store today's date and find ages
today = dt.date.today()
ages_manual = today.year - banking['birth_date'].dt.year

# Find rows where age column == ages_manual
age_equ = ages_manual == banking['age']

# Store consistent and inconsistent data
consistent_ages = banking[age_equ]
inconsistent_ages = banking[~age_equ]

# Store consistent and inconsistent data
print("Number of inconsistent ages: ", inconsistent_ages.shape[0])

3.8 Completeness

3.9 Is this missing at random?

You’ve seen in the video exercise how there are a variety of missingness types when observing missing data. As a reminder, missingness types can be described as the following:

Missing Completely at Random: No systematic relationship between a column’s missing values and other or own values.
Missing at Random: There is a systematic relationship between a column’s missing values and other observed values.
Missing not at Random: There is a systematic relationship between a column’s missing values and unobserved values.

You have a DataFrame containing customer satisfaction scores for a service. What type of missingness is the following?

A customer satisfaction_score column with missing values for highly dissatisfied customers.

$\square$ Missing completely at random

$\square$ Missing at random

$\blacksquare$ Missing not at random

3.10 Missing Investors

Dealing with missing data is one of the most common tasks in data science. There are a variety of types of missingness, as well as a variety of types of solutions to missing data.

You just received a new version of the banking DataFrame containing data on the amount held and invested for new and existing customers. However, there are rows with missing inv_amount values.

You know for a fact that most customers below 25 do not have investment accounts yet, and suspect it could be driving the missingness.

Instruction 1

Print the number of missing values by column in the banking DataFrame.
Plot and show the missingness matrix of banking with the msno.matrix() function.
Isolate the values of banking missing values of inv_amount into missing_investors and with non-missing inv_amount values into investors.

# Print number of missing values in banking
print(banking.isna().sum())

# Visualize missingness matrixmsno.matrix(banking)
plt.show()

# Isolate missing and non missing values of inv_amount
missing_investors = banking[banking['inv_amount'].isna()]
investors = banking[~banking['inv_amount'].isna()]

Instruction 2
Now that you’ve isolated banking into investors and missing_investors, use the .describe() method on both of these DataFrames in the console to understand whether there are structural differences between them. What do you think is going on?

$\square$ The data is missing completely at random and there are no drivers behind the missingness.

$\square$ The inv_amount is missing only for young customers, since the average age in missing_investors is 22 and the maximum age is 25.

$\blacksquare$ The inv_amount is missing only for old customers, since the average age in missing_investors is 42 and the maximum age is 59.

Instruction 3
Sort the banking DataFrame by the age column and plot the missingness matrix of banking_sorted.

# Sort banking by age and visualize
banking_sorted = banking.sort_values(by = 'age')
msno.matrix(banking_sorted)plt.show()

3.11 Follow the money

In this exercise, you’re working with another version of the banking DataFrame that contains missing values for both the cust_id column and the acct_amount column.

You want to produce analysis on how many unique customers the bank has, the average amount held by customers and more. You know that rows with missing cust_id don’t really help you, and that on average acct_amount is usually 5 times the amount of inv_amount.

Instruction

Use .dropna() to drop missing values of the cust_id column in banking and store the results in banking_fullid.
Compute the estimated acct_amount of banking_fullid knowing that acct_amount is usually inv_amount * 5 and assign the results to acct_imp.
Impute the missing values of acct_amount in banking_fullid with the newly created acct_imp using .fillna().

# Drop missing values of cust_id
banking_fullid = banking.dropna(subset = ['cust_id'])

# Compute estimated acct_amount
acct_imp = banking_fullid['inv_amount'] * 5

# Impute missing acct_amount with corresponding acct_imp
banking_imputed = banking_fullid.fillna({'acct_amount':acct_imp})

# Print number of missing values
print(banking_imputed.isna().sum())

4. Record linkage

4.1 Comparing strings

4.2 Minimum edit distance

In the video exercise, you saw how minimum edit distance is used to identify how similar two strings are. As a reminder, minimum edit distance is the minimum number of steps needed to reach from String A to String B, with the operations available being:

Insertion of a new character.
Deletion of an existing character.
Substitution of an existing character.
Transposition of two existing consecutive characters.

What is the minimum edit distance from 'sign' to 'sing', and which operation(s) gets you there?

$\square$ 2 by substituting 'g' with 'n' and 'n' with 'g'.

$\blacksquare$ 1 by transposing 'g' with 'n'.

$\square$ 1 by substituting 'g' with 'n'.

$\square$ 2 by deleting 'g' and inserting a new 'g' at the end.

4.3 The cutoff point

In this exercise, and throughout this chapter, you’ll be working with the restaurants DataFrame which has data on various restaurants. Your ultimate goal is to create a restaurant recommendation engine, but you need to first clean your data.

This version of restaurants has been collected from many sources, where the cuisine_type column is riddled with typos, and should contain only italian, american and asian cuisine types. There are so many unique categories that remapping them manually isn’t scalable, and it’s best to use string similarity instead.
Before doing so, you want to establish the cutoff point for the similarity score using the fuzzywuzzy's process.extract() function by finding the similarity score of the most distant typo of each category.

Instruction 1

Import process from fuzzywuzzy.
Store the unique cuisine_types into unique_types.
Calculate the similarity of 'asian', 'american', and 'italian' to all possible cuisine_types using process.extract(), while returning all possible matches.

# Import process from fuzzywuzzy
from fuzzywuzzy import process

# Store the unique values of cuisine_type in unique_types
unique_types = restaurants['cuisine_type'].unique()

# Calculate similarity of 'asian' to all values of unique_types
print(process.extract('asian', 
                       unique_types, 
                       limit = len(unique_types)))
                 
# Calculate similarity of 'american' to all values of unique_types
print(process.extract('american', 
                       unique_types, 
                       limit = len(unique_types)))
                       
# Calculate similarity of 'italian' to all values of unique_types
print(process.extract('italian', 
                       unique_types, 
                       limit = len(unique_types)))

Instruction 2
Take a look at the output, what do you think should be the similarity cutoff point when remapping categories?

$\blacksquare$ 80

$\square$ 70

$\square$ 60

4.4 Remapping categories II

In the last exercise, you determined that the distance cutoff point for remapping typos of 'american', 'asian', and 'italian' cuisine types stored in the cuisine_type column should be 80.

In this exercise, you’re going to put it all together by finding matches with similarity scores equal to or higher than 80 by using fuzywuzzy.process's extract() function, for each correct cuisine type, and replacing these matches with it. Remember, when comparing a string with an array of strings using process.extract(), the output is a list of tuples where each of tuple is as such:

(closest match, similarity score, index of match)

Instruction 1

Using the method shown in the video, return all of the unique values for the cuisine_type column of restaurants.
Okay! Looks like you will need to use some string matching to correct these misspellings!
As a first step, create a list of matches comparing 'italian' with the restaurant types listed in the cuisine_type column.

# Inspect the unique values of the cuisine_type column
print(restaurants['cuisine_type'].unique())

# Create a list of matches, comparing 'italian' with the cuisine_type column
matches = process.extract('italian', 
                           restaurants['cuisine_type'],
                           limit=len(restaurants.cuisine_type))

# Inspect the first 5 matches
print(matches[0:5])

Instruction 2
Now you’re getting somewhere! Now you can iterate through matches to reassign similar entries.

Within the for loop, use an if statement to check whether the similarity score in each match is greater than or equal to 80.
If it is, use .loc to select rows where cusine_type in restaurants is equal to the current match (which is the first element of match), and reassign them to be 'italian'.

# Iterate through the list of matches to italian
for match in matches: 
 # Check whether the similarity score is greater than or equal to 80  
 if match[1] >= 80:    
   # Select all rows where the cuisine_type is spelled this way, and set them to the correct cuisine    
   restaurants.loc[restaurants['cuisine_type'] == match[0]] = 'italian'

Instruction 3
Finally, you’ll adapt your code to work with every restaurant type in categories.

Using the variable cuisine to iterate through categories, embed your code from the previous step in an outer for loop.
Inspect the final result. This has been done for you.

# Iterate through categories
for cuisine in categories:    
  # Create a list of matches, comparing cuisine with the cuisine_type column  
  matches = process.extract(cuisine, 
                            restaurants['cuisine_type'], 
                            limit=len(restaurants.cuisine_type))
                            
    # Iterate through the list of matches  
    for match in matches:     
      # Check whether the similarity score is greater than or equal to 80    
      if match[1] >= 80:      
        # If it is, select all rows where the cuisine_type is spelled this way, and set them to the correct cuisine     
        restaurants.loc[restaurants['cuisine_type'] == match[0]] = cuisine      

# Inspect the final result
restaurants['cuisine_type'].unique()

4.5 Generating pairs

4.6 To link or not to link?

Similar to joins, record linkage is the act of linking data from different sources regarding the same entity. But unlike joins, record linkage does not require exact matches between different pairs of data, and instead can find close matches using string similarity. This is why record linkage is effective when there are no common unique keys between the data sources you can rely upon when linking data sources such as a unique identifier.

In this exercise, you will classify each card whether it is a traditional join problem, or a record linkage one.

Record linkage
- Usubg an address column to join tow DataFrames, with the address in each DataFrame being formatted slightly differently.
- Two customer DataFrames containing names and address, one with a unique identifier per customer, one without.
- Merging two basketball DataFrames, with columns team_A, team_B, and time and differently formatted team names betwwen each DataFrames.
Regular join
- Consolidating two DataFrames containing details on DataCamp courses, with each DataCamp course having its own unique identifier.
- Two basketball DataFrames with a common unique identifier per game.

4.7 Pair of restaurants

In the last lesson, you cleaned the restaurants dataset to make it ready for building a restaurants recommendation engine. You have a new DataFrame named restaurants_new with new restaurants to train your model on, that’s been scraped from a new data source.

You’ve already cleaned the cuisine_type and city columns using the techniques learned throughout the course. However you saw duplicates with typos in restaurants names that require record linkage instead of joins with restaurants.

In this exercise, you will perform the first step in record linkage and generate possible pairs of rows between restaurants and restaurants_new.

Instruction 1

Instantiate an indexing object by using the Index() function from recordlinkage.
Block your pairing on cuisine_type by using indexer's’ .block() method.
Generate pairs by indexing restaurants and restaurants_new in that order.

# Create an indexer and object and find possible pairs
indexer = recordlinkage.Index()

# Block pairing on cuisine_type
indexer.block('cuisine_type')

# Generate pairs
pairs = indexer.index(restaurants, restaurants_new)

Instruction 2
Now that you’ve generated your pairs, you’ve achieved the first step of record linkage. What are the steps remaining to link both restaurants DataFrames, and in what order?

$\blacksquare$ Compare between columns, score the comparison, then link the DataFrames.

$\square$ Clean the data, compare between columns, link the DataFrames, then score the comparison.

$\square$ Clean the data, compare between columns, score the comparison, then link the DataFrames.

4.8 Similiar restairants

In the last exercise, you generated pairs between restaurants and restaurants_new in an effort to cleanly merge both DataFrames using record linkage.

When performing record linkage, there are different types of matching you can perform between different columns of your DataFrames, including exact matches, string similarities, and more.

Now that your pairs have been generated and stored in pairs, you will find exact matches in the city and cuisine_type columns between each pair, and similar strings for each pair in the rest_name column.

Instruction 1
Instantiate a comparison object using the recordlinkage.Compare() function.

# Create a comparison object
comp_cl = recordlinkage.Compare()

Instruction 2

Use the appropriate comp_cl method to find exact matches between the city and cuisine_type columns of both DataFrames.
Use the appropriate comp_cl method to find similar strings with a 0.8 similarity threshold in the rest_name column of both DataFrames.

# Find exact matches on city, cuisine_types 
comp_cl.exact('city', 
              'city', 
              label='city')
comp_cl.exact('cuisine_type', 
              'cuisine_type', 
              label='cuisine_type')
              
# Find similar matches of rest_name
comp_cl.string('rest_name', 
               'rest_name', 
                label='name', 
                threshold = 0.8)

Instruction 3
Compute the comparison of the pairs by using the .compute() method of comp_cl.

# Get potential matches and print
potential_matches = comp_cl.compute(pairs, 
                                    restaurants, 
                                    restaurants_new)
print(potential_matches)

Instruction 4
Print out potential_matches, the columns are the columns being compared, with values being 1 for a match, and 0 for not a match for each pair of rows in your DataFrames. To find potential matches, you need to find rows with more than matching value in a column. You can find them with

potential_matches[potential_matches.sum(axis = 1) >= n]

Where n is the minimum number of columns you want matching to ensure a proper duplicate find, what do you think should the value of n be?

$\blacksquare$ 3 because I need to have matches in all my columns.

$\square$ 2 because matching on any of the 2 columns or more is enough to find potential duplicates.

$\square$ 1 because matching on just 1 column like the restaurant name is enough to find potential duplicates.

4.9 Linking DataFrames

4.10 Getting the right index

Here’s a DataFrame named matches containing potential matches between two DataFrames, users_1 and users_2. Each DataFrame’s row indices is stored in uid_1 and uid_2 respectively.

		first_name	address_1	address_2	marriage_status	data_of_birth
uid_1	uid_2
0	3	1	1	1	1	0
	···	···	···	···	···	···
	···	···	···	···	···	···
1	3	1	1	1	1	0
	···	···	···	···	···	···
	···	···	···	···	···	···

How do you extract all values of the uid_1 index column?

$\square$ matches.index.get_level_values(0)

$\square$ matches.index.get_level_values(1)

$\square$ matches.index.get_level_values(‘uid_1’)

$\blacksquare$ Both 1 and 3 are correct.

4.11 Linking them together!

In the last lesson, you’ve finished the bulk of the work on your effort to link restaurants and restaurants_new. You’ve generated the different pairs of potentially matching rows, searched for exact matches between the cuisine_type and city columns, but compared for similar strings in the rest_name column. You stored the DataFrame containing the scores in potential_matches.

Now it’s finally time to link both DataFrames. You will do so by first extracting all row indices of restaurants_new that are matching across the columns mentioned above from potential_matches. Then you will subset restaurants_new on these indices, then append the non-duplicate values to restaurants

Instruction

Isolate instances of potential_matches where the row sum is above or equal to 3 by using the .sum() method.
Extract the second column index from matches, which represents row indices of matching record from restaurants_new by using the .get_level_values() method.
Subset restaurants_new for rows that are not in matching_indices.
Append non_dup to restaurants.

# Isolate potential matches with row sum >=3
matches = potential_matches[potential_matches.sum(axis = 1) >= 3]

# Get values of second column index of matches
matching_indices = matches.index.get_level_values(1)

# Subset restaurants_new based on non-duplicate values
non_dup = restaurants_new[~restaurants_new.index.isin(matching_indices)]

# Append non_dup to restaurants
full_restaurants = restaurants.append(non_dup)
print(full_restaurants)

4.12 Congratulations

你可能感兴趣的:(python,pandas)

langchain系列 - FewShotPromptTemplate 少量示例码--到成功大语言模型 langchain
导读环境：OpenEuler、Windows11、WSL2、Python3.12.3langchain0.3背景：前期忙碌的开发阶段结束，需要沉淀自己的应用知识，过一遍LangChain时间：20250220说明：技术梳理，针对FewShotPromptTemplate专门来写一篇博客概念说明few-shot最初来源于机器学习的概念，还有one-shot、zero-shot概念，概念如下：机器学习
nginx ngx_http_module(9) 指令详解 s_fox_ nginx nginx http 运维
nginxngx_http_module(9)指令详解nginx模块目录nginx全指令目录一、目录1.1模块简介ngx_http_uwsgi_module：uWSGI支持模块，允许Nginx与uWSGI服务器进行通信。uWSGI是一种应用服务器协议，广泛用于PythonWeb应用的部署。通过该模块，Nginx可以将动态请求转发给uWSGI服务器处理，并将响应返回给客户端。常用的指令包括uwsgi
sql注入之python脚本进行时间盲注和布尔盲注温柔小胖 sql 数据库网络安全
一、什么是时间盲注和布尔盲注？答：时间盲注是攻击者通过构造恶意sql语句利用sleep()等延迟函数来观察数据库响应时间差异来进行推断信息和条件判断。如果条件为真，数据库会执行延时操作，如果为假则立即返回。响应时间较短。SELECTIF(1=1,SLEEP(5),0);如果条件为真、数据库会暂停5s如果条件为假、数据库会立即返回布尔盲注通过观察数据库返回的不同响应（如真或假）来推断信息。攻击者构造
Python中的生成器表达式（generator expression） Java资深爱好者 python python 开发语言
Python中的生成器表达式（generatorexpression）是一种类似于列表解析（listcomprehension）的语法结构，但它返回的是一个生成器（generator）对象，而不是一个完整的列表。生成器对象是一个迭代器，它可以逐个产生元素，而不是一次性生成所有元素，从而节省内存空间。生成器表达式在形式上与列表解析非常相似，但是它们使用圆括号()而不是方括号[]。当你迭代生成器表达式
Ollama部署大模型，本地调用居7然 android 人工智能 chatgpt 爬虫开发语言 AI编程
Ollama简单介绍Ollama是一个强大的大型语言模型平台，它允许用户轻松地下载、安装和运行各种大型语言模型。在本文中，我将指导你如何在你的本地机器上部署Ollama，并展示如何使用Python进行简单的API调用以访问这些模型最近很多人在学习大模型的时候，也遇到这个问题了，Ollama下载的模型，如果不想在命令行里面直接使用，而是想用Python去调用大模型该如何去使用？这是Ollama的官网
PyInstaller参数大揭秘：一文读懂打包神器的核心密码 Abossss Python python
一、引言在Python开发的广阔领域中，我们常常会面临这样一个问题：如何将自己精心编写的Python脚本，分享给那些没有Python环境的小伙伴，或者部署到生产环境中呢？这时候，PyInstaller库就如同一位救星，闪亮登场。PyInstaller是一个功能强大的跨平台打包工具，它可以将Python脚本及其所有依赖项，打包成一个独立的可执行文件。这意味着，无论目标系统是否安装了Python环境，
量化交易策略都有哪些？怎么运用？股票程序化交易接口 Python股票量化交易股票API接口量化交易量化交易策略均值回归动量策略风险控制股票量化接口股票API接口
Python股票接口实现查询账户，提交订单，自动交易（1）Python股票程序交易接口查账，提交订单，自动交易（2）股票量化，Python炒股，CSDN交流社区>>>均值回归策略：寻找价格的回归点均值回归的原理均值回归策略是基于一种市场现象，即价格不会永远偏离其长期的平均值。从市场的历史数据来看，无论是股票、期货还是其他金融资产，价格总是围绕着一个均值上下波动。这就像一个有弹性的绳子，当价格被拉伸
【全栈】SprintBoot+vue3迷你商城-细节解析（2）：分页杰九 vue.js spring boot java
【全栈】SprintBoot+vue3迷你商城-细节解析（2）：分页往期的文章都在这里啦，大家有兴趣可以看一下后端部分：【全栈】SprintBoot+vue3迷你商城（1）【全栈】SprintBoot+vue3迷你商城（2）【全栈】SprintBoot+vue3迷你商城-扩展：利用python爬虫爬取商品数据【全栈】SprintBoot+vue3迷你商城（3）【全栈】SprintBoot+vue3
有需要2025年参加蓝桥杯比赛的同学往下看！！！岱宗夫up 教程蓝桥杯职场和发展
有需要2025年参加蓝桥杯比赛的同学往下下看！！！以下是关于近两年（2023年和2024年）蓝桥杯Python组考点的详细总结：一、2023年蓝桥杯Python考点分析在2023年的蓝桥杯Python竞赛中，考点主要集中在基础算法、数据结构、动态规划、数学、高精度计算以及二分查找等方面。（一）基础算法基础算法是竞赛的基石，包括枚举、排序（如冒泡排序、选择排序、插入排序等）、搜索（如BFS和DFS）
Ubuntu22 安装多个版本的python 莫忘初心丶 python 数据库开发语言
前言使用pyenv是一个很好的选择，尤其是在需要管理多个Python版本时。它提供了一个简单的方法来安装、切换和管理多个版本的Python，而不必依赖系统的包管理器或update-alternatives。特别是当你需要在同一系统中频繁切换Python版本时，pyenv会显得更加方便。目录前言为什么使用`pyenv`安装`pyenv`1.安装依赖2.安装`pyenv`3.配置shell环境4.安装
python的继承 zhangbeizhen18 L01-基础
记录：备忘录。1.继承classPerson(object):def__init__(self,p_name,p_addr,p_age):self.name=p_nameself.addr=p_addrself.age=p_ageclassGirl(Person):def__init__(self,g_name,g_addr,g_age,g_bra_cup):Person.__init__(sel
【MySQL】表空间丢失处理（Tablespace is missing for table 错误处理） m0_74824823 面试学习路线阿里巴巴 mysql 数据库
问题背景最近，我在运行一个基于Python爬虫的项目时，爬虫需要频繁与MySQL数据库交互。不幸的是，在数据爬取过程中，Windows系统突然强制更新并重启。这次意外中断导致MySQL数据库的三个表格（2022年、2023年和2024年的数据表）出现了“Tablespaceismissing”的错误。起初，我尝试了常规的CHECKTABLE和REPAIRTABLE方法，但这些都没有解决问题。最终，
Python 继承详解江湖一条鱼 python
继承是面向对象编程（OOP）的一个重要特性，允许一个类（子类）从另一个类（父类）继承属性和方法。继承可以提高代码的重用性，增强程序的可扩展性和可维护性。目录一、继承的作用二、继承的语法1.单继承2.多继承三、子类扩展1.添加新功能2.重写父类方法3.调用父类方法四、继承的特殊情况1.子类初始化父类2.方法解析顺序（MRO）五、抽象类与接口1.抽象类2.接口3.ABC类4.使用方法1.定义抽象基类2
【如何学习商城源码】启山智软商城源码微信小程序小程序 java
学习商城源码是一个系统而深入的过程，需要掌握多种方法和技巧。以下是一些建议，帮助你有效地学习商城源码：一、搭建学习环境准备开发工具编程语言相关：根据商城源码使用的编程语言，安装相应的集成开发环境（IDE）。例如，若源码是Java语言编写的，可安装IntelliJIDEA或Eclipse；若是Python语言，可选择PyCharm等。这些IDE能帮助你高效地编辑、调试代码，提供语法高亮、自动补全等功
从零创建一个 Django 项目 m0_74824823 面试学习路线阿里巴巴 django python 后端
1.准备环境在开始之前，确保你的开发环境满足以下要求：安装了Python(推荐3.8或更高版本)。安装pip包管理工具。如果要使用MySQL或PostgreSQL，确保对应的数据库已安装。创建虚拟环境在项目目录中创建并激活虚拟环境，保证项目依赖隔离：#创建虚拟环境python-mvenvenv#激活虚拟环境#WindowsenvScriptsactivate#Linux/Macsourceenv/
anaconda中的python在pycharm中用不了_Pycharm中使用Anaconda 白白前
Pycharm中使用Anaconda问题：安装完Pycharm和Anaconda后，想让Pycharm能调用Anaconda中包含的各种包。这样就不用重复安装各种包了。Anaconda下载安装Anaconda指的是一个开源的Python发行版本，其包含了conda、Python等180多个科学包及其依赖项。因为包含了大量的科学包，Anaconda的下载文件比较大(约515MB)。安装Anacond
python - 永久存储 susie0815 python python 服务器
打开文件使用open()函数打开文件时，openfilemode（文件打开模式）是一个决定了以何种方式打开文件以及对文件可以进行哪些操作的重要参数。基本模式只读模式（‘r’）默认的打开模式，用于读取文件。如果文件不存在，会抛出FileNotFoundError异常。try:file=open('test.txt','r')content=file.read()print(content)file.
自动化测试的学习路线 Ws＿学习
自动化测试是提高软件开发效率和质量的关键手段。学习自动化测试通常涉及多个方面的技能，从基础的编程语言知识到测试工具的使用，再到实际的测试脚本编写和执行。以下是一个学习自动化测试的路线图，帮助你有条不紊地掌握相关技能：1.基础知识在开始自动化测试之前，首先要具备一定的编程和软件测试基础：编程语言：Python、Java、JavaScript或者Ruby（根据你选择的自动化测试工具决定）软件测试基础：
Python自动化测试 Ws＿ python python
Python自动化测试是软件开发中的重要组成部分，可以帮助提高测试效率和准确性。以下是学习Python自动化测试的基本路线，以及相关资料的链接：学习路线1.基础知识Python基础：掌握Python语言的基本语法、数据类型、控制流、函数、面向对象编程等。你可以先确保对Python的基本语法有清晰的理解。参考资料：Python官方文档书籍推荐：《Python编程：从入门到实践》2.了解自动化测试的基
Python实现Excel表格保存到不同文件夹 Leo_Aqu excel python
"""点击“上传”按钮，从本地上传待处理的Excel表格点击“处理”按钮，对Excel表格进行处理点击“保存A”按钮，保存处理后的Excel表格到A文件夹下点击“保存B”按钮，保存处理后的Excel表格到B文件夹下"""#作者:Leo#时间:2024/9/2621:52importtkinterastkfromtkinterimportfiledialog,messageboximportpand
通义灵码AI程序员天天向上杰 AI编程 AIGC 人工智能
通义灵码是阿里云与通义实验室联合打造的智能编码辅助工具，基于通义大模型技术，为开发者提供多种编程辅助功能。它支持多种编程语言，包括Java、Python、Go、TypeScript、JavaScript、C/C++、PHP、C#、Ruby等200多种编码语言。通义灵码AI程序员：今年1月，通义灵码AI程序员全面上线，同时支持VSCode、JetBrainsIDEs，是国内首个真正落地的AI程序员。
python使用技巧超超是超超 python
1、耗时装饰器importtimedefdecorate(func):definner():begin=time.time()result=func()end=time.time()print(f'函数{func}耗时{end-begin}')returnresultreturninner2、查看代码运行耗时fromline_profilerimportLineProfilerdefoperati
Anaconda与python和pycharm的安装及其关系 Daylight.. 学习笔记 pycharm python ide
Anaconda与python和pycharm的安装及其关系一、Anaconda与python和pycharm的关系：1.Anaconda包含python，并且里面含有许多常用的库。（安装了Anaconda就不需要安装python了）2.pycharm是一种IDE（集成开发环境），在其中可以编写Python程序。（工具和语言的关系）。二、如何安装？Anaconda的安装Anaconda官网下载地址
ImportError: cannot import name ‘Mapping‘ from ‘collections‘ AI算法网奇 python基础前端 javascript 数据库
ImportError:cannotimportname'Mapping'from'collections'解决方法：fromcollections.abcimportMapping#正确导入Mappingdefprocess_mapping(data):ifisinstance(data,Mapping):#使用Mapping进行类型检查#处理映射类型的代码pass测试命令：python-c"f
python图形界面化编程GUI（二）常用的组件(Text、Radiobutton、Checkbutton、Canvas)和布局管理器(gird、pack、place) hwwaizs python-GUI图形化编程 python 开发语言
Text文本框Text(多行文本框)的主要用于显示多行文本，还可以显示网页链接,图片,HTML页面,甚至CSS样式表，添加组件等。主要用来显示信息，也常被当做简单的文本处理器、⽂本编辑器或者网页浏览器来使用。IDLE就是Text组件构成的。insert插入的时候可以用INSERT代表当前光标的位置，END代表在结尾的位置，也可以用插入小数的形式，2.3代表第二行第三列后插入。fromtkinter
【深度解析】最短路径算法：Dijkstra与Floyd-Warshall 吴师兄大模型算法数据结构 python 最短路径算法 Dijkstra算法 Floyd-Warshall 开发语言
系列文章目录01-从零开始掌握Python数据结构：提升代码效率的必备技能！02-算法复杂度全解析：时间与空间复杂度优化秘籍03-线性数据结构解密：数组的定义、操作与实际应用04-深入浅出链表：Python实现与应用全面解析05-栈数据结构详解：Python实现与经典应用场景06-深入理解队列数据结构：从定义到Python实现与应用场景07-双端队列（Deque）详解：Python实现与滑动窗口应
CSE 231 Computer Python program 后端
CSE231Spring2025ComputerProject#4LearningobjectivesThisassignmentfocusesonthedesign,implementationandtestingofaPythonprogramthatusescharacterstringsforlookingattheDNAsequencesforkeyproteinsandseeingho
全网最全！DeepSeek 新手入门教程合集人工智能deepseek
如果你是初次接触DeepSeek的普通用户或开发者，面对海量教程却无从下手？别担心！本文为你整理全网最易懂、最实用的DeepSeek学习资源，涵盖快速上手、编程实战、系统手册等，附直达链接，收藏这一篇就够了！一、快速入门指南《DeepSeek入门教程》-博客园亮点：手把手教你注册账号、获取APIKey，并提供Python调用多轮对话的代码示例，适合初级开发者。直达链接：点击查看核心内容：API调用
【Python】Python入门——判断语句 zhoushanguhe Python python 编程开发语言
Python入门——判断语句。内容包括if语句、条件表达式、三元运算、match语句等。目录一、if语句1.基本if-else语句2.常用比较运算符3.if-else连写4.pass语句5.变量的作用域二、条件表达式三、三元运算四、match语句五、其他一、if语句1.基本if-else语句当条件成立时，执行某些语句；否则执行另一些语句。注意：if和else后需要加上冒号:if语句的代码块需要缩进
兄弟们，我的deepseek终于可以控制浏览器了：Part 1/n，含代码几道之旅 Dify：智能体（Agent）工作流知识库全搞定几道之旅AI专栏VVVIP 人工智能
文章目录前言helloworld前言其实，deepseek控制浏览器咱之前就发过，只不过当时没有想到这么好的标题，哈哈。所依赖的，依然是BrowserUse这个项目BrowserUse项目官网helloworld按照官网配置好环境后，只需新建一个python文件（例如，叫main.py?）然后运行即可。fromlangchain_openaiimportChatOpenAIfrombrowser_
项目中枚举与注解的结合使用飞翔的马甲 java enum annotation
前言：版本兼容，一直是迭代开发头疼的事，最近新版本加上了支持新题型，如果新创建一份问卷包含了新题型，那旧版本客户端就不支持，如果新创建的问卷不包含新题型，那么新旧客户端都支持。这里面我们通过给问卷类型枚举增加自定义注解的方式完成。顺便巩固下枚举与注解。一、枚举 1.在创建枚举类的时候，该类已继承java.lang.Enum类，所以自定义枚举类无法继承别的类，但可以实现接口。
【Scala十七】Scala核心十一：下划线_的用法 bit1129 scala
下划线_在Scala中广泛应用，_的基本含义是作为占位符使用。_在使用时是出问题非常多的地方，本文将不断完善_的使用场景以及所表达的含义 1. 在高阶函数中使用 scala> val list = List(-3,8,7,9) list: List[Int] = List(-3, 8, 7, 9) scala> list.filter(_ > 7) r
web缓存基础：术语、http报头和缓存策略 dalan_123 Web
对于很多人来说，去访问某一个站点，若是该站点能够提供智能化的内容缓存来提高用户体验，那么最终该站点的访问者将络绎不绝。缓存或者对之前的请求临时存储，是http协议实现中最核心的内容分发策略之一。分发路径中的组件均可以缓存内容来加速后续的请求，这是受控于对该内容所声明的缓存策略。接下来将讨web内容缓存策略的基本概念，具体包括如如何选择缓存策略以保证互联网范围内的缓存能够正确处理的您的内容，并谈论下
crontab 问题周凡杨 linux crontab unix
一： 0481-079 Reached a symbol that is not expected. 背景： */5 * * * * /usr/IBMIHS/rsync.sh
让tomcat支持2级域名共享session g21121 session
tomcat默认情况下是不支持2级域名共享session的，所有有些情况下登陆后从主域名跳转到子域名会发生链接session不相同的情况，但是只需修改几处配置就可以了。打开tomcat下conf下context.xml文件找到Context标签,修改为如下内容如果你的域名是www.test.com <Context sessionCookiePath="/path&q
web报表工具FineReport常用函数的用法总结（数学和三角函数）老A不折腾 Web finereport 总结
ABS ABS(number):返回指定数字的绝对值。绝对值是指没有正负符号的数值。 Number:需要求出绝对值的任意实数。示例: ABS(-1.5)等于1.5。 ABS(0)等于0。 ABS(2.5)等于2.5。 ACOS ACOS(number):返回指定数值的反余弦值。反余弦值为一个角度，返回角度以弧度形式表示。 Number:需要返回角
linux 启动java进程 sh文件墙头上一根草 linux shell jar
#!/bin/bash #初始化服务器的进程PId变量 user_pid=0; robot_pid=0; loadlort_pid=0; gateway_pid=0; ######### #检查相关服务器是否启动成功 #说明： #使用JDK自带的JPS命令及grep命令组合，准确查找pid #jps 加 l 参数，表示显示java的完整包路径 #使用awk，分割出pid
我的spring学习笔记5-如何使用ApplicationContext替换BeanFactory aijuans Spring 3 系列
如何使用ApplicationContext替换BeanFactory？ package onlyfun.caterpillar.device; import org.springframework.beans.factory.BeanFactory; import org.springframework.beans.factory.xml.XmlBeanFactory; import
Linux 内存使用方法详细解析 annan211 linux 内存 Linux内存解析
来源 http://blog.jobbole.com/45748/ 我是一名程序员，那么我在这里以一个程序员的角度来讲解Linux内存的使用。一提到内存管理，我们头脑中闪出的两个概念，就是虚拟内存，与物理内存。这两个概念主要来自于linux内核的支持。 Linux在内存管理上份为两级，一级是线性区，类似于00c73000-00c88000，对应于虚拟内存，它实际上不占用
数据库的单表查询常用命令及使用方法(-) 百合不是茶 oracle 函数单表查询
创建数据库; --建表 create table bloguser(username varchar2(20),userage number(10),usersex char(2)); 创建bloguser表,里面有三个字段 &nbs
多线程基础知识 bijian1013 java 多线程 thread java多线程
一．进程和线程进程就是一个在内存中独立运行的程序，有自己的地址空间。如正在运行的写字板程序就是一个进程。 “多任务”：指操作系统能同时运行多个进程（程序）。如WINDOWS系统可以同时运行写字板程序、画图程序、WORD、Eclipse等。线程：是进程内部单一的一个顺序控制流。线程和进程 a. 每个进程都有独立的
fastjson简单使用实例 bijian1013 fastjson
一.简介阿里巴巴fastjson是一个Java语言编写的高性能功能完善的JSON库。它采用一种“假定有序快速匹配”的算法，把JSON Parse的性能提升到极致，是目前Java语言中最快的JSON库；包括“序列化”和“反序列化”两部分，它具备如下特征：
【RPC框架Burlap】Spring集成Burlap bit1129 spring
Burlap和Hessian同属于codehaus的RPC调用框架，但是Burlap已经几年不更新，所以Spring在4.0里已经将Burlap的支持置为Deprecated,所以在选择RPC框架时，不应该考虑Burlap了。这篇文章还是记录下Burlap的用法吧，主要是复制粘贴了Hessian与Spring集成一文，【RPC框架Hessian四】Hessian与Spring集成
【Mahout一】基于Mahout 命令参数含义 bit1129 Mahout
1. mahout seqdirectory $ mahout seqdirectory --input (-i) input Path to job input directory(原始文本文件). --output (-o) output The directory pathna
linux使用flock文件锁解决脚本重复执行问题 ronin47 linux lock　重复执行
linux的crontab命令，可以定时执行操作，最小周期是每分钟执行一次。关于crontab实现每秒执行可参考我之前的文章《linux crontab 实现每秒执行》现在有个问题，如果设定了任务每分钟执行一次，但有可能一分钟内任务并没有执行完成，这时系统会再执行任务。导致两个相同的任务在执行。例如： <? // test .php
java-74-数组中有一个数字出现的次数超过了数组长度的一半，找出这个数字 bylijinnan java
public class OcuppyMoreThanHalf { /** * Q74 数组中有一个数字出现的次数超过了数组长度的一半，找出这个数字 * two solutions: * 1.O(n) * see <beauty of coding>--每次删除两个不同的数字，不改变数组的特性 * 2.O(nlogn) * 排序。中间
linux 系统相关命令 candiio linux
系统参数 cat /proc/cpuinfo cpu相关参数 cat /proc/meminfo 内存相关参数 cat /proc/loadavg 负载情况性能参数 1）top M：按内存使用排序 P：按CPU占用排序 1：显示各CPU的使用情况 k：kill进程 o：更多排序规则回车：刷新数据 2）ulimit ulimit -a：显示本用户的系统限制参
[经营与资产]保持独立性和稳定性对于软件开发的重要意义 comsci 软件开发
一个软件的架构从诞生到成熟，中间要经过很多次的修正和改造如果在这个过程中，外界的其它行业的资本不断的介入这种软件架构的升级过程中那么软件开发者原有的设计思想和开发路线
在CentOS5.5上编译OpenJDK6 Cwind linux OpenJDK
几番周折终于在自己的CentOS5.5上编译成功了OpenJDK6，将编译过程和遇到的问题作一简要记录，备查。 0. OpenJDK介绍 OpenJDK是Sun（现Oracle）公司发布的基于GPL许可的Java平台的实现。其优点： 1、它的核心代码与同时期Sun（-> Oracle）的产品版基本上是一样的，血统纯正，不用担心性能问题，也基本上没什么兼容性问题；（代码上最主要的差异是
java乱码问题 dashuaifu java乱码问题 js中文乱码
swfupload上传文件参数值为中文传递到后台接收中文乱码在js中用setPostParams（{"tag" : encodeURI( document.getElementByIdx_x("filetag").value，"utf-8")}）; 然后在servlet中String t
cygwin很多命令显示command not found的解决办法 dcj3sjt126com cygwin
cygwin很多命令显示command not found的解决办法修改cygwin.BAT文件如下 @echo off D: set CYGWIN=tty notitle glob set PATH=%PATH%;d:\cygwin\bin;d:\cygwin\sbin;d:\cygwin\usr\bin;d:\cygwin\usr\sbin;d:\cygwin\us
[介绍]从 Yii 1.1 升级 dcj3sjt126com PHP yii2
2.0 版框架是完全重写的，在 1.1 和 2.0 两个版本之间存在相当多差异。因此从 1.1 版升级并不像小版本间的跨越那么简单，通过本指南你将会了解两个版本间主要的不同之处。如果你之前没有用过 Yii 1.1，可以跳过本章，直接从"入门篇"开始读起。请注意，Yii 2.0 引入了很多本章并没有涉及到的新功能。强烈建议你通读整部权威指南来了解所有新特性。这样有可能会发
Linux SSH免登录配置总结 eksliang ssh-keygen Linux SSH免登录认证 Linux SSH互信
转载请出自出处：http://eksliang.iteye.com/blog/2187265 一、原理我们使用ssh-keygen在ServerA上生成私钥跟公钥，将生成的公钥拷贝到远程机器ServerB上后,就可以使用ssh命令无需密码登录到另外一台机器ServerB上。生成公钥与私钥有两种加密方式，第一种是
手势滑动销毁Activity gundumw100 android
老是效仿ios，做android的真悲催！有需求：需要手势滑动销毁一个Activity 怎么办尼？自己写？不用~，网上先问一下百度。结果： http://blog.csdn.net/xiaanming/article/details/20934541 首先将你需要的Activity继承SwipeBackActivity，它会在你的布局根目录新增一层SwipeBackLay
JavaScript变换表格边框颜色 ini JavaScript html Web html5 css
效果查看：http://hovertree.com/texiao/js/2.htm代码如下，保存到HTML文件也可以查看效果： <html> <head> <meta charset="utf-8"> <title>表格边框变换颜色代码-何问起</title> </head> <body&
Kafka Rest : Confluent kane_xie kafka REST confluent
最近拿到一个kafka rest的需求，但kafka暂时还没有提供rest api（应该是有在开发中，毕竟rest这么火），上网搜了一下，找到一个Confluent Platform，本文简单介绍一下安装。这里插一句，给大家推荐一个九尾搜索，原名叫谷粉SOSO，不想fanqiang谷歌的可以用这个。以前在外企用谷歌用习惯了，出来之后用度娘搜技术问题，那匹配度简直感人。环境声明：Ubu
Calender不是单例 men4661273 单例 Calender
在我们使用Calender的时候，使用过Calendar.getInstance()来获取一个日期类的对象，这种方式跟单例的获取方式一样，那么它到底是不是单例呢，如果是单例的话，一个对象修改内容之后，另外一个线程中的数据不久乱套了吗？从试验以及源码中可以得出，Calendar不是单例。测试： Calendar c1 =
线程内存和主内存之间联系 qifeifei java thread
1， java多线程共享主内存中变量的时候，一共会经过几个阶段， lock:将主内存中的变量锁定，为一个线程所独占。 unclock:将lock加的锁定解除，此时其它的线程可以有机会访问此变量。 read:将主内存中的变量值读到工作内存当中。 load:将read读取的值保存到工作内存中的变量副本中。
schedule和scheduleAtFixedRate tangqi609567707 java timer schedule
原文地址：http://blog.csdn.net/weidan1121/article/details/527307 import java.util.Timer;import java.util.TimerTask;import java.util.Date; /** * @author vincent */public class TimerTest {
erlang 部署 wudixiaotie erlang
1.如果在启动节点的时候报这个错： {"init terminating in do_boot",{'cannot load',elf_format,get_files}} 则需要在reltool.config中加入 {app, hipe, [{incl_cond, exclude}]}, 2.当generate时，遇到： ERROR