Categorical variables, also known as qualitative variables, are a fundamental concept in statistics and data analysis. Here's a breakdown to help you understand them:
What are they?
Types of Categorical Variables:
Why are they important?
Things to remember:
In machine learning, categorical data often needs to be converted into numerical representations for algorithms to process them effectively. Ordinal encoding is a technique that assigns integer values to categories based on their inherent order or ranking. This is suitable for variables where the order between categories matters, such as:
Key Points:
Example in Python:
import pandas as pd
# Sample data
data = {
'Color': ['Red', 'Blue', 'Green', 'Red', 'Green', 'Blue'],
'Size': ['S', 'L', 'M', 'M', 'M', 'L']
}
df = pd.DataFrame(data)
# Ordinal encode 'Color' (assuming Red < Blue < Green)
from sklearn.preprocessing import OrdinalEncoder
encoder = OrdinalEncoder(categories=[['Red', 'Blue', 'Green']])
encoded_color = encoder.fit_transform(df[['Color']])
# Show results
print(df)
print("\nEncoded 'Color':")
print(encoded_color)
# Ordinal encode 'Size' (assuming S < M < L)
encoded_size = encoder.fit_transform(df[['Size']])
print("\nEncoded 'Size':")
print(encoded_size)
In machine learning, one-hot encoding is a technique used to represent categorical variables as numerical vectors suitable for use in algorithms that expect numerical inputs. It works by creating a new binary vector, with one position for each category in the original variable. The position corresponding to the actual category value is set to 1, while all other positions are set to 0.
Illustration:
Imagine you have a categorical variable representing eye color with three possible values: "blue", "brown", and "green". Here's how one-hot encoding would work:
Original Value | One-Hot Encoded Vector |
---|---|
"blue" | [1, 0, 0] |
"brown" | [0, 1, 0] |
"green" | [0, 0, 1] |
drive_spreadsheet导出到 Google 表格
You can see that each vector has a length equal to the number of categories (3 in this case), and only one value in the vector is 1, indicating the actual category.
Advantages:
Disadvantages:
Python Example:
# Import libraries
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
# Create sample data
data = {'color': ['blue', 'brown', 'green', 'blue', 'brown']}
df = pd.DataFrame(data)
# One-hot encode the 'color' column
encoder = OneHotEncoder(sparse=False) # Specify 'sparse=False' for a dense array
encoded_df = pd.DataFrame(encoder.fit_transform(df[['color']]), columns=['blue', 'brown', 'green'])
# Combine original and encoded data
df_combined = pd.concat([df, encoded_df], axis=1)
print(df_combined)
Cardinality refers to the number of unique elements in a set. Think of it as the size of a bucket containing unique items. In different contexts, cardinality can refer to:
1. Set Cardinality:
len()
function tells you the cardinality of a set.my_set = {1, 2, 3, 2, 4} # Duplicate values are removed
print(len(my_set)) # Output: 4 (unique elements: 1, 2, 3, 4)
2. Cardinality in Relations (Databases):
3. Cardinality in Statistics:
Independent variable
An independent variable is a variable that is changed by the experimenter in a scientific experiment. It is the variable that is tested to see how it affects the dependent variable. The independent variable is also called the "controlled variable" or the "manipulated variable."
Dependent variable
A dependent variable is a variable that is affected by the independent variable. It is the variable that is measured in a scientific experiment. The dependent variable is also called the "responding variable" or the "measured variable."
Scatter plot
A scatter plot is a type of graph that shows the relationship between two variables. The independent variable is plotted on the x-axis, and the dependent variable is plotted on the y-axis. Each data point is represented by a dot on the graph. The dots are then connected with a line to show the trend of the data.