Kaggle: Detect toxicity - Basic EDA -1

This kaggle is:

https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/overview


The Goal is:

to classify toxic comments
especially to recognize unintended bias towards identities

toxic comment is a comment that is rude, disrespectful or otherwise likely to make someone leave a discussion


challenge is:

some neutral comments regarding some identity like "gay" would be classified as toxic,eg:"I am a gay woman" .

reason is:
identities associated with toxicity outnumbered neutral comments regarding the same identity


Dataset

dataset labeled with the associated identity

https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/data

These are subtypes of toxicity, do not need to predict:

severe_toxicity
obscene
threat
insult
identity_attack
sexual_explicit

These columns are corresponding to identity attributes:
representing the identities that are mentioned in the comment

male
female
transgender
other_gender
heterosexual
homosexual_gay_or_lesbian
bisexual
other_sexual_orientation
christian
jewish
muslim
hindu
buddhist
atheist
other_religion
black
white
asian
latino
other_race_or_ethnicity
physical_disability
intellectual_or_learning_disability
psychiatric_or_mental_illness
other_disability

Additional columns:

toxicity_annotator_count and identity_annotator_count, and metadata from Civil Comments: created_date, publication_id, parent_id, article_id, rating, funny, wow, sad, likes, disagree. Civil Comments' label rating is the civility rating Civil Comments users gave the comment.

Example:

Comment: Continue to stand strong LGBT community. Yes, indeed, you'll overcome and you have.
Toxicity Labels: All 0.0
Identity Mention Labels: homosexual_gay_or_lesbian: 0.8, bisexual: 0.6, transgender: 0.3 (all others 0.0)


1. Libs and Data:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import os
print(os.listdir("../input"))
train_df = pd.read_csv('../input/train.csv')
test_df = pd.read_csv('../input/test.csv')

2. Shape of data

train_len, test_len = len(train_df.index), len(test_df.index)
print(f'train size: {train_len}, test size: {test_len}')

train size: 1804874, test size: 97320

train_df.head()

3. Count the amount of missing values

miss_val_train_df = train_df.isnull().sum(axis=0) / train_len
miss_val_train_df = miss_val_train_df[miss_val_train_df > 0] * 100
miss_val_train_df
  • a large portion of the data doesn't have the identity tag
  • but the numbers are same

4. Visualization

Q1: which identity appears the most in the dataset?

According to the data details, just care about the identities tagged in this dataset, and make a list of them:

identities = ['male','female','transgender','other_gender','heterosexual','homosexual_gay_or_lesbian',
              'bisexual','other_sexual_orientation','christian','jewish','muslim','hindu','buddhist',
              'atheist','other_religion','black','white','asian','latino','other_race_or_ethnicity',
              'physical_disability','intellectual_or_learning_disability','psychiatric_or_mental_illness',
              'other_disability']

From below diagram we can also see distributions of toxic and non-toxic in each identity:

# getting the dataframe with identities tagged
train_labeled_df = train_df.loc[:, ['target'] + identities ].dropna()
# lets define toxicity as a comment with a score being equal or .5
# in that case we divide it into two dataframe so we can count toxic vs non toxic comment per identity
toxic_df = train_labeled_df[train_labeled_df['target'] >= .5][identities]
non_toxic_df = train_labeled_df[train_labeled_df['target'] < .5][identities]

# at first, we just want to consider the identity tags in binary format. So if the tag is any value other than 0 we consider it as 1.
toxic_count = toxic_df.where(train_labeled_df == 0, other = 1).sum()
non_toxic_count = non_toxic_df.where(train_labeled_df == 0, other = 1).sum()

# now we can concat the two series together to get a toxic count vs non toxic count for each identity
toxic_vs_non_toxic = pd.concat([toxic_count, non_toxic_count], axis=1)
toxic_vs_non_toxic = toxic_vs_non_toxic.rename(index=str, columns={1: "non-toxic", 0: "toxic"})
# here we plot the stacked graph but we sort it by toxic comments to (perhaps) see something interesting
toxic_vs_non_toxic.sort_values(by='toxic').plot(kind='bar', stacked=True, figsize=(30,10), fontsize=20).legend(prop={'size': 20})

Q2: which identities are more frequently related to toxic comments?

  • consider the score (target) of how toxic the comment is
  • also count in the value of how each identity been targeted
# First we multiply each identity with the target
weighted_toxic = train_labeled_df.iloc[:, 1:].multiply(train_labeled_df.iloc[:, 0], axis="index").sum() 
# changing the value of identity to 1 or 0 only and get comment count per identity group
identity_label_count = train_labeled_df[identities].where(train_labeled_df == 0, other = 1).sum()
# then we divide the target weighted value by the number of time each identity appears
weighted_toxic = weighted_toxic / identity_label_count
weighted_toxic = weighted_toxic.sort_values(ascending=False)
# plot the data using seaborn like before
plt.figure(figsize=(30,20))
sns.set(font_scale=3)
ax = sns.barplot(x = weighted_toxic.values , y = weighted_toxic.index, alpha=0.8)
plt.ylabel('Demographics')
plt.xlabel('Weighted Toxicity')
plt.title('Weighted Analysis of Most Frequent Identities')
plt.show()

Conclusion: race based identities (White and Black) and religion based identities (Muslim and Jews) are heavily associated with toxic comments.

你可能感兴趣的:(Kaggle: Detect toxicity - Basic EDA -1)