pySpark DataFrame采样的方法

方法一:

df_class_0 = df_train[df_train['label'] == 0]
df_class_1 = df_train[df_train['label'] == 1]
df_class_1_over = df_class_1.sample(count_class_0, replace=True)
df_test_over = pd.concat([df_class_0, df_class_1_over], axis=0)

方法二:

train_1= train_initial.where(col('label')==1).sample(True, 10.0, seed = 2018)
#step 2. Merge this data with label = 0 data

train_0=train_initial.where(col('label')==0)
train_final = train_0.union(train_1)

参考:

  1. stackOverflow

你可能感兴趣的:(Spark)