YUI0908

基于Xgboost的不均衡数据分类

1.项目分析与设计

该项目通过美国人口普查数据训练一个模型来预测美国人口收入水平。数据集上包含199523个训练数据和99762个测试数据，各包含了41个属性。经分析，该数据包含了人口统计信息、年龄、贷款信息、国籍、种族等信息。属性数据中有包含空值和有偏分布等问题，处理思路如下：
1.读取数据，观察特征及其分布
2.分析缺失情况，处理缺失值
3.异常值处理
4.对分类变量进行哑编码
5.用随机森林进行重要特征筛选
6.重采样对不均衡数据进行处理
7.构建XGBOOST模型，并进行建模分析预测

2. 数据探索

      In [1]: 
    

import numpy as np
import pandas as pd

      In [2]: 
    

train_df=pd.read_csv('train.csv')
test_df=pd.read_csv('test.csv')
print ('train_df:%s,%s'%train_df.shape)
print ('test_df:%s,%s'%test_df.shape)

train_df:199523,41
test_df:99762,41

      In [3]: 
    

##检查因变量

train_df.income_level.unique()
test_df.income_level.unique()

        Out[3]: 
      

array(['-50000', '50000+.'], dtype=object)

      In [4]: 
    

#为了便于分析，将变量编码为0,1
train_df.loc[train_df['income_level']==-50000,'income_level']=0
train_df.loc[train_df['income_level']== 50000,'income_level']=1
test_df.loc[test_df['income_level']=='-50000','income_level']=0
test_df.loc[test_df['income_level']=='50000+.','income_level']=1

      In [5]: 
    

##检查样本不均衡程度
a=train_df['income_level'].sum()*100.0/train_df['income_level'].count()
b=test_df['income_level'].sum()*100.0/test_df['income_level'].count()
print ('train_df  (1,0):(%s,%s)'%(a,100-a))
print ('test_df  (1,0):(%s,%s)'%(b,100-b))

train_df  (1,0):(6.20580083499,93.794199165)
test_df  (1,0):(6.20075780357,93.7992421964)

观察数据

      In [6]: 
    

train_df.info()


RangeIndex: 199523 entries, 0 to 199522
Data columns (total 41 columns):
age                                 199523 non-null int64
class_of_worker                     199523 non-null object
industry_code                       199523 non-null int64
occupation_code                     199523 non-null int64
education                           199523 non-null object
wage_per_hour                       199523 non-null int64
enrolled_in_edu_inst_lastwk         199523 non-null object
marital_status                      199523 non-null object
major_industry_code                 199523 non-null object
major_occupation_code               199523 non-null object
race                                199523 non-null object
hispanic_origin                     198649 non-null object
sex                                 199523 non-null object
member_of_labor_union               199523 non-null object
reason_for_unemployment             199523 non-null object
full_parttime_employment_stat       199523 non-null object
capital_gains                       199523 non-null int64
capital_losses                      199523 non-null int64
dividend_from_Stocks                199523 non-null int64
tax_filer_status                    199523 non-null object
region_of_previous_residence        199523 non-null object
state_of_previous_residence         198815 non-null object
d_household_family_stat             199523 non-null object
d_household_summary                 199523 non-null object
migration_msa                       99827 non-null object
migration_reg                       99827 non-null object
migration_within_reg                99827 non-null object
live_1_year_ago                     199523 non-null object
migration_sunbelt                   99827 non-null object
num_person_Worked_employer          199523 non-null int64
family_members_under_18             199523 non-null object
country_father                      192810 non-null object
country_mother                      193404 non-null object
country_self                        196130 non-null object
citizenship                         199523 non-null object
business_or_self_employed           199523 non-null int64
fill_questionnaire_veteran_admin    199523 non-null object
veterans_benefits                   199523 non-null int64
weeks_worked_in_year                199523 non-null int64
year                                199523 non-null int64
income_level                        199523 non-null int64
dtypes: int64(13), object(28)
memory usage: 62.4+ MB

**数值数据观察

      In [7]: 
    

import matplotlib.pyplot as plt
def num_tr(filed,n):
    fig=plt.figure(figsize=(10,5))
    train_df[filed].hist(bins=n) 
    plt.title('%s'%filed) 
    plt.show()

1.age

1.1 分布

      In [8]: 
    

num_tr('age',100)

如图可以观察到，年龄在0-90之间，并且随着年龄的增大，人数减少

我猜测20岁以下和步入工作不久的人，比较不可能>50K，但是也不一定
现在将其分组，0-22,22-35,35-60,60-90,对应编码为：0,1,2,3(22岁为本科毕业平均年龄，35为工作初期（前10年），60岁为退休年龄）

      In [9]: 
    

'''
#创建年龄分组字段
labels=[0,1,2,3,4,5,6,7,8,9]
train_df['age_class']=pd.cut(train_df['age'],bins=[-1,10,20,30,40,50,60,70,80,90,100],labels=labels)
test_df['age_class']=pd.cut(test_df['age'],bins=[-1,10,20,30,40,50,60,70,80,90,100],labels=labels)
'''

        Out[9]: 
      

"\n#\xe5\x88\x9b\xe5\xbb\xba\xe5\xb9\xb4\xe9\xbe\x84\xe5\x88\x86\xe7\xbb\x84\xe5\xad\x97\xe6\xae\xb5\nlabels=[0,1,2,3,4,5,6,7,8,9]\ntrain_df['age_class']=pd.cut(train_df['age'],bins=[-1,10,20,30,40,50,60,70,80,90,100],labels=labels)\ntest_df['age_class']=pd.cut(test_df['age'],bins=[-1,10,20,30,40,50,60,70,80,90,100],labels=labels)\n"

1.2 年龄与目标变量的关系

收入水平为1的主要集中在30-50岁之间,并且可以看出，收入水平为1的人群年龄分布是接近正态的，均值为50.

      In [10]: 
    

'''
fig=plt.figure(figsize=(12,6))
train_df.groupby(['age_class','income_level'])['income_level'].count().unstack().plot(kind='bar')

plt.title('income_level wrt age') 
plt.show()
    
'''

        Out[10]: 
      

"\nfig=plt.figure(figsize=(12,6))\ntrain_df.groupby(['age_class','income_level'])['income_level'].count().unstack().plot(kind='bar')\n\nplt.title('income_level wrt age') \nplt.show()\n    \n"

      In [11]: 
    

#查看收入水平的人群的年龄分布
fig=plt.figure(figsize=(12,6))
train_df.age[train_df.income_level==0].plot(kind='kde')
train_df.age[train_df.income_level==1].plot(kind='kde')
plt.legend(('0','1'))
plt.show()

2.capital_losses&capital_gains

右偏数据。后续有待进一步分析。

      In [12]: 
    

fig=plt.figure(figsize=(8,4))
plt.subplot2grid((1,2),(0,0))
train_df.capital_gains.plot(kind='box')
plt.subplot2grid((1,2),(0,1))
train_df.capital_losses.plot(kind='box')
plt.show()

3.weeks_worked_in_year

水平为0的主要集中在0,50，而水平位1的则主要为50. 这里可以看出，水平为1的该变量几乎没有取值为1的。

      In [13]: 
    

#查看收入水平的人群的周工作时长分布
fig=plt.figure(figsize=(8,4))
plt.subplot2grid((1,2),(0,0))
train_df.weeks_worked_in_year[train_df.income_level==0].hist(bins=20)
plt.subplot2grid((1,2),(0,1))
train_df.weeks_worked_in_year[train_df.income_level==1].hist(bins=20,color='r')
plt.show()

4.divided_from_stocks

右偏数据。后续有待进一步分析。

      In [14]: 
    

fig=plt.figure(figsize=(12,6))
train_df.dividend_from_Stocks[train_df.income_level==0].hist(bins=100)
train_df.dividend_from_Stocks[train_df.income_level==1].hist(bins=100)
plt.legend(('0','1'))
plt.show()

5.num_person_Worked_employment

level为0的主要为0，level为1的则主要是6.

      In [15]: 
    

fig=plt.figure(figsize=(12,6))
#train_df.num_person_Worked_employer[train_df.income_level==0].hist(bins=100)
#train_df.num_person_Worked_employer[train_df.income_level==1].hist(bins=100)
train_df.groupby(['num_person_Worked_employer','income_level'])['income_level'].count().unstack().plot(kind='bar')
plt.legend(('0','1'))
plt.show()

**分类数据观察

1.class_of_worker

虽然没有提供有关Not in universe类别的具体信息。我们假设这个答案是由填写人口普查数据而感到沮丧的人（由于任何原因）给出的。
这个变量看起来不平衡，即只有两个类别似乎占主导地位。在这种情况下，一个好的做法是将总分类频率的频率小于5％的水平组合起来。后续处理。

      In [16]: 
    

fig=plt.figure(figsize=(18,12))
train_df.groupby(['class_of_worker','income_level'])['income_level'].count().unstack().plot(kind='bar')
#plt.xticks(rotation=120)
plt.legend(('0','1'))
plt.show()

2.education

Bachelors degree学士学位有最多level为1的

      In [17]: 
    

fig=plt.figure(figsize=(18,8))
train_df.groupby(['education','income_level'])['income_level'].count().unstack().plot(kind='bar')
#plt.xticks(rotation=30)
plt.legend(('0','1'))
plt.show()

3.marital_status

Married-civilian spouse present已婚且配偶在场的人中level=1最多

      In [18]: 
    

fig=plt.figure(figsize=(18,8))
train_df.groupby(['marital_status','income_level'])['income_level'].count().unstack().plot(kind='bar')
#plt.xticks(rotation=30)
plt.legend(('0','1'))
plt.show()

4.race种族

白人占统样本的多数，并且，白人的level=1也最多。

      In [19]: 
    

fig=plt.figure(figsize=(18,8))
train_df.groupby(['race','income_level'])['income_level'].count().unstack().plot(kind='bar')
#plt.xticks(rotation=30)
plt.legend(('0','1'))
plt.show()

5.sex

样本中，整体女性人数占比较大，但是level为1的主要为男性。

      In [20]: 
    

fig=plt.figure(figsize=(18,8))
train_df.groupby(['sex','income_level'])['income_level'].count().unstack().plot(kind='bar')
#plt.xticks(rotation=30)
plt.legend(('0','1'))
plt.show()

6.member_of_labor_union

两类都集中在Not in universe中

      In [21]: 
    

fig=plt.figure(figsize=(18,8))
train_df.groupby(['member_of_labor_union','income_level'])['income_level'].count().unstack().plot(kind='bar')
#plt.xticks(rotation=30)
plt.legend(('0','1'))
plt.show()

7.full_parttime_employment_stat

      In [22]: 
    

fig=plt.figure(figsize=(18,8))
train_df.groupby(['full_parttime_employment_stat','income_level'])['income_level'].count().unstack().plot(kind='bar')
#plt.xticks(rotation=30)
plt.legend(('0','1'))
plt.show()

8.tax_filer_status

      In [23]: 
    

fig=plt.figure(figsize=(18,8))
train_df.groupby(['tax_filer_status','income_level'])['income_level'].count().unstack().plot(kind='bar')
#plt.xticks(rotation=30)
plt.legend(('0','1'))
plt.show()

9.business_or_self_employed

      In [24]: 
    

fig=plt.figure(figsize=(18,8))
train_df.groupby(['business_or_self_employed','income_level'])['income_level'].count().unstack().plot(kind='bar')
#plt.xticks(rotation=30)
plt.legend(('0','1'))
plt.show()

2. 数据预处理

2.1 缺失值处理

test数据没有缺失值,但是有?，首先将?置为空

      In [25]: 
    

s=pd.Series(train_df.isnull().sum())
print s
ss=pd.Series(test_df.isnull().sum())
print ss

age                                     0
class_of_worker                         0
industry_code                           0
occupation_code                         0
education                               0
wage_per_hour                           0
enrolled_in_edu_inst_lastwk             0
marital_status                          0
major_industry_code                     0
major_occupation_code                   0
race                                    0
hispanic_origin                       874
sex                                     0
member_of_labor_union                   0
reason_for_unemployment                 0
full_parttime_employment_stat           0
capital_gains                           0
capital_losses                          0
dividend_from_Stocks                    0
tax_filer_status                        0
region_of_previous_residence            0
state_of_previous_residence           708
d_household_family_stat                 0
d_household_summary                     0
migration_msa                       99696
migration_reg                       99696
migration_within_reg                99696
live_1_year_ago                         0
migration_sunbelt                   99696
num_person_Worked_employer              0
family_members_under_18                 0
country_father                       6713
country_mother                       6119
country_self                         3393
citizenship                             0
business_or_self_employed               0
fill_questionnaire_veteran_admin        0
veterans_benefits                       0
weeks_worked_in_year                    0
year                                    0
income_level                            0
dtype: int64
age                                     0
class_of_worker                         0
industry_code                           0
occupation_code                         0
education                               0
wage_per_hour                           0
enrolled_in_edu_inst_lastwk             0
marital_status                          0
major_industry_code                     0
major_occupation_code                   0
race                                    0
hispanic_origin                       405
sex                                     0
member_of_labor_union                   0
reason_for_unemployment                 0
full_parttime_employment_stat           0
capital_gains                           0
capital_losses                          0
dividend_from_Stocks                    0
tax_filer_status                        0
region_of_previous_residence            0
state_of_previous_residence           330
d_household_family_stat                 0
d_household_summary                     0
migration_msa                       49946
migration_reg                       49946
migration_within_reg                49946
live_1_year_ago                         0
migration_sunbelt                   49946
num_person_Worked_employer              0
family_members_under_18                 0
country_father                       3429
country_mother                       3072
country_self                         1764
citizenship                             0
business_or_self_employed               0
fill_questionnaire_veteran_admin        0
veterans_benefits                       0
weeks_worked_in_year                    0
year                                    0
income_level                            0
dtype: int64

计算出缺失百分比

      In [26]: 
    

##检查样本缺失比例
m=train_df.shape[0]
for i,j in s.iteritems():
    if j>0:
        print i,j*100.0/m
print '----------------------------'
#检查测试样本缺失比例
n=test_df.shape[0]
for i,j in ss.iteritems():
    if j>0:
        print i,j*100.0/n

hispanic_origin 0.438044736697
state_of_previous_residence 0.354846308446
migration_msa 49.9671717045
migration_reg 49.9671717045
migration_within_reg 49.9671717045
migration_sunbelt 49.9671717045
country_father 3.36452439067
country_mother 3.06681435223
country_self 1.70055582564
----------------------------
hispanic_origin 0.405966199555
state_of_previous_residence 0.330787273711
migration_msa 50.0651550691
migration_reg 50.0651550691
migration_within_reg 50.0651550691
migration_sunbelt 50.0651550691
country_father 3.43718048957
country_mother 3.07932880255
country_self 1.76820833584

删掉缺失值过多的列（4列接近50%）

      In [27]: 
    

del train_df['migration_msa']
del train_df['migration_reg']
del train_df['migration_within_reg']
del train_df['migration_sunbelt']

del test_df['migration_msa']
del test_df['migration_reg']
del test_df['migration_within_reg']
del test_df['migration_sunbelt']

其他五个变量的缺失值占比极小，决定添加字段（不均衡样本尽量塑造数据而不是删除宝贵的数据）

      In [28]: 
    

train_df['hispanic_origin']= train_df['hispanic_origin'].fillna('others')
train_df['state_of_previous_residence']= train_df['state_of_previous_residence'].fillna('others')
train_df['country_father']= train_df['country_father'].fillna('others')
train_df['country_mother']= train_df['country_mother'].fillna('others')
train_df['country_self']= train_df['country_self'].fillna('others')

test_df['hispanic_origin']= test_df['hispanic_origin'].fillna('others')
test_df['state_of_previous_residence']= test_df['state_of_previous_residence'].fillna('others')
test_df['country_father']= test_df['country_father'].fillna('others')
test_df['country_mother']= test_df['country_mother'].fillna('others')
test_df['country_self']= test_df['country_self'].fillna('others')

2.2 异常值处理

*变量转换处理极度右偏数据（对数处理）

      In [29]: 
    

def outliner(df,filed):
    df[filed]=np.log(df[filed]+1)
    df[filed].plot(kind='kde')

      In [30]: 
    

#训练数据
fig=plt.figure(figsize=(15,5))
plt.subplot2grid((2,2),(0,0))
outliner(train_df,'capital_losses') #.capital_losses&capital_gains
plt.subplot2grid((2,2),(0,1))
outliner(train_df,'capital_gains')
plt.subplot2grid((2,2),(1,0))
outliner(train_df,'dividend_from_Stocks')
plt.show()

      In [31]: 
    

##测试数据
fig=plt.figure(figsize=(15,5))
plt.subplot2grid((2,2),(0,0))
outliner(test_df,'capital_losses') #.capital_losses&capital_gains
plt.subplot2grid((2,2),(0,1))
outliner(test_df,'capital_gains')
plt.subplot2grid((2,2),(1,0))
outliner(test_df,'dividend_from_Stocks')
plt.show()

2.3 哑编码

      In [32]: 
    

def dummy_encode(df,filed,a):
    dummies=pd.get_dummies(df[filed],prefix=a)
    len=dummies.shape[1]-1
    a= dummies.iloc[:,0:len]
    b=pd.concat([df, a], axis=1)
    del b[filed]
    return b

      In [33]: 
    

print train_df.shape,test_df.shape

(199523, 37) (99762, 37)

      In [34]: 
    

df_all=pd.concat([train_df,test_df])

      In [35]: 
    

#对训练和测试数据进行哑编码
df_all=dummy_encode(df_all,'fill_questionnaire_veteran_admin','fill_questionnaire_veteran_admin')
df_all=dummy_encode(df_all,'citizenship','citizenship')
df_all=dummy_encode(df_all,'country_self','country_self')
df_all=dummy_encode(df_all,'country_mother','country_mother')
df_all=dummy_encode(df_all,'country_father','country_father')
df_all=dummy_encode(df_all,'family_members_under_18','family_members_under_18')
df_all=dummy_encode(df_all,'live_1_year_ago','live_1_year_ago')
df_all=dummy_encode(df_all,'d_household_summary','d_household_summary')
df_all=dummy_encode(df_all,'class_of_worker','class_of_worker')
df_all=dummy_encode(df_all,'education','education')
df_all=dummy_encode(df_all,'enrolled_in_edu_inst_lastwk','enrolled_in_edu_inst_lastwk')
df_all=dummy_encode(df_all,'marital_status','marital_status')
df_all=dummy_encode(df_all,'major_industry_code','major_industry_code')
df_all=dummy_encode(df_all,'major_occupation_code','major_occupation_code')
df_all=dummy_encode(df_all,'race','race')
df_all=dummy_encode(df_all,'hispanic_origin','hispanic_origin')
df_all=dummy_encode(df_all,'sex','sex')
df_all=dummy_encode(df_all,'member_of_labor_union','member_of_labor_union')
df_all=dummy_encode(df_all,'reason_for_unemployment','reason_for_unemployment')
df_all=dummy_encode(df_all,'full_parttime_employment_stat','full_parttime_employment_stat')
df_all=dummy_encode(df_all,'tax_filer_status','tax_filer_status')
df_all=dummy_encode(df_all,'region_of_previous_residence','region_of_previous_residence')
df_all=dummy_encode(df_all,'state_of_previous_residence','state_of_previous_residence')
df_all=dummy_encode(df_all,'d_household_family_stat','d_household_family_stat')

      In [36]: 
    

train_df=df_all.iloc[0:199523,:]
test_df=df_all.iloc[199523:,:]
print train_df.shape,test_df.shape

(199523, 352) (99762, 352)

      In [37]: 
    

#test_df.to_csv('testooooo.csv')
#train_df.to_csv('trainooooo.csv')

3. 特征选择

随机森林做特征选择

      In [38]: 
    

##将目标变量提到最后一列
Y=train_df['income_level']
del train_df['income_level']
train_df['income_level']=Y

YT=test_df['income_level']
del test_df['income_level']
test_df['income_level']=YT

D:\Anaconda2\lib\site-packages\ipykernel_launcher.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
D:\Anaconda2\lib\site-packages\ipykernel_launcher.py:8: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

      In [39]: 
    

y=Y
X=train_df.iloc[:,0:351]

      In [73]: 
    

from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt

selected_feat_names=set()
for i in range(10):                           #这里我们进行十次循环取交集
    tmp = set()
    rfc = RandomForestClassifier(n_jobs=-1)
    rfc.fit(X, y)

    #print("training finished")

    importances = rfc.feature_importances_
    indices = np.argsort(importances)[::-1]   # 降序排列
    S={}
    for f in range(X.shape[1]):
        if  importances[indices[f]] >=0.0001:
            tmp.add(X.columns[indices[f]])
            S[X.columns[indices[f]]]=importances[indices[f]]
            #print("%2d) %-*s %f" % (f + 1, 30, X.columns[indices[f]], importances[indices[f]]))
    selected_feat_names |= tmp
    imp_fea=pd.Series(S)
print(len(selected_feat_names), "features are selected")

(285, 'features are selected')

      In [41]: 
    

train_new=train_df[['income_level']]
test_new=test_df[['income_level']]
for i in selected_feat_names:
    train_new[i]=train_df[i]
    try :
        test_new[i]=test_df[i]
    except Exception :
        print '----------------'
        print i
        del train_new[i]
        
print train_new.shape,test_new.shape

D:\Anaconda2\lib\site-packages\ipykernel_launcher.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
D:\Anaconda2\lib\site-packages\ipykernel_launcher.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

(199523, 292) (99762, 292)

      In [42]: 
    

##将目标变量提到最后一列
Y=train_new['income_level']
del train_new['income_level']
train_new['income_level']=Y

YT=test_new['income_level']
del test_new['income_level']
test_new['income_level']=YT

D:\Anaconda2\lib\site-packages\ipykernel_launcher.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
D:\Anaconda2\lib\site-packages\ipykernel_launcher.py:8: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

      In [43]: 
    

#train_new.to_csv('train_new.csv')
#test_new.to_csv('test_new.csv')

4. 机器学习

首先:不均衡技术处理（欠采样和过采样技术）
其次：模型选择与训练(xgboost)
最后：调参，调参我希望在保证精确度》0.94的前提下，AUC越大越好

4.1不均衡技术:欠采样和过采样

4.1.1欠采样

train_df  (1,0):(6.20580083499,93.794199165)
test_df  (1,0):(6.20075780357,93.7992421964)
正例：12382 负例：187141
抽样比例：25%
发生率为：21%

      In [44]: 
    

def down_sample(df):
    df1=df[df['income_level']==1]#正例
    df2=df[df['income_level']==0]##负例
    df3=df2.sample(frac=0.25)##抽负例
    return pd.concat([df1,df3],ignore_index=True)

      In [45]: 
    

down_train_df=down_sample(train_df)
down_train_new=down_sample(train_new)

4.1.2过采样

train_df  (1,0):(6.20580083499,93.794199165)
test_df  (1,0):(6.20075780357,93.7992421964)
正例：12382 负例：187141
正例复制5次
发生率为：25%

      In [46]: 
    

def up_sample(df):
    df1=df[df['income_level']==1]#正例
    df2=df[df['income_level']==0]##负例
    df3=pd.concat([df1,df1,df1,df1,df1],ignore_index=True)
    return pd.concat([df2,df3],ignore_index=True)

      In [47]: 
    

up_train_df=up_sample(train_df)
up_train_new=up_sample(train_new)

4.2 Xgboost 进行机器学习

      In [48]: 
    

import xgboost as xgb
from xgboost.sklearn import XGBClassifier
    
from sklearn import metrics
from sklearn.cross_validation import train_test_split
#记录程序运行时间
import time 

##定义模型参数
param = {}
# use logistic regression loss
param['objective'] = 'binary:logistic'
# scale weight of positive examples
param['scale_pos_weight'] = 1
param['bst:eta'] = 0.2
param['bst:max_depth'] = 6
param['eval_metric'] = 'logloss'
param['silent'] = 1
param['nthread'] = 10

Threshold=0.5

D:\Anaconda2\lib\site-packages\sklearn\cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

      In [74]: 
    

def xgb_model(train,tests,list,pam,Threshold):
    train_xy,val = train_test_split(train, test_size = 0.3,random_state=1)
    #random_state is of big influence for val-auc
    y = train_xy['income_level']
    X = train_xy.drop(['income_level'],axis=1)
    val_y = val['income_level']
    val_X = val.drop(['income_level'],axis=1)

    weight1 = np.ones(len(y))
    weight2 = np.ones(len(val_y))
    
    xgb_val = xgb.DMatrix(val_X,label=val_y,weight = weight2)
    xgb_train = xgb.DMatrix(X, label=y,weight = weight1)#,weight = weight1
    test_y = tests['income_level']
    test_X = tests.drop(['income_level'],axis=1)
    xgb_test = xgb.DMatrix(test_X)##不要label
    

    
    watchlist = [(xgb_train, 'train'),(xgb_val, 'val')]
    
    num_round = 100 # 迭代次数

    # training model 
    print ("training xgboost")
    threads = list
    ##调参
    for i in threads:
        param[pam] = i
        tmp = time.time()
        plst = param.items()+[('eval_metric', '[email protected]')]
        model = xgb.train( plst, xgb_train, num_round, watchlist,verbose_eval=False )
        preds = model.predict(xgb_test,ntree_limit=model.best_ntree_limit)
        
        print pam,i
        print ("XGBoost with %d thread costs: %s seconds" % (i, str(time.time() - tmp)))
        for j in range(len(preds)):
            if preds[j]>=Threshold :
                preds[j]=1
            else :
                preds[j]=0
       
        print 'AUC: %.4f' % metrics.roc_auc_score(test_y,preds)
        print 'ACC: %.4f' % metrics.accuracy_score(test_y,preds)
        return model

**分别在特征筛选后的train_new,以及在此基础上的上采样和下采样的数据集上进行调参，以此选出最佳模型

4.2.1max_depth调参

list=[6,7,8,9]
上采样时：树深为9,10可以继续调参 因为跨过了ACC为0.94
下采样时，选择树深为4

      In [54]: 
    

### 未经过重采样的训练集
list1=[4,6,8,9,10]
pam='max_depth'

xgb_model(train_new,test_new,list1,pam,Threshold)

training xgboost
max_depth 4
XGBoost with 4 thread costs: 56.1489999294 seconds
AUC: 0.7273
ACC: 0.9572
max_depth 6
XGBoost with 6 thread costs: 70.1800000668 seconds
AUC: 0.7364
ACC: 0.9581
max_depth 8
XGBoost with 8 thread costs: 90.8049998283 seconds
AUC: 0.7435
ACC: 0.9585
max_depth 9
XGBoost with 9 thread costs: 101.267000198 seconds
AUC: 0.7387
ACC: 0.9579
max_depth 10
XGBoost with 10 thread costs: 112.95600009 seconds
AUC: 0.7417
ACC: 0.9577

      In [55]: 
    

### 下采样的训练集
xgb_model(down_train_new,test_new,list1,pam,Threshold)

training xgboost
max_depth 4
XGBoost with 4 thread costs: 14.1289999485 seconds
AUC: 0.8398
ACC: 0.9395
max_depth 6
XGBoost with 6 thread costs: 19.8990001678 seconds
AUC: 0.8398
ACC: 0.9385
max_depth 8
XGBoost with 8 thread costs: 26.1499998569 seconds
AUC: 0.8410
ACC: 0.9376
max_depth 9
XGBoost with 9 thread costs: 29.2709999084 seconds
AUC: 0.8396
ACC: 0.9373
max_depth 10
XGBoost with 10 thread costs: 32.6649999619 seconds
AUC: 0.8380
ACC: 0.9368

      In [56]: 
    

### 上采样的训练集
xgb_model(up_train_new,test_new,list1,pam,Threshold)

training xgboost
max_depth 4
XGBoost with 4 thread costs: 66.6740000248 seconds
AUC: 0.8555
ACC: 0.9333
max_depth 6
XGBoost with 6 thread costs: 88.0920000076 seconds
AUC: 0.8545
ACC: 0.9352
max_depth 8
XGBoost with 8 thread costs: 113.984999895 seconds
AUC: 0.8507
ACC: 0.9380
max_depth 9
XGBoost with 9 thread costs: 127.167999983 seconds
AUC: 0.8485
ACC: 0.9398
max_depth 10
XGBoost with 10 thread costs: 142.092000008 seconds
AUC: 0.8416
ACC: 0.9410

4.2.2scale_pos_weight调参

list=[0.2,0.4,0.5,0.6,0.8,1]
经过上个步骤，上下采样的AUC都有不错的表现，但是ACC不足
而未经重采样的数据集的AUC的峰值大约在0.74左右，需要舍弃这个数据集
下采样则是ACC达不到0.94
下面着重在树深为9,10 的基础上调整上采样的ACC


结果：上采样选择scale_pos_weight=1.0，树深为9，AUC:0.8485,ACC：0.9398

      In [59]: 
    

### 上采样
param['max_depth'] = 10
list2=[0.8,0.9,1.0,1.1,1.2]
pam='scale_pos_weight'
xgb_model(up_train_new,test_new,list2,pam,Threshold)

training xgboost
scale_pos_weight 0.8
XGBoost with 0 thread costs: 145.473999977 seconds
AUC: 0.8300
ACC: 0.9456
scale_pos_weight 0.9
XGBoost with 0 thread costs: 141.223999977 seconds
AUC: 0.8359
ACC: 0.9430
scale_pos_weight 1.0
XGBoost with 1 thread costs: 143.575999975 seconds
AUC: 0.8416
ACC: 0.9410
scale_pos_weight 1.1
XGBoost with 1 thread costs: 171.828999996 seconds
AUC: 0.8442
ACC: 0.9392
scale_pos_weight 1.2
XGBoost with 1 thread costs: 165.302000046 seconds
AUC: 0.8500
ACC: 0.9364

      In [60]: 
    

### 上采样
param['max_depth'] = 9
list2=[0.8,0.9,1.0,1.1,1.2]
pam='scale_pos_weight'
Threshold=0.5
xgb_model(up_train_new,test_new,list2,pam,Threshold)

training xgboost
scale_pos_weight 0.8
XGBoost with 0 thread costs: 142.86500001 seconds
AUC: 0.8362
ACC: 0.9445
scale_pos_weight 0.9
XGBoost with 0 thread costs: 145.427999973 seconds
AUC: 0.8412
ACC: 0.9413
scale_pos_weight 1.0
XGBoost with 1 thread costs: 141.220000029 seconds
AUC: 0.8485
ACC: 0.9398
scale_pos_weight 1.1
XGBoost with 1 thread costs: 144.178000212 seconds
AUC: 0.8519
ACC: 0.9363
scale_pos_weight 1.2
XGBoost with 1 thread costs: 143.682999849 seconds
AUC: 0.8547
ACC: 0.9349

4.2.3阈值调参

Threshold 0.45-0.55,步进0.01
最终选择：Threshold= 0.51
此时：
AUC: 0.8466
ACC: 0.9409

      In [64]: 
    

for m in np.arange(0.45,0.55,0.01):
    param['scale_pos_weight'] = 1.0
    list3=[9]
    pam='max_depth'
    Threshold=m
    print 'Threshold=',m
    xgb_model(up_train_new,test_new,list3,pam,Threshold)

Threshold= 0.45
training xgboost
max_depth 9
XGBoost with 9 thread costs: 162.524000168 seconds
AUC: 0.8565
ACC: 0.9337
Threshold= 0.46
training xgboost
max_depth 9
XGBoost with 9 thread costs: 160.478999853 seconds
AUC: 0.8555
ACC: 0.9352
Threshold= 0.47
training xgboost
max_depth 9
XGBoost with 9 thread costs: 153.284000158 seconds
AUC: 0.8537
ACC: 0.9363
Threshold= 0.48
training xgboost
max_depth 9
XGBoost with 9 thread costs: 153.253999949 seconds
AUC: 0.8523
ACC: 0.9375
Threshold= 0.49
training xgboost
max_depth 9
XGBoost with 9 thread costs: 152.437000036 seconds
AUC: 0.8499
ACC: 0.9384
Threshold= 0.5
training xgboost
max_depth 9
XGBoost with 9 thread costs: 143.042000055 seconds
AUC: 0.8485
ACC: 0.9398
Threshold= 0.51
training xgboost
max_depth 9
XGBoost with 9 thread costs: 136.401000023 seconds
AUC: 0.8466
ACC: 0.9409
Threshold= 0.52
training xgboost
max_depth 9
XGBoost with 9 thread costs: 139.329999924 seconds
AUC: 0.8445
ACC: 0.9418
Threshold= 0.53
training xgboost
max_depth 9
XGBoost with 9 thread costs: 138.738999844 seconds
AUC: 0.8425
ACC: 0.9428
Threshold= 0.54
training xgboost
max_depth 9
XGBoost with 9 thread costs: 156.219000101 seconds
AUC: 0.8404
ACC: 0.9436
Threshold= 0.55
training xgboost
max_depth 9
XGBoost with 9 thread costs: 164.50999999 seconds
AUC: 0.8381
ACC: 0.9446

4.3 重要特征可视化

      In [72]: 
    

from xgboost import plot_importance
import matplotlib.pyplot as plt
from graphviz import Digraph
import pydot

      In [76]: 
    

param['scale_pos_weight'] = 1.0
list3=[9]
pam='max_depth'
Threshold=0.51
model=xgb_model(up_train_new,test_new,list3,pam,Threshold)

training xgboost
max_depth 9
XGBoost with 9 thread costs: 156.400000095 seconds
AUC: 0.8466
ACC: 0.9409

      In [102]: 
    

imp_feat=imp_fea.sort_values()[::-1]
feat_imp=imp_feat[:30]

      In [103]: 
    

feat_imp.plot(kind='bar')

        Out[103]: 
      

你可能感兴趣的:(个人)

斤斤计较的婚姻到底有多难？白心之岂必有为
很多人私聊我会问到在哪个人群当中斤斤计较的人最多？我都会回答他，一般婚姻出现问题的斤斤计较的人士会非常多，以我多年经验，在婚姻落的一塌糊涂的人当中，斤斤计较的人数占比在20～30%以上，也就是说10个婚姻出现问题的斤斤计较的人有2-3个有多不减。在婚姻出问题当中，有大量的心理不平衡的、尖酸刻薄的怨妇。在婚姻中仅斤斤计较有两种类型：第一种是物质上的，另一种是精神上的。在物质与精神上抠门已经严重的影响
随笔 | 仙一般的灵气海思沧海
仙岛今天，我看了你全部，似乎已经进入你的世界我不知道，这是否是梦幻，还是你仙一般的灵气吸引了我也许每一个人都要有一份属于自己的追求，这样才能够符合人生的梦想，生活才能够充满着阳光与快乐我不知道，我为什么会这样的感叹，是在感叹自己的人生，还是感叹自己一直没有孜孜不倦的追求只感觉虚度了光阴，每天活在自己的梦中，活在一个不真实的世界是在逃避自己，还是在逃避周围的一切有时候我嘲笑自己，嘲笑自己如此的虚无，
10月|愿你的青春不负梦想-读书笔记-01 Tracy的小书斋
本书的作者是俞敏洪，大家都很熟悉他了吧。俞敏洪老师是我行业的领头羊吧，也是我事业上的偶像。本日摘录他书中第一章中的金句：『一个人如果什么目标都没有，就会浑浑噩噩，感觉生命中缺少能量。能给我们能量的，是对未来的期待。第一件事，我始终为了进步而努力。与其追寻全世界的骏马，不如种植丰美的草原，到时骏马自然会来。第二件事，我始终有阶段性的目标。什么东西能给我能量？答案是对未来的期待。』读到这里的时候，我便
2021年12月19日，春蕾教育集团团建活动感受——黄晓丹黄错错加油
感受:1.从陌生到熟悉的过程。游戏环节让我们在轻松的氛围中得到了锻炼，也增长了不少知识。2.游戏过程中，我们贡献的是个人力量，展现的是团队的力量。它磨合的往往不止是工作的熟悉，更是观念上契合度的贴近。3.这和工作是一样的道理。在各自的岗位上，每个人摆正自己的位置、各司其职充分发挥才能，并团结一致劲往一处使，才能实现最大的成功。新知:1.团队精神需要不断地创新。过去，人们把创新看作是冒风险，现在人们
三大师传 beca酱
巴尔扎克的作品被誉为“法国社会的一面镜子”。文学大师维克多·雨果对巴尔扎克的评价是：“在最伟大的人物中间，巴尔扎克是名列前茅者；在最优秀的人物中间，巴尔扎克是佼佼者之一。”一个原本寂寂无名的小人物，从地中海的某个海岛上，只身一人来到巴黎，没有朋友，也没有名望。作为一个一文不名的外乡人，凭着赤手空拳赢得了巴黎，征服了整个法兰西，并且赢得了世界。这个人就是十九世纪法国伟大的军事家、政治家，法兰西第一帝
我的烦恼余建梅
我的烦恼。女儿问我：“你给学生布置什么作文题目？”“《我的烦恼》。”“他们都这么大了，你觉得他们还有烦恼吗？”“有啊！每个人都会有自己烦恼。”“我不相信，大人是没有烦恼的，如果说一定有的话，你的烦恼和我写作业有关，而且是小烦恼。不像我，天天被你说，有这样的妈妈，烦恼是没完没了。”女儿愤愤不平。每个人都会有自己的烦恼，处在上有老下有小的年纪，烦恼多的数不完。想干好工作带好孩子，想孝顺父母又想经营好自
直抒《紫罗兰永恒花园外传》雷姆的黑色童话
没看过《紫罗兰永恒花园》的我莫名的看完了《紫罗兰永恒花园外传》，又莫名的被故事中的姐妹之情狠狠地感动了的一把。感动何在：困苦中相依为命的姐妹二人被迫分离，用一个人的自由换取另一个人的幸福。之后，虽相隔不知几许依旧心心念念彼此牵挂。这种深深的姐妹情谊就是令我为之动容的所在。贝拉和泰勒分别影片开始，海天之间一个孩童凭栏眺望，手中拿着折旧的信纸。镜头一转，挑灯伏案的薇尔莉特正在打字机前奋笔疾书。这些片段
我的黑历史袖手围观有来有去
孩子同学与我们一起共进晚餐，俩孩子加我三个人。小同学是一个大方率性礼貌的小孩，我们也都非常喜欢。好了，回到正题上来让我把这个故事讲完。俩孩子都喜欢吃鱼，所以就发生了小孩子之间常会发生的事。我狠狠的盯了我家孩子，孩子表情有些狼狈。和孩子单独一起的时候，见她尚未释怀，并谴责我不该狠盯她，让她没面子。也许是她触动了我的童年往事吧。由此，一狠心，给她讲了一段埋藏心里极深的黑历史：我奶奶有四个儿子，四个儿子
《中华小厨师》单行VS爱藏：姜是老的辣，书是新的好 cicoky
《汉书·郦食其传》有曰：“王者以民为天，而民以食为天。”自古以来，吃饱饭是每一个人的基本要求，而吃好饭却是每一个人的最终追求。于是，厨师这一职业孕育而生，其渊源之久，甚至可追溯到4000年前的奴隶时代。职业本身无贵贱，但职业能力却有高低之分。所以一家餐馆生意好不好，厨师的水平决定一切，而站在所有厨师顶端的就被称之为“特级厨师”。今天要说的就是一个关于“特级厨师刘昴星”的故事。连载历程1995年第4
拥有断舍离的心态，过精简生活--《断舍离》读书笔记爱吃丸子的小樱桃
不知不觉间房间里的东西越来越多，虽然摆放整齐，但也时常会觉得空间逼仄，令人心生烦闷。抱着断舍离的态度，我开始阅读《断舍离》这本书，希望从书中能找到一些有效的方法，帮助我实现空间、物品上的断舍离。《断舍离》是日本作家山下英子通过自己的经历、思考和实践总结而成的，整体内涵也从刚开始的私人生活哲学的“断舍离”升华成了“人生实践哲学”，接着又成为每个人都能实行的“改变人生的断舍离”，从“哲学”逐渐升华成“
【华为OD机试真题2023B卷 JAVA&JS】We Are A Team 若博豆 java 算法华为 javascript
华为OD2023（B卷）机试题库全覆盖，刷题指南点这里WeAreATeam时间限制：1秒|内存限制：32768K|语言限制：不限题目描述：总共有n个人在机房，每个人有一个标号（1<=标号<=n），他们分成了多个团队，需要你根据收到的m条消息判定指定的两个人是否在一个团队中，具体的：1、消息构成为：abc，整数a、b分别代
万物难度不度己边度512
你好，陌生人！你是否有过迷茫，在别人的面前自己却不曾展示！你是否自己承担着所有的痛苦，却又笑对人生！你是否在很多时候想找人诉说，翻开手机却发现，手机里面空无一人！你是否有很多事情想做，最后却因你自己拖延，最后发现自己什么都做不了！对没有错，我的名字就叫你是否！不要怀疑！不要悲伤！我们的生活可是还有很到要继续的呢！还有很多那个人，很多地方我们都没有去过！所以我们已经没有退路了！那就继续向前吧！加油！
2019-08-08 65454
东莞家庭聚会出行旅游去哪里玩住？想起来有很久没有和家里人聚会啦，这次组织家人来到威廉古堡别墅轰趴，一大家子27个人，在别墅订了一天办，玩的非常的开心，小孩子玩游戏机，也很放心不会丢，我们就在唱歌、打麻将、打桌球一系列的活动，还准备小次等小孩生日在别墅举办，还可以给孩子做一个生日的策划
2.0践行没有你的参与就不完美 x秀丽x
亲爱的伙伴们早上好，今天早上我们开了一次班委竞选的会议，全程只有20多个人参与，宫班本着对大家负责任的态度告诉我们，此次竞选作废，原因是这没有达到2.0的100%参会要求，如果没有大家的参与那么这个班委选出来还有什么意义，这说明选出来的人也是不一定是我们大家心目中认可的那个人，所以为了让大家的这个90天能够更好的激发出自己的的“做”的能力，那么要从第一次竞选班委的会议开始做到100%出席会议，竞选
其二十八尾喵
你知道吗？图片发自App我今天知道了你有喜欢的人，不是我。心空空的，整个人都不是我的了。可，怎么办？还是要好好的活着，毕竟你喜欢的人，我不能杀，可是我可以杀其他喜欢你的人呀！也罢，此生无缘，来世再见。鱼干
人怎么才能认识自己？阿尚青子自由写作人
人怎么才能认识自己？（原问题）我从不愿意上纲上线地确定偌大的话题，就直接说吧。纵使你能认识世界上的万事万物，你很难做到真实地认识自己。因为即使就这个世界，基本上每个人也很难做到客观、公正、科学地认识。对你好的人就是好吗？一件事情是否能够保持永远原来的样子？借不到钱的男友，女友想离开他就理直气壮？父母对子女有几分慷慨，又有几分是无私？工作的意义究竟是什么？是工作需要你，还是你需要工作呢？诸如此类的问
2022-11-17 无奇君
又去了一次社康，这次是急性支气管炎……太难了。半夜就猛咳，天天咳醒，还好他戴海绵耳塞睡吵不到他，要不然对他来说也是种煎熬。一累也会猛咳，希望这次是最后一次吃药，吃完就好。又想把头发剪短了，顺便染个色。可是刚刚去看人家还没开门，不是休息日老板好佛系。理发店是个夫妻店，一年多前刚搬来的时候老板还没对象呢，当时聊天老板就说希望能找个对象一起两个人守着店都比上班强。不久后再去他已经有对象了，而且在店里帮忙
如何成为段子手欣雅阅读
我是一个尬聊大师，与朋友聊天经常把话题聊死，留我一个人在群里，望着自己打下的最后一句话无语凝噎。看到风趣幽默的朋友与人聊天，很是艳羡，觉得自己何时才能成为这样的段子手呢？一、段子是什么？“段子”一词在百度百科上的解释：本是相声中的一个艺术术语，指的是相声作品中一节或一段艺术内容。我的理解：段子就是一些搞笑的故事或者笑话。二、为什么要会说段子？不知道大家有没有这样的朋友，本来很无趣的聚会，只要有他参
python os 环境变量 CV矿工 python 开发语言 numpy
环境变量：环境变量是程序和操作系统之间的通信方式。有些字符不宜明文写进代码里，比如数据库密码，个人账户密码，如果写进自己本机的环境变量里，程序用的时候通过os.environ.get（）取出来就行了。os.environ是一个环境变量的字典。环境变量的相关操作importos"""设置/修改环境变量：os.environ[‘环境变量名称’]=‘环境变量值’#其中key和value均为string类
闲鱼鱼小铺怎么开通？鱼小铺开通需要哪些流程？高省APP大九
闲鱼鱼小铺是平台推出的一个专业程度的店铺，与普通店铺相比会有更多的权益，比如说发布的商品数量从50增加到500；拥有专业的店铺数据看板与分析的功能，这对于专门在闲鱼做生意的用户来说是非常有帮助的，那么鱼小铺每个人都能开通吗？大家好，我是高省APP联合创始人蓓蓓导师，高省APP是2021年推出的电商导购平台，0投资，0风险、高省APP佣金更高，模式更好，终端用户不流失。【高省】是一个可省钱佣金高，能
第六集如何安装CentOS7.0，3分钟学会centos7安装教程 date分享
从光盘引导系统按回车键继续进入引导程序安装界面，选择语言这里选择简体中文版点击继续选择桌面安装下面给系统分区选择磁盘，点击完成选择基本分区，点击加号swap分区,大小填内存的两倍在选择根分区，使用所有可用的磁盘空间选择文件系统ext4点击完成，点击开始安装设置root密码，点击完成设置普通用户和密码，点击完成整个过程持续八分钟左右根据个人配置不同，时间长短不同好，现在点击重启系统进入重启状态点击本
心有蓝天白云，爱情便会晴空万里，然后有花香有鸟鸣有美好的未来曹十二吖
丁南的婚姻，来自于一场她对生命的对比。她曾经说过，当她最爱的母亲用生命去逼迫她结婚的时候，她曾一度不理解到愤怒，甚至于想过用轻生来对抗母亲的不理智。庆幸的是，丁南是一个自我调节能力非常强的人，她想如果我连死亡都不怕，还怕不能经营好一段婚姻吗？抱着这样的念头，24年没有谈过恋爱的她，用短短三个月的时间，完成了少女到女人的蜕变。她曾经说过：“我要把自己最珍贵的东西留给自己命中注定的那个人。”闺蜜几人中
01-Git初识 Meereen Git git
01-Git初识概念：一个免费开源，分布式的代码版本控制系统，帮助开发团队维护代码作用：记录代码内容。切换代码版本，多人开发时高效合并代码内容如何学：个人本机使用：Git基础命令和概念多人共享使用：团队开发同一个项目的代码版本管理Git配置用户信息配置：用户名和邮箱，应用在每次提交代码版本时表明自己的身份命令：查看git版本号git-v配置用户名gitconfig--globaluser.name
似乎老是忘记什么东西灰台
S带上了耳机，眼前的一切都与她隔绝开来。虽是初春的好天气，花都开的正鲜艳，行人也都驻足欣赏，还有不少怀着好心情的年轻人在花树下打闹。不过S似乎并不在意这些，连耳机传来的rap也没有调动起她的兴致。一瞬间，心脏好像变成了黑洞，“啊，我身边还有几个人呢，似乎没有了吧”。阳光的温度覆盖到了脖子上，S抬头看了看开满花的树，“我妈好像还挺喜欢花的”，S随手拍了一张照片，微信发到自己一家三口的群里。过了一会，
一个历史事件和查理一世走上断头台有很大关系，这个事件是什么？王老师聊围棋
今天我要讲的历史事件，查理一世被处死的始末。其实查理一世给被处死的时候，与一个事件有很大的联系。这个事件是“普莱德清洗”。提到这个事件，我们不得不提到一个人，这个人就是克伦威尔。可以说，查理一世能够走上断头台，克伦威尔有很大的功劳。为什么这么说呢。那我们就成英国内战的终结说起吧。我们都知道英国的内战是有保王党挑起来。在保王党军队一路凯歌进攻的同时。就在1645年6月14日，在纳西比荒原上进行最后的
《大兴安岭猎人传说》今年最好看的东北鬼怪故事，很优秀一部电影
《大兴安岭猎人传说》是最新上映于愚人节的网剧，别看是网剧却远超出我的个人预料。该片由民俗故事改编，这点就很吸引人，因为民俗故事口口相传，比那些编造而成的鬼故事更具有了真实性，网大做的电影还不错哦，如果可以我打四星好评。大兴安岭的故事我们经常听老人提起，那里有原始大森林，物产丰富，更流传着精灵怪物的传说。什么红黄白柳灰，出马仙、人参娃娃的故事层出不穷，以大兴安岭为背景的故事真不少。可很多鬼片看到最后
读《人间鲁迅》有感琳语读书
上周读完《闻一多传》后，我对中国近代知识分子产生了兴趣，这周继续读了《人间鲁迅》。厚厚的两本书，记录了一个人的一生，苦痛，彷徨和挣扎，虽然只读了一小部分，却也心潮澎湃。闻一多和鲁迅是完全不同的。鲁迅是沉郁的，现实的，寂寞的，抗争的。除了天生性格的不同外，环境的塑造也是非常之大。鲁迅少年经历了家庭的变故，看尽了人间冷暖，世态炎凉。这种经历促使他很早就观察思考人生，立志用文学来改变中国国民的劣根。闻一
【讲解】怎么消除妊娠纹 poyan7160
女人是脆弱的，尤其是孕期的女性。辛辛苦苦怀胎十月，经历一次深到骨子里的痛还不够，无奈还要留下一身的妊娠纹。母亲是伟大的，但也是要付出代价的，妊娠纹就是最好的证明。可是，难道真的要带着妊娠纹过一辈子吗?不，坚决不!接下来新时代辣妈告诉你怎么去除妊娠纹?怎么去除妊娠纹——根据肌肤需要补充水分就像敷面膜那样，大家都知道敷面膜的目的是为了给肌肤补充水分。水分对一个人的肌肤很重要，只有有了足够的水分，肌肤才
2019-01-19 王小康KK
姓名:王康公司:扬州市方圆建筑工程有限公司2018年3月16日～3月18日上海361期《六项精进》感谢二组学员【日精进打卡第307天】【知～学习】《六项精进》大纲3遍共862遍《大学》通篇3遍共860遍《六项精进》全书40页【经典名句】思想决定行为，行为决定习惯，习惯决定性格，性格决定命运。【行～实践】一、修身：（对自己个人）1、践行六项精进的理念。二、齐家：（对家庭和家人）1、和女朋友视频聊天。
2019-10-24 柒月的可可
今日上班无事，人又懒怠动，不知道如何打发这个下午，终于打开了。我大概是把当日记来写的。重庆的天气骤然凉了。早上出门的时候，满地都是落叶，脚踩上去，却是刚下过雨，叶子已润掉，走不出声响。白天在办公室不见天日，对温度也无甚感觉，晚上一个人回到家，屋子里窗户都开着，被冷风吹了一天，一迈进屋，便觉冷气森然。将近二十度的天气，竟要裹着毯子才觉温暖。再过一周，就到十一月。扛过十一月，就可以开暖气了。然而我真的
Maven Array_06 eclipse jdk maven
Maven Maven是基于项目对象模型(POM)，信息来管理项目的构建，报告和文档的软件项目管理工具。 Maven 除了以程序构建能力为特色之外，还提供高级项目管理工具。由于 Maven 的缺省构建规则有较高的可重用性，所以常常用两三行 Maven 构建脚本就可以构建简单的项目。由于 Maven 的面向项目的方法，许多 Apache Jakarta 项目发文时使用 Maven，而且公司
ibatis的queyrForList和queryForMap区别 bijian1013 java ibatis
一.说明 iBatis的返回值参数类型也有种：resultMap与resultClass，这两种类型的选择可以用两句话说明之： 1.当结果集列名和类的属性名完全相对应的时候，则可直接用resultClass直接指定查询结果类
LeetCode[位运算] - #191 计算汉明权重 Cwind java 位运算 LeetCode Algorithm 题解
原题链接：#191 Number of 1 Bits 要求：写一个函数，以一个无符号整数为参数，返回其汉明权重。例如，‘11’的二进制表示为'00000000000000000000000000001011', 故函数应当返回3。汉明权重：指一个字符串中非零字符的个数；对于二进制串，即其中‘1’的个数。难度：简单分析：将十进制参数转换为二进制，然后计算其中1的个数即可。 “
浅谈java类与对象 15700786134 java
java是一门面向对象的编程语言，类与对象是其最基本的概念。所谓对象，就是一个个具体的物体，一个人，一台电脑，都是对象。而类，就是对象的一种抽象，是多个对象具有的共性的一种集合，其中包含了属性与方法，就是属于该类的对象所具有的共性。当一个类创建了对象，这个对象就拥有了该类全部的属性，方法。相比于结构化的编程思路，面向对象更适用于人的思维
linux下双网卡同一个IP 被触发 linux
转自： http://q2482696735.blog.163.com/blog/static/250606077201569029441/ 由于需要一台机器有两个网卡，开始时设置在同一个网段的IP，发现数据总是从一个网卡发出，而另一个网卡上没有数据流动。网上找了下，发现相同的问题不少：一、关于双网卡设置同一网段IP然后连接交换机的时候出现的奇怪现象。当时没有怎么思考、以为是生成树
安卓按主页键隐藏程序之后无法再次打开肆无忌惮_ 安卓
遇到一个奇怪的问题，当SplashActivity跳转到MainActivity之后，按主页键，再去打开程序，程序没法再打开（闪一下），结束任务再开也是这样，只能卸载了再重装。而且每次在Log里都打印了这句话"进入主程序"。后来发现是必须跳转之后再finish掉SplashActivity 本来代码： // 销毁这个Activity fin
通过cookie保存并读取用户登录信息实例知了ing JavaScript html
通过cookie的getCookies()方法可获取所有cookie对象的集合；通过getName()方法可以获取指定的名称的cookie；通过getValue()方法获取到cookie对象的值。另外，将一个cookie对象发送到客户端，使用response对象的addCookie()方法。下面通过cookie保存并读取用户登录信息的例子加深一下理解。（1）创建index.jsp文件。在改
JAVA 对象池矮蛋蛋 java ObjectPool
原文地址： http://www.blogjava.net/baoyaer/articles/218460.html Jakarta对象池 ☆为什么使用对象池恰当地使用对象池化技术，可以有效地减少对象生成和初始化时的消耗，提高系统的运行效率。Jakarta Commons Pool组件提供了一整套用于实现对象池化
ArrayList根据条件+for循环批量删除的方法 alleni123 java
场景如下： ArrayList<Obj> list Obj-> createTime, sid. 现在要根据obj的createTime来进行定期清理。（释放内存） ------------------------- 首先想到的方法就是 for(Obj o:list){ if(o.createTime-currentT>xxx){
阿里巴巴“耕地宝”大战各种宝百合不是茶平台战略
“耕地保”平台是阿里巴巴和安徽农民共同推出的一个 “首个互联网定制私人农场”，“耕地宝”由阿里巴巴投入一亿，主要是用来进行农业方面，将农民手中的散地集中起来不仅加大农民集体在土地上面的话语权，还增加了土地的流通与利用率，提高了土地的产量，有利于大规模的产业化的高科技农业的发展，阿里在农业上的探索将会引起新一轮的产业调整，但是集体化之后农民的个体的话语权将更少，国家应出台相应的法律法规保护
Spring注入有继承关系的类（1） bijian1013 java spring
一个类一个类的注入 1.AClass类 package com.bijian.spring.test2; public class AClass { String a; String b; public String getA() { return a; } public void setA(Strin
30岁转型期你能否成为成功人士 bijian1013 成功
很多人由于年轻时走了弯路，到了30岁一事无成，这样的例子大有人在。但同样也有一些人，整个职业生涯都发展得很优秀，到了30岁已经成为职场的精英阶层。由于做猎头的原因，我们接触很多30岁左右的经理人，发现他们在职业发展道路上往往有很多致命的问题。在30岁之前，他们的职业生涯表现很优秀，但从30岁到40岁这一段，很多人
[Velocity三]基于Servlet+Velocity的web应用 bit1129 velocity
什么是VelocityViewServlet 使用org.apache.velocity.tools.view.VelocityViewServlet可以将Velocity集成到基于Servlet的web应用中，以Servlet+Velocity的方式实现web应用 Servlet + Velocity的一般步骤 1.自定义Servlet，实现VelocityViewServl
【Kafka十二】关于Kafka是一个Commit Log Service bit1129 service
Kafka is a distributed, partitioned, replicated commit log service.这里的commit log如何理解？ A message is considered "committed" when all in sync replicas for that partition have applied i
NGINX + LUA实现复杂的控制 ronin47 lua nginx 控制
安装lua_nginx_module 模块 lua_nginx_module 可以一步步的安装，也可以直接用淘宝的OpenResty Centos和debian的安装就简单了。。这里说下freebsd的安装： fetch http://www.lua.org/ftp/lua-5.1.4.tar.gz tar zxvf lua-5.1.4.tar.gz cd lua-5.1.4 ma
java-14.输入一个已经按升序排序过的数组和一个数字，在数组中查找两个数，使得它们的和正好是输入的那个数字 bylijinnan java
public class TwoElementEqualSum { /** * 第 14 题：题目：输入一个已经按升序排序过的数组和一个数字，在数组中查找两个数，使得它们的和正好是输入的那个数字。要求时间复杂度是 O(n) 。如果有多对数字的和等于输入的数字，输出任意一对即可。例如输入数组 1 、 2 、 4 、 7 、 11 、 15 和数字 15 。由于
Netty源码学习-HttpChunkAggregator-HttpRequestEncoder-HttpResponseDecoder bylijinnan java netty
今天看Netty如何实现一个Http Server org.jboss.netty.example.http.file.HttpStaticFileServerPipelineFactory： pipeline.addLast("decoder", new HttpRequestDecoder()); pipeline.addLast(&quo
java敏感词过虑-基于多叉树原理 cngolon 违禁词过虑替换违禁词敏感词过虑多叉树
基于多叉树的敏感词、关键词过滤的工具包，用于java中的敏感词过滤 1、工具包自带敏感词词库，第一次调用时读入词库，故第一次调用时间可能较长，在类加载后普通pc机上html过滤5000字在80毫秒左右，纯文本35毫秒左右。 2、如需自定义词库，将jar包考入WEB-INF工程的lib目录，在WEB-INF/classes目录下建一个 utf-8的words.dict文本文件，
多线程知识 cuishikuan 多线程
T1，T2，T3三个线程工作顺序，按照T1，T2，T3依次进行 public class T1 implements Runnable{ @Override
spring整合activemq dalan_123 java spring jms
整合spring和activemq需要搞清楚如下的东东1、ConnectionFactory分： a、spring管理连接到activemq服务器的管理ConnectionFactory也即是所谓产生到jms服务器的链接 b、真正产生到JMS服务器链接的ConnectionFactory还得
MySQL时间字段究竟使用INT还是DateTime？ dcj3sjt126com mysql
环境：Windows XPPHP Version 5.2.9MySQL Server 5.1 第一步、创建一个表date_test（非定长、int时间） CREATE TABLE `test`.`date_test` (`id` INT NOT NULL AUTO_INCREMENT ,`start_time` INT NOT NULL ,`some_content`
Parcel: unable to marshal value dcj3sjt126com marshal
在两个activity直接传递List<xxInfo>时，出现Parcel: unable to marshal value异常。在MainActivity页面（MainActivity页面向NextActivity页面传递一个List<xxInfo>）： Intent intent = new Intent(this, Next
linux进程的查看上（ps） eksliang linux ps linux ps -l linux ps aux
ps:将某个时间点的进程运行情况选取下来转载请出自出处：http://eksliang.iteye.com/admin/blogs/2119469 http://eksliang.iteye.com ps 这个命令的man page 不是很好查阅，因为很多不同的Unix都使用这儿ps来查阅进程的状态，为了要符合不同版本的需求，所以这个
为什么第三方应用能早于System的app启动 gqdy365 System
Android应用的启动顺序网上有一大堆资料可以查阅了，这里就不细述了，这里不阐述ROM启动还有bootloader，软件启动的大致流程应该是启动kernel -> 运行servicemanager 把一些native的服务用命令启动起来（包括wifi, power, rild, surfaceflinger, mediaserver等等）-> 启动Dalivk中的第一个进程Zygot
App Framework发送JSONP请求(3) hw1287789687 jsonp 跨域请求发送jsonp ajax请求越狱请求
App Framework 中如何发送JSONP请求呢? 使用jsonp,详情请参考:http://json-p.org/ 如何发送Ajax请求呢? (1)登录 /*** * 会员登录 * @param username * @param password */ var user_login=function(username,password){ // aler
发福利，整理了一份关于“资源汇总”的汇总 justjavac 资源
觉得有用的话，可以去github关注：https://github.com/justjavac/awesome-awesomeness-zh_CN 通用 free-programming-books-zh_CN 免费的计算机编程类中文书籍精彩博客集合 hacke2/hacke2.github.io#2 ResumeSample 程序员简历
用 Java 技术创建 RESTful Web 服务 macroli java 编程 Web REST
转载：http://www.ibm.com/developerworks/cn/web/wa-jaxrs/ JAX-RS (JSR-311) 【 Java API for RESTful Web Services 】是一种 Java™ API，可使 Java Restful 服务的开发变得迅速而轻松。这个 API 提供了一种基于注释的模型来描述分布式资源。注释被用来提供资源的位
CentOS6.5-x86_64位下oracle11g的安装详细步骤及注意事项超声波 oracle linux
前言：这两天项目要上线了，由我负责往服务器部署整个项目，因此首先要往服务器安装oracle，服务器本身是CentOS6.5的64位系统，安装的数据库版本是11g，在整个的安装过程中碰到很多的坑，不过最后还是通过各种途径解决并成功装上了。转别写篇博客来记录完整的安装过程以及在整个过程中的注意事项。希望对以后那些刚刚接触的菜鸟们能起到一定的帮助作用。安装过程中可能遇到的问题（注
HttpClient 4.3 设置keeplive 和 timeout 的方法 supben httpclient
ConnectionKeepAliveStrategy kaStrategy = new DefaultConnectionKeepAliveStrategy() { @Override public long getKeepAliveDuration(HttpResponse response, HttpContext context) { long keepAlive
Spring 4.2新特性-@Import注解的升级 wiselyman spring 4
3.1 @Import @Import注解在4.2之前只支持导入配置类在4.2,@Import注解支持导入普通的java类,并将其声明成一个bean 3.2 示例演示java类 package com.wisely.spring4_2.imp; public class DemoService { public void doSomethin