连接数据库
#host需要带端口号
#形成的url形如mysql+pymysql:/c:[email protected]:3306/mydata?charset=utf8
url = "mysql+pymysql://" + username + ":" + password + "@" + host + "/" + dbname + "?charset=utf8"
engine = create_engine(url, encoding = "utf-8", echo = True)
读取数据库中数据直接转化成DataFrame格式
sql_cmd = "SELECT id, crowsource_label, model_label FROM query_pair_score WHERE status = 2;"
labeled_tasks = pd.read_sql(sql = sql_cmd, con = engine)
得到的结果为
id crowdsource_label model_label
0 300 0 1.87808
1 600 0 12.15180
2 900 0 8.46058
3 1200 0 9.22723
4 1500 0 9.81807
5 1800 0 7.52028
6 2100 0 10.51960
7 2400 0 5.27248
8 2700 0 6.48207
9 3000 0 10.96560
10 3300 0 7.67834
直接创建DataFrame
创建一个包含2列的,index从0~n-1的DataFrame
index = range(n)
interval_diff = pd.DataFrame(index = index, columns=['id','diff'])
结果为
diff
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
更改DataFrame中的某个值
#改变index为1的数据的diff属性
interval_diff.loc[1, 'diff'] = 10
#或者改变第二行的数据的diff值
interval_diff.iloc[1]['diff'] = 10
结果为
diff
0 NaN
1 10
2 NaN
3 NaN
4 NaN
按照某个属性排序
按照diff属性排序
interval_diff.sort_values("diff", inplace=True)
其中inplace的值代表了是否改变DataFrame本身,还是只在返回结果中改变
原来排序为
diff
0 0
1 10
2 2
3 3
4 4
排序之后的结果为
diff
0 0
2 2
3 3
4 4
1 10
获取列
获得series格式
labeled_tasks['diff']
获得df格式
labeled_tasks[['diff']]
添加列
由于希望得到crowdsource_label和model_label之间的差值,因此需添加一列diff以记录差值
labeled_tasks['diff'] = abs(labeled_tasks['crowsource_label'] - labeled_tasks['model_label'])
得到的结果
id crowdsource_label model_label diff
0 300 0 1.87808 1.87808
1 600 0 12.15180 12.15180
2 900 0 8.46058 8.46058
3 1200 0 9.22723 9.22723
4 1500 0 9.81807 9.81807
5 1800 0 7.52028 7.52028
6 2100 0 10.51960 10.51960
7 2400 0 5.27248 5.27248
8 2700 0 6.48207 6.48207
9 3000 0 10.96560 10.96560
10 3300 0 7.67834 7.67834
按某一条件求得均值/行数/和
求diff在某一范围内项diff值的均值
labeled_tasks[(labeled_tasks['model_label'] >= min) & (labeled_tasks['model_label'] < max)]['diff'].mean()
#行数为.count()
#和为.sum()
拼接df
这里指的是像list的拼接,并不是join
chosen_tasks = chosen_tasks.append(tmp_tasks, ignore_index= True)
ignore_index = True是忽略index,重新编号
如果要拼接df中的某一列
chosen_tasks = chosen_tasks.append(tmp_tasks[["id"]], ignore_index= True)
Tips:千万要用df格式,如果取出来的列是series格式就会插入到行里面
随机采样
先用numpy里给index打乱顺序的排序
再取出这几行
sample_tmp_tasks = np.random.permutation(len(tmp_tasks))
chosen_tasks = chosen_tasks.append\
(tmp_tasks.take(sample_tmp_tasks[:(task_num - len(chosen_tasks.index))]), ignore_index = True)
利用一个dataframe更新另一个dataframe
all_tasks.update(chosen_tasks)
其中
all_tasks:
id query_id doc_id crowdsource_label model_label status
0 2 1 16680753 0 1.632180 0
1 3 1 19015748 0 0.969919 0
2 4 1 13011519 0 1.209210 0
3 5 1 33807962 0 0.544877 0
4 6 1 17383343 0 2.000000 0
5 7 1 17006704 0 1.425010 0
6 8 1 18684298 0 1.363920 0
7 9 1 16133129 0 1.205120 0
8 10 1 790034 0 1.382500 0
9 11 1 8143635 0 0.758511 0
10 12 1 4838134 0 1.449270 0
11 13 1 4366254 0 1.294460 0
12 14 1 22414724 0 0.125651 0
13 15 1 12880897 0 0.737783 0
14 16 1 26354296 0 0.965067 0
15 17 1 30599973 0 1.140350 0
... ... ... ... ... ... ...
chosen_tasks:
id status
25053 25055 1
20224 20226 1
14537 14539 1
25511 25513 1
18141 18143 1
5211 5213 1
57 59 1
6698 6700 1
26102 26104 1
14900 14902 1
更新后的all_tasks中status是1的行:
id query_id doc_id crowdsource_label model_label status
57 59 1 40587779 0 0.918214 1.0
5211 5213 53 18775179 0 0.955059 1.0
6698 6700 68 37017957 0 0.995368 1.0
14537 14539 146 40129713 0 0.843316 1.0
14900 14902 150 37066546 0 0.968835 1.0
18141 18143 182 21561576 0 0.853430 1.0
20224 20226 203 23267492 0 0.819788 1.0
25053 25055 251 25513497 0 0.941556 1.0
25511 25513 256 18468419 0 0.897300 1.0
26102 26104 262 36742024 0 0.889142 1.0