在用numpy.astype强制转换数据类型的时候,由于numpy精度的问题将会对长度超过16位的数据发生不可预见的变化。
见以下样例:
a=np.random.randint(10000000000000000,100000000000000000,6,dtype=np.int64).reshape(3,-1)
a
Out[250]:
array([[84627891441616237, 76092046570743607],
[98092567621991294, 29336557186973849],
[27275086880071664, 17713014931142607]], dtype=int64)
a==a.astype(np.float64).astype(np.int64)
Out[251]:
array([[False, False],
[False, False],
[ True, False]])
仔细观察a 和转换一遍后的a
a
Out[252]:
array([[84627891441616237, 76092046570743607],
[98092567621991294, 29336557186973849],
[27275086880071664, 17713014931142607]], dtype=int64)
a.astype(np.float64).astype(np.int64)
Out[253]:
array([[84627891441616240, 76092046570743600],
[98092567621991296, 29336557186973848],
[27275086880071664, 17713014931142608]], dtype=int64)
可以发现,所有数据前16位都是相同的,16位以后就不可控了,导致错误发生的原因,就是numpy的32位精度问题导致的。
怎么解决呢?
我尝试了DataFrame的object类型可以解决,解决方式如下:
将numpy转换为DataFrame的时候,指定数据类型为object。
生成之后,在利用astype将其转换为int64即可。
具体应用可见以下样例:
rl
Out[255]:
array([-8049777870090522920, -5440935078746751688, -3933548592432029974,
-2462334750121545038, -1190291399416696655, 501852907112055918,
1104104769051714879, 1318804999709453069, 1643349955204012180,
1985695761539862128, 2177922432728714602, 2539438373990063976,
2757041686965216513, 2930804226408986280, 4652176466101519414,
5587216625180694234, 6110778615839656518, 8414204104888822915],
dtype=int64)
V
Out[256]:
array([[ 1.17613153e+222, -5.20143643e+220, -2.56059855e+218,
-2.56059878e+218, 1.82560909e+211, 1.01358871e+211,
1.82560909e+211, -1.05320730e+221, 8.07676648e+221,
3.23330432e+194, -2.48561946e+218, -1.18058699e+219,
2.65113824e+164, 9.86541855e+219, -3.21047863e+219,
7.98645998e+193, -8.12021191e+210, 1.01358815e+211],
[-8.92833386e+221, -1.00145726e+221, 2.14225335e+218,
2.14225304e+218, -1.40702046e+211, -1.60296698e+211,
-1.40702046e+211, 1.04484451e+221, 9.40007615e+221,
2.73018012e+194, -3.14728928e+218, -1.05611169e+219,
9.09859019e+163, -5.68088783e+219, 5.46366951e+219,
-2.49687040e+194, -1.95946525e+210, -1.60296712e+211]])
rl=DataFrame([rl,V[0],V[1]],dtype=object).T
rl.columns=['SOURCEID','ax','ay']
rl.SOURCEID=rl.SOURCEID.astype('int64')
rl.SOURCEID.values
Out[258]:
array([-8049777870090522920, -5440935078746751688, -3933548592432029974,
-2462334750121545038, -1190291399416696655, 501852907112055918,
1104104769051714879, 1318804999709453069, 1643349955204012180,
1985695761539862128, 2177922432728714602, 2539438373990063976,
2757041686965216513, 2930804226408986280, 4652176466101519414,
5587216625180694234, 6110778615839656518, 8414204104888822915],
dtype=int64)