US和UK的5日之后的PLA数据都没有存入ES,后来发现是日期格式不符合。
读取的时候,已经是timestamp类型了(见如下dataFrame.printSchema
)
现在关键是DataFrame和CSV的格式转换问题
看来要显示转换比较靠谱
val userBiddingResultSchema = FbUserBiddingResultPojo.structType
val userBiddingResultDf = sparkSession.read.schema(userBiddingResultSchema).csv(mdlResultPath)
是数据格式问题,US的几十万数据中,有条是url里面带了逗号,这样就错位了,我们现在在spark中指定的是inferSchema为true,也就是依赖spark自动解析列的数据并判断类型,当US的数据中存在问题,这一列就没有这样整齐划一了,spark将其判断为string
ISO 8601 https://blog.csdn.net/dai451954706/article/details/46930167
convert String to date; and convert date to String
https://www.cnblogs.com/mlfh1234/p/9210046.html
my code:
import java.sql.{Date, Timestamp}
import java.util.Locale
import java.text.SimpleDateFormat
object Demo {
def main(args:Array[String]): Unit = {
val str: String = ""
println("begin...")
val loc = new Locale("en")
val fm = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSXXX",loc)
val tm = "2019-10-14T11:35:41.005-07:00"
val dt2 = fm.parse(tm)
println(dt2.getTime())
//reverse
val ts = new Timestamp(System.currentTimeMillis())
println(ts)
println(fm.format(ts))
}
}
对比测试的结果却是:日期转化和字符串直接处理,结果上没有区别,都是966条,实际源文件是962条,也就是有2条记录重复了
通过写java程序解析
1572995047000Cj0KCQiAtf_tBRDtARIsAIbAKe187WBetzRXJ-It8hL053IZKhOiIs2eWazHPBlxLNeUkDsLCjwoZ4waAlHREALw_wcB | count : 4
1573091679000Cj0KCQjwr-_tBRCMARIsAN413WTp0pgFVL16yWU9VowmLmPL9gvuLox0DSBHS0yeCNcNQSJkbnsim84aAlIjEALw_wcB | count : 4
在原始文件中找到了重复的记录(主键完全一样,但是后面几个金额不一样,有0的):
121行:
Google,136,PLA,2,Cj0KCQiAtf_tBRDtARIsAIbAKe187WBetzRXJ-It8hL053IZKhOiIs2eWazHPBlxLNeUkDsLCjwoZ4waAlHREALw_wcB,2156222200820561,1140946,1824172998,1,2019-11-05,2019-11-05T16:04:07.000-07:00,1,101,47.99,48.85382,48.85382,,1,154.3780712
606行:
Google,136,PLA,2,Cj0KCQiAtf_tBRDtARIsAIbAKe187WBetzRXJ-It8hL053IZKhOiIs2eWazHPBlxLNeUkDsLCjwoZ4waAlHREALw_wcB,2156222200820561,1140946,1824172998,0,2019-11-05,2019-11-05T16:04:07.000-07:00,1,101,47.99,0.0,0.0,0.0,1,0.0
2019-11-08 15:35:44 INFO CodeGenerator:54 - Code generated in 11.354684 ms
±------±------±----------±---------------±-----±----------±-------------------±------------------±---------------------±-----±–±-----------------±-----------------±-----------------±-----------------±-------------------±---------+
|partner|channel|tgt_site_id| rotation_id|abc_id|campaign_id| ck_trans_dt| conversionName|conversionCurrencyCode|status|cnt| igmbsum| gmbsum| dgmbsum| ibuyersum|conversion_value_sum| batchDate|
±------±------±----------±---------------±-----±----------±-------------------±------------------±---------------------±-----±–±-----------------±-----------------±-----------------±-----------------±-------------------±---------+
| Bing| PLA| 71|7091533165446554| null| 350769209|2019-10-29 00:00:…|offline_conversions| EUR| 0| 6| 37.38476| 49.21999999999999|12.041509999999999|1.5713719999999998| 118.13584159999999|2019-10-10|
| Bing| PLA| 71|7091533165446554| null| 350769219|2019-10-27 00:00:…|offline_conversions| EUR| 0| 44|2645.4809920000007| 2878.731397|2544.0942840000007|11.541987999999998| 8359.719934719998|2019-10-10|
±------±------±----------±---------------±-----±----------±-------------------±------------------±---------------------±-----±–±-----------------±-----------------±-----------------±-----------------±-------------------±---------+
用dataFrame.printSchema
打印如下:
root
|-- partner: string (nullable = true)
|-- channel: string (nullable = true)
|-- tgt_site_id: integer (nullable = true)
|-- rotation_id: long (nullable = true)
|-- abc_id: string (nullable = true)
|-- campaign_id: integer (nullable = true)
|-- ck_trans_dt: timestamp (nullable = true)
|-- conversionName: string (nullable = true)
|-- conversionCurrencyCode: string (nullable = true)
|-- status: integer (nullable = true)
|-- cnt: long (nullable = false)
|-- igmbsum: double (nullable = true)
|-- gmbsum: double (nullable = true)
|-- dgmbsum: double (nullable = true)
|-- ibuyersum: double (nullable = true)
|-- conversion_value_sum: double (nullable = true)
|-- batchDate: string (nullable = false)