Spark JDBC(1)MySQL Database RDD
Try to understand how the JDBCRDD work on Spark.
First of all, the master did not connect to the database.
First step,
The client driver class will connect to the MySQL and get the minId and maxId.
150612 17:21:55 58 Connect
[email protected] on lmm
select coalesce(min(d.id), 0) from device d where d.last_updated >= '2014-06-12 00:00:00.0000' and d.last_updated < '2014-06-13 00:00:00.0000'
select coalesce(max(d.id), 0) from device d
Second step, All the workers will try to fetch the data based on partitions
150612 17:22:13 59 Connect cluster@ubuntu-dev2 on lmm
select id, tenant_id, date_created, last_updated, device_id, os_type, os_version,
search_radius, sdk_major_version, last_time_zone, sendable
from
device d
where
375001 <= d.id and
d.id <= 750001
select id, tenant_id, date_created, last_updated, device_id, os_type, os_version,
search_radius, sdk_major_version, last_time_zone, sendable
from
device d
where
750002 <= d.id and
d.id <= 1125002
62 Connect cluster@ubuntu-dev1 on lmm
62 Query select id, tenant_id, date_created, last_updated, device_id, os_type, os_version,
search_radius, sdk_major_version, last_time_zone, sendable
from
device d
where
0 <= d.id and
d.id <= 375000
63 Query select id, tenant_id, date_created, last_updated, device_id, os_type, os_version,
search_radius, sdk_major_version, last_time_zone, sendable
from
device d
where
1500004 <= d.id and
d.id <= 1875004
The sample JDBCRDD is in code
https://github.com/luohuazju/sillycat-spark/tree/streaming
References:
http://spark.apache.org/docs/1.4.0/tuning.html
http://stackoverflow.com/questions/27619230/how-to-split-the-input-file-in-apache-spark