Spark JDBC(1)MySQL Database RDD

Spark JDBC(1)MySQL Database RDD

Try to understand how the JDBCRDD work on Spark.
First of all, the master did not connect to the database.

First step,
The client driver class will connect to the MySQL and get the minId and maxId.
150612 17:21:55   58 Connect [email protected] on lmm
select coalesce(min(d.id), 0) from device d where d.last_updated >= '2014-06-12 00:00:00.0000' and d.last_updated < '2014-06-13 00:00:00.0000'
select coalesce(max(d.id), 0) from device d

Second step, All the workers will try to fetch the data based on partitions
150612 17:22:13   59 Connect cluster@ubuntu-dev2 on lmm
select id, tenant_id, date_created, last_updated, device_id, os_type, os_version,
          search_radius, sdk_major_version, last_time_zone, sendable
         from
          device d
         where
          375001 <= d.id and
          d.id <= 750001

select id, tenant_id, date_created, last_updated, device_id, os_type, os_version,
          search_radius, sdk_major_version, last_time_zone, sendable
         from
          device d
         where
          750002 <= d.id and
          d.id <= 1125002


62 Connect cluster@ubuntu-dev1 on lmm
62 Query select id, tenant_id, date_created, last_updated, device_id, os_type, os_version,
          search_radius, sdk_major_version, last_time_zone, sendable
         from
          device d
         where
          0 <= d.id and
          d.id <= 375000
63 Query select id, tenant_id, date_created, last_updated, device_id, os_type, os_version,
          search_radius, sdk_major_version, last_time_zone, sendable
         from
          device d
         where
          1500004 <= d.id and
          d.id <= 1875004

The sample JDBCRDD is in code
https://github.com/luohuazju/sillycat-spark/tree/streaming

References:
http://spark.apache.org/docs/1.4.0/tuning.html
http://stackoverflow.com/questions/27619230/how-to-split-the-input-file-in-apache-spark




你可能感兴趣的:(database)