Spark:Task Locality参考

Data locality can have a major impact on the performance of Spark jobs. If data and the code that operates on it are together then computation tends to be fast. Butif code and data are separated, one must move to the other. Typically it is faster to ship serialized code from place to place than a chunk of data because code size is much smaller than data. Spark builds its scheduling around this general principle of data locality.

Data locality is how close data is to the code processing it. There are several levels of locality based on the data’s current location. In order from closest to farthest:

PROCESS_LOCAL- data is in the same JVM as the running code. This is the best locality possible

NODE_LOCAL- data is on the same node. Examples might be in HDFS on the same node, or in another executor on the same node. This is a little slower than PROCESS_LOCAL because the data has to travel between processes

NO_PREF- data is accessed equally quickly from anywhere and has no locality preference

RACK_LOCAL- data is on the same rack of servers. Data is on a different server on the same rack so needs to be sent over the network, typically through a single switch

ANY- data is elsewhere on the network and not in the same rack

Spark prefers to schedule all tasks at the best locality level, but this is not always possible. In situations where there is no unprocessed data on any idle executor, Spark switches to lower locality levels. There are two options:

a) wait until a busy CPU frees up to start a task on data on the same server, or

b) immediately start a new task in a farther away place that requires moving data there

What Spark typically does is wait a bit in the hopes that a busy CPU frees up. Once that timeout expires, it starts moving the data from far away to the free CPU. The wait timeout for fallback between each level can be configured individually or all together in one parameter; see thespark.locality parameterson the configuration page for details. You should increase these settings if your tasks are long and see poor locality, but the default usually works well.

#

你可能感兴趣的:(Spark:Task Locality参考)