06 ,spark 提速手段 : 16G 文件读取耗时

1 ,普通数据格式 : 非压缩 ( 我们用 csv )

Spark supports many formats, such as csv, json, xml, parquet, orc, and avro. Spark can be extended to support many more formats with external data sources - for more information

2 ,存储容器 : 分布式存储 ( s3 是可以的 )

When you create a new Spark cluster, you can select Azure Blob Storage or Azure Data Lake Storage as your cluster’s default storage.

3 ,使用缓存 : 我们把多次用到的数据缓存,但是并不适合所有数据都缓存

Spark provides its own native caching mechanisms, which can be used through different methods such as .persist(), .cache(), and CACHE TABLE.

4 ,结果 : 16G 文件读取,计算总条数耗时 43 秒

06 ,spark 提速手段 : 16G 文件读取耗时_第1张图片

你可能感兴趣的:(spark,大量实战,spark,大量实战)