introduction dask

Overview

  • DASK provides multi-core and distributed parallel execution on larger-than-memory datasets.
    We can think of Dask at a high and a low level

  • High level collections: DASK provides high-level Array, Bag, and DataFrame collections that mimic NumPy, lists, and Pandas but can operate in parallel on datasets that don’t fit into memory. DASK’s high-level collections are alternatives to NumPy and Pandas for large datasets.

  • Low Level schedulers: DASK provides dynamic task schedulers that execute task graphs in parallel. These execution engines power the high-level collections mentioned above but can also power custom, user-defined workloads. These schedulers are low-latency (around 1ms) and work hard to run computations in a small memory footprint. DASK’s schedulers are an alternative to direct use of threading or multiprocessing libraries in complex cases or other task scheduling systems like Luigi or IPython parallel.

Tutorials:

0. High Level Data Structured

  1. Array(VS NumPy)
  • Dask arrays support most of the NumPy interface like the following
  1. Bag(VS List) https://docs.dask.org/en/latest/bag-overview.html
  • Parallel: like Spark RDD and toolz
  • Like list we can map, filter, fold, and groupby
  • Common Uses: Dask bags are often used to parallelize simple computations on unstructured or semi-structured data like text data, log files, JSON records, or user defined Python objects. Make BIG file easy!!!
  1. Dataframe(VS Pandas)
  • SAME API with Pandas: Pandas影响力太大了,spark也在出koalas来使用pandas的api操作spark
  • Common Uses and Anti-Uses:总结就是单机能处理还是用pandas吧. 并且,能避免使用dataframe的话还是避免使用,毕竟计算效率比较低(但是做data analytics时候几乎无法避免)
#Dask DataFrame is used in situations where Pandas is commonly needed, usually when Pandas fails due to data size or computation speed:
Manipulating large datasets, even when those datasets don’t fit in memory
Accelerating long computations by using many cores
Distributed computing on large datasets with standard Pandas operations like groupby, join, and time series computations

#Dask DataFrame may not be the best choice in the following situations:
If your dataset fits comfortably into RAM on your laptop, then you may be better off just using Pandas. There may be simpler ways to improve performance than through parallelism
If your dataset doesn’t fit neatly into the Pandas tabular model, then you might find more use in dask.bag or dask.array
If you need functions that are not implemented in Dask DataFrame, then you might want to look at dask.delayed which offers more flexibility
If you need a proper database with all that databases offer you might prefer something like Postgres

1. Dask Scheduler(delay, submit) - Lazy Execution

https://github.com/dask/dask-examples/blob/master/delayed.ipynb
https://github.com/dask/dask-examples/blob/master/futures.ipynb
Just like Airflow, if you want to make calculation in some order. Both DASK and Airflow are easy to scale up.

2. Distributed

Dask comes with four available schedulers:

"threaded": a scheduler backed by a thread pool
"processes": a scheduler backed by a process pool
"single-threaded" (aka "sync"): a synchronous scheduler, good for debugging
distributed: a distributed scheduler for executing graphs on multiple machines, see below.

3. Distributed ML

https://github.com/dask/dask-examples/blob/master/machine-learning.ipynb
https://github.com/dask/dask-examples/blob/master/dataframe.ipynb
#method-1-in memory
from sklearn.externals import joblib
with joblib.parallel_backend(‘dask’):
grid_search.fit(X, y)

#method-2-disk memory
Most estimators in scikit-learn are designed to work on in-memory arrays. Training with larger datasets may require different algorithms.

All of the algorithms implemented in Dask-ML work well on larger than memory datasets, which you might store in a dask array or dataframe.

4. Summary: Dask VS Spark

过了一遍教程之后,觉得Dask和Spark太像了,除了Spark是全平台但Dask是python only。Dask的delay功能也可以被Airflow替代。官方更详细的解释在这里。https://docs.dask.org/en/stable/spark.html

你可能感兴趣的:(python,机器学习)