pip install dask
高级集合被用来生成任务图,这些任务图可以由单机或集群上的调度器执行。
import numpy as np
import pandas as pd
import dask.dataframe as dd
import dask.array as da
import dask.bag as db
index = pd.date_range("2021-09-01", periods=2400, freq="1H")# 从"2021-09-01"开始,间隔1小时,创建2400个元素
print(index.shape)# (2400,)
df = pd.DataFrame({"a": np.arange(2400), "b": list("abcaddbe" * 300)}, index=index)
#
ddf = dd.from_pandas(df, npartitions=10)
现在创建了一个Dask DataFrame,有2列2400行,由10个分区组成,每个分区有240行。每个分区代表一块数据。
ddf.divisions
#检查每个分区所覆盖的索引值
索引Dask集合的感觉就像切分NumPy数组或pandas DataFrame。
print(ddf.b)
print(ddf.b.compute())
ddf.a.mean().compute()#1199.5
ddf.b.unique().compute()
"""
0 a
1 b
2 c
3 d
4 e
Name: b, dtype: object
"""
result = ddf["2021-10-01": "2021-10-09 5:00"].a.cumsum() - 100
result.compute()
'''
2021-10-01 00:00:00 620
2021-10-01 01:00:00 1341
2021-10-01 02:00:00 2063
2021-10-01 03:00:00 2786
2021-10-01 04:00:00 3510
...
2021-10-09 01:00:00 158301
2021-10-09 02:00:00 159215
2021-10-09 03:00:00 160130
2021-10-09 04:00:00 161046
2021-10-09 05:00:00 161963
Freq: H, Name: a, Length: 198, dtype: int32
'''
#pip install graphviz
result.visualize()
Dask Delayed让你把单个函数调用包装成一个延迟执行的任务图。
import dask
@dask.delayed
def inc(x):
return x + 1
@dask.delayed
def add(x, y):
return x + y
a = inc(1) # no work has happened yet
b = inc(2) # no work has happened yet
c = add(a, b) # no work has happened yet
c = c.compute() # This triggers all of the above computations
print(c)#5
Futures 一旦函数被提交,计算就开始了。
from dask.distributed import Client
client = Client()
def inc(x):
return x + 1
def add(x, y):
return x + y
a = client.submit(inc, 1) # work starts immediately
b = client.submit(inc, 2) # work starts immediately
c = client.submit(add, a, b) # work starts immediately
c = c.result() # block until work finishes, then gather result
print(c)#5
在生成了一个任务图之后,执行它是调度器的工作。
默认情况下,当你在Dask对象上调用计算时,Dask会使用你电脑上的线程池来并行运行计算。
如果你想要更多的控制,可以使用分布式调度器来代替。尽管它的名字里有 “分布式”,但分布式调度器在单机和多机上都能很好地工作。可以把它看作是 “高级调度器”。
from dask.distributed import Client
client = Client()
print(client)
一旦你创建了一个客户端,任何计算都将在它所指向的集群上运行。
from dask.distributed import Client
client = Client("" )
print(client)
当使用分布式集群时,Dask提供了一个诊断仪表板,你可以看到你的任务被处理的情况。
print(client.dashboard_link)
#'http://127.0.0.1:61518/status'
DASK 文档