Do you know how much GPT3 API will it cost?

你知道多少会GPT3 API费用？

A rough calculation tells me it can go a maximum of 790 requests/$.

粗略的计算告诉我，它最多可以处理790个请求/ $。

GPT3 is pretty huge(175B parameters = 700GB) and you know how costly GPU inferences can be. Even if we find a use case for it, we still need to justify the ROI. There are many blogs on the potential applications but I haven’t found anything on its pricing.

GPT3非常庞大(175B参数= 700GB)，并且您知道GPU推理的代价是多么昂贵。即使我们找到了用例，我们仍然需要证明ROI的合理性。关于潜在应用程序的博客很多，但我对其价格一无所获。

Let’s try to guess it with the fundamentals of cloud pricing.

让我们尝试用云定价的基本原理来猜测。

Note: You can use this methodology for calculating the API cost for any model. People also like to use AWS TCO(Total cost of ownership) calculator but I enjoy doing it manually.

注意：您可以使用此方法来计算任何模型的API成本。人们还喜欢使用AWS TCO(总拥有成本)计算器，但我喜欢手动进行。

步骤0 —用例 (STEP 0 — Usecase)

Transformers are quadratic in compute. So it’s extremely crucial to decide on the use case for it because the use case will decide the sequence length.

变压器在计算中是二次方的。因此，决定用例非常关键，因为用例将决定序列长度。

The best use case for GPT3 is text generation given the prompt.

GPT3的最佳用例是根据提示生成文本。

The prompt can be of any length but 128 makes a sensible guess. People also do it recursively by appending the previously generated text to generate more.

提示可以是任意长度，但128可以做出明智的猜测。人们还通过附加先前生成的文本以生成更多内容来递归地执行此操作。

GPT3 can take the seq_length up to 1024(max supported) but due to the quadratic nature of the transformer, it is going to make the inference even costlier.

GPT3的seq_length最多可以达到1024(支持的最大值)，但是由于转换器的二次性质，它会使推理更加昂贵。

Let’s fix the seq length to 128 and then use scaling to calculate for 1024.

让我们将seq长度固定为128，然后使用缩放比例计算出1024。

Note: You can use this methodology for calculating the API cost for any model. People also like to use AWS TCO(Total cost of ownership) calculator but I enjoy doing it manually.

注意：您可以使用此方法来计算任何模型的API成本。人们还喜欢使用AWS TCO(总拥有成本)计算器，但我喜欢手动进行。

第1步-每小时获取GPT2推断 (STEP 1 — Getting GPT2 inferences per hour)

Assumptions

假设条件

Seq length — 128
序列长度— 128
GPU + XLA inference on Tensorflow
Tensorflow上的GPU + XLA推断
V100 GPU instance
V100 GPU实例
12 vCPUs, 40GB of RAM
12个vCPU，40GB RAM
Batch size — 8
批次大小— 8

From HuggingFace experiment sheet, GPT2 gets inference time of 0.02s for a batch size of 8 on Tensorflow GPU + XLA.

从HuggingFace实验表中，在Tensorflow GPU + XLA上，批处理大小为8时，GPT2的推理时间为0.02s。

Hence it can serve 8*3600/0.02 = 1440000 inferences/hour.

因此，它每小时可以处理8 * 3600 / 0.02 = 1440000个推论。

步骤2 –每小时获取GPT3推断 (STEP 2 — Getting GPT3 inferences per hour)

GPT2–1.5B parameters

GPT2–1.5B参数

GPT3–175B parameters

GPT3–175B参数

Since GPT3 cannot fit on 1 GPU, its split across many. For simplicity reasons, let’s assume we can extrapolate the inference time with linear calculation. Although multi-GPU can be slower due to the passing of gradients from 1 GPU to another.

由于GPT3无法安装在1个GPU上，因此它可以分为多个。为简单起见，我们假设可以使用线性计算来推断推理时间。尽管由于将梯度从1个GPU传递到另一个GPU，多GPU可能会变慢。

Equivalent GPT3 inferences/hour/GPU

等效的GPT3推断/小时/ GPU

= 1440000*1.5/175

= 1440000 * 1.5 / 175

= ~12400

=〜12400

步骤3 —推理优化 (STEP 3 — Inference optimisation)

HuggingFace mentions AMP(fp16) can increase throughput by 1.5x.

HuggingFace 提到 AMP(fp16)可以将吞吐量提高1.5倍。

New inferences/hour/GPU

新的推论/小时/ GPU

= 12400*1.5

= 12400 * 1.5

= 18600

步骤4 –满载时的每小时成本 (STEP 4 — Cost per hour at full load)

AWS p3.2x costs $3.06/hour. If we take a reserved instance for a year, it can give upto 36% discount with all upfront cost.

AWS p3.2x每小时收费3.06美元。如果我们将预留实例保留一年，则可以扣除所有前期费用最多36％的折扣。

Discounted cost = $3.06(1–0.360) = $1.96/hour

折扣费用= $ 3.06(1-0.360)= $ 1.96 /小时

(Azure V100 1 year reserved instance costs $1.72/hour)

(Azure V100 1年保留实例的费用为每小时1.72 USD)

第5步-每次推理费用 (STEP 5 — Cost per inference)

Cost per inference

每次推理成本

= instance cost/inferences

=实例成本/推断

= 1.96/18600

= 1.96 / 18600

= $0.00010537634

= $ 0.00010537634

It will cost you a minimum of $0.00010537634 per API call of GPT3.

每个GPT3 API调用将至少花费$ 0.00010537634。

In $1 you will be able to serve 9490 API requests.

只需支付$ 1，您就可以处理9490 API请求。

较长的API (Longer sequence API)

GPT2 with seq length 1024 and batch size 8 takes 0.195s which is 10x the time of 128 seq length.

seq长度为1024且批量大小为8的GPT2需要0.195s，是128 seq长度的10倍。

Hence you will be able to serve 949/$

因此，您将能够提供949 / $

结论 (Conclusion)

I hope this gives you a good idea of how to justify the use case for your business.

我希望这对您如何为您的业务证明用例有一个好主意。

We haven’t added any profit margin of OpenAI to the API cost. But taking a profit margin of 20% means it will be able to serve 949/1.2 = 790 requests/$

我们尚未在API成本中添加OpenAI的任何利润率。但是，如果获得20％的利润率，则可以满足949 / 1.2 = 790个请求/ $

Do you think 790/$ is good enough for your business?

您认为790 / $足以满足您的业务需求吗？

翻译自: https://towardsdatascience.com/estimating-gpt3-api-cost-50282f869ab8

估算gpt3 api成本