Learn how to customize a model for your application.
This guide is intended for users of the new OpenAI fine-tuning API. If you are a legacy fine-tuning user, please refer to our legacy fine-tuning guide.
Fine-tuning lets you get more out of the models available through the API by providing:
GPT models have been pre-trained on a vast amount of text. To use the models effectively, we include instructions and sometimes several examples in a prompt. Using demonstrations to show how to perform a task is often called "few-shot learning."
Fine-tuning improves on few-shot learning by training on many more examples than can fit in the prompt, letting you achieve better results on a wide number of tasks. Once a model has been fine-tuned, you won't need to provide as many examples in the prompt. This saves costs and enables lower-latency requests.
At a high level, fine-tuning involves the following steps:
Visit our pricing page to learn more about how fine-tuned model training and usage are billed.
We are working on enabling fine-tuning for GPT-4 and expect this feature to be available later this year.
Fine-tuning is currently available for the following models:
gpt-3.5-turbo-0613
(recommended)babbage-002
davinci-002
We expect gpt-3.5-turbo
to be the right model for most users in terms of results and ease of use, unless you are migrating a legacy fine-tuned model.
Fine-tuning GPT models can make them better for specific applications, but it requires a careful investment of time and effort. We recommend first attempting to get good results with prompt engineering, prompt chaining (breaking complex tasks into multiple prompts), and function calling, with the key reasons being:
Our GPT best practices guide provides a background on some of the most effective strategies and tactics for getting better performance without fine-tuning. You may find it helpful to iterate quickly on prompts in our playground.
Some common use cases where fine-tuning can improve results:
One high-level way to think about these cases is when it’s easier to "show, not tell". In the sections to come, we will explore how to set up data for fine-tuning and various examples where fine-tuning improves the performance over the baseline model.
Another scenario where fine-tuning is effective is in reducing costs and / or latency, by replacing GPT-4 or by utilizing shorter prompts, without sacrificing quality. If you can achieve good results with GPT-4, you can often reach similar quality with a fine-tuned gpt-3.5-turbo
model by fine-tuning on the GPT-4 completions, possibly with a shortened instruction prompt.
Once you have determined that fine-tuning is the right solution (i.e. you’ve optimized your prompt as far as it can take you and identified problems that the model still has), you’ll need to prepare data for training the model. You should create a diverse set of demonstration conversations that are similar to the conversations you will ask the model to respond to at inference time in production.
Each example in the dataset should be a conversation in the same format as our Chat completions API, specifically a list of messages where each message has a role, content, and optional name. At least some of the training examples should directly target cases where the prompted model is not behaving as desired, and the provided assistant messages in the data should be the ideal responses you want the model to provide.
In this example, our goal is to create a chatbot that occasionally gives sarcastic responses, these are three training examples (conversations) we could create for a dataset:
1
2
3
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris, as if everyone doesn't know that already."}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role": "assistant", "content": "Oh, just some guy named William Shakespeare. Ever heard of him?"}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "How far is the Moon from Earth?"}, {"role": "assistant", "content": "Around 384,400 kilometers. Give or take a few, like that really matters."}]}
We do not currently support function calling examples but are working to enable this.
The conversational chat format is required to fine-tune gpt-3.5-turbo
. For babbage-002
and davinci-002
, you can follow the prompt completion pair format used for legacy fine-tuning as shown below.
1
2
3
{"prompt": "", "completion": ""}
{"prompt": "", "completion": ""}
{"prompt": "", "completion": ""}
We generally recommend taking the set of instructions and prompts that you found worked best for the model prior to fine-tuning, and including them in every training example. This should let you reach the best and most general results, especially if you have relatively few (e.g. under a hundred) training examples.
If you would like to shorten the instructions or prompts that are repeated in every example to save costs, keep in mind that the model will likely behave as if those instructions were included, and it may be hard to get the model to ignore those "baked-in" instructions at inference time.
It may take more training examples to arrive at good results, as the model has to learn entirely through demonstration and without guided instructions.
To fine-tune a model, you are required to provide at least 10 examples. We typically see clear improvements from fine-tuning on 50 to 100 training examples with gpt-3.5-turbo
but the right number varies greatly based on the exact use case.
We recommend starting with 50 well-crafted demonstrations and seeing if the model shows signs of improvement after fine-tuning. In some cases that may be sufficient, but even if the model is not yet production quality, clear improvements are a good sign that providing more data will continue to improve the model. No improvement suggests that you may need to rethink how to set up the task for the model or restructure the data before scaling beyond a limited example set.
After collecting the initial dataset, we recommend splitting it into a training and test portion. When submitting a fine-tuning job with both training and test files, we will provide statistics on both during the course of training. These statistics will be your initial signal of how much the model is improving. Additionally, constructing a test set early on will be useful in making sure you are able to evaluate the model after training, by generating samples on the test set.
Each training example is limited to 4096 tokens. Examples longer than this will be truncated to the first 4096 tokens when training. To be sure that your entire training example fits in context, consider checking that the total token counts in the message contents are under 4,000. The maximum number of total tokens trained per job is 50 million tokens (tokens_in_dataset * n_epochs
).
You can compute token counts using our counting tokens notebook from the OpenAI cookbook.
In order to estimate the cost of a fine-tuning job, please refer to the pricing page for details on the cost per 1k tokens. To estimate the costs for a specific fine-tuning job, use the following formula:
base cost per 1k tokens * number of tokens in the input file * number of epochs trained
For a training file with 100,000 tokens trained over 3 epochs, the expected cost would be ~$2.40.
Once you have compiled a dataset and before you create a fine-tuning job, it is important to check the data formatting. To do this, we created a simple Python script which you can use to find potential errors, review token counts, and estimate the cost of a fine-tuning job.
Data formatting script
Once you have the data validated, the file needs to be uploaded in order to be used with a fine-tuning jobs:
1
2
3
4
5
6
7
8
import openai
import os
openai.api_key = os.getenv("OPENAI_API_KEY")
openai.File.create(
file=open("mydata.jsonl", "rb"),
purpose='fine-tune'
)
After ensuring you have the right amount and structure for your dataset, and have uploaded the file, the next step is to create a fine-tuning job.
Start your fine-tuning job using the OpenAI SDK:
openai.FineTuningJob.create(training_file="file-abc123", model="gpt-3.5-turbo")
model
is the name of the model you're starting from (gpt-3.5-turbo
, babbage-002
, or davinci-002
). You can customize your fine-tuned model's name using the suffix parameter.
After you've started a fine-tuning job, it may take some time to complete. Your job may be queued behind other jobs in our system, and training a model can take minutes or hours depending on the model and dataset size. After the model training is completed, the user who created the fine-tuning job will receive an email confirmation.
In addition to creating a fine-tuning job, you can also list existing jobs, retrieve the status of a job, or cancel a job.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# List 10 fine-tuning jobs
openai.FineTuningJob.list(limit=10)
# Retrieve the state of a fine-tune
openai.FineTuningJob.retrieve("ft-abc123")
# Cancel a job
openai.FineTuningJob.cancel("ft-abc123")
# List up to 10 events from a fine-tuning job
openai.FineTuningJob.list_events(id="ft-abc123", limit=10)
# Delete a fine-tuned model (must be an owner of the org the model was created in)
openai.Model.delete("ft-abc123")
When a job has succeeded, you will see the fine_tuned_model
field populated with the name of the model when you retrieve the job details. You may now specify this model as a parameter to in the Chat completions (for gpt-3.5-turbo
) or legacy Completions API (for babbage-002
and davinci-002
), and make requests to it using the Playground.
After your job is completed, the model should be available right away for inference use. In some cases, it may take several minutes for your model to become ready to handle requests. If requests to your model time out or the model name cannot be found, it is likely because your model is still being loaded. If this happens, try again in a few minutes.
1
2
3
4
5
6
7
8
9
completion = openai.ChatCompletion.create(
model="ft:gpt-3.5-turbo:my-org:custom_suffix:id",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"}
]
)
print(completion.choices[0].message)
You can start making requests by passing the model name as shown above and in our GPT guide.
We provide the following training metrics computed over the course of training: training loss, training token accuracy, test loss, and test token accuracy. These statistics are meant to provide a sanity check that training went smoothly (loss should decrease, token accuracy should increase).
However we think that evaluating samples from the fine-tuned model provides the most relevant sense of model quality. We recommend generating samples from both the base model and the fine-tuned model on a test set, and comparing the samples side by side. The test set should ideally include the full distribution of inputs that you might send to the model for inference. If manual evaluation is too time-consuming, consider using our Evals library for how to use GPT-4 to perform evaluations.
If the results from a fine-tuning job are not as good as you expected, consider the following ways to adjust the training dataset:
Once you’re satisfied with the quality and distribution of the examples, you can consider scaling up the number of training examples. This tends to help the model learn the task better, especially around possible "edge cases". We expect a similar amount of improvement every time you double the number of training examples. You can loosely estimate the expected quality gain from increasing the training data size by:
In general, if you have to make a trade-off, a smaller amount of high-quality data is generally more effective than a larger amount of low-quality data.
We allow you to specify the number of epochs to fine-tune a model for. We recommend initially training without specifying the number of epochs, allowing us to pick a default for you based on dataset size, then adjusting if you observe the following:
Now that we have explored the basics of the fine-tuning API, let’s look at going through the fine-tuning lifecycle for a few different use cases.
Style and tone
Structured output
For users migrating from /v1/fine-tunes
to the updated /v1/fine_tuning/jobs
API and newer models, the main difference you can expect is the updated API. The legacy prompt completion pair data format has been retained for the updated babbage-002
and davinci-002
models to ensure a smooth transition. The new models will support fine-tuning with 4k token context and have a knowledge cutoff of September 2021.
For most tasks, you should expect to get better performance from gpt-3.5-turbo
than from the GPT base models.
Embeddings with retrieval is best suited for cases when you need to have a large database of documents with relevant context and information.
By default OpenAI’s models are trained to be helpful generalist assistants. Fine-tuning can be used to make a model which is narrowly focused, and exhibits specific ingrained behavior patterns. Retrieval strategies can be used to make new information available to a model by providing it with relevant context before generating its response. Retrieval strategies are not an alternative to fine-tuning and can in fact be complementary to it.
We plan to release support for fine-tuning both of these models later this year.
We recommend generating samples from both the base model and the fine-tuned model on a test set of chat conversations, and comparing the samples side by side. For more comprehensive evaluations, consider using the OpenAI evals framework to create an eval specific to your use case.
No, we do not currently support continuing the fine-tuning process once a job has finished. We plan to support this in the near future.
Please refer to the estimate cost section above.
No, we do not currently support this integration but are working to enable it in the near future.
Please refer to our rate limit guide for the most up to date information on the limits.