Skip to content
Code-along 2023-12-22 Fine-tuning GPT3.5 with the OpenAI API Questions
Some background context
Fine-tuning models lets you customize them for new tasks. By fine-tuning GPT3.5, you can improve the accuracy of its response, use it to find a particular tone of voice, talk about niche topics, and more. This first of this two-part series, covers how to use the OpenAI API and Python to get started fine-tuning GPT3.5
Data
This case study uses the Yahoo Non-Factoid Question Dataset derived from the Yahoo’s Webscope L6 collection
- It has 87,361 questions and their corresponding answers.
- Freely available from Hugging Face.
Main tasks
The main tasks include
- Loading data from Hugging Face
- Preprocess the data for fine-tuning
- Fine-tune the GPT3.5 model
- Interaction with the fine-tuned model
Target audience
This case study The case study would be of interest to:
- AI and Machine Learning Enthusiasts
- Data Scientists and Analysts
- Academics and Students
- Industry Professionals
- Software Developers
Key takeaways:
- Learn when fine-tuning large language models can be beneficial
- How to use the fine-tuning tools in the OpenAI API
- Understand the workflow for fine-tuning
Task 0: Installing and Importing Relevant Packages
The main packages that need to be installed are:
datasets
: to load datasets from Hugging Face.openai
: to interact with OpenAI models and built-in function.time
: used to track the fine-tuning time.random
: to select random observations from the training data.json
: the format of the training and validation data.
Instructions
Complete the following tasks to successfully complete the packages installation
- Use the
--upgrade
option to installpip
command usingpython
- Install the
datasets
package - Install the version
0.28
of theopenai
package
%%bash
python3 -m pip install --upgrade pip
pip -q install -U datasets
pip -q install openai==0.28
- Note: Restart the kernel from the top left tap by selecting
Run > Restart kernel
- This ensures that all the changes are successfully performed
- Import the following packages
FineTuningJob
andChatCompletion
fromopenai
load_dataset
function fromdatasets
sleep
fromtime
random
json
from openai import FineTuningJob, ChatCompletion
from datasets import load_dataset
from time import sleep
import random
import json
Task 1: Data Loading
In this section, you will load the yahoo_answers_qa
dataset from Hugging Face
using the load_dataset
function.
Instructions
- Acquire the
train
split of theyahoo_answers_qa
data
yahoo_answers_qa = load_dataset("yahoo_answers_qa", split="train")
- Check the features/columns and the total number of rows of the data
yahoo_answers_qa
- From the above command, you will notice that there are 87362 rows from the dataset, and such a huge amount of data can be long to process, especially during the fine-tuning process.
- For simplicity's sake, let's use a subset of 150 rows from the previously loaded dataset.
- Use the
.select
and therange
functions to select a subset of150
rows
- Use the