Colab : https://colab.research.google.com/github/AI4Finance-Foundation/FinGPT/blob/master/FinGPT_Training_LoRA_with_ChatGLM2_6B_for_Beginners.ipynb
Welcome to this comprehensive guide aimed at beginners diving into the realm of Financial Large Language Models (FinLLMs) with FinGPT.
This blog post demystifies the process of training FinGPT using Low-Rank Adaptation (LoRA) with the robust base model ChatGlm2-6b.
Data preparation is a crucial step when it comes to training Financial Large Language Models.
Here, we’ll guide you on how to get your dataset ready for FinGPT using Python.
In this section, you’ve initialized your working directory and loaded a financial sentiment dataset. Let’s break down the steps:
pip install datasets transformers torch tqdm pandas huggingface_hub
pip install sentencepiece
pip install protobuf transformers==4.30.2 cpm_kernels torch>=2.0 gradio mdtex2html sentencepiece accelerate
Initialize Directories:
This block checks if certain paths exist; if they do, it deletes them to avoid data conflicts, and then creates a new directory for the upcoming data.
如果原来这里存在文件,就删掉
import os
import shutil
jsonl_path = "../data/dataset_new.jsonl"
save_path = '../data/dataset_new'
if os.path.exists(jsonl_path):
os.remove(jsonl_path)
if os.path.exists(save_path):
shutil.rmtree(save_path)
directory = "../data"
if not os.path.exists(directory):
os.makedirs(directory)
from datasets import load_dataset
import datasets
dic = {
0:"negative",
1:'positive',
2:'neutral',
}
tfns = load_dataset('zeroshot/twitter-financial-news-sentiment')
tfns = tfns['train']
tfns = tfns.to_pandas()
tfns['label'] = tfns['label'].apply(lambda x:dic[x])
tfns['instruction'] = 'What is the sentiment of this tweet? Please choose an answer from {negative/neutral/positive}.'
tfns.columns = ['input', 'output', 'instruction']
tfns = datasets.Dataset.from_pandas(tfns)
tfns
tmp_dataset = datasets.concatenate_datasets([tfns]*2)
train_dataset = tmp_dataset
print(tmp_dataset.num_rows)
all_dataset = train_dataset.shuffle(seed = 42)
all_dataset.shape
Now that your training data is loaded and prepared.
Once your data is prepared, the next steps involve formatting the dataset for model ingestion and tokenizing the input data.
Below, we provide a step-by-step breakdown of the code snippets shared.
You need to structure your data in a specific format that aligns with the training process.
import json
from tqdm.notebook import tqdm
def format_example(example: dict) -> dict:
context = f"Instruction: {example['instruction']}\n"
if example.get("input"):
context += f"Input: {example['input']}\n"
context += "Answer: "
target = example["output"]
return {"context": context, "target": target}
data_list = []
for item in all_dataset.to_pandas().itertuples():
tmp = {}
tmp["instruction"] = item.instruction
tmp["input"] = item.input
tmp["output"] = item.output
data_list.append(tmp)
# save to a jsonl file
with open("../data/dataset_new.jsonl", 'w') as f:
for example in tqdm(data_list, desc="formatting.."):
f.write(json.dumps(format_example(example)) + '\n')
Tokenization is the process of converting input text into tokens that can be fed into the model.
import datasets
from transformers import AutoTokenizer, AutoConfig
model_name = "THUDM/chatglm2-6b"
jsonl_path = "../data/dataset_new.jsonl" # updated path
save_path = '../data/dataset_new' # updated path
max_seq_length = 512
skip_overlength = True
The preprocess function tokenizes the prompt and target, combines them into input IDs, and then trims or pads the sequence to the maximum sequence length.
def preprocess(tokenizer, config, example, max_seq_length):
prompt = example["context"]
target = example["target"]
prompt_ids = tokenizer.encode(prompt, max_length=max_seq_length, truncation=True)
target_ids = tokenizer.encode(
target,
max_length=max_seq_length,
truncation=True,
add_special_tokens=False)
input_ids = prompt_ids + target_ids + [config.eos_token_id]
return {"input_ids": input_ids, "seq_len": len(prompt_ids)}
The read_jsonl function reads each line from the JSONL file, preprocesses it using the preprocess function, and then yields each preprocessed example.
def read_jsonl(path, max_seq_length, skip_overlength=False):
tokenizer = AutoTokenizer.from_pretrained(
model_name, trust_remote_code=True)
config = AutoConfig.from_pretrained(
model_name, trust_remote_code=True, device_map='auto')
with open(path, "r") as f:
for line in tqdm(f.readlines()):
example = json.loads(line)
feature = preprocess(tokenizer, config, example, max_seq_length)
if skip_overlength and len(feature["input_ids"]) > max_seq_length:
continue
feature["input_ids"] = feature["input_ids"][:max_seq_length]
yield feature
The script then creates a Hugging Face Dataset object from the generator and saves it to disk.
save_path = '../data/dataset_new'
dataset = datasets.Dataset.from_generator(
lambda: read_jsonl(jsonl_path, max_seq_length, skip_overlength)
)
dataset.save_to_disk(save_path)
Training a model can be resource-intensive.
Ensure you have a powerful GPU Need to purchase Google Colab GPU plans, Colab Pro is sufficient or just buy 100 compute units for $10 NVIDIA A100 is recommended due to its high memory capacity.
pip install torch torchvision torchaudio
pip install transformers
pip install loguru
pip install datasets
pip install peft
pip install bitsandbytes
pip install tensorboard
pip install sentencepiece
pip install accelerate -U
Ensure CUDA is accessible in the system path
Only for Windows Subsystem for Linux (WSL)
import os
os.environ["PATH"] = f"{os.environ['PATH']}:/usr/local/cuda/bin"
os.environ['LD_LIBRARY_PATH'] = "/usr/lib/wsl/lib:/usr/local/cuda/lib64"
Initialize and set training arguments.
from typing import List, Dict, Optional
import torch
from loguru import logger
from transformers import (
AutoModel,
AutoTokenizer,
TrainingArguments,
Trainer,
BitsAndBytesConfig
)
from peft import (
TaskType,
LoraConfig,
get_peft_model,
set_peft_model_state_dict,
prepare_model_for_kbit_training,
prepare_model_for_int8_training,
)
from peft.utils import TRANSFORMERS_MODELS_TO_LORA_TARGET_MODULES_MAPPING
training_args = TrainingArguments(
output_dir='./finetuned_model', # saved model path
logging_steps = 500,
# max_steps=10000,
num_train_epochs = 2,
per_device_train_batch_size=4,
gradient_accumulation_steps=8,
learning_rate=1e-4,
weight_decay=0.01,
warmup_steps=1000,
save_steps=500,
fp16=True,
# bf16=True,
torch_compile = False,
load_best_model_at_end = True,
evaluation_strategy="steps",
remove_unused_columns=False,
)
Set quantization configuration to reduce model size without losing significant precision.
# Quantization
q_config = BitsAndBytesConfig(load_in_4bit=True,
bnb_4bit_quant_type='nf4',
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.float16
)
Load the base model and tokenizer, and prepare the model for INT8 training.
Load tokenizer & model need massive space
model_name = "THUDM/chatglm2-6b"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(
model_name,
quantization_config=q_config,
trust_remote_code=True,
device='cuda'
)
model = prepare_model_for_int8_training(model, use_gradient_checkpointing=True)
Implement Low-Rank Adaptation (LoRA) and print trainable parameters.
def print_trainable_parameters(model):
"""
Prints the number of trainable parameters in the model.
"""
trainable_params = 0
all_param = 0
for _, param in model.named_parameters():
all_param += param.numel()
if param.requires_grad:
trainable_params += param.numel()
print( f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
)
LoRA
target_modules = TRANSFORMERS_MODELS_TO_LORA_TARGET_MODULES_MAPPING['chatglm']
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
inference_mode=False,
r=8,
lora_alpha=32,
lora_dropout=0.1,
target_modules=target_modules,
bias='none',
)
model = get_peft_model(model, lora_config)
print_trainable_parameters(model)
resume_from_checkpoint = None
if resume_from_checkpoint is not None:
checkpoint_name = os.path.join(resume_from_checkpoint, 'pytorch_model.bin')
if not os.path.exists(checkpoint_name):
checkpoint_name = os.path.join(
resume_from_checkpoint, 'adapter_model.bin'
)
resume_from_checkpoint = False
if os.path.exists(checkpoint_name):
logger.info(f'Restarting from {checkpoint_name}')
adapters_weights = torch.load(checkpoint_name)
set_peft_model_state_dict(model, adapters_weights)
else:
logger.info(f'Checkpoint {checkpoint_name} not found')
model.print_trainable_parameters()
In this segment, we’ll delve into the loading of your pre-processed data, and finally, launch the training of your FinGPT model. Here’s a stepwise breakdown of the script provided:
# load data
from datasets import load_from_disk
import datasets
dataset = datasets.load_from_disk("../data/dataset_new")
dataset = dataset.train_test_split(0.2, shuffle=True, seed = 42)
class ModifiedTrainer(Trainer):
def compute_loss(self, model, inputs, return_outputs=False):
return model(
input_ids=inputs["input_ids"],
labels=inputs["labels"],
).loss
def prediction_step(self, model: torch.nn.Module, inputs, prediction_loss_only: bool, ignore_keys = None):
with torch.no_grad():
res = model(
input_ids=inputs["input_ids"].to(model.device),
labels=inputs["labels"].to(model.device),
).loss
return (res, None, None)
def save_model(self, output_dir=None, _internal_call=False):
from transformers.trainer import TRAINING_ARGS_NAME
os.makedirs(output_dir, exist_ok=True)
torch.save(self.args, os.path.join(output_dir, TRAINING_ARGS_NAME))
saved_params = {
k: v.to("cpu") for k, v in self.model.named_parameters() if v.requires_grad
}
torch.save(saved_params, os.path.join(output_dir, "adapter_model.bin"))
def data_collator(features: list) -> dict:
len_ids = [len(feature["input_ids"]) for feature in features]
longest = max(len_ids)
input_ids = []
labels_list = []
for ids_l, feature in sorted(zip(len_ids, features), key=lambda x: -x[0]):
ids = feature["input_ids"]
seq_len = feature["seq_len"]
labels = (
[tokenizer.pad_token_id] * (seq_len - 1) + ids[(seq_len - 1) :] + [tokenizer.pad_token_id] * (longest - ids_l)
)
ids = ids + [tokenizer.pad_token_id] * (longest - ids_l)
_ids = torch.LongTensor(ids)
labels_list.append(torch.LongTensor(labels))
input_ids.append(_ids)
input_ids = torch.stack(input_ids)
labels = torch.stack(labels_list)
return {
"input_ids": input_ids,
"labels": labels,
}
from torch.utils.tensorboard import SummaryWriter
from transformers.integrations import TensorBoardCallback
Train
Took about 10 compute units
writer = SummaryWriter()
trainer = ModifiedTrainer(
model=model,
args=training_args, # Trainer args
train_dataset=dataset["train"], # Training set
eval_dataset=dataset["test"], # Testing set
data_collator=data_collator, # Data Collator
callbacks=[TensorBoardCallback(writer)],
)
trainer.train()
writer.close()
# save model
model.save_pretrained(training_args.output_dir)
After training, save and download your model. You can also check the model’s size.
zip -r /content/saved_model.zip /content/{training_args.output_dir}
download to local
from google.colab import files
files.download('/content/saved_model.zip')
save to google drive
from google.colab import drive
drive.mount('/content/drive')
save the finetuned model to google drive
!cp -r "/content/finetuned_model" "/content/drive/MyDrive"
def get_folder_size(folder_path):
total_size = 0
for dirpath, _, filenames in os.walk(folder_path):
for f in filenames:
fp = os.path.join(dirpath, f)
total_size += os.path.getsize(fp)
return total_size / 1024 / 1024 # Size in MB
model_size = get_folder_size(training_args.output_dir)
print(f"Model size: {model_size} MB")
Now your model is trained and saved! You can download it and use it for generating financial insights or any other relevant tasks in the finance domain.
The usage of TensorBoard allows you to deeply understand and visualize the training dynamics and performance of your model in real-time.
Happy FinGPT Training!
Now that your model is trained, let’s understand how to use it to infer and run benchmarks.
pip install transformers==4.30.2 peft==0.4.0
pip install sentencepiece
pip install accelerate
pip install torch
pip install peft
pip install datasets
pip install bitsandbytes
clone the FinNLP repository
git clone https://github.com/AI4Finance-Foundation/FinNLP.git
import sys
sys.path.append('/content/FinNLP/')
from transformers import AutoModel, AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch
# Load benchmark datasets from FinNLP
from finnlp.benchmarks.fpb import test_fpb
from finnlp.benchmarks.fiqa import test_fiqa , add_instructions
from finnlp.benchmarks.tfns import test_tfns
from finnlp.benchmarks.nwgi import test_nwgi
pip install --upgrade peft
load model from google drive
from google.colab import drive
drive.mount('/content/drive')
Define the path you want to check
path_to_check = "/content/drive/My Drive/finetuned_model"
# Check if the specified path exists
if os.path.exists(path_to_check):
print("Path exists.")
else:
print("Path does not exist.")
load the chatglm2-6b base model
base_model = "THUDM/chatglm2-6b"
peft_model = training_args.output_dir
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
model = AutoModel.from_pretrained(base_model, trust_remote_code=True, load_in_8bit=True, device_map="auto")
model = PeftModel.from_pretrained(model, peft_model)
model = model.eval()
load our finetuned model
base_model = "THUDM/chatglm2-6b"
peft_model = "/content/drive/My Drive/finetuned_model"
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
model = AutoModel.from_pretrained(base_model, trust_remote_code=True, load_in_8bit=True, device_map="auto")
model = PeftModel.from_pretrained(model, peft_model)
model = model.eval()
batch_size = 8
TFNS Test Set, len 2388
Available: 84.85 compute units
res = test_tfns(model, tokenizer, batch_size = batch_size)
# Available: 83.75 compute units
# Took about 1 compute unite to inference
# FPB, len 1212
res = test_fpb(model, tokenizer, batch_size = batch_size)
# FiQA, len 275
res = test_fiqa(model, tokenizer, prompt_fun = add_instructions, batch_size = batch_size)
# NWGI, len 4047
res = test_nwgi(model, tokenizer, batch_size = batch_size)
https://github.com/AI4Finance-Foundation/FinGPT/tree/master/fingpt/FinGPT-v3
TFNS:
FinGPT V3.1:
This notebook:
FPB:
FinGPT V3.1:
This notebook:
FiQA:
FinGPT V3.1:
This notebook:
Since the FiQA dataset wasn’t part of our training set, our model’s zero-shot performance is relatively poor compared to FinGPT V3.1.
NWGI:
FinGPT V3.1:
This notebook:
This exercise provided insights into the performance of your trained FinGPT model across various benchmarks. While there are areas where it excels, certain benchmarks highlight opportunities for improvement and tuning. Exploring additional training data and refining the model further will likely lead to enhanced performance across different financial NLP tasks, making it a powerful tool for various applications in the finance sector.
Happy Experimenting with FinGPT!