FinGPT_Training_LoRA_with_ChatGLM2_6B_for_Beginners

FinGPT


Colab : https://colab.research.google.com/github/AI4Finance-Foundation/FinGPT/blob/master/FinGPT_Training_LoRA_with_ChatGLM2_6B_for_Beginners.ipynb


Getting Started with FinGPT

Welcome to this comprehensive guide aimed at beginners diving into the realm of Financial Large Language Models (FinLLMs) with FinGPT.

This blog post demystifies the process of training FinGPT using Low-Rank Adaptation (LoRA) with the robust base model ChatGlm2-6b.


Part 1: Preparing the Data

Data preparation is a crucial step when it comes to training Financial Large Language Models.

Here, we’ll guide you on how to get your dataset ready for FinGPT using Python.

In this section, you’ve initialized your working directory and loaded a financial sentiment dataset. Let’s break down the steps:


pip install datasets transformers torch tqdm pandas huggingface_hub
pip install sentencepiece
pip install protobuf transformers==4.30.2 cpm_kernels torch>=2.0 gradio mdtex2html sentencepiece accelerate


1.1 初始化文件夹

Initialize Directories:
This block checks if certain paths exist; if they do, it deletes them to avoid data conflicts, and then creates a new directory for the upcoming data.

如果原来这里存在文件,就删掉

import os
import shutil

jsonl_path = "../data/dataset_new.jsonl"
save_path = '../data/dataset_new'


if os.path.exists(jsonl_path):
    os.remove(jsonl_path)

if os.path.exists(save_path):
    shutil.rmtree(save_path)

directory = "../data"
if not os.path.exists(directory):
    os.makedirs(directory)


1.2 Load and Prepare Dataset:
  • Import necessary libraries from the datasets package: https://huggingface.co/docs/datasets/index
  • Load the Twitter Financial News Sentiment (TFNS) dataset and convert it to a Pandas dataframe. https://huggingface.co/datasets/zeroshot/twitter-financial-news-sentiment
  • Map numerical labels to their corresponding sentiments (negative, positive, neutral).
  • Add instruction for each data entry, which is crucial for Instruction Tuning.
  • Convert the Pandas dataframe back to a Hugging Face Dataset object.
from datasets import load_dataset
import datasets

dic = {
    0:"negative",
    1:'positive',
    2:'neutral',
}

tfns = load_dataset('zeroshot/twitter-financial-news-sentiment')
tfns = tfns['train']
tfns = tfns.to_pandas()
tfns['label'] = tfns['label'].apply(lambda x:dic[x])
tfns['instruction'] = 'What is the sentiment of this tweet? Please choose an answer from {negative/neutral/positive}.'
tfns.columns = ['input', 'output', 'instruction']
tfns = datasets.Dataset.from_pandas(tfns)
tfns


1.3 Concatenate and Shuffle Dataset

tmp_dataset = datasets.concatenate_datasets([tfns]*2)
train_dataset = tmp_dataset
print(tmp_dataset.num_rows)

all_dataset = train_dataset.shuffle(seed = 42)
all_dataset.shape

Now that your training data is loaded and prepared.


Part 2: Dataset Formatting and Tokenization

Once your data is prepared, the next steps involve formatting the dataset for model ingestion and tokenizing the input data.
Below, we provide a step-by-step breakdown of the code snippets shared.


2.1 Dataset Formatting:

You need to structure your data in a specific format that aligns with the training process.

import json
from tqdm.notebook import tqdm


def format_example(example: dict) -> dict:
    context = f"Instruction: {example['instruction']}\n"
    if example.get("input"):
        context += f"Input: {example['input']}\n"
    context += "Answer: "
    target = example["output"]
    return {"context": context, "target": target}


data_list = []
for item in all_dataset.to_pandas().itertuples():
    tmp = {}
    tmp["instruction"] = item.instruction
    tmp["input"] = item.input
    tmp["output"] = item.output
    data_list.append(tmp)


# save to a jsonl file
with open("../data/dataset_new.jsonl", 'w') as f:
    for example in tqdm(data_list, desc="formatting.."):
        f.write(json.dumps(format_example(example)) + '\n')

2.2 Tokenization

Tokenization is the process of converting input text into tokens that can be fed into the model.

import datasets
from transformers import AutoTokenizer, AutoConfig

model_name = "THUDM/chatglm2-6b"
jsonl_path = "../data/dataset_new.jsonl"  # updated path
save_path = '../data/dataset_new'  # updated path
max_seq_length = 512
skip_overlength = True


The preprocess function tokenizes the prompt and target, combines them into input IDs, and then trims or pads the sequence to the maximum sequence length.

def preprocess(tokenizer, config, example, max_seq_length):
    prompt = example["context"]
    target = example["target"]
    prompt_ids = tokenizer.encode(prompt, max_length=max_seq_length, truncation=True)
    target_ids = tokenizer.encode(
        target,
        max_length=max_seq_length,
        truncation=True,
        add_special_tokens=False)
    input_ids = prompt_ids + target_ids + [config.eos_token_id]
    return {"input_ids": input_ids, "seq_len": len(prompt_ids)}


The read_jsonl function reads each line from the JSONL file, preprocesses it using the preprocess function, and then yields each preprocessed example.

def read_jsonl(path, max_seq_length, skip_overlength=False):
    tokenizer = AutoTokenizer.from_pretrained(
        model_name, trust_remote_code=True)
    config = AutoConfig.from_pretrained(
        model_name, trust_remote_code=True, device_map='auto')
    with open(path, "r") as f:
        for line in tqdm(f.readlines()):
            example = json.loads(line)
            feature = preprocess(tokenizer, config, example, max_seq_length)
            if skip_overlength and len(feature["input_ids"]) > max_seq_length:
                continue
            feature["input_ids"] = feature["input_ids"][:max_seq_length]
            yield feature


2.3 Save the dataset

The script then creates a Hugging Face Dataset object from the generator and saves it to disk.

save_path = '../data/dataset_new'

dataset = datasets.Dataset.from_generator(
    lambda: read_jsonl(jsonl_path, max_seq_length, skip_overlength)
    )
dataset.save_to_disk(save_path)

Part 3: Setup FinGPT training parameters with LoRA on ChatGlm2-6b

Training a model can be resource-intensive.
Ensure you have a powerful GPU Need to purchase Google Colab GPU plans, Colab Pro is sufficient or just buy 100 compute units for $10 NVIDIA A100 is recommended due to its high memory capacity.

pip install torch torchvision torchaudio
pip install transformers
pip install loguru
pip install datasets
pip install peft
pip install bitsandbytes
pip install tensorboard
pip install sentencepiece
pip install accelerate -U

Ensure CUDA is accessible in the system path
Only for Windows Subsystem for Linux (WSL)

import os
os.environ["PATH"] = f"{os.environ['PATH']}:/usr/local/cuda/bin"
os.environ['LD_LIBRARY_PATH'] = "/usr/lib/wsl/lib:/usr/local/cuda/lib64"

3.1 Training Arguments Setup:

Initialize and set training arguments.


from typing import List, Dict, Optional
import torch
from loguru import logger
from transformers import (
    AutoModel,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    BitsAndBytesConfig
)

from peft import (
    TaskType,
    LoraConfig,
    get_peft_model,
    set_peft_model_state_dict,
    prepare_model_for_kbit_training,
    prepare_model_for_int8_training,
)

from peft.utils import TRANSFORMERS_MODELS_TO_LORA_TARGET_MODULES_MAPPING

training_args = TrainingArguments(
        output_dir='./finetuned_model',    # saved model path
        logging_steps = 500,
        # max_steps=10000,
        num_train_epochs = 2,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=8,
        learning_rate=1e-4,
        weight_decay=0.01,
        warmup_steps=1000,
        save_steps=500,
        fp16=True,
        # bf16=True,
        torch_compile = False,
        load_best_model_at_end = True,
        evaluation_strategy="steps",
        remove_unused_columns=False,
    )

3.2 Quantization Config Setup:

Set quantization configuration to reduce model size without losing significant precision.


# Quantization
q_config = BitsAndBytesConfig(load_in_4bit=True,
                                bnb_4bit_quant_type='nf4',
                                bnb_4bit_use_double_quant=True,
                                bnb_4bit_compute_dtype=torch.float16
                                )

3.3 Model Loading & Preparation:

Load the base model and tokenizer, and prepare the model for INT8 training.

  • Runtime -> Change runtime type -> A100 GPU
  • retart runtime and run again if not working

Load tokenizer & model need massive space

model_name = "THUDM/chatglm2-6b"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(
        model_name,
        quantization_config=q_config,
        trust_remote_code=True,
        device='cuda'
    )
model = prepare_model_for_int8_training(model, use_gradient_checkpointing=True)

3.4 LoRA Config & Setup:

Implement Low-Rank Adaptation (LoRA) and print trainable parameters.


def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print( f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )



LoRA

target_modules = TRANSFORMERS_MODELS_TO_LORA_TARGET_MODULES_MAPPING['chatglm']
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    inference_mode=False,
    r=8,
    lora_alpha=32,
    lora_dropout=0.1,
    target_modules=target_modules,
    bias='none',
)
model = get_peft_model(model, lora_config)
print_trainable_parameters(model) 

resume_from_checkpoint = None
if resume_from_checkpoint is not None:
    checkpoint_name = os.path.join(resume_from_checkpoint, 'pytorch_model.bin')
    if not os.path.exists(checkpoint_name):
        checkpoint_name = os.path.join(
            resume_from_checkpoint, 'adapter_model.bin'
        )
        resume_from_checkpoint = False
    if os.path.exists(checkpoint_name):
        logger.info(f'Restarting from {checkpoint_name}')
        adapters_weights = torch.load(checkpoint_name)
        set_peft_model_state_dict(model, adapters_weights)
    else:
        logger.info(f'Checkpoint {checkpoint_name} not found')
 

model.print_trainable_parameters()

Part 4: Loading Data and Training FinGPT

In this segment, we’ll delve into the loading of your pre-processed data, and finally, launch the training of your FinGPT model. Here’s a stepwise breakdown of the script provided:

  • Need to purchase Google Colab GPU plans, Colab Pro is sufficient or just buy 100 compute units for $10

4.1 Loading Your Data:

# load data
from datasets import load_from_disk
import datasets

dataset = datasets.load_from_disk("../data/dataset_new")
dataset = dataset.train_test_split(0.2, shuffle=True, seed = 42)

4.2 Training Configuration and Launch:
  • Customize the Trainer class for specific loss computation, prediction step, and model-saving methods.
  • Define a data collator function to process batches of data during training.
  • Set up TensorBoard for logging, instantiate your modified trainer, and begin training.

class ModifiedTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        return model(
            input_ids=inputs["input_ids"],
            labels=inputs["labels"],
        ).loss 
    def prediction_step(self, model: torch.nn.Module, inputs, prediction_loss_only: bool, ignore_keys = None):
        with torch.no_grad():
            res = model(
                input_ids=inputs["input_ids"].to(model.device),
                labels=inputs["labels"].to(model.device),
            ).loss
        return (res, None, None)
    def save_model(self, output_dir=None, _internal_call=False):
        from transformers.trainer import TRAINING_ARGS_NAME

        os.makedirs(output_dir, exist_ok=True)
        torch.save(self.args, os.path.join(output_dir, TRAINING_ARGS_NAME))
        saved_params = {
            k: v.to("cpu") for k, v in self.model.named_parameters() if v.requires_grad
        }
        torch.save(saved_params, os.path.join(output_dir, "adapter_model.bin"))

def data_collator(features: list) -> dict:
    len_ids = [len(feature["input_ids"]) for feature in features]
    longest = max(len_ids)
    input_ids = []
    labels_list = []
    for ids_l, feature in sorted(zip(len_ids, features), key=lambda x: -x[0]):
        ids = feature["input_ids"]
        seq_len = feature["seq_len"]
        labels = (
            [tokenizer.pad_token_id] * (seq_len - 1) + ids[(seq_len - 1) :] + [tokenizer.pad_token_id] * (longest - ids_l)
        )
        ids = ids + [tokenizer.pad_token_id] * (longest - ids_l)
        _ids = torch.LongTensor(ids)
        labels_list.append(torch.LongTensor(labels))
        input_ids.append(_ids)
    input_ids = torch.stack(input_ids)
    labels = torch.stack(labels_list)
    return {
        "input_ids": input_ids,
        "labels": labels,
    } 


from torch.utils.tensorboard import SummaryWriter
from transformers.integrations import TensorBoardCallback 

Train
Took about 10 compute units

writer = SummaryWriter()
trainer = ModifiedTrainer(
    model=model,
    args=training_args,             # Trainer args
    train_dataset=dataset["train"], # Training set
    eval_dataset=dataset["test"],   # Testing set
    data_collator=data_collator,    # Data Collator
    callbacks=[TensorBoardCallback(writer)],
)
trainer.train()
writer.close()

# save model
model.save_pretrained(training_args.output_dir)

4.3 Model Saving and Download:

After training, save and download your model. You can also check the model’s size.


zip -r /content/saved_model.zip /content/{training_args.output_dir} 

download to local

from google.colab import files
files.download('/content/saved_model.zip')

save to google drive

from google.colab import drive
drive.mount('/content/drive')

save the finetuned model to google drive

!cp -r "/content/finetuned_model" "/content/drive/MyDrive"

def get_folder_size(folder_path):
    total_size = 0
    for dirpath, _, filenames in os.walk(folder_path):
        for f in filenames:
            fp = os.path.join(dirpath, f)
            total_size += os.path.getsize(fp)
    return total_size / 1024 / 1024  # Size in MB

model_size = get_folder_size(training_args.output_dir)
print(f"Model size: {model_size} MB")


Now your model is trained and saved! You can download it and use it for generating financial insights or any other relevant tasks in the finance domain.
The usage of TensorBoard allows you to deeply understand and visualize the training dynamics and performance of your model in real-time.

Happy FinGPT Training!


Part 5: Inference and Benchmarks using FinGPT

Now that your model is trained, let’s understand how to use it to infer and run benchmarks.

  • Took about 10 compute units

pip install transformers==4.30.2 peft==0.4.0
pip install sentencepiece
pip install accelerate
pip install torch
pip install peft
pip install datasets
pip install bitsandbytes

5.1 Load the model

clone the FinNLP repository

git clone https://github.com/AI4Finance-Foundation/FinNLP.git

import sys
sys.path.append('/content/FinNLP/') 

from transformers import AutoModel, AutoTokenizer, AutoModelForCausalLM

from peft import PeftModel
import torch

# Load benchmark datasets from FinNLP
from finnlp.benchmarks.fpb import test_fpb
from finnlp.benchmarks.fiqa import test_fiqa , add_instructions
from finnlp.benchmarks.tfns import test_tfns
from finnlp.benchmarks.nwgi import test_nwgi

pip install --upgrade peft 

load model from google drive

from google.colab import drive
drive.mount('/content/drive')



Define the path you want to check

path_to_check = "/content/drive/My Drive/finetuned_model"

# Check if the specified path exists
if os.path.exists(path_to_check):
    print("Path exists.")
else:
    print("Path does not exist.")




load the chatglm2-6b base model

base_model = "THUDM/chatglm2-6b"
peft_model = training_args.output_dir

tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
model = AutoModel.from_pretrained(base_model, trust_remote_code=True, load_in_8bit=True, device_map="auto")

model = PeftModel.from_pretrained(model, peft_model)

model = model.eval()



load our finetuned model

base_model = "THUDM/chatglm2-6b"
peft_model = "/content/drive/My Drive/finetuned_model"

tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
model = AutoModel.from_pretrained(base_model, trust_remote_code=True, load_in_8bit=True, device_map="auto")

model = PeftModel.from_pretrained(model, peft_model)

model = model.eval()


5.2 Run Benchmarks:

batch_size = 8 

TFNS Test Set, len 2388
Available: 84.85 compute units

res = test_tfns(model, tokenizer, batch_size = batch_size)
# Available: 83.75 compute units
# Took about 1 compute unite to inference 

# FPB, len 1212
res = test_fpb(model, tokenizer, batch_size = batch_size)

# FiQA, len 275
res = test_fiqa(model, tokenizer, prompt_fun = add_instructions, batch_size = batch_size)



# NWGI, len 4047
res = test_nwgi(model, tokenizer, batch_size = batch_size)

5.3 Compare it with FinGPT V3.1 results

https://github.com/AI4Finance-Foundation/FinGPT/tree/master/fingpt/FinGPT-v3


Comparison

TFNS:

FinGPT V3.1:

  • Acc: 0.876
  • F1 macro: 0.841
  • F1 weighted (follow BloombergGPT): 0.875

This notebook:

  • Acc: 0.856
  • F1 macro: 0.806
  • F1 weighted (follow BloombergGPT): 0.850
Since we trained on the TFNS dataset, it is expected that the test results would be good.

FPB:

FinGPT V3.1:

  • Acc: 0.856
  • F1 macro: 0.841
  • F1 weighted: 0.855

This notebook:

  • Acc: 0.741
  • F1 macro: 0.655
  • F1 weighted: 0.694
Considering the FPB dataset was not included in our training set, the obtained zero-shot results are acceptable.

FiQA:

FinGPT V3.1:

  • Acc: 0.836
  • F1 macro: 0.746
  • F1 weighted: 0.850

This notebook:

  • Acc: 0.48
  • F1 macro: 0.5
  • F1 weighted: 0.49

Since the FiQA dataset wasn’t part of our training set, our model’s zero-shot performance is relatively poor compared to FinGPT V3.1.


NWGI:

FinGPT V3.1:

  • Acc: 0.642
  • F1 macro: 0.650
  • F1 weighted: 0.642

This notebook:

  • Acc: 0.521
  • F1 macro: 0.500
  • F1 weighted: 0.490
The results are reasonable

Conclusion:

  • The training and testing of FinGPT in this exercise demanded a total of 20 compute units, broken down into 10 for training and another 10 for inference.
  • 100 Compute Units for 10 dollars, that makes 2 dollars to train and test with FinGPT
  • This cost-effective approach is primarily attributable to the utilization of the Low-Rank Adaptation (LoRA) method, which proves to be economical while ensuring efficient model training and inference.

This exercise provided insights into the performance of your trained FinGPT model across various benchmarks. While there are areas where it excels, certain benchmarks highlight opportunities for improvement and tuning. Exploring additional training data and refining the model further will likely lead to enhanced performance across different financial NLP tasks, making it a powerful tool for various applications in the finance sector.

Happy Experimenting with FinGPT!


你可能感兴趣的:(LLM,FinGPT,LoRA,ChatGLM2)