关键词: 医疗数据集,大模型微调训练
开源项目:llm-medical-data 用于大模型微调训练的医疗数据集
项目地址:https://github.com/donote/llm-medical-data
该项目主要参考了几篇关于医学领域大模型的论文或项目中的医疗大模型微调项目,收集整理项目涉及到的微调样本数据,数据说明如下:
源地址:https://github.com/Toyhom/Chinese-medical-dialogue-data
文件名 | 样本量 |
---|---|
IM_内科.csv | 307,596 |
andriatria_男科.csv | 113,877 |
obgyn_妇产科.csv | 229,706 |
oncology_肿瘤科.csv | 96,627 |
pediatric_儿科.csv | 117,099 |
surgical_外科.csv | 149,576 |
chinese_medical_dialogue_data/sample_h100.csv
源地址:https://github.com/SCIR-HI/Huatuo-Llama-Med-Chinese
文件名 | 样本量 |
---|---|
liver_cancer.json | 1,000 |
llama_data.json | 8,568 |
{
"instruction":" : 我听说免疫疗法是治疗晚期肝癌的一种很有效的方法,真的吗? : 是的,免疫疗法是一种新兴的治疗方法,它可以提高人体免疫系统对癌细胞的反应能力。对于晚期肝癌患者,免疫疗法已证明具有很高的治疗效果。 : 但是,这是否适用于所有的肝癌患者?",
"input":""
"output":"不是所有的肝癌患者都适合进行免疫疗法。根据患者的临床表现和病情,医生会对患者进行评估,从而选择最适合的治疗方案。",
}
源地址:https://github.com/MediaBrain-SJTU/MedicalGPT-zh
项目未提供微调数据,但给出了微调数据生成脚本和种子数据文件,添加openai api key后可自行生成微调样本数据。
pip install -r medicalgpt_data/requirements.txt
在book_based_question_generation.py
和dialogue_generation.py
中增加openai api key
:
### add your openai key list, more key can generate data faster
api_keys=[]
在medicalgpt_data下面执行对应的样本生成脚本即可,格式参考样例文件:book_based_qa.json
和dialogue_task.json
源地址:https://github.com/Kent0n-Li/ChatDoctor
文件名 | 样本量 |
---|---|
GenMedGPT-5k.json | 2,354 |
iCliniq.json | 7342 |
HealthCareMagic-100k.json | 24,665 |
GenMedGPT-5k
为chatGPT生成的医患对话,iCliniq
和HealthCareMagic-100k
为在线站点真实对话。
{
"instruction": "If you are a doctor, please answer the medical questions based on the patient's description.",
"input": "Doctor, I have been experiencing sudden and frequent panic attacks. I don't know what to do.",
"output": "Well, based on what you're telling me, it sounds like you may be suffering from panic disorder. The best course of action is to start with psychotherapy and mental health counseling. Additionally, we should conduct an electrocardiogram to make sure that there are no physical issues causing your panic attacks. We will also need to perform a depression screen and a toxicology screen to rule out any other underlying causes. Finally, I would recommend a comprehensive psychological and psychiatric evaluation and therapy to help manage your symptoms."
}
项目地址:https://github.com/CogStack/opengpt
文件名 | 类型 | 样本量 |
---|---|---|
prepared_generated_data_for_nhs_uk_qa.csv | QA | 24,665 |
prepared_generated_data_for_nhs_uk_conversations.csv | 对话 | 2,354 |
prepared_generated_data_for_medical_tasks.csv | 任务 | 4,688 |
使用chatGPT对NHS站点数据生成的样本,生成样本的prompts
见prompts dataset。
text
中<|user|>
和<|ai|>
对应了standford-alpaca样本格式中的input
和output
text,raw_data_id
"<|user|> What is high blood pressure? <|eos|> <|ai|> High blood pressure is a condition where the force at which your heart pumps blood around your body is high. It is recorded with 2 numbers, the systolic pressure and the diastolic pressure, both measured in millimetres of mercury (mmHg).
References:
- https://www.nhs.uk/conditions/Blood-pressure-(high)/Pages/Introduction.aspx <|eos|> <|eod|>",0
----------END----------
同步更新到:AI加油站