关于数据库和检索方式的选择
评估指标与数据集
结合示例项目的可行性分析
数据集
图像特征提取
文本处理
图文特征融合
训练目标
✅ 优化点
任务 | 评估指标 | 作用 |
---|---|---|
医学问答质量 | BLEU, ROUGE, METEOR | 评估答案的流畅性和文本质量 |
医学合理性 | 医学专家审核评分 | 衡量回答的医学正确性 |
推荐科室准确率 | Precision, Recall, F1-score | 评估推荐结果是否准确 |
BLEU/ROUGE/METEOR 提升策略
医学合理性
推荐科室 F1-score
仅使用 Qwen 模型实现命名实体识别、知识图谱构建及文本生成。
步骤 1:数据预处理
步骤 2:模型初始化
AutoModelForCausalLM.from_pretrained("Qwen-model")
加载预训练的 Qwen 模型。AutoTokenizer.from_pretrained("Qwen-model")
来处理文本输入。步骤 3:微调设置
torch.optim.AdamW
并设定合适的权重衰减和梯度裁剪参数。步骤 4:训练策略
Trainer
类和 TrainingArguments
进行训练管理,定期在验证集上评估并保存最佳模型。步骤 5:检查点保存与调参
torch.optim.AdamW
进行参数更新。AutoModelForCausalLM
加载 Qwen 模型。AutoTokenizer
处理文本输入。Trainer
和 TrainingArguments
管理训练过程。pandas
进行数据清洗和格式转换,使用 datasets
库加载和管理数据集。指标:
测试方法:
Trainer.evaluate()
方法在验证集上定期测试模型,确保训练过程稳定。预测优化数值:
实现优化的方法:
数据集使用:
数据库搜索时机:
构建流程:
与模型结合:
优化结果论证:
对比基准:
下面提供一个详细、可执行的方案,从需求分析到数据预处理、模型微调开发、系统集成,再到测试评估 主要模块:前端输入接口、数据预处理、Qwen模型服务、后端数据存储(Neo4j & 关系型数据库)、后处理规则引擎。
操作1.2.2:知识图谱结构
CREATE CONSTRAINT ON (d:Disease) ASSERT d.name IS UNIQUE;
CREATE (:Disease {name: '高血压'});
CREATE (:Symptom {name: '头痛'});
MATCH (d:Disease {name: '高血压'}), (s:Symptom {name: '头痛'})
CREATE (d)-[:HAS_SYMPTOM]->(s);
操作1.2.3:制定接口与安全策略
import pandas as pd
# 从CSV读取数据,并简单脱敏处理(如移除姓名、身份证号)
data = pd.read_csv('raw_medical_reports.csv')
data.drop(columns=['name', 'id_number'], inplace=True)
data.to_csv('clean_medical_reports.csv', index=False)
操作2.2.1:清洗数据
import re
def clean_text(text):
text = re.sub(r'\s+', ' ', text) # 多余空格
text = text.strip()
return text
data['report_text'] = data['report_text'].apply(clean_text)
data.to_csv('cleaned_medical_reports.csv', index=False)
操作2.2.2:文本分句与分词
import jieba
data['tokens'] = data['report_text'].apply(lambda x: list(jieba.cut(x)))
data.to_csv('tokenized_medical_reports.csv', index=False)
操作2.3.1:准备标注工具
操作2.3.2:执行标注任务
{
"text": "患者患有高血压,伴随头痛症状。",
"entities": [[3, 5, "Disease"], [8, 10, "Symptom"]],
"relations": [[[3,5], [8,10], "HAS_SYMPTOM"]]
}
import json
import csv
with open('annotations.json', 'r', encoding='utf-8') as f:
annotations = json.load(f)
with open('nodes.csv', 'w', newline='', encoding='utf-8') as node_file, \
open('relations.csv', 'w', newline='', encoding='utf-8') as rel_file:
node_writer = csv.writer(node_file)
rel_writer = csv.writer(rel_file)
node_writer.writerow(['entity_id', 'label', 'name'])
rel_writer.writerow(['start_entity', 'end_entity', 'relation_type'])
entity_id = 1
entity_map = {}
for ann in annotations:
for entity in ann['entities']:
entity_text = ann['text'][entity[0]:entity[1]]
label = entity[2]
if entity_text not in entity_map:
entity_map[entity_text] = entity_id
node_writer.writerow([entity_id, label, entity_text])
entity_id += 1
for rel in ann['relations']:
e1_text = ann['text'][rel[0][0]:rel[0][1]]
e2_text = ann['text'][rel[1][0]:rel[1][1]]
rel_writer.writerow([entity_map[e1_text], entity_map[e2_text], rel[2]])
// 导入节点
LOAD CSV WITH HEADERS FROM 'file:///nodes.csv' AS row
CREATE (n:Entity {id: toInteger(row.entity_id), label: row.label, name: row.name});
// 导入关系
LOAD CSV WITH HEADERS FROM 'file:///relations.csv' AS row
MATCH (a:Entity {id: toInteger(row.start_entity)}),
(b:Entity {id: toInteger(row.end_entity)})
CREATE (a)-[:RELATION {type: row.relation_type}]->(b);
操作3.1.1:加载并分割数据集
from sklearn.model_selection import train_test_split
import pandas as pd
df = pd.read_csv('cleaned_medical_reports.csv')
train, temp = train_test_split(df, test_size=0.2, random_state=42)
val, test = train_test_split(temp, test_size=0.5, random_state=42)
train.to_csv('train.csv', index=False)
val.to_csv('val.csv', index=False)
test.to_csv('test.csv', index=False)
操作3.1.2:文本Tokenization
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("qwen-base")
def tokenize_text(text):
return tokenizer.encode_plus(text, max_length=512, truncation=True, padding="max_length")
# 示例:处理训练集
train_data = pd.read_csv('train.csv')
train_data['tokens'] = train_data['report_text'].apply(lambda x: tokenize_text(x)['input_ids'])
train_data.to_csv('train_tokenized.csv', index=False)
from transformers import AutoModelForSeq2SeqLM, Trainer, TrainingArguments
model = AutoModelForSeq2SeqLM.from_pretrained("qwen-base")
import torch
from torch.utils.data import Dataset
class MedicalReportDataset(Dataset):
def __init__(self, csv_file, tokenizer, max_length=512):
self.data = pd.read_csv(csv_file)
self.tokenizer = tokenizer
self.max_length = max_length
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
row = self.data.iloc[idx]
input_enc = self.tokenizer(row['input_text'], max_length=self.max_length, truncation=True, padding="max_length", return_tensors="pt")
target_enc = self.tokenizer(row['target_report'], max_length=self.max_length, truncation=True, padding="max_length", return_tensors="pt")
return {
"input_ids": input_enc.input_ids.squeeze(),
"attention_mask": input_enc.attention_mask.squeeze(),
"labels": target_enc.input_ids.squeeze()
}
training_args = TrainingArguments(
output_dir='./qwen_medical_report',
num_train_epochs=3,
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
evaluation_strategy="steps",
eval_steps=500,
save_steps=1000,
logging_steps=100,
learning_rate=5e-5,
weight_decay=0.01,
save_total_limit=2,
fp16=True
)
train_dataset = MedicalReportDataset('train.csv', tokenizer)
eval_dataset = MedicalReportDataset('val.csv', tokenizer)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset
)
trainer.train()
trainer.save_model('./qwen_medical_report_final')
sample_input = "患者主诉头痛,既往有高血压病史。"
inputs = tokenizer.encode(sample_input, return_tensors="pt")
outputs = model.generate(inputs, max_length=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
操作3.3.1:实体抽取
def extract_entities(text):
# 调用NER模型API或内部方法
entities = ner_model.predict(text)
return entities
操作3.3.2:关系识别与知识图谱校验
from neo4j import GraphDatabase
uri = "bolt://localhost:7687"
driver = GraphDatabase.driver(uri, auth=("neo4j", "password"))
def check_relation(entity1, entity2, relation_type):
query = (
"MATCH (a:Entity {name:$name1})-[r:RELATION {type:$rel}]->(b:Entity {name:$name2}) "
"RETURN r"
)
with driver.session() as session:
result = session.run(query, name1=entity1, name2=entity2, rel=relation_type)
return result.single() is not None
from flask import Flask, request, jsonify
app = Flask(__name__)
@app.route('/generate_report', methods=['POST'])
def generate_report():
data = request.json
input_text = data.get("input_text")
inputs = tokenizer.encode(input_text, return_tensors="pt")
outputs = model.generate(inputs, max_length=200)
report = tokenizer.decode(outputs[0], skip_special_tokens=True)
return jsonify({"report": report})
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000)
FROM python:3.9-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
EXPOSE 5000
CMD ["python", "app.py"]
@Service
public class ReportService {
@Autowired
private RestTemplate restTemplate;
public String generateReport(String inputText) {
String url = "http://qwen-service:5000/generate_report";
Map request = new HashMap<>();
request.put("input_text", inputText);
ResponseEntity
操作4.2.1:配置关系型数据库连接
application.properties
中配置数据库连接信息;操作4.2.2:集成Neo4j接口
@Service
public class Neo4jService {
private final Driver driver;
@Autowired
public Neo4jService(@Value("${neo4j.uri}") String uri,
@Value("${neo4j.username}") String username,
@Value("${neo4j.password}") String password) {
this.driver = GraphDatabase.driver(uri, AuthTokens.basic(username, password));
}
public boolean checkRelation(String entity1, String entity2, String relation) {
try (Session session = driver.session()) {
String cypher = "MATCH (a:Entity {name: $entity1})-[r:RELATION {type: $relation}]->(b:Entity {name: $entity2}) RETURN r";
return session.run(cypher, parameters("entity1", entity1, "entity2", entity2, "relation", relation))
.hasNext();
}
}
}
操作4.3.1:Docker Compose编排
version: '3.8'
services:
qwen-service:
build: ./qwen_service
ports:
- "5000:5000"
backend-service:
build: ./backend_service
ports:
- "8080:8080"
depends_on:
- mysql
- neo4j
mysql:
image: mysql:8.0
environment:
MYSQL_ROOT_PASSWORD: rootpassword
MYSQL_DATABASE: medical_db
ports:
- "3306:3306"
neo4j:
image: neo4j:4.4
environment:
NEO4J_AUTH: neo4j/neo4jpassword
ports:
- "7687:7687"
- "7474:7474"
操作4.3.2:CI/CD部署
操作5.1.1:后端服务单元测试
@SpringBootTest
public class ReportServiceTest {
@Autowired
private ReportService reportService;
@Test
public void testGenerateReport() {
String input = "患者主诉胸痛...";
String report = reportService.generateReport(input);
assertNotNull(report);
assertTrue(report.contains("胸痛"));
}
}
操作5.1.2:Python模块单元测试
from nltk.translate.bleu_score import sentence_bleu
reference = [['高血压', '患者', '症状']]
candidate = ['患者', '高血压', '症状']
score = sentence_bleu(reference, candidate)
print("BLEU score:", score)
通过以上每一步明确的操作说明与代码示例,该方案不仅具备理论指导,还能直接转化为具体的开发任务,确保在实际系统开发中具备可执行性。
评价指标解析
ROUGE 是什么?
在这个任务中怎么衡量?
为什么 ROUGE 适合这个任务?可行范围是多少?
我们先解释 什么是序列标注任务,然后再对比 CRF 和 token-level 分类。
患者出现 高血压 和 头痛 。
CRF 是什么?
torchcrf
库或者 transformers
结合 CRF 层。在这个任务中怎么衡量?
为什么 CRF 适合?可行范围是多少?
什么是 token-level 分类?
"患者 出现 高 血 压 和 头 痛 。"
O
、B-Disease
还是 I-Disease
。在这个任务中怎么衡量?
为什么 token-level 分类有时不够好?
指标 | 作用 | 适用任务 | 典型范围 |
---|---|---|---|
ROUGE | 评估文本生成质量 | 医疗问答、病历补全 | ROUGE-1: 50-80, ROUGE-L: 40-60 |
F1-score | 评估 NER、分类任务 | 任务 2 的医学实体识别 | 85%~95% |
BLEU | 评估文本翻译 | 任务 1 医疗问答 | 30-60 |