v_JULY_v

学术论文GPT源码解读：从chatpaper、chatwithpaper到gpt_academic

前言

之前7月中旬，我曾在微博上说准备做“20个LLM大型项目的源码解读”

针对这个事，目前的最新情况是

已经做了的：LLaMA、Alpaca、ChatGLM-6B、deepspeedchat、transformer、langchain、langchain-chatglm知识库
准备做的：chatpaper、deepspeed、Megatron-LM
再往后则：BERT、GPT、pytorch、chatdoctor、baichuan、BLOOM/BELLE、Chinese LLaMA、PEFT BLIP2 llama.cpp

总之，够未来半年忙了。为加快这个事情的进度，本文解读两个关于学术论文的GPT(由于我司每周都有好几个或为申博、或为评职称、或为毕业而报名论文1V1发表辅导的，比如中文期刊、EI会议、ei期刊/SCI等等，所以对这个方向一直都是高度关注，我司也在做类似的LLM产品，敬请期待)

一个是chatpaper：https://github.com/kaixindelele/ChatPaper
一个是gpt_academic：https://github.com/binary-husky/gpt_academic

我把这两个项目的结构做了拆解/解析，且基本把原有代码的每一行都补上了注释，如果大家对任何一行代码有疑问，可以随时在本文评论区留言，我会及时做补充说明

第一部分 ChatPaper：论文对话、总结、翻译

ChatPaper的自身定位是全流程加速科研：论文总结+专业级翻译+润色+审稿+审稿回复，因为论文更多是PDF的格式，故针对PDF的对话、总结、翻译，便不可避免的涉及到PDF的解析

1.1 ChatPaper/ChatReviewerAndResponse

1.1.1 对PDF的解析：ChatReviewerAndResponse/get_paper.py

// 待更

1.1.2 论文审查：ChatReviewerAndResponse/chat_reviewer.py

使用OpenAI的GPT模型进行论文审查的脚本。它首先定义了一个Reviewer类来处理审查工作，然后在if __name__ == '__main__':语句下使用argparse处理命令行参数，并调用chat_reviewer_main函数来开始审查过程

导入模块：与第一段代码相似，但新增了一些库，如jieba、tenacity等

命名元组定义：用于保存与论文审稿相关的参数

ReviewerParams = namedtuple(
    "ReviewerParams",
    [
        "paper_path",
        "file_format",
        "research_fields",
        "language"
    ],
)

判断文本中是否包含中文：

def contains_chinese(text):
    for ch in text:
        if u'\u4e00' <= ch <= u'\u9fff':
            return True
    return False

插入句子到文本
主要功能是在给定文本的每隔一定数量的单词或中文字符后插入一个指定的句子。如果文本行包含中文字符，则使用jieba分词工具来切分中文，否则使用空格来切分：

def insert_sentence(text, sentence, interval):
    # 将输入文本按换行符分割成行
    lines = text.split('\n')
    # 初始化一个新的行列表
    new_lines = []

    # 遍历每一行
    for line in lines:
        # 检查行中是否包含中文字符
        if contains_chinese(line):
            # 如果是中文，使用jieba分词工具进行分词
            words = list(jieba.cut(line))
            # 定义分隔符为空字符（对于中文分词）
            separator = ''
        else:
            # 如果不包含中文，按空格分割行
            words = line.split()
            # 定义分隔符为空格（对于英文或其他非中文语言）
            separator = ' '

        # 初始化一个新的单词列表
        new_words = []
        # 初始化一个计数器
        count = 0

        # 遍历当前行的每一个单词
        for word in words:
            # 将当前单词添加到新的单词列表
            new_words.append(word)
            # 计数器增加
            count += 1

            # 检查是否达到了插入句子的间隔
            if count % interval == 0:
                # 在达到指定间隔时，将要插入的句子添加到新的单词列表
                new_words.append(sentence)

        # 将新的单词列表连接起来，并添加到新的行列表
        new_lines.append(separator.join(new_words))

    # 将新的行列表连接起来，返回结果
    return '\n'.join(new_lines)

论文审稿类：定义了一个Reviewer类，包含以下功能：
$\rightarrow$ 第一阶段审稿：先是基于论文标题和摘要，选择要审稿的部分

# 定义Reviewer类
class Reviewer:
    # 初始化方法，设置属性
    def __init__(self, args=None):
        if args.language == 'en':
            self.language = 'English'
        elif args.language == 'zh':
            self.language = 'Chinese'
        else:
            self.language = 'Chinese'        
        # 创建一个ConfigParser对象
        self.config = configparser.ConfigParser()
        # 读取配置文件
        self.config.read('apikey.ini')
        # 获取某个键对应的值        
        self.chat_api_list = self.config.get('OpenAI', 'OPENAI_API_KEYS')[1:-1].replace('\'', '').split(',')
        self.chat_api_list = [api.strip() for api in self.chat_api_list if len(api) > 5]
        self.cur_api = 0
        self.file_format = args.file_format        
        self.max_token_num = 4096
        self.encoding = tiktoken.get_encoding("gpt2")
    
    def validateTitle(self, title):
        # 修正论文的路径格式
        rstr = r"[\/\\\:\*\?\"\<\>\|]" # '/ \ : * ? " < > |'
        new_title = re.sub(rstr, "_", title) # 替换为下划线
        return new_title

然后分别实现两个函数
一个stage_1，主要功能是为了与GPT-3模型进行对话，获取模型对于文章的两个最关键部分的选择意见

def stage_1(self, paper):
    # 初始化一个空列表，用于存储生成的HTML内容
    htmls = []
    
    # 初始化一个空字符串，用于存储文章的标题和摘要
    text = ''
    # 添加文章的标题
    text += 'Title: ' + paper.title + '. '
    # 添加文章的摘要
    text += 'Abstract: ' + paper.section_texts['Abstract']
    
    # 计算文本的token数量
    text_token = len(self.encoding.encode(text))
    # 判断token数量是否超过最大token限制的一半减去800
    if text_token > self.max_token_num/2 - 800:
        input_text_index = int(len(text)*((self.max_token_num/2)-800)/text_token)
        # 如果超出，则截取文本以满足长度要求
        text = text[:input_text_index]
    
    # 设置OpenAI API的密钥
    openai.api_key = self.chat_api_list[self.cur_api]
    # 更新当前使用的API索引
    self.cur_api += 1
    # 如果当前API索引超过API列表的长度，则重置为0
    self.cur_api = 0 if self.cur_api >= len(self.chat_api_list)-1 else self.cur_api
    
    # 创建与GPT-3的对话消息
    messages = [
        {"role": "system",
         "content": f"You are a professional reviewer in the field of {args.research_fields}. "
                    f"I will give you a paper. You need to review this paper and discuss the novelty and originality of ideas, correctness, clarity, the significance of results, potential impact and quality of the presentation. "
                    f"Due to the length limitations, I am only allowed to provide you the abstract, introduction, conclusion and at most two sections of this paper."
                    f"Now I will give you the title and abstract and the headings of potential sections. "
                    f"You need to reply at most two headings. Then I will further provide you the full information, includes aforementioned sections and at most two sections you called for.\n\n"
                    f"Title: {paper.title}\n\n"
                    f"Abstract: {paper.section_texts['Abstract']}\n\n"
                    f"Potential Sections: {paper.section_names[2:-1]}\n\n"
                    f"Follow the following format to output your choice of sections:"
                    f"{{chosen section 1}}, {{chosen section 2}}\n\n"},
        {"role": "user", "content": text},
    ]
    
    # 调用OpenAI API与GPT-3进行对话
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=messages,
    )
    
    # 初始化一个空字符串，用于存储模型的回复
    result = ''
    # 遍历模型的回复，将其添加到结果字符串中
    for choice in response.choices:
        result += choice.message.content
    # 打印模型的回复
    print(result)
    
    # 返回模型的回复，将其分割为多个部分
    return result.split(',')

一个chat_review，主要功能是调用GPT-3模型进行论文审稿，对输入的文章文本进行审查，并按照预定格式生成审稿意见

def chat_review(self, text):
    # 设置OpenAI API的密钥
    openai.api_key = self.chat_api_list[self.cur_api]
    
    # 更新当前使用的API密钥索引
    self.cur_api += 1
    # 如果当前API密钥索引超过API密钥列表的长度，则将其重置为0
    self.cur_api = 0 if self.cur_api >= len(self.chat_api_list)-1 else self.cur_api

    # 定义用于审稿提示的token数量
    review_prompt_token = 1000
    
    # 计算输入文本的token数量
    text_token = len(self.encoding.encode(text))
    # 计算输入文本的截取位置
    input_text_index = int(len(text)*(self.max_token_num-review_prompt_token)/text_token)
    # 截取文本并添加前缀
    input_text = "This is the paper for your review:" + text[:input_text_index]
    
    # 从'ReviewFormat.txt'文件中读取审稿格式
    with open('ReviewFormat.txt', 'r') as file:
        review_format = file.read()
    
    # 创建与GPT-3的对话消息
    messages=[
        {"role": "system", 
         "content": "You are a professional reviewer in the field of "+args.research_fields+". Now I will give you a paper. You need to give a complete review opinion according to the following requirements and format:"+ review_format +" Please answer in {}.".format(self.language)},
        {"role": "user", "content": input_text},
    ]
    
    # 调用OpenAI API与GPT-3进行对话
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=messages,
    )
    
    # 初始化一个空字符串，用于存储模型的回复
    result = ''
    # 遍历模型的回复，将其添加到结果字符串中
    for choice in response.choices:
        result += choice.message.content

    # 在结果中插入特定的句子，警告不允许复制
    result = insert_sentence(result, '**Generated by ChatGPT, no copying allowed!**', 15)
    # 追加伦理声明
    result += "\n\n⚠伦理声明/Ethics statement：\n--禁止直接复制生成的评论用于任何论文审稿工作！\n--Direct copying of generated comments for any paper review work is prohibited!"
    
    # 打印分隔符和结果
    print("********"*10)
    print(result)
    print("********"*10)
    # 打印相关的token使用信息和响应时间
    print("prompt_token_used:", response.usage.prompt_tokens)
    print("completion_token_used:", response.usage.completion_tokens)
    print("total_token_used:", response.usage.total_tokens)
    print("response_time:", response.response_ms/1000.0, 's')
    
    # 返回模型生成的审稿意见
    return result

$\rightarrow$ 使用ChatGPT进行审稿，且有tenacity重试机制和更多的功能，其中review_by_chatgpt 调用了上面所示的两个函数，一个stage_1，一个chat_review

def review_by_chatgpt(self, paper_list):
    # 创建一个空列表用于存储每篇文章审稿后的HTML格式内容
    htmls = []
    
    # 遍历paper_list中的每一篇文章
    for paper_index, paper in enumerate(paper_list):
        # 使用第一阶段审稿方法选择文章的关键部分
        sections_of_interest = self.stage_1(paper)
        
        # 初始化一个空字符串用于提取文章的主要部分
        text = ''
        # 添加文章的标题
        text += 'Title:' + paper.title + '. '
        # 添加文章的摘要
        text += 'Abstract: ' + paper.section_texts['Abstract']
        
        # 查找并添加“Introduction”部分
        intro_title = next((item for item in paper.section_names if 'ntroduction' in item.lower()), None)
        if intro_title is not None:
            text += 'Introduction: ' + paper.section_texts[intro_title]
        
        # 同样地，查找并添加“Conclusion”部分
        conclusion_title = next((item for item in paper.section_names if 'onclusion' in item), None)
        if conclusion_title is not None:
            text += 'Conclusion: ' + paper.section_texts[conclusion_title]
        
        # 遍历sections_of_interest，添加其他感兴趣的部分
        for heading in sections_of_interest:
            if heading in paper.section_names:
                text += heading + ': ' + paper.section_texts[heading]
        
        # 使用ChatGPT进行审稿，并得到审稿内容
        chat_review_text = self.chat_review(text=text)
        
        # 将审稿的文章编号和内容添加到htmls列表中
        htmls.append('## Paper:' + str(paper_index+1))
        htmls.append('\n\n\n')
        htmls.append(chat_review_text)
        
        # 获取当前日期和时间，并转换为字符串格式
        date_str = str(datetime.datetime.now())[:13].replace(' ', '-')
        try:
            # 创建输出文件夹
            export_path = os.path.join('./', 'output_file')
            os.makedirs(export_path)
        except:
            # 如果文件夹已存在，则不执行任何操作
            pass
        
        # 如果是第一篇文章，则写模式为'w'，否则为'a'
        mode = 'w' if paper_index == 0 else 'a'
        
        # 根据文章标题和日期生成文件名
        file_name = os.path.join(export_path, date_str+'-'+self.validateTitle(paper.title)+"."+self.file_format)
        
        # 将审稿内容导出为Markdown格式并保存
        self.export_to_markdown("\n".join(htmls), file_name=file_name, mode=mode)
        
        # 清空htmls列表，为下一篇文章做准备
        htmls = []

主程序部分：
定义了一个chat_reviewer_main 函数，该函数创建了一个Reviewer对象，并对指定路径中的PDF文件进行审稿

def chat_reviewer_main(args):            

    reviewer1 = Reviewer(args=args)
    # 开始判断是路径还是文件：   
    paper_list = []     
    if args.paper_path.endswith(".pdf"):
        paper_list.append(Paper(path=args.paper_path))            
    else:
        for root, dirs, files in os.walk(args.paper_path):
            print("root:", root, "dirs:", dirs, 'files:', files) #当前目录路径
            for filename in files:
                # 如果找到PDF文件，则将其复制到目标文件夹中
                if filename.endswith(".pdf"):
                    paper_list.append(Paper(path=os.path.join(root, filename)))        
    print("------------------paper_num: {}------------------".format(len(paper_list)))        
    [print(paper_index, paper_name.path.split('\\')[-1]) for paper_index, paper_name in enumerate(paper_list)]
    reviewer1.review_by_chatgpt(paper_list=paper_list)

主程序中定义了命令行参数解析，并调用了chat_reviewer_main 函数
在主程序中增加了审稿时间的计算功能

if __name__ == '__main__':    
    parser = argparse.ArgumentParser()
    parser.add_argument("--paper_path", type=str, default='', help="path of papers")
    parser.add_argument("--file_format", type=str, default='txt', help="output file format")
    parser.add_argument("--research_fields", type=str, default='computer science, artificial intelligence and reinforcement learning', help="the research fields of paper")
    parser.add_argument("--language", type=str, default='en', help="output lauguage, en or zh")
    
    reviewer_args = ReviewerParams(**vars(parser.parse_args()))
    start_time = time.time()
    chat_reviewer_main(args=reviewer_args)
    print("review time:", time.time() - start_time)