最新ChatGPT GPT-4 NLU实战之文档问答类ChatPDF功能(附ipynb与python源码及视频)——开源DataWhale发布入门ChatGPT技术新手从0到1必备使用指南手册(五)

目录

  • 前言
  • 最新ChatGPT GPT-4 自然语言理解NLU实战之文档问答类ChatPDF功能
    • 引言
    • ChatGPT 接口
    • Qdrant数据库Embedding存储
    • 核心代码
    • 测试
  • 其它NLU应用及实战
    • 相关文献
  • 参考资料
  • 其它资料下载

最新ChatGPT GPT-4 NLU实战之文档问答类ChatPDF功能(附ipynb与python源码及视频)——开源DataWhale发布入门ChatGPT技术新手从0到1必备使用指南手册(五)_第1张图片

前言

最近,研究人员开始探索使用ChatGPT来进行文档问答(QA)的任务。与传统的文档问答系统相比,这种方法的优点在于可以利用ChatGPT强大的生成能力来产生更为准确、详细的答案。在此过程中,ChatGPT通过阅读相关文档并提取问题所需的信息来寻找答案。这种方法不仅可以提高QA的准确性,还可以提高系统的可扩展性和适应性。

其中最火的莫过于ChatPDF,它是国外小哥Mathis Lichtenberger开发的一个应用。通过上传PDF文件到ChatPDF,就能实现和PDF跨语言对话,并根据PDF内容回答的提问。即,通过ChatPDF能够实现和PDF聊天。跨语言是指如果PDF是英文,你可以输入中文和它对话,反之亦然。而该应用的核心方法就是基于OpenAI的 Chat API,给PDF的每一段创建语义索引,然后使用关联最密切的段落去提示 (prompt) Chat API。

这是ChatPDF主页对其介绍:

  • 无论是课本、讲义还是演示文稿,都可以轻松理解。无需再花费数小时翻阅研究论文和学术文章,让我们更有效地支持学术成长。

  • ChatPDF可以帮助我们更好地学习。无论是课本、讲义还是演示文稿,都可以轻松理解。无需再花费数小时翻阅研究论文和学术文章,让我们更有效地支持学术成长。

  • 通过ChatPDF,我们可以轻松地解锁无尽知识。从历史文档到诗歌、文学作品,无论是什么语言,ChatPDF都能理解并用喜欢的语言回复。让好奇心得到满足,拓宽视野,这个工具能回答任何来自PDF文件的问题。

本文也将给大家从0到1为大家展示关于NLU应用之文档问答的底层技术及应用。

最新ChatGPT GPT-4 自然语言理解NLU实战之文档问答类ChatPDF功能

引言

  文档问答和QA有点类似,不过要稍微复杂一点。它会先用QA的方法召回一个相关的文档,然后让模型在这个文档中找出问题的答案。一般的流程还是先召回相关文档,然后做阅读理解任务。阅读理解和实体提取任务有些类似,但它预测的不是具体某个标签,而是答案的Index,即start和end的位置。

  还是举个例子。假设我们的问题是:“北京奥运会举办于哪一年?”

  召回的文档可能是含有北京奥运会举办的新闻,比如类似下面这样的:

第29届夏季奥林匹克运动会(Beijing 2008; Games of the XXIX Olympiad),又称2008年北京奥运会,2008年8月8日晚上8时整在中国首都北京开幕。8月24日闭幕。

  标注就是「2008年」这个答案的索引。

  当然,一个文档里可能有不止一个问题,比如上面的文档,还可以问:“北京奥运会啥时候开幕?”,“北京奥运会什么时候闭幕”,“北京奥运会是第几届奥运会”等问题。

  根据之前的NLP方法,这里实际做起来方案会比较多,也有一定的复杂度;不过总的来说还是分类任务。现在我们有了LLM,问题就变得简单了。依然是两步:

  • 召回:与QA类似,这次召回的是Doc,这一步其实就是相似Embedding选择最相似的。
  • 回答:将召回来的文档和问题以Prompt的方式提交给Completion/ChatCompletion接口,直接得到答案。

ChatGPT 接口

  我们分别用两种不同的接口各举一例,首先看看Completion接口:

import openai
OPENAI_API_KEY = "填入专属的API key"

openai.api_key = OPENAI_API_KEY
def complete(prompt):
    response = openai.Completion.create(
        prompt=prompt,
        temperature=0,
        max_tokens=300,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0,
        model="text-davinci-003"
    )
    ans = response["choices"][0]["text"].strip(" \n")
    return ans
# 来自官方文档
prompt = """Answer the question as truthfully as possible using the provided text, and if the answer is not contained within the text below, say "I don't know"

Context:
The men's high jump event at the 2020 Summer Olympics took place between 30 July and 1 August 2021 at the Olympic Stadium.
33 athletes from 24 nations competed; the total possible number depended on how many nations would use universality places 
to enter athletes in addition to the 32 qualifying through mark or ranking (no universality places were used in 2021).
Italian athlete Gianmarco Tamberi along with Qatari athlete Mutaz Essa Barshim emerged as joint winners of the event following
a tie between both of them as they cleared 2.37m. Both Tamberi and Barshim agreed to share the gold medal in a rare instance
where the athletes of different nations had agreed to share the same medal in the history of Olympics. 
Barshim in particular was heard to ask a competition official "Can we have two golds?" in response to being offered a 
'jump off'. Maksim Nedasekau of Belarus took bronze. The medals were the first ever in the men's high jump for Italy and 
Belarus, the first gold in the men's high jump for Italy and Qatar, and the third consecutive medal in the men's high jump
for Qatar (all by Barshim). Barshim became only the second man to earn three medals in high jump, joining Patrik Sjöberg
of Sweden (1984 to 1992).

Q: Who won the 2020 Summer Olympics men's high jump?
A:"""
complete(prompt)
'Gianmarco Tamberi and Mutaz Essa Barshim emerged as joint winners of the event.'

  上面的Context就是我们召回的文档。

  再看ChatCompletion接口:

prompt = """请根据以下Context回答问题,直接输出答案即可,不用附带任何上下文。

Context:
诺曼人(诺曼人:Nourmands;法语:Normands;拉丁语:Normanni)是在10世纪和11世纪将名字命名为法国诺曼底的人。他们是北欧人的后裔(丹麦人,挪威人和挪威人)的海盗和海盗,他们在首相罗洛(Rollo)的领导下向西弗朗西亚国王查理三世宣誓效忠。经过几代人的同化,并与法兰克和罗马高卢人本地居民融合,他们的后代将逐渐与以西卡罗来纳州为基础的加洛林人文化融合。诺曼人独特的文化和种族身份最初出现于10世纪上半叶,并在随后的几个世纪中持续发展。

问题:
诺曼底在哪个国家/地区?
"""
def ask(content):
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo", 
        messages=[{"role": "user", "content": content}]
    )

    ans = response.get("choices")[0].get("message").get("content")
    return ans
ans = ask(prompt)
print(ans)
法国。

  看起来还行,我们接下来就把整个流程串起来,先用Completion接口实现(便宜),不过也很方便替换过去,毕竟输入都不变(都是Prompt)。

  首先是加载数据集,取自:openai-cookbook/olympics-1-collect-data.ipynb at 1f6c2304b401e931928e74e978d9a0b8a40d1cf7 · openai/openai-cookbook

import pandas as pd
df = pd.read_csv("./dataset/olympics_sections_text.csv")
df.shape
(3964, 4)
df.head()
title heading content tokens
0 2020 Summer Olympics Summary The 2020 Summer Olympics (Japanese: 2020年夏季オリン... 726
1 2020 Summer Olympics Host city selection The International Olympic Committee (IOC) vote... 126
2 2020 Summer Olympics Impact of the COVID-19 pandemic In January 2020, concerns were raised about th... 374
3 2020 Summer Olympics Qualifying event cancellation and postponement Concerns about the pandemic began to affect qu... 298
4 2020 Summer Olympics Effect on doping tests Mandatory doping tests were being severely res... 163

Qdrant数据库Embedding存储

  我们这次不用Redis,换一个工具:Qdrant - Vector Search Engine,Qdrant相比Redis的单线程更容易扩展。但我们切记,要根据实际情况选择工具,很多时候过度优化是原罪,适合的就是最好的。我们真正需要做的是将业务逻辑抽象,做到尽量不依赖任何工具,换工具只需要换一个适配器就好。

  依然使用Docker,启动很简单:

docker run -p 6333:6333 -v $(pwd)/qdrant_storage:/qdrant/storage qdrant/qdrant`

  自然也少不了客户端的安装:

pip install qdrant-client

  不过首先还是生成Embedding,这一步可以使用get_embedding接口:

from openai.embeddings_utils import get_embedding, cosine_similarity

  或者也可以直接使用原生的Embedding接口,还支持多条一次请求:

def get_embedding_direct(inputs):
    embed_model = "text-embedding-ada-002"

    res = openai.Embedding.create(
        input=inputs, engine=embed_model
    )
    return res
texts = [v.content for v in df.itertuples()]
len(texts)
3964
import pnlp
emds = []
for idx, batch in enumerate(pnlp.generate_batches_by_size(texts, 200)):
    response = get_embedding_direct(batch)
    for v in response.data:
        emds.append(v.embedding)
    print(f"batch: {idx} done")
batch: 0 done
batch: 1 done
batch: 2 done
batch: 3 done
batch: 4 done
batch: 5 done
batch: 6 done
batch: 7 done
batch: 8 done
batch: 9 done
batch: 10 done
batch: 11 done
batch: 12 done
batch: 13 done
batch: 14 done
batch: 15 done
batch: 16 done
batch: 17 done
batch: 18 done
batch: 19 done
len(emds), len(emds[0])
(3964, 1536)

  接下来是创建索引:

from qdrant_client import QdrantClient
client = QdrantClient(host="localhost", port=6333)

  值得注意的是,qdrant还支持内存/文件库,也就是说,可以直接:

# client = QdrantClient(":memory:")
# 或
# client = QdrantClient(path="path/to/db")

  我们还是用server的方式:

from qdrant_client.models import Distance, VectorParams

client.recreate_collection(
    collection_name="doc_qa",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)
True
# client.delete_collection("doc_qa")

  然后是把向量入库:

payload=[
    {"content": v.content, "heading": v.heading, "title": v.title, "tokens": v.tokens} for v in df.itertuples()
]
client.upload_collection(
    collection_name="doc_qa",
    vectors=emds,
    payload=payload
)

  接下来进行查询:

query = "Who won the 2020 Summer Olympics men's high jump?"
query_vector = get_embedding(query, engine="text-embedding-ada-002")
hits = client.search(
    collection_name="doc_qa",
    query_vector=query_vector,
    limit=5
)
hits
[ScoredPoint(id=236, version=3, score=0.90316474, payload={'content': 'The men\'s high jump event at the 2020 Summer Olympics took place between 30 July and 1 August 2021 at the Olympic Stadium. 33 athletes from 24 nations competed; the total possible number depended on how many nations would use universality places to enter athletes in addition to the 32 qualifying through mark or ranking (no universality places were used in 2021). Italian athlete Gianmarco Tamberi along with Qatari athlete Mutaz Essa Barshim emerged as joint winners of the event following a tie between both of them as they cleared 2.37m. Both Tamberi and Barshim agreed to share the gold medal in a rare instance where the athletes of different nations had agreed to share the same medal in the history of Olympics. Barshim in particular was heard to ask a competition official "Can we have two golds?" in response to being offered a \'jump off\'. Maksim Nedasekau of Belarus took bronze. The medals were the first ever in the men\'s high jump for Italy and Belarus, the first gold in the men\'s high jump for Italy and Qatar, and the third consecutive medal in the men\'s high jump for Qatar (all by Barshim). Barshim became only the second man to earn three medals in high jump, joining Patrik Sjöberg of Sweden (1984 to 1992).', 'heading': 'Summary', 'title': "Athletics at the 2020 Summer Olympics – Men's high jump", 'tokens': 275}, vector=None),
 ScoredPoint(id=313, version=4, score=0.88258004, payload={'content': "The men's long jump event at the 2020 Summer Olympics took place between 31 July and 2 August 2021 at the Japan National Stadium. Approximately 35 athletes were expected to compete; the exact number was dependent on how many nations use universality places to enter athletes in addition to the 32 qualifying through time or ranking (1 universality place was used in 2016). 31 athletes from 20 nations competed. Miltiadis Tentoglou won the gold medal, Greece's first medal in the men's long jump. Cuban athletes Juan Miguel Echevarría and Maykel Massó earned silver and bronze, respectively, the nation's first medals in the event since 2008.", 'heading': 'Summary', 'title': "Athletics at the 2020 Summer Olympics – Men's long jump", 'tokens': 136}, vector=None),
 ScoredPoint(id=284, version=4, score=0.8821836, payload={'content': "The men's pole vault event at the 2020 Summer Olympics took place between 31 July and 3 August 2021 at the Japan National Stadium. 29 athletes from 18 nations competed. Armand Duplantis of Sweden won gold, with Christopher Nilsen of the United States earning silver and Thiago Braz of Brazil taking bronze. It was Sweden's first victory in the event and first medal of any color in the men's pole vault since 1952. Braz, who had won in 2016, became the ninth man to earn multiple medals in the pole vault.", 'heading': 'Summary', 'title': "Athletics at the 2020 Summer Olympics – Men's pole vault", 'tokens': 112}, vector=None),
 ScoredPoint(id=222, version=3, score=0.876395, payload={'content': "The men's triple jump event at the 2020 Summer Olympics took place between 3 and 5 August 2021 at the Japan National Stadium. Approximately 35 athletes were expected to compete; the exact number was dependent on how many nations use universality places to enter athletes in addition to the 32 qualifying through time or ranking (2 universality places were used in 2016). 32 athletes from 19 nations competed. Pedro Pichardo of Portugal won the gold medal, the nation's second victory in the men's triple jump (after Nelson Évora in 2008). China's Zhu Yaming took silver, while Hugues Fabrice Zango earned Burkina Faso's first Olympic medal in any event.", 'heading': 'Summary', 'title': "Athletics at the 2020 Summer Olympics – Men's triple jump", 'tokens': 139}, vector=None),
 ScoredPoint(id=205, version=3, score=0.86075026, payload={'content': "The men's 110 metres hurdles event at the 2020 Summer Olympics took place between 3 and 5 August 2021 at the Olympic Stadium. Approximately forty athletes were expected to compete; the exact number was dependent on how many nations used universality places to enter athletes in addition to the 40 qualifying through time or ranking (1 universality place was used in 2016). 40 athletes from 29 nations competed. Hansle Parchment of Jamaica won the gold medal, the nation's second consecutive victory in the event. His countryman Ronald Levy took bronze. American Grant Holloway earned silver, placing the United States back on the podium in the event after the nation missed the medals for the first time in Rio 2016 (excluding the boycotted 1980 Games).", 'heading': 'Summary', 'title': "Athletics at the 2020 Summer Olympics – Men's 110 metres hurdles", 'tokens': 149}, vector=None)]

核心代码

  接下来将这个过程包装在Prompt生成过程中:

MAX_SECTION_LEN = 500
SEPARATOR = "\n* "
separator_len = 3
def construct_prompt(question: str):
    query_vector = get_embedding(question, engine="text-embedding-ada-002")
    hits = client.search(
        collection_name="doc_qa",
        query_vector=query_vector,
        limit=5
    )
    
    choose = []
    length = 0
    indexes = []
     
    for hit in hits:
        doc = hit.payload
        length += doc["tokens"] + separator_len
        if length > MAX_SECTION_LEN:
            break
            
        choose.append(SEPARATOR + doc["content"].replace("\n", " "))
        indexes.append(doc["title"] + doc["heading"])
            
    # Useful diagnostic information
    print(f"Selected {len(choose)} document sections:")
    print("\n".join(indexes))
    
    header = """Answer the question as truthfully as possible using the provided context, and if the answer is not contained within the text below, say "I don't know."\n\nContext:\n"""
    
    return header + "".join(choose) + "\n\n Q: " + question + "\n A:"
prompt = construct_prompt("Who won the 2020 Summer Olympics men's high jump?")

print("===\n", prompt)
Selected 2 document sections:
Athletics at the 2020 Summer Olympics – Men's high jumpSummary
Athletics at the 2020 Summer Olympics – Men's long jumpSummary
===
 Answer the question as truthfully as possible using the provided context, and if the answer is not contained within the text below, say "I don't know."

Context:

* The men's high jump event at the 2020 Summer Olympics took place between 30 July and 1 August 2021 at the Olympic Stadium. 33 athletes from 24 nations competed; the total possible number depended on how many nations would use universality places to enter athletes in addition to the 32 qualifying through mark or ranking (no universality places were used in 2021). Italian athlete Gianmarco Tamberi along with Qatari athlete Mutaz Essa Barshim emerged as joint winners of the event following a tie between both of them as they cleared 2.37m. Both Tamberi and Barshim agreed to share the gold medal in a rare instance where the athletes of different nations had agreed to share the same medal in the history of Olympics. Barshim in particular was heard to ask a competition official "Can we have two golds?" in response to being offered a 'jump off'. Maksim Nedasekau of Belarus took bronze. The medals were the first ever in the men's high jump for Italy and Belarus, the first gold in the men's high jump for Italy and Qatar, and the third consecutive medal in the men's high jump for Qatar (all by Barshim). Barshim became only the second man to earn three medals in high jump, joining Patrik Sjöberg of Sweden (1984 to 1992).
* The men's long jump event at the 2020 Summer Olympics took place between 31 July and 2 August 2021 at the Japan National Stadium. Approximately 35 athletes were expected to compete; the exact number was dependent on how many nations use universality places to enter athletes in addition to the 32 qualifying through time or ranking (1 universality place was used in 2016). 31 athletes from 20 nations competed. Miltiadis Tentoglou won the gold medal, Greece's first medal in the men's long jump. Cuban athletes Juan Miguel Echevarría and Maykel Massó earned silver and bronze, respectively, the nation's first medals in the event since 2008.

 Q: Who won the 2020 Summer Olympics men's high jump?
 A:
def complete(prompt):
    response = openai.Completion.create(
        prompt=prompt,
        temperature=0,
        max_tokens=300,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0,
        model="text-davinci-003"
    )
    ans = response["choices"][0]["text"].strip(" \n")
    return ans
complete(prompt)
'Gianmarco Tamberi and Mutaz Essa Barshim emerged as joint winners of the event following a tie between both of them as they cleared 2.37m. Both Tamberi and Barshim agreed to share the gold medal.'

  试试ChatCompletion(ChatGPT)接口:

def ask(content):
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo", 
        messages=[{"role": "user", "content": content}]
    )

    ans = response.get("choices")[0].get("message").get("content")
    return ans

测试

ans = ask(prompt)
ans
"Gianmarco Tamberi and Mutaz Essa Barshim shared the gold medal in the men's high jump event at the 2020 Summer Olympics."

  再看几个例子:

query = "Why was the 2020 Summer Olympics originally postponed?"
prompt = construct_prompt(query)
answer = complete(prompt)

print(f"\nQ: {query}\nA: {answer}")
Selected 1 document sections:
Concerns and controversies at the 2020 Summer OlympicsSummary

Q: Why was the 2020 Summer Olympics originally postponed?
A: The 2020 Summer Olympics were originally postponed due to the COVID-19 pandemic.
query = "In the 2020 Summer Olympics, how many gold medals did the country which won the most medals win?"
prompt = construct_prompt(query)
answer = complete(prompt)

print(f"\nQ: {query}\nA: {answer}")
Selected 2 document sections:
2020 Summer Olympics medal tableSummary
List of 2020 Summer Olympics medal winnersSummary

Q: In the 2020 Summer Olympics, how many gold medals did the country which won the most medals win?
A: The United States won the most medals overall, with 113, and the most gold medals, with 39.
# ChatGPT
answer = ask(prompt)

print(f"\nQ: {query}\nA: {answer}")
Q: In the 2020 Summer Olympics, how many gold medals did the country which won the most medals win?
A: The country that won the most medals at the 2020 Summer Olympics was the United States, with 113 medals, including 39 gold medals.
query = "What is the tallest mountain in the world?"
prompt = construct_prompt(query)
answer = complete(prompt)

print(f"\nQ: {query}\nA: {answer}")
Selected 3 document sections:
Sport climbing at the 2020 Summer Olympics – Men's combinedRoute-setting
Ski mountaineering at the 2020 Winter Youth Olympics – Boys' individualSummary
Ski mountaineering at the 2020 Winter Youth Olympics – Girls' individualSummary

Q: What is the tallest mountain in the world?
A: I don't know.
# ChatGPT
answer = ask(prompt)

print(f"\nQ: {query}\nA: {answer}")
Q: What is the tallest mountain in the world?
A: I don't know.

其它NLU应用及实战

最新ChatGPT GPT-4 NLU实战之实体分类识别与模型微调

最新ChatGPT GPT-4 NLU实战之智能多轮对话机器人

相关文献

  • 【1】GPT3 和它的 In-Context Learning | Yam
  • 【2】ChatGPT Prompt 工程:设计、实践与思考 | Yam
  • 【3】一些 ChatGPT Prompt 示例 | Yam
  • 【4】dair-ai/Prompt-Engineering-Guide: Guides, papers, lecture, notebooks and resources for prompt engineering
  • 【5】Best practices for prompt engineering with OpenAI API | OpenAI Help Center
  • 【6】ChatGPT Prompts and Products | PromptVine
  • 【7】Prompt Vibes
  • 【8】ShareGPT: Share your wildest ChatGPT conversations with one click.
  • 【9】Awesome ChatGPT Prompts | This repo includes ChatGPT prompt curation to use ChatGPT better.
  • 【10】Learn Prompting | Learn Prompting
  • 【11】Fine-tuning - OpenAI API
  • 【12】[PUBLIC] Best practices for fine-tuning GPT-3 to classify text - Google Docs
  • 【13】hscspring/chatbot: Lab for Chatbot

memo:

  • openai-cookbook/text_explanation_examples.md at main · openai/openai-cookbook
  • openai-cookbook/Fine-tuned_classification.ipynb at main · openai/openai-cookbook
  • openai-cookbook/Gen_QA.ipynb at main · openai/openai-cookbook
  • openai-cookbook/Question_answering_using_embeddings.ipynb at main · openai/openai-cookbook
  • openai-cookbook/Fine-tuned_classification.ipynb at main · openai/openai-cookbook

参考资料

ChatGPT 使用指南:句词分类 @长琴

相关视频讲解

其它资料下载

如果大家想继续了解人工智能相关学习路线和知识体系,欢迎大家翻阅我的另外一篇博客《重磅 | 完备的人工智能AI 学习——基础知识学习路线,所有资料免关注免套路直接网盘下载》
这篇博客参考了Github知名开源平台,AI技术平台以及相关领域专家:Datawhale,ApacheCN,AI有道和黄海广博士等约有近100G相关资料,希望能帮助到所有小伙伴们。

你可能感兴趣的:(ChatGPT,ChatGPT商业应用,chatgpt,人工智能,nlp,自然语言处理,python)