GPT4机器人开发

本文为基于mini pupper的机器人GPT4交互开发文档,同步更新开发进度,最终程序将在github gpt4_ros及github mini pupper项目开源。

mini pupper GPT4 development project

Abstract

Important feature

  1. the robot ask question to make it more Interactive, it actually does feel like you are almost chatting with the real person
  2. exit mode: audio[hello loona] to exit the chatgpt mode
  3. Nothing in this video is pre scripted - the model is given a basic prompt describing Ameca, giving the robot a description of self - its pure ai

Technology alternative library

1. For Process

[Video Process] Amazon KVS:

Amazon Kinesis Video Streams makes it easy to securely stream video from connected devices to AWS for analytics, machine learning (ML), playback, and other processing. Kinesis Video Streams automatically provisions and elastically scales all the infrastructure needed to ingest streaming video data from millions of devices. It durably stores, encrypts, and indexes video data in your streams, and allows you to access your data through easy-to-use APIs. Kinesis Video Streams enables you to playback video for live and on-demand viewing, and quickly build applications that take advantage of computer vision and video analytics through integration with Amazon Rekognition Video, and libraries for ML frameworks such as Apache MxNet, TensorFlow, and OpenCV. Kinesis Video Streams also supports WebRTC, an open-source project that enables real-time media streaming and interaction between web browsers, mobile applications, and connected devices via simple APIs. Typical uses include video chat and peer-to-peer media streaming.
Overall, amazon Kinesis Video Streams (KVS) is used for securely streaming, storing, indexing, and processing video data from connected devices at scale, enabling analytics, machine learning, playback, and the development of applications that leverage computer vision and real-time media streaming.

  1. Data Collection (at the camera/device): The process starts with the camera or other video devices capturing images.

  2. Data Upload (from the device to the cloud): The data is then securely transmitted to Amazon Kinesis Video Streams.

  3. Data Storage and Indexing (in the cloud on AWS servers): KVS automatically stores, encrypts, and indexes the video data, making it accessible through APIs.

  4. Data Processing (in the cloud on AWS servers): The video data is processed and analyzed using Amazon Rekognition Video or other machine learning frameworks such as Apache MxNet, TensorFlow, and OpenCV. In this case, the processing would include facial recognition.

  5. Result Presentation (on the user’s display device): The processed results (for example, recognized faces) can be retrieved via APIs and displayed on the user’s screen through an application.

The main components involved in this process include the video device, Amazon KVS, the machine learning/facial recognition system, and the user interface.
Amazon Kinesis Video Streams

[Command Process] code-as-policies

Large language models (LLMs) trained on code-completion have been shown to be capable of synthesizing simple Python programs from docstrings [1]. We find that these codewriting LLMs can be re-purposed to write robot policy code, given natural language commands. Specifically, policy code can express functions or feedback loops that process perception outputs (e.g.,from object detectors [2], [3]) and parameterize control primitive APIs. When provided as input several example language commands (formatted as comments) followed by corresponding policy code (via few-shot prompting), LLMs can take in new commands and autonomously re-compose API calls to generate new policy code respectively. By chaining classic logic structures and referencing third-party libraries (e.g., NumPy, Shapely) to perform arithmetic, LLMs used in this way can write robot policies that (i) exhibit spatial-geometric reasoning, (ii) generalize to new instructions, and (iii) prescribe precise values (e.g., velocities) to ambiguous descriptions (“faster”) depending on context (i.e., behavioral commonsense). This paper presents code as policies: a robot-centric formalization of language model generated programs (LMPs) that can represent reactive policies (e.g., impedance controllers), as well as waypoint-based policies (vision-based pick and place, trajectory-based control), demonstrated across multiple real robot platforms. Central to our approach is prompting hierarchical code-gen (recursively defining undefined functions), which can write more complex code and also improves state-of-the-art to solve 39.8% of problems on the HumanEval [1] benchmark. Code and videos are available at https://code-as-policies.github.io
code-as-policies

[ROS Command Process] ROSGPT by Anis Koubaa

ROSGPT is a pioneering approach that combines the power of ChatGPT and ROS (Robot Operating System) to redefine human-robot interaction. By leveraging large language models like ChatGPT, ROSGPT enables the conversion of unstructured human language into actionable robotic commands. This repository contains the implementation of ROSGPT, allowing developers to explore and contribute to the project.
youtube rosgpt
github rosgpt

Paper: Anis Koubaa. “ROSGPT: Next-Generation Human-Robot Interaction with ChatGPT and ROS”, to appear.

2. For voice input and output

[ASR] Amazon Lex:

Amazon Lex is a service provided by Amazon Web Services (AWS) that enables developers to build conversational interfaces for applications using voice and text. It uses advanced deep learning algorithms in natural language understanding (NLU) and automatic speech recognition (ASR) to interpret user input and convert it into actionable commands or responses.

Amazon Lex is designed to make it easy for developers to create chatbots or virtual assistants for various applications, such as customer support, task automation, or even as part of a larger AI-driven system. With Amazon Lex, developers can build and deploy conversational interfaces on platforms like mobile devices, web applications, and messaging platforms.

The service provides tools for designing conversation flows, defining intents and slots, and specifying prompts and responses. It also integrates with other AWS services, such as AWS Lambda, to facilitate processing of user input and execution of backend logic.

[ASR] Whisper openai

Whisper is an automatic speech recognition (ASR) system developed by OpenAI. ASR technology is designed to convert spoken language into written text, and Whisper aims to provide high-quality and accurate speech recognition. It is trained on a massive amount of multilingual and multitask supervised data collected from the web.

While I don’t have specific information on “Whisper For speech recognition and synthesis,” it’s possible that you’re referring to a combination of Whisper for ASR and a separate speech synthesis system. Speech synthesis, also known as text-to-speech (TTS), is the process of converting written text into audible speech.

By combining the capabilities of both ASR and TTS systems, it becomes possible to create applications and tools that facilitate communication, improve accessibility, and enable voice-activated systems like virtual assistants, transcription services, and more.
official whisper openai
github whisper openai

[TTS] ElevenLabs

Prime Voice AI
The most realistic and versatile AI speech software, ever. Eleven brings the most compelling, rich and lifelike voices to creators and publishers seeking the ultimate tools for storytelling.
ElevenLabs

[local TTS/maybe online TTS] waveglow nvidia

In our recent paper, we propose WaveGlow: a flow-based network capable of generating high quality speech from mel-spectrograms. WaveGlow combines insights from Glow and WaveNet in order to provide fast, efficient and high-quality audio synthesis, without the need for auto-regression. WaveGlow is implemented using only a single network, trained using only a single cost function: maximizing the likelihood of the training data, which makes the training procedure simple and stable.

Our PyTorch implementation produces audio samples at a rate of 1200 kHz on an NVIDIA V100 GPU. Mean Opinion Scores show that it delivers audio quality as good as the best publicly available WaveNet implementation.

Visit our website for audio samples.
github waveglow nvidia

[TTS] Text to speech Google cloud

Google Text-to-Speech (TTS) is a speech synthesis technology developed by Google as part of its suite of machine learning and artificial intelligence services. It converts written text into natural-sounding speech, allowing applications and devices to provide auditory output instead of relying solely on visual text displays. Google Text-to-Speech is used in various applications, such as virtual assistants, navigation systems, e-book readers, and accessibility tools for people with visual impairments.

Google’s TTS system is based on advanced machine learning techniques, including deep neural networks, which enable it to generate human-like speech with appropriate intonation, rhythm, and pacing. The technology supports multiple languages and offers a range of voices to choose from, allowing developers to customize the user experience in their applications.

Developers can integrate Google Text-to-Speech into their applications using the Google Cloud Text-to-Speech API, which provides a straightforward way to access and utilize the TTS functionality in their products and services.
text to speech Google cloud

Reference

[Two Wheeled Robot] loona by Keyi Tech

1 answer any question
2 tell interactive stories
3 play a variety of game

more than 700+ emoji
state->audio/robot reaction
loop:
video/audio->AMS Lex->robot->AMS KVS ->feedback
audio[chatgpt]->robot reaction[thinking][screen display state change]
audio[hello loona]->robot reaction[shake head]
problem:
loona reagards all voice input as a question
youtube loona
kickstarter loona
facebook loona
official loona

[Two Wheeled Robot] Sarcastic robot powered by GPT-4

github selenasun1618/GPT-3PO
youtube Sarcastic robot

[Humanoid Robot] ameca human-like robot

This #ameca demo couples automated speech recognition with GPT 3 - a large language model that generates meaningful answers - the output is fed to an online TTS service which generates the voice and visemes for lip sync timing. The team at Engineered Arts ltd pose the questions.
youtube ameca
youtube Ameca expressions with GPT3 / 4

[Quadruped Robot Dog] Boston Dynamics Spot by Underfitted

file[json]->GPT
explain what the structure is and how to read json
GPT can read the json and answer"Battery level or next mission"
describe the [next mission such as location]
one is system prompt, another one is from whisper
question:[Spot, are you standing up? yes or no] ->robot reaction[shake head]
youtube Boston Dynamics

[Biped Servo Robot] EMO desktop Robot

offcial emo

[robotic arm] GPT robotic arm by vlangogh

twitter robotic arm

进展:

Amazon KVS

使用Amazon Kinesis Video Streams (KVS)处理人脸识别数据时,数据从摄像头录制到最后在用户屏幕上显示的流程如下:

  1. 数据采集(在摄像头/设备上):过程开始于摄像头或其他视频设备捕捉图像。

  2. 数据上传(从设备到云端):然后,数据通过安全的连接被发送到Amazon Kinesis Video Streams。

  3. 数据存储和索引(在云端的AWS服务器上):KVS自动存储、加密并索引视频数据,使其可以通过API进行访问。

  4. 数据处理(在云端的AWS服务器上):使用Amazon Rekognition Video或其他机器学习框架,如Apache MxNet、TensorFlow和OpenCV,来对视频数据进行处理和分析。在这个例子中,处理可能包括人脸识别。

  5. 结果呈现(在用户的显示设备上):处理结果(例如,识别出的人脸)可以通过API被提取出来,并通过应用程序显示在用户的屏幕上。

这个过程涉及到的主要组件包括视频设备、Amazon KVS、机器学习/人脸识别系统和用户界面。

ROSGPT

  1. 利用LLMs(具体来说是ChatGPT)进行提示工程,利用其独特的特性,比如能力诱导思维链指令调节
  2. 通过本体论开发 Ontology将非结构化的自然语言命令转换为特定于应用程序上下文的结构化机器人指令
  3. 利用LLMs的零样本少样本学习能力 Few-Shot Learning,从非结构化的人类语言输入中引出结构化的机器人指令
  4. 知名的LLMs 包括 OpenAI 的 GPT-3 [6]和 GPT-4 [9,10],谷歌的 BERT [11] 和 T5 [12]
  5. LLMs 出色的即时学习功能是基于提示工程技术构建的
  6. 结构化命令数据通常以标准格式JSON表示
  7. GPTROSProxy:提示工程模块使用提示工程方法处理非结构化文本输入
  8. 机器人需要移动或旋转。精确定义移动和旋转命令需要开发本体,涵盖领域特定概念,如Robot Motion和Robot Rotation。为充分描述这些命令,本体还必须包含关键参数,如Distance、Linear Velocity、Direction、Angle、Angular Velocity和Orientation。通过利用这样的本体,自然语言命令可以更准确、更一致地结构化,从而提高机器人系统的性能和可靠性
  9. ROSParser模块是rosGPT系统的关键组件,负责处理从非结构化命令调出的结构化数据并将其转换为可执行的代码。从软件工程的角度看,ROSParser可以被视为一个中间件,它促进高级处理模块和低级机器人控制模块之间的通讯。ROSParser模块被设计用来与ROS节点接口,负责控制低级机器人硬件组件,如电机控制器或传感器,使用预定义的ROS编程基元
  10. 考虑上述导航用例,可以提出本体,以捕捉与空间导航任务相关的基本概念,关系和属性,如图3所示。本体说明该用例具有三个动作:移动,目标导航和旋转,其中动作具有一个或多个可从语音提示中推断出的参数。本本体包括位置,运动和速度等概念。该本体可以看作是人类命令预期从中推断出的可能机器人系统动作的有限范围。因此,这可以转换为以下ROS 2基元:• move(线性速度,距离,是否向前)• rotate(角速度,角度,是否顺时针)• go to goal(位置)b)JSON序列化结构化命令设计:现在,我们清楚地了解如何设计与上述本体和ROS 2基元一致的结构化命令格式。因此,我们提出了一种可以用于使用ChatGPT进行提示工程的JSON序列化命令格式。
{
"action": "go_to_goal",
"params": {
"location": {
"type": "str",
"value": "Kitchen"
}
}
}
{
"action": "move",
"params": {
"linear_speed": 0.5,
"distance": distance,
"is_forward": True
"unit": "meter"
}
}
{
"action": "rotate",
"params": {
"angular_velocity": 0.35,
"angle": 40,
"is_clockwise": is_clockwise
"unit": "degrees"
}
}
  1. 即使在学习样本中没有定义这个特定的行动,“拍照”,ChatGPT仍然生成了这个行动。这个结果强调了该模型在没有像本体论这样的结构框架指导下解释和生成语境准确命令的潜在限制。没有本体论关键词,ChatGPT会根据其理解生成行动,这可能并不总是与学习样本中定义的所需行动一致。因此,这可能导致输出不符合学习样本中特定约束和要求的情况。
  2. 本体论的应用有效地限制了模型的输出,确保其与所需模式一致,并遵守特定上下文的要求。这项实验研究说明了本体论在指导ChatGPT等大型语言模型生成语境相关和准确的结构机器人命令方面的重要作用。通过在培训和微调过程中结合本体论其他结构化框架,我们可以显著提高模型生成符合应用场景和符合预定义的约束和要求的输出的能力
  3. 该软件包由ROSGPT REST服务器和其对应的ROS2节点的ROS 2 Python代码,一个利用Web Speech API将人类语音转换为文本命令的Web应用程序组成。Web API通过其REST API与ROSGPT通信,以提交文本命令。ROSGPT使用ChatGPT API将人类文本转换为JSON序列化命令,机器人使用该命令移动或导航。我们致力于使ROSGPT的实施成为人机交互中进一步发展的开放平台
  4. 本体论是一个特定领域中已知的形式化表示,可以更一致、更明确地理解实体和概念之间的关系。通过利用这种结构化知识,ROSGPT系统可以更有效地处理自然语言命令,并将其翻译成精确可执行的机器人动作。

你可能感兴趣的:(人工智能)