本文为基于mini pupper的机器人GPT4交互开发文档,同步更新开发进度,最终程序将在github gpt4_ros及github mini pupper项目开源。
Amazon Kinesis Video Streams makes it easy to securely stream video from connected devices to AWS for analytics, machine learning (ML), playback, and other processing. Kinesis Video Streams automatically provisions and elastically scales all the infrastructure needed to ingest streaming video data from millions of devices. It durably stores, encrypts, and indexes video data in your streams, and allows you to access your data through easy-to-use APIs. Kinesis Video Streams enables you to playback video for live and on-demand viewing, and quickly build applications that take advantage of computer vision and video analytics through integration with Amazon Rekognition Video, and libraries for ML frameworks such as Apache MxNet, TensorFlow, and OpenCV. Kinesis Video Streams also supports WebRTC, an open-source project that enables real-time media streaming and interaction between web browsers, mobile applications, and connected devices via simple APIs. Typical uses include video chat and peer-to-peer media streaming.
Overall, amazon Kinesis Video Streams (KVS) is used for securely streaming, storing, indexing, and processing video data from connected devices at scale, enabling analytics, machine learning, playback, and the development of applications that leverage computer vision and real-time media streaming.
Data Collection (at the camera/device): The process starts with the camera or other video devices capturing images.
Data Upload (from the device to the cloud): The data is then securely transmitted to Amazon Kinesis Video Streams.
Data Storage and Indexing (in the cloud on AWS servers): KVS automatically stores, encrypts, and indexes the video data, making it accessible through APIs.
Data Processing (in the cloud on AWS servers): The video data is processed and analyzed using Amazon Rekognition Video or other machine learning frameworks such as Apache MxNet, TensorFlow, and OpenCV. In this case, the processing would include facial recognition.
Result Presentation (on the user’s display device): The processed results (for example, recognized faces) can be retrieved via APIs and displayed on the user’s screen through an application.
The main components involved in this process include the video device, Amazon KVS, the machine learning/facial recognition system, and the user interface.
Amazon Kinesis Video Streams
Large language models (LLMs) trained on code-completion have been shown to be capable of synthesizing simple Python programs from docstrings [1]. We find that these codewriting LLMs can be re-purposed to write robot policy code, given natural language commands. Specifically, policy code can express functions or feedback loops that process perception outputs (e.g.,from object detectors [2], [3]) and parameterize control primitive APIs. When provided as input several example language commands (formatted as comments) followed by corresponding policy code (via few-shot prompting), LLMs can take in new commands and autonomously re-compose API calls to generate new policy code respectively. By chaining classic logic structures and referencing third-party libraries (e.g., NumPy, Shapely) to perform arithmetic, LLMs used in this way can write robot policies that (i) exhibit spatial-geometric reasoning, (ii) generalize to new instructions, and (iii) prescribe precise values (e.g., velocities) to ambiguous descriptions (“faster”) depending on context (i.e., behavioral commonsense). This paper presents code as policies: a robot-centric formalization of language model generated programs (LMPs) that can represent reactive policies (e.g., impedance controllers), as well as waypoint-based policies (vision-based pick and place, trajectory-based control), demonstrated across multiple real robot platforms. Central to our approach is prompting hierarchical code-gen (recursively defining undefined functions), which can write more complex code and also improves state-of-the-art to solve 39.8% of problems on the HumanEval [1] benchmark. Code and videos are available at https://code-as-policies.github.io
code-as-policies
ROSGPT is a pioneering approach that combines the power of ChatGPT and ROS (Robot Operating System) to redefine human-robot interaction. By leveraging large language models like ChatGPT, ROSGPT enables the conversion of unstructured human language into actionable robotic commands. This repository contains the implementation of ROSGPT, allowing developers to explore and contribute to the project.
youtube rosgpt
github rosgpt
Paper: Anis Koubaa. “ROSGPT: Next-Generation Human-Robot Interaction with ChatGPT and ROS”, to appear.
Amazon Lex is a service provided by Amazon Web Services (AWS) that enables developers to build conversational interfaces for applications using voice and text. It uses advanced deep learning algorithms in natural language understanding (NLU) and automatic speech recognition (ASR) to interpret user input and convert it into actionable commands or responses.
Amazon Lex is designed to make it easy for developers to create chatbots or virtual assistants for various applications, such as customer support, task automation, or even as part of a larger AI-driven system. With Amazon Lex, developers can build and deploy conversational interfaces on platforms like mobile devices, web applications, and messaging platforms.
The service provides tools for designing conversation flows, defining intents and slots, and specifying prompts and responses. It also integrates with other AWS services, such as AWS Lambda, to facilitate processing of user input and execution of backend logic.
Whisper is an automatic speech recognition (ASR) system developed by OpenAI. ASR technology is designed to convert spoken language into written text, and Whisper aims to provide high-quality and accurate speech recognition. It is trained on a massive amount of multilingual and multitask supervised data collected from the web.
While I don’t have specific information on “Whisper For speech recognition and synthesis,” it’s possible that you’re referring to a combination of Whisper for ASR and a separate speech synthesis system. Speech synthesis, also known as text-to-speech (TTS), is the process of converting written text into audible speech.
By combining the capabilities of both ASR and TTS systems, it becomes possible to create applications and tools that facilitate communication, improve accessibility, and enable voice-activated systems like virtual assistants, transcription services, and more.
official whisper openai
github whisper openai
Prime Voice AI
The most realistic and versatile AI speech software, ever. Eleven brings the most compelling, rich and lifelike voices to creators and publishers seeking the ultimate tools for storytelling.
ElevenLabs
In our recent paper, we propose WaveGlow: a flow-based network capable of generating high quality speech from mel-spectrograms. WaveGlow combines insights from Glow and WaveNet in order to provide fast, efficient and high-quality audio synthesis, without the need for auto-regression. WaveGlow is implemented using only a single network, trained using only a single cost function: maximizing the likelihood of the training data, which makes the training procedure simple and stable.
Our PyTorch implementation produces audio samples at a rate of 1200 kHz on an NVIDIA V100 GPU. Mean Opinion Scores show that it delivers audio quality as good as the best publicly available WaveNet implementation.
Visit our website for audio samples.
github waveglow nvidia
Google Text-to-Speech (TTS) is a speech synthesis technology developed by Google as part of its suite of machine learning and artificial intelligence services. It converts written text into natural-sounding speech, allowing applications and devices to provide auditory output instead of relying solely on visual text displays. Google Text-to-Speech is used in various applications, such as virtual assistants, navigation systems, e-book readers, and accessibility tools for people with visual impairments.
Google’s TTS system is based on advanced machine learning techniques, including deep neural networks, which enable it to generate human-like speech with appropriate intonation, rhythm, and pacing. The technology supports multiple languages and offers a range of voices to choose from, allowing developers to customize the user experience in their applications.
Developers can integrate Google Text-to-Speech into their applications using the Google Cloud Text-to-Speech API, which provides a straightforward way to access and utilize the TTS functionality in their products and services.
text to speech Google cloud
1 answer any question
2 tell interactive stories
3 play a variety of game
more than 700+ emoji
state->audio/robot reaction
loop:
video/audio->AMS Lex->robot->AMS KVS ->feedback
audio[chatgpt]->robot reaction[thinking][screen display state change]
audio[hello loona]->robot reaction[shake head]
problem:
loona reagards all voice input as a question
youtube loona
kickstarter loona
facebook loona
official loona
github selenasun1618/GPT-3PO
youtube Sarcastic robot
This #ameca demo couples automated speech recognition with GPT 3 - a large language model that generates meaningful answers - the output is fed to an online TTS service which generates the voice and visemes for lip sync timing. The team at Engineered Arts ltd pose the questions.
youtube ameca
youtube Ameca expressions with GPT3 / 4
file[json]->GPT
explain what the structure is and how to read json
GPT can read the json and answer"Battery level or next mission"
describe the [next mission such as location]
one is system prompt, another one is from whisper
question:[Spot, are you standing up? yes or no] ->robot reaction[shake head]
youtube Boston Dynamics
offcial emo
twitter robotic arm
使用Amazon Kinesis Video Streams (KVS)处理人脸识别数据时,数据从摄像头录制到最后在用户屏幕上显示的流程如下:
数据采集(在摄像头/设备上):过程开始于摄像头或其他视频设备捕捉图像。
数据上传(从设备到云端):然后,数据通过安全的连接被发送到Amazon Kinesis Video Streams。
数据存储和索引(在云端的AWS服务器上):KVS自动存储、加密并索引视频数据,使其可以通过API进行访问。
数据处理(在云端的AWS服务器上):使用Amazon Rekognition Video或其他机器学习框架,如Apache MxNet、TensorFlow和OpenCV,来对视频数据进行处理和分析。在这个例子中,处理可能包括人脸识别。
结果呈现(在用户的显示设备上):处理结果(例如,识别出的人脸)可以通过API被提取出来,并通过应用程序显示在用户的屏幕上。
这个过程涉及到的主要组件包括视频设备、Amazon KVS、机器学习/人脸识别系统和用户界面。
{
"action": "go_to_goal",
"params": {
"location": {
"type": "str",
"value": "Kitchen"
}
}
}
{
"action": "move",
"params": {
"linear_speed": 0.5,
"distance": distance,
"is_forward": True
"unit": "meter"
}
}
{
"action": "rotate",
"params": {
"angular_velocity": 0.35,
"angle": 40,
"is_clockwise": is_clockwise
"unit": "degrees"
}
}