思考实践

开源项目解读 —— Self-Operating Computer Framework # 长期主义 # 价值

价值：生成主函数业务逻辑函数思维导图，帮助理解，PR到开源项目，希望帮助大家理解IPA工作原理，国内没有好的开源项目，我就来翻译分析解读，给大家抛砖引玉。思维导图打算用文心一言配合其思维导图插件实现。

开源链接：OthersideAI/self-operating-computer: A framework to enable multimodal models to operate a computer. (github.com)

整体代码框架

核心代码逻辑

capture_screen_with_cursor # 用光标捕获屏幕

capture_mini_screenshot_with_cursor # 将截图和网格一起保存

add_grid_to_image # 给图像配上网格

keyboard_type# 用于通过程序模拟键盘输入

search # 模拟在操作系统中搜索文本。具体来说，它会模拟按下“开始”键（在Windows中）或“Command”和“空格”键（在MacOS中），然后输入提供的文本并按下“Enter”键。

keyboard_type# 用于通过程序模拟键盘输入

"extract_json_from_string" and "convert_percent_to_decimal"# 从json提取字符与把百分数转换为小数点

draw_label_with_background # 在屏幕上绘制一个网格，并在网格的每个交叉点上添加百分比标签。该函数可以捕获在 Linux 和 macOS 系统上工作。

click_at_percentage # 在屏幕上点击指定百分比的位置

mouse_click # 在屏幕上点击指定百分比的位置

summarize # 用于使用预先训练好的模型来生成摘要。该函数可以捕获屏幕截图并将其作为输入提供给模型。该函数可以尝试使用两个预训练模型：`gpt-4-vision-preview` 和 `gemini-pro-vision`。

parse_response # 用于该函数解析与 AI 对话交互的响应。该函数可以捕获不同的响应类型，例如点击、输入文本或搜索查询。总之，`parse_response` 函数将响应解析为字典，其中包含一个表示响应类型的字符串和一个与响应类型相关的数据

get_next_action_from_gemini_pro_vision # 该函数使用预训练的模型`gemini-pro-vision`生成下一个操作。该函数可以捕获屏幕截图并将其作为输入提供给模型。

get_next_action_from_openai # 该函数使用 OpenAI 的 GPT-4 模型生成下一个操作。该函数可以捕获屏幕截图并将其作为输入提供给模型。

accurate_mode_double_check # 该函数在精确模式下使用预训练的模型`gpt-4-vision-preview`重新生成操作。该函数可以捕获屏幕截图并将其作为输入提供给模型。向OAI提供以光标为中心的迷你截图的其他截图，以便进一步微调点击位置

get_last_assistant_message # 该函数从消息数组中检索最后一个来自AI助手的消息

get_next_action # 该函数根据传入的模型、消息数组、目标对象和精确模式来生成下一个操作。

format_accurate_mode_vision_prompt # 该函数根据上一次点击的坐标和屏幕尺寸来生成摘要提示

format_vision_prompt # 该函数根据目标对象和上一次操作来生成摘要提示

format_summary_prompt # 该函数根据目标对象来生成摘要提示，该函数在summarize函数中作为子函数被调用。

main # 该函数是Self- Operating Computer的入口点

validation # 函数用于验证模型、精确模式和语音模式是否正确配置

ModelNotRecognizedException # 该类继承自基类`Exception`。这个类用于在遇到未识别的模型时引发异常

keyboard_type# 用于通过程序模拟键盘输入

代码中的Prompt设定

一些常规变量的设定

调用的包

dotenv这个包（库）主要是用来加载环境变量的

xlib这个包（库）主要是用来与Window系统交互的库

prompt_toolkit 这个包（库）是用于在Python中构建功能强大的交互式命令行应用程序的库。基于文本终端的 UI

PyAutoGUI这个包（库）是一个简单易用，跨平台的可以模拟键盘鼠标进行自动操作的 python 库,可实现控制鼠标、键盘、消息框、截图、定位等功能,上能挂机刷宝箱，下能自动写文档.

Pydantic这个包（库）是一个常用的用于数据接口schema定义与检查的库。通过pydantic库，我们可以更为规范地定义和使用数据接口，这对于大型项目的开发将会更为友好。

Pygetwindow这个包（库）提供了一些方法和属性,使得在Python程序中可以轻松地执行各种窗口操作。

业务逻辑

架构-模块

整体代码框架

核心代码逻辑

capture_screen_with_cursor # 用光标捕获屏幕

def capture_screen_with_cursor(file_path):
    user_platform = platform.system()

    if user_platform == "Windows":
        screenshot = pyautogui.screenshot()
        screenshot.save(file_path)
    elif user_platform == "Linux":
        # Use xlib to prevent scrot dependency for Linux
        screen = Xlib.display.Display().screen()
        size = screen.width_in_pixels, screen.height_in_pixels
        monitor_size["width"] = size[0]
        monitor_size["height"] = size[1]
        screenshot = ImageGrab.grab(bbox=(0, 0, size[0], size[1]))
        screenshot.save(file_path)
    elif user_platform == "Darwin":  # (Mac OS)
        # Use the screencapture utility to capture the screen with the cursor
        subprocess.run(["screencapture", "-C", file_path])
    else:
        print(f"The platform you're using ({user_platform}) is not currently supported")

capture_mini_screenshot_with_cursor # 将截图和网格一起保存

def capture_mini_screenshot_with_cursor(
    file_path=os.path.join("screenshots", "screenshot_mini.png"), x=0, y=0
):
    user_platform = platform.system()

    if user_platform == "Linux":
        x = float(x[:-1])  # convert x from "50%" to 50.
        y = float(y[:-1])

        x = (x / 100) * monitor_size[
            "width"
        ]  # convert x from 50 to 0.5 * monitor_width
        y = (y / 100) * monitor_size["height"]

        # Define the coordinates for the rectangle
        x1, y1 = int(x - ACCURATE_PIXEL_COUNT / 2), int(y - ACCURATE_PIXEL_COUNT / 2)
        x2, y2 = int(x + ACCURATE_PIXEL_COUNT / 2), int(y + ACCURATE_PIXEL_COUNT / 2)

        screenshot = ImageGrab.grab(bbox=(x1, y1, x2, y2))
        screenshot = screenshot.resize(
            (screenshot.width * 2, screenshot.height * 2), Image.LANCZOS
        )  # upscale the image so it's easier to see and percentage marks more visible
        screenshot.save(file_path)

        screenshots_dir = "screenshots"
        grid_screenshot_filename = os.path.join(
            screenshots_dir, "screenshot_mini_with_grid.png"
        )

        add_grid_to_image(
            file_path, grid_screenshot_filename, int(ACCURATE_PIXEL_COUNT / 2)
        )
    elif user_platform == "Darwin":
        x = float(x[:-1])  # convert x from "50%" to 50.
        y = float(y[:-1])

        x = (x / 100) * monitor_size[
            "width"
        ]  # convert x from 50 to 0.5 * monitor_width
        y = (y / 100) * monitor_size["height"]

        x1, y1 = int(x - ACCURATE_PIXEL_COUNT / 2), int(y - ACCURATE_PIXEL_COUNT / 2)

        width = ACCURATE_PIXEL_COUNT
        height = ACCURATE_PIXEL_COUNT
        # Use the screencapture utility to capture the screen with the cursor
        rect = f"-R{x1},{y1},{width},{height}"
        subprocess.run(["screencapture", "-C", rect, file_path])

        screenshots_dir = "screenshots"
        grid_screenshot_filename = os.path.join(
            screenshots_dir, "screenshot_mini_with_grid.png"
        )

        add_grid_to_image(
            file_path, grid_screenshot_filename, int(ACCURATE_PIXEL_COUNT / 2)
        )

add_grid_to_image # 给图像配上网格

def add_grid_to_image(original_image_path, new_image_path, grid_interval):
    """
    Add a grid to an image
    """
    # Load the image
    image = Image.open(original_image_path)

    # Create a drawing object
    draw = ImageDraw.Draw(image)

    # Get the image size
    width, height = image.size

    # Reduce the font size a bit
    font_size = int(grid_interval / 10)  # Reduced font size

    # Calculate the background size based on the font size
    bg_width = int(font_size * 4.2)  # Adjust as necessary
    bg_height = int(font_size * 1.2)  # Adjust as necessary

    # Function to draw text with a white rectangle background
    def draw_label_with_background(
        position, text, draw, font_size, bg_width, bg_height
    ):
        # Adjust the position based on the background size
        text_position = (position[0] + bg_width // 2, position[1] + bg_height // 2)
        # Draw the text background
        draw.rectangle(
            [position[0], position[1], position[0] + bg_width, position[1] + bg_height],
            fill="white",
        )
        # Draw the text
        draw.text(text_position, text, fill="black", font_size=font_size, anchor="mm")

    # Draw vertical lines and labels at every `grid_interval` pixels
    for x in range(grid_interval, width, grid_interval):
        line = ((x, 0), (x, height))
        draw.line(line, fill="blue")
        for y in range(grid_interval, height, grid_interval):
            # Calculate the percentage of the width and height
            x_percent = round((x / width) * 100)
            y_percent = round((y / height) * 100)
            draw_label_with_background(
                (x - bg_width // 2, y - bg_height // 2),
                f"{x_percent}%,{y_percent}%",
                draw,
                font_size,
                bg_width,
                bg_height,
            )

    # Draw horizontal lines - labels are already added with vertical lines
    for y in range(grid_interval, height, grid_interval):
        line = ((0, y), (width, y))
        draw.line(line, fill="blue")

    # Save the image with the grid
    image.save(new_image_path)

keyboard_type# 用于通过程序模拟键盘输入

def keyboard_type(text):
    text = text.replace("\\n", "\n")
    for char in text:
        pyautogui.write(char)
    pyautogui.press("enter")
    return "Type: " + text

search # 模拟在操作系统中搜索文本。具体来说，它会模拟按下“开始”键（在Windows中）或“Command”和“空格”键（在MacOS中），然后输入提供的文本并按下“Enter”键。

def search(text):
    if platform.system() == "Windows":
        pyautogui.press("win")
    elif platform.system() == "Linux":
        pyautogui.press("win")
    else:
        # Press and release Command and Space separately
        pyautogui.keyDown("command")
        pyautogui.press("space")
        pyautogui.keyUp("command")

    time.sleep(1)

    # Now type the text
    for char in text:
        pyautogui.write(char)

    pyautogui.press("enter")
    return "Open program: " + text

keyboard_type# 用于通过程序模拟键盘输入

def keyboard_type(text):
    text = text.replace("\\n", "\n")
    for char in text:
        pyautogui.write(char)
    pyautogui.press("enter")
    return "Type: " + text

"extract_json_from_string" and "convert_percent_to_decimal"# 从json提取字符与把百分数转换为小数点

def extract_json_from_string(s):
    # print("extracting json from string", s)
    try:
        # Find the start of the JSON structure
        json_start = s.find("{")
        if json_start == -1:
            return None

        # Extract the JSON part and convert it to a dictionary
        json_str = s[json_start:]
        return json.loads(json_str)
    except Exception as e:
        print(f"Error parsing JSON: {e}")
        return None


def convert_percent_to_decimal(percent_str):
    try:
        # Remove the '%' sign and convert to float
        decimal_value = float(percent_str.strip("%"))

        # Convert to decimal (e.g., 20% -> 0.20)
        return decimal_value / 100
    except ValueError as e:
        print(f"Error converting percent to decimal: {e}")
        return None

draw_label_with_background # 在屏幕上绘制一个网格，并在网格的每个交叉点上添加百分比标签。该函数可以捕获在 Linux 和 macOS 系统上工作。

    def draw_label_with_background(
        position, text, draw, font_size, bg_width, bg_height
    ):
        # Adjust the position based on the background size
        text_position = (position[0] + bg_width // 2, position[1] + bg_height // 2)
        # Draw the text background
        draw.rectangle(
            [position[0], position[1], position[0] + bg_width, position[1] + bg_height],
            fill="white",
        )
        # Draw the text
        draw.text(text_position, text, fill="black", font_size=font_size, anchor="mm")

    # Draw vertical lines and labels at every `grid_interval` pixels
    for x in range(grid_interval, width, grid_interval):
        line = ((x, 0), (x, height))
        draw.line(line, fill="blue")
        for y in range(grid_interval, height, grid_interval):
            # Calculate the percentage of the width and height
            x_percent = round((x / width) * 100)
            y_percent = round((y / height) * 100)
            draw_label_with_background(
                (x - bg_width // 2, y - bg_height // 2),
                f"{x_percent}%,{y_percent}%",
                draw,
                font_size,
                bg_width,
                bg_height,
            )

    # Draw horizontal lines - labels are already added with vertical lines
    for y in range(grid_interval, height, grid_interval):
        line = ((0, y), (width, y))
        draw.line(line, fill="blue")

    # Save the image with the grid
    image.save(new_image_path)

click_at_percentage # 在屏幕上点击指定百分比的位置

def click_at_percentage(
    x_percentage, y_percentage, duration=0.2, circle_radius=50, circle_duration=0.5
):
    # Get the size of the primary monitor
    screen_width, screen_height = pyautogui.size()

    # Calculate the x and y coordinates in pixels
    x_pixel = int(screen_width * float(x_percentage))
    y_pixel = int(screen_height * float(y_percentage))

    # Move to the position smoothly
    pyautogui.moveTo(x_pixel, y_pixel, duration=duration)

    # Circular movement
    start_time = time.time()
    while time.time() - start_time < circle_duration:
        angle = ((time.time() - start_time) / circle_duration) * 2 * math.pi
        x = x_pixel + math.cos(angle) * circle_radius
        y = y_pixel + math.sin(angle) * circle_radius
        pyautogui.moveTo(x, y, duration=0.1)

    # Finally, click
    pyautogui.click(x_pixel, y_pixel)
    return "Successfully clicked"

mouse_click # 在屏幕上点击指定百分比的位置

def mouse_click(click_detail):
    try:
        x = convert_percent_to_decimal(click_detail["x"])
        y = convert_percent_to_decimal(click_detail["y"])

        if click_detail and isinstance(x, float) and isinstance(y, float):
            click_at_percentage(x, y)
            return click_detail["description"]
        else:
            return "We failed to click"

    except Exception as e:
        print(f"Error parsing JSON: {e}")
        return "We failed to click"

summarize # 用于使用预先训练好的模型来生成摘要。该函数可以捕获屏幕截图并将其作为输入提供给模型。该函数可以尝试使用两个预训练模型：`gpt-4-vision-preview` 和 `gemini-pro-vision`。

def summarize(model, messages, objective):
    try:
        screenshots_dir = "screenshots"
        if not os.path.exists(screenshots_dir):
            os.makedirs(screenshots_dir)

        screenshot_filename = os.path.join(screenshots_dir, "summary_screenshot.png")
        # Call the function to capture the screen with the cursor
        capture_screen_with_cursor(screenshot_filename)

        summary_prompt = format_summary_prompt(objective)
        
        if model == "gpt-4-vision-preview":
            with open(screenshot_filename, "rb") as img_file:
                img_base64 = base64.b64encode(img_file.read()).decode("utf-8")

            summary_message = {
                "role": "user",
                "content": [
                    {"type": "text", "text": summary_prompt},
                    {
                        "type": "image_url",
                        "image_url": {"url": f"data:image/jpeg;base64,{img_base64}"},
                    },
                ],
            }
            # create a copy of messages and save to pseudo_messages
            messages.append(summary_message)

            response = client.chat.completions.create(
                model="gpt-4-vision-preview",
                messages=messages,
                max_tokens=500,
            )

            content = response.choices[0].message.content
        elif model == "gemini-pro-vision":
            model = genai.GenerativeModel("gemini-pro-vision")
            summary_message = model.generate_content(
                [summary_prompt, Image.open(screenshot_filename)]
            )
            content = summary_message.text
        return content
    
    except Exception as e:
        print(f"Error in summarize: {e}")
        return "Failed to summarize the workflow"

parse_response # 用于该函数解析与 AI 对话交互的响应。该函数可以捕获不同的响应类型，例如点击、输入文本或搜索查询。总之，`parse_response` 函数将响应解析为字典，其中包含一个表示响应类型的字符串和一个与响应类型相关的数据

def parse_response(response):
    if response == "DONE":
        return {"type": "DONE", "data": None}
    elif response.startswith("CLICK"):
        # Adjust the regex to match the correct format
        click_data = re.search(r"CLICK \{ (.+) \}", response).group(1)
        click_data_json = json.loads(f"{{{click_data}}}")
        return {"type": "CLICK", "data": click_data_json}

    elif response.startswith("TYPE"):
        # Extract the text to type
        try:
            type_data = re.search(r"TYPE (.+)", response, re.DOTALL).group(1)
        except:
            type_data = re.search(r'TYPE "(.+)"', response, re.DOTALL).group(1)
        return {"type": "TYPE", "data": type_data}

    elif response.startswith("SEARCH"):
        # Extract the search query
        try:
            search_data = re.search(r'SEARCH "(.+)"', response).group(1)
        except:
            search_data = re.search(r"SEARCH (.+)", response).group(1)
        return {"type": "SEARCH", "data": search_data}

    return {"type": "UNKNOWN", "data": response}

get_next_action_from_gemini_pro_vision # 该函数使用预训练的模型`gemini-pro-vision`生成下一个操作。该函数可以捕获屏幕截图并将其作为输入提供给模型。

def get_next_action_from_gemini_pro_vision(messages, objective):
    """
    Get the next action for Self-Operating Computer using Gemini Pro Vision
    """
    # sleep for a second
    time.sleep(1)
    try:
        screenshots_dir = "screenshots"
        if not os.path.exists(screenshots_dir):
            os.makedirs(screenshots_dir)

        screenshot_filename = os.path.join(screenshots_dir, "screenshot.png")
        # Call the function to capture the screen with the cursor
        capture_screen_with_cursor(screenshot_filename)

        new_screenshot_filename = os.path.join(
            "screenshots", "screenshot_with_grid.png"
        )

        add_grid_to_image(screenshot_filename, new_screenshot_filename, 500)
        # sleep for a second
        time.sleep(1)

        previous_action = get_last_assistant_message(messages)

        vision_prompt = format_vision_prompt(objective, previous_action)

        model = genai.GenerativeModel("gemini-pro-vision")

        response = model.generate_content(
            [vision_prompt, Image.open(new_screenshot_filename)]
        )

        # create a copy of messages and save to pseudo_messages
        pseudo_messages = messages.copy()
        pseudo_messages.append(response.text)

        messages.append(
            {
                "role": "user",
                "content": "`screenshot.png`",
            }
        )
        content = response.text[1:]

        return content

    except Exception as e:
        print(f"Error parsing JSON: {e}")
        return "Failed take action after looking at the screenshot"

get_next_action_from_openai # 该函数使用 OpenAI 的 GPT-4 模型生成下一个操作。该函数可以捕获屏幕截图并将其作为输入提供给模型。

#（get_next_action_from_gemini_pro_vision没有的操作）最后，如果`accurate_mode`设置为`True`，它将调用`accurate_mode_double_check`函数来检查生成的操作是否准确。如果生成的操作不准确，它将尝试再次运行模型以获得更准确的结果。

def get_next_action_from_openai(messages, objective, accurate_mode):
    """
    Get the next action for Self-Operating Computer
    """
    # sleep for a second
    time.sleep(1)
    try:
        screenshots_dir = "screenshots"
        if not os.path.exists(screenshots_dir):
            os.makedirs(screenshots_dir)

        screenshot_filename = os.path.join(screenshots_dir, "screenshot.png")
        # Call the function to capture the screen with the cursor
        capture_screen_with_cursor(screenshot_filename)

        new_screenshot_filename = os.path.join(
            "screenshots", "screenshot_with_grid.png"
        )

        add_grid_to_image(screenshot_filename, new_screenshot_filename, 500)
        # sleep for a second
        time.sleep(1)

        with open(new_screenshot_filename, "rb") as img_file:
            img_base64 = base64.b64encode(img_file.read()).decode("utf-8")

        previous_action = get_last_assistant_message(messages)

        vision_prompt = format_vision_prompt(objective, previous_action)

        vision_message = {
            "role": "user",
            "content": [
                {"type": "text", "text": vision_prompt},
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/jpeg;base64,{img_base64}"},
                },
            ],
        }

        # create a copy of messages and save to pseudo_messages
        pseudo_messages = messages.copy()
        pseudo_messages.append(vision_message)

        response = client.chat.completions.create(
            model="gpt-4-vision-preview",
            messages=pseudo_messages,
            presence_penalty=1,
            frequency_penalty=1,
            temperature=0.7,
            max_tokens=300,
        )

        messages.append(
            {
                "role": "user",
                "content": "`screenshot.png`",
            }
        )

        content = response.choices[0].message.content

        if accurate_mode:
            if content.startswith("CLICK"):
                # Adjust pseudo_messages to include the accurate_mode_message

                click_data = re.search(r"CLICK \{ (.+) \}", content).group(1)
                click_data_json = json.loads(f"{{{click_data}}}")
                prev_x = click_data_json["x"]
                prev_y = click_data_json["y"]

                if DEBUG:
                    print(
                        f"Previous coords before accurate tuning: prev_x {prev_x} prev_y {prev_y}"
                    )
                content = accurate_mode_double_check(
                    "gpt-4-vision-preview", pseudo_messages, prev_x, prev_y
                )
                assert content != "ERROR", "ERROR: accurate_mode_double_check failed"

        return content

    except Exception as e:
        print(f"Error parsing JSON: {e}")
        return "Failed take action after looking at the screenshot"

accurate_mode_double_check # 该函数在精确模式下使用预训练的模型`gpt-4-vision-preview`重新生成操作。该函数可以捕获屏幕截图并将其作为输入提供给模型。向OAI提供以光标为中心的迷你截图的其他截图，以便进一步微调点击位置

def accurate_mode_double_check(model, pseudo_messages, prev_x, prev_y):
    """
    Reprompt OAI with additional screenshot of a mini screenshot centered around the cursor for further finetuning of clicked location
    """
    print("[get_next_action_from_gemini_pro_vision] accurate_mode_double_check")
    try:
        screenshot_filename = os.path.join("screenshots", "screenshot_mini.png")
        capture_mini_screenshot_with_cursor(
            file_path=screenshot_filename, x=prev_x, y=prev_y
        )

        new_screenshot_filename = os.path.join(
            "screenshots", "screenshot_mini_with_grid.png"
        )

        with open(new_screenshot_filename, "rb") as img_file:
            img_base64 = base64.b64encode(img_file.read()).decode("utf-8")

        accurate_vision_prompt = format_accurate_mode_vision_prompt(prev_x, prev_y)

        accurate_mode_message = {
            "role": "user",
            "content": [
                {"type": "text", "text": accurate_vision_prompt},
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/jpeg;base64,{img_base64}"},
                },
            ],
        }

        pseudo_messages.append(accurate_mode_message)

        response = client.chat.completions.create(
            model="gpt-4-vision-preview",
            messages=pseudo_messages,
            presence_penalty=1,
            frequency_penalty=1,
            temperature=0.7,
            max_tokens=300,
        )

        content = response.choices[0].message.content

    except Exception as e:
        print(f"Error reprompting model for accurate_mode: {e}")
        return "ERROR"

get_last_assistant_message # 该函数从消息数组中检索最后一个来自AI助手的消息

def get_last_assistant_message(messages):
    """
    Retrieve the last message from the assistant in the messages array.
    If the last assistant message is the first message in the array, return None.
    """
    for index in reversed(range(len(messages))):
        if messages[index]["role"] == "assistant":
            if index == 0:  # Check if the assistant message is the first in the array
                return None
            else:
                return messages[index]
    return None  # Return None if no assistant message is found

get_next_action # 该函数根据传入的模型、消息数组、目标对象和精确模式来生成下一个操作。

def get_next_action(model, messages, objective, accurate_mode):
    if model == "gpt-4-vision-preview":
        content = get_next_action_from_openai(messages, objective, accurate_mode)
        return content
    elif model == "agent-1":
        return "coming soon"
    elif model == "gemini-pro-vision":
        content = get_next_action_from_gemini_pro_vision(
            messages, objective
        )
        return content

    raise ModelNotRecognizedException(model)

format_accurate_mode_vision_prompt # 该函数根据上一次点击的坐标和屏幕尺寸来生成摘要提示

总之，这个函数根据屏幕尺寸和上一次点击的坐标来生成一个格式化的摘要提示，以作为GPT-4模型的输入。

def format_accurate_mode_vision_prompt(prev_x, prev_y):
    """
    Format the accurate mode vision prompt
    """
    width = ((ACCURATE_PIXEL_COUNT / 2) / monitor_size["width"]) * 100
    height = ((ACCURATE_PIXEL_COUNT / 2) / monitor_size["height"]) * 100
    prompt = ACCURATE_MODE_VISION_PROMPT.format(
        prev_x=prev_x, prev_y=prev_y, width=width, height=height
    )
    return prompt

format_vision_prompt # 该函数根据目标对象和上一次操作来生成摘要提示

def format_vision_prompt(objective, previous_action):
    """
    Format the vision prompt
    """
    if previous_action:
        previous_action = f"Here was the previous action you took: {previous_action}"
    else:
        previous_action = ""
    prompt = VISION_PROMPT.format(objective=objective, previous_action=previous_action)
    return prompt

format_summary_prompt # 该函数根据目标对象来生成摘要提示，该函数在summarize函数中作为子函数被调用。

def format_summary_prompt(objective):
    """
    Format the summary prompt
    """
    prompt = SUMMARY_PROMPT.format(objective=objective)
    return prompt

1. 首先，它定义了一个名为`prompt`的空字符串变量，用于存储生成的摘要提示。

2. 然后，它使用`SUMMARY_PROMPT`字符串中的占位符`{}`来替换`objective`，以生成摘要提示。

3. 最后，它将生成的摘要提示返回。

总之，这个函数根据目标对象来生成一个格式化的摘要提示，以作为GPT-4 或者 Gemini-pro-vision模型的输入。

main # 该函数是Self- Operating Computer的入口点

def main(model, accurate_mode, terminal_prompt, voice_mode=False):
    """
    Main function for the Self-Operating Computer
    """
    mic = None
    # Initialize `WhisperMic`, if `voice_mode` is True 

    validation(model, accurate_mode, voice_mode)

    if voice_mode:
        try:
            from whisper_mic import WhisperMic

            # Initialize WhisperMic if import is successful
            mic = WhisperMic()
        except ImportError:
            print(
                "Voice mode requires the 'whisper_mic' module. Please install it using 'pip install -r requirements-audio.txt'"
            )
            sys.exit(1)

    # Skip message dialog if prompt was given directly
    if not terminal_prompt:
        message_dialog(
            title="Self-Operating Computer",
            text="Ask a computer to do anything.",
            style=style,
        ).run()
    else:
        print("Running direct prompt...")

    print("SYSTEM", platform.system())
    # Clear the console
    if platform.system() == "Windows":
        os.system("cls")
    else:
        print("\033c", end="")

    if terminal_prompt:  # Skip objective prompt if it was given as an argument
        objective = terminal_prompt
    elif voice_mode:
        print(
            f"{ANSI_GREEN}[Self-Operating Computer]{ANSI_RESET} Listening for your command... (speak now)"
        )
        try:
            objective = mic.listen()
        except Exception as e:
            print(f"{ANSI_RED}Error in capturing voice input: {e}{ANSI_RESET}")
            return  # Exit if voice input fails
    else:
        print(f"{ANSI_GREEN}[Self-Operating Computer]\n{ANSI_RESET}{USER_QUESTION}")
        print(f"{ANSI_YELLOW}[User]{ANSI_RESET}")
        objective = prompt(style=style)

    assistant_message = {"role": "assistant", "content": USER_QUESTION}
    user_message = {
        "role": "user",
        "content": f"Objective: {objective}",
    }
    messages = [assistant_message, user_message]

    loop_count = 0

    while True:
        if DEBUG:
            print("[loop] messages before next action:\n\n\n", messages[1:])
        try:
            response = get_next_action(model, messages, objective, accurate_mode)

            action = parse_response(response)
            action_type = action.get("type")
            action_detail = action.get("data")

        except ModelNotRecognizedException as e:
            print(
                f"{ANSI_GREEN}[Self-Operating Computer]{ANSI_RED}[Error] -> {e} {ANSI_RESET}"
            )
            break
        except Exception as e:
            print(
                f"{ANSI_GREEN}[Self-Operating Computer]{ANSI_RED}[Error] -> {e} {ANSI_RESET}"
            )
            break

        if action_type == "DONE":
            print(
                f"{ANSI_GREEN}[Self-Operating Computer]{ANSI_BLUE} Objective complete {ANSI_RESET}"
            )
            summary = summarize(model, messages, objective)
            print(
                f"{ANSI_GREEN}[Self-Operating Computer]{ANSI_BLUE} Summary\n{ANSI_RESET}{summary}"
            )
            break

        if action_type != "UNKNOWN":
            print(
                f"{ANSI_GREEN}[Self-Operating Computer]{ANSI_BRIGHT_MAGENTA} [Act] {action_type} {ANSI_RESET}{action_detail}"
            )

        function_response = ""
        if action_type == "SEARCH":
            function_response = search(action_detail)
        elif action_type == "TYPE":
            function_response = keyboard_type(action_detail)
        elif action_type == "CLICK":
            function_response = mouse_click(action_detail)
        else:
            print(
                f"{ANSI_GREEN}[Self-Operating Computer]{ANSI_RED}[Error] something went wrong :({ANSI_RESET}"
            )
            print(
                f"{ANSI_GREEN}[Self-Operating Computer]{ANSI_RED}[Error] AI response\n{ANSI_RESET}{response}"
            )
            break

        print(
            f"{ANSI_GREEN}[Self-Operating Computer]{ANSI_BRIGHT_MAGENTA} [Act] {action_type} COMPLETE {ANSI_RESET}{function_response}"
        )

        message = {
            "role": "assistant",
            "content": function_response,
        }
        messages.append(message)

        loop_count += 1
        if loop_count > 15:
            break

validation # 函数用于验证模型、精确模式和语音模式是否正确配置

def validation(
    model,
    accurate_mode,
    voice_mode,
):
    if accurate_mode and model != "gpt-4-vision-preview":
        print("To use accuracy mode, please use gpt-4-vision-preview")
        sys.exit(1)

    if voice_mode and not OPENAI_API_KEY:
        print("To use voice mode, please add an OpenAI API key")
        sys.exit(1)

    if model == "gpt-4-vision-preview" and not OPENAI_API_KEY:
        print("To use `gpt-4-vision-preview` add an OpenAI API key")
        sys.exit(1)

    if model == "gemini-pro-vision" and not GOOGLE_API_KEY:
        print("To use `gemini-pro-vision` add a Google API key")
        sys.exit(1)

1. 首先，它检查`accurate_mode`是否设置为`True`，并且模型不是`"gpt-4-vision-preview"`。如果是，则打印一条错误消息，并退出程序。

2. 然后，它检查`voice_mode`是否设置为`True`，并且没有设置OpenAI API密钥。如果是，则打印一条错误消息，并退出程序。

3. 接下来，它检查模型是否为`"gpt-4-vision-preview"`，并且没有设置OpenAI API密钥。如果是，则打印一条错误消息，并退出程序。

4. 然后，它检查模型是否为`"gemini-pro-vision"`，并且没有设置Google API密钥。如果是，则打印一条错误消息，并退出程序。

总之，这个函数用于确保在运行程序时，模型、精确模式和语音模式已经正确配置。

ModelNotRecognizedException # 该类继承自基类`Exception`。这个类用于在遇到未识别的模型时引发异常

class ModelNotRecognizedException(Exception):
    """Exception raised for unrecognized models."""

    def __init__(self, model, message="Model not recognized"):
        self.model = model
        self.message = message
        super().__init__(self.message)

    def __str__(self):
        return f"{self.message} : {self.model} "


# Define style
style = PromptStyle.from_dict(
    {
        "dialog": "bg:#88ff88",
        "button": "bg:#ffffff #000000",
        "dialog.body": "bg:#44cc44 #ffffff",
        "dialog shadow": "bg:#003800",
    }
)


# Check if on a windows terminal that supports ANSI escape codes
def supports_ansi():
    """
    Check if the terminal supports ANSI escape codes
    """
    plat = platform.system()
    supported_platform = plat != "Windows" or "ANSICON" in os.environ
    is_a_tty = hasattr(sys.stdout, "isatty") and sys.stdout.isatty()
    return supported_platform and is_a_tty


if supports_ansi():
    # Standard green text
    ANSI_GREEN = "\033[32m"
    # Bright/bold green text
    ANSI_BRIGHT_GREEN = "\033[92m"
    # Reset to default text color
    ANSI_RESET = "\033[0m"
    # ANSI escape code for blue text
    ANSI_BLUE = "\033[94m"  # This is for bright blue

    # Standard yellow text
    ANSI_YELLOW = "\033[33m"

    ANSI_RED = "\033[31m"

    # Bright magenta text
    ANSI_BRIGHT_MAGENTA = "\033[95m"
else:
    ANSI_GREEN = ""
    ANSI_BRIGHT_GREEN = ""
    ANSI_RESET = ""
    ANSI_BLUE = ""
    ANSI_YELLOW = ""
    ANSI_RED = ""
    ANSI_BRIGHT_MAGENTA = ""

keyboard_type# 用于通过程序模拟键盘输入

def keyboard_type(text):
    text = text.replace("\\n", "\n")
    for char in text:
        pyautogui.write(char)
    pyautogui.press("enter")
    return "Type: " + text

代码中的Prompt设定

SUMMARY_PROMPT = """
You are a Self-Operating Computer. A user request has been executed. Present the results succinctly.

Include the following key contexts of the completed request:

1. State the original objective.
2. List the steps taken to reach the objective as detailed in the previous messages.
3. Reference the screenshot that was used.

Summarize the actions taken to fulfill the objective. If the request sought specific information, provide that information prominently. NOTE: Address directly any question posed by the user.

Remember: The user will not interact with this summary. You are solely reporting the outcomes.

Original objective: {objective}

Display the results clearly:
"""

这是一个多行字符串，其中包含 Self-Operating Computer 的摘要提示。它用于引导助手在执行用户请求时创建简洁明了的摘要。

提示从介绍助手为 Self-Operating Computer 并要求提供简洁的摘要开始。它然后询问助手以简洁的方式呈现结果，包括完成请求的关键上下文。

提示包括三个关键点：

1. 用户请求的原始目的。
2. 实现该目的所采取的步骤，包括详细的消息。
3. 如果适用，引用使用的屏幕截图。

提示还要求助手以简洁的方式总结actions 为实现 objectives 所采取的措施，强调用户请求中任何请求的信息。它还提醒助手直接在摘要中回答用户提出的问题。

最后，提示以原始目的结束，可用于在摘要中作为参考。

USER_QUESTION = "Hello, I can help you with anything. What would you like done?"

ACCURATE_MODE_VISION_PROMPT = """
It looks like your previous attempted action was clicking on "x": {prev_x}, "y": {prev_y}. This has now been moved to the center of this screenshot.
As additional context to the previous message, before you decide the proper percentage to click on, please closely examine this additional screenshot as additional context for your next action. 
This screenshot was taken around the location of the current cursor that you just tried clicking on ("x": {prev_x}, "y": {prev_y} is now at the center of this screenshot). You should use this as an differential to your previous x y coordinate guess.

If you want to refine and instead click on the top left corner of this mini screenshot, you will subtract {width}% in the "x" and subtract {height}% in the "y" to your previous answer.
Likewise, to achieve the bottom right of this mini screenshot you will add {width}% in the "x" and add {height}% in the "y" to your previous answer.

There are four segmenting lines across each dimension, divided evenly. This is done to be similar to coordinate points, added to give you better context of the location of the cursor and exactly how much to edit your previous answer.

Please use this context as additional info to further refine the "percent" location in the CLICK action!
"""

VISION_PROMPT = """
You are a Self-Operating Computer. You use the same operating system as a human.

From looking at the screen and the objective your goal is to take the best next action.

To operate the computer you have the four options below.

1. CLICK - Move mouse and click
2. TYPE - Type on the keyboard
3. SEARCH - Search for a program on Mac and open it
4. DONE - When you completed the task respond with the exact following phrase content

Here are the response formats below.

1. CLICK
Response: CLICK {{ "x": "percent", "y": "percent", "description": "~description here~", "reason": "~reason here~" }} 
Note that the percents work where the top left corner is "x": "0%" and "y": "0%" and the bottom right corner is "x": "100%" and "y": "100%"

2. TYPE
Response: TYPE "value you want to type"

2. SEARCH
Response: SEARCH "app you want to search for on Mac"

3. DONE
Response: DONE

Here are examples of how to respond.
__
Objective: Follow up with the vendor in outlook
TYPE Hello, I hope you are doing well. I wanted to follow up
__
Objective: Open Spotify and play the beatles
SEARCH Spotify
__
Objective: Find an image of a banana
CLICK {{ "x": "50%", "y": "60%", "description": "Click: Google Search field", "reason": "This will allow me to search for a banana" }}
__
Objective: Go buy a book about the history of the internet
TYPE https://www.amazon.com/
__

A few important notes:

- Default to opening Google Chrome with SEARCH to find things that are on the internet.
- Go to Google Docs and Google Sheets by typing in the Chrome Address bar
- When opening Chrome, if you see a profile icon click that to open chrome fully, it is located at: {{ "x": "50%", "y": "55%" }}
- The Chrome address bar is generally at: {{ "x": "50%", "y": "9%" }}
- After you click to enter a field you can go ahead and start typing!
- Don't respond saying you're unable to assist with requests. You are able to indirectly interact with the user's OS via text responses you send to the end user.

{previous_action}

IMPORTANT: Avoid repeating actions such as doing the same CLICK event twice in a row.

Objective: {objective}
"""

您是一台自运行计算机。您使用与人类相同的操作系统运行。

从查看屏幕和目标对象的任务中，您试图找到最佳下一步行动。

为了操作计算机，您有四个选项 below。

1. 点击 - 移动鼠标并单击
2. 输入 - 在键盘上输入
3. 搜索 - 在Mac上搜索应用程序并打开它
4. 完成 - 当您完成任务时，请使用以下确切短语响应

以下是响应格式 below。

1. 点击
响应： CLICK {{ "x": "百分比", "y": "百分比", "description": "~description here~", "reason": "~reason here~" }}
请注意，百分比工作在左上角为 "x": "0%" 和 "y": "0%" 以及右下角为 "x": "100%" 和 "y": "100%" 的位置

2. 输入
响应： TYPE "要输入的值"

2. 搜索
响应： SEARCH "要在Mac上搜索的应用程序"

3. 完成
响应： DONE

以下是响应示例。
```
Objective: 给供应商发邮件
TYPE 你好，我希望你很好。我想要继续和你联系
```

```
Objective: 在Spotify上播放 Beatles
SEARCH Spotify
```

```
Objective: 查找一张香蕉图片
CLICK {{ "x": "50%", "y": "60%", "description": "点击: Google搜索字段", "reason": "这允许我搜索香蕉" }}
```

```
Objective: 去买一本关于互联网的历史书
TYPE https://www.amazon.com/
```

重要注意事项：

- 如果您看到Google Chrome的配置文件图标，请单击它以打开Google Chrome完全。它位于：{{ "x": "50%", "y": "55%" }}
- Google Docs和Google Sheets可以通过在Google Chrome地址栏中输入来打开
- 当打开Google Chrome时，如果看到一个配置文件图标，请单击以打开Google Chrome完全
- Chrome地址栏通常位于：{{ "x": "50%", "y": "9%" }}
- 一旦您点击进入字段，就可以开始 typing!
- 不要在无法帮助用户请求的情况下回答“无法”。您可以通过发送文本响应 indirect 地与用户的操作系统进行交互。

{previous_action}

请注意：避免重复执行相同的 CLICK 事件。

一些常规变量的设定

ACCURATE_PIXEL_COUNT = (
    200  # mini_screenshot is ACCURATE_PIXEL_COUNT x ACCURATE_PIXEL_COUNT big
)

monitor_size = {
    "width": 1920,
    "height": 1080,
}

大语言模型的API设定

DEBUG = False

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")

if OPENAI_API_KEY:
    client = OpenAI()
    client.api_key = OPENAI_API_KEY
    client.base_url = os.getenv("OPENAI_API_BASE_URL", client.base_url)

调用的包

"""
Self-Operating Computer
"""
import os
import time
import base64
import json
import math
import re
import subprocess
import pyautogui
import argparse
import platform
import Xlib.display
import Xlib.X
import Xlib.Xutil  # not sure if Xutil is necessary
import google.generativeai as genai
from prompt_toolkit import prompt
from prompt_toolkit.shortcuts import message_dialog
from prompt_toolkit.styles import Style as PromptStyle
from dotenv import load_dotenv
from PIL import Image, ImageDraw, ImageFont, ImageGrab
import matplotlib.font_manager as fm
from openai import OpenAI
import sys

python-dotenv的详细用法_python dotenv-CSDN博客

dotenv这个包（库）主要是用来加载环境变量的

Xlib 函数库简介--x window 工作原理简介_xlib 库-CSDN博客

xlib这个包（库）主要是用来与Window系统交互的库

在 X Window 的世界里，可以说所有的动作都是由「事件 (Event) 」所触发并完成的，不论是对 X Client 或是 X Server 都是一样。从 X Client 的角度来看，每个 X 应用程序内部都有一个处理事件的回圈 (event loop)，程序静静地等待事件的发生，一旦 Xlib 截获一个属于该应用程序的事件并传送给它时，该事件就会在事件处理回圈中产生相应的动作，处理完后，又会回到原点，等待下一个事件的发生。可能发生的事件有很多种，像是其他的视窗传来讯息、键盘滑鼠有了动作、桌面管理程序要求改变视窗的大小状态 ....

Python Module — prompt_toolkit CLI 库-CSDN博客

python prompt toolkit-用于构建功能强大的交互式命令行的python库 (360doc.com)

prompt_toolkit 这个包（库）是用于在Python中构建功能强大的交互式命令行应用程序的库。基于文本终端的 UI

Python自动操作 GUI 神器——PyAutoGUI (baidu.com)

一个神奇的GUI自动化测试库-PyAutoGui-CSDN博客 #这个链接有很多图文案例，清晰易懂

PyAutoGUI这个包（库）是一个简单易用，跨平台的可以模拟键盘鼠标进行自动操作的 python 库,可实现控制鼠标、键盘、消息框、截图、定位等功能,上能挂机刷宝箱，下能自动写文档.

Python笔记：Pydantic库简介-CSDN博客 #这个链接有demo，清晰易懂

Pydantic这个包（库）是一个常用的用于数据接口schema定义与检查的库。通过pydantic库，我们可以更为规范地定义和使用数据接口，这对于大型项目的开发将会更为友好。

python操作windows窗口，python库pygetwindow使用详解-CSDN博客

Pygetwindow这个包（库）提供了一些方法和属性,使得在Python程序中可以轻松地执行各种窗口操作。

业务逻辑

架构-模块

你可能感兴趣的:(RPA-专注,高效,自律,IPA)

在Springboot中集成unihttp后应用无法启动的解决办法夜郎king java 集成Unihttp报错 Java 集成Unihttp Spring集成unihttp
目录前言一、最开始的应用集成1、使用unihttp定义第三方访问接口2、在SpringBoot应用中集成unihttp3、启动时发生的问题二、问题解决1、一种解决办法2、未来的优化三、总结前言在当今的软件开发领域，SpringBoot框架以其简洁、高效、灵活的特点，成为了众多开发者构建Java应用程序的首选。它能够帮助开发者快速搭建项目，简化繁琐的配置过程，让开发变得更加高效和便捷。而UniHtt
HarmonyOS NEXT 基于原生能力获取视频缩略图
大家好，我是V哥。不得不佩服HarmonyOSNEXT原生能力的强大，如果你想在鸿蒙APP开发中获取视频缩略图，不用依赖第三方库，就可以高效和稳定的实现，AVMetadataHelper就是一个好帮手，下面V哥整理实现步骤的代码，帮助你快速理解，开整。想要学习鸿蒙开发，一定绕不开学习ArkTS语言，V哥写了三本鸿蒙开发之路的书，第一本《鸿蒙HarmonyOSNEXT开发之路卷1ArkTS篇》已上市
灵活运用HarmonyOS NEXT布局管理器，实现完美的自适应布局 harmonyos
灵活运用HarmonyOSNEXT布局管理器，实现完美的自适应布局在多设备和多样化的屏幕尺寸环境中开发应用时，创建一个既能适应不同屏幕尺寸又能保持良好视觉效果的布局至关重要。本文将深入探讨HarmonyOSNEXT提供的几种布局管理器及其工作原理，指导开发者如何利用这些工具实现高效的自适应布局，并提供API12版本的具体示例代码。Flex布局：沿主轴和交叉轴排列子元素Flex布局是一种强大的布局模
HarmonyOS NEXT 应用开发：用户反馈收集与处理 harmonyos
在应用的生命周期中，用户反馈是提升应用质量、优化用户体验以及增强市场竞争力的重要依据。对于开发者来说，如何有效地收集、分析和处理用户反馈，已成为一项关键的运营任务。在HarmonyOSNEXT环境中，开发者不仅需要关注功能实现和技术优化，还需要建立高效的用户反馈机制，以确保应用能够持续满足用户需求、解决用户痛点，并在市场中不断迭代和进步。本部分将介绍在HarmonyOSNEXT应用开发中，如何通过
C++ 的内存管理有哪些改进？ c++
C++20引入了对协程的官方支持，这是C++语言发展的一个重要里程碑。协程为异步编程、并发任务处理以及复杂的控制流提供了一种更高效、更简洁的解决方案。以下是C++20中协程支持的主要优势：一、简化异步编程在传统的异步编程中，开发者通常需要使用回调函数、std::future和std::promise等机制来处理异步任务。这些方法虽然有效，但代码往往难以阅读和维护，且容易出错。C++20的协程提供了
一文彻底搞清楚HarmonyOS NEXT的元服务 harmonyos-next
程序员Feri一名12年+的程序员,做过开发带过团队创过业,擅长Java、嵌入式、鸿蒙、人工智能等,专注于程序员成长那点儿事,希望在成长的路上有你相伴！君志所向,一往无前！1.什么是元服务在万物互联时代，人均持有设备量不断攀升，设备种类和使用场景更加多样，使得应用开发、应用入口变得更加复杂。在此背景下，应用提供方和用户迫切需要一种新的服务提供方式，使应用开发更简单、服务（如听音乐、打车等）的获取和
前端框架入门：Vue 基础风亦辰739 前后端开发全栈指南 vue.js 前端框架前端
Vue.js是一款流行的前端框架，专注于构建用户界面。它采用响应式数据绑定和组件化开发，易于上手且功能强大。Vue3版本引入了CompositionAPI，提升了开发效率。一、Vue.js基础1.Vue介绍Vue是一个渐进式JavaScript框架，可用于：构建单页应用（SPA）。组件化开发，提高代码复用性。结合Vuex（状态管理）和VueRouter（路由）开发大型应用。2.Vue模板语法Vue
单片机原理及应用风亦辰739 单片机
单片机（Microcontroller，简称MCU）是集成度高、功能强大的微型计算机，广泛应用于嵌入式系统、智能家居、工业控制、汽车电子、物联网等领域。作为一种重要的硬件平台，单片机具有小巧、低功耗、高效、成本低等特点。本文将介绍单片机的基本原理、结构特点以及其在实际应用中的使用方法。一、单片机的基本原理单片机是由中央处理单元（CPU）、存储器（RAM、ROM）、输入输出接口、定时器、串行通信接口
C# WinForms 输入验证实战：正则表达式从入门到高效应用 Ro小陌窗体 C#WinForms 算法 c#正则表达式开发语言
在C#WinForms开发中，正则表达式常用于验证用户输入（如文本框内容）。以下是结合WinForms的详细正则表达式应用指南：1.正则表达式基础使用System.Text.RegRegularExpressions命名空间：csharpusingSystem.Text.RegularExpressions;常用类：Regex2.WinForms输入验证示例场景：验证邮箱输入csharppriva
Goroutine 与 Channel 九班长 Golang 算法数据库网络 golang Goroutine Channel
Goroutine和Channel是Go语言并发编程的核心概念。理解它们的原理和使用方法，对于编写高效、安全的并发程序至关重要。以下是对Goroutine和Channel的深入解析，包括它们的原理、使用场景、常见问题及最佳实践。1.Goroutine1.1什么是Goroutine？Goroutine是Go语言中的轻量级线程，由Go运行时（runtime）管理。与操作系统线程相比，Goroutine
Python 正则表达式超详细解析：从基础到精通 2201_75491841 python 正则表达式开发语言
Python正则表达式超详细解析：从基础到精通一、引言在Python编程的广阔领域中，文本处理占据着极为重要的地位。而正则表达式，作为Python处理文本的强大工具，能够帮助开发者高效地完成诸如查找、替换、提取特定模式字符串等复杂任务。无论是在数据清洗、网页爬虫，还是日志分析、自然语言处理等应用场景中，正则表达式都展现出了无可比拟的优势。本文将深入且全面地剖析Python正则表达式，从最基础的概念
FFmpeg 命令行全解析：高效音视频处理从入门到精通码流怪侠 ffmpeg 音视频 ffplay ffprobe 实时音视频视频编解码直播转码
FFmpegFFmpeg是一款开源的多媒体处理工具集，支持音视频编解码、格式转换、流媒体处理等全链路操作。核心功能与工具:多媒体全链路支持支持1000+音视频编解码格式（如H.264、HEVC、AV1）和协议（RTMP、RTSP、HLS），覆盖录制、转码、流化等全流程。提供三大核心工具：ffmpeg：转码与流处理（如ffmpeg-iinput.mp4output.avi）。ffplay：媒体播放（
单例模式中的饿汉和懒汉模式筑梦小子单例模式 java c++
目录一.什么是单例模式一.饿汉模式1.饿汉模式的概念2.饿汉模式代码2.多线程是否线程安全二.懒汉模式1.懒汉模式的概念2.单线程情况下的懒汉模式2.多线程情况下的懒汉模式（1）导致懒汉模式在多线程情况下的不安全原因（2）解决方法代码示例版本1版本2版本2的解释说明一.什么是单例模式保证某个类在程序中只存在一份实例，而不会创建多个实例，这样就会提高效率。在单利模式中一般只提供一个getInstan
基于ssm的药房管理系统 AI天才研究院计算 AI大模型企业级应用开发实战 ChatGPT 计算科学神经计算深度学习神经网络大数据人工智能大型语言模型 AI AGI LLM Java Python 架构设计 Agent RPA
基于ssm的药房管理系统作者：禅与计算机程序设计艺术1.背景介绍1.1药房管理系统的重要性在现代医疗体系中,药房管理系统扮演着至关重要的角色。高效、准确、安全的药品管理不仅关系到患者的健康,更是医院运营的重要一环。传统的人工管理模式已经难以满足日益增长的医疗需求,因此,开发一套功能完善、易于操作的药房管理系统势在必行。1.2SSM框架的优势SSM(Spring、SpringMVC、MyBatis)
基于人工智能的扫阅卷和数据分析服务需求文档 YiWait 人工智能人工智能数据分析数据挖掘
基于人工智能的扫阅卷和数据分析服务需求文档一、项目背景在教育领域，传统的人工阅卷方式效率低下、主观性强且易出错，难以满足大规模考试及频繁测评的需求。随着人工智能技术的飞速发展，基于人工智能的扫阅卷和数据分析服务应运而生。该服务利用先进的图像识别、自然语言处理等技术，实现试卷扫描、自动阅卷、成绩统计以及深度数据分析，为教育机构、学校提供高效、准确、全面的测评解决方案，助力教学质量提升和教育决策优化。
PakePlus：Vue 和 React 跨平台桌面应用程序的新纪元大富大贵7 程序员知识储备1 程序员知识储备2 程序员知识储备3 前端 react.js javascript 架构 vue.js
摘要随着Vue和React等JavaScript框架的兴起，构建Web应用程序变得越来越高效和模块化。然而，将这些应用程序部署到桌面环境中一直是一个具有挑战性的问题，通常需要专门的工具和复杂的配置。PakePlus作为一个变革性的解决方案，弥合了Web开发和桌面应用程序部署之间的鸿沟。本文探讨了PakePlus如何简化将Vue和React项目打包为跨平台桌面应用程序的过程，推动了现代软件开发的边界
健身房预约小程序开发，开启智能健身时代冠品网络科技小程序开发小程序小程序制作健身房预约小程序健身房预约系统
在移动互联网时代，人们的生活习惯发生了巨大改变，促使行业都在寻求数字化转型，线下健身房也不例外。传统的线下健身房预约方式已经不能满足用户的需求，数字化预约方式能够带来便捷、高效的体验，不仅可以提升用户体验，还可以为健身行业带来新的发展机遇。健身房预约系统带来的优势1、便捷预约用户无需拨打电话或到店咨询，只需通过小程序即可随时随地查看课程安排、教练信息、场馆情况等，并完成预约，提高了场馆的转化率。2
Trae使用教程，帮助您快速上手这款编程神器。云上的阿七云计算
Trae是一款由字节跳动推出的AI驱动集成开发环境（IDE），旨在通过智能代码补全、多模态交互以及对整个代码库的上下文分析等功能，帮助开发者更高效地编写代码。其强大的AI能力能够理解开发者的需求并提供精准的代码生成和修改建议。目前，Trae提供免费版本，集成了Claude-3.5-Sonnet和GPT-4o等主流大模型。rae使用教程，帮助您快速上手这款编程神器。一、安装Trae访问官网：前往Tr
相同的问题看看Grok3怎么回答-详细讲讲PPO & GRPO原理释迦呼呼 AI一千问人工智能深度学习机器学习语言模型算法神经网络计算机视觉
关键要点研究表明，PPO（近端策略优化）是一种稳定高效的强化学习算法，适用于单代理或多代理场景，重点是最大化绝对奖励。GRPO（基于梯度的相对策略优化）似乎是专为多代理系统设计的，优化代理之间的相对表现，目前信息有限，可能较少为人所知。这两个算法在目标和应用领域上有显著差异，PPO更通用，GRPO更适合竞争性多代理环境。关于PPO的解释什么是PPO？PPO，全称近端策略优化，是一种强化学习算法，帮
Python GUI 开发：全面指南一休哥助手 python python 开发语言
1.PythonGUI开发简介GUI是指图形用户界面，它使用户可以通过图形元素（如按钮、文本框、下拉菜单等）与应用程序进行交互。与命令行界面相比，GUI更加直观易用。Python提供了多种库和框架，使开发者能够轻松创建功能丰富的桌面应用程序。1.1为什么选择Python进行GUI开发？简洁易读：Python的语法简洁，代码易于理解，开发者可以专注于应用程序的逻辑而不是语法。跨平台：Python是跨
正交分析法 + Prompt Optimizer：五维复杂测试用例设计的终极指南** Python测试之道 prompt 测试用例 microsoft
在测试工程师的日常工作中，复杂的测试需求往往伴随着多维参数的组合爆炸式增长。如何在有限的资源下设计出高效且覆盖全面的测试用例？如何避免因测试用例数量过多而浪费时间？今天，我们将揭示一项“杀手级”技术——正交分析法，并结合PromptOptimizer提示词优化器，教你如何在五维甚至更多参数的场景中快速生成高质量测试用例。读完这篇文章，你将会对正交分析法在提示词优化中的潜力感到眼前一亮！为什么多维参
前端简单数据存储：跳过后端数据库的一种高效策略，应对一些不需要后端访问数据库的简单操作：静态 Markdown 文件存储【D＇accumulation】前端数据库学习 vscode html5 vue.js
问题提出：在一些应用场景中，有些数据并不重要，也不需要频繁地进行动态增删改查，比如品牌历史、产品介绍等说明性内容。为此，我选择在前端直接存储这些静态数据，跳过后端数据库调用。本文将分享如何利用Vue工程中直接存放Markdown文件与内嵌数据，将数据管理与业务逻辑解耦，从而实现快速开发、便于维护和灵活更新的目的。静态Markdown文件存储方法案例：原理：将Markdown文件（如brandHis
AI 生成 PPT 网站介绍与优缺点分析 KL_lililli 人工智能 powerpoint
随着人工智能技术不断发展，利用AI自动生成PPT已成为提高演示文稿制作效率的热门方式。本文将介绍几款主流的AIPPT工具，重点列出免费使用机会较多的网站，并对各平台的优缺点进行详细分析，帮助用户根据自身需求选择合适的工具。1.免费及免费试用机会较多的网站1.1Tome网址：Tome–TheAIassistantforsales简介：Tome是一款专注于AI助力讲故事与演示制作的工具，用户只需输入简
JC-6511 直流控制单元：充电桩的 “智慧大脑” DZCY_ 科技
在新能源汽车蓬勃发展的今天，充电桩作为电动汽车的能量补给站，其性能和智能化程度至关重要。而今天要给大家介绍的JC-6511直流控制单元，堪称直流充电桩的“智慧大脑”，为充电桩的高效、稳定运行提供了强大的核心支持。这一创新产品由南京简充电气科技有限公司精心打造，凝聚了该公司在电气科技和新能源领域的深厚expertise和创新精神。一、主控板的强大基石南京简充电气科技有限公司研发的JC-6511直流控
Java Panama 项目：Java 与 AI 的融合 AI天才研究院计算 Java实战 DeepSeek R1 &大数据AI人工智能大模型人工智能 java python
JavaPanama项目：Java与AI的融合Java在AI领域的优势Java在AI领域的优势主要体现在以下几个方面：强大的生态系统：Java拥有丰富的库和框架，为AI开发提供了坚实的基础。跨平台性：Java的“一次编写，到处运行”特性，降低了AI应用的运维成本。高性能与稳定性：Java虚拟机（JVM）的优化和垃圾回收机制，确保了AI应用的高效运行和内存管理。实时数据处理能力：Java可以高效处理
WebAssembly 与 JavaScript：高性能 Web 开发的未来 vvilkim JavaScript 现代WEB技术 wasm javascript 开发语言
在现代Web开发中，性能始终是一个关键问题。随着Web应用变得越来越复杂，开发者需要更高效的工具和技术来满足用户对速度和响应能力的需求。WebAssembly（简称Wasm）正是为此而生。它是一种低级的二进制指令格式，旨在为Web提供接近原生代码的执行性能。与此同时，它与JavaScript的关系也备受关注。本文将深入探讨WebAssembly是什么，以及它与JavaScript如何协同工作。什么
C++ 的内存管理有哪些改进？ c++
C++20引入了对协程的官方支持，这是C++语言发展的一个重要里程碑。协程为异步编程、并发任务处理以及复杂的控制流提供了一种更高效、更简洁的解决方案。以下是C++20中协程支持的主要优势：一、简化异步编程在传统的异步编程中，开发者通常需要使用回调函数、std::future和std::promise等机制来处理异步任务。这些方法虽然有效，但代码往往难以阅读和维护，且容易出错。C++20的协程提供了
Java有哪些编程技巧？ java
Java编程技巧：提升效率与质量的实用指南在Java编程中，掌握一些高效的编程技巧不仅可以提高开发效率，还能提升代码的可读性、可维护性和性能。以下是一些实用的Java编程技巧，供开发者参考和应用。一、代码优化技巧（一）合理使用数据类型选择合适的数据类型：根据实际需求选择合适的数据类型。例如，如果只需要存储整数，且数值范围较小，可以使用int而不是long，以节省内存。使用包装类时需谨慎：Java的
跟着外贸高手学跟单！分享6大实用跟单技巧
在外贸行业中，订单的成交95%依赖于高效的跟单技巧。无论是分析客户行为，还是灵活运用价格策略，每一步都可能成为促成交易的关键。本文将结合外贸实战经验，分享6大核心跟单技巧，并介绍如何通过ZohoBooks的智能化外贸管理工具提升跟单效率与成功率。技巧1：深度分析客户，精准锁定需求核心方法：通过海关数据、社交媒体（如领英、脸书）及搜索引擎（谷歌）挖掘客户的采购历史、合作供应商、竞争对手等关键信息，并
STM32最小系统板详解 QoyOle stm32 单片机嵌入式硬件
STM32最小系统板是一款基于STMicroelectronics的STM32微控制器的开发板，它提供了一个简化的硬件平台，用于快速原型设计和开发嵌入式系统。本文将详细介绍STM32最小系统板的特点、组成部分以及如何使用它进行开发。一、特点简化的硬件设计：STM32最小系统板采用了最小化的硬件设计，仅包含了必要的元件，如STM32微控制器、晶振、电源管理电路等。这使得开发者可以专注于软件开发，而无
Java序列化进阶篇 g21121 java序列化
1.transient 类一旦实现了Serializable 接口即被声明为可序列化，然而某些情况下并不是所有的属性都需要序列化，想要人为的去阻止这些属性被序列化，就需要用到transient 关键字。
escape()、encodeURI()、encodeURIComponent()区别详解 aigo JavaScript Web
原文：http://blog.sina.com.cn/s/blog_4586764e0101khi0.html JavaScript中有三个可以对字符串编码的函数，分别是： escape,encodeURI,encodeURIComponent，相应3个解码函数：,decodeURI,decodeURIComponent 。下面简单介绍一下它们的区别 1 escape()函
ArcgisEngine实现对地图的放大、缩小和平移 Cb123456 添加矢量数据对地图的放大、缩小和平移 Engine
ArcgisEngine实现对地图的放大、缩小和平移: 个人觉得是平移，不过网上的都是漫游，通俗的说就是把一个地图对象从一边拉到另一边而已。就看人说话吧. 具体实现: 一、引入命名空间 using ESRI.ArcGIS.Geometry; using ESRI.ArcGIS.Controls; 二、代码实现.
Java集合框架概述天子之骄 Java集合框架概述
集合框架集合框架可以理解为一个容器，该容器主要指映射(map)、集合(set)、数组(array)和列表(list)等抽象数据结构。从本质上来说，Java集合框架的主要组成是用来操作对象的接口。不同接口描述不同的数据类型。简单介绍： Collection接口是最基本的接口，它定义了List和Set，List又定义了LinkLi
旗正4.0页面跳转传值问题何必如此 java jsp
跳转和成功提示 a) 成功字段非空forward 成功字段非空forward，不会弹出成功字段，为jsp转发，页面能超链接传值,传输变量时需要拼接。接拼接方式list.jsp?test="+strweightUnit+"或list.jsp?test="+weightUnit+&qu
全网唯一:移动互联网服务器端开发课程 cocos2d-x小菜 web开发移动开发移动端开发移动互联程序员
移动互联网时代来了！ App市场爆发式增长为Web开发程序员带来新一轮机遇，近两年新增创业者，几乎全部选择了移动互联网项目！传统互联网企业中超过98%的门户网站已经或者正在从单一的网站入口转向PC、手机、Pad、智能电视等多端全平台兼容体系。据统计，AppStore中超过85%的App项目都选择了PHP作为后端程
Log4J通用配置|注意问题笔记 7454103 DAO apache tomcat log4j Web
关于日志的等级那些去百度就知道了！这几天要搭个新框架配置了日志记下来！做个备忘！ #这里定义能显示到的最低级别,若定义到INFO级别,则看不到DEBUG级别的信息了~! log4j.rootLogger=INFO,allLog # DAO层 log记录到dao.log 控制台和总日志文件 log4j.logger.DAO=INFO,dao,C
SQLServer TCP/IP 连接失败问题 ---SQL Server Configuration Manager darkranger sql c windows SQL Server XP
当你安装完之后,连接数据库的时候可能会发现你的TCP/IP 没有启动.. 发现需要启动客户端协议 : TCP/IP 需要打开 SQL Server Configuration Manager... 却发现无法打开 SQL Server Configuration Manager..?? 解决方法: C:\WINDOWS\system32目录搜索framedyn.
[置顶] 做有中国特色的程序员 aijuans 程序员
从出版业说起网络作品排到靠前的，都不会太难看，一般人不爱看某部作品也是因为不喜欢这个类型，而此人也不会全不喜欢这些网络作品。究其原因，是因为网络作品都是让人先白看的，看的好了才出了头。而纸质作品就不一定了，排行榜靠前的，有好作品，也有垃圾。许多大牛都是写了博客，后来出了书。这些书也都不次，可能有人让为不好，是因为技术书不像小说，小说在读故事，技术书是在学知识或温习知识，有些技术书读得可
document.domain 跨域问题 avords document
document.domain用来得到当前网页的域名。比如在地址栏里输入：javascript:alert(document.domain); //www.315ta.com我们也可以给document.domain属性赋值，不过是有限制的，你只能赋成当前的域名或者基础域名。比如：javascript:alert(document.domain = "315ta.com");
关于管理软件的一些思考 houxinyou 管理
工作好多看年了,一直在做管理软件,不知道是我最开始做的时候产生了一些惯性的思维,还是现在接触的管理软件水平有所下降.换过好多年公司,越来越感觉现在的管理软件做的越来越乱. 在我看来,管理软件不论是以前的结构化编程,还是现在的面向对象编程,不管是CS模式,还是BS模式.模块的划分是很重要的.当然,模块的划分有很多种方式.我只是以我自己的划分方式来说一下. 做为管理软件,就像现在讲究MVC这
NoSQL数据库之Redis数据库管理(String类型和hash类型) bijian1013 redis 数据库 NoSQL
一.Redis的数据类型 1.String类型及操作 String是最简单的类型，一个key对应一个value，string类型是二进制安全的。Redis的string可以包含任何数据，比如jpg图片或者序列化的对象。 Set方法：设置key对应的值为string类型的value
Tomcat 一些技巧征客丶 java tomcat dos
以下操作都是在windows 环境下一、Tomcat 启动时配置 JAVA_HOME 在 tomcat 安装目录，bin 文件夹下的 catalina.bat 或 setclasspath.bat 中添加 set JAVA_HOME=JAVA 安装目录 set JRE_HOME=JAVA 安装目录/jre 即可；二、查看Tomcat 版本在 tomcat 安装目
【Spark七十二】Spark的日志配置 bit1129 spark
在测试Spark Streaming时，大量的日志显示到控制台，影响了Spark Streaming程序代码的输出结果的查看(代码中通过println将输出打印到控制台上)，可以通过修改Spark的日志配置的方式，不让Spark Streaming把它的日志显示在console 在Spark的conf目录下，把log4j.properties.template修改为log4j.p
Haskell版冒泡排序 bookjovi 冒泡排序 haskell
面试的时候问的比较多的算法题要么是binary search，要么是冒泡排序，真的不想用写C写冒泡排序了，贴上个Haskell版的，思维简单，代码简单，下次谁要是再要我用C写冒泡排序，直接上个haskell版的，让他自己去理解吧。 sort [] = [] sort [x] = [x] sort (x:x1:xs) | x>x1 = x1:so
java 路径配置文件读取 bro_feng java
这几天做一个项目，关于路径做如下笔记，有需要供参考。取工程内的文件，一般都要用相对路径，这个自然不用多说。在src统计目录建配置文件目录res,在res中放入配置文件。读取文件使用方式： 1. MyTest.class.getResourceAsStream("/res/xx.properties") 2. properties.load(MyTest.
读《研磨设计模式》-代码笔记-简单工厂模式 bylijinnan java 设计模式
声明：本文只为方便我个人查阅和理解，详细的分析以及源代码请移步原作者的博客http://chjavach.iteye.com/ package design.pattern; /* * 个人理解：简单工厂模式就是IOC; * 客户端要用到某一对象，本来是由客户创建的，现在改成由工厂创建，客户直接取就好了 */ interface IProduct {
SVN与JIRA的关联 chenyu19891124 SVN
SVN与JIRA的关联一直都没能装成功，今天凝聚心思花了一天时间整合好了。下面是自己整理的步骤：一、搭建好SVN环境，尤其是要把SVN的服务注册成系统服务二、装好JIRA，自己用是jira-4.3.4破解版三、下载SVN与JIRA的插件并解压，然后拷贝插件包下lib包里的三个jar，放到Atlassian\JIRA 4.3.4\atlassian-jira\WEB-INF\lib下，再
JWFDv0.96 最新设计思路 comsci 数据结构算法工作企业应用公告
随着工作流技术的发展，工作流产品的应用范围也不断的在扩展，开始进入了像金融行业(我已经看到国有四大商业银行的工作流产品招标公告了)，实时生产控制和其它比较重要的工程领域，而
vi 保存复制内容格式粘贴 daizj vi 粘贴复制保存原格式不变形
vi是linux中非常好用的文本编辑工具，功能强大无比，但对于复制带有缩进格式的内容时，粘贴的时候内容错位很严重，不会按照复制时的格式排版，vi能不能在粘贴时，按复制进的格式进行粘贴呢？答案是肯定的，vi有一个很强大的命令可以实现此功能。在命令模式输入:set paste，则进入paste模式，这样再进行粘贴时
shell脚本运行时报错误：/bin/bash^M: bad interpreter 的解决办法 dongwei_6688 shell脚本
出现原因：windows上写的脚本，直接拷贝到linux系统上运行由于格式不兼容导致解决办法： 1. 比如文件名为myshell.sh，vim myshell.sh 2. 执行vim中的命令 : set ff?查看文件格式，如果显示fileformat=dos，证明文件格式有问题 3. 执行vim中的命令 :set fileformat=unix 将文件格式改过来就可以了，然后:w
高一上学期难记忆单词 dcj3sjt126com word english
honest 诚实的；正直的 argue 争论 classical 古典的 hammer 锤子 share 分享；共有 sorrow 悲哀；悲痛 adventure 冒险 error 错误；差错 closet 壁橱；储藏室 pronounce 发音；宣告 repeat 重做；重复 majority 大多数；大半 native 本国的，本地的，本国
hibernate查询返回DTO对象，DTO封装了多个pojo对象的属性 frankco POJO hibernate查询 DTO
DTO-数据传输对象；pojo-最纯粹的java对象与数据库中的表一一对应。简单讲：DTO起到业务数据的传递作用，pojo则与持久层数据库打交道。有时候我们需要查询返回DTO对象，因为DTO
Partition List hcx2013 partition
Given a linked list and a value x, partition it such that all nodes less than x come before nodes greater than or equal to x. You should preserve the original relative order of th
Spring MVC测试框架详解——客户端测试 jinnianshilongnian
上一篇《Spring MVC测试框架详解——服务端测试》已经介绍了服务端测试，接下来再看看如果测试Rest客户端，对于客户端测试以前经常使用的方法是启动一个内嵌的jetty/tomcat容器，然后发送真实的请求到相应的控制器；这种方式的缺点就是速度慢；自Spring 3.2开始提供了对RestTemplate的模拟服务器测试方式，也就是说使用RestTemplate测试时无须启动服务器，而是模拟一
关于推荐个人观点 liyonghui160com 推荐系统关于推荐个人观点
回想起来，我也做推荐了3年多了，最近公司做了调整招聘了很多算法工程师，以为需要多么高大上的算法才能搭建起来的，从实践中走过来，我只想说【不是这样的】第一次接触推荐系统是在四年前入职的时候，那时候，机器学习和大数据都是没有的概念，什么大数据处理开源软件根本不存在，我们用多台计算机web程序记录用户行为，用.net的w
不间断旋转的动画 pangyulei 动画
CABasicAnimation* rotationAnimation; rotationAnimation = [CABasicAnimation animationWithKeyPath:@"transform.rotation.z"]; rotationAnimation.toValue = [NSNumber numberWithFloat: M
自定义annotation sha1064616837 java enum annotation reflect
对象有的属性在页面上可编辑，有的属性在页面只可读，以前都是我们在页面上写死的，时间一久有时候会混乱，此处通过自定义annotation在类属性中定义。越来越发现Java的Annotation真心很强大，可以帮我们省去很多代码，让代码看上去简洁。下面这个例子主要用到了 1.自定义annotation：@interface，以及几个配合着自定义注解使用的几个注解 2.简单的反射 3.枚举
Spring 源码 up2pu spring
1.Spring源代码 https://github.com/SpringSource/spring-framework/branches/3.2.x 注：兼容svn检出 2.运行脚本 import-into-eclipse.bat 注：需要设置JAVA_HOME为jdk 1.7 build.gradle compileJava { sourceCompatibilit
利用word分词来计算文本相似度 yangshangchuan word word分词文本相似度余弦相似度简单共有词
word分词提供了多种文本相似度计算方式：方式一：余弦相似度，通过计算两个向量的夹角余弦值来评估他们的相似度实现类：org.apdplat.word.analysis.CosineTextSimilarity 用法如下： String text1 = "我爱购物"; String text2 = "我爱读书"; String text3 =