晓理紫

[晓理紫]每日论文分享(有中文摘要，源码或项目地址)--大模型、扩散模型、视觉语言导航

专属领域论文订阅

关注{晓理紫|小李子}，每日更新论文，如感兴趣，请转发给有需要的同学，谢谢支持

如果你感觉对你有所帮助，请关注我，每日准时为你推送最新论文。

分类:

大语言模型LLM

视觉模型VLM

扩散模型

视觉语言导航VLN

具身智能，机器人

强化学习

开放词汇，检测分割

[晓理紫]每日论文分享(有中文摘要，源码或项目地址)

== LLM ==

标题: SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models

作者: Xin Zhang, Dong Zhang, Shimin Li

PubTime: 2024-01-23

Downlink: http://arxiv.org/abs/2308.16692v2

Project: https://0nutation.github.io/SpeechTokenizer.github.io/|

GitHub: https://github.com/ZhangXInFD/SpeechTokenizer/.|

中文摘要: 当前的语音大型语言模型建立在离散的语音表示之上，这些表示可以分为语义标记和声学标记。然而，现有的语音标记并不是专门为语音语言建模而设计的。为了评估语音标记对于构建语音语言模型的适用性，我们建立了第一个基准SLMTokBench。我们的结果表明，无论是语义标记还是声学标记都不适合这一目的。因此，我们提出了SpeechTokenizer，一个用于语音大型语言模型的统一语音标记器。SpeechTokenizer采用带有残差矢量量化（RVQ）的编码器——解码器架构。SpeechTokenizer统一了语义和声学标记，跨不同的RVQ层分层解开语音信息的不同方面。此外，我们利用SpeechTokenizer构建了一个统一的语音语言模型（USLM）。实验表明，SpeechTokenizer在语音重建方面的性能与EnCodec相当，并在SLMTokBench基准测试中表现出强大的性能。此外，USLM在零镜头文本到语音转换任务中优于VALL-E。代码和模型可在https：//github.com/ZhangXInFD/SpeechTokenizer/获得。

摘要: Current speech large language models build upon discrete speech representations, which can be categorized into semantic tokens and acoustic tokens. However, existing speech tokens are not specifically designed for speech language modeling. To assess the suitability of speech tokens for building speech language models, we established the first benchmark, SLMTokBench. Our results indicate that neither semantic nor acoustic tokens are ideal for this purpose. Therefore, we propose SpeechTokenizer, a unified speech tokenizer for speech large language models. SpeechTokenizer adopts the Encoder-Decoder architecture with residual vector quantization (RVQ). Unifying semantic and acoustic tokens, SpeechTokenizer disentangles different aspects of speech information hierarchically across different RVQ layers. Furthermore, We construct a Unified Speech Language Model (USLM) leveraging SpeechTokenizer. Experiments show that SpeechTokenizer performs comparably to EnCodec in speech reconstruction and demonstrates strong performance on the SLMTokBench benchmark. Also, USLM outperforms VALL-E in zero-shot Text-to-Speech tasks. Code and models are available at https://github.com/ZhangXInFD/SpeechTokenizer/.

标题: Beyond Task Performance: Evaluating and Reducing the Flaws of Large Multimodal Models with In-Context Learning

作者: Mustafa Shukor, Alexandre Rame, Corentin Dancette

PubTime: 2024-01-22

Downlink: http://arxiv.org/abs/2310.00647v2

Project: https://evalign-icl.github.io/|

GitHub: https://github.com/mshukor/EvALign-ICL.|

中文摘要: 随着大型语言模型（LLMs）的成功，大型多模态模型（LMMs），如Flamingo模型及其后续竞争对手，已经开始成为走向通才代理的自然步骤。然而，与最近的LMM的互动揭示了当前评估基准很难捕捉到的主要局限性。事实上，任务性能（例如，VQA准确性）本身并不能提供足够的线索来理解它们的真实能力、局限性以及这些模型在多大程度上符合人类的期望。为了完善我们对这些缺陷的理解，我们偏离了当前的评估范式，并且（1）在5个不同的轴上评估了10个最近的开源LMM，从3B到80B参数尺度；幻觉、弃权、组合性、可解释性和指令遵循。我们对这些轴的评估揭示了LMMs的主要缺陷。虽然当前调整这些模型的首选解决方案是基于培训，如指令调整或RLHF，但我们宁愿（2）探索免培训情境学习（ICL）作为解决方案，并研究它如何影响这些限制。基于我们的ICL研究，（3）我们进一步推动ICL，并提出新的多模态ICL变体，如；多任务——ICL，后见之明链——ICL，和自我纠正——ICL。我们的发现如下。（1）尽管LMM取得了成功，但它们仍有缺陷，仅通过扩展无法解决。（2）ICL对LMMs缺陷的影响是微妙的；尽管ICL对提高可解释性和答案弃权很有效，但它只是稍微提高了指令遵循，并没有提高写作能力，实际上甚至放大了幻觉。（3）建议的ICL变体作为有效解决其中一些缺陷的事后方法是有希望的。代码可在以下网址获得：https://github.com/mshukor/EvALign-ICL。

摘要: Following the success of Large Language Models (LLMs), Large Multimodal Models (LMMs), such as the Flamingo model and its subsequent competitors, have started to emerge as natural steps towards generalist agents. However, interacting with recent LMMs reveals major limitations that are hardly captured by the current evaluation benchmarks. Indeed, task performances (e.g., VQA accuracy) alone do not provide enough clues to understand their real capabilities, limitations, and to which extent such models are aligned to human expectations. To refine our understanding of those flaws, we deviate from the current evaluation paradigm, and (1) evaluate 10 recent open-source LMMs from 3B up to 80B parameter scale, on 5 different axes; hallucinations, abstention, compositionality, explainability and instruction following. Our evaluation on these axes reveals major flaws in LMMs. While the current go-to solution to align these models is based on training, such as instruction tuning or RLHF, we rather (2) explore the training-free in-context learning (ICL) as a solution, and study how it affects these limitations. Based on our ICL study, (3) we push ICL further and propose new multimodal ICL variants such as; Multitask-ICL, Chain-of-Hindsight-ICL, and Self-Correcting-ICL. Our findings are as follows. (1) Despite their success, LMMs have flaws that remain unsolved with scaling alone. (2) The effect of ICL on LMMs flaws is nuanced; despite its effectiveness for improved explainability, answer abstention, ICL only slightly improves instruction following, does not improve compositional abilities, and actually even amplifies hallucinations. (3) The proposed ICL variants are promising as post-hoc approaches to efficiently tackle some of those flaws. The code is available here: https://github.com/mshukor/EvALign-ICL.

标题: GenSim: Generating Robotic Simulation Tasks via Large Language Models

作者: Lirui Wang, Yiyang Ling, Zhecheng Yuan

PubTime: 2024-01-21

Downlink: http://arxiv.org/abs/2310.01361v2

Project: https://liruiw.github.io/gensim)|https://liruiw.github.io/gensim),|https://huggingface.co/spaces/Gen-Sim/Gen-Sim),|

GitHub: https://github.com/liruiw/GenSim)|

中文摘要: 收集大量真实世界的交互数据来训练一般的机器人策略通常非常昂贵，因此激发了模拟数据的使用。然而，由于提出和验证新任务需要人工，现有的数据生成方法通常集中于场景级多样性（例如，对象实例和姿态）而不是任务级多样性。这使得在模拟数据上训练的策略很难展示重要的任务级泛化。在本文中，我们建议通过利用大型语言模型（LLM）的基础和编码能力来自动生成丰富的仿真环境和专家演示。我们的方法被称为GenSim，有两种模式：目标导向生成，其中将目标任务交给LLM，LLM提出任务课程来解决目标任务；探索性生成，其中LLM从以前的任务开始，迭代地提出有助于解决更复杂任务的新任务。我们使用GPT4将现有的基准扩展了10倍，达到100多个任务，在这些任务上，我们进行了有监督的微调，并评估了几个LLM，包括微调的GPTs和机器人模拟任务代码生成的Code Llama。此外，我们观察到，当用于多任务策略训练时，LLMs生成的模拟程序可以显著增强任务级泛化。我们进一步发现，在最小的模拟到真实适应的情况下，在GPT4生成的模拟任务上预训练的多任务策略表现出对现实世界中看不见的长期任务的更强转移，并且比基线高出25%。有关代码、演示和视频，请访问项目网站（https：//liruiw.github.io/gensim）。

摘要: Collecting large amounts of real-world interaction data to train general robotic policies is often prohibitively expensive, thus motivating the use of simulation data. However, existing methods for data generation have generally focused on scene-level diversity (e.g., object instances and poses) rather than task-level diversity, due to the human effort required to come up with and verify novel tasks. This has made it challenging for policies trained on simulation data to demonstrate significant task-level generalization. In this paper, we propose to automatically generate rich simulation environments and expert demonstrations by exploiting a large language models’ (LLM) grounding and coding ability. Our approach, dubbed GenSim, has two modes: goal-directed generation, wherein a target task is given to the LLM and the LLM proposes a task curriculum to solve the target task, and exploratory generation, wherein the LLM bootstraps from previous tasks and iteratively proposes novel tasks that would be helpful in solving more complex tasks. We use GPT4 to expand the existing benchmark by ten times to over 100 tasks, on which we conduct supervised finetuning and evaluate several LLMs including finetuned GPTs and Code Llama on code generation for robotic simulation tasks. Furthermore, we observe that LLMs-generated simulation programs can enhance task-level generalization significantly when used for multitask policy training. We further find that with minimal sim-to-real adaptation, the multitask policies pretrained on GPT4-generated simulation tasks exhibit stronger transfer to unseen long-horizon tasks in the real world and outperform baselines by 25%. See the project website (https://liruiw.github.io/gensim) for code, demos, and videos.

标题: LoraHub: Efficient Cross-Task Generalization via Dynamic LoRA Composition

作者: Chengsong Huang, Qian Liu, Bill Yuchen Lin

PubTime: 2024-01-18

Downlink: http://arxiv.org/abs/2307.13269v2

Project: https://huggingface.co/lorahub.|

GitHub: https://github.com/sail-sg/lorahub,|

中文摘要: 低秩自适应（LoRA）经常被用来为新任务微调大型语言模型（LLMs）。本文研究了用于跨任务泛化的LoRA可组合性，并介绍了LoraHub，这是一个简单的框架，用于有目的地组装在不同给定任务上训练的LoRA模块，目标是在看不见的任务上实现适应性能。只需几个新任务的例子，LoraHub就可以流畅地组合多个LoRA模块，无需人工专业知识和假设。值得注意的是，合成既不需要额外的模型参数，也不需要梯度。Big-Bench硬基准测试的实证结果表明，LoraHub虽然没有超过上下文学习的性能，但通过在推理过程中每个示例使用显著减少的令牌数量，在少数镜头场景中提供了显著的性能——效率权衡。值得注意的是，当与不同的演示示例配对时，与上下文学习相比，LoraHub建立了更好的上限，展示了其未来发展的潜力。我们的愿景是为LoRA模块建立一个平台，使用户能够分享他们训练过的LoRA模块。这种协作方法促进了LoRA模块在新任务中的无缝应用，有助于适应性生态系统。我们的代码可以在https：//github.com/sail-sg/lorahub，获得，所有预先训练的LoRA模块都在https：//huggingface.co/lorahub。发布

摘要: Low-rank adaptations (LoRA) are often employed to fine-tune large language models (LLMs) for new tasks. This paper investigates LoRA composability for cross-task generalization and introduces LoraHub, a simple framework devised for the purposive assembly of LoRA modules trained on diverse given tasks, with the objective of achieving adaptable performance on unseen tasks. With just a few examples from a new task, LoraHub can fluidly combine multiple LoRA modules, eliminating the need for human expertise and assumptions. Notably, the composition requires neither additional model parameters nor gradients. Empirical results on the Big-Bench Hard benchmark suggest that LoraHub, while not surpassing the performance of in-context learning, offers a notable performance-efficiency trade-off in few-shot scenarios by employing a significantly reduced number of tokens per example during inference. Notably, LoraHub establishes a better upper bound compared to in-context learning when paired with different demonstration examples, demonstrating its potential for future development. Our vision is to establish a platform for LoRA modules, empowering users to share their trained LoRA modules. This collaborative approach facilitates the seamless application of LoRA modules to novel tasks, contributing to an adaptive ecosystem. Our code is available at https://github.com/sail-sg/lorahub, and all the pre-trained LoRA modules are released at https://huggingface.co/lorahub.

标题: CLadder: Assessing Causal Reasoning in Language Models

作者: Zhijing Jin, Yuen Chen, Felix Leeb

PubTime: 2024-01-17

Downlink: http://arxiv.org/abs/2312.04350v3

Project: https://huggingface.co/datasets/causalNLP/cladder,|

GitHub: https://github.com/causalNLP/cladder.|

中文摘要: 执行因果推理的能力被广泛认为是智能的核心特征。在这项工作中，我们研究了大型语言模型（LLMs）是否可以连贯地推理因果关系。自然语言处理（NLP）中的许多现有工作集中于评估LLMs中的常识性因果推理，因此未能评估模型是否能够根据一组定义良好的形式规则执行因果推理。为了解决这个问题，我们提出了一个新的NLP任务，自然语言中的因果推理，受Judea Pearl等人假设的“因果推理机”的启发。我们用10K样本组成了一个大型数据集CLadder：基于因果图和查询（关联、干预和反事实）的集合，我们通过oracle因果推理引擎获得符号问题和基本事实答案。然后这些被翻译成自然语言。我们在我们的数据集上评估了多个LLM，并引入和评估了一个定制的思维链提示策略CausalCoT。我们表明，我们的任务对LLMs来说是极具挑战性的，我们进行了深入的分析，以获得对LLMs因果推理能力的更深入的见解。我们的数据在https://huggingface.co/datasets/causalNLP/cladder上是开源的，我们的代码可以在https://github.com/causalNLP/cladder上找到。

摘要: The ability to perform causal reasoning is widely considered a core feature of intelligence. In this work, we investigate whether large language models (LLMs) can coherently reason about causality. Much of the existing work in natural language processing (NLP) focuses on evaluating commonsense causal reasoning in LLMs, thus failing to assess whether a model can perform causal inference in accordance with a set of well-defined formal rules. To address this, we propose a new NLP task, causal inference in natural language, inspired by the “causal inference engine” postulated by Judea Pearl et al. We compose a large dataset, CLadder, with 10K samples: based on a collection of causal graphs and queries (associational, interventional, and counterfactual), we obtain symbolic questions and ground-truth answers, through an oracle causal inference engine. These are then translated into natural language. We evaluate multiple LLMs on our dataset, and we introduce and evaluate a bespoke chain-of-thought prompting strategy, CausalCoT. We show that our task is highly challenging for LLMs, and we conduct an in-depth analysis to gain deeper insights into the causal reasoning abilities of LLMs. Our data is open-sourced at https://huggingface.co/datasets/causalNLP/cladder, and our code can be found at https://github.com/causalNLP/cladder.

标题: EarthPT: a time series foundation model for Earth Observation

作者: Michael J. Smith, Luke Fleming, James E. Geach

PubTime: 2024-01-11

Downlink: http://arxiv.org/abs/2309.07207v2

Project: https://www.climatechange.ai/papers/neurips2023/2|

GitHub: https://github.com/aspiaspace/EarthPT|

中文摘要: 我们引入了EarthPT-一种地球观测（EO）预训练Transformer model。EarthPT是一个7亿参数解码Transformer model基础模型，以自回归自我监督的方式训练，并专门针对EO用例开发。我们证明了EarthPT是一个有效的预测器，可以准确预测未来400-2300 nm范围内的未来像素级表面反射率。例如，在五个月的测试集范围内，标准化差异植被指数（NDVI）的演变预测在像素级具有大约0.05的典型误差（在-1->1的自然范围内），优于基于历史平均的简单相位折叠模型。我们还证明了EarthPT学习的嵌入包含语义上有意义的信息，可以用于下游任务，如高粒度、动态的土地利用分类。令人兴奋的是，我们注意到丰富的EO数据在理论上为我们提供了千万亿次训练令牌。因此，如果我们假设EarthPT遵循类似于大型语言模型（LLMs）的神经缩放定律，那么目前对缩放EarthPT和其他类似的“大型观察模型”没有数据限制。

摘要: We introduce EarthPT – an Earth Observation (EO) pretrained transformer. EarthPT is a 700 million parameter decoding transformer foundation model trained in an autoregressive self-supervised manner and developed specifically with EO use-cases in mind. We demonstrate that EarthPT is an effective forecaster that can accurately predict future pixel-level surface reflectances across the 400-2300 nm range well into the future. For example, forecasts of the evolution of the Normalised Difference Vegetation Index (NDVI) have a typical error of approximately 0.05 (over a natural range of -1 -> 1) at the pixel level over a five month test set horizon, out-performing simple phase-folded models based on historical averaging. We also demonstrate that embeddings learnt by EarthPT hold semantically meaningful information and could be exploited for downstream tasks such as highly granular, dynamic land use classification. Excitingly, we note that the abundance of EO data provides us with – in theory – quadrillions of training tokens. Therefore, if we assume that EarthPT follows neural scaling laws akin to those derived for Large Language Models (LLMs), there is currently no data-imposed limit to scaling EarthPT and other similar `Large Observation Models.’

== VLM ==

标题: Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data

作者: Lihe Yang, Bingyi Kang, Zilong Huang

PubTime: 2024-01-19

Downlink: http://arxiv.org/abs/2401.10891v1

Project: https://depth-anything.github.io|

GitHub: https://github.com/LiheYoung/Depth-Anything.|

中文摘要: 这项工作提出了深度任何东西，一个非常实用的解决方案，用于鲁棒的单目深度估计。不追求新的技术模块，我们的目标是建立一个简单而强大的基础模型，处理任何情况下的任何图像。为此，我们通过设计一个数据引擎来收集和自动注释大规模未标记数据（~62M），从而扩大数据覆盖范围，从而能够减少泛化误差。我们研究了两种简单而有效的策略，它们使数据扩展变得有希望。首先，通过利用数据扩充工具创建更具挑战性的优化目标。它迫使模型主动寻求额外的视觉知识并获得鲁棒的表示。其次，开发了一个辅助监督来加强模型从预训练的编码器继承丰富的语义先验。我们广泛评估了它的零拍摄能力，包括六个公共数据集和随机拍摄的照片。它展示了令人印象深刻的概括能力。此外，通过使用来自NYUv2和KITTI的度量深度信息对其进行微调，可以设置新的SOTA。我们更好的深度模型也产生了更好的深度调节控制网络。我们的模型在https：//github.com/LiheYoung/Depth-Anything发布。

摘要: This work presents Depth Anything, a highly practical solution for robust monocular depth estimation. Without pursuing novel technical modules, we aim to build a simple yet powerful foundation model dealing with any images under any circumstances. To this end, we scale up the dataset by designing a data engine to collect and automatically annotate large-scale unlabeled data (~62M), which significantly enlarges the data coverage and thus is able to reduce the generalization error. We investigate two simple yet effective strategies that make data scaling-up promising. First, a more challenging optimization target is created by leveraging data augmentation tools. It compels the model to actively seek extra visual knowledge and acquire robust representations. Second, an auxiliary supervision is developed to enforce the model to inherit rich semantic priors from pre-trained encoders. We evaluate its zero-shot capabilities extensively, including six public datasets and randomly captured photos. It demonstrates impressive generalization ability. Further, through fine-tuning it with metric depth information from NYUv2 and KITTI, new SOTAs are set. Our better depth model also results in a better depth-conditioned ControlNet. Our models are released at https://github.com/LiheYoung/Depth-Anything.

标题: Crossway Diffusion: Improving Diffusion-based Visuomotor Policy via Self-supervised Learning

作者: Xiang Li, Varun Belagali, Jinghuan Shang

PubTime: 2024-01-11

Downlink: http://arxiv.org/abs/2307.01849v3

Project: https://youtu.be/9deKHueZBuk|

GitHub: https://github.com/LostXine/crossway_diffusion|

中文摘要: 序列建模方法在机器人模仿学习中显示出有希望的结果。最近，扩散模型已经以序列建模的方式被用于行为克隆，受益于它们在建模复杂数据分布方面的卓越能力。标准的基于扩散的策略根据输入状态从随机噪声中迭代地生成动作序列。尽管如此，扩散策略的模型可以在视觉表示方面进一步改进。在这项工作中，我们提出了交叉扩散，这是一种简单而有效的方法，通过精心设计的状态解码器和辅助的自我监督学习（SSL）目标来增强基于扩散的视觉运动策略学习。状态解码器从反向扩散过程的中间表示重建原始图像像素和其他状态信息。利用SSL目标和初始扩散损失对整个模型进行了联合优化。我们的实验证明了交叉扩散在各种模拟和真实世界机器人任务中的有效性，证实了它相对于标准的基于扩散的策略的一致优势以及对基线的实质性改进。

摘要: Sequence modeling approaches have shown promising results in robot imitation learning. Recently, diffusion models have been adopted for behavioral cloning in a sequence modeling fashion, benefiting from their exceptional capabilities in modeling complex data distributions. The standard diffusion-based policy iteratively generates action sequences from random noise conditioned on the input states. Nonetheless, the model for diffusion policy can be further improved in terms of visual representations. In this work, we propose Crossway Diffusion, a simple yet effective method to enhance diffusion-based visuomotor policy learning via a carefully designed state decoder and an auxiliary self-supervised learning (SSL) objective. The state decoder reconstructs raw image pixels and other state information from the intermediate representations of the reverse diffusion process. The whole model is jointly optimized by the SSL objective and the original diffusion loss. Our experiments demonstrate the effectiveness of Crossway Diffusion in various simulated and real-world robot tasks, confirming its consistent advantages over the standard diffusion-based policy and substantial improvements over the baselines.

标题: AUPIMO: Redefining Visual Anomaly Detection Benchmarks with High Speed and Low Tolerance

作者: Joao P. C. Bertoldo, Dick Ameln, Ashwin Vaidya

PubTime: 2024-01-19

Downlink: http://arxiv.org/abs/2401.01984v2

Project: https://summerofcode.withgoogle.com/archive/2023/projects/SPMopugd|

GitHub: https://github.com/jpcbertoldo/aupimo.|

摘要: Recent advances in visual anomaly detection research have seen AUROC and AUPRO scores on public benchmark datasets such as MVTec and VisA converge towards perfect recall, giving the impression that these benchmarks are near-solved. However, high AUROC and AUPRO scores do not always reflect qualitative performance, which limits the validity of these metrics in real-world applications. We argue that the artificial ceiling imposed by the lack of an adequate evaluation metric restrains progression of the field, and it is crucial that we revisit the evaluation metrics used to rate our algorithms. In response, we introduce Per-IMage Overlap (PIMO), a novel metric that addresses the shortcomings of AUROC and AUPRO. PIMO retains the recall-based nature of the existing metrics but introduces two distinctions: the assignment of curves (and respective area under the curve) is per-image, and its X-axis relies solely on normal images. Measuring recall per image simplifies instance score indexing and is more robust to noisy annotations. As we show, it also accelerates computation and enables the usage of statistical tests to compare models. By imposing low tolerance for false positives on normal images, PIMO provides an enhanced model validation procedure and highlights performance variations across datasets. Our experiments demonstrate that PIMO offers practical advantages and nuanced performance insights that redefine anomaly detection benchmarks – notably challenging the perception that MVTec AD and VisA datasets have been solved by contemporary models. Available on GitHub: https://github.com/jpcbertoldo/aupimo.

标题: Moonshot: Towards Controllable Video Generation and Editing with Multimodal Conditions

作者: David Junhao Zhang, Dongxu Li, Hung Le

PubTime: 2024-01-03

Downlink: http://arxiv.org/abs/2401.01827v1

Project: https://showlab.github.io/Moonshot/|

GitHub: https://github.com/salesforce/LAVIS.|

中文摘要: 大多数现有的视频扩散模型（VDM）仅限于文本条件。因此，它们通常缺乏对所生成视频的视觉外观和几何结构的控制。这项工作介绍了Moonshot，一种新的视频生成模型，它同时基于图像和文本的多模态输入。该模型建立在一个核心模块上，称为多模态视频块（MVB），它由用于表示视频特征的传统时空层和用于处理图像和文本输入以进行外观调节的解耦交叉注意力层组成。此外，我们仔细设计了模型架构，使得它可以选择性地与预先训练的图像控制网络模块集成，用于几何视觉条件，而不需要额外的训练开销，这与先前的方法相反。实验表明，与现有模型相比，Moonshot具有多功能的多模态调节机制，在视觉质量和时间一致性方面表现出显著的改善。此外，该模型可以很容易地重新用于各种生成应用，如个性化视频生成、图像动画和视频编辑，揭示了其作为可控视频生成的基本架构的潜力。模型将在https：//github.com/salesforce/LAVIS上公开。

摘要: Most existing video diffusion models (VDMs) are limited to mere text
conditions. Thereby, they are usually lacking in control over visual appearance
and geometry structure of the generated videos. This work presents Moonshot, a
new video generation model that conditions simultaneously on multimodal inputs
of image and text. The model builts upon a core module, called multimodal video
block (MVB), which consists of conventional spatialtemporal layers for
representing video features, and a decoupled cross-attention layer to address
image and text inputs for appearance conditioning. In addition, we carefully
design the model architecture such that it can optionally integrate with
pre-trained image ControlNet modules for geometry visual conditions, without
needing of extra training overhead as opposed to prior methods. Experiments
show that with versatile multimodal conditioning mechanisms, Moonshot
demonstrates significant improvement on visual quality and temporal consistency
compared to existing models. In addition, the model can be easily repurposed
for a variety of generative applications, such as personalized video
generation, image animation and video editing, unveiling its potential to serve
as a fundamental architecture for controllable video generation. Models will be
made public on https://github.com/salesforce/LAVIS.

标题: SVGDreamer: Text Guided SVG Generation with Diffusion Model

作者: Ximing Xing, Haitao Zhou, Chuang Wang

PubTime: 2024-01-03

Downlink: http://arxiv.org/abs/2312.16476v2

Project: https://ximinng.github.io/SVGDreamer-project/|https://ximinng.github.io/SVGDreamer-project/|https://ximinng.github.io/SVGDreamer-project/|

中文摘要: 最近，文本引导的可缩放矢量图形（SVG）合成在图像学和草图等领域显示出了前景。然而，现有的文本到SVG的生成方法缺乏可编辑性，并且难以满足视觉质量和结果多样性。为了解决这些限制，我们提出了一种新的文本引导矢量图形合成方法，称为SVGDreamer。SVGDreamer整合了语义驱动的图像矢量化（SIVE）过程，能够将合成分解为前景对象和背景，从而增强可编辑性。具体来说，SIVE过程引入了基于注意力的原始控制和注意力屏蔽损失函数，用于有效控制和操作单个元素。此外，我们提出了一种矢量化的基于粒子的分数提取（VPSD）方法来解决现有文本到SVG生成方法中颜色过饱和度、矢量基元过平滑和有限结果多样性的挑战。此外，在VPSD的基础上，我们引入了奖励反馈学习（ReFL）来加速VPSD的收敛和提高审美情趣。已经进行了大量的实验来验证SVGDreamer的有效性，证明了它在可编辑性、视觉质量和多样性方面优于基线方法。SVGDreamer的代码和演示可以在\href{https：//ximinng.github.io/SVGDreamer-project/}{https：//ximinng.github.io/SVGDreamer-project/}找到

摘要: Recently, text-guided scalable vector graphics (SVGs) synthesis has shown
promise in domains such as iconography and sketch. However, existing
text-to-SVG generation methods lack editability and struggle with visual
quality and result diversity. To address these limitations, we propose a novel
text-guided vector graphics synthesis method called SVGDreamer. SVGDreamer
incorporates a semantic-driven image vectorization (SIVE) process that enables
the decomposition of synthesis into foreground objects and background, thereby
enhancing editability. Specifically, the SIVE process introduce attention-based
primitive control and an attention-mask loss function for effective control and
manipulation of individual elements. Additionally, we propose a Vectorized
Particle-based Score Distillation (VPSD) approach to tackle the challenges of
color over-saturation, vector primitives over-smoothing, and limited result
diversity in existing text-to-SVG generation methods. Furthermore, on the basis
of VPSD, we introduce Reward Feedback Learning (ReFL) to accelerate VPSD
convergence and improve aesthetic appeal. Extensive experiments have been
conducted to validate the effectiveness of SVGDreamer, demonstrating its
superiority over baseline methods in terms of editability, visual quality, and
diversity. The code and demo of SVGDreamer can be found at
\href{https://ximinng.github.io/SVGDreamer-project/}{https://ximinng.github.io/SVGDreamer-project/}.

标题: ODTrack: Online Dense Temporal Token Learning for Visual Tracking

作者: Yaozong Zheng, Bineng Zhong, Qihua Liang

PubTime: 2024-01-03

Downlink: http://arxiv.org/abs/2401.01686v1

GitHub: https://github.com/GXNU-ZhongLab/ODTrack|

中文摘要: 跨连续视频帧的在线上下文推理和关联对于视觉跟踪中的感知实例至关重要。然而，大多数当前性能最好的跟踪器通过离线模式持续依赖于参考帧和搜索帧之间的稀疏时间关系。因此，它们只能在每个图像对中独立地相互作用，并建立有限的时间相关性。为了缓解上述问题，我们提出了一种简单、灵活、有效的视频级跟踪流水线，名为\textbf{ODTrack}，它以在线令牌传播的方式密集关联视频帧的上下文关系。ODTrack接收任意长度的视频帧来捕捉实例的时空轨迹关系，并将目标的辨别特征（定位信息）压缩成令牌序列，实现帧到帧的关联。这种新的解决方案带来了以下好处：1）净化后的令牌序列可以作为下一个视频帧中推理的提示，从而利用过去的信息来指导未来的推理；2）通过令牌序列的迭代传播，有效地避免了复杂的在线更新策略，从而实现了更高效的模型表示和计算。ODTrack在七个基准测试中实现了新的\textit{SOTA}性能，同时以实时速度运行。代码和模型可在\url{https：//github.com/GXNU-ZhongLab/ODTrack}获得。

摘要: Online contextual reasoning and association across consecutive video frames
are critical to perceive instances in visual tracking. However, most current
top-performing trackers persistently lean on sparse temporal relationships
between reference and search frames via an offline mode. Consequently, they can
only interact independently within each image-pair and establish limited
temporal correlations. To alleviate the above problem, we propose a simple,
flexible and effective video-level tracking pipeline, named \textbf{ODTrack},
which densely associates the contextual relationships of video frames in an
online token propagation manner. ODTrack receives video frames of arbitrary
length to capture the spatio-temporal trajectory relationships of an instance,
and compresses the discrimination features (localization information) of a
target into a token sequence to achieve frame-to-frame association. This new
solution brings the following benefits: 1) the purified token sequences can
serve as prompts for the inference in the next video frame, whereby past
information is leveraged to guide future inference; 2) the complex online
update strategies are effectively avoided by the iterative propagation of token
sequences, and thus we can achieve more efficient model representation and
computation. ODTrack achieves a new \textit{SOTA} performance on seven
benchmarks, while running at real-time speed. Code and models are available at
\url{https://github.com/GXNU-ZhongLab/ODTrack}.

== diffusion model ==

标题: VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models

作者: Haoxin Chen, Yong Zhang, Xiaodong Cun

PubTime: 2024-01-17

Downlink: http://arxiv.org/abs/2401.09047v1

Project: https://ailab-cvc.github.io/videocrafter;|

GitHub: https://github.com/AILab-CVC/VideoCrafter|

中文摘要: 文本到视频生成旨在根据给定的提示生成视频。最近，一些商业视频模型已经能够生成具有最小噪声、出色细节和高美学分数的似是而非的视频。然而，这些模型依赖于大规模、过滤良好、高质量的视频，社区无法访问这些视频。许多现有的研究工作使用低质量的WebVid-10M数据集训练模型，很难生成高质量的视频，因为模型经过优化以适应WebVid-10M。在这项工作中，我们探索了从稳定扩散扩展的视频模型的训练方案，并研究了利用低质量视频和合成的高质量图像来获得高质量视频模型的可行性。我们首先分析了视频模型的空间和时间模块之间的联系以及向低质量视频的分布转移。我们观察到，与仅训练时间模块相比，所有模块的完全训练导致空间和时间模块之间更强的耦合。基于这种更强的耦合，我们通过用高质量图像微调空间模块，将分布转移到更高的质量而没有运动退化，从而产生通用的高质量视频模型。进行评估以证明所提出的方法的优越性，特别是在图像质量、运动和概念合成方面。

摘要: Text-to-video generation aims to produce a video based on a given prompt. Recently, several commercial video models have been able to generate plausible videos with minimal noise, excellent details, and high aesthetic scores. However, these models rely on large-scale, well-filtered, high-quality videos that are not accessible to the community. Many existing research works, which train models using the low-quality WebVid-10M dataset, struggle to generate high-quality videos because the models are optimized to fit WebVid-10M. In this work, we explore the training scheme of video models extended from Stable Diffusion and investigate the feasibility of leveraging low-quality videos and synthesized high-quality images to obtain a high-quality video model. We first analyze the connection between the spatial and temporal modules of video models and the distribution shift to low-quality videos. We observe that full training of all modules results in a stronger coupling between spatial and temporal modules than only training temporal modules. Based on this stronger coupling, we shift the distribution to higher quality without motion degradation by finetuning spatial modules with high-quality images, resulting in a generic high-quality video model. Evaluations are conducted to demonstrate the superiority of the proposed method, particularly in picture quality, motion, and concept composition.

标题: InstantID: Zero-shot Identity-Preserving Generation in Seconds

作者: Qixun Wang, Xu Bai, Haofan Wang

PubTime: 2024-01-15

Downlink: http://arxiv.org/abs/2401.07519v1

Project: https://instantid.github.io/|

GitHub: https://github.com/InstantID/InstantID.|

中文摘要: 使用文本反转、DreamBooth和LoRA等方法进行个性化图像合成已经取得了重大进展。然而，它们在现实世界中的适用性受到高存储需求、冗长的微调过程以及对多个参考图像的需求的阻碍。相反，现有的基于ID嵌入的方法虽然只需要单一的正向推理，但面临着挑战：它们要么需要跨众多模型参数进行广泛的微调，缺乏与社区预训练模型的兼容性，要么无法保持高人脸保真度。针对这些限制，我们引入了InstantID，这是一个强大的基于扩散模型的解决方案。我们的即插即用模块仅使用一张面部图像就能熟练地处理各种风格的图像个性化，同时确保高保真。为了实现这一点，我们设计了一个新的身份网，通过施加强语义和弱空间条件，将面部和地标图像与文本提示相结合来指导图像生成。InstantID展示了卓越的性能和效率，证明在身份保护至关重要的实际应用中非常有益。此外，我们的工作与SD1.5和SDXL等流行的预训练文本到图像扩散模型无缝集成，作为一个适应性强的插件。我们的代码和预先训练的检查点将在https：//github.com/InstantID/InstantID上提供。

摘要: There has been significant progress in personalized image synthesis with methods such as Textual Inversion, DreamBooth, and LoRA. Yet, their real-world applicability is hindered by high storage demands, lengthy fine-tuning processes, and the need for multiple reference images. Conversely, existing ID embedding-based methods, while requiring only a single forward inference, face challenges: they either necessitate extensive fine-tuning across numerous model parameters, lack compatibility with community pre-trained models, or fail to maintain high face fidelity. Addressing these limitations, we introduce InstantID, a powerful diffusion model-based solution. Our plug-and-play module adeptly handles image personalization in various styles using just a single facial image, while ensuring high fidelity. To achieve this, we design a novel IdentityNet by imposing strong semantic and weak spatial conditions, integrating facial and landmark images with textual prompts to steer the image generation. InstantID demonstrates exceptional performance and efficiency, proving highly beneficial in real-world applications where identity preservation is paramount. Moreover, our work seamlessly integrates with popular pre-trained text-to-image diffusion models like SD1.5 and SDXL, serving as an adaptable plugin. Our codes and pre-trained checkpoints will be available at https://github.com/InstantID/InstantID.

标题: Crossway Diffusion: Improving Diffusion-based Visuomotor Policy via Self-supervised Learning

作者: Xiang Li, Varun Belagali, Jinghuan Shang

PubTime: 2024-01-11

Downlink: http://arxiv.org/abs/2307.01849v3

Project: https://youtu.be/9deKHueZBuk|

GitHub: https://github.com/LostXine/crossway_diffusion|

标题: ControlDreamer: Stylized 3D Generation with Multi-View ControlNet

作者: Yeongtak Oh, Jooyoung Choi, Yongsung Kim

PubTime: 2024-01-05

Downlink: http://arxiv.org/abs/2312.01129v2

Project: https://controldreamer.github.io/|

摘要: Recent advancements in text-to-3D generation have significantly contributed
to the automation and democratization of 3D content creation. Building upon
these developments, we aim to address the limitations of current methods in
generating 3D models with creative geometry and styles. We introduce multi-view
ControlNet, a novel depth-aware multi-view diffusion model trained on generated
datasets from a carefully curated text corpus. Our multi-view ControlNet is
then integrated into our two-stage pipeline, ControlDreamer, enabling
text-guided generation of stylized 3D models. Additionally, we present a
comprehensive benchmark for 3D style editing, encompassing a broad range of
subjects, including objects, animals, and characters, to further facilitate
research on diverse 3D generation. Our comparative analysis reveals that this
new pipeline outperforms existing text-to-3D methods as evidenced by human
evaluations and CLIP score metrics.

标题: Moonshot: Towards Controllable Video Generation and Editing with Multimodal Conditions

作者: David Junhao Zhang, Dongxu Li, Hung Le

PubTime: 2024-01-03

Downlink: http://arxiv.org/abs/2401.01827v1

Project: https://showlab.github.io/Moonshot/|

GitHub: https://github.com/salesforce/LAVIS.|

标题: CoMoSVC: Consistency Model-based Singing Voice Conversion

作者: Yiwen Lu, Zhen Ye, Wei Xue

PubTime: 2024-01-03

Downlink: http://arxiv.org/abs/2401.01792v1

Project: https://comosvc.github.io/.|

中文摘要: 基于扩散的歌声转换（SVC）方法取得了显著的性能，产生了与目标音色高度相似的自然音频。然而，迭代采样过程导致推理速度慢，因此加速变得至关重要。本文提出了一种基于一致性模型的SVC方法CoMoSVC，旨在实现高质量生成和高速采样。首先专门为SVC设计了一个基于扩散的教师模型，并在自洽属性下进一步提取学生模型，以实现一步采样。在单个NVIDIA GTX4090 GPU上的实验表明，尽管CoMoSVC的推理速度明显快于最先进的（SOTA）基于扩散的SVC系统，但基于主观和客观指标，它仍然实现了相当或更好的转换性能。音频样本和代码可在https：//comosvc.github.io/。

摘要: The diffusion-based Singing Voice Conversion (SVC) methods have achieved
remarkable performances, producing natural audios with high similarity to the
target timbre. However, the iterative sampling process results in slow
inference speed, and acceleration thus becomes crucial. In this paper, we
propose CoMoSVC, a consistency model-based SVC method, which aims to achieve
both high-quality generation and high-speed sampling. A diffusion-based teacher
model is first specially designed for SVC, and a student model is further
distilled under self-consistency properties to achieve one-step sampling.
Experiments on a single NVIDIA GTX4090 GPU reveal that although CoMoSVC has a
significantly faster inference speed than the state-of-the-art (SOTA)
diffusion-based SVC system, it still achieves comparable or superior conversion
performance based on both subjective and objective metrics. Audio samples and
codes are available at https://comosvc.github.io/.

== VLN ==

标题: ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models

作者: Rohan Wadhawan, Hritik Bansal, Kai-Wei Chang

PubTime: 2024-01-24

Downlink: http://arxiv.org/abs/2401.13311v1

Project: https://con-textual.github.io/|

中文摘要: 人工智能的最新进展导致了大型多模态模型（LMM）的发展，这些模型能够处理复杂的任务，包括对图像中的文本和视觉内容进行联合推理（例如，在公共场所导航地图）。本文介绍了ConTextual，这是一个新颖的基准测试，包括明确设计的指令，用于评估LMMs执行上下文敏感的文本丰富的可视化推理的能力。上下文强调不同的真实世界场景（例如，时间阅读、导航、购物等），要求更深入地理解文本和视觉元素之间的交互。我们的发现揭示了表现最好的LMM、GPT-4V（ision）和使用人类评估的人类能力之间30.8%的显著性能差距，表明在上下文敏感的文本丰富的视觉推理方面有很大的改进空间。值得注意的是，虽然GPT-4V在模因和引用解释等抽象类别中表现出色，但其整体表现仍落后于人类。除了人工评估，我们还采用了使用GPT-4的自动评估指标，揭示了绩效差异的类似趋势。我们还在不同的视觉环境中进行细粒度的评估，并提供定性分析，为LMM设计的未来发展提供了一个强大的框架。https：//con-textual.github.io/

摘要: Recent advancements in AI have led to the development of large multimodal models (LMMs) capable of processing complex tasks involving joint reasoning over text and visual content in the image (e.g., navigating maps in public places). This paper introduces ConTextual, a novel benchmark comprising instructions designed explicitly to evaluate LMMs’ ability to perform context-sensitive text-rich visual reasoning. ConTextual emphasizes diverse real-world scenarios (e.g., time-reading, navigation, shopping and more) demanding a deeper understanding of the interactions between textual and visual elements. Our findings reveal a significant performance gap of 30.8% between the best-performing LMM, GPT-4V(ision), and human capabilities using human evaluation indicating substantial room for improvement in context-sensitive text-rich visual reasoning. Notably, while GPT-4V excelled in abstract categories like meme and quote interpretation, its overall performance still lagged behind humans. In addition to human evaluations, we also employed automatic evaluation metrics using GPT-4, uncovering similar trends in performance disparities. We also perform a fine-grained evaluation across diverse visual contexts and provide qualitative analysis which provides a robust framework for future advancements in the LMM design. https://con-textual.github.io/

标题: SemanticSLAM: Learning based Semantic Map Construction and Robust Camera Localization

作者: Mingyang Li, Yue Ma, Qinru Qiu

PubTime: 2024-01-23

Downlink: http://arxiv.org/abs/2401.13076v1

GitHub: https://github.com/Leomingyangli/SemanticSLAM|

中文摘要: 视觉同步定位和绘图（VSLAM）中的当前技术通过比较连续场景的图像特征来估计相机位移。这些算法依赖于场景的连续性，因此需要频繁的摄像机输入。然而，频繁处理图像会导致大量的内存使用和计算开销。在这项研究中，我们介绍了SemanticSLAM，这是一个端到端的视觉惯性里程计系统，它利用了从RGB-D传感器提取的语义特征。这种方法能够创建环境的语义图，并确保可靠的相机定位。SemanticSLAM是场景不可知的，这意味着它不需要针对不同的环境进行重新训练。它可以在室内环境中有效地工作，即使没有频繁的摄像机输入，也不需要事先知道。SemanticSLAM的优势在于它能够逐步细化语义图并改进姿态估计。这是通过卷积长短期记忆（ConvLSTM）网络实现的，该网络经过训练可以在地图构建过程中纠正错误。与现有的VSLAM算法相比，SemanticSLAM将姿态估计提高了17%。由此产生的语义图提供了关于环境的可解释信息，并且可以容易地应用于各种下游任务，例如路径规划、避障和机器人导航。该代码将在https：//github.com/Leomingyangli/SemanticSLAM

摘要: Current techniques in Visual Simultaneous Localization and Mapping (VSLAM) estimate camera displacement by comparing image features of consecutive scenes. These algorithms depend on scene continuity, hence requires frequent camera inputs. However, processing images frequently can lead to significant memory usage and computation overhead. In this study, we introduce SemanticSLAM, an end-to-end visual-inertial odometry system that utilizes semantic features extracted from an RGB-D sensor. This approach enables the creation of a semantic map of the environment and ensures reliable camera localization. SemanticSLAM is scene-agnostic, which means it doesn’t require retraining for different environments. It operates effectively in indoor settings, even with infrequent camera input, without prior knowledge. The strength of SemanticSLAM lies in its ability to gradually refine the semantic map and improve pose estimation. This is achieved by a convolutional long-short-term-memory (ConvLSTM) network, trained to correct errors during map construction. Compared to existing VSLAM algorithms, SemanticSLAM improves pose estimation by 17%. The resulting semantic map provides interpretable information about the environment and can be easily applied to various downstream tasks, such as path planning, obstacle avoidance, and robot navigation. The code will be publicly available at https://github.com/Leomingyangli/SemanticSLAM

标题: ETPNav: Evolving Topological Planning for Vision-Language Navigation in Continuous Environments

作者: Dong An, Hanqing Wang, Wenguan Wang

PubTime: 2024-01-22

Downlink: http://arxiv.org/abs/2304.03047v3

GitHub: https://github.com/MarSaKi/ETPNav.|https://github.com/MarSaKi/ETPNav|

中文摘要: Vision-language导航是一项需要代理按照指令在环境中导航的任务。它在具体化人工智能领域变得越来越重要，在自主导航、搜索和救援以及人机交互方面具有潜在的应用。在本文中，我们提出了一个更实际但具有挑战性的对应设置——连续环境中的视觉语言导航（VLN-CE）。为了开发一个鲁棒的VLN-CE代理，我们提出了一个新的导航框架ETPNav，它专注于两个关键技能：1）抽象环境和生成远程导航计划的能力，以及2）在连续环境中的避障控制能力。ETPNav通过沿着穿越路径自组织预测的航路点来执行环境的在线拓扑映射，而无需先前的环境经验。它赋予代理将导航过程分解为高级规划和低级控制的特权。同时，ETPNav利用基于Transformer model的跨模态规划器来基于拓扑图和指令生成导航计划。然后，该计划通过避障控制器来执行，该控制器利用试错法来防止导航陷入障碍物。实验结果证明了该方法的有效性。ETPNav的产量超过10%和20%的改进比以前的属性R2R-CE和RxR-CE数据集的最新技术。我们的代码可在https：//github.com/MarSaKi/ETPNav。获得

摘要: Vision-language navigation is a task that requires an agent to follow instructions to navigate in environments. It becomes increasingly crucial in the field of embodied AI, with potential applications in autonomous navigation, search and rescue, and human-robot interaction. In this paper, we propose to address a more practical yet challenging counterpart setting - vision-language navigation in continuous environments (VLN-CE). To develop a robust VLN-CE agent, we propose a new navigation framework, ETPNav, which focuses on two critical skills: 1) the capability to abstract environments and generate long-range navigation plans, and 2) the ability of obstacle-avoiding control in continuous environments. ETPNav performs online topological mapping of environments by self-organizing predicted waypoints along a traversed path, without prior environmental experience. It privileges the agent to break down the navigation procedure into high-level planning and low-level control. Concurrently, ETPNav utilizes a transformer-based cross-modal planner to generate navigation plans based on topological maps and instructions. The plan is then performed through an obstacle-avoiding controller that leverages a trial-and-error heuristic to prevent navigation from getting stuck in obstacles. Experimental results demonstrate the effectiveness of the proposed method. ETPNav yields more than 10% and 20% improvements over prior state-of-the-art on R2R-CE and RxR-CE datasets, respectively. Our code is available at https://github.com/MarSaKi/ETPNav.

标题: Multimotion Visual Odometry (MVO)

作者: Kevin M. Judd, Jonathan D. Gammell

PubTime: 2024-01-15

Downlink: http://arxiv.org/abs/2110.15169v3

Project: https://www.youtube.com/watch?v=mNj3s1nf-6A|https://www.youtube.com/playlist?list=PLbaQBz4TuPcxMIXKh5Q80s0N9ISezFcpi|

中文摘要: 视觉运动估计是自主导航中一个研究得很好的挑战。最近的工作集中于解决高度动态环境中的多运动估计问题。这些环境不仅包括多个复杂的运动，而且往往表现出显著的遮挡。很难同时估计第三方运动和传感器自运动，因为物体的观测运动包括其真实运动和传感器运动。先前在多运动估计中的大多数工作通过依赖于基于外观的对象检测或特定于应用程序的运动约束来简化这个问题。这些方法在特定的应用程序和环境中是有效的，但不能很好地推广到完整的多运动估计问题（MEP）。本文介绍了Multimotion Visual Odometry（MVO），这是一种多运动估计管道，它估计场景中每个运动的完整SE（3）轨迹，包括传感器自身运动，而不依赖于基于外观的信息。MVO通过多运动分割和跟踪技术扩展了传统的视觉里程计（VO）管道。它使用物理建立的运动先验来推断通过临时遮挡的运动，并通过运动闭合来识别运动的再现。对牛津多运动数据集（OMD）和KITTI Vision Benchmark Suite的真实世界数据的评估表明，与类似方法相比，MVO实现了良好的估计精度，并适用于各种多运动估计挑战

摘要: Visual motion estimation is a well-studied challenge in autonomous navigation. Recent work has focused on addressing multimotion estimation in highly dynamic environments. These environments not only comprise multiple, complex motions but also tend to exhibit significant occlusion. Estimating third-party motions simultaneously with the sensor egomotion is difficult because an object’s observed motion consists of both its true motion and the sensor motion. Most previous works in multimotion estimation simplify this problem by relying on appearance-based object detection or application-specific motion constraints. These approaches are effective in specific applications and environments but do not generalize well to the full multimotion estimation problem (MEP). This paper presents Multimotion Visual Odometry (MVO), a multimotion estimation pipeline that estimates the full SE(3) trajectory of every motion in the scene, including the sensor egomotion, without relying on appearance-based information. MVO extends the traditional visual odometry (VO) pipeline with multimotion segmentation and tracking techniques. It uses physically founded motion priors to extrapolate motions through temporary occlusions and identify the reappearance of motions through motion closure. Evaluations on real-world data from the Oxford Multimotion Dataset (OMD) and the KITTI Vision Benchmark Suite demonstrate that MVO achieves good estimation accuracy compared to similar approaches and is applicable to a variety of multimotion estimation challenges.

标题: Learning Interactive Real-World Simulators

作者: Mengjiao Yang, Yilun Du, Kamyar Ghasemipour

PubTime: 2024-01-13

Downlink: http://arxiv.org/abs/2310.06114v2

Project: https://universal-simulator.github.io.|https://universal-simulator.github.io|

中文摘要: 基于互联网数据训练的生成模型彻底改变了文本、图像和视频内容的创建方式。也许生成模型的下一个里程碑是模拟现实体验，以响应人类、机器人和其他交互式代理所采取的行动。真实世界模拟器的应用范围从游戏和电影中的可控内容创建，到纯粹在模拟中训练可直接部署在现实世界中的具体代理。我们探索了通过生成建模学习真实世界交互的通用模拟器的可能性。我们首先提出了一个重要的观察结果，即可用于学习真实世界模拟器的自然数据集通常在不同维度上是丰富的（例如，图像数据中的大量对象、机器人数据中的密集采样动作以及导航数据中的不同运动）。通过仔细编排不同的数据集，每个数据集都提供了整体体验的不同方面，我们可以从静态场景和对象中模拟高级指令（如“打开抽屉”）和低级控件（如“按x，y移动”）的视觉结果。我们使用模拟器来训练高级视觉语言策略和低级强化学习策略，在纯模拟训练后，每一种策略都可以在现实世界中零次部署。我们还表明，其他类型的智能，如视频字幕模型，可以从模拟经验的训练中受益，从而开辟更广泛的应用。视频演示可在https://universal-simulator.github.io.

摘要: Generative models trained on internet data have revolutionized how text, image, and video content can be created. Perhaps the next milestone for generative models is to simulate realistic experience in response to actions taken by humans, robots, and other interactive agents. Applications of a real-world simulator range from controllable content creation in games and movies, to training embodied agents purely in simulation that can be directly deployed in the real world. We explore the possibility of learning a universal simulator of real-world interaction through generative modeling. We first make the important observation that natural datasets available for learning a real-world simulator are often rich along different dimensions (e.g., abundant objects in image data, densely sampled actions in robotics data, and diverse movements in navigation data). With careful orchestration of diverse datasets, each providing a different aspect of the overall experience, we can simulate the visual outcome of both high-level instructions such as ``open the drawer’’ and low-level controls such as “move by x, y” from otherwise static scenes and objects. We use the simulator to train both high-level vision-language policies and low-level reinforcement learning policies, each of which can be deployed in the real world in zero shot after training purely in simulation. We also show that other types of intelligence such as video captioning models can benefit from training with simulated experience, opening up even wider applications. Video demos can be found at https://universal-simulator.github.io.

标题: Surgical-DINO: Adapter Learning of Foundation Models for Depth Estimation in Endoscopic Surgery

作者: Beilei Cui, Mobarakol Islam, Long Bai

PubTime: 2024-01-12

Downlink: http://arxiv.org/abs/2401.06013v2

GitHub: https://github.com/BeileiCui/SurgicalDINO.|

中文摘要: 目的：机器人手术中的深度估计在3D重建、手术导航和增强现实可视化中至关重要。尽管基础模型在许多视觉任务中表现出出色的性能，包括深度估计（例如，DINOv2），但最近的工作观察到其在医疗和外科领域特定应用中的局限性。这项工作提出了一个低排名适应(LoRA)的基础模型的手术深度估计。方法：我们设计了一种基于基础模型的深度估计方法，称为Surgical-DINO，这是DINOv2的低秩适应，用于内窥镜手术的深度估计。我们构建LoRA层，并将它们集成到DINO中，以适应外科手术特定领域的知识，而不是传统的微调。在训练过程中，我们冻结了显示出出色视觉表现能力的DINO图像编码器，并且只优化了LoRA层和深度解码器，以整合来自手术场景的特征。结果：我们的模型在从达芬奇Xi内窥镜手术中收集的MICCAI挑战数据集上得到了广泛的验证。我们的经验表明，在内窥镜深度估计任务中，Surgical-DINO明显优于所有最先进的模型。消融研究的分析显示了我们的LoRA层和适应性的显著效果的证据。结论：Surgical-DINO为基础模型成功适应外科领域的深度估计提供了一些启示。结果中有明确的证据表明，对计算机视觉数据集中预训练权重的零镜头预测或简单的微调不足以直接在外科领域使用基础模型。代码可在https：//github.com/BeileiCui/SurgicalDINO获得。

摘要: Purpose: Depth estimation in robotic surgery is vital in 3D reconstruction, surgical navigation and augmented reality visualization. Although the foundation model exhibits outstanding performance in many vision tasks, including depth estimation (e.g., DINOv2), recent works observed its limitations in medical and surgical domain-specific applications. This work presents a low-ranked adaptation (LoRA) of the foundation model for surgical depth estimation. Methods: We design a foundation model-based depth estimation method, referred to as Surgical-DINO, a low-rank adaptation of the DINOv2 for depth estimation in endoscopic surgery. We build LoRA layers and integrate them into DINO to adapt with surgery-specific domain knowledge instead of conventional fine-tuning. During training, we freeze the DINO image encoder, which shows excellent visual representation capacity, and only optimize the LoRA layers and depth decoder to integrate features from the surgical scene. Results: Our model is extensively validated on a MICCAI challenge dataset of SCARED, which is collected from da Vinci Xi endoscope surgery. We empirically show that Surgical-DINO significantly outperforms all the state-of-the-art models in endoscopic depth estimation tasks. The analysis with ablation studies has shown evidence of the remarkable effect of our LoRA layers and adaptation. Conclusion: Surgical-DINO shed some light on the successful adaptation of the foundation models into the surgical domain for depth estimation. There is clear evidence in the results that zero-shot prediction on pre-trained weights in computer vision datasets or naive fine-tuning is not sufficient to use the foundation model in the surgical domain directly. Code is available at https://github.com/BeileiCui/SurgicalDINO.

专属领域论文订阅

关注{晓理紫|小李子}，每日更新论文，如感兴趣，请转发给有需要的同学，谢谢支持。谢谢提供建议

如果你感觉对你有所帮助，请关注我，每日准时为你推送最新论文

你可能感兴趣的:(每日论文,机器人,深度学习,人工智能,机器学习)

【Dive Into Stable Diffusion v3.5】1：开源项目正式发布——深入探索SDv3.5模型全参/LoRA/RLHF训练 Donvink 大模型 #AIGC stable diffusion AIGC 人工智能机器学习深度学习
目录1引言2项目简介3快速上手3.1下载代码3.2环境配置3.3项目结构3.4下载模型与数据集3.5运行指令3.6核心参数说明3.6.1通用参数3.6.2优化器/学习率3.6.3数据相关4结语1引言在人工智能和机器学习领域，生成模型的应用越来越广泛。StableDiffusion作为其中的佼佼者，因其强大的图像生成能力而备受关注。今天，我的开源项目DiveIntoStableDiffusionv3
PyTorch 深度学习实战（19）：离线强化学习与 Conservative Q-Learning (CQL) 算法进取星辰 PyTorch 深度学习实战深度学习 pytorch 算法
在上一篇文章中，我们探讨了分布式强化学习与IMPALA算法，展示了如何通过并行化训练提升强化学习的效率。本文将聚焦离线强化学习（OfflineRL）这一新兴方向，并实现ConservativeQ-Learning(CQL)算法，利用Minari提供的静态数据集训练安全的强化学习策略。一、离线强化学习与CQL原理1.离线强化学习的特点无需环境交互：直接从预收集的静态数据集学习数据效率高：复用历史经验
Java IDEA中Gutter Icons图标的含义路宇 java笔记 java intellij-idea 开发语言 gutter-icons 图标 Java开发工具
前些天发现了一个蛮有意思的人工智能学习网站,8个字形容一下"通俗易懂，风趣幽默"，感觉非常有意思,忍不住分享一下给大家。点击跳转到教程前言：很多人刚开始用IDEA来学习编程，会发现下面这些图标。但是我们有时候并不知道它的含义和设置显示与隐藏，下面给大家讲解一下装订线图标位于左侧编辑器中。它们调用一些基本操作以及其他特定于框架和技术的功能。设置步骤File->Setting进到idea的设置页面。接
当今前沿技术：改变生活的创新趋势 jiemizhushou 生活经验分享
智能机器人在工业生产中正发挥着重要作用。这些机器人提高了生产效率，降低了人工成本，成为现代制造业的核心工具。现如今，汽配、电子和食品等行业都在积极采用智能机器人。例如，富士康在其手机生产线上使用机器人，以提升生产线的自动化程度。通过这些机器人，富士康不仅提高了生产速度，还确保了产品的一致性和质量。未来，智能机器人的应用将更加广泛。随着技术的不断进步，机器人将更加智能化，能够完成更复杂的任务。例如，
【科研必备】EI/Scopus收录！2025年3-4月智能制造、自动化、无人驾驶、人工智能等前沿领域国际会议邀您参与~与全球学者交流，让学术之光在国际舞台上闪耀！努力毕业的小土博^_^ 学术会议推荐制造自动化人工智能深度学习神经网络算法
【科研必备】EI/Scopus收录！2025年3-4月智能制造、无人驾驶、人工智能等前沿领域国际会议邀您参与~与全球学者交流，让学术之光在国际舞台上闪耀！【科研必备】EI/Scopus收录！2025年3-4月智能制造、无人驾驶、人工智能等前沿领域国际会议邀您参与~与全球学者交流，让学术之光在国际舞台上闪耀！文章目录【科研必备】EI/Scopus收录！2025年3-4月智能制造、无人驾驶、人工智能等
知识库在意图识别中扮演着**数据支撑**和**语义理解辅助**的双重角色 PersistDZ 大数据与AI 人工智能
知识库在意图识别中扮演着数据支撑和语义理解辅助的双重角色，而训练智能客服的意图识别Agent需要结合知识库的结构化数据与机器学习技术。以下是详细解析：一、知识库在意图识别中的作用1.提供标注数据意图标签定义：知识库中存储了预先定义的意图分类体系（如“订单查询”“退换货”“投诉”等），为模型提供明确的训练目标。标注样本：知识库包含大量用户对话历史及其对应的意图标签，是训练监督学习模型的核心数据源。2
K8S学习之基础四十：配置altermanager发送告警到钉钉群云上艺旅 K8S学习 kubernetes 学习钉钉 prometheus 云原生容器
配置altermanager发送告警到钉钉群创建钉钉群，设置机器人助手(必须是管理员才能设置)，获取webhookwebhook：https://oapi.dingtalk.com/robot/send?access_token=25bed933a52d69f192347b5be4b2193bc0b257a6d9ae68d81619e3ae3d93f7c6#创建cm，配置钉钉群信息vialertm
一切皆是映射：DQN训练加速技术：分布式训练与GPU并行 AI天才研究院计算 AI大模型企业级应用开发实战 ChatGPT 计算科学神经计算深度学习神经网络大数据人工智能大型语言模型 AI AGI LLM Java Python 架构设计 Agent RPA
1.背景介绍1.1深度强化学习的兴起近年来，深度强化学习（DeepReinforcementLearning，DRL）在游戏、机器人控制、自然语言处理等领域取得了令人瞩目的成就。作为一种结合深度学习和强化学习的强大技术，DRL能够使智能体在与环境交互的过程中学习最优策略，从而实现自主决策和控制。1.2DQN算法及其局限性深度Q网络（DeepQ-Network，DQN）是DRL的一种经典算法，它利用
LangChain入门：使用Python和通义千问打造免费的Qwen大模型聊天机器人南七小僧人工智能网站开发 AI技术产品经理服务器数据库 windows
前言LangChain是一个用于开发由大型语言模型（LargeLanguageModels，简称LLMs）驱动的应用程序的框架。它提供了一个灵活的框架，使得开发者可以构建具有上下文感知能力和推理能力的应用程序，这些应用程序可以利用公司的数据和APIs。这个框架由几个部分组成。LangChain库：Python和JavaScript库。包含了各种组件的接口和集成，一个基本的运行时，用于将这些组件组合
大规模语言模型从理论到实践分布式训练的集群架构 AI智能涌现深度研究 DeepSeek R1 &大数据AI人工智能 Python入门实战计算科学神经计算深度学习神经网络大数据人工智能大型语言模型 AI AGI LLM Java Python 架构设计 Agent RPA
大规模语言模型从理论到实践分布式训练的集群架构作者：禅与计算机程序设计艺术/ZenandtheArtofComputerProgramming1.背景介绍1.1问题的由来随着深度学习技术的飞速发展，大规模语言模型（LargeLanguageModels,LLMs）在自然语言处理（NaturalLanguageProcessing,NLP）领域取得了突破性进展。LLMs，如BERT、GPT-3等，通
图生视频技术的发展与展望：从技术突破到未来图景 Liudef06 Stable Diffusion 音视频人工智能深度学习 stable diffusion
一、技术发展现状图生视频（Image-to-VideoGeneration）是生成式人工智能（AIGC）的重要分支，其核心是通过单张或多张静态图像生成动态视频序列。近年来，随着深度学习、多模态融合和计算硬件的进步，图生视频技术经历了从基础研究到商业落地的快速演进。早期探索与GAN的奠基早期图生视频技术主要基于生成对抗网络（GAN），通过对抗训练生成低分辨率的视频片段。例如，DeepMind的DVD
【论文阅读】MMedPO：用临床感知多模态偏好优化调整医学视觉语言模型勤奋的小笼包论文阅读语言模型人工智能自然语言处理 chatgpt
MMedPO：用临床感知多模态偏好优化调整医学视觉语言模型1.背景2.核心问题：3.方法：3.实验结果与优势4.技术贡献与意义5.结论MMedPO:AligningMedicalVision-LanguageModelswithClinical-AwareMultimodalPreferenceOptimizationMMedPO：用临床感知多模态偏好优化调整医学视觉语言模型gitgub:地址1.
计算机专业毕业设计指南晴天毕设课程设计毕业设计开发语言 java
毕业设计是计算机专业学生展示综合能力的重要环节，它不仅是对所学知识的总结，也是进入职场或深造前的实战演练。本文将为你提供一份详细的毕业设计指南，帮助你从选题到答辩顺利完成毕业设计。如果有什么问题可以点击文章末尾名片咨询哦一、毕业设计流程概述毕业设计通常包括以下几个阶段：选题需求分析系统设计编码实现测试与优化论文撰写答辩准备每个阶段都有其重要性，下面将逐一详细说明。二、详细步骤1.选题选题是毕业设计
【SoC基础】单片机之寄存器解析望闻问嵌 #SoC 单片机嵌入式硬件
：如果你也对机器人、人工智能感兴趣，看来我们志同道合✨：不妨浏览一下我的博客主页【https://blog.csdn.net/weixin_51244852】：文章若有幸对你有帮助，可点赞收藏⭐不迷路：内容若有错误，敬请留言指正！原创文，转载注明出处文章目录1、寄存器位置2、寄存器种类2.1通用用途寄存器2.2CPU执行相关寄存器2.3外设控制寄存器3.寄存器在CPU访问外设过程中起到的作用1、寄
大模型时代的知识焦虑机载软件与适航机器学习-建模算法-代理模型人工智能大数据
引言：浪潮之巅，焦虑暗涌大模型时代已经浩荡而来，如同奔腾的浪潮，以令人惊叹的速度重塑着世界的面貌。从智能客服的温声细语，到AI绘画的妙笔生花，再到自动驾驶的日趋成熟，大型语言模型、图像模型等人工智能技术以前所未有的姿态，渗透进我们生活的方方面面。信息获取前所未有的便捷，知识创造空前高效，人机交互焕然一新，一个充满无限可能的智能化未来似乎触手可及。然而，在这令人眼花缭乱的技术盛景之下，一股无形的焦虑
每日新闻掌握【2025年3月20日星期四】 cdmt 每日新闻掌握科技
2025年3月20日星期四农历二月廿一大公司/大事件住建部：坚决稳住楼市，推动房地产市场止跌回稳近日，中共住房和城乡建设部党组召开理论学习中心组学习（扩大）会议。会议要求，要持续推进城市更新，坚持问题导向和目标导向，开展城市体检，找准人民群众急难愁盼问题和城市发展短板弱项，下功夫实施一批惠民生、防风险、促发展的更新项目。要坚决稳住楼市，持续巩固“四个取消、四个降低、两个增加”房地产政策“组合拳”效
近期计算机领域的热点技术 0dayNu1L 云计算量子计算人工智能
随着科技的飞速发展，计算机领域的新技术、新趋势层出不穷。本文将探讨近期计算机领域的几个热点技术趋势，并对它们进行简要的分析和展望。一、人工智能与机器学习人工智能（AI）和机器学习（ML）是近年来计算机领域最为热门的话题之一。AI和ML技术已经广泛应用于图像识别、自然语言处理、智能推荐等领域，并取得了显著的成果。随着技术的不断进步，AI和ML将更深入地渗透到各个行业，为人类社会带来更多便利和效益。在
谷歌准备斥资 230 亿收购网络安全初创公司 Wiz 网络研究观网络研究观谷歌
Alphabet正在就收购Wiz进行深入谈判，这将显著增强其安全能力。这将是谷歌母公司有史以来最大规模的收购。这是路透社根据匿名消息来源撰写的内容。目标收购金额为230亿美元，即211亿欧元。Wiz拥有实时检测和响应网络威胁的技术。通过实施人工智能，Wiz能够在短时间内吸引许多公司作为客户。Alphabet的收购目标定于2020年初。到2023年，Wiz的收入将达到3.5亿美元。当时，全球40%的
数学领域的跨时代进化与升级：从公理化到智能化的破茧之路夏末之花算法
作者：夏末之花|发布时间：2025-03-16|阅读量：10万+|点赞数：5.6万引言：数学的“破茧时刻”与文明跃迁人类历史上，数学的每一次重大突破都像一次“破茧时刻”，推动文明跨越式发展。从古希腊的几何公理化到牛顿的微积分，再到20世纪的计算机理论，数学始终是科学革命的基石。而在21世纪的今天，随着量子计算、人工智能、生物信息等技术的爆发，数学正迎来新一轮的进化与升级——从纯粹的逻辑工具，演变为
精准测试：软件开发中的高效质量保障利器霍格沃兹软件测试开发精准化测试测试用例安全性测试测试覆盖率模块测试 selenium 测试工具压力测试
全面解析软件测试开发：人工智能测试、自动化测试、性能测试、测试左移、测试右移到DevOps如何驱动持续交付在现代软件开发中，测试效率与测试质量直接影响产品竞争力。精准测试作为一项兼具效率与精度的创新测试方法，已经成为众多企业提升软件质量的重要手段。本篇文章围绕精准测试的落地实施、对质量指标的提升、数据统计与效果评估方法以及如何提高投入产出比进行全面解读，帮助企业掌握精准测试的价值与实践路径。精准测
提升敏感力，“工具人”破圈的唯一解！技能咖 GAI认证生成式人工智能认证人工智能
在当今这个日新月异的数字化时代，个人与组织面临着前所未有的挑战与机遇。随着科技的飞速发展，尤其是生成式人工智能（GenerativeAI）的兴起，职场生态正在发生深刻变革。如何在这场变革中提升敏感力，实现从“工具人”到行业佼佼者的跨越，成为了众多职场人士关注的焦点。本文将探讨提升敏感力的重要性，并引入生成式人工智能认证（GAI认证），为您揭示“工具人”破圈的唯一解。提升敏感力：职场竞争的关键什么是
金三银四快过去一半了，是时候加把劲了后端go找工作面试
从复旦春招会的15000+岗位争夺战，到AI算法岗年薪百万的“神仙打架”，再到游戏行业20:1的残酷竞争比，今年的金三银四像极了《三体》里的黑暗森林：机会看似遍地，但稍有不慎就成了别人的“背景板”。但现实真的是“投晚了就凉了”吗？数据告诉你真相：智联研究院统计显示，算法工程师、机器人算法工程师等岗位需求同比激增44%，而中小企业的“捡漏窗口”才刚开启。这半个月，我整理了20+场面试实录（含小鹅通、
Moodle + Websoft9：创新教育的强大组合，助力教学与学习开源软件
Moodle+Websoft9：构建未来课堂的技术基石一、Moodle：开源生态的深度解析•模块化设计：支持超800个官方插件，如H5P交互内容创作、BigBlueButton虚拟课堂，满足个性化教学需求。•学习分析引擎：内置LearningAnalyticsAPI，可集成Python/R语言进行深度学习，预测学生学业风险。•移动优先战略：MoodleApp支持离线学习、扫码签到，2023年新增A
新浪财经App喜娜AI助手通过大模型登记，已上线AI摘要和个股公告AI解读量子位
3月14日，官方发布的信息显示，新浪财经App喜娜AI助手近日已通过北京市生成式人工智能服务登记。目前，喜娜AI助手已上线两项创新功能：喜娜AI摘要和个股公告AI解读。这两项功能旨在通过先进的人工智能技术，提升用户对财经资讯和上市公司公告的理解与分析效率，这标志着AI技术在信息服务领域的又一重大突破。喜娜AI摘要：快速提炼财经资讯核心要点AI时代，资讯信息迎来爆炸性增长，用户每天都要面对海量资讯，
书籍-《动手学深度学习（英文版）》
书籍：DiveintoDeepLearning作者：AstonZhang，ZacharyC.Lipton，MuLi，AlexanderJ.Smola出版：CambridgeUniversityPress编辑：陈萍萍的公主@一点人工一点智能下载：书籍下载-《动手学深度学习（英文版）》01书籍介绍深度学习已经彻底改变了模式识别，为计算机视觉、自然语言处理和自动语音识别等领域提供了强大的工具。应用深度学
向量数据库 PieCloudVector 进阶系列丨打造以 LLM 为基础的聊天机器人
本系列前两篇文章深入探讨了PieCloudVector在图片和音频数据上的应用之后，本文将聚焦于文本数据，探索PieCloudVector对于文本数据的向量化处理、存储以及检索，并最终结合LLM打造聊天机器人的全流程。在自然语言处理任务中涉及到大量对文本数据的处理、分析和理解，而向量数据库在其中发挥了重要的作用。本文为《PieCloudVector进阶系列》的第三篇，将为大家介绍如何利用PieCl
模型微调：让AI更懂你的魔法棒带上一无所知的我 pytorch 人工智能 python
模型微调：让AI更懂你的魔法棒✨在人工智能的世界里，模型微调（Fine-tuning）就像是一位魔法师用魔法棒对预训练模型进行“个性化改造”，让它更适应特定的任务。今天，我们就来深入探讨模型微调的技术细节，让你也能像魔法师一样，轻松驾驭AI模型！什么是模型微调？模型微调是指在预训练模型的基础上，通过少量的特定任务数据进行训练，使模型更好地适应新任务的技术。预训练模型通常是基于大规模数据集（如Ima
根据论文复现大模型方法以及出错处理技巧 Ai玩家hly 从0倒1 论文复现大模型复现 Ai大模型复现
复现一篇论文中的大模型搭建涉及以下几个关键步骤：理解论文的模型架构、数据集处理、超参数设置以及实验环境的搭建。这里给出一个基本的实现方法示例，假设我们选择复现一个图像分类任务中的经典模型，例如ResNet。实现步骤示例1.理解论文和模型架构选择一篇关于ResNet的论文作为示例，例如《DeepResidualLearningforImageRecognition》（Heetal.,2015）。2.
书籍-《优化基础：理论、工具及应用（论文版）》机器学习人工智能
书籍：OptimizationEssentials:Theory,Tools,andApplications作者：FaizHamid出版：Springer编辑：陈萍萍的公主@一点人工一点智能下载：书籍下载-《优化基础：理论、工具及应用（论文版）》01书籍介绍本书探讨了运筹学和数学优化领域的最新发展和令人兴奋的挑战。它以统一且精心编排的方式呈现了以下内容：(a)现实生活中出现的新颖优化问题，并突出每
从 DeepSeek 到 AI 工具箱：Websoft9 应用托管平台赋能高校教学与科研人工智能deepseek
从DeepSeek到AI工具箱：Websoft9应用托管平台赋能高校教学与科研人工智能技术的快速发展正在重塑高校的教学与科研生态。从智能教学辅助到跨学科研究，AI工具的应用场景不断扩展，而技术落地的复杂性也带来新的挑战。在这一背景下，如何将大模型能力与多样化AI工具无缝整合，构建安全、易用的科研教学环境，成为高校数字化转型的关键命题。一、高校智能化转型的三大痛点技术门槛高•AI工具部署依赖专业运维
java线程Thread和Runnable区别和联系 zx_code java jvm thread 多线程 Runnable
我们都晓得java实现线程2种方式，一个是继承Thread，另一个是实现Runnable。模拟窗口买票，第一例子继承thread，代码如下 package thread; public class ThreadTest { public static void main(String[] args) { Thread1 t1 = new Thread1(
【转】JSON与XML的区别比较丁_新 json xml
1.定义介绍 (1).XML定义扩展标记语言 (Extensible Markup Language, XML) ，用于标记电子文件使其具有结构性的标记语言，可以用来标记数据、定义数据类型，是一种允许用户对自己的标记语言进行定义的源语言。 XML使用DTD(document type definition)文档类型定义来组织数据;格式统一，跨平台和语言，早已成为业界公认的标准。 XML是标
c++ 实现五种基础的排序算法 CrazyMizzz C++c 算法
#include<iostream> using namespace std; //辅助函数，交换两数之值 template<class T> void mySwap(T &x, T &y){ T temp = x; x = y; y = temp; } const int size = 10; //一、用直接插入排
我的软件麦田的设计者我的软件音乐类娱乐放松
这是我写的一款app软件，耗时三个月，是一个根据央视节目开门大吉改变的，提供音调，猜歌曲名。1、手机拥有者在android手机市场下载本APP，同意权限，安装到手机上。2、游客初次进入时会有引导页面提醒用户注册。（同时软件自动播放背景音乐）。3、用户登录到主页后，会有五个模块。a、点击不胫而走，用户得到开门大吉首页部分新闻，点击进入有新闻详情。b、
linux awk命令详解被触发 linux awk
awk是行处理器: 相比较屏幕处理的优点，在处理庞大文件时不会出现内存溢出或是处理缓慢的问题，通常用来格式化文本信息 awk处理过程: 依次对每一行进行处理，然后输出 awk命令形式: awk [-F|-f|-v] ‘BEGIN{} //{command1; command2} END{}’ file [-F|-f|-v]大参数，-F指定分隔符，-f调用脚本，-v定义变量 var=val
各种语言比较 _wy_ 编程语言
Java Ruby PHP 擅长领域
oracle 中数据类型为clob的编辑知了ing oracle clob
public void updateKpiStatus(String kpiStatus,String taskId){ Connection dbc=null; Statement stmt=null; PreparedStatement ps=null; try { dbc = new DBConn().getNewConnection(); //stmt = db
分布式服务框架 Zookeeper -- 管理分布式环境中的数据矮蛋蛋 zookeeper
原文地址： http://www.ibm.com/developerworks/cn/opensource/os-cn-zookeeper/ 安装和配置详解本文介绍的 Zookeeper 是以 3.2.2 这个稳定版本为基础，最新的版本可以通过官网 http://hadoop.apache.org/zookeeper/来获取，Zookeeper 的安装非常简单，下面将从单机模式和集群模式两
tomcat数据源 alafqq tomcat
数据库 JNDI(Java Naming and Directory Interface，Java命名和目录接口)是一组在Java应用中访问命名和目录服务的API。没有使用JNDI时我用要这样连接数据库： 03. Class.forName("com.mysql.jdbc.Driver"); 04. conn
遍历的方法百合不是茶遍历
遍历在java的泛
linux查看硬件信息的命令 bijian1013 linux
linux查看硬件信息的命令一.查看CPU： cat /proc/cpuinfo 二.查看内存： free 三.查看硬盘： df linux下查看硬件信息 1、lspci 列出所有PCI 设备； lspci - list all PCI devices:列出机器中的PCI设备（声卡、显卡、Modem、网卡、USB、主板集成设备也能
java常见的ClassNotFoundException bijian1013 java
1.java.lang.ClassNotFoundException: org.apache.commons.logging.LogFactory 添加包common-logging.jar2.java.lang.ClassNotFoundException: javax.transaction.Synchronization
【Gson五】日期对象的序列化和反序列化 bit1129 反序列化
对日期类型的数据进行序列化和反序列化时，需要考虑如下问题： 1. 序列化时，Date对象序列化的字符串日期格式如何 2. 反序列化时，把日期字符串序列化为Date对象，也需要考虑日期格式问题 3. Date A -> str -> Date B,A和B对象是否equals 默认序列化和反序列化 import com
【Spark八十六】Spark Streaming之DStream vs. InputDStream bit1129 Stream
1. DStream的类说明文档： /** * A Discretized Stream (DStream), the basic abstraction in Spark Streaming, is a continuous * sequence of RDDs (of the same type) representing a continuous st
通过nginx获取header信息 ronin47 nginx header
1. 提取整个的Cookies内容到一个变量，然后可以在需要时引用，比如记录到日志里面， if ( $http_cookie ~* "(.*)$") { set $all_cookie $1; } 变量$all_cookie就获得了cookie的值，可以用于运算了
java-65.输入数字n，按顺序输出从1最大的n位10进制数。比如输入3，则输出1、2、3一直到最大的3位数即999 bylijinnan java
参考了网上的http://blog.csdn.net/peasking_dd/article/details/6342984 写了个java版的： public class Print_1_To_NDigit { /** * Q65.输入数字n，按顺序输出从1最大的n位10进制数。比如输入3，则输出1、2、3一直到最大的3位数即999 * 1.使用字符串
Netty源码学习-ReplayingDecoder bylijinnan java netty
ReplayingDecoder是FrameDecoder的子类，不熟悉FrameDecoder的，可以先看看 http://bylijinnan.iteye.com/blog/1982618 API说，ReplayingDecoder简化了操作，比如： FrameDecoder在decode时，需要判断数据是否接收完全： public class IntegerH
js特殊字符过滤 cngolon js特殊字符 js特殊字符过滤
1.js中用正则表达式过滤特殊字符, 校验所有输入域是否含有特殊符号function stripscript(s) { var pattern = new RegExp("[`~!@#$^&*()=|{}':;',\\[\\].<>/?~！@#￥……&*（）——|{}【】‘；：”“'。，、？]"
hibernate使用sql查询 ctrain Hibernate
import java.util.Iterator; import java.util.List; import java.util.Map; import org.hibernate.Hibernate; import org.hibernate.SQLQuery; import org.hibernate.Session; import org.hibernate.Transa
linux shell脚本中切换用户执行命令方法 daizj linux shell 命令切换用户
经常在写shell脚本时，会碰到要以另外一个用户来执行相关命令，其方法简单记下： 1、执行单个命令：su - user -c "command" 如：下面命令是以test用户在/data目录下创建test123目录 [root@slave19 /data]# su - test -c "mkdir /data/test123"
好的代码里只要一个 return 语句 dcj3sjt126com return
别再这样写了：public boolean foo() { if (true) { return true; } else { return false;
Android动画效果学习 dcj3sjt126com android
1、透明动画效果方法一：代码实现 public View onCreateView(LayoutInflater inflater, ViewGroup container, Bundle savedInstanceState) { View rootView = inflater.inflate(R.layout.fragment_main, container, fals
linux复习笔记之bash shell (4)管道命令 eksliang linux管道命令汇总 linux管道命令 linux常用管道命令
转载请出自出处： http://eksliang.iteye.com/blog/2105461 bash命令执行的完毕以后，通常这个命令都会有返回结果，怎么对这个返回的结果做一些操作呢？那就得用管道命令‘|’。上面那段话，简单说了下管道命令的作用，那什么事管道命令呢？答：非常的经典的一句话，记住了，何为管
Android系统中自定义按键的短按、双击、长按事件 gqdy365 android
在项目中碰到这样的问题：由于系统中的按键在底层做了重新定义或者新增了按键，此时需要在APP层对按键事件（keyevent）做分解处理，模拟Android系统做法，把keyevent分解成： 1、单击事件：就是普通key的单击； 2、双击事件：500ms内同一按键单击两次； 3、长按事件：同一按键长按超过1000ms（系统中长按事件为500ms）； 4、组合按键：两个以上按键同时按住；
asp.net获取站点根目录下子目录的名称 hvt .net C#asp.net hovertree Web Forms
使用Visual Studio建立一个.aspx文件(Web Forms)，例如hovertree.aspx,在页面上加入一个ListBox代码如下： <asp:ListBox runat="server" ID="lbKeleyiFolder" /> 那么在页面上显示根目录子文件夹的代码如下： string[] m_sub
Eclipse程序员要掌握的常用快捷键 justjavac java eclipse 快捷键 ide
判断一个人的编程水平，就看他用键盘多，还是鼠标多。用键盘一是为了输入代码（当然了，也包括注释），再有就是熟练使用快捷键。曾有人在豆瓣评《卓有成效的程序员》：“人有多大懒，才有多大闲”。之前我整理了一个程序员图书列表，目的也就是通过读书，让程序员变懒。写道程序员作为特殊的群体，有的人可以这么懒，懒到事情都交给机器去做，而有的人又可
c++编程随记 lx.asymmetric C++笔记
为了字体更好看，改变了格式…… &&运算符： #include<iostream> using namespace std; int main(){ int a=-1,b=4,k; k=(++a<0)&&!(b--
linux标准IO缓冲机制研究音频数据 linux
一、什么是缓存I/O(Buffered I/O)缓存I/O又被称作标准I/O,大多数文件系统默认I/O操作都是缓存I/O。在Linux的缓存I/O机制中，操作系统会将I/O的数据缓存在文件系统的页缓存(page cache)中，也就是说，数据会先被拷贝到操作系统内核的缓冲区中，然后才会从操作系统内核的缓冲区拷贝到应用程序的地址空间。1.缓存I/O有以下优点:A.缓存I/O使用了操作系统内核缓冲区，
随想生活暗黑小菠萝生活
其实账户之前就申请了，但是决定要自己更新一些东西看也是最近。从毕业到现在已经一年了。没有进步是假的，但是有多大的进步可能只有我自己知道。毕业的时候班里12个女生，真正最后做到软件开发的只要两个包括我，PS：我不是说测试不好。当时因为考研完全放弃找工作，考研失败，我想这只是我的借口。那个时候才想到为什么大学的时候不能好好的学习技术，增强自己的实战能力，以至于后来找工作比较费劲。我
我认为POJO是一个错误的概念 windshome java POJO 编程 J2EE 设计
这篇内容其实没有经过太多的深思熟虑，只是个人一时的感觉。从个人风格上来讲，我倾向简单质朴的设计开发理念；从方法论上，我更加倾向自顶向下的设计；从做事情的目标上来看，我追求质量优先，更愿意使用较为保守和稳妥的理念和方法。 &