前言: UT Austin Villa是近几年Robocup仿真3D项目中稳稳当当的世界冠军,他们每年拿了冠军之后都会发1到2篇论文来阐述他们的进步,其论文内容已经形成了固定模板。首先是Introduction,说一下他们近几年拿了多少个冠军等等,不用细看;然后是Domain Description,缀述一下RoboCup仿真3D的运行环境等等,不用细看;然后是Changes for 20xx,这个是介绍他们当年的进步的实现方法,重点看 ;再后面是Main Competition Results and Analysis、Technical Challenges,是各种秀战绩,不用细看。
一句话,只看Changes for 20xx就够了。
下面我把我们复现过程中可能会用到的一些部分进行了理解性的翻译。
论文中有些不明白的部分,我发邮件给了论文的作者,德克萨斯大学奥斯汀分校的教授Peter Stone,他帮我抄送给了他们团队的负责人,Patrick MacAlpine 博士后,从他那得到了非常耐心而详细的解答,非常感谢他们,并体会到了我们与世界冠军之间的巨大差距(例如原文标题为“品读”,现在改成了“拜读”)。
注: 本文只是一些注释和理解,原文还是要自己看几遍的。
论文主要内容及对一些部分的理解如下:
One significant change for the 2019 RoboCup 3D Simulation League competition was penalizing self-collisions. While the simulator’s physics model can detect and simulate self-collisions—when a robot’s body part such as a leg or
arm collides with another part of its own body—having the physics model try
to process and handle the large number of self-collisions occurring during games
often leads to instability in the simulator causing it to crash. To preserve stability of the simulator self-collisions are purposely ignored by the physics model.
However, not modeling self-collisions can result in robots performing physically
impossible motions such as one leg passing through the other when kicking the
ball. In order to discourage teams from having robots with self-colliding behaviors, a new feature was added to the simulator this year to detect and penalize
self-collisions when they happen. This feature signals a self-collision as having
occurred if two body parts of a robot overlap by more than 0.04 meters, and
then all joints in any arm or leg of the robot involved in the self-collision are
frozen and not allowed to move for one second. Freezing the joints in an arm or
leg that has started to collide with another body part is an approximation of the
physics model preventing body parts from moving through each other, and also
detracts from the performance of the robot due to its limb being “numb” and immobile. After the second passes, the joints are unfrozen, and the robot is allowed to move its self-colliding body parts for two seconds without any self-collisions
being reported. This two second period, during which previously collided body
parts are no longer penalized and frozen for self-collisions, allows a robot time
to reposition its body to no longer have a self-collision.
加入了一个传球模式:
A player may initiate the pass play mode as long as the following conditions are all met:
– The current play mode is PlayOn.
– The agent is within 0.5 meters of the ball.
– No opponents are within a meter of the ball.
– The ball is stationary as measured by having a speed no greater than 0.05
meters per second.
– At least three seconds have passed since the last time a player’s team has
been in pass mode.
Once pass mode for a team has started the following happens:
– Players from the opponent team are prevented from getting within a meter
of the ball.
– The pass play mode ends as soon as a player touches the ball or four seconds
have passed.
– After pass mode has ended the team who initiated the pass mode is unable
to score for ten seconds—this prevents teams from trying to take a shot on
goal out of pass mode.
通过跟其他不同队伍进行几千场比赛,将产生碰撞时的动作和球员编号记录下来。
下面是采用的策略:
1.手臂调整:大约一半有自我碰撞的踢球动作涉及到手臂,而在踢球动作中起主要作用的是腿,因此可以通过调整手臂的关节角度来避免自我碰撞,而不改变原先的踢球动作。
When a self-collision occurs, the simulator reports which body parts
of a robot collided with each other. For kicking skills the body parts that
matter the most are those in the legs, so if a robot’s arm is involved in a self-collision the arm’s movement can probably be adjusted without affecting
the kicking motion. Roughly half the kicking skills that had self-collisions
involved the robots’ arms in the self-collisions, so we were able to manually
adjust the arms’ joint angle positions to no longer self-collide while still
exhibiting the same kicking motion through the ball.
2.重新优化当前产生碰撞的动作:在很多情况下很难通过手动调节来避免动作中的自我碰撞,那么就以当前动作为起点,用cma-es算法重新进行优化,如果发生自我碰撞,就给球员的适应值上加上一个大的惩罚值。
In many cases it is not easy
to hand adjust the motions of a skill to avoid a self-collision as doing so fundamentally changes the performance of the skill (e.g. adjusting the position
of the legs of a robot for a kicking skill when the robot’s legs self-collide).
Instead of trying to fix things by hand, the current skill can be relearned
with CMA-ES using the current self-colliding behavior as a starting point
for learning, while also adding a large penalty value to the fitness of an agent
if it has any self-collisions while performing the optimization task it is trying
to learn.
3.如果当前的动作里含有很多自我碰撞,可能优化的时候就找不到不含有自我碰撞的动作,这时候就从跟当前动作相似的一个动作为起点开始优化
: If the previous strategy does
not work—possibly because the current behavior has too many self-collisions
such that it is hard to find a behavior that does not have self-collisions when
using the current self-colliding behavior as a starting point—one can instead
attempt to learn using a similar related skill (e.g. similar distance kick) that
has fewer collisions as a starting point for learning.
Some skills have
infrequent enough self-collisions that they do not always occur during a learning trial, but still experience a significant number of self-collisions during
games. It can be especially hard to reduce the number of self-collisions for
skills when self-collisions are not always detected during learning. As a way
to decrease the chance of the robot assuming body positions that are right on
the border of having self-collisions, one can decrease the allowed amount of
overlap between body parts in the simulator before a self-collision is considered to have occurred. By decreasing the amount of allowed overlap between
body parts during learning it is less likely that a learned behavior will have
self-collisions exceeding the actual allowed amount of overlap.
Only activate pass mode when an opponent is within 1.25 meters of the
ball. Activating pass mode before the opponent is close is unnecessary as
the opponent is not yet a threat to interfere with a kick, and the later pass
mode is activated the later it will time out leaving more time to kick the ball
before pass mode eventually ends.
2.不要在球员离敌方球门足够近,可以直接射门得分时开启pass mode,否则必须要等10s才能射门。
Do not use pass mode when a player is close enough to take a shot on goal
and score. Goals cannot be scored for ten seconds after pass mode ends, so
it is better to attempt a shot and try to score than to pass the ball and then
have to wait ten seconds to score.
3.当球员不在球后面时,即使离敌方球门很近,可以直接射门,也要使用pass mode,因为球员从球前走到球后面的踢球点需要一定的时间,如果不开启pass mode敌人就会对我们踢球造成潜在威胁。
Do use pass mode if a player is not behind the ball even if the player is close
enough to the opponent’s goal to take a shot and score. The player will have
to take some time to walk around the ball to get in position to take a shot,
and at that point it is likely the opponent will have gotten close enough to
the ball to interfere with a potential shot.
2018年的主要进展:
1.摔倒
2.走过了,碰到了球或者没走够,错过了球
3.踢球时间太长 (超过12s没有接触球)造成超时
即产生Penalty时的fitness与球没有动时的结果一样。因为cma-es算法只使用训练中fitness值的顺序排序,因此不同踢球动作间fitness的相对误差不会造成影响。
优化代数:300代;
每代个体数:300个;
优化结果:fitness > -1,即球最终到达位置离目标点的平均距离小于一米。
优化顺序:先用已经有的长距离踢球参数作为种子,优化出一组好的长距离动作参数;然后依次减小优化的踢球距离,并将上一次的参数作为本次的种子。例如:用19m的参数作为种子优化18m的,再用18m的参数作为种子优化17m的。。。
数据集的获取方法: 设 S S S是一个大小为m的数据集: { ( x i , y i ) } i = 1 m \{(x^i,y^i)\}^m_{i=1} { (xi,yi)}i=1m,其中单输入 x i x^i xi是一个49维的特征向量,用来表示比赛状态,即比赛模式、22个球员的坐标,球的坐标和潜在的传球坐标(我的理解是:比赛模式为1维,22个球员的x坐标和y坐标一共是44维,球的x坐标和y坐标一共2维,再加上潜在的传球x坐标和y坐标2维,一共49维);输出 y i y^i yi是一个[0,1]之间的单标量值,用来表示潜在传球位置的值(译为“得分”更为恰当)。在数据采集过程中,先根据 x i x^i xi将赛场恢复到一个确定的比赛状态,通过10次重复来确定 y i y^i yi的值,在每一次采集时,如果在20s内进球了,就给一个+1的奖励,否则奖励为0, y i y^i yi就是这10次奖励的平均值。显然,对于每一种球员和球的站位状态,都有很多有效的传球位置;因此对于一种站位状态有很多的训练例子。(在这里,一个有效的传球位置是在距离球的初始坐标20m以内,而且球场范围内)
此外,下面的方法优化了数据集:
1.网络的输入是规范化的,具体地说,输入网络的球员坐标是按照球场的x坐标轴从左到右顺序排列的;
2.通过对数据预处理来确保对称。具体地说,如果球的y坐标是负值,就反转所有的y坐标以保证输入到神经网络的球的y坐标都是正值。这样相当于只用考虑球在球场上边一半时的情况,因而减少了一半的可能情况,提高了收敛速度。
优化传球的细节: 首先要确定合适的神经网络的大小,影响我们选择的有两个因素,一个是它是否会过拟合,另一个是它是否能在0.02s内完成计算。
下表是不同的神经网络大小所对应的平均花费时间、最大花费时间、和最大丢包数量,单位是毫秒。(最大丢包数量是sever和agent通信时丢失的消息量)
UT选择的方案是上表第三种。
下面是方案3的训练细节:
一旦训练完成,这个网络就可以每时每刻根据当前的比赛状态计算出潜在传球位置的得分,机器人将传向得分最高的一个潜在传球位置。
以上是论文里的内容,看完之后我有一个疑问:
当我们收集数据集时,首先,将 x i x ^ i xi设置为输入; 其次,我们需要根据 x i x ^ i xi在RoboCup3D仿真平台上构建一个环境,并设计一种策略来测试它是否在20秒内达到目标。 最后,根据测试结果得到 y i y ^ i yi。
我不明白的是,在上面的第二步中,如何设计策略?
Patrick MacAlpine 博士给出的详细回答(翻译了会变味,直接贴上原文):
2017年的主要进展:
2016年的主要进展:
2015年的主要进展:
附: UT所有文献的网址
值得一提:
入门RoboCup仿真3D的必读材料:用户手册,里面的通信部分可以参考:RoboCup仿真3D底层通信模块介绍(一)、RoboCup仿真3D底层通信模块介绍(二)
上手Robocup仿真3D的必读材料:UT开源底层的详细介绍