The Saga of Ryzom is a persistent massively-multiplayer online game (MMORPG) released in September 2004 throughout Europe and North America, localised in 3 languages so far. It has been developed by Nevrax since 2000, and was taken over by Gameforge in late 2006.
The Nevrax team built Ryzom from scratch, starting with the Nevrax Library (NeL), made available as free software under the General Public License (GPL). On top of NeL, a server technology was created to handle an immersive virtual world. This article focuses on the network techniques developed to display a smooth scene of moving entities, the dynamic properties of which are propagated to all connected end-user clients having an avatar in the same virtual area.
I would like to thank Daniel Miller, CTO of Nevrax and Gameforge France, for his enlightening contributions and reviews of this article.
Our client software had to display an animated 3D scene figuring moving and animated entities. Players would have direct control over their avatars, as point & click control (the alternative usually found in 2D or ¾ perspective games, even sometimes in 3D games) was considered not immersive enough. While the role-playing genre usually does not require fast input from the players, like in first person shooters (where the first player to shoot, will be the one who survives) or in sport games (where synchronization is very important for collective actions), movements had to be consistent & smooth.
We wanted visible micro-teleportations, or position jolts, to be less frequent than in other MMOGs, in which, at least at the time when the Ryzom project started, entities often seemed to go back in time or jump forward. Another important feature was to allow the characters to move in a seamless environment, without loading time between predetermined small geographical zones, during which the player would be idle.
In the present article, which explains the techniques we tried and which ones were eventually chosen, a visual property refers to any dynamic state or attribute of a player or non-player entity that must be known by client software to render it, such as position & heading, 3D model identifier, clothes set, current animation, etc. We assume the client software has local access to some static data such as 3D models, textures and animations. In most of the document we will focus on the propagation of position, because the continuous nature of position paths makes greatly noticeable when smooth propagation is not achieved.
In an online game, especially a massively multiplayer one, the bandwidth constraints prevents from trivially sending every position change to all players. At the planned time of Ryzom launch, 56k modems still were widespread among players, ADSL was still in its infancy. Our MMORPG had to work nominally on a 56k or even 14k connection.
Besides, in an online game, cheating utterly destroys the game experience of the victims (the players put at a disadvantage) and must be prevented. From the ever-popular adage “Don’t trust the client”, even more present as Nevrax’s founders’ intention was to release the Ryzom client under a free software license, it was quickly clear that the client would be a display and input device, while all game logic would have to be on the server.
Handling dynamic game information on the server side would prevent the appearance of player-made radars, for instance. Then, after excluding a peer-to-peer network solution, the bandwidth constraints did not only originate from the low bandwidth available in consumers’ homes, but also in the server farms that we would have to rent.
MMORPG playing experience was known to come with lag, an unpleasant noticeable delay between an action and its viewing, often visible as a stopped animation or animation jerk. This lag phenomenon had to be minimized by design. It usually originated either from an inadequate information propagation system under tight bandwidth constraints and latency-prone networks such as the Internet, or from delay in CPU-processing by the server.
Our work then had to focus on studying the known techniques of visual property propagation, and possibly creating better ones. The CPU performance of the software was also critical, to avoid time-consuming peaks.
Early distributed simulation projects, such as Distributed Interactive Simulation (DIS), dealt with moving objects having high-inertia movements. In this aircraft simulation context, uniform rectilinear movements are the norm, and variation from these occur in a gradual way. It is then possible to replicate the movements by simulating a replica of an actuated entity, and applying behaviour change events.
For example, if the actuated entity reduces its speed by s, an update is sent to the replica to make it reduce its speed by s as well. Of course, the update can take a non-negligible time to transit from the actuator to the observer. The position of the replica is extrapolated until the update is received, leading to a position correction, thus the actual path observed on the replica may be slightly different than the original one.
This positional inaccuracy may be lowered by introducing additional information in the update message: for example a timestamp stating from which point the speed change was done will make the replicated path finally rejoin the original one, although the temporary phase during when the update message is transitting will have a different path.
Then, we need to implement an algorithm that will blend the temporary inaccurate path with the corrected path, the result of which will depend on the depth of the blending (using the history of previously received positions, for example [Bernier]).
Figure 1: Dead Reckoning: sending an update
when the replicated path differs too much
from the original path [Aronson]
To reduce the frequency of updates, we can send an update only when the original path and the replicated path diverge by a defined amount (Figure 1) (it may then be necessary to run both the master entity and the replicate by the master process to compare them). We now have a dead reckoning system (see prototype screenshot on Figure 2), named after the ancient naval navigation techniques used when the stars were not visible.
Figure 2: Nevrax’ 2D Dead Reckoning
Prototype from early 2000, showing
original and replicated path superimposed
However, in most MMORPG the primary controlled avatar is a humanoid, which is not known for high-inertia movements but random movements. Using Dead Reckoning would lead to frequent state correction updates, increasing both network traffic and visible jolts (because a quick position jump is sometimes needed when the blended replicated path differs to much from the original path).
Dead Reckoning may still be used for vehicles, AI entities having steady movements (although too steady movements for AI entities might not be wanted, because of their unnatural feeling). To allow for a large number of player-controlled entities, we focused on a different strategy: transmitting "good-old" position updates, but simulating interpolated natural movement in a virtual time space and maintaining total control of update frequency.
Yahn W. Bernier of Valve, made a presentation about Half-Life & Team Fortress Networking at the Game Developer Conference in 2000 [Bernier]. The idea was that instead of trying to predict the future (the principle of Dead Reckoning), why not make the past look like the present? When a client has received all updates for all entities in a scene, they are able to display an accurate and smooth animated view of the scene.
That is why a delay, called Lag Compensation Time (LCT), is introduced between the real action and the display on the other side of the transmission pipeline. The higher the LCT, the higher the number of updates received, the smoother the movements: if an avatar reaches a position before the next position is received from the server then it will have to stop and wait.
This results in jerky start-stop movement, especially if the position update rhythm varies, which is unavoidable when coming over the Internet. The goal of the LCT is to ensure that this doesn't happen, but the LCT needs to be as small as possible as it represents a lag or temporal inconsistency.
If two players have their screens side by side (Figure 3), they obviously will notice that their display is shifted in time. In some other situations, it may be noticed: for example, when two characters are trying to run together at the same time, they will have the feeling the other one is always behind, because their controlled character is not shifted in time (that would be a terrible control experience!).
Figure 3: Lag Compensation Time and Interpolation
This raises the problem of interaction between objects displayed in present time space (the player's avatar) and objects displayed in a past time space (remote characters, AI entities). One solution is to make the LCT vary according to the distance from the player's avatar. This idea is called temporal perception, or presentation time or sometimes local perception filters and comes from the analogy with the appearance of the stars in the sky: the farther the distance, the longer the time the light takes to come to us [Singhal-Zyda].
Anyway, if we want more accurate movement around the player’s avatar than at distant sight, updates of entities in the immediate vicinity of the observer should then have a higher priority than updates from distant entities. The goal is to minimize the error magnitude: a foreground entity (such as a melee adversary) may be displayed in a strikingly wrong position with less than 1 meter of positional error, while an entity farther away may not be perceptibly badly positioned despite a positional error of several meters which may only correspond to only a few pixels or less. With position updates that are less frequent for distant entities, the required LCT is greater. A flexible LCT allows one to minimise the lag for nearby characters while still avoiding jerkiness for background characters
It means we need a way to control the frequency of updates of viewed entities.
Maintaining times across several machines needs a synchronization mechanism. Common synchronization schemes compute a delta between the local time of the client machine and a reference time, and transmitting a timestamp relative to the reference time. However, we noticed that many consumer PCs had internal clocks with a different speed, leading to desynchronization.
Moreover, for our server applications we adopted a flexible time system based on “ticks” sent by a conductor service, that would increase the current game cycle at a rate that was sustainable by all server applications: if a service had a sudden increase of workload, all services would wait for it, avoiding a vicious circle of congestion. The client time synchronization was thus based on the average time between two received messages, assuming the server regularly sends a message to each client.
Although everyone believed the consumer modems & bandwidths would probably increase a lot in the following years, keeping a low transfer rate had the advantage of keeping low server bandwith cost. After all, running thousands of players on a single server (which is the essence of MMOGs) would certainly need fat internet connections which are inevitably expensive.
Transmitting position updates of a populated 3D scene in 13 kbit/s (the throttle that was eventually set) raised the following problems:
Which updates would have priority?
The need for CPU-efficiency?
Several algorithms were tried. The principle of them is the same. For a particular observer, at a given time:
Determine which entities are in the neighbourhood of the observer, within the seamless environment, and communicate the changes of this list (called "vision”) to the client.
Associate a priority to each entity of this list based on distance.
Iterate over the entities in order of (highest to lowest), and compare their states with a copy of the state as known by the client. here significant difference is found, add an update record to the send buffer. Stop when then send buffer size has reached the bandwidth threshold.
Two main algorithms are used in Ryzom.
The first one is used in “mainland” Ryzom. Space is partitioned by a grid, and computing the list of visible entities consists in taking entities from the same cell and iterating on the neighbour cells by drawing a spiral, until the desired number of visible entities is reached.
For Ryzom Ring instances, where areas are much smaller, a newer algorithm is used: dynamic groups are formed, based on the distance from an entity to the centre of gravity of the group. If an entity moves too far away from the group, the group is split in two: a new group is created.
In both cases the result is the same: a per-client set of associations between game entities and viewed entities. Explaining these algorithms in detail is beyond the scope of this article, but now we will see how and when the associations and their properties are transmitted to clients.
One of the algorithms that we tried was built from the following approach: trying to correlate the update frequency of a property with the “number of pixels” eventually affected by a property change on the viewer’s screen. Practically, the first budget-based algorithm that was developped dealt with each property: each tuple (viewer client, viewed entity, property) was assigned a priority, and a data structure representing “shelves & buckets” was browsed when sending updates to the client. The priority helped assigning a property change event into the right bucket. It was computed from several criteria:
Distance from observer to entity (the "manhattan distance" approximation is used for performance improvement)
Magnitude of the difference between the copy of the state as known by the client and the actual value.
For positions, we tackled the problem which would occur if the difference was eventually low while the entity had moved for a long path: for example, when walking around a wall, the starting position and the ending position might be very close, while in this case the change is “important”. Instead of comparing the positions, we compared “quantity of movement”, called mileage, that we had to accumulate every time an entity move was detected.
Other properties were taken into account as well. For example, if a viewed entity suddenly put on their hat, without moving, the entity would get a higher update priority.
This algorithm proved too CPU-intensive for our target of 25 x 250 x 1000 pairs (1000 clients per front-end service, each displaying 250 entities having 25 dynamic properties). Hence a different algorithm was developed, computing a score for pairs (viewer, viewed entity) from the following criterion:
Distance from observer to entity (the "manhattan distance" approximation is used for performance improvement)
The score of an entity was incremented proportionally to the inverse of the distance to the viewer, and would be reset when sending the state to the viewer client. Then another step was required to filter which properties would be sent to a viewer when processing a particular viewed entity.
When iterating over viewed entities sorted by descending priority, we compare the current value of the properties to the previous values stored during the last send. If they don't match, by a sufficient threshold, we include them in the update record to be sent. The mileage trick described in the previous section was retained for position arbitration. As this comparison is a critical part because it is called a large number of times, we used several tricks to optimise it. We later optimised the whole step by comparing only properties relevant to the entity type. For example, we know in advance that an intelligent plant is a still entity in Ryzom, so we don't need to compare the positions.
Evaluating strategies and their implementation was not an easy task, because, based on a fixed bandwidth usage, they resulted in a subjective movement and animation credibility level in the client. Moreover, comparing different live moving scenes over time was not possible without any quantitative data. That is why we made measurements inside the front-end service and plotted some graphs to ease these tasks.
The Figure 4 shows an example in which we have measured the priority of the position of an entity moving at constant speed within a large group of entities, with a viewer who moves past it. In this graph, the higher the priority, the lower the y-value, the higher the probability that an update will be sent. The most common pattern looks like a saw tooth (a decreasing straight line, interrupted by a vertical line). Indeed, the watched entity is moving, thus the priority increases until an update is triggered.
Figure 4: Priorities of an entity’s position over time
The graph shows a couple of interesting phenomena: first, the first two patterns look bigger than the following ones. We can explain this by the fact that the server, at the beginning, has to send all initial properties to the client, while the bandwidth still has the same limit. The watched entity having more competitors of high priority, the sending will be triggered for a higher priority (lower y-value) than on average. Besides this competition, the first updates must compete for bandwidth with other network messages, called impulsions, used to transmit commands (unrelated with visual properties) from the server to the client.
Another visible cue is the dip in the middle of the graph. It occurred when the viewer went past the entity, in the middle of the group. With a reduced distance between viewer and viewed entity, the priority and the update frequency increased. Most of the time, the watched entity was at a similar distance from the viewer as the surrounding entities, but in the short time frame in the middle of the curve, it was closer to the watcher than the majority of the others.
A light client, based on Snowballs (the technology demo for NeL), was developed to visually confirm the quantitative evaluation results, before a full Ryzom client was ready. As this prototype was on client side, it provided a test application for the front-end code implementing the frequency control. The update frequency of a position of an entity was represented by its colour.
Figure 5: visual check of update frequencies
The above screenshot (Figure 5) shows an example where the closer entities (in blue and green) to a viewer (in white) have higher update frequency than farther entities (in yellow and orange).
To reduce the number of operations to do in a game cycle, a system to split up computation over several game cycles was added. As a result, only an incremental subset of visibility pairs is prioritised at a time.
Another optimisation consisted in taking advantage of the multiprocessor machines we had (and later, hyper threaded multiprocessors): the computation part and the sending part were pipelined using several threads.
Low-level optimisations were also done: by displaying the assembly code generated by the C++ compiler, we could rework the data structures so that the processor had minimal overhead and cache misses when accessing them.
At the end, each instance of our front-end service was able to filter and prioritise dynamic properties, and to deliver them to roughly a thousand clients each. The scalability of this system, fed of properties by a mirror system, allowed to multiply the number of front-end services depending on the number of players to be supported in a single seamless environment (shard). We can run several instances of the Front End Service in parallel on a single multi-processor server and several such servers per game shard.
Of course, this system does not guarantee the order in which property change events are sent to the client. To synchronize some property changes, other information such as timestamps are used.
Online games based on small geographical zones can use position coordinates local to zones, with low range needs. However, a large seamless environment theoretically deals with wide-range coordinates. On the server side, we have 32-bit integer positions with millimetre precision. An approach would be to alternate absolute positions and “smaller” positions relative to the latest absolute positions. However, this technique has strong sequentially constraints, which cannot be easily adapted to our best-effort (non-reliable) lightweight network protocol.
A more suited approach was based on the assumption that a client will only display the entities in the neighbourhood, within the square kilometer surrounding the viewer (Figure 6). Then, a virtual local basis can be used. Instead of translating positions from absolute basis to a basis relative to the viewer, a more CPU-effective method consists in using the “truncated absolute coordinate” algorithm.
The server transmits short entity coordinates, and, to save more bandwidth, also lowers the precision to 16-millimeter in a 1024 m window by dividing the coordinate by 16 (i.e. shifting down by 4 bits). No reference to the viewer’s position is needed:
uint16 ec_short = (uint16)(ec_full >> 4);
To reconstruct the full position of an entity, the client needs the following inputs:
Full viewer coordinates (uint321 vc_full).
Short entity coordinates (uint16 ec_short).
The client calculates the short viewer coordinates, and the short entity delta coordinates from the transmitted values, to finally deduce the full entity coordinates from the full delta (obtained by sign extension) and the damped viewer coordinates (to avoid entity flickering when the viewer moves):
uint32 vc_full_damped = vc_full & ~0xF;
uint16 vc_short = (uint16)(vc_full >> 4);
sint16 delta_short = (sint16)(ec_short – vc_short);
uint32 ec_full = (uint32)(vc_full_damped + delta_full);
For the example in Figure 6:
Full X Coordinates: |
Short Coordinates: Viewer: 0xC000 Entity A: 0xE000 Entity B: 0x6000 Entity C: 0x2000 |
Short Delta: Entity A: 0x2000 Entity B: 0xA000 Entity C: 0x6000 |
Full Delta: Entity A: 0x00020000 Entity B: 0xFFFA0000 Entity C: 0x00060000 |
If dealing with ground entities, the z-value (altitude) may be even more compressed, by referring to a layer index, provided the client software can access efficiently the 3D topography structure corresponding to a 2D value.
1 Basic types shown are from the NeL library: uint32 is 32-bit unsigned int, sint16 is 16-bit signed int, etc.
Now, our client animation system is fed with property (especially position) updates that, at a given time, are more frequent for certain entities than for others. Actual animation of entity is interpolated between the received position updates, starting with a delay, and the movement speed is set so that the client has already received at least one position in advanced when the entity is displayed at an updated position.
This would be easy with a constant update frequency. However, to allow smooth movement of a particular entity, the time control of movement animations has to withstand frequency changes in the feeding data. If the entity has not received the next position, it will be forced to stop their course and probably will have to jump to next position once it is received – this is what we try to avoid.
Thus we need to anticipate frequency changes, to adjust the movement speed in advance. We compared various algorithms to do so.
A test bench was developed to simulate various movement cases, and measurements of the update frequency were done on a simulated client. Then several prediction formulas were computed from the latest received data, in order to adjust the movement speed using a predicted interval that should be, most of the time, slightly greater than the actual update interval. While our early scoring attempts were OK, we browsed some research papers to benefit from theoretical works.
Voice-over-IP technology was found to have similar challenges: in this case the aim is to transmit a sound with the smallest possible lag, avoiding voice cuts [Moon-Kurose-Towsley]. Then we adapted an algorithm from a research paper [Kansal-Kandikar], which proved more efficient than our first attempts (Figure 7).
Figure 7: Measured Interval Between Position Updates (in blue)
and Predicted Maximum Interval (in red), in relation to time – if the predicted curve
crosses the measured one, a jerk will occur in the movement speed
Nevertheless, this technique needed refinement to handle the case when an entity stops sending osition updates because it stops moving (not because its priority gets lower).
The new developed techniques were successful at providing a smooth scene of moving entities, and was sometimes acclaimed: when members of the Nevrax team saw the tests with hundreds of animated entities, they were enthusiastic about the mass-combat feature it would allow.
Later the property propagation system became a part of the Realtime Army Invasion Deployment RAID engine, a label for the chain ranging from the AI service conducting the behaviour of the non-player characters and creatures to the 3D client engine animating the entities. It was used for large-scale events featuring thousands of evil Kitins (crab-looking creatures) against hundreds of player characters (Figure 8), producing scenes reminiscent of the movie Starship Troopers, as can be seen in action in this in-game film.
- [Aronson] Aronson, J., "Dead Reckoning: Latency Hiding for Networked Games", 1997
http://www.gamasutra.com/features/19970919/aronson_01.htm
- [Moon-Kurose-Towsley] Moon, S.B., Kurose, J., Towsley, D. "Packet audio playout delay adjustment: performance bounds and algorithms", 1998
http://an.kaist.ac.kr/~sbmoon/paper/intl-journal/1998-acm-multimedia-voip.pdf
- [Kansal-Karandikar] Kansal, A. Karandikar, A. "Jitter-free Audio Playout over Best Effort Packet Networks". ATM Forum International Symposium, New Delhi, 2001
http://citeseer.ist.psu.edu/557994.html
- [Bernier] Bernier, Y. B., “Latency Compensating Methods in Client/Server In-Game Protocol Design and Optimization”, Proceedings of the Game Developer Conference, 2001
http://www.resourcecode.de/stuff/clientsideprediction.pdf
- [Singhal-Zyda] Networked Virtual Environments: Design and Implementation, Singhal S., Zyda M., Addison-Wesley Professional, 1999, ISBN 0201325578.
(Web links verified in May 2007)