Introduction
kcp-go is a Production-Grade Reliable-UDP library for golang.
This library intents to provide a smooth, resilient, ordered, error-checked and anonymous delivery of streams over UDPpackets, it has been battle-tested with opensource project kcptun. Millions of devices(from low-end MIPS routers to high-end servers) have deployed kcp-go powered program in a variety of forms like online games, live broadcasting, file synchronization and network acceleration.
Lastest Release
Features
Designed for Latency-sensitive scenarios.Compatible with skywind3000's C version with various improvements.
Documentation
For complete documentation, see the associated Godoc.
Specification
| SESSION |
| KCP(ARQ) |
+-----------------+
| FEC(OPTIONAL) |
+-----------------+ +-----------------+
| UDP(PACKET) |
| CRYPTO(OPTIONAL)| +-----------------+ +-----------------+
| PHY |
| IP | +-----------------+ | LINK | +-----------------+
(LAYER MODEL OF KCP-GO)
+-----------------+1
Usage
Client: full demo
kcpconn, err := kcp.DialWithOptions("192.168.0.1:10000", nil, 10, 3)
Server: full demo
lis, err := kcp.ListenWithOptions(":10000", nil, 10, 3)
Model Identifier: MacBookPro14,1
Processor Name: Intel Core i5
Number of Processors: 1
Processor Speed: 3.1 GHz Total Number of Cores: 2
Memory: 8 GB
L2 Cache (per Core): 256 KB
L3 Cache: 4 MB
beginning tests, encryption:salsa20, fec:10/3
goos: darwin goarch: amd64
BenchmarkSM4-4 50000 32180 ns/op 93.23 MB/s 0 B/op 0 allocs/op
pkg: github.com/xtaci/kcp-go
BenchmarkAES128-4 500000 3285 ns/op 913.21 MB/s 0 B/op 0 allocs/op
BenchmarkAES192-4 300000 3623 ns/op 827.85 MB/s 0 B/op 0 allocs/op
BenchmarkTEA-4 100000 15384 ns/op 195.00 MB/s 0 B/op 0 allocs/op
BenchmarkAES256-4 300000 3874 ns/op 774.20 MB/s 0 B/op 0 allocs/op BenchmarkXOR-4 20000000 89.9 ns/op 33372.00 MB/s 0 B/op 0 allocs/op
BenchmarkNone-4 30000000 45.7 ns/op 65597.94 MB/s 0 B/op 0 allocs/op
BenchmarkBlowfish-4 50000 26927 ns/op 111.41 MB/s 0 B/op 0 allocs/op BenchmarkCast5-4 50000 34258 ns/op 87.57 MB/s 0 B/op 0 allocs/op Benchmark3DES-4 10000 117149 ns/op 25.61 MB/s 0 B/op 0 allocs/op
BenchmarkCRC32-4 20000000 65.2 ns/op 15712.43 MB/s
BenchmarkTwofish-4 50000 33538 ns/op 89.45 MB/s 0 B/op 0 allocs/op BenchmarkXTEA-4 30000 45666 ns/op 65.69 MB/s 0 B/op 0 allocs/op BenchmarkSalsa20-4 500000 3308 ns/op 906.76 MB/s 0 B/op 0 allocs/op BenchmarkCsprngSystem-4 1000000 1150 ns/op 13.91 MB/s
BenchmarkFECDecode-4 1000000 1119 ns/op 1339.61 MB/s 1606 B/op 2 allocs/op
BenchmarkCsprngMD5-4 10000000 145 ns/op 110.26 MB/s BenchmarkCsprngSHA1-4 10000000 158 ns/op 126.54 MB/s BenchmarkCsprngNonceMD5-4 10000000 153 ns/op 104.22 MB/s BenchmarkCsprngNonceAES128-4 100000000 19.1 ns/op 837.81 MB/s BenchmarkFECEncode-4 2000000 832 ns/op 1801.83 MB/s 17 B/op 0 allocs/op
BenchmarkEchoSpeed1M-4 30 34859104 ns/op 30.08 MB/s 1143773 B/op 27186 allocs/op
BenchmarkFlush-4 5000000 272 ns/op 0 B/op 0 allocs/op BenchmarkEchoSpeed4K-4 5000 259617 ns/op 15.78 MB/s 5451 B/op 149 allocs/op BenchmarkEchoSpeed64K-4 1000 1706084 ns/op 38.41 MB/s 56002 B/op 1604 allocs/op BenchmarkEchoSpeed512K-4 100 14345505 ns/op 36.55 MB/s 482597 B/op 13045 allocs/op
ok github.com/xtaci/kcp-go 50.349s
BenchmarkSinkSpeed4K-4 50000 31369 ns/op 130.57 MB/s 1566 B/op 30 allocs/op BenchmarkSinkSpeed64K-4 5000 329065 ns/op 199.16 MB/s 21529 B/op 453 allocs/op BenchmarkSinkSpeed256K-4 500 2373354 ns/op 220.91 MB/s 166332 B/op 3554 allocs/op BenchmarkSinkSpeed1M-4 300 5117927 ns/op 204.88 MB/s 310378 B/op 6988 allocs/op
PASS
Key Design Considerations
slice vs. container/list
kcp.flush()
loops through the send queue for retransmission checking for every 20ms(interval).
I've wrote a benchmark for comparing sequential loop through slice and container/list here:
https://github.com/xtaci/notes/blob/master/golang/benchmark2/cachemiss_test.go
BenchmarkLoopList-4 100000000 54.6 ns/op
List structure introduces heavy cache misses compared to slice which owns better locality, 5000 connections with 32 window size and 20ms interval will cost 6us/0.03%(cpu) using slice, and 8.7ms/43.5%(cpu) for list for each kcp.flush()
.
Timing accuracy vs. syscall clock_gettime
Timing is critical to RTT estimator, inaccurate timing leads to false retransmissions in KCP, but calling time.Now()
costs 42 cycles(10.5ns on 4GHz CPU, 15.6ns on my MacBook Pro 2.7GHz).
The benchmark for time.Now() lies here:
https://github.com/xtaci/notes/blob/master/golang/benchmark2/syscall_test.go
BenchmarkNow-4 100000000 15.6 ns/op
In kcp-go, after each kcp.output()
function call, current clock time will be updated upon return, and for a single kcp.flush()
operation, current time will be queried from system once. For most of the time, 5000 connections costs 5000 * 15.6ns = 78us(a fixed cost while no packet needs to be sent), as for 10MB/s data transfering with 1400 MTU, kcp.output()
will be called around 7500 times and costs 117us for time.Now()
in every second.
Connection Termination
Control messages like SYN/FIN/RST in TCP are not defined in KCP, you need some keepalive/heartbeat mechanism in the application-level. A real world example is to use some multiplexing protocol over session, such as smux(with embedded keepalive mechanism), see kcptun for example.
FAQ
Q: I'm handling >5K connections on my server, the CPU utilization is so high.
A: A standalone agent
or gate
server for running kcp-go is suggested, not only for CPU utilization, but also important to the precision of RTT measurements(timing) which indirectly affects retransmission. By increasing update interval
with SetNoDelay
like conn.SetNoDelay(1, 40, 1, 1)
will dramatically reduce system load, but lower the performance.
Who is using this?
https://github.com/xtaci/kcptun -- A Secure Tunnel Based On KCP over UDP.https://play.google.com/store/apps/details?id=com.k17game.k3 -- Battle Zone - Earth 2048, a world-wide strategy game.
Links
https://github.com/xtaci/libkcp -- FEC enhanced KCP session library for iOS/Android in C++