什么是tun/tap设备
linux支持的虚拟网络设备中,tun/tap设备相对特殊,其为用户空间程序提供了网络数据包的发送和接收能力。他既可以当做点对点设备(TUN),也可以当做以太网设备(TAP)。用户层程序通过tun设备只能读写IP数据包,而通过tap设备能读写链路层数据包,类似于普通socket和raw socket的差别一样,处理数据包的格式不一样。
运行tun、tap设备之后,会在内核空间添加/dev/net/tun,实质上是主设备号10的字符设备。从功能上看,tun设备驱动主要应该包括两个部分,一是虚拟网卡驱动,其实就是虚拟网卡中对skb进行封装解封装等操作;二是字符设备驱动,用于内核空间与用户空间的交互。
注: tun/tap设备 类型为misc debvice,大部分的设备都有一个明确的分类class,如字符设备、块设备等,有一些设备进行分类时不太好分,我们不知道一些设备到底应该分到哪一类设备中去,所以最后将这些不知道分到哪类中的设备给分到misc设备中,也就是分到了杂散类中。miscdevice共享一个主设备号MISC_MAJOR(即10),但次设备号不同。
tun/tap设备的使用
使用tun/tap设备有的著名的开源软件有不少,如openvpn、qemu等,他们的使用方式有助于对设备整体原理和流程的理解。
如下为 openvpn开源项目 ,tun.c文件中创建tun/tap设备的函数。
int open_tun (const char *dev, char *actual, int size)
{
struct ifreq ifr;
int fd;
char *device = "/dev/net/tun";
if ((fd = open (device, O_RDWR)) < 0) //创建描述符
msg (M_ERR, "Cannot open TUN/TAP dev %s", device);
memset (&ifr, 0, sizeof (ifr));
ifr.ifr_flags = IFF_NO_PI;
if (!strncmp (dev, "tun", 3)) {
ifr.ifr_flags |= IFF_TUN;
} else if (!strncmp (dev, "tap", 3)) {
ifr.ifr_flags |= IFF_TAP;
} else {
msg (M_FATAL, "I don't recognize device %s as a TUN or TAP device",dev);
}
if (strlen (dev) > 3) /* unit number specified? */
strncpy (ifr.ifr_name, dev, IFNAMSIZ);
if (ioctl (fd, TUNSETIFF, (void *) &ifr) < 0) //打开虚拟网卡
msg (M_ERR, "Cannot ioctl TUNSETIFF %s", dev);
set_nonblock (fd);
msg (M_INFO, "TUN/TAP device %s opened", ifr.ifr_name);
strncpynt (actual, ifr.ifr_name, size);
return fd;
}
调用上面函数创建了tap/tun设备后,就可以使用read和write函数就可以读取或者发送给虚拟的网卡数据了。当然为了能够让协议栈的报文进入设备,如果是tun设备,可以在其上配置ip地址和路由将l3流量引入,如果是tap设备,加入bridge后走l2协议栈引入。
如下,10.1.1.1--> 10.1.1.2 的icmp报文由用户态ping-->内核协议栈-->tun1-->用户态openvpn。
[dev@debian:] ip addr add 192.168.1.1/24 dev tun1
[dev@debian:] ip route add 10.1.1.0/24 via 192.168.1.2
[dev@debian:] ping 10.1.1.2 -I 10.1.1.1
这个过程可以通过下面这张图来理解,顺带可以理解一下前面说的“tun/tap设备相对特殊,其为用户空间程序提供了网络数据包的发送和接收能力”这句话的含义,比较一下它和普通的网卡驱动的区别。
上图中有两个应用程序A和B分别是 ping 和 openvpn的用户态进程(openvpn的本地overlay封装地址为10.32.0.11),而其它的socket、协议栈(Newwork Protocol Stack)和网络设备(eth0和tun0)部分都在内核层,其实socket是协议栈的一部分,这里分开来的目的是为了看的更直观。
tun0是一个Tun/Tap虚拟设备,从上图中可以看出它和物理设备eth0的差别,它们的一端虽然都连着协议栈,但另一端不一样,eth0的另一端是物理网络,这个物理网络可能就是一个交换机,而tun0的另一端是一个用户层的程序,协议栈发给tun0的数据包能被这个应用程序读取到,并且应用程序能直接向tun0写数据。
数据包的发送流程为:
- 执行 ping 10.1.1.2 -I 10.1.1.1,通过socket A发送了一个数据包,socket将这个数据包丢给协议栈;
- 协议栈根据数据包的目的IP地址,匹配本地路由规则,知道这个数据包应该由tun0出去,于是将数据包交给tun0;
- tun0收到数据包之后,发现另一端被进程B打开了,于是将数据包丢给了进程B,即 openvpn;
- openvpn 收到数据包之后,根据vpn配置,在内层ip包外面再缝状一层ip包(ip+udp),构造一个新的数据包,最后通过socket B将数据包转发出去,这时候新数据包的源地址变成了eth0的地址,而目的IP地址变成了一个其它的地址,比如是10.33.0.1;
- socket B将数据包丢给协议栈;
- 协议栈根据本地路由,发现这个数据包应该要通过eth0发送出去,于是将数据包交给eth0;
- eth0通过物理网络将数据包发送出去;
回程的收包流程相反:
- eth0收到一个 10.33.0.1 --> 10.33.0.11的udp包,送入协议栈;
- 协议栈将udp报文送入udp socket的所有者 openvpn(应用程序B),openvpn根据将内层报文(10.1.1.2-->10.1.1.1 icmp)写入tun0口对应的字符设备;
- 内核态虚拟设备tun0口接收回程报文,送入协议栈;
- 根据路由信息为本地报文,送入icmp 接收流程,icmp rcv将报文通过socket A送入ping程序。
tun/tap设备驱动实现
tun/tap设备驱动主要应该包括两个部分,一是虚拟网卡驱动,其实就是虚拟网卡中对skb进行封装解封装等操作;二是字符设备驱动,用于内核空间与用户空间的交互。
设备创建
tun/tap设备驱动的开始也是init函数,其中主要调用了misc_register注册了一个miscdev设备。
注册完这个设备之后将在系统中生成一个“/dev/net/tun”文件,同字符设备类似,当应用程序使用open系统调用打开这个文件时,将生成file文件对象,而其file_operations将指向tun_fops。
static int __init tun_init(void)
{
int ret = 0;
pr_info("%s, %s\n", DRV_DESCRIPTION, DRV_VERSION);
ret = rtnl_link_register(&tun_link_ops);
if (ret) {
pr_err("Can't register link_ops\n");
goto err_linkops;
}
ret = misc_register(&tun_miscdev);
if (ret) {
pr_err("Can't register misc device %d\n", TUN_MINOR);
goto err_misc;
}
ret = register_netdevice_notifier(&tun_notifier_block);
if (ret) {
pr_err("Can't register netdevice notifier\n");
goto err_notifier;
}
return 0;
err_notifier:
misc_deregister(&tun_miscdev);
err_misc:
rtnl_link_unregister(&tun_link_ops);
err_linkops:
return ret;
}
static const struct file_operations tun_fops = {
.owner = THIS_MODULE,
.llseek = no_llseek,
.read = do_sync_read,
.aio_read = tun_chr_aio_read,
.write = do_sync_write,
.aio_write = tun_chr_aio_write,
.poll = tun_chr_poll,
.unlocked_ioctl = tun_chr_ioctl,
#ifdef CONFIG_COMPAT
.compat_ioctl = tun_chr_compat_ioctl,
#endif
.open = tun_chr_open,
.release = tun_chr_close,
.fasync = tun_chr_fasync
};
static struct miscdevice tun_miscdev = {
.minor = TUN_MINOR,
.name = "tun",
.nodename = "net/tun",
.fops = &tun_fops,
};
用户态调用open打开“/dev/net/tun”,这将最终调用tun_fops的open函数,即tun_chr_open。
static int tun_chr_open(struct inode *inode, struct file * file)
{
struct tun_file *tfile;
DBG1(KERN_INFO, "tunX: tun_chr_open\n");
# 创建 tun_file结构,它连接字符设备和 tuntap设备,包含sock 和 socket结构,前者偏向于驱动层,后者偏向于应用层
tfile = (struct tun_file *)sk_alloc(&init_net, AF_UNSPEC, GFP_KERNEL,
&tun_proto);
if (!tfile)
return -ENOMEM;
rcu_assign_pointer(tfile->tun, NULL);
tfile->net = get_net(current->nsproxy->net_ns);
tfile->flags = 0;
rcu_assign_pointer(tfile->socket.wq, &tfile->wq);
init_waitqueue_head(&tfile->wq.wait);
# 关联 socket和字符设备文件,设置socket的ops 操作集
tfile->socket.file = file;
tfile->socket.ops = &tun_socket_ops;
sock_init_data(&tfile->socket, &tfile->sk);
sk_change_net(&tfile->sk, tfile->net);
tfile->sk.sk_write_space = tun_sock_write_space;
tfile->sk.sk_sndbuf = INT_MAX;
# 将 tun_file作为file的私有字段。注意 file结构 就是每次应用调用open打开/dev/net/tun生成的,然后通过其private_data 很容易的得到 tun_file,进而得到设备对应的sock和socket结构。
file->private_data = tfile;
set_bit(SOCK_EXTERNALLY_ALLOCATED, &tfile->socket.flags);
INIT_LIST_HEAD(&tfile->next);
sock_set_flag(&tfile->sk, SOCK_ZEROCOPY);
return 0;
}
主要做的事情:
- 创建 tun_file结构,这是个关键性的数据结构,它连接字符设备文件和 tuntap设备,前者包含sock 和 socket结构,前者偏向于驱动层,后者偏向于应用层。后者为tun_struct tun;
- tun_file作为file的私有字段,tun_struct作为tuntap 设备 net_device的私有数据,这样file <->tun_file<->tun_struct<->net_device,关联起来。
- 关联 socket和字符设备文件,设置socket的ops 操作集;
- 将 tun_file作为file的私有字段。注意 file结构 就是每次应用调用open打开/dev/net/tun生成的,然后通过其private_data 很容易的得到 tun_file,进而得到设备对应的sock和socket结构。
/* A tun_file connects an open character device to a tuntap netdevice. It
* also contains all socket related strctures (except sock_fprog and tap_filter)
* to serve as one transmit queue for tuntap device. The sock_fprog and
* tap_filter were kept in tun_struct since they were used for filtering for the
* netdevice not for a specific queue (at least I didn't see the requirement for
* this).
*
* RCU usage:
* The tun_file and tun_struct are loosely coupled, the pointer from one to the
* other can only be read while rcu_read_lock or rtnl_lock is held.
*/
struct tun_file {
struct sock sk;
struct socket socket;
struct socket_wq wq;
struct tun_struct __rcu *tun;
struct net *net;
struct fasync_struct *fasync;
/* only used for fasnyc */
unsigned int flags;
u16 queue_index;
struct list_head next;
struct tun_struct *detached;
};
从上面函数可以看到,open打开/dev/net/tun后还没有真正创建tuntap设备,tun_file->tun还是空,一般会执行ioctl (fd, TUNSETIFF, (void *) &ifr) 来真正创建tap/tun设备。这将最终调用tun_ops中的tun_chr_ioctl函数。tun_chr_ioctl中会调用__tun_chr_ioctl,然后调用 tun_set_iff 创建接口。
- 创建设备,net_device,设置其私有数据( netdev_priv)为tun_struct结构,处于性能考虑可以设置多队列;
- 调用 tun_net_init,根据设备类型(tun or tap),挂载 net_device结构的netdev_ops,tun 设备为tun_netdev_ops,tap设备为tap_netdev_ops 。并设置链路层属性,tun设备是三层的,没有链路层头,也不需要ARP;
- 调用 tun_attach,将 tun_struct 和 tun_file 关联,这样file、tun_file、tun_struct、net_device就联系到一起了,从用户态到内核驱动打通了;
- 调用 register_netdevice 注册设备,系统也就能看到tuntap设备了;
static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
{
struct tun_struct *tun;
struct tun_file *tfile = file->private_data;
struct net_device *dev;
int err;
......
dev = __dev_get_by_name(net, ifr->ifr_name);
if (dev) {
/*这里是设备已经创建,tun文件关联设备的操作。
* 用户态进程每次open /dev/net/tun都会创建新的文件,但tun/tap设备只创建一次。
*/
} else {
char *name;
unsigned long flags = 0;
int queues = ifr->ifr_flags & IFF_MULTI_QUEUE ?
MAX_TAP_QUEUES : 1;
if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
return -EPERM;
err = security_tun_dev_create();
if (err < 0)
return err;
/* Set dev type */
// tun 设备 or tap设备
if (ifr->ifr_flags & IFF_TUN) {
/* TUN device */
flags |= IFF_TUN;
name = "tun%d";
} else if (ifr->ifr_flags & IFF_TAP) {
/* TAP device */
flags |= IFF_TAP;
name = "tap%d";
} else
return -EINVAL;
if (*ifr->ifr_name)
name = ifr->ifr_name;
dev = alloc_netdev_mqs(sizeof(struct tun_struct), name,
NET_NAME_UNKNOWN, tun_setup, queues,
queues);
if (!dev)
return -ENOMEM;
err = dev_get_valid_name(net, dev, name);
if (err < 0)
goto err_free_dev;
dev_net_set(dev, net);
dev->rtnl_link_ops = &tun_link_ops;
dev->ifindex = tfile->ifindex;
dev->sysfs_groups[0] = &tun_attr_group;
tun = netdev_priv(dev);
tun->dev = dev;
tun->flags = flags;
tun->txflt.count = 0;
tun->vnet_hdr_sz = sizeof(struct virtio_net_hdr);
tun->align = NET_SKB_PAD;
tun->filter_attached = false;
tun->sndbuf = tfile->socket.sk->sk_sndbuf;
tun->rx_batched = 0;
RCU_INIT_POINTER(tun->steering_prog, NULL);
tun->pcpu_stats = netdev_alloc_pcpu_stats(struct tun_pcpu_stats);
if (!tun->pcpu_stats) {
err = -ENOMEM;
goto err_free_dev;
}
spin_lock_init(&tun->lock);
err = security_tun_dev_alloc_security(&tun->security);
if (err < 0)
goto err_free_stat;
tun_net_init(dev);
tun_flow_init(tun);
dev->hw_features = NETIF_F_SG | NETIF_F_FRAGLIST |
TUN_USER_FEATURES | NETIF_F_HW_VLAN_CTAG_TX |
NETIF_F_HW_VLAN_STAG_TX;
dev->features = dev->hw_features | NETIF_F_LLTX;
dev->vlan_features = dev->features &
~(NETIF_F_HW_VLAN_CTAG_TX |
NETIF_F_HW_VLAN_STAG_TX);
tun->flags = (tun->flags & ~TUN_FEATURES) |
(ifr->ifr_flags & TUN_FEATURES);
INIT_LIST_HEAD(&tun->disabled);
err = tun_attach(tun, file, false, ifr->ifr_flags & IFF_NAPI);
if (err < 0)
goto err_free_flow;
err = register_netdevice(tun->dev);
if (err < 0)
goto err_detach;
}
netif_carrier_on(tun->dev);
tun_debug(KERN_INFO, tun, "tun_set_iff\n");
/* Make sure persistent devices do not get stuck in
* xoff state.
*/
if (netif_running(tun->dev))
netif_tx_wake_all_queues(tun->dev);
strcpy(ifr->ifr_name, tun->dev->name);
return 0;
......
}
static void tun_net_init(struct net_device *dev)
{
struct tun_struct *tun = netdev_priv(dev);
switch (tun->flags & TUN_TYPE_MASK) {
case IFF_TUN:
dev->netdev_ops = &tun_netdev_ops;
/* Point-to-Point TUN Device */
dev->hard_header_len = 0;
dev->addr_len = 0;
dev->mtu = 1500;
/* Zero header length */
dev->type = ARPHRD_NONE;
dev->flags = IFF_POINTOPOINT | IFF_NOARP | IFF_MULTICAST;
break;
case IFF_TAP:
dev->netdev_ops = &tap_netdev_ops;
/* Ethernet TAP Device */
ether_setup(dev);
dev->priv_flags &= ~IFF_TX_SKB_SHARING;
dev->priv_flags |= IFF_LIVE_ADDR_CHANGE;
eth_hw_addr_random(dev);
break;
}
dev->min_mtu = MIN_MTU;
dev->max_mtu = MAX_MTU - dev->hard_header_len;
}
报文收发
用户态发出
流程如下,用户态调用write 写入数据后,调用文件的写函数,最终调用 netif_rx_ni函数接收报文,不会直接调用 netif_receive_skb处理报文,而是netif_rx_ni 调用 enqueue_to_backlog入队cpu的soft_data的接收队列,由软中断处理后续报文接收流程,后续软中断会调用 process_backlog --> __netif_receive_skb 处理报文,如同物理口接收处理。
# 4.18.2
static ssize_t tun_chr_write_iter(struct kiocb *iocb, struct iov_iter *from)
{
struct file *file = iocb->ki_filp;
struct tun_file *tfile = file->private_data;
struct tun_struct *tun = tun_get(tfile);
ssize_t result;
if (!tun)
return -EBADFD;
result = tun_get_user(tun, tfile, NULL, from,
file->f_flags & O_NONBLOCK, false);
tun_put(tun);
return result;
}
这个流程是从用户态通过文件写进入的,有些地方从socket发送,如vhost-net,其虚拟机发送报文时,调用vhost-net handle_tx函数,再调用 tun_socket_ops ->tun_sendmsg->tun_get_user->netif_rx_ni。
static void handle_tx(struct vhost_net *net)
{
struct vhost_net_virtqueue *nvq = &net->vqs[VHOST_NET_VQ_TX];
struct vhost_virtqueue *vq = &nvq->vq;
unsigned out, in;
int head;
struct msghdr msg = {
.msg_name = NULL,
.msg_namelen = 0,
.msg_control = NULL,
.msg_controllen = 0,
.msg_flags = MSG_DONTWAIT,
};
size_t len, total_len = 0;
int err;
size_t hdr_size;
struct socket *sock;
struct vhost_net_ubuf_ref *uninitialized_var(ubufs);
bool zcopy, zcopy_used;
int sent_pkts = 0;
mutex_lock(&vq->mutex);
// tap socket结构
sock = vq->private_data;
if (!sock)
goto out;
......
for (;;) {
......
/* TODO: Check specific error and bomb out unless ENOBUFS? */
err = sock->ops->sendmsg(sock, &msg, len);
......
}
out:
mutex_unlock(&vq->mutex);
}
可以看到这个流程实际上未涉及到tuntap设备的驱动,经文件活socket发出后直接送入协议栈。
内核态到用户态
内核态到用户态即由内核态协议栈经tuntap设备发送,上送用户态的流程。
接口发送是个通用流程,经由dev_queue_xmit 最终调用到设备的 net_device_ops->ndo_start_xmit 函数,就是上述tuntap口创建流程挂载的 tun_netdev_ops 和 tap_netdev_ops,
# kernel 4.18.2
static inline netdev_tx_t __netdev_start_xmit(const struct net_device_ops *ops,
struct sk_buff *skb, struct net_device *dev,
bool more)
{
skb->xmit_more = more ? 1 : 0;
return ops->ndo_start_xmit(skb, dev);
}
static const struct net_device_ops tun_netdev_ops = {
.ndo_uninit = tun_net_uninit,
.ndo_open = tun_net_open,
.ndo_stop = tun_net_close,
.ndo_start_xmit = tun_net_xmit,
.ndo_fix_features = tun_net_fix_features,
.ndo_select_queue = tun_select_queue,
#ifdef CONFIG_NET_POLL_CONTROLLER
.ndo_poll_controller = tun_poll_controller,
#endif
.ndo_set_rx_headroom = tun_set_headroom,
.ndo_get_stats64 = tun_net_get_stats64,
};
static const struct net_device_ops tap_netdev_ops = {
.ndo_uninit = tun_net_uninit,
.ndo_open = tun_net_open,
.ndo_stop = tun_net_close,
.ndo_start_xmit = tun_net_xmit,
.ndo_fix_features = tun_net_fix_features,
.ndo_set_rx_mode = tun_net_mclist,
.ndo_set_mac_address = eth_mac_addr,
.ndo_validate_addr = eth_validate_addr,
.ndo_select_queue = tun_select_queue,
#ifdef CONFIG_NET_POLL_CONTROLLER
.ndo_poll_controller = tun_poll_controller,
#endif
.ndo_features_check = passthru_features_check,
.ndo_set_rx_headroom = tun_set_headroom,
.ndo_get_stats64 = tun_net_get_stats64,
.ndo_bpf = tun_xdp,
.ndo_xdp_xmit = tun_xdp_xmit,
};
报文回被缓存到tun_file的 tx_ring 中,然后调用sock 的sk_data_ready触发用户态读取。
# kernel 4.18.2
/* Net device start xmit */
static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
{
struct tun_struct *tun = netdev_priv(dev);
int txq = skb->queue_mapping;
struct tun_file *tfile;
int len = skb->len;
rcu_read_lock();
tfile = rcu_dereference(tun->tfiles[txq]);
/* Drop packet if interface is not attached */
if (txq >= tun->numqueues)
goto drop;
if (!rcu_dereference(tun->steering_prog))
tun_automq_xmit(tun, skb);
tun_debug(KERN_INFO, tun, "tun_net_xmit %d\n", skb->len);
BUG_ON(!tfile);
/* Drop if the filter does not like it.
* This is a noop if the filter is disabled.
* Filter can be enabled only for the TAP devices. */
if (!check_filter(&tun->txflt, skb))
goto drop;
if (tfile->socket.sk->sk_filter &&
sk_filter(tfile->socket.sk, skb))
goto drop;
len = run_ebpf_filter(tun, skb, len);
if (len == 0 || pskb_trim(skb, len))
goto drop;
if (unlikely(skb_orphan_frags_rx(skb, GFP_ATOMIC)))
goto drop;
skb_tx_timestamp(skb);
/* Orphan the skb - required as we might hang on to it
* for indefinite time.
*/
skb_orphan(skb);
nf_reset(skb);
if (ptr_ring_produce(&tfile->tx_ring, skb))
goto drop;
/* Notify and wake up reader process */
if (tfile->flags & TUN_FASYNC)
kill_fasync(&tfile->fasync, SIGIO, POLL_IN);
tfile->socket.sk->sk_data_ready(tfile->socket.sk);
rcu_read_unlock();
return NETDEV_TX_OK;
drop:
this_cpu_inc(tun->pcpu_stats->tx_dropped);
skb_tx_error(skb);
kfree_skb(skb);
rcu_read_unlock();
return NET_XMIT_DROP;
}
3.10 的内核,报文会背缓存到 sock->sk_receive_queue中,然后调用wake_up_interruptible_poll 唤醒tun_file的等待队列中进程/线程接收报文,和4.18其实一样,4.18 调用 sk->sk_data_ready,为 sock_def_readable函数,其调用 wake_up_interruptible_poll。
/* Net device start xmit */
static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
{
struct tun_struct *tun = netdev_priv(dev);
int txq = skb->queue_mapping;
struct tun_file *tfile;
......
/* Enqueue packet */
skb_queue_tail(&tfile->socket.sk->sk_receive_queue, skb);
/* Notify and wake up reader process */
if (tfile->flags & TUN_FASYNC)
kill_fasync(&tfile->fasync, SIGIO, POLL_IN);
wake_up_interruptible_poll(&tfile->wq.wait, POLLIN |
POLLRDNORM | POLLRDBAND);
......
}