Linux内核套接字(Socket)的设计与实现

一个套接字是一个通信端点的抽象,就如使用文件描述符来访问一个文件一样,应用程序使用套接字描述符来访问套接字。

在Linux中实现了一套机制,使套接字的实现与文件描述符实现一样,使应用程序可以像访问文件一样访问套接字。许多用文件描述符访问文件的函数如read和write,也同样可以用于套接字访问。

1. Tcp socket是如何工作的

先从建好的连接开始开始介绍:

  1. 内核管理的每一个TCP文件描述符都是一个struct, 它记录TCP相关的信息(如序列号、当前窗口大小等等),以及一个接收缓冲区(receive buffer,或者叫receive queue)和一个写缓冲区(write buffer,或者叫write queue)
  2. 当内核从NIC获取数据包时,它会对数据包进行解码,并根据源IP、源端口、目标IP和目标端口找出与该数据包相关联的TCP连接。此信息用于查找与该连接关联的内存中的struct sock。假设数据包是按顺序的到来的,那么数据有效负载就被复制到套接字的接收缓冲区中。此时,内核将执行read(2)或使用诸如select(2)或epoll_wait(2)等I/O多路复用方式系统调用,唤醒等待此套接字的进程。
  3. 当用户态的进程实际调用文件描述符上的read(2)时,它会导致内核从其接收缓冲区中删除数据,并将该数据复制到此进程调用read(2)所提供的缓冲区中。
  4. 发送数据的工作原理类似。当应用程序调用write(2)时,它将数据从用户提供的缓冲区复制到内核写入队列中。随后,内核将把数据从写队列复制到NIC中,并实际发送数据。如果网络繁忙,如果TCP发送窗口已满,或者如果有流量整形策略等等,从用户实际调用write(2)开始,到向NIC传输数据的实际时间可能会有所延迟。

1.1 读语义

如果接收缓冲区为空,并且用户调用read(2),则系统调用将被阻塞,直到数据可用。

如果接收缓冲区是非空的,并且用户调用read(2),系统调用将立即返回这些可用的数据。如果读取队列中准备好的数据量小于用户提供的缓冲区的大小,则可能发生部分读取。调用方可以通过检查read(2)的返回值来检测到这一点。

如果接收缓冲区已满,而TCP连接的另一端尝试发送更多的数据,内核将拒绝对数据包进行ACK。这只是常规的TCP拥塞控制。

1.2 写语义

如果写入队列未满,并且用户调用写入,则系统调用将成功。如果写入队列有足够的空间,则将复制所有数据。如果写入队列只有部分数据的空间,那么将发生部分写入,并且只有部分数据将被复制到缓冲区。调用方通过检查write(2)的返回值来检查这一点。

如果写入队列已满,并且用户调用写入write(2)),则系统调用将被阻塞。

2. 套接字管理结构:socket和sock

在Linux中定义了一系列的数据结构来描述套接字本身、套接字传送的数据格式、套接字的属性和管理套接字连接状态的数据结构。

  • 管理套接字传送数据的结构:Socket Buffer(skb),struct sk_buff数据结构中存放了套接字接收/发送的数据。
  • 管理协议本身套接字的结构:对每一个新创建的套接字,内核协议栈都会创建struct socket和struct sock两个数据结构。这两个结构就像孪生兄弟,struct socket面向用户空间,struct sock面向内核空间。
struct socket {
	unsigned long flags;
	const struct proto_ops *ops; // 即当用户调用sendmsg时,内核会找到描述符fd对应的struct socket结构,然后调用sock->ops->sendmsg执行特定协议的发送。
	struct file   *file; // file指针指向上面那张图中的struct file结构,通过它,socket便与文件系统关联了起来。
	struct sock   *sk;
    short type;
};

Linux内核套接字(Socket)的设计与实现_第1张图片

/**
  *	struct sock - network layer representation of sockets
  *	@__sk_common: shared layout with inet_timewait_sock
  *	@sk_shutdown: mask of %SEND_SHUTDOWN and/or %RCV_SHUTDOWN
  *	@sk_userlocks: %SO_SNDBUF and %SO_RCVBUF settings
  *	@sk_lock:	synchronizer
  *	@sk_rcvbuf: size of receive buffer in bytes
  *	@sk_wq: sock wait queue and async head
  *	@sk_dst_cache: destination cache
  *	@sk_dst_lock: destination cache lock
  *	@sk_policy: flow policy
  *	@sk_rmem_alloc: receive queue bytes committed
  *	@sk_receive_queue: incoming packets
  *	@sk_wmem_alloc: transmit queue bytes committed
  *	@sk_write_queue: Packet sending queue
  *	@sk_async_wait_queue: DMA copied packets
  *	@sk_omem_alloc: "o" is "option" or "other"
  *	@sk_wmem_queued: persistent queue size
  *	@sk_forward_alloc: space allocated forward
  *	@sk_allocation: allocation mode
  *	@sk_sndbuf: size of send buffer in bytes
  *	@sk_flags: %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,
  *		   %SO_OOBINLINE settings, %SO_TIMESTAMPING settings
  *	@sk_no_check: %SO_NO_CHECK setting, wether or not checkup packets
  *	@sk_route_caps: route capabilities (e.g. %NETIF_F_TSO)
  *	@sk_route_nocaps: forbidden route capabilities (e.g NETIF_F_GSO_MASK)
  *	@sk_gso_type: GSO type (e.g. %SKB_GSO_TCPV4)
  *	@sk_gso_max_size: Maximum GSO segment size to build
  *	@sk_lingertime: %SO_LINGER l_linger setting
  *	@sk_backlog: always used with the per-socket spinlock held
  *	@sk_callback_lock: used with the callbacks in the end of this struct
  *	@sk_error_queue: rarely used
  *	@sk_prot_creator: sk_prot of original sock creator (see ipv6_setsockopt,
  *			  IPV6_ADDRFORM for instance)
  *	@sk_err: last error
  *	@sk_err_soft: errors that don't cause failure but are the cause of a
  *		      persistent failure not just 'timed out'
  *	@sk_drops: raw/udp drops counter
  *	@sk_ack_backlog: current listen backlog
  *	@sk_max_ack_backlog: listen backlog set in listen()
  *	@sk_priority: %SO_PRIORITY setting
  *	@sk_type: socket type (%SOCK_STREAM, etc)
  *	@sk_protocol: which protocol this socket belongs in this network family
  *	@sk_peercred: %SO_PEERCRED setting
  *	@sk_rcvlowat: %SO_RCVLOWAT setting
  *	@sk_rcvtimeo: %SO_RCVTIMEO setting
  *	@sk_sndtimeo: %SO_SNDTIMEO setting
  *	@sk_rxhash: flow hash received from netif layer
  *	@sk_filter: socket filtering instructions
  *	@sk_protinfo: private area, net family specific, when not using slab
  *	@sk_timer: sock cleanup timer
  *	@sk_stamp: time stamp of last packet received
  *	@sk_socket: Identd and reporting IO signals
  *	@sk_user_data: RPC layer private data
  *	@sk_sndmsg_page: cached page for sendmsg
  *	@sk_sndmsg_off: cached offset for sendmsg
  *	@sk_send_head: front of stuff to transmit
  *	@sk_security: used by security modules
  *	@sk_mark: generic packet mark
  *	@sk_write_pending: a write to stream socket waits to start
  *	@sk_state_change: callback to indicate change in the state of the sock
  *	@sk_data_ready: callback to indicate there is data to be processed
  *	@sk_write_space: callback to indicate there is bf sending space available
  *	@sk_error_report: callback to indicate errors (e.g. %MSG_ERRQUEUE)
  *	@sk_backlog_rcv: callback to process the backlog
  *	@sk_destruct: called at sock freeing time, i.e. when all refcnt == 0
 */
struct sock {
	/*
	 * Now struct inet_timewait_sock also uses sock_common, so please just
	 * don't add nothing before this first member (__sk_common) --acme
	 */
	struct sock_common	__sk_common;
    /*...*/
	int			sk_rcvbuf;
	socket_lock_t		sk_lock;
	/*
	 * The backlog queue is special, it is always used with
	 * the per-socket spinlock held and requires low latency
	 * access. Therefore we special case it's implementation.
	 */
	struct {
		struct sk_buff *head;
		struct sk_buff *tail;
		int len;
	} sk_backlog;
	struct sk_buff_head	sk_receive_queue;
	struct sk_buff_head	sk_write_queue;
    /*...*/
	void			(*sk_state_change)(struct sock *sk);
	void			(*sk_data_ready)(struct sock *sk, int bytes);
	void			(*sk_write_space)(struct sock *sk);
	void			(*sk_error_report)(struct sock *sk);
  	int			(*sk_backlog_rcv)(struct sock *sk, struct sk_buff *skb);  
	void                    (*sk_destruct)(struct sock *sk);
};



/**
 *	struct sock_common - minimal network layer representation of sockets
 *	@skc_node: main hash linkage for various protocol lookup tables
 *	@skc_nulls_node: main hash linkage for TCP/UDP/UDP-Lite protocol
 *	@skc_refcnt: reference count
 *	@skc_tx_queue_mapping: tx queue number for this connection
 *	@skc_hash: hash value used with various protocol lookup tables
 *	@skc_u16hashes: two u16 hash values used by UDP lookup tables
 *	@skc_family: network address family
 *	@skc_state: Connection state
 *	@skc_reuse: %SO_REUSEADDR setting
 *	@skc_bound_dev_if: bound device index if != 0
 *	@skc_bind_node: bind hash linkage for various protocol lookup tables
 *	@skc_portaddr_node: second hash linkage for UDP/UDP-Lite protocol
 *	@skc_prot: protocol handlers inside a network family
 *	@skc_net: reference to the network namespace of this socket
 *
 *	This is the minimal network layer representation of sockets, the header
 *	for struct sock and struct inet_timewait_sock.
 */
struct sock_common {
	/*
	 * first fields are not copied in sock_copy()
	 */
	union {
		struct hlist_node	skc_node;
		struct hlist_nulls_node skc_nulls_node;
	};
	atomic_t		skc_refcnt;
	int			skc_tx_queue_mapping;

	union  {
		unsigned int	skc_hash;
		__u16		skc_u16hashes[2];
	};
	unsigned short		skc_family;
	volatile unsigned char	skc_state;
	unsigned char		skc_reuse;
	int			skc_bound_dev_if;
	union {
		struct hlist_node	skc_bind_node;
		struct hlist_nulls_node skc_portaddr_node;
	};
	struct proto		*skc_prot;
#ifdef CONFIG_NET_NS
	struct net	 	*skc_net;
#endif
};

Linux内核套接字(Socket)的设计与实现_第2张图片

3. 套接字和文件

Linux内核套接字(Socket)的设计与实现_第3张图片
其中task_struct表示一个进程,files_struct中的fd_array[]表示该进程打开的所有描述符,对于套接字来说,与其他类型文件的区别就是最终f_op指向的是socket_file_ops。


/*
 *	Socket files have a set of 'special' operations as well as the generic file ones. These don't appear
 *	in the operation structures but are done directly via the socketcall() multiplexor.
 */

static const struct file_operations socket_file_ops = {
	.owner =	THIS_MODULE,
	.llseek =	no_llseek,
	.aio_read =	sock_aio_read,
	.aio_write =	sock_aio_write,
	.poll =		sock_poll,
	.unlocked_ioctl = sock_ioctl,
#ifdef CONFIG_COMPAT
	.compat_ioctl = compat_sock_ioctl,
#endif
	.mmap =		sock_mmap,
	.open =		sock_no_open,	/* special open code to disallow open via /proc */
	.release =	sock_close,
	.fasync =	sock_fasync,
	.sendpage =	sock_sendpage,
	.splice_write = generic_splice_sendpage,
	.splice_read =	sock_splice_read,
};

参考

走进Linux内核网络 套接字的秘密—socket与sock

TCP Socket 是如何工作的?

你可能感兴趣的:(linux内核,网络协议)