TCMU学习笔记

  • iSCSI Target 端的用户态实现 TCMU 介绍
  • TCMU 原理剖析
  • TCMU 对接自定义后端存储代码示例讲解

文章目录

      • TCMU 简介
      • TCMU 优势
      • 设计约束
      • 实现概览
        • MailBox
        • Command Ring
          • TCMU_OP_CMD
          • TCMU_OP_PAD
        • Data Area
      • TCMU 动态配置加载
      • TCMU Plugin Handler
        • 实现方式一 全权接管命令处理
        • 实现方式二 仅实现具体的IO操作

TCMU 简介

  • TCM 是 LIO 的别名,是内核态的 iscsi target 实现。TCM targets 运行在内核态,TCMU(TCM in Userspace)是 LIO 的用户态实现。
  • 内核以模块的方式提供不同的SCSI传输协议。TCM也模块化数据存储,目前已经支持文件(fileio),块设备(block),RAMDISK 或其他iSCSI设备(pscsi),这些称为 backstore或 storage engine。这些内建的模块完全在内核态实现。
  • LIO 仅能使用内核态的 backstore。而如 tgt 这样的其他的用户态 target 解决方案,能够支持 Gluster 的 GLFS 和 Ceph 的 RBD 作为 backstore。target 相当于转换器,允许 initiators 通过标准协议存储数据到这些非传统的网络存储系统。
  • 为 LIO 增加这些支持存在更多的困难,因为 LIO 是纯内核态代码。要解决这个问题有两种方案,一种是将 GLFS 和 RBD API 和 protocols 库移植到内核,另一种方案是为 LIO 创建用户空间 pass-through backstore,称为 TCMU。前者的工作远比后者复杂。

TCMU 优势

  • 除了可以容易支持RBD和GLFS,TCMU将允许更加简单的方式来开发新的backstore。TCMU与LIO loopback组合,实现类似于 FUSE(File system in Userspace) 的机制,只是SCSI layer替换了file system layer。
  • 这种机制的劣势是需要配置更多不同的组件,存在故障的风险。如果我们希望工作量尽可能少的话,这些不可避免,只能希望故障不是致命的影响。

设计约束

  • 高性能: 高吞吐量,低延迟。
  • 简介处理,如果用户空间发生如下故障:
    • 1) nerver attaches
    • 2) hangs
    • 3) dies
    • 4) misbehaves
  • 允许未来在用户空间和内核空间的灵活实现。
  • 合理的内存使用。
  • 简单地配置和运行。
  • 简单地编写用户空间backend。
    [外链图片转存失败(img-HtMf8h2d-1564232269961)(https://github.com/zjs1224522500/files-and-images/blob/master/blog/pic/TCMU-LIO.png?raw=true)]

实现概览

  • TCMU 接口的核心是一段由用户态和内核态共享的内存区域。这块区域包括:
      1. 控制区域(mailbox);
      1. 无锁的生产者消费者环形buffer,用于命令的传递和状态的返回;
      1. 数据in/out的缓冲区。
  • TCMU 使用了已经存在的 UIO 子系统。 UIO 子系统允许设备驱动在用户态开发,这个概念十分贴近 TCMU 的使用案例,除了物理设备, TCMU 为 SCSI 命令实现了内存映射层。使用 UIO 也有利于 TCMU 处理设备的自省,如通过用户态去决定使用多大的共享区域,和两端的信号机制。
  • 内存区域是没有指针的,只有相对于内存区域起始位置的 offset。通过这种机制可以使在用户进程挂掉或者重启使得内存区域在不同的虚拟地址空间的情况下,仍然保持工作。可以查看target_core_user.h查看结构的定义。

MailBox

  • mailbox总是在共享内存的开始位置,并且包含了version,开始位置的offset,command ring的大小,用户态和内核态存放ring command和command完成状态的head和tail指针。
struct tcmu_mailbox {
	__u16 version; // 2 如果是别的值,用户空间应该废弃)
	__u16 flags; //TCMU_MAILBOX_FLAG_CAP_OOOC: 标志支持out-of-order completion
	__u32 cmdr_off; //command ring在内存区域的起始位置的偏移量。
	__u32 cmdr_size; //command ring区域的大小。这不需要2的幂来表示。

	__u32 cmd_head; //由内核修改,表示一个command已经放置到ring中。

	/* Updated by user. On its own cacheline */
	__u32 cmd_tail __attribute__((__aligned__(ALIGN_SIZE))); //由用户空间修改,表示一个command已经处理完成。

} __attribute__((packed));

Command Ring

  • Command放置到ring上,kernel根据command的size移动mailbox.cmd_head指针,取模cmdr_size,并且通过uio_event_notify()通知用户空间。当命令执行完成,用户空间更新mailbox.cmd_tail,并且通过write() 4个字节通知内核。当cmd_head等于cmd_tail,ring为空 – 说明没有command等待用户空间处理。
  • TCMU commands是8字节对齐的。command以通用header起头,header包含了len_op,32位的值,用于存储command的长度,同时使用了最低的3个bit存储操作码opcode。也包含了cmd_id和flags,由内核设置的kflags和用户空间设置的uflags。
  • 现在只定义了两种操作码opcode: TCMU_OP_CMDTCMU_OP_PAD
TCMU_OP_CMD
  • 在command ring的结构是struct tcmu_cmd_entry
struct tcmu_cmd_entry {
	struct tcmu_cmd_entry_hdr hdr;

	union {
		struct {
			uint32_t iov_cnt;  // The size of iov array.
			uint32_t iov_bidi_cnt; // size of iov array for Data-In
			uint32_t iov_dif_cnt; //
			uint64_t cdb_off;  // Command data block offset
			uint64_t __pad1;
			uint64_t __pad2;
			struct iovec iov[0]; // iov array - buffer
		} req;
		
		struct {
			uint8_t scsi_status; // Command executed, return status.
			uint8_t __pad1;
			uint16_t __pad2;
			uint32_t __pad3;
			char sense_buffer[TCMU_SENSE_BUFFERSIZE]; // response data
		} rsp;
	};

} __attribute__((packed));


/*
 * Only a few opcodes, and length is 8-byte aligned, so use low bits for opcode.
 */
struct tcmu_cmd_entry_hdr {
	__u32 len_op;
	__u16 cmd_id;
	__u8 kflags;
#define TCMU_UFLAG_UNKNOWN_OP 0x1
	__u8 uflags;

} __attribute__((packed));
  • 用户空间通过tcmu_cmd_entry.req.cdb_off找到SCSI CDB(Command Data Block)。这是command在整个共享内存区域起始位置的偏移量,而不是结构体或ring中的里面的偏移量。
  • 通过req.iov[]数据访问数据区域。iov_cnt包含了iov[]结构的数量,需要区分Data-In还是Data-Out的数据。对于双向的command,iov_cnt指定多少iovec覆盖了Data-Out区域,iov_bidi_cnt指定了多少iovec覆盖了Data-In区域(紧接在Data-Out区域)。就像别的field一样,iov.iov_base是相对于内存区域起始位置的偏移量。
  • 当command执行完成,用户空间设置rsp.scsi_status,如果有需要也设置rsp.sense_buffer。
  • 用户空间根据entry.hdr.length(mod cmdr_size)增加mailbox.cmd_tail,并且通过UIO方法通知内核,4字节写到文件描述符。
  • 如果设置TCMU_MAILBOX_FLAG_CAP_OOOC到mailbox->flags,kernel可以处理out-of-order completions。在这种情况下,用户空间能够处理与源头不同的顺序。由于kernel处理command的顺序与command ring中的一致,所以用户空间在command执行完成时,需要更新cmd->id。
TCMU_OP_PAD
  • 当操作码opcode是PAD,用户空间只会更新cmd_tail – 因为是一个no-op。kernel插入PAD entries确保每个CMD entry在command ring中是连续的。
  • 未来会加入更多的opcode。如果用户空间遇到一个不能处理的opcode,必须要设置hdr.uflags的第0个bit为UNKNOWN_OP,更新cmd_tail,处理附加的commands。

Data Area

  • 数据区域是在command ring的后面,TCMU接口没有定义这片区域的结构,用户空间应该只访问pengding iovs指定的区域。

[外链图片转存失败(img-5kR76M2X-1564232269967)(https://github.com/zjs1224522500/files-and-images/blob/master/blog/pic/User-Kernel-Communication.png?raw=true)]

TCMU 动态配置加载

  • TCMU 使用了一个时刻监听 TCMU 配置文件的守护进程来检查配置文件的变化,对应地更新 TCMU 的相关配置,诸如日志级别, 日志输出路径等

TCMU Plugin Handler

  • 为了在 TCMU 相关接口的基础之上,处理自定义后端存储的处理方法,需要自定义实现 TCMU Handler.
struct tcmur_handler {
	const char *name;	/* Human-friendly name */
	const char *subtype;	/* Name for cfgstring matching */
	const char *cfg_desc;	/* Description of this backstore's config string */

	void *opaque;		/* Handler private data. */

	/*
	 * As much as possible, check that the cfgstring will result
	 * in a working device when given to us as dev->cfgstring in
	 * the ->open() call.
	 *
	 * This function is optional but gives configuration tools a
	 * chance to warn users in advance if the device they're
	 * trying to create is invalid.
	 *
	 * Returns true if string is valid. Only if false, set *reason
	 * to a string that says why. The string will be free()ed.
	 * Suggest using asprintf().
	 */
	bool (*check_config)(const char *cfgstring, char **reason);

	int (*reconfig)(struct tcmu_device *dev, struct tcmulib_cfg_info *cfg);

	/* Per-device added/removed callbacks */
	int (*open)(struct tcmu_device *dev, bool reopen);
	void (*close)(struct tcmu_device *dev);

	/*
	 * If > 0, runner will execute up to nr_threads IO callouts from
	 * threads.
	 * if 0, runner will call IO callouts from the cmd proc thread or
	 * completion context for compound commands.
	 */
	int nr_threads;

	/*
	 * Async handle_cmd only handlers return:
	 *
	 * - TCMU_STS_OK if the command has been executed successfully
	 * - TCMU_STS_NOT_HANDLED if opcode is not handled
	 * - TCMU_STS_ASYNC_HANDLED if opcode is handled asynchronously
	 * - Non TCMU_STS_OK code indicating failure
	 * - TCMU_STS_PASSTHROUGH_ERR For handlers that require low level
	 *   SCSI processing and want to setup their own sense buffers.
	 *
	 * Handlers that set nr_threads > 0 and async handlers
	 * that implement handle_cmd and the IO callouts below return:
	 *
	 * - TCMU_STS_OK if the handler has queued the command.
	 * - TCMU_STS_NOT_HANDLED if the command is not supported.
	 * - TCMU_STS_NO_RESOURCE if the handler was not able to allocate
	 *   resources for the command.
	 *
	 * If TCMU_STS_OK is returned from the callout the handler must call
	 * the tcmulib_cmd->done function with TCMU_STS return code.
	 */
	handle_cmd_fn_t handle_cmd;

	/*
	 * Below callbacks are only executed by generic_handle_cmd.
	 * Returns:
	 * - TCMU_STS_OK if the handler has queued the command.
	 * - TCMU_STS_NO_RESOURCE if the handler was not able to allocate
	 *   resources for the command.
	 *
	 * If TCMU_STS_OK is returned from the callout the handler must call
	 * the tcmulib_cmd->done function with TCMU_STS return code.
	 */
	rw_fn_t write;
	rw_fn_t read;
	flush_fn_t flush;
	unmap_fn_t unmap;

	/*
	 * If the lock is acquired and the tag is not TCMU_INVALID_LOCK_TAG,
	 * it must be associated with the lock and returned by get_lock_tag on
	 * local and remote nodes. When unlock is successful, the tag
	 * associated with the lock must be deleted.
	 *
	 * Returns a TCMU_STS indicating success/failure.
	 */
	int (*lock)(struct tcmu_device *dev, uint16_t tag);
	int (*unlock)(struct tcmu_device *dev);

	/*
	 * Return tag set in lock call in tag buffer and a TCMU_STS
	 * indicating success/failure.
	 */
	int (*get_lock_tag)(struct tcmu_device *dev, uint16_t *tag);

	/*
	 * Must return TCMUR_DEV_LOCK state value.
	 */
	int (*get_lock_state)(struct tcmu_device *dev);

	/*
	 * internal field, don't touch this
	 *
	 * indicates to tcmu-runner whether this is an internal handler loaded
	 * via dlopen or an external handler registered via dbus. In the
	 * latter case opaque will point to a struct dbus_info.
	 */
	bool _is_dbus_handler;

	/*
	 * Update the logdir called by dynamic config thread.
	 */
	bool (*update_logdir)(void);
};

实现方式一 全权接管命令处理

  • file_optical.c 为例:
/*
 * Return scsi status or TCMU_STS_NOT_HANDLED
 */
static int fbo_handle_cmd(struct tcmu_device *dev, struct tcmulib_cmd *cmd)
{
	uint8_t *cdb = cmd->cdb;
	struct iovec *iovec = cmd->iovec;
	size_t iov_cnt = cmd->iov_cnt;
	uint8_t *sense = cmd->sense_buf;
	struct fbo_state *state = tcmur_dev_get_private(dev);
	bool do_verify = false;
	int ret;

	/* Check for format in progress */
	/* Certain commands can be executed even if a format is in progress */
	if (state->flags & FBO_FORMATTING &&
	    cdb[0] != INQUIRY &&
	    cdb[0] != REQUEST_SENSE &&
	    cdb[0] != GET_CONFIGURATION &&
	    cdb[0] != GPCMD_GET_EVENT_STATUS_NOTIFICATION) {
		tcmu_sense_set_key_specific_info(sense, state->format_progress);
		ret = TCMU_STS_FRMT_IN_PROGRESS;
		return ret;
	}

	switch(cdb[0]) {
	case TEST_UNIT_READY:
		ret = tcmu_emulate_test_unit_ready(cdb, iovec, iov_cnt);
		break;
	case REQUEST_SENSE:
		ret = fbo_emulate_request_sense(dev, cdb, iovec, iov_cnt, sense);
		break;
	case FORMAT_UNIT:
		ret = fbo_emulate_format_unit(dev, cdb, iovec, iov_cnt, sense);
		break;
	case READ_6:
	case READ_10:
	case READ_12:
		ret = fbo_read(dev, cdb, iovec, iov_cnt, sense);
		break;
	case WRITE_VERIFY:
		do_verify = true;
	case WRITE_6:
	case WRITE_10:
	case WRITE_12:
		ret = fbo_write(dev, cdb, iovec, iov_cnt, sense, do_verify);
		break;
	case INQUIRY:
		ret = fbo_emulate_inquiry(cdb, iovec, iov_cnt, sense);
		break;
	case MODE_SELECT:
	case MODE_SELECT_10:
		ret = fbo_emulate_mode_select(cdb, iovec, iov_cnt, sense);
		break;
	case MODE_SENSE:
	case MODE_SENSE_10:
		ret = fbo_emulate_mode_sense(cdb, iovec, iov_cnt, sense);
		break;
	case START_STOP:
		ret = tcmu_emulate_start_stop(dev, cdb);
		break;
	case ALLOW_MEDIUM_REMOVAL:
		ret = fbo_emulate_allow_medium_removal(dev, cdb, sense);
		break;
	case READ_FORMAT_CAPACITIES:
		ret = fbo_emulate_read_format_capacities(dev, cdb, iovec,
							 iov_cnt, sense);
		break;
	case READ_CAPACITY:
		if ((cdb[1] & 0x01) || (cdb[8] & 0x01))
			/* Reserved bits for MM logical units */
			return TCMU_STS_INVALID_CDB;
		else
			ret = tcmu_emulate_read_capacity_10(state->num_lbas,
							    state->block_size,
							    cdb, iovec,
							    iov_cnt);
		break;
	case VERIFY:
		ret = fbo_verify(dev, cdb, iovec, iov_cnt, sense);
		break;
	case SYNCHRONIZE_CACHE:
		ret = fbo_synchronize_cache(dev, cdb, sense);
		break;
	case READ_TOC:
		ret = fbo_emulate_read_toc(dev, cdb, iovec, iov_cnt, sense);
		break;
	case GET_CONFIGURATION:
		ret = fbo_emulate_get_configuration(dev, cdb, iovec, iov_cnt,
						    sense);
		break;
	case GPCMD_GET_EVENT_STATUS_NOTIFICATION:
		ret = fbo_emulate_get_event_status_notification(dev, cdb,
								iovec, iov_cnt,
								sense);
		break;
	case READ_DISC_INFORMATION:
		ret = fbo_emulate_read_disc_information(dev, cdb, iovec,
							iov_cnt, sense);
		break;
	case READ_DVD_STRUCTURE:
		ret = fbo_emulate_read_dvd_structure(dev, cdb, iovec, iov_cnt,
						     sense);
		break;
	case MECHANISM_STATUS:
		ret = fbo_emulate_mechanism_status(dev, cdb, iovec, iov_cnt,
						   sense);
		break;
	default:
		ret = TCMU_STS_NOT_HANDLED;
	}
	return ret;
}

实现方式二 仅实现具体的IO操作

  • 以 file.example.c为例
/*
 * Copyright (c) 2014 Red Hat, Inc.
 *
 * This file is licensed to you under your choice of the GNU Lesser
 * General Public License, version 2.1 or any later version (LGPLv2.1 or
 * later), or the Apache License 2.0.
 */

/*
 * Example code to demonstrate how a TCMU handler might work.
 *
 * Using the example of backing a device by a file to demonstrate:
 *
 * 1) Registering with tcmu-runner
 * 2) Parsing the handler-specific config string as needed for setup
 * 3) Opening resources as needed
 * 4) Handling SCSI commands and using the handler API
 */

#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#include "scsi_defs.h"
#include "libtcmu.h"
#include "tcmu-runner.h"
#include "tcmur_device.h"

struct file_state {
	int fd;
};

static int file_open(struct tcmu_device *dev, bool reopen)
{
	struct file_state *state;
	char *config;

	state = calloc(1, sizeof(*state));
	if (!state)
		return -ENOMEM;

    // Init the file state of tcmu
	tcmur_dev_set_private(dev, state);

    // Move the pointer to the first '/' location in path string
	config = strchr(tcmu_dev_get_cfgstring(dev), '/');
	if (!config) {
	tcmu_err("no configuration found in cfgstring\n");
		goto err;
	}	
	config += 1; /* get past '/' */

    // Enable the tcmu write cache.(Set the value of tcmu_device as true)
	tcmu_dev_set_write_cache_enabled(dev, 1);

    // Open the file with path.(With mode)
	state->fd = open(config, O_CREAT | O_RDWR, S_IRUSR | S_IWUSR);
	if (state->fd == -1) {
		tcmu_err("could not open %s: %m\n", config);
		goto err;
	}

	tcmu_dbg("config %s\n", tcmu_dev_get_cfgstring(dev));

	return 0;

err:
	free(state);
	return -EINVAL;
}

static void file_close(struct tcmu_device *dev)
{
	// Get the file state of tcmu_device.
	struct file_state *state = tcmur_dev_get_private(dev);

    // Close the file
	close(state->fd);

	// free the state
	free(state);
}

/**
 * 
 * @param *dev              tcmu device
 * @param *cmd              Command line interface.(not used in this method)
 * @param *iov              buffer array to syore the data
 * @param iov_cnt           buffer array size  
 * @param length            read length
 * @param offset            start address
 * 
 * return                   the size of read data.
 */
static int file_read(struct tcmu_device *dev, struct tcmulib_cmd *cmd,
		     struct iovec *iov, size_t iov_cnt, size_t length,
		     off_t offset)
{
	// Get the file state of tecmu_device
	struct file_state *state = tcmur_dev_get_private(dev);
	size_t remaining = length;
	ssize_t ret;

    // Read the file in loop
	while (remaining) {
		// Read the data and store in the iov array. Return the data size.
		ret = preadv(state->fd, iov, iov_cnt, offset);
		if (ret < 0) {
			tcmu_err("read failed: %m\n");
			ret = TCMU_STS_RD_ERR;
			goto done;
		}

		if (ret == 0) {
			/* EOF, then zeros the iovecs left */
			tcmu_iovec_zero(iov, iov_cnt);
			break;
		}

        // Consume the iov array. 
		tcmu_iovec_seek(iov, ret);
		// Move the offset
		offset += ret;
		// Change the length and continute to read file.
		remaining -= ret;
	}
	// Read finished, and status is OK.
	ret = TCMU_STS_OK;
done:
	return ret;
}

/**
 * 
 * @param *dev              tcmu device
 * @param *cmd              Command line interface.(not used in this method)
 * @param *iov              buffer array to syore the data
 * @param iov_cnt           buffer array size  
 * @param length            write length
 * @param offset            start address
 * 
 * return                   the size of read data.
 */
static int file_write(struct tcmu_device *dev, struct tcmulib_cmd *cmd,
		      struct iovec *iov, size_t iov_cnt, size_t length,
		      off_t offset)
{
	// Get the file state of tecmu_device
	struct file_state *state = tcmur_dev_get_private(dev);
	size_t remaining = length;
	ssize_t ret;

    // Write the file in loop
	while (remaining) {
	    // Wirte the data in the iov array to file. Return the data size.
		ret = pwritev(state->fd, iov, iov_cnt, offset);
		if (ret < 0) {
			tcmu_err("write failed: %m\n");
			ret = TCMU_STS_WR_ERR;
			goto done;
		}
		// Consume an inv array.
		tcmu_iovec_seek(iov, ret);
		// Move the offset.
		offset += ret;
		// Change the length and continue to write.
		remaining -= ret;
	}
	ret = TCMU_STS_OK;
done:
	return ret;
}

static int file_flush(struct tcmu_device *dev, struct tcmulib_cmd *cmd)
{
	// Get the file state of tcmu_device.
	struct file_state *state = tcmur_dev_get_private(dev);
	int ret;

	// Sync(Flush) the data in page cache to disk.
	if (fsync(state->fd)) {
		tcmu_err("sync failed\n");
		ret = TCMU_STS_WR_ERR;
		goto done;
	}
	ret = TCMU_STS_OK;
done:
	return ret;
}

static int file_reconfig(struct tcmu_device *dev, struct tcmulib_cfg_info *cfg)
{
	switch (cfg->type) {
	// Extend or Reduce the size of file.
	case TCMULIB_CFG_DEV_SIZE:
		/*
		 * TODO - For open/reconfig we should make sure the FS the
		 * file is on is large enough for the requested size. For
		 * now assume we can grow the file and return 0.
		 */
		return 0;
	case TCMULIB_CFG_DEV_CFGSTR:
	// Handle the write cache.
	case TCMULIB_CFG_WRITE_CACHE:
	default:
		return -EOPNOTSUPP;
	}
}

static const char file_cfg_desc[] =
	"The path to the file to use as a backstore.";

// Init the tcmu_handler with given static method defined in this class.
static struct tcmur_handler file_handler = {
	.cfg_desc = file_cfg_desc,

	.reconfig = file_reconfig,

	.open = file_open,
	.close = file_close,
	.read = file_read,
	.write = file_write,
	.flush = file_flush,
	.name = "File-backed Handler (example code)",
	.subtype = "file",
	.nr_threads = 2,
};

/* Entry point must be named "handler_init". */
int handler_init(void)
{
	// Regist the file_handler to running_handler list
	return tcmur_register_handler(&file_handler);
}

你可能感兴趣的:(云计算)