kernel 3.1.5
block/blk-core.c: 管理queue, request。
1. EXPORT_SYMBOL宏解释:
#define __EXPORT_SYMBOL(sym, sec) \
extern typeof(sym) sym; \__attribute__((section("__ksymtab_strings"), aligned(1))) \
= MODULE_SYMBOL_PREFIX #sym; \
static const struct kernel_symbol __ksymtab_##sym \将sym的先存在__ksymtab_strings section的__kstrtab_##sym中,将sym的地址和 __kstrtab_##sym存入___ksymtab" sec "+" #sym section中的类型为 kernel_symbol 的结构__ksymtab_##sym。
__used : 告诉编译器无论是否发现调用者,都要编译被修饰的对象
unused: 用于函数和变量,表示该函数或变量可能不使用,这个属性可以避免编译器产生警告信息。
2. drive_stat_acct: 统计part iostatus
3. blk_queue_congestion_threshold: 计算queue 的nr_congestion_on /nr_congestion_off拥挤程度
4. blk_get_backing_dev_info:从block_device返回request_queue的backing_dev_info
5. blk_rq_init: 初始化request, 并且用request_queue初始化部分request
6. req_bio_endio: request endio处理, 减少bio->bi_size的值,如果bio->bi_size的size为零且命令不是REQ_FLUSH_SEQ,那么调用bio_endio。
bio_endio调用bio->bi_end_io处理实际的end io动作(为不同的disk, fs, flush等)
7. blk_delay_work: 由struct work_struct返回request_queue, 调用__blk_run_queue(), 发送一个request, 具体由q->request_fn实现。request_fn是在allocate queue node等的时候初始化的
8. blk_delay_queue:增加delayed work到work queue
9. blk_start_queue: restart a previously stopped queue。 清除stopped flag, call q->request_fn(q)启动一个单个的device queue
10. blk_stop_queue: 取消delayed work, 设置stopped标志
11. blk_sync_queue: remove delay work timer, 人然后ancel a delayed work and wait for it to finish(cancel_delayed_work_sync)。
12. __blk_run_queue: 运行一个单独的device queue
13. blk_run_queue_async: 如果却ue没有stopped, 取消该delayed work, 再queue delayed work with dalay time 0.
14. blk_run_queue: 与__blk_run_queue不同在于这个函数多了request queue_lock
15. blk_put_queue: 减少queue kobj的引用计数
16. blk_cleanup_queue: 先blk_sync_queue, 再call del_timer_sync删除q->backing_dev_info.laptop_mode_wb_timer, 设置queue falg为QUEUE_FLAG_DEAD。 递减queue kobj引用计数
17. blk_init_free_list: 初始化rq的request_list, 并从mempool中分配rq_pool
18. blk_alloc_queue: 调用blk_alloc_queue_node。
blk_alloc_queue_node:
通过调用kmem_cache_alloc_node分配queue, node是指memory node,对于numa有效。
初始化queue的backing_dev_info,调用blk_throtl_init初始化queue的throtl。
setup q->backing_dev_info.laptop_mode_wb_timer timer, 设置timer回调函数调用laptop_mode_timer_fn(如果backing_dev_info脏, 回写backing_dev_info)。
setup q->timeout timer(queue timer), 并设置 blk_rq_timed_out_timer(遍历queue的timeout的timeout list的request, 如果request 未完成, 调用blk_rq_timed_out, 该函数根据queue的rq_timed_out_fn函数返回值, 分别调用__blk_complete_request, 或重设timer)。
初始化q->timeout_list, q->flush_queue[0], q->flush_queue[1], flush_data_in_flight。
初始化delayedwork,设置函数blk_delay_work
初始化queue的kobj, 类型为blk_queue_ktype(
struct kobj_type blk_queue_ktype = {
.sysfs_ops = &queue_sysfs_ops,
.default_attrs = default_attrs,
.release = blk_release_queue,
};
)
初始化sysfs_lock
初始化__queue_lock
19. blk_init_queue:see the comments:
* blk_init_queue - prepare a request queue for use with a block device
* @rfn: The function to be called to process requests that have been
* placed on the queue.
* @lock: Request queue spin lock
*
* Description:
* If a block device wishes to use the standard request handling procedures,
* which sorts requests and coalesces adjacent requests, then it must
* call blk_init_queue(). The function @rfn will be called when there
* are requests on the queue that need to be processed. If the device
* supports plugging, then @rfn may not be called immediately when requests
* are available on the queue, but may be called at some time later instead.
* Plugged queues are generally unplugged when a buffer belonging to one
* of the requests on the queue is needed, or due to memory pressure.
*
* @rfn is not required, or even expected, to remove all requests off the
* queue, but only as many as it can handle at a time. If it does leave
* requests on the queue, it is responsible for arranging that the requests
* get dealt with eventually.
*
* The queue spin lock must be held while manipulating the requests on the
* request queue; this lock will be taken also from interrupt context, so irq
* disabling is needed for it.
*
* Function returns a pointer to the initialized request queue, or %NULL if
* it didn't succeed.
*
* Note:
* blk_init_queue() must be paired with a blk_cleanup_queue() call
* when the block device is deactivated (such as at module unload).
调用blk_init_queue_node初始化queue
blk_init_queue_node:
先调用blk_alloc_queue_node(前述)
再调用blk_init_allocated_queue_node
blk_init_allocated_queue_node:
设置request_fn函数
调用blk_queue_make_request, 为queue设置alterative make request函数__make_request
20.
blk_get_queue: 增加queue kobj引用计数
21.
blk_free_request:如果queue命令是REQ_ELVPRIV, 调用elv_put_request去调用elevator_put_req_fn
调用mempool_free从rq_pool释放request
22. blk_alloc_request:
调用mempool_alloc分配request
blk_rq_init:初始化request
如果priv != 0, 设置cmdflag 为REQ_ELVPRIV(see no 21)
23:
ioc_batching: returns true if the ioc is a valid batching request。
return ioc->nr_batch_requests == q->nr_batching ||
(ioc->nr_batch_requests > 0
&& time_before(jiffies, ioc->last_waited + BLK_BATCH_TIME));
time_before(a, b): 如果time a 小于b, 返回true
24. ioc_set_batching: 如果是一个有效的batching request, 返回, 否则设置:
ioc->nr_batch_requests = q->nr_batching;
ioc->last_waited = jiffies;
这样就是一个有效新batching request了
25. __freed_request:
如果queue的rl->count[sync] 否则如果rl->count[sync] + 1 <= q->nr_requests, wake up rl等待队列, 清除queue_flags的QUEUE_FLAG_SYNCFULL/QUEUE_FLAG_ASYNCFULL 26.freed_request: 一个request被release之后(get_regest和__blk_put_request), 更新full/congestion状态, 唤醒等等者 rl->count[sync]-- 调用__freed_request更新唤醒 如果有rl->starved[sync ^ 1], 同样调用__freed_request更新唤醒 27: blk_rq_should_init_elevator: 如果bio bi_rw类型不是(REQ_FLUSH | REQ_FUA), 需要init elevator。 28: get_request: Get a free request, 由get_request_wait和blk_get_request调用 request_list: free request list, 一个为写, 一个为读, 这个解释了sync标志的含义, 一个读, 一个写 rw_is_sync: 不是写请求REQ_WRITE或者是同步REQ_SYNC, 返回true。 if mayqueue is ELV_MQUEUE_NO, queue started.(空闲, 如果rl->count[is_sync] == 0,要设置rl->starved[is_sync]=1 if (rl->count[is_sync]+1 >= q->nr_requests),if full标志未设, ioc_set_batching, and set queue full, 否则如果不是batcher, 返回null request, rl->count[is_sync]+1 >= queue_congestion_on_threshold(q), 就设置queue congested标志 if (rl->count[is_sync] >= (3 * q->nr_requests / 2)), 如果queue中的request的数量超过总数的一定值, 也返回空的request,根据这个公式, 表明request的数量是可能大于最大总数nr_requests的 增加request的list的count数 既然开始有request了, 那么就不饿了, 设置rl->starved[is_sync]=0 如果是flush, 不需要init elevator, 否则rl->elvpriv++; 为何先unlock queue_lock? 因为调用get_request的时候是queue_lock lock的时候(queue_lock must be held, 刚刚前面的一些计算和设置需要lock保护),这个lock是在blk_get_request中上锁的。 如果blk_alloc_request成功, 保持queue_lock unlock, 如果是在batch 的BLK_BATCH_TIME之内或者刚call了ioc_set_batching, 就递减ioc->nr_batch_requests, 表明可以做一次batch request。 调用trace_block_getrq, 增加bio到trace 如果alloc没有成功, queue_lock上锁, 退回前面的设置, 如有可能, 唤醒queue等待队列, 如果rl->count[is_sync] == 0, 设置starved标志, 返回空request blk_alloc_request:看item 22, 从mempool中的到request, 并初始化, 以后将request插入requestlist的时候, queue_lock是要先上锁的。 get_request_wait: 当第一次get_request不成功时, 把自己放在等待队列上, 当唤醒之后再get一次, 唤醒时起码可以得到一个request。 blk_get_request: 被很多driver调用, 它分别调用get_request_wait和get_request, 在调用他们之前, 先lock queue_lock. 如果调用get_request不成功,会unlock queue_lock. 29: blk_make_request: 先get request, 然后生成bounce buffer, 再将bio挂到request上, 然后再将request放到request queue(会用电梯算法哦), bounce buffer: 中文版ldd 439 DMA映射中说 当驱动程序试图在外围设备不可访问的地址上执行DMA时,将创建回弹缓冲区。 blk_rq_append_bio: 把bio加到request中。 如果rq->bio空,调用blk_rq_bio_prep, 生成新的request的bio, 否则调用ll_back_merge_fn, 将bio合并到rq的bio。 如果合并正确,更新request的biotail和data_size. 其中下面片段的含义: rq->biotail->bi_next = bio; 更新旧的biotail的next, high = page_to_pfn(bv->bv_page) > queue_bounce_pfn(q); 30: blk_insert_request: 调用add_acct_request增加request到queue, 调用__blk_run_queue发起一个request 31. part_round_stats: 统计io performance status 32: blk_put_request: 释放request 33. blk_add_request_payload: 增加payload到request, :“this is a quite horrible hack and nothing but handling of discard requests should ever use it.“ 34: __make_request: 会被generic_make_request函数调用, 首先生成bio的bounce buffer(有dma和isa两种, 其他不会生成) 调用attempt_plug_merge尝试把bio插入到task的plug list中的request queue(这个queue等于函数参数提供的queue)中,成功的话返回。 否则merge到函数提供的queue中, 1. 如果bio->bi_rw & (REQ_FLUSH | REQ_FUA)是false, 调用elv_merge得到request, 调用bio_attempt_back_merge或者bio_attempt_front_merge, 将request merge到queue。 2. 否则, 调用get_request_wait生成request 调用init_request_from_bio初始化request 如果current->plug存在, 把request加到plug的tail, 否则把request加到queue中, 调用__blk_run_queue启动request bio_attempt_back_merge: merge bio到request biotail bio_attempt_front_merge: merge bio到request的bio 35: blk_partition_remap: 如果是分区, remap bio, 具体就是加上分区的start_sect。 36: handle_bad_sector:设置bio->bi_flags为BIO_EOF 37: setup_fail_make_request/should_fail_request/fail_make_request_debugfs/should_fail_request: handle request error 38: bio_check_eod: 检查bio是否超出设备的结尾 39: generic_make_request:发起一个request 如果bio_list非空, 把新的bio插入current->bio_list尾部, 否则, bio_list=&bio_list_on_stack 调用__generic_make_request发起request(调用queue的make_request_fn) submit_bio比generic_make_request多了count_vm_events。 40: blk_rq_check_limits:检查request的sectors, bytes, or segments是否大于queue的限制 41: blk_insert_cloned_request:插入一个cloned的request(从函数参数看, 是可以选择不同queue的, 不过似乎没这么做?!) 从这个函数dispatch_queued_ios来看, 该函数首先把request从queuelist中删除, 然后调用blk_insert_cloned_request时使用的queue是request指向的queue, 而后面blk_insert_cloned_request函数会设request的q等于这个, 也就是说实际上queue没变, 变得是request会插到queue的tail(flush时可能会不同) 42: blk_rq_err_bytes: 统计在error之前完成的bytes 43: blk_account_io_completion: 完成处理, 更新readbytes blk_account_io_done: 统计duration等 44: blk_peek_request: 根据elv算法取出request, 如有必要start 45: blk_dequeue_request: 从queue中出去request 46: blk_start_request: dequeue(调用item 45)一个request, 启动timer blk_add_timer(把request加入到queue的timeoutlist) 为这个request. 47: blk_fetch_request:call blk_peek_request得到top request, 并start blk_start_request, 48: blk_update_bidi_request: call blk_update_request更新request的data_len, segment, 等,如果request next不为空, 更新request(不同的更新参数),返回true, 否则, 增加radom, 返回false 49: blk_unprep_request: 错误处理完成时调用, make一个request ready,(释放request占用的资源, 比如buffer) 50; blk_finish_request:blk_delete_timer, blk_unprep_request, end_io, __blk_put_request 51: blk_end_bidi_request: blk_update_bidi_request, blk_finish_request __blk_end_bidi_request, blk_end_request blk_end_request_all blk_end_request_cur blk_end_request_err __blk_end_request __blk_end_request_all __blk_end_request_cur __blk_end_request_err 52: blk_rq_bio_prep: 用bio设置request 53: rq_flush_dcache_pages: flush request的pages 54: blk_rq_unprep_clone: remove request的bio __blk_rq_prep_clone: copy src request到dest request blk_rq_prep_clone: clone src request的bio到dest request, 并调用__blk_rq_prep_clone 55:调度queue work kblockd_schedule_work kblockd_schedule_delayed_work 56: blk_start_plug: 设置plug plug_rq_cmp: 比较两个request的queue是否相同 57: queue_unplugged: call run queue flush_plug_callbacks: 运行plug list的callback函数 blk_flush_plug_list: flush_plug_callbacks, queue_unplugged blk_finish_plug: 调用blk_flush_plug_list, 将current plug设为kong 58: blk_dev_init: 分配kblockd_workqueue, 创建request_cachep(生成request pool), 和blk_requestq_cachep(由blk_alloc_queue_node生成queue)。
相当于一个跳转的buffer。例如一些老设备只能访问16M以下的内存,但DMA的目的地址却在16M以上时,就需要在设备能访问16M范围内设置一个buffer作为跳转。
今后的PCI设备都会在设备上集成IOMMU,这种问题将不再存在
rq->biotail = bio; 新的biotail是bio
另外, 如果bio是在high mem中, 就不会认为这个bv是segment的一部分, 会使用bounce buffer, 参见__blk_recalc_rq_segments的部分代码:
if (high || highprv)
goto new_segment;