Ceph RGW 对象上传源码分析

RGW中OP请求处理函数的入口都是process_request,process_request会根据传入的RGWRados参数获取RGWOp,然后当请求合法性通过后去调用rgw_process_authenticated函数执行OP的具体动作。

请求的处理  Collapse source
int  process_request(RGWRados*  const  store,
                     RGWREST*  const  rest,
                     RGWRequest*  const  req,
                     const  std::string& frontend_prefix,
                     const  rgw_auth_registry_t& auth_registry,
                     RGWRestfulIO*  const  client_io,
                     OpsLogSocket*  const  olog)
{
     ...
     RGWOp* op = NULL;
     int  init_error = 0;
     bool  should_log =  false ;
     RGWRESTMgr *mgr;
     RGWHandler_REST *handler = rest->get_handler(store, s,
                                                auth_registry,
                                                frontend_prefix,
                                                client_io, &mgr, &init_error);
     ...
      op = handler->get_op(store); //解析出具体的OP
     ...
     ret = rgw_process_authenticated(handler, op, req, s); //验证OP
     if  (ret < 0) {
         abort_early(s, op, ret, handler);
         goto  done;
   }
rgw_process_authenticated函数内部执行OP具体动作时候,分为三个阶段pre_exec,execute和complete
OP执行  Collapse source
int  rgw_process_authenticated(RGWHandler_REST *  const  handler,
                               RGWOp *& op,
                               RGWRequest *  const  req,
                               req_state *  const  s,
                               const  bool  skip_retarget)
{
     ...
     req-> log (s,  "init op" );
   ret = op->init_processing();
   if  (ret < 0) {
     return  ret;
   }
   req-> log (s,  "verifying op mask" );
   ret = op->verify_op_mask();
   if  (ret < 0) {
     return  ret;
   }
   req-> log (s,  "verifying op permissions" );
   ret = op->verify_permission();
   if  (ret < 0) {
     if  (s->system_request) {
       dout(2) <<  "overriding permissions due to system operation"  << dendl;
     else  if  (s->auth.identity->is_admin_of(s->user->user_id)) {
       dout(2) <<  "overriding permissions due to admin operation"  << dendl;
     else  {
       return  ret;
     }
   }
   req-> log (s,  "verifying op params" );
   ret = op->verify_params();
   if  (ret < 0) {
     return  ret;
   }
   req-> log (s,  "pre-executing" );
   op->pre_exec();
   req-> log (s,  "executing" );
   op->execute();
   req-> log (s,  "completing" );
   op->complete();
     ...
}

所有的RGW 操作都是继承于基类RGWOp,然后具体的OP重写自己的的execute方法。

对象上传分为整体上传和分段上传。对于选择整体上传和分段上传的选择在于用户单次操作上传文件大小,若大于5GB对象就需要使用分段上传API,否则使用整体上传的API。这个就是用来兼容S3标准(对象操作)

对于整体上传就会对应到RGWPutObjProcessor_Atomic处理类,分段上传会对应RGWPutObjProcessor_Multipart处理类。这两个类会对上传的对象做加密操作,压缩,数据处理等。是整个上传对象操作的核心处理类。关于对上传对象压缩处理见ceph 官方文档介绍。

对于对象文件上传OP的类为RGWPutObj,其定义的execute函数为整个文件上传的入口,整个函数调用及嵌套关系如下所示:

RGWPutObj::execute
  • RGWPutObj_Compress::prepare
    • RGWPutObjProcessor_Atomic::prepare
  • put_data_and_throttle
    • RGWPutObj_Compress::handle_data
      • RGWPutObjProcessor_Atomic::handle_data
        • RGWPutObjProcessor_Atomic::write_data
          • RGWPutObjProcessor_Aio::handle_obj_data
            • RGWRados::aio_put_obj_data
              • librados aio API
    • RGWPutObjProcessor_Aio::throttle_data
  • RGWPutObjProcessor::complete
    • RGWPutObjProcessor_Atomic::complete

RGWPutObj::execute  Collapse source
void  RGWPutObj::execute()
{
  
   RGWPutObjProcessor *processor = NULL;
   // filter用于对数据进行处理,比如加密和压缩
   RGWPutObjDataProcessor *filter = nullptr;
   std::unique_ptr encrypt;
   
   // 用于存储用户提供的md5、计算的md5 相关的数组
   char  supplied_md5_bin[CEPH_CRYPTO_MD5_DIGESTSIZE + 1];
   char  supplied_md5[CEPH_CRYPTO_MD5_DIGESTSIZE * 2 + 1];
   char  calc_md5[CEPH_CRYPTO_MD5_DIGESTSIZE * 2 + 1];
   unsigned  char  m[CEPH_CRYPTO_MD5_DIGESTSIZE];
   MD5 hash;
 
   bufferlist bl, aclbl, bs;
   int  len;
   map::iterator iter;
   bool  multipart;
   // copy source range 相关
   off_t fst;
   off_t lst;
   // 根据zone配置选择object的压缩类型,可为none或具体的压缩插件名字
   // http://docs.ceph.com/docs/kraken/radosgw/compression/
   const  auto& compression_type = store->get_zone_params().get_compression_type(
       s->bucket_info.placement_rule);
   CompressorRef plugin;
   boost::optional compressor;
 
   bool  need_calc_md5 = (dlo_manifest == NULL) && (slo_info == NULL);
   perfcounter->inc(l_rgw_put);
   op_ret = -EINVAL;
 
   //---------------------------------------------------------
   // 解析并检查请求参数是否完整、正确
   //---------------------------------------------------------
   
   // 判断用户请求object name、bucket name等是否正确
   if  (s->object.empty()) {
     goto  done;
   }
   if  (!s->bucket_exists) {
     op_ret = -ERR_NO_SUCH_BUCKET;
     return ;
   }
   // 解析并判断http请求的相关参数,包括copy obj的情况、包含tagging的情况、包含version的情况,以及基本的objname和bucketname解析
   op_ret = get_params();
   if  (op_ret < 0) {
     ldout(s->cct, 20) <<  "get_params() returned ret="  << op_ret << dendl;
     goto  done;
   }
 
   op_ret = get_system_versioning_params(s, &olh_epoch, &version_id); //版本号
   if  (op_ret < 0) {
     ldout(s->cct, 20) <<  "get_system_versioning_params() returned ret="
               << op_ret << dendl;
     goto  done;
   }
 
   // 判断并处理请求是否提供了md5来校验请求完整
   if  (supplied_md5_b64) {
     need_calc_md5 =  true ;
 
     ldout(s->cct, 15) <<  "supplied_md5_b64="  << supplied_md5_b64 << dendl;
     op_ret = ceph_unarmor(supplied_md5_bin, &supplied_md5_bin[CEPH_CRYPTO_MD5_DIGESTSIZE + 1],
                        supplied_md5_b64, supplied_md5_b64 +  strlen (supplied_md5_b64));
     ldout(s->cct, 15) <<  "ceph_armor ret="  << op_ret << dendl;
     if  (op_ret != CEPH_CRYPTO_MD5_DIGESTSIZE) {
       op_ret = -ERR_INVALID_DIGEST;
       goto  done;
     }
 
     buf_to_hex(( const  unsigned  char  *)supplied_md5_bin, CEPH_CRYPTO_MD5_DIGESTSIZE, supplied_md5);
     ldout(s->cct, 15) <<  "supplied_md5="  << supplied_md5 << dendl;
   }
 
   // 判断http传输是否使用了chunk传输的方式,如果没有,可以直接根据content length来判断quota,否则需要等到所有chunk接收完成
//https://www.cnblogs.com/ribavnu/p/5084458.html
   if  (!chunked_upload) {  /* with chunked upload we don't know how big is the upload.
                             we also check sizes at the end anyway */
     // 判断是否满足user和bucket的quota约束
     op_ret = store->check_quota(s->bucket_owner.get_id(), s->bucket,
                 user_quota, bucket_quota, s->content_length);
     if  (op_ret < 0) {
       ldout(s->cct, 20) <<  "check_quota() returned ret="  << op_ret << dendl;
       goto  done;
     }
     // 判断是否满足bucket index的shard的约束 http://docs.ceph.com/docs/master/radosgw/dynamicresharding/
     op_ret = store->check_bucket_shards(s->bucket_info, s->bucket, bucket_quota);
     if  (op_ret < 0) {
       ldout(s->cct, 20) <<  "check_bucket_shards() returned ret="  << op_ret << dendl;
       goto  done;
     }
   }
   
   // 当启用Multipart上传时,用户每次上传新part需要带上之前上传response中返回的etag
   // 判断用户是否提供了etag
  // 关于etag介绍:https://baike.baidu.com/item/ETag/4419019?fr=aladdin
   if  (supplied_etag) {
     strncpy (supplied_md5, supplied_etag,  sizeof (supplied_md5) - 1);
     supplied_md5[ sizeof (supplied_md5) - 1] =  '\0' ;
   }
   // 判断用户是否使用multipart方式的obj,并返回对应的processor : RGWPutObjProcessor_Atomic 或 RGWPutObjProcessor_Multipart
   // 并用multipart (bool)标识是否是multipart
   processor = select_processor(* static_cast (s->obj_ctx), &multipart);
 
   // no filters by default
   filter = processor;
 
   /* Handle object versioning of Swift API. */
   if  (! multipart) {
     rgw_obj obj(s->bucket, s->object);
     op_ret = store->swift_versioning_copy(* static_cast (s->obj_ctx),
                                           s->bucket_owner.get_id(),
                                           s->bucket_info,
                                           obj);
     if  (op_ret < 0) {
       goto  done;
     }
   }
 
   // 调用RGWPutObjProcessor_Atomic或RGWPutObjProcessor_Multipart的prepare:
   // RGWPutObjProcessor_Atomic:
   //   写入前的准备工作:生成对象名称前缀、设置placement rules、
   //   在内存中创建对应的对象、设置切分head和tail对象的尺寸等等工作
   // RGWPutObjProcessor_Multipart:
   //   比起Atomic,多了处理uploadId和partNumber的过程http://docs.ceph.com/docs/master/radosgw/s3/objectops/#initiate-multi-part-upload
   // 完成对应的工作后,嵌套调用RGWPutObjProcessor_Aio的prepare:
   //   根据用户配置,设置aio的window size
   // 然后会嵌套调用RGWPutObjProcessor的prepare:
   //   设置RGWPutObjProcessor的store指针
   op_ret = processor->prepare(store, NULL);
   if  (op_ret < 0) {
     ldout(s->cct, 20) <<  "processor->prepare() returned ret="  << op_ret
               << dendl;
     goto  done;
   }
 
   // 如果是copy source range操作,获得source对象的起止偏移
   fst = copy_source_range_fst;
   lst = copy_source_range_lst;
 
   // sse相关,如果用户设置了sse,则进行加密的准备
   op_ret = get_encrypt_filter(&encrypt, filter);
   if  (op_ret < 0) {
     goto  done;
   }
   // 需要加密时,filter用于加密数据
   if  (encrypt != nullptr) {
     filter = encrypt.get();
   else  {
     //no encryption, we can try compression
     if  (compression_type !=  "none" ) {
       // 不需要加密时,并且compression_type被设置了,filter被用于压缩数据
       plugin = get_compressor_plugin(s, compression_type);
       if  (!plugin) {
         ldout(s->cct, 1) <<  "Cannot load plugin for compression type "
             << compression_type << dendl;
       else  {
         //  如果一切都没问题,构造compressor
         compressor.emplace(s->cct, plugin, filter);
         filter = &*compressor;
       }
     }
   }
 
   //-------------------------------------------------------------------------
   // 前期参数解析、工具准备、head obj初始化、写入ctx初始化工作完成      
   // 下面从req中读取数据,经处理后存入rados
   //-------------------------------------------------------------------------
 
   do  {
     bufferlist data;
     if  (fst > lst)
       break ;
     if  (!copy_source) {
       // 如果不是copy,是正常的put
       // 读取请求体rgw_max_chunk_size字节的数据到data
       /* 有关rgw_max_chunk_size的解释:
           "The chunk size is the size of RADOS I/O requests that RGW sends when accessing "
           "data objects. RGW read and write operation will never request more than this amount "
           "in a single request. This also defines the rgw object head size, as head operations "
           "need to be atomic, and anything larger than this would require more than a single "
           "operation."),
       */
       len = get_data(data);
     else  {
       // 否则,从另一个对象读取
       uint64_t cur_lst = min(fst + s->cct->_conf->rgw_max_chunk_size - 1, lst);
       op_ret = get_data(fst, cur_lst, data);
       if  (op_ret < 0)
         goto  done;
       len = data.length();
       s->content_length += len;
       fst += len;
     }
     if  (len < 0) {
       op_ret = len;
       ldout(s->cct, 20) <<  "get_data() returned ret="  << op_ret << dendl;
       goto  done;
     }
     // 计算data的md5
     if  (need_calc_md5) {
       hash.Update(( const  byte *)data.c_str(), data.length());
     }
     
     /* update torrrent */
     torrent.update(data);
 
     /* do we need this operation to be synchronous? if we're dealing with an object with immutable
      * head, e.g., multipart object we need to make sure we're the first one writing to this object
      */
     bool  need_to_wait = (ofs == 0) && multipart;
 
     bufferlist orig_data;
 
     if  (need_to_wait) {
       orig_data = data;
     }
 
     // 先将调用RGWPutObj_Compress::handle_data数据进行压缩
     // (或加密 或者 什么都不做)
     // 然后调用RGWPutObjProcessor_Atomic::handle_data 将处理后的数据切分成一个head和多个tail对象
     // handle_data最终调用`store->aio_put_obj_data`函数,将对象写入rados
     
     // 在使用librados异步写时,需要先调用aio_create_completion函数,该
     // 函数会返回一个rados_completion_t类型的对象,来表示异步写的状态
     // rados_completion_t: Represents the state of an asynchronous operation
     //  - it contains the return value once the operation completes,
     // and can be used to block until the operation is complete or safe.
     
     // put_data_and_throttle调用throttle_data时会传入这个对象的指针(handle)
     // 这里,如果是上传Multipart类型对象的第一块数据,need_to_wait为true
     // need_to_wait为true表示函数会等到该块数据写入rados才返回(变为同步写)
     op_ret = put_data_and_throttle(filter, data, ofs, need_to_wait);
     
     if  (op_ret < 0) {
       if  (!need_to_wait || op_ret != -EEXIST) {
         ldout(s->cct, 20) <<  "processor->thottle_data() returned ret="
               << op_ret << dendl;
         goto  done;
       }
 
       
       /* need_to_wait == true and op_ret == -EEXIST */
       ldout(s->cct, 5) <<  "NOTICE: processor->throttle_data() returned -EEXIST, need to restart write"  << dendl;
 
       /* restore original data */
       data.swap(orig_data);
 
       /* restart processing with different oid suffix */
 
       dispose_processor(processor);
       processor = select_processor(* static_cast (s->obj_ctx), &multipart);
       filter = processor;
 
       string oid_rand;
       char  buf[33];
       gen_rand_alphanumeric(store->ctx(), buf,  sizeof (buf) - 1);
       oid_rand.append(buf);
 
       op_ret = processor->prepare(store, &oid_rand);
       if  (op_ret < 0) {
         ldout(s->cct, 0) <<  "ERROR: processor->prepare() returned "
              << op_ret << dendl;
         goto  done;
       }
 
       op_ret = get_encrypt_filter(&encrypt, filter);
       if  (op_ret < 0) {
         goto  done;
       }
       if  (encrypt != nullptr) {
         filter = encrypt.get();
       else  {
         if  (compressor) {
           compressor.emplace(s->cct, plugin, filter);
           filter = &*compressor;
         }
       }
       op_ret = put_data_and_throttle(filter, data, ofs,  false );
       if  (op_ret < 0) {
         goto  done;
       }
     }
     // ofs表示当前已经从请求体中读取的数据长度
     ofs += len;
     // len==0 表示对象数据读取完成
   while  (len > 0);
 
   {
     // flush 缓冲区
     bufferlist flush;
     op_ret = put_data_and_throttle(filter, flush, ofs,  false );
     if  (op_ret < 0) {
       goto  done;
     }
   }
   // 如果不是chunk uoload,并且接收到的数据和content length不同,表明传输出现错误
   if  (!chunked_upload && ofs != s->content_length) {
     op_ret = -ERR_REQUEST_TIMEOUT;
     goto  done;
   }
   s->obj_size = ofs;
 
   perfcounter->inc(l_rgw_put_b, s->obj_size);
 
   // 如函数名……
   op_ret = do_aws4_auth_completion();
   if  (op_ret < 0) {
     goto  done;
   }
   // 判断是否超出quota限制
   op_ret = store->check_quota(s->bucket_owner.get_id(), s->bucket,
                               user_quota, bucket_quota, s->obj_size);
   if  (op_ret < 0) {
     ldout(s->cct, 20) <<  "second check_quota() returned op_ret="  << op_ret << dendl;
     goto  done;
   }
   // 判断是否超出bucket index 某个shards的最大obj数目
   op_ret = store->check_bucket_shards(s->bucket_info, s->bucket, bucket_quota);
   if  (op_ret < 0) {
     ldout(s->cct, 20) <<  "check_bucket_shards() returned ret="  << op_ret << dendl;
     goto  done;
   }
 
   hash.Final(m);
 
   // 将压缩信息加入attrs
   if  (compressor && compressor->is_compressed()) {
     bufferlist tmp;
     RGWCompressionInfo cs_info;
     cs_info.compression_type = plugin->get_type_name();
     cs_info.orig_size = s->obj_size;
     cs_info.blocks = move(compressor->get_compression_blocks());
     ::encode(cs_info, tmp);
     attrs[RGW_ATTR_COMPRESSION] = tmp;
     ldout(s->cct, 20) <<  "storing "  << RGW_ATTR_COMPRESSION
         <<  " with type="  << cs_info.compression_type
         <<  ", orig_size="  << cs_info.orig_size
         <<  ", blocks="  << cs_info.blocks.size() << dendl;
   }
 
   buf_to_hex(m, CEPH_CRYPTO_MD5_DIGESTSIZE, calc_md5);
 
   etag = calc_md5;
 
   // 判断数据的md5是否符合期望
   if  (supplied_md5_b64 &&  strcmp (calc_md5, supplied_md5)) {
     op_ret = -ERR_BAD_DIGEST;
     goto  done;
   }
 
   // 把acl信息存入xattr
   policy.encode(aclbl);
   emplace_attr(RGW_ATTR_ACL, std::move(aclbl));
 
 
   if  (dlo_manifest) {
     op_ret = encode_dlo_manifest_attr(dlo_manifest, attrs);
     if  (op_ret < 0) {
       ldout(s->cct, 0) <<  "bad user manifest: "  << dlo_manifest << dendl;
       goto  done;
     }
     complete_etag(hash, &etag);
     ldout(s->cct, 10) << __func__ <<  ": calculated md5 for user manifest: "  << etag << dendl;
   }
 
   if  (slo_info) {
     bufferlist manifest_bl;
     ::encode(*slo_info, manifest_bl);
     emplace_attr(RGW_ATTR_SLO_MANIFEST, std::move(manifest_bl));
 
     hash.Update((byte *)slo_info->raw_data, slo_info->raw_data_len);
     complete_etag(hash, &etag);
     ldout(s->cct, 10) << __func__ <<  ": calculated md5 for user manifest: "  << etag << dendl;
   }
 
   // etag相关
   if  (supplied_etag && etag.compare(supplied_etag) != 0) {
     op_ret = -ERR_UNPROCESSABLE_ENTITY;
     goto  done;
   }
   bl.append(etag.c_str(), etag.size() + 1);
   emplace_attr(RGW_ATTR_ETAG, std::move(bl));
 
   // 将其他 (其他从http请求中获得的、对象需要的attr),存入xattr
   populate_with_generic_attrs(s, attrs);
   op_ret = rgw_get_request_metadata(s->cct, s->info, attrs);
   if  (op_ret < 0) {
     goto  done;
   }
   encode_delete_at_attr(delete_at, attrs);
   encode_obj_tags_attr(obj_tags.get(), attrs);
 
   /* Add a custom metadata to expose the information whether an object
    * is an SLO or not. Appending the attribute must be performed AFTER
    * processing any input from user in order to prohibit overwriting. */
   if  (slo_info) {
     bufferlist slo_userindicator_bl;
     slo_userindicator_bl.append( "True" , 4);
     emplace_attr(RGW_ATTR_SLO_UINDICATOR, std::move(slo_userindicator_bl));
   }
 
   // 完成之前未完成的head和tail的写入,为head设置xattr
   op_ret = processor->complete(s->obj_size, etag, &mtime, real_time(), attrs,
                                (delete_at ? *delete_at : real_time()), if_match, if_nomatch,
                                (user_data.empty() ? nullptr : &user_data));
 
   // only atomic upload will upate version_id here
   if  (!multipart)
     version_id = ( static_cast (processor))->get_version_id();
 
   /* produce torrent */
   if  (s->cct->_conf->rgw_torrent_flag && (ofs == torrent.get_data_len()))
   {
     torrent.init(s, store);
     torrent.set_create_date(mtime);
     op_ret =  torrent.complete();
     if  (0 != op_ret)
     {
       ldout(s->cct, 0) <<  "ERROR: torrent.handle_data() returned "  << op_ret << dendl;
       goto  done;
     }
   }
 
done:
   // 释放processor
   dispose_processor(processor);
   perfcounter->tinc(l_rgw_put_lat,
             (ceph_clock_now() - s-> time ));

下面主要是在execute中主要执行的一些方法的说明:

prepare阶段:该阶段主要是初始化manifest结构体:

prepare阶段  Collapse source
int  RGWPutObjProcessor_Atomic::prepare(RGWRados *store, string *oid_rand)
{
   head_obj.init(bucket, obj_str);
   int  r = prepare_init(store, oid_rand);
   if  (r < 0) {
     return  r;
   }
   if  (!version_id.empty()) {
     head_obj.key.set_instance(version_id);
   else  if  (versioned_object) {
     store->gen_rand_obj_instance_name(&head_obj);
   }
   manifest.set_trivial_rule(max_chunk_size, store->ctx()->_conf->rgw_obj_stripe_size);
   r = manifest_gen.create_begin(store->ctx(), &manifest, bucket_info.placement_rule, head_obj.bucket, head_obj); //初始化manifest
   if  (r < 0) {
     return  r;
   }
   return  0;
}

put_data_and_throttle

在execute中,这个函数被调用的过程如下:

 

put_data_and_throttle  Collapse source
do {
   bufferlist data;
   // 从请求体读取最多 rgw_max_chunk_size 字节的数据到data
   len = get_data(data);
   ......
   op_ret = put_data_and_throttle(filter, data, ofs, need_to_wait);
   ......
while  (len > 0);

不断的从请求中获取最多rgw_max_chunk_size字节的数据放入到bufferlist中,然后交给Processor分片或者写入Rodos底层。

put_data_and_throttle  Collapse source
static  inline  int  put_data_and_throttle(RGWPutObjDataProcessor *processor,
                     bufferlist& data, off_t ofs,
                     bool  need_to_wait)
{
   bool  again =  false ;
   do  {
     void  *handle = nullptr;
     rgw_raw_obj obj;
 
     uint64_t size = data.length();
     // handle指针指向aio返回的对象,可以通过handle得知aio是否完成
     int  ret = processor->handle_data(data, ofs, &handle, &obj, &again);
     if  (ret < 0)
       return  ret;
     if  (handle != nullptr)
     {
       // 将obj和handle封装后放入 Aio类的pending队列
       // 并根据window_size限制pending队列的大小
       ret = processor->throttle_data(handle, obj, size, need_to_wait);
       if  (ret < 0)
         return  ret;
     }
     else
       break ;
     need_to_wait =  false /* the need to wait only applies to the first
                * iteration */
   while  (again);
 
   return  0;
/* put_data_and_throttle */

RGWPutObjProcessor_Atomic::handle_data

这个函数主要完成的是将一个rgw对象切分成一个head对象和多个tail对象的操作,然后调用write_data函数异步写入rados。

RGWPutObjProcessor_Atomic::handle_data  Collapse source
int  RGWPutObjProcessor_Atomic::handle_data(bufferlist &bl, off_t ofs,  void  **phandle, rgw_raw_obj *pobj,  bool  *again)
{
   *phandle = NULL;
   // data_ofs表示当前已经执行写入操作的所有数据
   // next_part_ofs表示下一rados对象的开头,也就是当前要写入的rados对象的结尾
   // 也就是说,总数据从cur_part_ofs开始到next_part_ofs结束的部分写入cur_obj指向的rados对象
 
   // 这么做是因为一个rgw对象会被切分成多个rados对象(一个head,多个tail),每个默认大小4M
   uint64_t max_write_size = std::min(max_chunk_size, (uint64_t)next_part_ofs - data_ofs);
 
   // 把bl中的数据move到pending_data_bl的尾部
   pending_data_bl.claim_append(bl);
   // 如果加上bl中的数据,数据总长度仍然达不到写入操作的阈值(max_chunk_size),返回,等待下一次handle_data的调用
   if  (pending_data_bl.length() < max_write_size)
   {
     *again =  false ;
     return  0;
   }
   // 把pending_data_bl前max_write_size字节的数据移到bl中
   pending_data_bl.splice(0, max_write_size, &bl);
 
   // 如果pending_data_bl剩下的数据仍然大大于写入操作的阈值(max_chunk_size)
   /* do we have enough data pending accumulated that needs to be written? */
   *again = (pending_data_bl.length() >= max_chunk_size);
 
   // 如果是head对象 并且 immutable_head()为false
   // data_ofs为0表示第一次写数据
   // immutable_head()函数
   //   在RGWPutObjProcessor_Atomic中默认返回false
   //   但可能会被子类继承并重写
   if  (!data_ofs && !immutable_head())
   {
     // 将bl中数据move到first_chunk中
     first_chunk.claim(bl);
     obj_len = (uint64_t)first_chunk.length();
     // 更新next_part_ofs和cur_part_ofs,将cur_obj指针指向当前要写入的rados对象
     int  r = prepare_next_part(obj_len);
 
     if  (r < 0)  return  r;
     // 更新总写入的数据偏移data_ofs
     data_ofs = obj_len;
     return  0;
   }
   
   off_t write_ofs = data_ofs;
   data_ofs = write_ofs + bl.length();
 
   // 对于不可改变类型的对象,当上传其head对象时,做一下标志,让后面做特别处理
   bool  exclusive = (!write_ofs && immutable_head());  /* immutable head object, need to verify nothing exists there
                                                         we could be racing with another upload, to the same
                                                         object and cleanup can be messy */
   // 该函数先判断write_ofs是否大于next_part_ofs
   // 如果是,则调用prepare_next_part函数,更新cur_obj、cur_part_ofs、next_part_ofs
   // 然后,将pobj设为cur_obj
   // 最后调用 hanle_obj_data 函数,做进一步操作
   // hanle_obj_data通过aio_put_obj_data,最终调用了librados aio相关的api,将数据异步写入rados
   int  ret = write_data(bl, write_ofs, phandle, pobj, exclusive);
   if  (ret >= 0)
   /* we might return, need to clear bl as it was already sent */
     bl.clear();
   }
   return  ret;
}

complete函数其实就是调用了do_complelte,这个函数主要是做收尾工作,之前在put_data_and_throttle函数中开始了异步写流程,在收尾时,首先等待所有异步写操作完成。然后将上传的rgw对象的attrs信息写入head对象的xattr中,完成对象上传操作。

RGWPutObjProcessor_Atomic::do_complete  Collapse source
int  RGWPutObjProcessor_Atomic::do_complete( size_t  accounted_size,  const  string& etag,
                                            real_time *mtime, real_time set_mtime,
                                            map& attrs,
                                            real_time delete_at,
                                            const  char  *if_match,
                                            const  char  *if_nomatch,  const  string *user_data,
                                            rgw_zone_set *zones_trace) {
   int  r = complete_writing_data(); //等待该rgw对象的所有异步写完成
   if  (r < 0)
     return  r;
   obj_ctx.obj.set_atomic(head_obj); //标识该对象为Atomic类型的对象
   RGWRados::Object op_target(store, bucket_info, obj_ctx, head_obj);
   /* some object types shouldn't be versioned, e.g., multipart parts */
   op_target.set_versioning_disabled(!versioned_object);
   RGWRados::Object::Write obj_op(&op_target); //将该rgw对象的attrs写入head对象的xattr中
   obj_op.meta.data = &first_chunk;
   obj_op.meta.manifest = &manifest;
   obj_op.meta.ptag = &unique_tag;  /* use req_id as operation tag */
   obj_op.meta.if_match = if_match;
   obj_op.meta.if_nomatch = if_nomatch;
   obj_op.meta.mtime = mtime;
   obj_op.meta.set_mtime = set_mtime;
   obj_op.meta.owner = bucket_info.owner;
   obj_op.meta.flags = PUT_OBJ_CREATE;
   obj_op.meta.olh_epoch = olh_epoch;
   obj_op.meta.delete_at = delete_at;
   obj_op.meta.user_data = user_data;
   obj_op.meta.zones_trace = zones_trace;
   r = obj_op.write_meta(obj_len, accounted_size, attrs); //此处进行attribute参数的写入
   if  (r < 0) {
     return  r;
   }
   canceled = obj_op.meta.canceled;
   return  0;
}
complete_writing_data函数的定义如下:
complete_writing_data  Collapse source
int  RGWPutObjProcessor_Atomic::complete_writing_data()
{
   if  (!data_ofs && !immutable_head()) {
     /* only claim if pending_data_bl() is not empty. This is needed because we might be called twice
      * (e.g., when a retry due to race happens). So a second call to first_chunk.claim() would
      * clobber first_chunk
      */
     if  (pending_data_bl.length() > 0) {
       first_chunk.claim(pending_data_bl);
     }
     obj_len = (uint64_t)first_chunk.length();
   }
   while  (pending_data_bl.length()) { //分多次IO写入rados对象
     void  *handle = nullptr;
     rgw_raw_obj obj;
     uint64_t max_write_size = MIN(max_chunk_size, (uint64_t)next_part_ofs - data_ofs); //每个I/O大小
     if  (max_write_size > pending_data_bl.length()) {
       max_write_size = pending_data_bl.length();
     }
     bufferlist bl;
     pending_data_bl.splice(0, max_write_size, &bl);
     uint64_t write_len = bl.length();
     int  r = write_data(bl, data_ofs, &handle, &obj,  false );
     if  (r < 0) {
       ldout(store->ctx(), 0) <<  "ERROR: write_data() returned "  << r << dendl;
       return  r;
     }
     data_ofs += write_len;
     r = throttle_data(handle, obj, write_len,  false );
     if  (r < 0) {
       ldout(store->ctx(), 0) <<  "ERROR: throttle_data() returned "  << r << dendl;
       return  r;
     }
     if  (data_ofs >= next_part_ofs) { //下一个Rados对象保存的用户数据便宜位置
       r = prepare_next_part(data_ofs);
       if  (r < 0) {
         ldout(store->ctx(), 0) <<  "ERROR: prepare_next_part() returned "  << r << dendl;
         return  r;
       }
     }
   }
   int  r = complete_parts();
   if  (r < 0) {
     return  r;
   }
   r = drain_pending();
   if  (r < 0)
     return  r;
   return  0;
}

总结:

整个过程就如下图所示:

  1. prepare:在prepare阶段的主要工作时初始化 manifest 数据结构
  2. handle_data:handle_data阶段,RGW每次从HTTP Server 缓冲区中取出rgw_max_chunk_size字节的数据,存放在bufferlist中,然后分成一个或多个I/O异步下发到RADOS层,每个I/O的大小等于MIN(rgw_max_chunk_size, next_part_ofs - data_ofs),其中next_part_ofs表示下一个RADOS对象保存的用户数据偏位置,data_ofs 表示当前数据的偏移位置。
  3. complete:等所有数据上传成功后,对象上传进入complate阶段,该阶段的主要工作时将对象的元数据更新到head_obj中,同时将对象条目更新到索引对象中,以便连续列举对象。

你可能感兴趣的:(ceph)