python查看数据集的结构 (用dict实现switch-case)

做机器学习的经常需要处理数据集,可能是json,mat,h5各种格式的,里面有各种标签结构。
了解数据集的结构、格式、类型,对我们处理数据是有帮助的。
写了一个有通用性的程序,
在此用来查看mscoco数据集的json注释,相同级别的数据使用了相同的缩进。

# -*- coding: utf-8 -*-
"""
Created on Tue Nov  6 22:23:17 2018

@author: BigFly
"""
import json

def process_dict(obj,level):
    print("")
    for key in obj.keys():
        print("  "*level, "\"%s\""%(key), end=":   ")
        process(obj[key],level+1)
        
def process_list(obj,level):
    print(""," len=",len(obj))
    samplenum=1 # 对list,查看几个item
    for idx in range(min(samplenum,len(obj))):
        print("  "*level, "item",idx, end=":   ")
        process(obj[idx], level+1)
    if len(obj)>samplenum:
        print("  "*level, "item ...")
        
def process_str(obj,level):
    print("",obj)
    
def process_num(obj,level):
    print("",obj)
    
switch={type({}) :  process_dict,
        type([]) :  process_list,
        type("") :  process_str,
        type(1)  :  process_num,
        type(1.0) : process_num }

def process(obj,level=0):
    obj_typ=type(obj)
    try:
        switch[obj_typ](obj,level+1)
    except KeyError as e:
        print("ERROR: NO ", obj_typ)


path="E:\\dataset\\MSCOCO\\annotations_trainval2017\\annotations\\instances_val2017.json"
path="E:\\dataset\\MSCOCO\\annotations_trainval2017\\annotations\\instances_train2017.json"

jsonstr=open(path).readline()
print("jsonstr",type(jsonstr),len(jsonstr))
annotations=json.loads(jsonstr)

#查看annotations的结构
process(annotations) #['licenses', 'categories', 'annotations', 'info', 'images']

这里列举了对5种类型的处理,要处理其他类型,仿照加进去就是了。
python没有switch-case结构,可以用dict实现。

运行结果:


   licenses:     len= 8
       item 0:   
           name:    Attribution-NonCommercial-ShareAlike License
           id:    1
           url:    http://creativecommons.org/licenses/by-nc-sa/2.0/
       item ...
   categories:     len= 80
       item 0:   
           supercategory:    person
           name:    person
           id:    1
       item ...
   annotations:     len= 36781
       item 0:   
           id:    1768
           bbox:     len= 4
               item 0:    473.07
               item ...
           image_id:    289343
           iscrowd:    0
           area:    702.1057499999998
           category_id:    18
           segmentation:     len= 1
               item 0:     len= 134
                   item 0:    510.66
                   item ...
       item ...
   info:   
       version:    1.0
       date_created:    2017/09/01
       description:    COCO 2017 Dataset
       year:    2017
       contributor:    COCO Consortium
       url:    http://cocodataset.org
   images:     len= 5000
       item 0:   
           file_name:    000000397133.jpg
           id:    397133
           date_captured:    2013-11-14 17:02:52
           license:    4
           height:    427
           flickr_url:    http://farm7.staticflickr.com/6116/6255196340_da26cf2c9e_z.jpg
           coco_url:    http://images.cocodataset.org/val2017/000000397133.jpg
           width:    640
       item ...

可以清晰的看出,annotations是dict类型,有5个key,以及每个项分别的类型和详情。

你可能感兴趣的:(python查看数据集的结构 (用dict实现switch-case))