本篇主要聚焦于文件提取脚本extractor.py
,该文件在/sources/extractor
目录下。
该脚本实质上是利用binwalk
来提取文件,拥有两个模式——nf
和 np
。前者用于提取构建镜像所需的必要文件/目录,后者用于提取固件中的内核,两个选项可以同时开启。
源代码有700+行,这里为了分析方便,分别按类和函数进行介绍。
这个脚本中的函数主要在Extractor
和ExtractionItem
两个类当中,除此以外只有一个测试数据库联通性的psql_check
和main
函数。
下面分别对两个类中的函数进行分析。
核心函数是io_find_rootfs
和extract
这两个,剩下的基本都是简单的工具函数,包括getstate
(删除不该被包装的属性)、io_dd
(在指定偏移处读取指定大小字节并写入输出文件)、magic
(判断输入文件类型并返回)、io_md5
(文件哈希计算)、io_rm
(删除文件)、_io_err
(提供错误信息)。
init
函数没啥好说的,就是设置一下参数,开线程池什么的。唯一值得注意的是类unix目录字典:
UNIX_DIRS = ["bin", "etc", "dev", "home", "lib", "mnt", "opt", "root",
"run", "sbin", "tmp", "usr", "var"]
UNIX_THRESHOLD = 4
查找文件系统。
def io_find_rootfs(start, recurse=True):
"""
Attempts to find a Linux root directory.
获取Linux形式的根文件系统。
"""
'''Recurse into single directory chains, e.g. jffs2-root/fs_1/.../
深度遍历,目的是为了获取单目录链 原因是解包出来的根文件目录可能被包含在单目录链内,
通过这个循环找到根文件目录所在的位置
'''
path = start
while (len(os.listdir(path)) == 1 and
os.path.isdir(os.path.join(path, os.listdir(path)[0]))):
path = os.path.join(path, os.listdir(path)[0])
# count number of unix-like directories
'''
此时是path已经是根文件目录,广度遍历以统计类unix目录数目
'''
count = 0
for subdir in os.listdir(path):
if subdir in Extractor.UNIX_DIRS and \
os.path.isdir(os.path.join(path, subdir)) and \
len(os.listdir(os.path.join(path, subdir))) > 0:
count += 1
# check for extracted filesystem, otherwise update queue
'''挺奇怪的..通过类unix目录数判断提取完整性'''
if count >= Extractor.UNIX_THRESHOLD:
return (True, path)
# in some cases, multiple filesystems may be extracted, so recurse to
# find best one
'''
部分情况下固件中可能有多个文件系统,比如说不可写文件系统搭配可写文件系统实现tmp,
或者LZMA压缩块里面有jeffs,这种时候就需要递归提取一下看看。不过这里的逻辑似乎是有问题的...
如果在子目录中找到了符合条件的根文件系统,它就会直接返回子目录中的文件系统,
并不会和一开始的根文件系统做比对...这样有可能会降低提取准确度,如果子文件系统不如父文件系统
完整的话...
'''
if recurse:
for subdir in os.listdir(path):
if os.path.isdir(os.path.join(path, subdir)):
res = Extractor.io_find_rootfs(os.path.join(path, subdir),
False)
'''注意是一层递归提取'''
if res[0]:
return res
return (False, start)
更新self.list
,多线程,调用_extract_item
完成具体工作。这里的self.list
实际上是输入的文件/目录列表,此时与提取尚且无关。
def extract(self):
"""
Perform extraction of firmware updates from input to tarballs in output
directory using a thread pool.
多线程提取固件文件系统并压缩为tar.gz
"""
if os.path.isdir(self._input):
for path, _, files in os.walk(self._input):
for item in files:
self._list.append(os.path.join(path, item))
elif os.path.isfile(self._input):
self._list.append(self._input)
'''输入为目录,则遍历目录下的路径并加入self._list
输入为文件,则直接将文件加入self._list'''
if self.output_dir and not os.path.isdir(self.output_dir):
os.makedirs(self.output_dir)
if self._pool:
# since we have to handle multiple files in one firmware image, it
# is better to use chunk_size=1
chunk_size = 1
list(self._pool.imap_unordered(self._extract_item, self._list,
chunk_size))
else:
for item in self._list:
self._extract_item(item)
实际上是初始化ExtractionItem
类并调用其中的extract
函数,在该类中会反过来调用Extractor
类中的函数。
def _extract_item(self, path):
"""
Wrapper function that creates an ExtractionItem and calls the extract()
method.
"""
ExtractionItem(self, path, 0, None, self.debug).extract()
init
同样是设置参数,注意到该类在初始化时使用的self
是ExtractionItem
类,相当于继承。此外还有几个更新状态的写法:
# Tag
self.tag = tag if tag else self.generate_tag()
......
# Status, with terminate indicating early termination for this item
self.terminate = False
self.status = None
self.update_status()
此外还有些奇怪的写法:
self.extractor = extractor
......
self.checksum = Extractor.io_md5(path)
这里面的extractor
和Extractor
都是指前面的Extractor
类。
同样有一堆工具函数:del
(关闭数据库连接,删除临时文件)、printf
(在debug情况下进行输出)、get_kernel/rootfs_status
(返回kernel、rootfs提取情况,用于检测提取完成与否)、get_kernel/rootf_path
(获取对应路径)、get_status
(检查是否获得终止信号or提取完成)
这个类比上一个类重要的多,有不少关系到提取的核心函数。
更新tag
,在init
中被调用,顺便更新了数据库中的brand
和image
。
这里顺便提一下,firmadyne
的数据库中那几个表之间都是相互独立的,比如说object
、brand
、image
,没有强行组织到一起,个人觉得是对鲁棒性有帮助的。
def generate_tag(self):
"""
Generate the filename tag.
生成文件名tag
"""
if not self.database:
return os.path.basename(self.item) + "_" + self.checksum
'''没有database直接拼接文件名和checksum作为tag
不过一般不会这样'''
try:
image_id = None
cur = self.database.cursor()
if self.extractor.brand:
brand = self.extractor.brand
else:
brand = os.path.relpath(self.item).split(os.path.sep)[0]
cur.execute("SELECT id FROM brand WHERE name=%s", (brand, ))
brand_id = cur.fetchone()
if not brand_id:
cur.execute("INSERT INTO brand (name) VALUES (%s) RETURNING id",
(brand, ))
'''没指定brand会直接拿相对工作目录的路径当成brand去识别,对应run.sh里面可以指定brand为auto
没查到就作为新brand插入,方便下一次模拟查询'''
brand_id = cur.fetchone()
if brand_id:
'''更新过brand所以这个分支实际上必进'''
cur.execute("SELECT id FROM image WHERE hash=%s",
(self.checksum, ))
image_id = cur.fetchone()
if not image_id:
cur.execute("INSERT INTO image (filename, brand_id, hash) \
VALUES (%s, %s, %s) RETURNING id",
(os.path.basename(self.item), brand_id[0],
self.checksum))
image_id = cur.fetchone()
'''没查到image id,意味着输入固件之前没分析过,在image中插入新数据行'''
self.database.commit()
except BaseException:
traceback.print_exc()
self.database.rollback()
finally:
if cur:
cur.close()
if image_id:
'''模拟过会直接给出image id'''
self.printf(">> Database Image ID: %s" % image_id[0])
'''
return str(image_id[0]) if \
image_id else os.path.basename(self.item) + "_" + self.checksum
这个应该是状态更新器,主要是按条件更新kernel_done
和rootfs_done
这两个,同时更新数据库中的状态。不过只有加了提取选项才会进行更新,不然啥都不干。
def update_status(self):
"""
Updates the status flags using the tag to determine completion status.
"""
kernel_done = os.path.isfile(self.get_kernel_path()) \
if self.extractor.do_kernel and self.output \
else not self.extractor.do_kernel
rootfs_done = os.path.isfile(self.get_rootfs_path()) \
if self.extractor.do_rootfs and self.output \
else not self.extractor.do_rootfs
self.status = (kernel_done, rootfs_done)
self.extractor.kernel_done = kernel_done
self.extractor.rootfs_done = rootfs_done
if self.database and kernel_done and self.extractor.do_kernel:
self.update_database("kernel_extracted", "True")
if self.database and rootfs_done and self.extractor.do_rootfs:
self.update_database("rootfs_extracted", "True")
return self.get_status()
'''更新完状态后返回状态,之所以这么写是因为进程可能被终止了,不能直接根据两个done就返回'''
黑名单检查,应该是只检查某个文件是否命中黑名单。目的是为了防止不能提取的文件被作为固件输入。
def _check_blacklist(self):
"""
Check if this file is blacklisted for analysis based on file type.
"""
real_path = os.path.realpath(self.item)
# print ("------ blacklist checking --------------")
# print (self.item)
# print (real_path)
# First, use MIME-type to exclude large categories of files
filetype = Extractor.magic(real_path.encode("utf-8", "surrogateescape"),
mime=True)
# print (filetype)
if filetype:
if any(s in filetype for s in ["application/x-executable",
"application/x-dosexec",
"application/x-object",
"application/x-sharedlib",
"application/pdf",
"application/msword",
"image/", "text/", "video/"]):
self.printf(">> Skipping: %s..." % filetype)
'''跳过的基本都是win下的文件,以及web服务中用处不大但占据存储空间的文件和目录'''
return True
# Next, check for specific file types that have MIME-type
# 'application/octet-stream'
filetype = Extractor.magic(real_path.encode("utf-8", "surrogateescape"))
if filetype:
if any(s in filetype for s in ["executable", "universal binary",
"relocatable", "bytecode", "applet",
"shared"]):
'''文件类型黑名单,不过比较奇怪的是有executable...'''
self.printf(">> Skipping: %s..." % filetype)
return True
# print (filetype)
# print ('-=----------------------------')
# Finally, check for specific file extensions that would be incorrectly
# identified
black_lists = ['.dmg', '.so', '.so.0']
'''后缀名黑名单'''
for black in black_lists:
if self.item.endswith(black):
self.printf(">> Skipping: %s..." % (self.item))
return True
return False
看注释是对已知提取方法的固件进行处理,一类似乎是文件系统和内核封装在一起(类似flash、ramdisk),另一类是手工dd
出kernel
和文件系统。不过在这个固件加密和自定义压缩算法的时代,这点法子实在是有点点苍白…
def _check_firmware(self, module, entry):
"""
If this file is of a known firmware type, directly attempt to extract
the kernel and root filesystem.
"""
dir_name = module.extractor.directory
desc = entry.description
if 'header' in desc:
# uImage
'''检查entry中是否有kernel且kernel未提取'''
if "uImage header" in desc:
if not self.get_kernel_status() and "OS Kernel Image" in desc:
kernel_offset = entry.offset + 64
kernel_size = 0
'''跳过uimage头部的64字节头部信息'''
for stmt in desc.split(','):
if "image size:" in stmt:
kernel_size = int(''.join(
i for i in stmt if i.isdigit()), 10)
if kernel_size != 0 and kernel_offset + kernel_size \
<= os.path.getsize(self.item):
self.printf(">>>> %s" % desc)
tmp_fd, tmp_path = tempfile.mkstemp(dir=self.temp)
os.close(tmp_fd)
Extractor.io_dd(self.item, kernel_offset,
kernel_size, tmp_path)
kernel = ExtractionItem(self.extractor, tmp_path,
self.depth, self.tag, self.debug)
return kernel.extract()
'''直接从镜像中提取信息,再创建tmp目录存放并提取内核,应该是为了处理文件系统和内核封装在一起的情况'''
# elif "RAMDisk Image" in entry.description:
# self.printf(">>>> %s" % entry.description)
# self.printf(">>>> Skipping: RAMDisk / initrd")
# self.terminate = True
# return True
# TP-Link or TRX
elif not self.get_kernel_status() and \
not self.get_rootfs_status() and \
"rootfs offset: " in desc and "kernel offset: " in desc:
image_size = os.path.getsize(self.item)
header_size = 0
kernel_offset = 0
kernel_size = 0
rootfs_offset = 0
rootfs_size = 0
for stmt in desc.split(','):
if "header size" in stmt:
header_size = int(stmt.split(':')[1].split()[0])
elif "kernel offset:" in stmt:
kernel_offset = int(stmt.split(':')[1], 16)
elif "kernel length:" in stmt:
kernel_size = int(stmt.split(':')[1], 16)
elif "rootfs offset:" in stmt:
rootfs_offset = int(stmt.split(':')[1], 16)
elif "rootfs length:" in stmt:
rootfs_size = int(stmt.split(':')[1], 16)
# add entry offset
kernel_offset += entry.offset
rootfs_offset += entry.offset + header_size
# compute sizes if only offsets provided
if rootfs_offset < kernel_offset:
if rootfs_size == 0:
rootfs_size = kernel_offset - rootfs_offset
if kernel_size == 0:
kernel_size = image_size - kernel_offset
elif rootfs_offset > kernel_offset:
if kernel_size == 0:
kernel_size = rootfs_offset - kernel_offset
if rootfs_size == 0:
rootfs_size = image_size - rootfs_offset
self.printf('image size: %d' % image_size)
self.printf('rootfs offset: %d' % rootfs_offset)
self.printf('rootfs size: %d' % rootfs_size)
self.printf('kernel offset: %d' % kernel_offset)
self.printf('kernel size: %d' % kernel_size)
# ensure that computed values are sensible
if kernel_size > 0 and rootfs_size > 0 and \
kernel_offset + kernel_size <= image_size and \
rootfs_offset + rootfs_size <= image_size:
self.printf(">>>> %s" % desc)
tmp_fd, tmp_path = tempfile.mkstemp(dir=self.temp)
os.close(tmp_fd)
Extractor.io_dd(self.item, kernel_offset, kernel_size,
tmp_path)
kernel = ExtractionItem(self.extractor, tmp_path,
self.depth, self.tag, self.debug)
kernel.extract()
tmp_fd, tmp_path = tempfile.mkstemp(dir=self.temp)
os.close(tmp_fd)
Extractor.io_dd(self.item, rootfs_offset, rootfs_size,
tmp_path)
rootfs = ExtractionItem(self.extractor, tmp_path,
self.depth, self.tag, self.debug)
rootfs.extract()
return True
'''直接io_dd提取再用类函数提取一遍,推测是因为有压缩'''
return False
没提取出kernel
的情况下直接用含kernel
和kernel version
字符串的文件当kernel
凑数…Vxworks
直接跳过可还行…
def _check_kernel(self, module, entry):
"""
If this file contains a kernel version string, assume it is a kernel.
Only Linux kernels are currently extracted.
"""
dir_name = module.extractor.directory
desc = entry.description
if 'kernel' in desc:
if self.get_kernel_status(): return True
else:
if "kernel version" in desc:
self.update_database("kernel_version", desc)
if "Linux" in desc:
if self.get_kernel_path():
shutil.copy(self.item, self.get_kernel_path())
else:
self.extractor.do_kernel = False
self.printf(">>>> %s" % desc)
return True
# VxWorks, etc
else:
self.printf(">>>> Ignoring: %s" % desc)
return False
查找文件系统目录并封装为tar.gz
。
def _check_rootfs(self, module, entry):
"""
If this file contains a known filesystem type, extract it.
"""
dir_name = module.extractor.directory
desc = entry.description
if 'filesystem' in desc or 'archive' in desc or 'compressed' in desc:
'''针对压缩文件系统做处理'''
if self.get_rootfs_status(): return True
else:
if dir_name:
unix = Extractor.io_find_rootfs(dir_name)
'''提取在extract里面已经做好了,这里主要是识别文件系统目录'''
if not unix[0]:
self.printf(">>>> Extraction failed!")
return False
self.printf(">>>> Found Linux filesystem in %s!" % unix[1])
if self.output:
shutil.make_archive(self.output, "gztar",
root_dir=unix[1])
'''将文件系统封装为tar.gz'''
else:
self.extractor.do_rootfs = False
return True
return False
# treat both archived and compressed files using the same pathway. this is
# because certain files may appear as e.g. "xz compressed data" but still
# extract into a root filesystem.
'''这么做的目的因为有些文件系统被识别为压缩数据但依然可以解包出来'''
递归提取, 针对文件系统套娃。
def _check_recursive(self, module, entry):
"""
Unified implementation for checking both "archive" and "compressed"
items.
"""
dir_name = module.extractor.directory
desc = entry.description
# filesystem for the netgear WNR2000 firmware (kernel in Squashfs)
if 'filesystem' in desc or 'archive' in desc or 'compressed' in desc:
'''判断是否是需要递归提取的压缩文件系统,因为其他文件一样会导致提取失败调用该函数'''
if dir_name:
self.printf(">> Recursing into %s ..." % desc)
count = 0
for root, dirs, files in os.walk(dir_name):
# sort both descending alphabetical and increasing
# length
'''遍历提取目录'''
files.sort()
files.sort(key=len)
if (not self.extractor.do_rootfs or self.get_rootfs_status()) and 'bin' in dirs and 'lib' in dirs:
break
'''判断无需提取或提取成功'''
# handle case where original file name is restored; put
# it to front of queue
if desc and "original file name:" in desc:
orig = None
for stmt in desc.split(","):
if "original file name:" in stmt:
orig = stmt.split("\"")[1]
if orig and orig in files:
files.remove(orig)
files.insert(0, orig)
'''对original file name做处理,不知道为啥'''
for filename in files:
# if count > ExtractionItem.RECURSION_BREADTH:
# self.printf(">> Skipping: recursion breadth %d"\
# % ExtractionItem.RECURSION_BREADTH)
# return False
path = os.path.join(root, filename)
if not pathlib.Path(path).is_file():
continue
new_item = ExtractionItem(self.extractor,
path,
self.depth + 1,
self.tag,
self.debug)
if new_item.extract():
'''递归提取,同时增加深度防止无限递归'''
if self.update_status():
return True
count += 1
return False
核心函数,本质上是利用_check_firmware
、_check_rootfs
、_check_kernel
、_check_recursive
遍历提权出的文件/目录,做一个标准化的固件提取流程,也因此在以上几个函数中被复用。
每一次执行都会利用binwalk
对输入文件进行提取,将提取结果放入新临时目录中。由于binwalk
会同时提取与解压压缩文件,故子函数中也包括了对目录的处理以及跳过压缩文件处理的逻辑。
整个提取流程都以get_status
为信号,检查到提取完成就直接返回,推测是因为使用了多线程所以需要通过这种方式来防止冗余操作。
def extract(self):
"""
Perform the actual extraction of firmware updates, recursively. Returns
True if extraction complete, otherwise False.
"""
self.printf("\n" + self.item.encode("utf-8", "replace").decode("utf-8"))
# check if item is complete
if self.get_status():
'''检查状态,判断提取是否已完成'''
self.printf(">> Skipping: completed!")
return True
# check if exceeding recursion depth
if self.depth > ExtractionItem.RECURSION_DEPTH:
'''检查递归深度,因为check_firmware与check_recursive中用到了该类进行提取'''
self.printf(">> Skipping: recursion depth %d" % self.depth)
return self.get_status()
# check if checksum is in visited set
self.printf(">> MD5: %s" % self.checksum)
with Extractor.visited_lock:
'''检查md5时加锁防止被修改'''
# Skip the same checksum only in the same status
# asus_latest(FW_RT_N12VP_30043804057.zip) firmware
if (self.checksum in self.extractor.visited and
self.extractor.visited[self.checksum] == self.status):
'''根据hash和status判断是否是以前成功提取过的固件'''
self.printf(">> Skipping: %s..." % self.checksum)
return self.get_status()
else:
self.extractor.visited[self.checksum] = self.status
# check if filetype is blacklisted
if self._check_blacklist():
'''check_blacklist是针对输入文件而言的,这种是为了避免递归提取时分析不必要的文件'''
return self.get_status()
# create working directory
self.temp = tempfile.mkdtemp()
# Move to temporary directory so binwalk does not write to input
os.chdir(self.temp)
try:
self.printf(">> Tag: %s" % self.tag)
self.printf(">> Temp: %s" % self.temp)
self.printf(">> Status: Kernel: %s, Rootfs: %s, Do_Kernel: %s, \
Do_Rootfs: %s" % (self.get_kernel_status(),
self.get_rootfs_status(),
self.extractor.do_kernel,
self.extractor.do_rootfs))
for module in binwalk.scan(self.item, "--run-as=root", "--preserve-symlinks",
"-e", "-r", "-C", self.temp, signature=True, quiet=True):
'''binwalk解包固件到临时目录,这里使用的binwalk模块只有signature
所以理论上应该只会跑一次循环'''
prev_entry = None
for entry in module.results:
'''遍历扫描结果'''
desc = entry.description
dir_name = module.extractor.directory
if prev_entry and prev_entry.description == desc and \
'Zlib comparessed data' in desc:
continue
'''如果是Zlib压缩的文件系统,binwalk会解压,所以不需要管
其他函数可以直接扫描目录'''
prev_entry = entry
self.printf('========== Depth: %d ===============' % self.depth)
self.printf("Name: %s" % self.item)
self.printf("Desc: %s" % desc)
self.printf("Directory: %s" % dir_name)
self._check_firmware(module, entry)
'''对已知提取方法的固件进行提取'''
if not self.get_rootfs_status():
self._check_rootfs(module, entry)
if not self.get_kernel_status():
self._check_kernel(module, entry)
if self.update_status():
self.printf(">> Skipping: completed!")
return True
else:
'''判断一次提取能否完成提取工作,不能则进行递归提取'''
self._check_recursive(module, entry)
except Exception:
print ("ERROR: ", self.item)
traceback.print_exc()
return False
感觉这个主要是递归比较复杂吧。。不过实际的固件应该也不会递归那么多次了,核心流程实际上就看在最后分析的extract
函数,遍历每一个entry并进行check。其中只有_check_firmware
和_check_recursive
两个会进行递归,不过都得在特殊条件下进行。
我个人是觉得这个脚本可能写的略繁琐了一些,不过考虑到它面向的是数以万计的庞大固件测试集,这种写法也是为了高兼容性吧(比如说_check_firmware
)。
如果要更新工作,我觉得应该从固件解密入手。考虑到数据库中有brand
,在image
表中也存储了固件名,确实可以在这方面做一定的尝试(毕竟固件提取一般情况下没有什么问题也不需要改进)。没法针对Vxworks
也是一个痛点,毕竟Vxworks
还有符号表等等乱七八糟东西,这个很明显只能用于小型linux
。