Angr源码分析——DDG的生成

本博客由闲散白帽子胖胖鹏鹏胖胖鹏潜力所写,仅仅作为个人技术交流分享,不得用做商业用途。转载请注明出处,未经许可禁止将本博客内所有内容转载、商用。      

上一篇博客我们分析了DTaint这篇论文是如何进行静态分析的,同时我注意到论文中说Angr的DDG很慢,我们这次来看一下Angr的DDG是如何生成的,对比一下两者的方法,看下是由于代码问题造成的还是由于方法造成的。

首先看下Angr中对DDG的描述:

这是由CFG快速生成数据依赖图的方法。使用它的唯一原因就是快。我们不对他的正确性和准确性进行保证。只有你想要简单地追踪数据而不在乎准确性的时候才建议使用这个函数。

如果想要获得更好的数据依赖图,请考虑更好的静态分析工具(比如value-set 分析)并且根据分析结果构造数据依赖图。(比如Angr中的VFG)

同时请注意。因为我们使用CFG生成数据依赖图,任何能够提升CFG准确性的技术都会对DDG的生成有效。 

  • cfg – 调用流图
  • start – 指明了DDG开始生成的地址
  • call_depth – None 或者整数.用于指明我们分析的深度
  • or None block_addrs (iterable) –指明使用DDG进行分析的blocks

So,万事init起。self.__init__函数初始化了一些变量,然后赚到了self.construct()函数,跟进看下。

def _construct(self):
	"""
	构造数据依赖图
	我们追踪如下依赖:
	- (Intra-IRSB) 临时变量依赖
	- 寄存器依赖
	- 非常有限的内存依赖
	我们追踪的内存访问类型
	- (Intra-functional)栈读写
	  追踪函数内部栈指针的变化,以及栈指针的依赖
	- (Global) 静态内存地址
      为每个函数保存内存访问地址和源statement。之后,我们遍历CFG并且按照控制流顺序链接每个读写对
	我们不追踪如下内存访问类型:
	- 符号化地址访问
	  在fastpath模式下无法访问。
	"""

	worklist = []
	worklist_set = set()

	# 初始化worklist,将所有入度为0的node加入分析列表,这是一个前序分析
	if self._start is None:
		# initial nodes are those nodes in CFG that has no in-degrees
		for n in self._cfg.graph.nodes():
			if self._cfg.graph.in_degree(n) == 0:
				# Put it into the worklist
				job = DDGJob(n, 0)
				self._worklist_append(job, worklist, worklist_set)
	else: # 否则就把起始点加入
		for n in self._cfg.get_all_nodes(self._start):
			job = DDGJob(n, 0)
			self._worklist_append(job, worklist, worklist_set)

	# 存储变量定义的字典
	# DDGJob -> LiveDefinition
	live_defs_per_node = {}

	while worklist:
		# 从worklist中取出来一个节点
		ddg_job = worklist[0]
		l.debug("Processing %s.", ddg_job)
		node, call_depth = ddg_job.cfg_node, ddg_job.call_depth
		worklist = worklist[ 1 : ]
		worklist_set.remove(node)

		# 抓取全部的final_states,一般都会有一个final state,我们全部进行处理
		#并且为每个node创建definitions,存入字典
		final_states = node.final_states

		if node in live_defs_per_node:
			live_defs = live_defs_per_node[node]
		else:
			live_defs = LiveDefinitions()
			live_defs_per_node[node] = live_defs

		successing_nodes = list(self._cfg.graph.successors(node))

		#将每个final state 和 successor对应,反之亦然
		match_suc = defaultdict(bool)
		match_state = defaultdict(set)

		# 遍历所有的successors 和 final states 
		for suc in successing_nodes:
			matched = False
			for state in final_states:
				try:
					if state.solver.eval(state.ip) == suc.addr: # 如果这个state跳转地址是successor就认为匹配成功
						match_suc[suc.addr] = True              # 分别加入对应list
						match_state[state].add(suc)
						matched = True
				except (SimUnsatError, SimSolverModeError, ZeroDivisionError):
					# ignore
					matched = matched
			if not matched:
				break

		# 判断所有的state和successor是否完全对应
		matches = len(match_suc) == len(successing_nodes) and len(match_state) == len(final_states)

		# 遍历所有的final state 
		for state in final_states:
			if not matches and state.history.jumpkind == 'Ijk_FakeRet' and len(final_states) > 1:
				# 如果没有其他可用的控制流,就跳过fakerets 
				continue

			# 记录调用深度
			new_call_depth = call_depth
			if state.history.jumpkind == 'Ijk_Call':
				new_call_depth += 1
			elif state.history.jumpkind == 'Ijk_Ret':
				new_call_depth -= 1

			# 检测调用深度是否超过了限制
			if self._call_depth is not None and call_depth > self._call_depth:
				l.debug('Do not trace into %s due to the call depth limit', state.ip)
				continue

			# 获取新的定义
			new_defs = self._track(state, live_defs, node.irsb.statements if node.irsb is not None else None)

			#corresponding_successors = [n for n in successing_nodes if
			#                            not state.ip.symbolic and n.addr == state.solver.eval(state.ip)]
			#if not corresponding_successors:
			#    continue

			changed = False

			# 如果每个successor都能与一个或多个final state(通过ip指令地址)匹配,
			# 则只需要接管匹配状态的LiveDefinition
			if matches:
				add_state_to_sucs = match_state[state]
			else:
				add_state_to_sucs = successing_nodes

			for successing_node in add_state_to_sucs: # 遍历所有的successors

				# 如果跳转类型为Ijk_Call/Ijk_Sys,并且IP不为符号值,
				# 我们过滤调用点的依赖
				if (state.history.jumpkind == 'Ijk_Call' or state.history.jumpkind.startswith('Ijk_Sys')) and \
						(state.ip.symbolic or successing_node.addr != state.solver.eval(state.ip)):
					suc_new_defs = self._filter_defs_at_call_sites(new_defs)
				else:
					suc_new_defs = new_defs

				# 判断次successor节点是否在live_defs_per_node中,也就是看下这个node是不是建立过definition了
				if successing_node in live_defs_per_node:
					defs_for_next_node = live_defs_per_node[successing_node]
				else:
					defs_for_next_node = LiveDefinitions()
					live_defs_per_node[successing_node] = defs_for_next_node

				# 遍历获得的新的依赖中的 var,code_loc_set
				# 判断是否定义发生了改变
				for var, code_loc_set in suc_new_defs.items():
					# l.debug("Adding %d new definitions for variable %s.", len(code_loc_set), var)
					changed |= defs_for_next_node.add_defs(var, code_loc_set)

			# 如果发生了改变,则说明有依赖关系,把所有的successor是放到worklist重新分析
			if changed:
				if (self._call_depth is None) or \
						(self._call_depth is not None and 0 <= new_call_depth <= self._call_depth):
					# Put all reachable successors back to our work-list again
					for successor in self._cfg.get_all_successors(node):
						nw = DDGJob(successor, new_call_depth)
						self._worklist_append(nw, worklist, worklist_set)

注意到这个_track()函数是进行数据分析的主要代码。继续跟下:

def _track(self, state, live_defs, statements):
	"""
	给定此程序点之前所有的live定义对象,追踪变化,并且返回一个新的live定义列表
	通过扫描new state列表来追踪变化。
	:param state:           输入的程序点
	:param live_defs:       次程序点前所有的 live definitions 
	:param list statements: VEX状态的list
	:returns:               新 live definition list
	:rtype:                 angr.analyses.ddg.LiveDefinitions
	"""

	# 复制现有的 live_defs
	self._live_defs = live_defs.copy()
    
	# #获取state历史actions 
	action_list = list(state.history.recent_actions)

	# 因为所有的临时变量都是本地作用,我们用一个dict进行追踪
	self._temp_variables = { } # 临时变量
	self._temp_register_symbols = { } # 临时寄存器

	# 在此方法的末尾,将所有的依赖边增加仅依赖图;或者当他们要写入新的边时增加???
	# 这是因为我们有时候不得不修改之前添加的边(例如增加边的新标签)
	self._temp_edges = defaultdict(list)
	self._register_edges = defaultdict(list)

	last_statement_id = None
	self._variables_per_statement = None  # 读出同在一个statement中的变量,我们在这里保存了变量的副本
										  # 这样我们将变量连接到tmp_write动作上
	self._custom_data_per_statement = None

	for a in action_list: # 遍历statement上所有的action
		if last_statement_id is None or last_statement_id != a.stmt_idx:
			# 更新 statement ID
			last_statement_id = a.stmt_idx
			statement = statements[last_statement_id] if statements and last_statement_id < len(statements) else None

			# 初始化每条语句的数据结构
			self._variables_per_statement = [ ]
			self._custom_data_per_statement = None

		if a.sim_procedure is None:
			current_code_location = CodeLocation(a.bbl_addr, a.stmt_idx, ins_addr=a.ins_addr)
		else:
			current_code_location = CodeLocation(None, None, sim_procedure=a.sim_procedure)

		# 根据动作类型,分别更新数据依赖
		if a.type == 'exit': # 处理退出动作
			self._handle_exit(a, current_code_location, state, statement)
		elif a.type == 'operation': #处理加减
			self._handle_operation(a, current_code_location, state, statement)
		elif a.type == 'constraint':
			pass
		else:
			handler_name = "_handle_%s_%s" % (a.type, a.action)
			if hasattr(self, handler_name):
				getattr(self, handler_name)(a, current_code_location, state, statement)
			else:
				l.debug("Skip an unsupported action %s.", a)

	return self._live_defs

如Angr作者所说,DDG的生成确实简陋。整个的流程就是,自顶向下的遍历CFG,遍历每个block的successor,追踪此successor中的操作,并且更新definition,如果有对definition的改动,就增加数据流图的边。很快,粒度也很粗。

你可能感兴趣的:(Angr源码分析)