LLVM-TransformUtils-Mem2Reg

  众所周知LLVM IR其实在clang的codegen后并不是strict-SSA结构,因为这时候局部变量表现为alloca指令,同时对局部变量通过load和store进行读写操作,这会导致局部变量可能会存在多个def(多个store指令),而SSA要求每个变量只能有一个def。这时LLVM会通过标准的SSA构造算法来将原始IR转换成minimal-SSA并最终转换成prune-SSA,这一切都在mem2reg的pass中实现。

预处理

  Mem2Reg.cpp是一个入口,主要功能实现在PromoteMemoryToRegister.cpp中。
  Mem2Reg.cpp中的主要代码:

    // Find allocas that are safe to promote, by looking at all instructions in
    // the entry node
    for (BasicBlock::iterator I = BB.begin(), E = --BB.end(); I != E; ++I)
      if (AllocaInst *AI = dyn_cast<AllocaInst>(I)) // Is it an alloca?
      	// 如果可以提升,则加入到数组中
        if (isAllocaPromotable(AI))
          Allocas.push_back(AI);

    if (Allocas.empty())
      break;
	// 对所有可以提升的alloca进行提升
    PromoteMemToReg(Allocas, DT, &AC);

  这里isAllocaPromotable用于判断是否可以提升,那么什么情况下局部变量不能提升到寄存器呢,首先是volatile属性的变量,很显然,这些变量必须要有内存空间。其次,如果有对局部变量取址,那么也是无法提升的,取址在LLVMIR中可能表现为对alloca指令进行加减操作。后面还有bitcast和gep的条件,gep是针对数组或结构体的操作,这些对SSA来说太难了。
  在收集了所有可以提升的alloca后,进入PromoteMemToReg对他们进行提升。
  PromoteMem2Reg中执行的是标准的SSA构造算法,算法原理不细讲,感兴趣的可以看论文《Efficiently computing static single assignment form and the control dependence graph》。过程分为以下几步:

  1. placing PHI-node.
  2. rename.
  3. 指令化简.

placing PHI-node.

  首先确认alloca是不是死代码,对于SSA很简单,看他的user是不是空:

		if (AI->use_empty()) {
			// If there are no uses of the alloca, just delete it now.
			AI->eraseFromParent();

			// Remove the alloca from the Allocas list, since it has been processed
			RemoveFromAllocasList(AllocaNum);
			++NumDeadAlloca;
			continue;
		}

  然后LLVM会通过一个辅助类计算alloca的相关属性,因为某些情况可以进行剪枝优化的,一般论文里或者书里不会介绍这些工程实践得到优化经验,因此看这些源码还是所有收获的。
  这里AllocaInfo记录了所有的DefiningBlock,UsingBlock以及是否是唯一的store(def)或者唯一的load(use),下面会看到如何用这些信息进行剪枝优化:

		// 分析alloca的一些属性
		Info.AnalyzeAlloca(AI);

		// 如果只有一个store,则将所有被其支配的def的user替换掉
		if (Info.DefiningBlocks.size() == 1) {
			if (rewriteSingleStoreAlloca(AI, Info, LBI, SQ.DL, DT, AC)) {
				// The alloca has been processed, move on.
				RemoveFromAllocasList(AllocaNum);
				++NumSingleStore;
				continue;
			}
		}

		// 如果只在一个block中使用到了这个alloca,那么就使用最近的store操作数来替代load的user
		if (Info.OnlyUsedInOneBlock &&
			promoteSingleBlockAlloca(AI, Info, LBI, SQ.DL, DT, AC)) {
			// The alloca has been processed, move on.
			RemoveFromAllocasList(AllocaNum);
			continue;
		}

  第一种情况:只有一个store。如果其余的load都被这个store所支配,那么这时就已经符合strict-SSA的要求。因此只需要找到所有被store支配的load,将这些load的use用store的操作数来替换就可以了。还有一种情况,就是在唯一的store之前有一个未定义的load,但我们不能说这个程序就是错的,LLVM的注释给出了一种情况:

///  for (...) {
///    int t = *A;
///    if (!first_iteration)
///      use(t);
///    *A = 42;
///  }

  这时虽然第一次A的使用(int t = *A)是未定义的,但程序是正确的。所以对于这种未定义load,还是要走正常的流程。

static bool rewriteSingleStoreAlloca(AllocaInst* AI, AllocaInfo& Info,
	LargeBlockInfo& LBI, const DataLayout& DL,
	DominatorTree& DT, AssumptionCache* AC) {
	StoreInst* OnlyStore = Info.OnlyStore;
	// store的是否是全局变量
	bool StoringGlobalVal = !isa<Instruction>(OnlyStore->getOperand(0));
	BasicBlock* StoreBB = OnlyStore->getParent();
	int StoreIndex = -1;

	// Clear out UsingBlocks.  We will reconstruct it here if needed.
	Info.UsingBlocks.clear();

	for (auto UI = AI->user_begin(), E = AI->user_end(); UI != E;) {
		Instruction* UserInst = cast<Instruction>(*UI++);
		// 被搜集的alloca应该只有load和store
		// store不管,这里只看load是否被store所支配
		if (!isa<LoadInst>(UserInst)) {
			assert(UserInst == OnlyStore && "Should only have load/stores");
			continue;
		}
		LoadInst* LI = cast<LoadInst>(UserInst);

		// Okay, if we have a load from the alloca, we want to replace it with the
		// only value stored to the alloca.  We can do this if the value is
		// dominated by the store.  If not, we use the rest of the mem2reg machinery
		// to insert the phi nodes as needed.
		if (!StoringGlobalVal) { // Non-instructions are always dominated.
			// load和store在一个block中,需要考虑他们之间的顺序
			// 如果load在store之前,则不处理这个load
			if (LI->getParent() == StoreBB) {

				if (StoreIndex == -1)
					StoreIndex = LBI.getInstructionIndex(OnlyStore);

				if (unsigned(StoreIndex) > LBI.getInstructionIndex(LI)) {
					// load出现在store之前,没法处理
					Info.UsingBlocks.push_back(StoreBB);
					continue;
				}
			}
			// 不在同一个块中,如果store不支配这个load,则也放弃处理
			else if (LI->getParent() != StoreBB &&
				!DT.dominates(StoreBB, LI->getParent())) {

				Info.UsingBlocks.push_back(LI->getParent());
				continue;
			}
		}

		// Otherwise, we *can* safely rewrite this load.
		Value* ReplVal = OnlyStore->getOperand(0);
		
		// store支配load但流程上load出现在store之前,说明存在一个不经过函数入口的路径先
		// 到达了load再到store,这时假设是undef,因为函数不可能不从入口进入。
		if (ReplVal == LI)
			ReplVal = UndefValue::get(LI->getType());

		// 替换load的所有user
		LI->replaceAllUsesWith(ReplVal);
		LI->eraseFromParent();
		LBI.deleteValue(LI);
	}

	// 如果没有load了,可以退出
	if (!Info.UsingBlocks.empty())
		return false; // If not, we'll have to fall back for the remainder.

	// 如果没有load了,则把store和alloa都删了
	Info.OnlyStore->eraseFromParent();
	LBI.deleteValue(Info.OnlyStore);

	AI->eraseFromParent();
	LBI.deleteValue(AI);
	return true;
}

  第二种情况是,load和store在同一个block中,注释中写道:许多alloca只会在一个block中使用,因此只需要顺序的把所有load的user用在它之前离他最近的那个store的操作数替代,对于未定义的load同上:

static bool promoteSingleBlockAlloca(AllocaInst * AI, const AllocaInfo & Info,
	LargeBlockInfo & LBI,
	const DataLayout & DL,
	DominatorTree & DT,
	AssumptionCache * AC) {

	using StoresByIndexTy = SmallVector<std::pair<unsigned, StoreInst*>, 64>;
	StoresByIndexTy StoresByIndex;

	// 找到所有store出现的顺序
	for (User* U : AI->users())
		if (StoreInst * SI = dyn_cast<StoreInst>(U))
			StoresByIndex.push_back(std::make_pair(LBI.getInstructionIndex(SI), SI));

	// 对store的index进行排序,后面用二分法找最近的store
	llvm::sort(StoresByIndex, less_first());

	// Walk all of the loads from this alloca, replacing them with the nearest
	// store above them, if any.
	for (auto UI = AI->user_begin(), E = AI->user_end(); UI != E;) {
		LoadInst* LI = dyn_cast<LoadInst>(*UI++);
		if (!LI)
			continue;

		unsigned LoadIdx = LBI.getInstructionIndex(LI);

		// Find the nearest store that has a lower index than this load.
		// 找到离load最近的store
		StoresByIndexTy::iterator I =
			std::lower_bound(StoresByIndex.begin(), StoresByIndex.end(),
				std::make_pair(LoadIdx,
					static_cast<StoreInst*>(nullptr)),
				less_first());
		if (I == StoresByIndex.begin()) {
			// 如果有一个store,用store的操作数替换load的user
			if (StoresByIndex.empty())
				// If there are no stores, the load takes the undef value.
				LI->replaceAllUsesWith(UndefValue::get(LI->getType()));
			// 否则不管
			else
				return false;
		}
		else {
			// Otherwise, there was a store before this load, the load takes its value.
			// Note, if the load was marked as nonnull we don't want to lose that
			// information when we erase it. So we preserve it with an assume.
			Value* ReplVal = std::prev(I)->second->getOperand(0);
			// 道理同上
			if (ReplVal == LI)
				ReplVal = UndefValue::get(LI->getType());

			LI->replaceAllUsesWith(ReplVal);
		}

		LI->eraseFromParent();
		LBI.deleteValue(LI);
	}

	// Remove the (now dead) stores and alloca.
	while (!AI->use_empty()) {
		StoreInst* SI = cast<StoreInst>(AI->user_back());
		SI->eraseFromParent();
		LBI.deleteValue(SI);
	}

	AI->eraseFromParent();
	LBI.deleteValue(AI);
	++NumLocalPromoted;
	return true;
}

  好了,剪枝完成,下面就是正常流程了,生成strict-SSA的正常流程就是找到所有def的支配边界,在支配边界插入phi结点,然后进行变量重命名,LLVM也是如此。
  当然这里还有一步剪枝,就是找到局部变量的live-in block。这些live-in block才是真正的待插入块,如果块入口变量不存活,实际上没有必要插入phi结点,比如某个支配边界块:

BB1:
	****
	%1 = store %var1,%t
	****
	%2 = load %var1
	****

  那么这时,虽然BB1是支配边界,但很明显var1并没有在入口存活,这时插入的phi就会变成死代码,因此即使这里不清理,后面的dce仍然会浪费时间来删除。
  因为这时还不是strict-SSA,因此用常规数据流中的活跃性分析方程就足够了,不过LLVM这里更简单,因为只计算一个变量,而且只计算live-in。

void PromoteMem2Reg::ComputeLiveInBlocks(
	AllocaInst * AI, AllocaInfo & Info,
	const SmallPtrSetImpl<BasicBlock*> & DefBlocks,
	SmallPtrSetImpl<BasicBlock*> & LiveInBlocks) {
	// 工作队列
	SmallVector<BasicBlock*, 64> LiveInBlockWorklist(Info.UsingBlocks.begin(),
		Info.UsingBlocks.end());

	// 检测出现def的use block中,def是否在use前,如果是,则说明不是live-in的
	for (unsigned i = 0, e = LiveInBlockWorklist.size(); i != e; ++i) {
		BasicBlock* BB = LiveInBlockWorklist[i];
		if (!DefBlocks.count(BB))
			continue;

		// 到这里,BB中既有use又有def,迭代指令确定第一个load和第一个store的相对位置
		for (BasicBlock::iterator I = BB->begin();; ++I) {
			if (StoreInst * SI = dyn_cast<StoreInst>(I)) {
				if (SI->getOperand(1) != AI)
					continue;

				LiveInBlockWorklist[i] = LiveInBlockWorklist.back();
				LiveInBlockWorklist.pop_back();
				--i;
				--e;
				break;
			}

			if (LoadInst * LI = dyn_cast<LoadInst>(I)) {
				if (LI->getOperand(0) != AI)
					continue;

				// 在store前遇到了load,说明是live-in
				break;
			}
		}
	}

	// 迭代的添加live-in的前继结点到live-in集合中
	while (!LiveInBlockWorklist.empty()) {
		BasicBlock* BB = LiveInBlockWorklist.pop_back_val();

]		// 不再重复处理
		if (!LiveInBlocks.insert(BB).second)
			continue;

		// 如果前继有def,那就不需要继续添加前继的前继了
		for (BasicBlock* P : predecessors(BB)) {
			// The value is not live into a predecessor if it defines the value.
			if (DefBlocks.count(P))
				continue;

			// Otherwise it is, add to the worklist.
			LiveInBlockWorklist.push_back(P);
		}
	}
}

  然后就是找到支配边界,在live-in block的入口处插入PHI:

		// 复制def blocks.
		SmallPtrSet<BasicBlock*, 32> DefBlocks;
		DefBlocks.insert(Info.DefiningBlocks.begin(), Info.DefiningBlocks.end());

		// 计算局部变量在block的live-in集合
		SmallPtrSet<BasicBlock*, 32> LiveInBlocks;
		ComputeLiveInBlocks(AI, Info, DefBlocks, LiveInBlocks);

		// 计算IDF并插入PHI结点
		IDF.setLiveInBlocks(LiveInBlocks);
		IDF.setDefiningBlocks(DefBlocks);
		SmallVector<BasicBlock*, 32> PHIBlocks;
		IDF.calculate(PHIBlocks);
		if (PHIBlocks.size() > 1)
			llvm::sort(PHIBlocks, [this](BasicBlock * A, BasicBlock * B) {
			return BBNumbers.lookup(A) < BBNumbers.lookup(B);
		});

		unsigned CurrentVersion = 0;
		for (BasicBlock* BB : PHIBlocks)
			// 这里其实就是插入了PHI结点,并建立了PHI-Alloca的映射关系。
			QueuePhiNode(BB, AllocaNum, CurrentVersion);

  到这里为止,placing phi-node就完成了,不过此时得到的PHI结点还没有IncomingValues,而且load的user并没有被替换为PHI结点。
  其实我们可以看出,LLVM在放置PHI结点时,无论是liveness的计算还是idf的计算,都是针对单个变量的(到LLVM8.0为止),很多步骤会重复的计算,因此这一块可能会有性能上的不足,有兴趣的同学可以研究LLVM上实现批量phi放置,相关论文《A practical and fast iterative algorithm for φ-function computation using DJ graphs》还有国内的《基于支配边界逆转的多变量Φ函数摆放算法》。

Rename

  OK,现在是下一个环节,rename。在SSA的定义中,所有值在函数开头有一个隐式的初始def,其值是未定义的,所以rename时为所有变量建立栈时也需要有一个undef的初值,这样可以避免先使用后定义的问题:

	RenamePassData::ValVector Values(Allocas.size());
	for (unsigned i = 0, e = Allocas.size(); i != e; ++i)
		Values[i] = UndefValue::get(Allocas[i]->getAllocatedType());

  rename的过程会对CFG进行dfs,边遍历边不断的收集值在block上的流出信息,同时用对应的流出值来代替后继结点中load指令的user或者填充PHI的IncomingVal。LLVM没有像标准算法那样对每个变量建立一个版本栈,而是在每个block出口处用一个数组来记录所有alloca对应的值信息,用于辅助后继block更新局部变量的值,每有一个store就更新一下流出值,实际上是隐式的实现了变量版本栈。
  先把视角放在一个基本块内。在一个block里,首先处理PHI结点,再处理后续的load和store。
  处理所有的PHI结点,之前放置的PHI结点的IncomingVal是空的,这时就需要根据前继的流出值来更新PHI结点:

	// 块中有PHI结点
	if (PHINode * APN = dyn_cast<PHINode>(BB->begin())) {
		// PHI结点时需要更新的
		if (PhiToAllocaMap.count(APN)) {
			// We want to be able to distinguish between PHI nodes being inserted by
			// this invocation of mem2reg from those phi nodes that already existed in
			// the IR before mem2reg was run.  We determine that APN is being inserted
			// because it is missing incoming edges.  All other PHI nodes being
			// inserted by this pass of mem2reg will have the same number of incoming
			// operands so far.  Remember this count.
			unsigned NewPHINumOperands = APN->getNumOperands();

			unsigned NumEdges = std::count(succ_begin(Pred), succ_end(Pred), BB);
			// 更新PHI结点
			BasicBlock::iterator PNI = BB->begin();
			do {
				// 获取这个PHI对应给alloca指令编号
				unsigned AllocaNo = PhiToAllocaMap[APN];

				// 为所有从pred到当前block的边设置入值,这个值就是上一个block出口处alloca指令对应的值
				// 一个block到另一个block有两条边?
				for (unsigned i = 0; i != NumEdges; ++i)
					APN->addIncoming(IncomingVals[AllocaNo], Pred);

				// 设置alloca指令在当前block的出口值为当前phi
				// 这个值时暂定的,如果后面有store,还会更新
				IncomingVals[AllocaNo] = APN;

				// 迭代到下一个PHI
				++PNI;
				APN = dyn_cast<PHINode>(PNI);
				if (!APN)
					break;

				// 根据之前NewPHINumOperands的注释,应为所有PHI时同步更新的,所以操作数数量应该也是一致的,如果不一致,说明不是mem2reg放的PHI
			} while (APN->getNumOperands() == NewPHINumOperands);
		}
	}

  因为是DFS遍历,所以在有环路的情况下,一个block可能会被遍历两次。而众所周知,循环头中PHI的值有些来自于循环体。在上面PHI的IncomingVal处理完成后,这里会判断block有没有被重复处理,如果重复处理,直接退出。

	// Don't revisit blocks.
	if (!Visited.insert(BB).second)
		return;

  然后遍历block中所有的指令,如果是load指令,则用block开头处对应的PHI来代替,如果是store指令,则更新变量版本:

	for (BasicBlock::iterator II = BB->begin(); !II->isTerminator();) {
		Instruction* I = &*II++; // get the instruction, increment iterator

		if (LoadInst * LI = dyn_cast<LoadInst>(I)) {
			AllocaInst* Src = dyn_cast<AllocaInst>(LI->getPointerOperand());
			if (!Src)
				continue;

			DenseMap<AllocaInst*, unsigned>::iterator AI = AllocaLookup.find(Src);
			if (AI == AllocaLookup.end())
				continue;

			// 找到load对应的alloca后,获取alloca在当前block的流入值
			Value * V = IncomingVals[AI->second];

			// 用流入值代替所有user
			LI->replaceAllUsesWith(V);
			BB->getInstList().erase(LI);
		}
		else if (StoreInst * SI = dyn_cast<StoreInst>(I)) {
			// Delete this instruction and mark the name as the current holder of the
			// value
			AllocaInst* Dest = dyn_cast<AllocaInst>(SI->getPointerOperand());
			if (!Dest)
				continue;

			DenseMap<AllocaInst*, unsigned>::iterator ai = AllocaLookup.find(Dest);
			if (ai == AllocaLookup.end())
				continue;

			// 获取store对应的alloca后,用store的参数来更新alloca的流出值
			unsigned AllocaNo = ai->second;
			IncomingVals[AllocaNo] = SI->getOperand(0);

			BB->getInstList().erase(SI);
		}
	}

  对单个block的处理就已经完成了,整个过程其实是被包裹再一个DFS算法中的:

/// Recursively traverse the CFG of the function, renaming loads and
/// stores to the allocas which we are promoting.
///
/// IncomingVals indicates what value each Alloca contains on exit from the
/// predecessor block Pred.
void PromoteMem2Reg::RenamePass(BasicBlock * BB, BasicBlock * Pred,
	RenamePassData::ValVector & IncomingVals,
	RenamePassData::LocationVector & IncomingLocs,
	std::vector<RenamePassData> & Worklist) {
NextIteration:

	// 处理PHI
	// 处理Load
	// 处理Store

	// 如果没有后继了,退出
	succ_iterator I = succ_begin(BB), E = succ_end(BB);
	if (I == E)
		return;

	// Keep track of the successors so we don't visit the same successor twice
	SmallPtrSet<BasicBlock*, 8> VisitedSuccs;

	// 第一个后继不放到队列里,而是直接在下一轮迭代中处理,对应DFS顺序
	VisitedSuccs.insert(*I);
	Pred = BB;
	BB = *I;
	++I;

	// 其余后继则跟随当前block的流出值接入到worklist中
	// 这种做法其实很巧妙的减少了流出值的内存消耗,因为IncomingVals可以得到复用。
	for (; I != E; ++I)
		if (VisitedSuccs.insert(*I).second)
			Worklist.emplace_back(*I, Pred, IncomingVals, IncomingLocs);

	goto NextIteration;
}

  到这里,PHI的rename也已经完成了。
  这个rename算法基本是标准stack-based rename算法的改良版,在CFG中遍历并且为每个变量都会追踪值的变化。而在SSA-book中,有一个不同的slot-based rename算法,在DT中遍历并且只维持值在DT中的流向,我自己实现了一下感觉比LLVM中的标准算法简单高效,后面再讨论。

指令化简

  这里还不算完,事实上,生成的PHI仍然可能有不必要的计算,比如IncomingVal全是undef或者所有IncomingVal的值都相同,这时还需要进行一次优化,这种优化十分简单,可以自己深入源码去看:

	// Loop over all of the PHI nodes and see if there are any that we can get
	// rid of because they merge all of the same incoming values.  This can
	// happen due to undef values coming into the PHI nodes.  This process is
	// iterative, because eliminating one PHI node can cause others to be removed.
	bool EliminatedAPHI = true;
	while (EliminatedAPHI) {
		EliminatedAPHI = false;

		// Iterating over NewPhiNodes is deterministic, so it is safe to try to
		// simplify and RAUW them as we go.  If it was not, we could add uses to
		// the values we replace with in a non-deterministic order, thus creating
		// non-deterministic def->use chains.
		for (DenseMap<std::pair<unsigned, unsigned>, PHINode*>::iterator
			I = NewPhiNodes.begin(),
			E = NewPhiNodes.end();
			I != E;) {
			PHINode* PN = I->second;

			// If this PHI node merges one value and/or undefs, get the value.
			if (Value * V = SimplifyInstruction(PN, SQ)) {
				PN->replaceAllUsesWith(V);
				PN->eraseFromParent();
				NewPhiNodes.erase(I++);
				EliminatedAPHI = true;
				continue;
			}
			++I;
		}
	}

  到这里就是mem2reg的全过程了,最终生成了prune-SSA。其实这时,代码中很可能还有许多alloca指令(比如volatile修饰的变量,user中有bitcast等),不过对于后续的优化没有影响,LLVM也是可以在不那么restrict的条件下工作的,但势必会减少许多优化的机会了。
  如果源程序中只有标量,那么经过了mem2reg,基本已经是prune-SSA了。但是如果有数组或者结构体,mem2reg是没有处理这些的,那么源程序仍然不是严格SSA,也有专门用于处理类数组访问的SSA(SSA-Book上的ArraySSA)。不过这实际不影响SSA上的优化,因为LLVM对数组或者结构体的操作仅仅看作是一条指令。

你可能感兴趣的:(编译与反编译)