前言
最近在排查公司Hadoop集群性能问题时,发现Hadoop集群整体处理速度非常缓慢,平时只需要跑几十分钟的任务时间一下子上张到了个把小时,起初怀疑是网络原因,后来证明的确是有一部分这块的原因,但是过了没几天,问题又重现了,这次就比较难定位问题了,后来分析hdfs请求日志和Ganglia的各项监控指标,发现namenode的挤压请求数持续比较大,说明namenode处理速度异常,然后进而分析出是因为写journalnode的editlog速度慢问题导致的,后来发现的确是journalnode的问题引起的,后来的原因是因为journalnode的editlog目录没创建,导致某台节点写edillog一直抛FileNotFoundException,所以在这里提醒大家一定要重视一些小角色,比如JournalNode.在问题排查期间,也对YARN的JournalNode相关部分的代码做了学习,下面是一下学习心得,可能有些地方分析有误,敬请谅解.
JournalNode
可能有些同学没有听说过JournalNode,只听过Hadoop的Datanode,Namenode,因为这个概念是在MR2也就是Yarn中新加的,journalNode的作用是存放EditLog的,在MR1中editlog是和fsimage存放在一起的然后SecondNamenode做定期合并,Yarn在这上面就不用SecondNamanode了.下面是目前的Yarn的架构图,重点关注一下JournalNode的角色.
上面在Active Namenode与StandBy Namenode之间的绿色区域就是JournalNode,当然数量不一定只有1个,作用相当于NFS共享文件系统.Active Namenode往里写editlog数据,StandBy再从里面读取数据进行同步.
QJM
下面从Yarn源码的角度分析一下JournalNode的机制,在配置中定义JournalNode节点的个数是可多个的,所以一定会存在一个类似管理者这样的角色存在,而这个管理者就是QJM,全程QuorumJournalManager.下面是QJM的变量定义:
-
-
-
-
-
- @InterfaceAudience.Private
- public class QuorumJournalManager implements JournalManager {
- static final Log LOG = LogFactory.getLog(QuorumJournalManager.class);
-
-
- private final int startSegmentTimeoutMs;
- private final int prepareRecoveryTimeoutMs;
- private final int acceptRecoveryTimeoutMs;
- private final int finalizeSegmentTimeoutMs;
- private final int selectInputStreamsTimeoutMs;
- private final int getJournalStateTimeoutMs;
- private final int newEpochTimeoutMs;
- private final int writeTxnsTimeoutMs;
-
-
-
-
- private static final int FORMAT_TIMEOUT_MS = 60000;
- private static final int HASDATA_TIMEOUT_MS = 60000;
- private static final int CAN_ROLL_BACK_TIMEOUT_MS = 60000;
- private static final int FINALIZE_TIMEOUT_MS = 60000;
- private static final int PRE_UPGRADE_TIMEOUT_MS = 60000;
- private static final int ROLL_BACK_TIMEOUT_MS = 60000;
- private static final int UPGRADE_TIMEOUT_MS = 60000;
- private static final int GET_JOURNAL_CTIME_TIMEOUT_MS = 60000;
- private static final int DISCARD_SEGMENTS_TIMEOUT_MS = 60000;
-
- private final Configuration conf;
- private final URI uri;
- private final NamespaceInfo nsInfo;
- private boolean isActiveWriter;
-
-
- private final AsyncLoggerSet loggers;
-
- private int outputBufferCapacity = 512 * 1024;
- private final URLConnectionFactory connectionFactory;
上面定义了很多的操作超时时间,这个过程也是走RPC的方式的.所有JournalNode客户端的代理被包含在了AsyncLoggerSet对象中,在此对象中包含了AsyncLogger对象列表,每个logger对象管控一个独立的Journalnode,下面是QJM中从配置动态创建logger对象
- static List createLoggers(Configuration conf,
- URI uri, NamespaceInfo nsInfo, AsyncLogger.Factory factory)
- throws IOException {
- List ret = Lists.newArrayList();
- List addrs = getLoggerAddresses(uri);
- String jid = parseJournalId(uri);
- for (InetSocketAddress addr : addrs) {
- ret.add(factory.createLogger(conf, nsInfo, jid, addr));
- }
- return ret;
- }
然后设置到AsyncLoggerSet集合类中:
- QuorumJournalManager(Configuration conf,
- URI uri, NamespaceInfo nsInfo,
- AsyncLogger.Factory loggerFactory) throws IOException {
- Preconditions.checkArgument(conf != null, "must be configured");
-
- this.conf = conf;
- this.uri = uri;
- this.nsInfo = nsInfo;
- this.loggers = new AsyncLoggerSet(createLoggers(loggerFactory));
- ...
AsyncLoggerSet集合类的定义很简单,就是Logger对象的包装类.
-
-
-
-
-
- class AsyncLoggerSet {
- static final Log LOG = LogFactory.getLog(AsyncLoggerSet.class);
-
- private final List loggers;
-
- private static final long INVALID_EPOCH = -1;
- private long myEpoch = INVALID_EPOCH;
-
- public AsyncLoggerSet(List loggers) {
- this.loggers = ImmutableList.copyOf(loggers);
- }
重新回到Logger对象类中,AsyncLogger对象是一个抽象类,实际起作用的是下面这个管道类
-
-
-
-
-
-
-
- @InterfaceAudience.Private
- public class IPCLoggerChannel implements AsyncLogger {
-
- private final Configuration conf;
-
- protected final InetSocketAddress addr;
- private QJournalProtocol proxy;
-
-
-
-
-
-
- private final ListeningExecutorService singleThreadExecutor;
-
-
-
-
-
-
- private final ListeningExecutorService parallelExecutor;
- private long ipcSerial = 0;
- private long epoch = -1;
- private long committedTxId = HdfsConstants.INVALID_TXID;
-
- private final String journalId;
- private final NamespaceInfo nsInfo;
-
- private URL httpServerURL;
-
- private final IPCLoggerChannelMetrics metrics;
正如这个类的名称一样,作用就是服务端与客户端执行类的连接类,注意,这个类并不是直接执行类.在这个管道类中,定义了许多有用的监控信息变量,ganglia上的journal监控指标就是取自于这里
- ...
-
-
-
-
- private int queuedEditsSizeBytes = 0;
-
-
-
-
-
- private long highestAckedTxId = 0;
-
-
-
-
-
- private long lastAckNanos = 0;
-
-
-
-
-
-
- private long lastCommitNanos = 0;
-
-
-
-
-
-
-
- private final int queueSizeLimitBytes;
-
-
-
-
-
-
-
-
- private boolean outOfSync = false;
- ...
因为管道类方法与真正客户端方法继承了相同的协议,方法定义是相同的,下面列举几个常见方法:
开始执行记录写操作
- @Override
- public ListenableFuture startLogSegment(final long txid,
- final int layoutVersion) {
- return singleThreadExecutor.submit(new Callable() {
- @Override
- public Void call() throws IOException {
- getProxy().startLogSegment(createReqInfo(), txid, layoutVersion);
- synchronized (IPCLoggerChannel.this) {
- if (outOfSync) {
- outOfSync = false;
- QuorumJournalManager.LOG.info(
- "Restarting previously-stopped writes to " +
- IPCLoggerChannel.this + " in segment starting at txid " +
- txid);
- }
- }
- return null;
- }
- });
- }
写完之后,执行记录确认finalize操作
- @Override
- public ListenableFuture finalizeLogSegment(
- final long startTxId, final long endTxId) {
- return singleThreadExecutor.submit(new Callable() {
- @Override
- public Void call() throws IOException {
- throwIfOutOfSync();
-
- getProxy().finalizeLogSegment(createReqInfo(), startTxId, endTxId);
- return null;
- }
- });
- }
singleThreadExecutor单线程线程池一般执行的是写操作相关,而并行线程池则进行的是读操作,而且所有的这些操作采用的异步执行的方式,保证了高效性.服务端执行操作函数后,立刻得到一个call列表,并等待回复值
- @Override
- public void finalizeLogSegment(long firstTxId, long lastTxId)
- throws IOException {
- QuorumCall q = loggers.finalizeLogSegment(
- firstTxId, lastTxId);
- loggers.waitForWriteQuorum(q, finalizeSegmentTimeoutMs,
- String.format("finalizeLogSegment(%s-%s)", firstTxId, lastTxId));
- }
JournalNode和Journal
与服务端对应的客户端,对每个JournalNode进行操作执行的类是JournalNode
-
-
-
-
-
-
-
- @InterfaceAudience.Private
- public class JournalNode implements Tool, Configurable, JournalNodeMXBean {
- public static final Log LOG = LogFactory.getLog(JournalNode.class);
- private Configuration conf;
- private JournalNodeRpcServer rpcServer;
- private JournalNodeHttpServer httpServer;
- private final Map journalsById = Maps.newHashMap();
- private ObjectName journalNodeInfoBeanName;
- private String httpServerURI;
- private File localDir;
-
- static {
- HdfsConfiguration.init();
- }
-
-
-
-
- private int resultCode = 0;
里面定义了与服务端对应的log记录操作方法
- ...
- public void discardSegments(String journalId, long startTxId)
- throws IOException {
- getOrCreateJournal(journalId).discardSegments(startTxId);
- }
-
- public void doPreUpgrade(String journalId) throws IOException {
- getOrCreateJournal(journalId).doPreUpgrade();
- }
-
- public void doUpgrade(String journalId, StorageInfo sInfo) throws IOException {
- getOrCreateJournal(journalId).doUpgrade(sInfo);
- }
-
- public void doFinalize(String journalId) throws IOException {
- getOrCreateJournal(journalId).doFinalize();
- }
- ...
而这些方法间接调用的方法又是Journal这个方法,并不约而同的传入了方法journald,journalId难道指的是所在JournalNode节点的标识?起初我也是这么想的,后来证明是错的.
- File[] journalDirs = localDir.listFiles(new FileFilter() {
- @Override
- public boolean accept(File file) {
- return file.isDirectory();
- }
- });
- for (File journalDir : journalDirs) {
- String jid = journalDir.getName();
- if (!status.containsKey(jid)) {
- Map jMap = new HashMap();
- jMap.put("Formatted", "true");
- status.put(jid, jMap);
- }
- }
答案其实是目标写目录,从hadoop-yarn-project的测试代码中也能知道
-
-
-
-
- public URI getQuorumJournalURI(String jid) {
- List addrs = Lists.newArrayList();
- for (JNInfo info : nodes) {
- addrs.add("127.0.0.1:" + info.ipcAddr.getPort());
- }
- String addrsVal = Joiner.on(";").join(addrs);
- LOG.debug("Setting logger addresses to: " + addrsVal);
- try {
- return new URI("qjournal://" + addrsVal + "/" + jid);
- } catch (URISyntaxException e) {
- throw new AssertionError(e);
- }
- }
JournalUri的格式是下面这种,qjournal://host/jid
-
- dfs.namenode.shared.edits.dir
- qjournal:
-
JournalNode中保存了Journal的map图映射对象可以使得不同的节点可以写不同的editlog目录.Journal对象才是最终的操作执行者,并且拥有直接操作editlog输出文件的EditLogOutputStream类.下面是其中一个方法
-
-
-
-
- public synchronized void startLogSegment(RequestInfo reqInfo, long txid,
- int layoutVersion) throws IOException {
- assert fjm != null;
- checkFormatted();
- checkRequest(reqInfo);
-
- if (curSegment != null) {
- LOG.warn("Client is requesting a new log segment " + txid +
- " though we are already writing " + curSegment + ". " +
- "Aborting the current segment in order to begin the new one.");
-
-
-
- abortCurSegment();
- }
-
-
-
-
- EditLogFile existing = fjm.getLogFile(txid);
- if (existing != null) {
- if (!existing.isInProgress()) {
- throw new IllegalStateException("Already have a finalized segment " +
- existing + " beginning at " + txid);
- }
- ...
具体代码的写逻辑,读者可自行查阅,本文只从整体上梳理一下整个JournalNode的写流程,下面是准备的一张简单架构图,帮助大家理解.
全部代码的分析请点击链接https://github.com/linyiqun/hadoop-yarn,后续将会继续更新YARN其他方面的代码分析。
参考源代码
Apach-hadoop-2.7.1(hadoop-hdfs-project)