以水稻为例教你如何使用BSA方法进行遗传定位（下篇）

接着上一篇留下的问题，如何按照以1M为窗口，每次移动10Kb计算均值。我们以KY131是"0/0", 而 "DN422" 是 "1/1"的位点结果为例，我们来增加这条线

原本我的想法是使用split根据每个位点的位置信息进行拆分，但是这里有一个问题是，如果只是每个窗口没有重叠的话，那么每个位点应该都有一个唯一的因子，但这里是以10kb将长度为1Mb的窗口从左往右移动，也就意味着有很多位点时属于多个窗口的，也就是无法直接用split-apply-combine的方法进行处理。

于是乎，我写了一个函数用于根据窗口来计算均值，它的逻辑就是先创建好窗口，然后根据窗口的起始位置和终止位置去筛选符合要求的位点，最后进行求和。同时考虑有些位置可能没有SNP，那么最后还进行了一波过滤。

函数的输入是之前snp-index的位置和对应的值，以及确定窗口的大小和步长，

calcValueByWindow <- function(pos, value,
                              window_size = 1000000,
                              step_size = 100000){
  # get the max position in the postion
  max_pos <- max(pos)
  
  # construct the window
  window_start <- seq(0, max_pos + window_size,step_size)
  window_end <- window_start + step_size
  mean_value <- vector(mode = "numeric", length = length(window_start))
  
  # select the value inside the window
  for (j in seq_along(window_start)){
    
    pos_in_window <- which(pos > window_start[j] &
                             pos < window_end[j])
    value_in_window <- value[pos_in_window]
    
    mean_value[j] <- mean(value_in_window)
    
  }
  # remove the Not A Number position
  nan_pos <-  is.nan(mean_value)
  mean_value <- mean_value[! nan_pos]
  window_pos <- ((window_start + window_end)/ 2)[!nan_pos]
  df <- data.frame(pos   = window_pos,
                   value = mean_value)
  return(df)
}

得到结果就可以用lines在之前结果上加上均值线

par(mfrow = c(3,4))

for (i in paste0("chr", formatC(1:12, width = 2, flag=0)) ){
  
  freq_flt <- freq2[grepl(i,row.names(freq2)), ]
  pos <- as.numeric(substring(row.names(freq_flt), 7))
  snp_index <- freq_flt[,1] - freq_flt[,2]
  
  # bin
  df <- calcValueByWindow(pos = pos, value = snp_index)
  
  plot(x = pos, y =snp_index, 
       ylim = c(-1,1),
       pch = 20, cex = 0.2,
       xlab = i,
       ylab = expression(paste(Delta, " " ,"SNP index")))
  lines(x = df$pos, y = df$value, col = "red")
}

最后再对文章总结一下。文章并不是只用了BSA的方法进行定位，他们花了几年的时间用SSR分子标记确定了候选基因可能区间，用BSA的方法在原有基础上缩小了定位区间。当然即便如此，候选基因也有上百个，作者通过BLAST的方式，对这些基因进行了注释。尽管中间还加了一些GO富集分析的内容，说这些基因富集在某个词条里，有一个是DNA metabolic processes(GO:0006259)，但我觉得如果作者用clusterProfiler做富集分析，它肯定无法得到任何富集结果。他做富集分析的目的是其实下面这个描述，也就是找到和抗冻相关的基因

LOC_Os06g39740 and LOC_Os06g39750,were annotated as the function of “response to cold (GO: 0009409)”, suggesting their key roles in regulating cold tolerance in rice. "

当然他还做了qRT-PCR进行了验证，最后推测LOC_Os06g39750应该是目标基因，这个基因里还有8个SNP位点。

扫码即刻交流

以水稻为例教你如何使用BSA方法进行遗传定位（下篇）

你可能感兴趣的:(以水稻为例教你如何使用BSA方法进行遗传定位（下篇）)