R语言---对称矩阵快速导出cytoscape网络格式

目录

<!--more--><!--more-->

背景

最近用R语言做了一万多个基因彼此之间的相关性,结果是对称矩阵,行名和列名是基因名称,矩阵中每个元素M[i,j],表示矩阵M的第i行第j列对应的两个基因相关性值。后续分析中希望将该矩阵转换为基因-基因-相关性值的网络文件,以便后续分析。

常规方法

通过两重循环遍历对称三角每个元素,这种方法在矩阵行列很多时效率太低,循环到2000多行基本不动。

#corMat is the symmetric matrix that needs to be transformed
#corNet is the output file that is transformed into network
corNet <- data.frame()
geneLists <- colnames(corNet)
for(i in 1:(length(geneLists)-1))
{
  for(j in (i+1):length(geneLists))
  {
    tmpDf <- c(geneLists[i], geneLists[j], corMat[i,j])
    corNet <- rbind(corMat, tmpDf)
  }
}

优化循环

通过分析对称矩阵循环规律,可以发现按照行选择数据时存在规律,如选择第i行的元素时,选择的元素集合是从第i+1列一直到最后一列,总共选择了n-i个元素,这里n是基因总数。所以可以优化成一重循环,提高很多效率。但是发现当循环到5000多行时,仍然变得很慢。

#matrix into network
geneList1 <- c()
geneList2 <- c()
corValue <- c()
geneLists <- colnames(corNet)
for(rowNum in 1:(length(geneLists)-1)){
  colNum <- length(geneLists)-rowNum
  geneList1 <- append(geneList1, rep(geneLists[rowNum], colNum))
  geneList2 <- append(geneList2, geneLists[(rowNum+1):length(geneLists)])
  corValue <- append(corValue, corMat[rowNum, (rowNum+1):length(geneLists)])
  print(rowNum)
}
corNet <- data.frame(geneList1, geneList2, corValue)
colnames(corNet) <- c('GeneList1', 'GeneList2', 'corValue')

再优化

避免让一个向量在计算中不断延长

因为在计算中会形成一个不断增长的向量,这样对于R的处理就会非常浪费时间。我们可以改变我们的代码的写法,就是我们不要在每次计算值的时候都要读取上一次的值,如下方示例。见如何提高代码效率。

x <- c(x, b) #b is a new value

脚本优化思路,通过追加写入方式,循环把每次提出来的列表整理成临时的数据框格式,再写入文件,这些可能避免让一个向量不断延长,以提高效率。

#matrix into network
#resFileName is the changed cytoscape network file
#geneLists is the whole gene list in the matrix
#corMat is the matrix to be transformed
resFileName <- 'Output/correlation_analysis_result.txt'
#write the header in the first row
tmpDat <- t(c('geneList1', 'geneList2', 'corValue'))
write.table(tmpDat, file = resFileName, quote = F, sep = '\t', row.names = F,
            col.names = F)
geneLists <- colnames(corNet)
for(rowNum in 1:(length(geneLists)-1)){
  colNum <- length(geneLists)-rowNum
  geneList1 <- rep(geneLists[rowNum], colNum)
  geneList2 <- geneLists[(rowNum+1):length(geneLists)]
  corValue <- corMat[rowNum, (rowNum+1):length(geneLists)]  
  tmpDat <- data.frame(geneList1, geneList2, corValue)
  write.table(tmpDat, file = resFileName, quote = F, sep = '\t', row.names = F,
              append = T, col.names = F)
  print(rowNum)
}

封装成函数

将上述脚本封装成函数便于日后复用 。

# symmetric matrix transformed into the paired network --------------------
# outPath is the file path that will save the output file
# corMat is the matrix to be transformed
symMat_intoNetwork <- function(outPath, corNet){
  #today
  today <- as.character(Sys.Date())
  #check the outPath
  if(!dir.exists(outPath)){
    return(paste(outPath, 'The path is not exist!', sep = '\n'))
    
  }
  #resFileName is the changed cytoscape network file
  resFileName <- paste(outPath, '/', 'symmetricMatrix_transformed_into_network_',
                       today, '.txt', sep = '')
  #write the header in the first row
  tmpDat <- t(c('geneList1', 'geneList2', 'corValue'))
  write.table(tmpDat, file = resFileName, quote = F, sep = '\t', row.names = F,
              col.names = F)
  geneLists <- colnames(corNet)
  for(rowNum in 1:(length(geneLists)-1)){
    colNum <- length(geneLists)-rowNum
    geneList1 <- rep(geneLists[rowNum], colNum)
    geneList2 <- geneLists[(rowNum+1):length(geneLists)]
    corValue <- corMat[rowNum, (rowNum+1):length(geneLists)]  
    tmpDat <- data.frame(geneList1, geneList2, corValue)
    write.table(tmpDat, file = resFileName, quote = F, sep = '\t', row.names = F,
                append = T, col.names = F)
    print(rowNum)
  }
}