<!--more-->
, <!--more-->
背景
最近用R语言做了一万多个基因彼此之间的相关性,结果是对称矩阵,行名和列名是基因名称,矩阵中每个元素M[i,j],表示矩阵M的第i行第j列对应的两个基因相关性值。后续分析中希望将该矩阵转换为基因-基因-相关性值的网络文件,以便后续分析。
常规方法
通过两重循环遍历对称三角每个元素,这种方法在矩阵行列很多时效率太低,循环到2000多行基本不动。
#corMat is the symmetric matrix that needs to be transformed
#corNet is the output file that is transformed into network
corNet <- data.frame()
geneLists <- colnames(corNet)
for(i in 1:(length(geneLists)-1))
{
for(j in (i+1):length(geneLists))
{
tmpDf <- c(geneLists[i], geneLists[j], corMat[i,j])
corNet <- rbind(corMat, tmpDf)
}
}
优化循环
通过分析对称矩阵循环规律,可以发现按照行选择数据时存在规律,如选择第i行的元素时,选择的元素集合是从第i+1列一直到最后一列,总共选择了n-i个元素,这里n是基因总数。所以可以优化成一重循环,提高很多效率。但是发现当循环到5000多行时,仍然变得很慢。
#matrix into network
geneList1 <- c()
geneList2 <- c()
corValue <- c()
geneLists <- colnames(corNet)
for(rowNum in 1:(length(geneLists)-1)){
colNum <- length(geneLists)-rowNum
geneList1 <- append(geneList1, rep(geneLists[rowNum], colNum))
geneList2 <- append(geneList2, geneLists[(rowNum+1):length(geneLists)])
corValue <- append(corValue, corMat[rowNum, (rowNum+1):length(geneLists)])
print(rowNum)
}
corNet <- data.frame(geneList1, geneList2, corValue)
colnames(corNet) <- c('GeneList1', 'GeneList2', 'corValue')
再优化
避免让一个向量在计算中不断延长
因为在计算中会形成一个不断增长的向量,这样对于R的处理就会非常浪费时间。我们可以改变我们的代码的写法,就是我们不要在每次计算值的时候都要读取上一次的值,如下方示例。见如何提高代码效率。
x <- c(x, b) #b is a new value
脚本优化思路,通过追加写入方式,循环把每次提出来的列表整理成临时的数据框格式,再写入文件,这些可能避免让一个向量不断延长,以提高效率。
#matrix into network
#resFileName is the changed cytoscape network file
#geneLists is the whole gene list in the matrix
#corMat is the matrix to be transformed
resFileName <- 'Output/correlation_analysis_result.txt'
#write the header in the first row
tmpDat <- t(c('geneList1', 'geneList2', 'corValue'))
write.table(tmpDat, file = resFileName, quote = F, sep = '\t', row.names = F,
col.names = F)
geneLists <- colnames(corNet)
for(rowNum in 1:(length(geneLists)-1)){
colNum <- length(geneLists)-rowNum
geneList1 <- rep(geneLists[rowNum], colNum)
geneList2 <- geneLists[(rowNum+1):length(geneLists)]
corValue <- corMat[rowNum, (rowNum+1):length(geneLists)]
tmpDat <- data.frame(geneList1, geneList2, corValue)
write.table(tmpDat, file = resFileName, quote = F, sep = '\t', row.names = F,
append = T, col.names = F)
print(rowNum)
}
封装成函数
将上述脚本封装成函数便于日后复用 。
# symmetric matrix transformed into the paired network --------------------
# outPath is the file path that will save the output file
# corMat is the matrix to be transformed
symMat_intoNetwork <- function(outPath, corNet){
#today
today <- as.character(Sys.Date())
#check the outPath
if(!dir.exists(outPath)){
return(paste(outPath, 'The path is not exist!', sep = '\n'))
}
#resFileName is the changed cytoscape network file
resFileName <- paste(outPath, '/', 'symmetricMatrix_transformed_into_network_',
today, '.txt', sep = '')
#write the header in the first row
tmpDat <- t(c('geneList1', 'geneList2', 'corValue'))
write.table(tmpDat, file = resFileName, quote = F, sep = '\t', row.names = F,
col.names = F)
geneLists <- colnames(corNet)
for(rowNum in 1:(length(geneLists)-1)){
colNum <- length(geneLists)-rowNum
geneList1 <- rep(geneLists[rowNum], colNum)
geneList2 <- geneLists[(rowNum+1):length(geneLists)]
corValue <- corMat[rowNum, (rowNum+1):length(geneLists)]
tmpDat <- data.frame(geneList1, geneList2, corValue)
write.table(tmpDat, file = resFileName, quote = F, sep = '\t', row.names = F,
append = T, col.names = F)
print(rowNum)
}
}