R语言实现4道编程题

目录

第一题

题目

统计一段序列A、C、G、T出现次数,序列长度至多1000nt

Given: A DNA string “s” of length at most 1000 nt.

Return: Four integers (separated by spaces) counting the respective number of times that the symbols ‘A’, ‘C’, ‘G’, and ‘T’ occur in “s”.

示例

input: AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC

output: 20 12 17 21

R代码实现

library(dplyr)
library(stringr)
basesCount <- function(char_list){
  if(nchar(char_list) > 1000){
    stop('The length of input string greater than 1000!')
  }
  a_count <- str_extract_all(char_list, 'A')[[1]] %>% length
  c_count <- str_extract_all(char_list, 'C')[[1]] %>% length
  g_count <- str_extract_all(char_list, 'G')[[1]] %>% length
  t_count <- str_extract_all(char_list, 'T')[[1]] %>% length
  return(c(a_count, c_count, g_count, t_count))
}
# Input
Q1_input <- read.table('Q1_input.txt', header = F, sep = '')[ ,1]
res <- basesCount(Q1_input)
res

通过构建basesCount函数,输入一个序列返回对应的A、C、G、T各个碱基个数的结果

该函数首先对输入序列进行判断,如果序列长度超过1000则报错。然后通过stringr包中str_extract_all函数提取出各个碱基再统计其个数。

shell脚本方法

#!/bin/sh
# $1 传入序列信息的样本名
if [ $# -lt 1 ]
then
	echo "Please input the seqence file name!"
	exit
fi
input_seq=`cat $1`
input_seq=`echo $input_seq | tr 'acgt' 'ACGT'`
seq_length=`echo $input_seq | wc -l`
if [ $seq_length -gt 1000 ]
then
	echo "The input seqence's length greater than 1000!"
	exit
fi
a_count=`echo $input_seq | grep -o 'A' | wc -l`
c_count=`echo $input_seq | grep -o 'C' | wc -l`
g_count=`echo $input_seq | grep -o 'G' | wc -l`
t_count=`echo $input_seq | grep -o 'T' | wc -l`
echo "$a_count $c_count $g_count $t_count"

首先判断调用的脚本是否传入序列文件名的变量;然后对序列进行大小写的转换;再判断序列是否大于1000,最后统计各碱基的个数并输出

第二题

题目

计算两个序列的Hamming distance,即两个序列中对应位置的碱基不同的个数

Given two strings “s” and “t” of equal length, the Hamming distance between “s” and “t” is the number of corresponding symbols that differ in “s” and “t”.

Given: Two DNA strings “s” and “t” of equal length (not exceeding 1 kbp).

Return: The Hamming distance between “s” and “t”.

示例

Example input: “s”: GAGCCTACTAACGGGAT “t”: CATCGTAATGACGGCCT

Should output: 7

R脚本实现

library(dplyr)
library(stringr)
hammingDistance <- function(str1, str2){
  # remove the special char
  str1 <- trimws(str1, which = c("both", "left", "right"), 
                 whitespace = "[ \t\r\n]") 
  str2 <- trimws(str2, which = c("both", "left", "right"), 
                 whitespace = "[ \t\r\n]")
  if(nchar(str1) != nchar(str2)){
    stop("the two string's length not equal!")
  }
  if(nchar(str1) > 1000){
    stop("the string's length greater than 1 kbp!")
  }
  count <- 0
  for(i in 1:nchar(str1)){
    if(substr(str1, i, i) != substr(str2, i, i)){
      count <- count + 1
    }
  }
  return(count)
}
Q2_input <- read.table('Q2_input.txt', header = F, sep = ':', row.names = 1)
hammingDistance(Q2_input['s', ], Q2_input['t', ])

说明:

  1. 构建hammingDistance函数,输入是两个序列,输出是计算的距离值
  2. 函数中首先先去除序列两端的特殊字符如空格等
  3. 然后检查两个碱基序列长度是否相等,不相等报错
  4. 检查序列长度是否超过1000,超过报错
  5. 依序遍历,判断两个序列对应等碱基是否相同,不同则计数+1
  6. 最终返回计算的结果作为hammingDistance距离

第三题

题目

找给定几个序列中的最长共同模体(motif)序列

A common substring of a collection of strings is a substring of every member of the collection. We say that a common substring is a longest common substring if there does not exist a longer common substring. For example, “CG” is a common substring of “ACGTACGT” and “AACCGTATA”, but it is not as long as possible; in this case, “CGTA” is a longest common substring of “ACGTACGT” and “AACCGTATA”. (Note that the longest common substring is not necessarily unique; for a simple example, “AA” and “CC” are both longest common substrings of “AACC” and “CCAA”.)

Given: A collection of k (k≤100) DNA strings of length at most 1 kbp each in FASTA format.

Return: A longest common substring of the collection. (If multiple solutions exist, you may return any single solution.)

示例

Example input:

1 GATTACA 2 TAGACCA 3 ATACA

Should output: TA

R脚本实现

library(dplyr)
library(stringr)
# function
fastaToList <- function(fasta_file){
  fasta_list <- seqinr::read.fasta(file = fasta_file, as.string = TRUE)
  str_len <- c()
  for(i in 1:length(fasta_list)){
    fasta_list[[i]] <- fasta_list[[i]] %>% as.character
    str_len[i] <- nchar(fasta_list[[i]])
  }
  return(fasta_list)
}
getMotif <- function(fasta_file){
  fasta_list <- fastaToList(fasta_file)
  motif_list = c()
  n <- 1
  start_loc <- 1
  while(TRUE){
    motif <- ''
    for(stop_loc in (start_loc+1):nchar(fasta_list[[1]])){
      test_motif <- substr(fasta_list[[1]], start_loc, stop_loc)
      for(i in 2:length(fasta_list)){
        res <- str_detect(fasta_list[[i]], test_motif)
        if(res == FALSE){
          break
        }
      }
      if(res == FALSE){
        break
      }
      motif <- test_motif
    }
    if(res == FALSE){
      start_loc <- stop_loc
      motif_list[n] <- motif
      n <- n+1
    }else{
      motif_list[n] <- motif
      break
    }
    if(start_loc >= (nchar(fasta_list[[1]]))){
      break
    }
  }
  return(motif_list)
}
# Input
fasta_file <- 'test.txt'
motif_list <- getMotif(fasta_file)
longest_motif <- motif_list[nchar(motif_list) == max(nchar(motif_list))]

说明:

  1. 首先通过fastaToList函数读入fasta文件,函数输入是fasta文件名,输出是包含各个序列字符串格式的列表
    • 使用seqinr包读入fasta格式的文件,输入名是文件名,as.string控制读入的格式是字符还是字符串
    • 读入后的数据是类的格式,需要自定义将每个元素转换成字符的格式
  2. 然后通过getMotif函数遍历获取各个序列共同的motif序列
    • 思路:
      • 以第一条序列(或最短序列长度)为参考序列,通过控制碱基开始位置和终止位置指针依次遍历,判断之间的序列是否在各个序列中存在对应,不对应则跳出循环
      • 终止位置指针滑动规律是从开始位置+1开始依次顺着参考序列滑动一个碱基,直到最后一个碱基
      • 开始位置指针滑动规律是从第一个碱基开始,后续都跳到上一次循环的终止指针处
    • 首先初始化变量
      • motif_list用于存储遍历得到的各个序列间共同的motif序列;
      • n用于记录下一个查找的motif序列应该存放在motif_list中的位置;
      • start_loc是查找motif序列的开始位置指针
    • 开始指针循环
      • 由于不确定具体循环次数,使用while无限循环,通过设置条件终止循环
      • 终止条件:1. 终止指针已经查到最后一个碱基并检验该序列是motif;2. 下一次循环的开始指针大于等于参考序列长度
      • 这里一次循环完成一次从开始指针开始,对终止指针的依序遍历,查找该潜在motif序列在所有序列中是否存在相应的潜在motif序列
    • 终止指针循环
      • 从开始指针+1处开始往后依次遍历,每次将从参考序列中获得的潜在motif序列与所有序列比对,如果在某个序列中找不到一致的序列(res 判断为FALSE),则终止在其他序列中继续查询比对,并停止此终止指针的遍历。
      • 否则依次往后遍历,直至到最后一个碱基,每次遍历的潜在motif序列如果在所有序列中都存在,则将此次遍历的潜在motif序列存入motif变量中

第四题

问题

蛋白质翻译,将RNA序列翻译成氨基酸序列

The 20 commonly occurring amino acids are abbreviated by using 20 letters from the English alphabet (all letters except for B, J, O, U, X, and Z). Protein strings are constructed from these 20 symbols. Henceforth, the term genetic string will incorporate protein strings along with DNA strings and RNA strings. Given: An RNA string “s” corresponding to a strand of mRNA (of length at most 10 kbp). Return: The protein string encoded by “s”.

示例

Example input: AUGGCCAUGGCGCCCAGAACUGAGAUCAAUAGUACCCGUAUUAACGGGUGA Should output: MAMAPRTEINSTRING

R脚本

library(dplyr)
library(stringr)
pt_code <- read.csv('translationCode.csv')
pt_code['Codon'] <- apply(pt_code, 1, 
                          function(x){return(paste0(x[1], x[2], x[3]))}
                          )
seq_str <- read.table('Q4_input.txt', header = F, sep = '')
proteinTranslation <- function(seq_str, stop_codon = TRUE){
  if(nchar(seq_str) > 10000){
    stop('The length of input string greater than 10 kbp!')
  }
  i <- 1
  translation_res <- ''
  while ((nchar(seq_str)-i)>=2) {
    codon <- substr(seq_str, i, i+2)
    tmp <- pt_code$Aacid[pt_code$Codon == codon]
    if(tmp == 'STOP' & stop_codon == TRUE){
      break
    }
    translation_res <- paste0(translation_res, tmp)
    i <- i+3
  }
  return(translation_res)
}

proteinTranslation(seq_str)

说明:

1. 制作了一个mRNA密码子与翻译后蛋白质序列对应关系的codebook
2. proteinTranslation函数,输入待翻译的RNA序列(和stop_codon参数,默认TRUE,即遇到终止密码就停止翻译)
3. 函数返回翻译后的氨基酸序列