Annovar

Posted by Icey的博客 on September 20, 2019

Annotate_variation.pl

Annotate_variation.pl的注释方式分为三种:

Gene-based annotation Region-based annotation Filter-based annotation

参数分别为-geneanno,-regionanno,-filter

annotate_variation.pl -geneanno -buildver hg19 example/ex1.avinput humandb/
annotate_variation.pl -regionanno -dbtype cytoBand -buildver hg19 example/ex1.avinput humandb/
annotate_variation.pl -filter -dbtype 1000g2014oct_all -buildver hg19 example/ex1.avinput humandb/
$ annotate_variation.pl -geneanno \
      -dbtype refGene -out ex1 \
      -build hg19 \
      example/ex1.avinput humandb/
# -geneanno 表示使用基于基因的注释
# -dbtype refGene 表示使用"refGene"数据库
# -out ex1 表示输出文件以ex1为前缀
# --buildver <string>    表示基因组版本,一般根据你做Call SNP时所用基因组版本      specify genome build version (default: hg18 for human)

默认使用gene-based注释类型以及refGene数据库,所以上面的命令可以缺省-geneanno -dbtype refGene

产生两个文件:variant_function和exonic_variant_function

结果文件

variant_function包含所有变异结果 exonic_variant只包含变异在外显子区域的

cat A.indel.input.variant_function|grep "^exonic"|wc -l
cat A.indel.input.exonic_variant_function|wc -l

一般来说

Exons = gene - introns
CDS = gene - introns - UTRs
CDS = Exons - UTRs

但文档说

the “exonic” here refers only to coding exonic portion , but not UTR portion, as there are two keywords (UTR5, UTR3) that are specifically reserved for UTR annotations. 因此这里的exonic = cds

variant文件:相比于input仅在前面加了两列

  • exonic 1 variant overlaps a coding exon_variant (SO:0001791)
  • splicing 1 variant is within 2-bp of a splicing junction (use -splicing_threshold to change this)
  • ncRNA 2 variant overlaps a transcript without coding annotation in the gene definition (see Notes below for more explanation)
  • UTR5 3 variant overlaps a 5’ untranslated region
  • UTR3 3 variant overlaps a 3’ untranslated region
  • intronic 4 variant overlaps an intron
  • upstream 5 variant overlaps 1-kb region upstream of transcription start site
  • downstream 5 variant overlaps 1-kb region downtream of transcription end site
  • intergenic 6 variant is in intergenic region

exonic文件

  • frameshift insertion 1 an insertion of one or more nucleotides that cause frameshift changes in protein coding sequence
  • frameshift deletion 2 a deletion of one or more nucleotides that cause frameshift changes in protein coding sequence
  • frameshift block substitution 3 a block substitution of one or more nucleotides that cause frameshift changes in protein coding sequence
  • stopgain 4 a nonsynonymous SNV, frameshift insertion/deletion, nonframeshift insertion/deletion or block substitution that lead to the immediate creation of stop codon at the variant site. For frameshift mutations, the creation of stop codon downstream of the variant will not be counted as “stopgain”!
  • stoploss 5 a nonsynonymous SNV, frameshift insertion/deletion, nonframeshift insertion/deletion or block substitution that lead to the immediate elimination of stop codon at the variant site
  • nonframeshift insertion 6 an insertion of 3 or multiples of 3 nucleotides that do not cause frameshift changes in protein coding sequence
  • nonframeshift deletion 7 a deletion of 3 or mutliples of 3 nucleotides that do not cause frameshift changes in protein coding sequence
  • nonframeshift block substitution 8 a block substitution of one or more nucleotides that do not cause frameshift changes in protein coding sequence
  • nonsynonymous SNV 9 a single nucleotide change that cause an amino acid change
  • synonymous SNV 10 a single nucleotide change that does not cause an amino acid
  • unknown 11 unknown function (due to various errors in the gene structure

自建库

db中需要有species_refGeneMrna.fa和species_refGene.txt两个文件。

-buildver参数 写 下划线前的 这里是 species

参考

1.annovar 文档