Annotate_variation.pl
Annotate_variation.pl的注释方式分为三种:
Gene-based annotation Region-based annotation Filter-based annotation
参数分别为-geneanno,-regionanno,-filter
annotate_variation.pl -geneanno -buildver hg19 example/ex1.avinput humandb/
annotate_variation.pl -regionanno -dbtype cytoBand -buildver hg19 example/ex1.avinput humandb/
annotate_variation.pl -filter -dbtype 1000g2014oct_all -buildver hg19 example/ex1.avinput humandb/
$ annotate_variation.pl -geneanno \
-dbtype refGene -out ex1 \
-build hg19 \
example/ex1.avinput humandb/
# -geneanno 表示使用基于基因的注释
# -dbtype refGene 表示使用"refGene"数据库
# -out ex1 表示输出文件以ex1为前缀
# --buildver <string> 表示基因组版本,一般根据你做Call SNP时所用基因组版本 specify genome build version (default: hg18 for human)
默认使用gene-based注释类型以及refGene数据库,所以上面的命令可以缺省-geneanno -dbtype refGene
产生两个文件:variant_function和exonic_variant_function
结果文件
variant_function包含所有变异结果 exonic_variant只包含变异在外显子区域的
cat A.indel.input.variant_function|grep "^exonic"|wc -l
cat A.indel.input.exonic_variant_function|wc -l
一般来说
Exons = gene - introns
CDS = gene - introns - UTRs
CDS = Exons - UTRs
但文档说
the “exonic” here refers only to coding exonic portion , but not UTR portion, as there are two keywords (UTR5, UTR3) that are specifically reserved for UTR annotations. 因此这里的exonic = cds
variant文件:相比于input仅在前面加了两列
- exonic 1 variant overlaps a coding exon_variant (SO:0001791)
- splicing 1 variant is within 2-bp of a splicing junction (use -splicing_threshold to change this)
- ncRNA 2 variant overlaps a transcript without coding annotation in the gene definition (see Notes below for more explanation)
- UTR5 3 variant overlaps a 5’ untranslated region
- UTR3 3 variant overlaps a 3’ untranslated region
- intronic 4 variant overlaps an intron
- upstream 5 variant overlaps 1-kb region upstream of transcription start site
- downstream 5 variant overlaps 1-kb region downtream of transcription end site
- intergenic 6 variant is in intergenic region
exonic文件
- frameshift insertion 1 an insertion of one or more nucleotides that cause frameshift changes in protein coding sequence
- frameshift deletion 2 a deletion of one or more nucleotides that cause frameshift changes in protein coding sequence
- frameshift block substitution 3 a block substitution of one or more nucleotides that cause frameshift changes in protein coding sequence
- stopgain 4 a nonsynonymous SNV, frameshift insertion/deletion, nonframeshift insertion/deletion or block substitution that lead to the immediate creation of stop codon at the variant site. For frameshift mutations, the creation of stop codon downstream of the variant will not be counted as “stopgain”!
- stoploss 5 a nonsynonymous SNV, frameshift insertion/deletion, nonframeshift insertion/deletion or block substitution that lead to the immediate elimination of stop codon at the variant site
- nonframeshift insertion 6 an insertion of 3 or multiples of 3 nucleotides that do not cause frameshift changes in protein coding sequence
- nonframeshift deletion 7 a deletion of 3 or mutliples of 3 nucleotides that do not cause frameshift changes in protein coding sequence
- nonframeshift block substitution 8 a block substitution of one or more nucleotides that do not cause frameshift changes in protein coding sequence
- nonsynonymous SNV 9 a single nucleotide change that cause an amino acid change
- synonymous SNV 10 a single nucleotide change that does not cause an amino acid
- unknown 11 unknown function (due to various errors in the gene structure
自建库
db中需要有species_refGeneMrna.fa和species_refGene.txt两个文件。
-buildver参数 写 下划线前的 这里是 species