Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

simple-event-annotation.R: doesn't add SVLEN header #561

Open
savalipa opened this issue Feb 8, 2022 · 5 comments
Open

simple-event-annotation.R: doesn't add SVLEN header #561

savalipa opened this issue Feb 8, 2022 · 5 comments

Comments

@savalipa
Copy link

savalipa commented Feb 8, 2022

Hello,

I am having trouble with the simple-event-annotation.R script. I have already tried to run the script for two different batches with SV results .vcf file generated with GRIDSS. One batch had one sample only, and the another had 4 samples. Nevertheless, I got this same error message for both batches, where "provided 68 variables to replace 67 variables" error is repeated:

Error in .subassign_columns(x, nsbs, value) : provided 68 variables to replace 67 variables Calls: [<- ... [<- -> mergeROWS -> mergeROWS -> .subassign_columns In addition: Warning message: info fields with no header: SVLEN Execution halted

I do not know how to start solving this. Is it due to the info field without header? Or something more complicated?

Best regards,
Salla

@d-cameron
Copy link
Member

I do not know how to start solving this. Is it due to the info field without header?

GRIDSS no longer writes the SVLEN header. The workaround is to add the SVLEN header before running the simple annotation script.

@d-cameron d-cameron changed the title simple-event-annotation.R: Error in .subassign_columns(x, nsbs, value) simple-event-annotation.R: doesn't add SVLEN header Feb 9, 2022
@chunlinxiao
Copy link

chunlinxiao commented Feb 18, 2022

I got same error - after adding SVLEN, the error was gone.

One question related to this script, which generated one annotated.vcf file (turning those BND to respective simple SV types without filtering), and the other file was a simple.bed file, which presumably a good SV calls in bed format with additional filtering - what was really implemented for this filtering? should we just use this simple.bed file as a confident SV call set by gridss?

Also, I noticed that in annotated.vcf, for a given DEL, DUP, or INV, it still contains two rows (one for each BND), but I saw all the INS events, also have two rows - so two BNDs were used for defining an INS event, or something else?

@savalipa
Copy link
Author

Thank you for your suggestions. Could you help me with providing your exact solutions (tool and/or script to run?)

@chunlinxiao
Copy link

what I did for the vcf, added a line to the header:

##INFO=<ID=SVLEN,Number=1,Type=Integer,Description="SV_length">

then the error disappeared.

@HuangYihang1222
Copy link

Can someone help me see what's the problem with my R script?

# 定义需要安装的包列表
packages <- c("BiocManager", "VariantAnnotation", "stringr", "StructuralVariantAnnotation")

# 检查包是否已安装,如果没有则安装
installed_packages <- rownames(installed.packages())
to_install <- packages[!packages %in% installed_packages]
if (length(to_install) > 0) {
  BiocManager::install(to_install)  # 使用BiocManager安装bioconductor包
  install.packages("stringr")  # stringr不是bioconductor包,使用install.packages安装
}

# 加载所需的包
library(BiocManager)
library(VariantAnnotation)
library(stringr)
library(StructuralVariantAnnotation)

# 定义简单的SV类型分类器
simpleEventType <- function(gr) {
  pgr <- partner(gr)
  return(ifelse(seqnames(gr) != seqnames(pgr), "CTX", # inter-chromosomal
                ifelse(strand(gr) == strand(pgr), "INV",
                       ifelse(gr$insLen >= abs(gr$svLen) * 0.7, "INS", # TODO: improve classification of complex events
                              ifelse(xor(start(gr) < start(pgr), strand(gr) == "-"), "DEL",
                                     "DUP")))))
}

# 定义处理VCF文件的函数
process_vcf <- function(vcf_path, output_dir, ref_path) {
  # 读取VCF文件
  vcf <- readVcf(vcf_path, ref_path)
  
  # 创建新的INFO字段
  new_info_fields <- data.frame(
    row.names = c("SIMPLE_TYPE", "SVLEN"),
    Number = c("1", "1"),
    Type = c("String", "Integer"),
    Description = c("Simple event type annotation based purely on breakend position and orientation.",
                    "SV length.")
  )
  
  # 将新的INFO字段与现有的INFO字段合并
  existing_info <- as.data.frame(info(header(vcf)))
  info_fields <- unique(as(rbind(existing_info, new_info_fields), "DataFrame"))
  
  # 更新VCF头部的INFO字段
  info(header(vcf)) <- info_fields

  # 计算SV类型并更新VCF对象
  gr <- breakpointRanges(vcf)
  svtype <- simpleEventType(gr)
  
  # 添加SIMPLE_TYPE信息到VCF的variants中
  info(vcf)$SIMPLE_TYPE <- NA_character_
  info(vcf[gr$sourceId])$SIMPLE_TYPE <- svtype
  info(vcf[gr$sourceId])$SVLEN <- gr$svLen
  
  # 写入VCF文件
  output_vcf_filename <- file.path(output_dir, paste0(basename(output_dir), "_annotated.vcf"))
  writeVcf(vcf, output_vcf_filename)
  
  # 过滤和生成BED文件
  gr <- gr[gr$FILTER == "PASS" & partner(gr)$FILTER == "PASS"]
  simplegr <- gr[svtype %in% c("INS", "INV", "DEL", "DUP")]
  simplebed <- data.frame(
    chrom = seqnames(simplegr),
    # call the centre of the homology/inexact interval
    start = as.integer((start(simplegr) + end(simplegr)) / 2),
    end = as.integer((start(partner(simplegr)) + end(partner(simplegr))) / 2),
    name = simpleEventType(simplegr),
    score = simplegr$QUAL,
    strand = "."
  )
  
  simplebed <- simplebed[simplebed$start < simplebed$end,]
  
  output_bed_filename <- file.path(output_dir, paste0(basename(output_dir), "_simple.bed"))
  write.table(simplebed, output_bed_filename, quote = FALSE, sep = "\t", row.names = FALSE, col.names = FALSE)
}

# 获取所有ERR*文件夹的路径
base_dir <- "/home/huangyh/EU_ragweed/SV_results/gridss"
err_dirs <- list.dirs(path = base_dir, full.names = FALSE, recursive = FALSE)
err_dirs <- err_dirs[grepl("^ERR", err_dirs)]  # 过滤出所有以"ERR"开头的目录

# 获取参考基因组的路径
ref_path <- "/home/huangyh/raw_data/ref_seq/Ragweed.genome.fa"

# 遍历每个ERR*文件夹中的VCF文件
for (err_dir in err_dirs) {
  vcf_files <- list.files(path = err_dir, pattern = "\\.vcf$", full.names = TRUE)
  
  for (vcf_file in vcf_files) {
    # 获取VCF文件所在的文件夹路径
    output_dir <- dirname(vcf_file)
    
    # 处理VCF文件并生成输出
    process_vcf(vcf_path = vcf_file, output_dir = output_dir, ref_path = ref_path)
  }
}

Then I got this:Error in .subassign_columns(x, nsbs, value) :
provided 68 variables to replace 67 variables
Calls: process_vcf ... [<- -> mergeROWS -> mergeROWS -> .subassign_columns
此外: Warning message:
In .breakpointRanges(x, ...) :
Removing 1 unpaired breakend variants. Use inferMissingBreakends=TRUE to recover with inferred partner breakends. Missing breakends: gridss30bb_2291h
停止执行

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants