introduction-to-SNPutils
Source:vignettes/introduction-to-SNPutils.Rmd
introduction-to-SNPutils.Rmd
Overview
SNPutils is an R package designed to convert numerical genotype data (0, 1, 2) to allele pairs. The package provides tools for:
- Converting genotype codes to actual allele pairs
- Validating allele assignments according to standard genetic rules
- Generating standardized output formats (TYP and COM)
- Handling indels (insertions/deletions) correctly
- Performing quality control checks on genotype data
Installation
You can install SNPutils from GitHub:
# install.packages("remotes")
remotes::install_github("LabeoFD/SNPutils")
Basic Concepts
In genetic data processing, genotypes are often represented as numerical codes:
- 0: Homozygous for the first allele (A/A)
- 1: Heterozygous (A/B)
- 2: Homozygous for the second allele (B/B)
These codes are convenient for computation. SNPutils converts these codes back to biologically meaningful allele pairs using annotation data that specifies which nucleotides (A, C, G, T) correspond to each allele.
Example Data
SNPutils comes with example datasets to demonstrate its functionality:
library(SNPutils)
# Load the example data
data(example_genotypes)
data(example_annotations)
# Examine the genotype data
head(example_genotypes)
#> probeset_id Sample1 Sample2 Sample3
#> 1 SNP1 0 2 2
#> 2 SNP2 0 1 1
#> 3 SNP3 1 1 0
#> 4 SNP4 1 0 2
#> 5 SNP5 2 0 0
#> 6 SNP6 1 1 2
The genotype data contains:
- A
probeset_id
column with SNP identifiers - Columns for each sample (Sample1, Sample2, Sample3) with genotype codes (0, 1, 2)
Examine the annotation data
head(example_annotations)
#> Probe.Set.ID Chromosome Physical.Position Allele.A Allele.B
#> 1 SNP1 9 35958693 T G
#> 2 SNP2 11 61564642 A C
#> 3 SNP3 19 99793111 A C
#> 4 SNP4 14 22319761 T G
#> 5 SNP5 16 41066538 A G
#> 6 SNP6 13 20182616 T C
The annotation data contains:
- A
Probe.Set.ID
column matching the probeset_id values - Chromosome and position information
-
Allele.A
andAllele.B
columns specifying the nucleotides
The example data contains both standard SNPs and indels. For indels,
the Allele.A
column contains a deletion marker (“-”) and
Allele.B
contains the insertion sequence.
Converting Genotypes to Alleles
The main function genotype2allele() converts numeric genotype codes to allele pairs:
# Convert genotypes to alleles
results <- genotype2allele(example_genotypes, example_annotations)
#> No rule violations found - all rules are respected.
# View the structure of the results
str(results)
#> List of 2
#> $ validation:List of 2
#> ..$ allele_validation: tibble [20 × 12] (S3: tbl_df/tbl/data.frame)
#> .. ..$ probeset_id : chr [1:20] "SNP1" "SNP10" "SNP11" "SNP12" ...
#> .. ..$ allele_a : chr [1:20] "T" "T" "-" "-" ...
#> .. ..$ allele_b : chr [1:20] "G" "C" "A" "G" ...
#> .. ..$ is_indel : logi [1:20] FALSE FALSE TRUE TRUE TRUE TRUE ...
#> .. ..$ is_multibase: logi [1:20] FALSE FALSE FALSE FALSE FALSE FALSE ...
#> .. ..$ is_AT_SNP : logi [1:20] FALSE FALSE FALSE FALSE FALSE FALSE ...
#> .. ..$ is_CG_SNP : logi [1:20] FALSE FALSE FALSE FALSE FALSE FALSE ...
#> .. ..$ is_other_SNP: logi [1:20] TRUE TRUE FALSE FALSE FALSE FALSE ...
#> .. ..$ rule2_check : logi [1:20] TRUE TRUE TRUE TRUE TRUE TRUE ...
#> .. ..$ rule3_check : logi [1:20] TRUE TRUE TRUE TRUE TRUE TRUE ...
#> .. ..$ rule4_check : logi [1:20] TRUE TRUE TRUE TRUE TRUE TRUE ...
#> .. ..$ rule5_check : logi [1:20] TRUE TRUE TRUE TRUE TRUE TRUE ...
#> ..$ rule_violations : tibble [0 × 13] (S3: tbl_df/tbl/data.frame)
#> .. ..$ probeset_id : chr(0)
#> .. ..$ allele_a : chr(0)
#> .. ..$ allele_b : chr(0)
#> .. ..$ is_indel : logi(0)
#> .. ..$ is_multibase: logi(0)
#> .. ..$ is_AT_SNP : logi(0)
#> .. ..$ is_CG_SNP : logi(0)
#> .. ..$ is_other_SNP: logi(0)
#> .. ..$ rule2_check : logi(0)
#> .. ..$ rule3_check : logi(0)
#> .. ..$ rule4_check : logi(0)
#> .. ..$ rule5_check : logi(0)
#> .. ..$ violations : chr(0)
#> $ results :Classes 'data.table' and 'data.frame': 60 obs. of 3 variables:
#> ..$ probeset_id : chr [1:60] "SNP1" "SNP1" "SNP1" "SNP10" ...
#> ..$ Sample_ID : chr [1:60] "Sample1" "Sample2" "Sample3" "Sample1" ...
#> ..$ allele_genotype: chr [1:60] "T/T" "G/G" "G/G" "T/T" ...
#> ..- attr(*, ".internal.selfref")=<externalptr>
The genotype2allele()
function returns a list with two
components:
-
validation
: Contains validation results and rule violations -
results
: Contains the converted genotype data
Let’s examine the conversion results:
# Look at the first few rows of the results
head(results$results)
#> probeset_id Sample_ID allele_genotype
#> <char> <char> <char>
#> 1: SNP1 Sample1 T/T
#> 2: SNP1 Sample2 G/G
#> 3: SNP1 Sample3 G/G
#> 4: SNP10 Sample1 T/T
#> 5: SNP10 Sample2 T/C
#> 6: SNP10 Sample3 C/C
str(results)
#> List of 2
#> $ validation:List of 2
#> ..$ allele_validation: tibble [20 × 12] (S3: tbl_df/tbl/data.frame)
#> .. ..$ probeset_id : chr [1:20] "SNP1" "SNP10" "SNP11" "SNP12" ...
#> .. ..$ allele_a : chr [1:20] "T" "T" "-" "-" ...
#> .. ..$ allele_b : chr [1:20] "G" "C" "A" "G" ...
#> .. ..$ is_indel : logi [1:20] FALSE FALSE TRUE TRUE TRUE TRUE ...
#> .. ..$ is_multibase: logi [1:20] FALSE FALSE FALSE FALSE FALSE FALSE ...
#> .. ..$ is_AT_SNP : logi [1:20] FALSE FALSE FALSE FALSE FALSE FALSE ...
#> .. ..$ is_CG_SNP : logi [1:20] FALSE FALSE FALSE FALSE FALSE FALSE ...
#> .. ..$ is_other_SNP: logi [1:20] TRUE TRUE FALSE FALSE FALSE FALSE ...
#> .. ..$ rule2_check : logi [1:20] TRUE TRUE TRUE TRUE TRUE TRUE ...
#> .. ..$ rule3_check : logi [1:20] TRUE TRUE TRUE TRUE TRUE TRUE ...
#> .. ..$ rule4_check : logi [1:20] TRUE TRUE TRUE TRUE TRUE TRUE ...
#> .. ..$ rule5_check : logi [1:20] TRUE TRUE TRUE TRUE TRUE TRUE ...
#> ..$ rule_violations : tibble [0 × 13] (S3: tbl_df/tbl/data.frame)
#> .. ..$ probeset_id : chr(0)
#> .. ..$ allele_a : chr(0)
#> .. ..$ allele_b : chr(0)
#> .. ..$ is_indel : logi(0)
#> .. ..$ is_multibase: logi(0)
#> .. ..$ is_AT_SNP : logi(0)
#> .. ..$ is_CG_SNP : logi(0)
#> .. ..$ is_other_SNP: logi(0)
#> .. ..$ rule2_check : logi(0)
#> .. ..$ rule3_check : logi(0)
#> .. ..$ rule4_check : logi(0)
#> .. ..$ rule5_check : logi(0)
#> .. ..$ violations : chr(0)
#> $ results :Classes 'data.table' and 'data.frame': 60 obs. of 3 variables:
#> ..$ probeset_id : chr [1:60] "SNP1" "SNP1" "SNP1" "SNP10" ...
#> ..$ Sample_ID : chr [1:60] "Sample1" "Sample2" "Sample3" "Sample1" ...
#> ..$ allele_genotype: chr [1:60] "T/T" "G/G" "G/G" "T/T" ...
#> ..- attr(*, ".internal.selfref")=<externalptr>
For each genotype, the function has created an
allele_genotype
showing the actual nucleotide alleles.
Validating Allele Assignments
SNPutils automatically validates allele assignments according to standard rules:
Rule 1: SNPs are fixed on the forward strand of the design-time reference genome Rule 2: For AT/CG SNPs, alleles should be in alphabetical order Rule 3: For other SNPs, Allele.A should be A or T, and Allele.B should be C or G Rule 4: For indels, Allele.B should not be a deletion marker Rule 5: For multi-base alleles, alleles should be in alphabetical order
Let’s check if any validation issues were found:
# Check for validation issues
if (!is.null(results$validation$rule_violations) && nrow(results$validation$rule_violations) > 0) {
head(results$validation$rule_violations)
} else {
cat("No validation issues found\n")
}
#> No validation issues found
Updated function: genotype2allelev2()
The package also includes an updated version with additional features:
# Convert genotypes to alleles using enhanced function
results_v2 <- genotype2allelev2(example_genotypes, example_annotations, verbose = FALSE)
# View the structure of the results
str(results_v2, max.level = 2)
#> List of 2
#> $ validation:List of 2
#> ..$ allele_validation: tibble [20 × 12] (S3: tbl_df/tbl/data.frame)
#> ..$ rule_violations : tibble [0 × 13] (S3: tbl_df/tbl/data.frame)
#> $ results :Classes 'data.table' and 'data.frame': 60 obs. of 3 variables:
#> ..$ probeset_id : chr [1:60] "SNP1" "SNP1" "SNP1" "SNP10" ...
#> ..$ Sample_ID : chr [1:60] "Sample1" "Sample2" "Sample3" "Sample1" ...
#> ..$ allele_genotype: chr [1:60] "T/T" "G/G" "G/G" "T/T" ...
#> ..- attr(*, ".internal.selfref")=<externalptr>
Key improvements in v2: - Space-to-dash conversion for alleles - Missing genotype handling (-1 → “–”) - Extended indel detection for multi-character alleles - Improved validation and error handling
Let’s examine the conversion results:
# Look at the first few rows of the results (using v2)
head(results_v2$results)
#> probeset_id Sample_ID allele_genotype
#> <char> <char> <char>
#> 1: SNP1 Sample1 T/T
#> 2: SNP1 Sample2 G/G
#> 3: SNP1 Sample3 G/G
#> 4: SNP10 Sample1 T/T
#> 5: SNP10 Sample2 T/C
#> 6: SNP10 Sample3 C/C
For each genotype, the function has created an
allele_genotype
showing the actual nucleotide alleles.
Validating Allele Assignments
SNPutils automatically validates allele assignments according to standard rules:
- Rule 1: SNPs are fixed on the forward strand of the design-time reference genome
- Rule 2: For AT/CG SNPs, alleles should be in alphabetical order
- Rule 3: For other SNPs, Allele.A should be A or T, and Allele.B should be C or G
- Rule 4: For indels, Allele.B should not be a deletion marker
- Rule 5: For multi-base alleles, alleles should be in alphabetical order
Let’s check if any validation issues were found:
Converting to TYP Format
The allele2typ()
function converts the results to TYP
format: NOTE: It’s a TYP format missing the header
(Pre-TYP)
# Convert to TYP format
typ_format <- allele2typ(results)
# View the TYP format data
head(typ_format)
#> Sample_ID SNP_Name Allele1_Forward Allele2_Forward
#> <char> <char> <char> <char>
#> 1: Sample1 SNP1 T T
#> 2: Sample2 SNP1 G G
#> 3: Sample3 SNP1 G G
#> 4: Sample1 SNP10 T T
#> 5: Sample2 SNP10 T C
#> 6: Sample3 SNP10 C C
The TYP format includes:
-
Sample_ID
: Sample identifier -
SNP_Name
: SNP identifier -
Allele1_Forward
andAllele2_Forward
: The two alleles
Handling Special Cases
Missing Genotypes
The enhanced function handles missing genotypes (-1) gracefully:
# Create test data with missing genotype
test_genotypes <- data.frame(
probeset_id = "SNP1",
Sample1 = -1
)
test_annotations <- data.frame(
Probe.Set.ID = "SNP1",
Chromosome = "1",
Physical.Position = 1000,
Allele.A = "A",
Allele.B = "G"
)
# Convert with missing genotype
result_missing <- genotype2allelev2(test_genotypes, test_annotations, verbose = FALSE)
result_missing$results
#> probeset_id Sample_ID allele_genotype
#> <char> <char> <char>
#> 1: SNP1 Sample1 --
Handling Indels
Let’s focus on how SNPutils handles indels:
# Filter for indels (SNP11-SNP20 in our example data)
indel_results <- subset(results$results, grepl("SNP1[1-9]|SNP20", probeset_id))
head(indel_results)
#> probeset_id Sample_ID allele_genotype
#> <char> <char> <char>
#> 1: SNP11 Sample1 D/D
#> 2: SNP11 Sample2 I/I
#> 3: SNP11 Sample3 D/D
#> 4: SNP12 Sample1 D/I
#> 5: SNP12 Sample2 D/D
#> 6: SNP12 Sample3 D/I
# Filter for indels (SNP11-SNP20 in our example data)
indel_results <- subset(results_v2$results, grepl("SNP1[1-9]|SNP20", probeset_id))
head(indel_results)
#> probeset_id Sample_ID allele_genotype
#> <char> <char> <char>
#> 1: SNP11 Sample1 D/D
#> 2: SNP11 Sample2 I/I
#> 3: SNP11 Sample3 D/D
#> 4: SNP12 Sample1 D/I
#> 5: SNP12 Sample2 D/D
#> 6: SNP12 Sample3 D/I
For indels, the genotype codes are interpreted as:
- 0: D/D (homozygous deletion)
- 1: D/I (heterozygous deletion/insertion)
- 2: I/I (homozygous insertion)
Function Comparison
Feature | genotype2allele() |
genotype2allelev2() |
---|---|---|
Basic conversion | ✓ | ✓ |
Validation | ✓ | ✓ |
Missing genotype handling | Limited | ✓ (-1 → “–”) |
Space-to-dash conversion | ✗ | ✓ |
Extended indel detection | Basic | ✓ (multi-character) |
Error handling | Basic | Basic |
Performance | Standard | Standard |
Working with Your Own Data
To use SNPutils with your own data, follow these steps:
- Read Your Genotype Data Use
readr::read_delim()
or similar to read your genotype data, especially if you need to skip header lines:
# Read genotype data
genotype_data <- read_delim(
"path/to/genotype.txt",
delim = "\t",
skip = 583, # Adjust based on your file structure
col_names = TRUE
)
# Read annotation data
annotation_data <- read.table(
"path/to/annotation.csv",
header = TRUE,
sep = ","
)
- Check and Prepare Your Data
Ensure your data has the required column names:
# For genotype data, ensure there's a column named "probeset_id"
if (!"probeset_id" %in% colnames(genotype_data)) {
# Rename the appropriate column
genotype_data <- rename(genotype_data, probeset_id = your_id_column_name)
}
# For annotation data, ensure there are columns named "Probe.Set.ID", "Allele.A", and "Allele.B"
required_cols <- c("Probe.Set.ID", "Allele.A", "Allele.B")
missing_cols <- required_cols[!required_cols %in% colnames(annotation_data)]
if (length(missing_cols) > 0) {
stop("Required columns missing from annotation data: ",
paste(missing_cols, collapse = ", "))
}
- Process Your Data Convert genotypes to alleles:
# Convert genotypes to alleles
results <- genotype2allele(genotype_data, annotation_data)
# Check for validation issues
if (nrow(results$validation$rule_violations) > 0) {
print(paste("Found", nrow(results$validation$rule_violations), "validation issues"))
head(results$validation$rule_violations)
}
# Convert to TYP format
typ_format <- allele2typ(results)
Using the updated function genotype2allelev2()
:
# Convert genotypes to alleles using enhanced function
results <- genotype2allelev2(genotype_data, annotation_data)
# Check for validation issues
if (nrow(results$validation$rule_violations) > 0) {
print(paste("Found", nrow(results$validation$rule_violations), "validation issues"))
head(results$validation$rule_violations)
}
# Convert to TYP format
typ_format <- allele2typ(results)
- Save Your Results Write the results to a file:
# Write TYP format to a file
write.csv(typ_format, "output_typ.csv", row.names = FALSE)