Class VariantContext

  • All Implemented Interfaces:
    Locatable, Feature, Serializable

    public class VariantContext
    extends Object
    implements Feature, Serializable

    High-level overview

    The VariantContext object is a single general class system for representing genetic variation data composed of:
    • Allele: representing single genetic haplotypes (A, T, ATC, -) (note that null alleles are used here for illustration; see the Allele class for how to represent indels)
    • Genotype: an assignment of alleles for each chromosome of a single named sample at a particular locus
    • VariantContext: an abstract class holding all segregating alleles at a locus as well as genotypes for multiple individuals containing alleles at that locus

    The class system works by defining segregating alleles, creating a variant context representing the segregating information at a locus, and potentially creating and associating genotypes with individuals in the context.

    All of the classes are highly validating -- call validate() if you modify them -- so you can rely on the self-consistency of the data once you have a VariantContext in hand. The system has a rich set of assessor and manipulator routines, as well as more complex static support routines in VariantContextUtils.

    The VariantContext (and Genotype) objects are attributed (supporting addition of arbitrary key/value pairs) and filtered (can represent a variation that is viewed as suspect).

    VariantContexts are dynamically typed, so whether a VariantContext is a SNP, Indel, or NoVariant depends on the properties of the alleles in the context. See the detailed documentation on the Type parameter below.

    It's also easy to create subcontexts based on selected genotypes.

    Working with Variant Contexts

    By default, VariantContexts are immutable. In order to access (in the rare circumstances where you need them) setter routines, you need to create MutableVariantContexts and MutableGenotypes.

    Some example data

     Allele A, Aref, T, Tref;
     Allele del, delRef, ATC, ATCref;
    

    A [ref] / T at 10

     
     GenomeLoc snpLoc = GenomeLocParser.createGenomeLoc("chr1", 10, 10);
    

    A / ATC [ref] from 20-23

     GenomeLoc delLoc = GenomeLocParser.createGenomeLoc("chr1", 20, 22);
    

    // A [ref] / ATC immediately after 20

     GenomeLoc insLoc = GenomeLocParser.createGenomeLoc("chr1", 20, 20);
    

    Alleles

    See the documentation in the Allele class itself

    What are they?

    Alleles can be either reference or non-reference

    Examples of alleles used here:

       A = new Allele("A");
       Aref = new Allele("A", true);
       T = new Allele("T");
       ATC = new Allele("ATC");
    

    Creating variant contexts

    By hand

    Here's an example of a A/T polymorphism with the A being reference:
     VariantContext vc = new VariantContext(name, snpLoc, Arrays.asList(Aref, T));
     
    If you want to create a non-variant site, just put in a single reference allele
     VariantContext vc = new VariantContext(name, snpLoc, Arrays.asList(Aref));
     
    A deletion is just as easy:
     VariantContext vc = new VariantContext(name, delLoc, Arrays.asList(ATCref, del));
     
    The only thing that distinguishes between an insertion and deletion is which is the reference allele. An insertion has a reference allele that is smaller than the non-reference allele, and vice versa for deletions.
     VariantContext vc = new VariantContext("name", insLoc, Arrays.asList(delRef, ATC));
     

    Converting rods and other data structures to VariantContexts

    You can convert many common types into VariantContexts using the general function:
     VariantContextAdaptors.convertToVariantContext(name, myObject)
     
    dbSNP and VCFs, for example, can be passed in as myObject and a VariantContext corresponding to that object will be returned. A null return value indicates that the type isn't yet supported. This is the best and easiest way to create contexts using RODs.

    Working with genotypes

     List<Allele> alleles = Arrays.asList(Aref, T);
     Genotype g1 = new Genotype(Arrays.asList(Aref, Aref), "g1", 10);
     Genotype g2 = new Genotype(Arrays.asList(Aref, T), "g2", 10);
     Genotype g3 = new Genotype(Arrays.asList(T, T), "g3", 10);
     VariantContext vc = new VariantContext(snpLoc, alleles, Arrays.asList(g1, g2, g3));
     
    At this point we have 3 genotypes in our context, g1-g3. You can assess a good deal of information about the genotypes through the VariantContext:
     vc.hasGenotypes()
     vc.isMonomorphicInSamples()
     vc.isPolymorphicInSamples()
     vc.getSamples().size()
    
     vc.getGenotypes()
     vc.getGenotypes().get("g1")
     vc.hasGenotype("g1")
    
     vc.getCalledChrCount()
     vc.getCalledChrCount(Aref)
     vc.getCalledChrCount(T)
     

    NO_CALL alleles

    The system allows one to create Genotypes carrying special NO_CALL alleles that aren't present in the set of context alleles and that represent undetermined alleles in a genotype:
     Genotype g4 = new Genotype(Arrays.asList(Allele.NO_CALL, Allele.NO_CALL), "NO_DATA_FOR_SAMPLE", 10);
    

    subcontexts

    It's also very easy get subcontext based only the data in a subset of the genotypes:
     VariantContext vc12 = vc.subContextFromGenotypes(Arrays.asList(g1,g2));
     VariantContext vc1 = vc.subContextFromGenotypes(Arrays.asList(g1));
     

    Fully decoding.

    Currently VariantContexts support some fields, particularly those stored as generic attributes, to be of any type. For example, a field AB might be naturally a floating point number, 0.51, but when it's read into a VC its not decoded into the Java presentation but left as a string "0.51". A fully decoded VariantContext is one where all values have been converted to their corresponding Java object types, based on the types declared in a VCFHeader. The fullyDecode(...) method takes a header object and creates a new fully decoded VariantContext where all fields are converted to their true java representation. The VCBuilder can be told that all fields are fully decoded, in which case no work is done when asking for a fully decoded version of the VC.
    See Also:
    Serialized Form
    • Field Detail

      • PASSES_FILTERS

        public static final Set<String> PASSES_FILTERS
      • contig

        protected final String contig
        The location of this VariantContext
      • start

        protected final long start
      • stop

        protected final long stop
      • type

        protected VariantContext.Type type
        The type (cached for performance reasons) of this context
      • alleles

        protected final List<Allele> alleles
        A set of the alleles segregating in this context
      • genotypes

        protected GenotypesContext genotypes
        A mapping from sampleName -> genotype objects for all genotypes associated with this context
      • genotypeCounts

        protected int[] genotypeCounts
        Counts for each of the possible Genotype types in this context
      • VALID_FILTER

        public static final Pattern VALID_FILTER
    • Constructor Detail

      • VariantContext

        protected VariantContext​(VariantContext other)
        Copy constructor
        Parameters:
        other - the VariantContext to copy
      • VariantContext

        protected VariantContext​(String source,
                                 String ID,
                                 String contig,
                                 long start,
                                 long stop,
                                 Collection<Allele> alleles,
                                 GenotypesContext genotypes,
                                 double log10PError,
                                 Set<String> filters,
                                 Map<String,​Object> attributes,
                                 boolean fullyDecoded,
                                 EnumSet<VariantContext.Validation> validationToPerform)
        the actual constructor. Private access only
        Parameters:
        source - source
        contig - the contig
        start - the start base (one based)
        stop - the stop reference base (one based)
        alleles - alleles
        genotypes - genotypes map
        log10PError - qual
        filters - filters: use null for unfiltered and empty set for passes filters
        attributes - attributes
        validationToPerform - set of validation steps to take
    • Method Detail

      • subContextFromSamples

        public VariantContext subContextFromSamples​(Set<String> sampleNames,
                                                    boolean rederiveAllelesFromGenotypes)
        This method subsets down to a set of samples. At the same time returns the alleles to just those in use by the samples, if rederiveAllelesFromGenotypes is true, otherwise the full set of alleles in this VC is returned as the set of alleles in the subContext, even if some of those alleles aren't in the samples WARNING: BE CAREFUL WITH rederiveAllelesFromGenotypes UNLESS YOU KNOW WHAT YOU ARE DOING
        Parameters:
        sampleNames - the sample names
        rederiveAllelesFromGenotypes - if true, returns the alleles to just those in use by the samples, true should be default
        Returns:
        new VariantContext subsetting to just the given samples
      • getType

        public VariantContext.Type getType()
        Determines (if necessary) and returns the type of this variation by examining the alleles it contains.
        Returns:
        the type of this VariantContext
      • isSNP

        public boolean isSNP()
        convenience method for SNPs
        Returns:
        true if this is a SNP, false otherwise
      • isVariant

        public boolean isVariant()
        convenience method for variants
        Returns:
        true if this is a variant allele, false if it's reference
      • isPointEvent

        public boolean isPointEvent()
        convenience method for point events
        Returns:
        true if this is a SNP or ref site, false if it's an indel or mixed event
      • isIndel

        public boolean isIndel()
        convenience method for indels
        Returns:
        true if this is an indel, false otherwise
      • isSimpleInsertion

        public boolean isSimpleInsertion()
        Returns:
        true if the alleles indicate a simple insertion (i.e., the reference allele is Null)
      • isSimpleDeletion

        public boolean isSimpleDeletion()
        Returns:
        true if the alleles indicate a simple deletion (i.e., a single alt allele that is Null)
      • isSimpleIndel

        public boolean isSimpleIndel()
        Returns:
        true if the alleles indicate a simple indel, false otherwise.
      • isComplexIndel

        public boolean isComplexIndel()
        Returns:
        true if the alleles indicate neither a simple deletion nor a simple insertion
      • isSymbolic

        public boolean isSymbolic()
      • isStructuralIndel

        public boolean isStructuralIndel()
      • isSymbolicOrSV

        public boolean isSymbolicOrSV()
        Returns:
        true if the variant is symbolic or a large indel
      • isMNP

        public boolean isMNP()
      • isMixed

        public boolean isMixed()
        convenience method for indels
        Returns:
        true if this is an mixed variation, false otherwise
      • hasID

        public boolean hasID()
      • emptyID

        public boolean emptyID()
      • getID

        public String getID()
      • getSource

        public String getSource()
      • getFiltersMaybeNull

        public Set<String> getFiltersMaybeNull()
      • getFilters

        public Set<String> getFilters()
      • isFiltered

        public boolean isFiltered()
      • isNotFiltered

        public boolean isNotFiltered()
      • filtersWereApplied

        public boolean filtersWereApplied()
      • hasLog10PError

        public boolean hasLog10PError()
      • getLog10PError

        public double getLog10PError()
      • getPhredScaledQual

        public double getPhredScaledQual()
      • hasAttribute

        public boolean hasAttribute​(String key)
      • getAttributeAsString

        public String getAttributeAsString​(String key,
                                           String defaultValue)
      • getAttributeAsInt

        public int getAttributeAsInt​(String key,
                                     int defaultValue)
      • getAttributeAsDouble

        public double getAttributeAsDouble​(String key,
                                           double defaultValue)
      • getAttributeAsBoolean

        public boolean getAttributeAsBoolean​(String key,
                                             boolean defaultValue)
      • getAttributeAsList

        public List<Object> getAttributeAsList​(String key)
        returns the value as an empty list if the key was not found, as a java.util.List if the value is a List or an Array, as a Collections.singletonList if there is only one value
      • getAttributeAsStringList

        public List<String> getAttributeAsStringList​(String key,
                                                     String defaultValue)
      • getAttributeAsIntList

        public List<Integer> getAttributeAsIntList​(String key,
                                                   int defaultValue)
      • getAttributeAsDoubleList

        public List<Double> getAttributeAsDoubleList​(String key,
                                                     double defaultValue)
      • getCommonInfo

        public CommonInfo getCommonInfo()
      • getReference

        public Allele getReference()
        Returns:
        the reference allele for this context
      • isBiallelic

        public boolean isBiallelic()
        Returns:
        true if the context is strictly bi-allelic
      • getNAlleles

        public int getNAlleles()
        Returns:
        The number of segregating alleles in this context
      • getMaxPloidy

        public int getMaxPloidy​(int defaultPloidy)
        Returns the maximum ploidy of all samples in this VC, or default if there are no genotypes This function is caching, so it's only expensive on the first call
        Parameters:
        defaultPloidy - the default ploidy, if all samples are no-called
        Returns:
        default, or the max ploidy
      • getAllele

        public Allele getAllele​(String allele)
        Returns:
        The allele sharing the same bases as this String. A convenience method; better to use byte[]
      • getAllele

        public Allele getAllele​(byte[] allele)
        Returns:
        The allele sharing the same bases as this byte[], or null if no such allele is present.
      • hasAllele

        public boolean hasAllele​(Allele allele)
        Returns:
        True if this context contains Allele allele, or false otherwise
      • hasAllele

        public boolean hasAllele​(Allele allele,
                                 boolean ignoreRefState)
      • hasAlternateAllele

        public boolean hasAlternateAllele​(Allele allele)
      • hasAlternateAllele

        public boolean hasAlternateAllele​(Allele allele,
                                          boolean ignoreRefState)
      • getAlleles

        public List<Allele> getAlleles()
        Gets the alleles. This method should return all of the alleles present at the location, including the reference allele. There are no constraints imposed on the ordering of alleles in the set. If the reference is not an allele in this context it will not be included.
        Returns:
        the set of alleles
      • getAlternateAlleles

        public List<Allele> getAlternateAlleles()
        Gets the alternate alleles. This method should return all the alleles present at the location, NOT including the reference allele. There are no constraints imposed on the ordering of alleles in the set.
        Returns:
        the set of alternate alleles
      • getIndelLengths

        public List<Integer> getIndelLengths()
        Gets the sizes of the alternate alleles if they are insertion/deletion events, and returns a list of their sizes
        Returns:
        a list of indel lengths ( null if not of type indel or mixed )
      • getAlternateAllele

        public Allele getAlternateAllele​(int i)
        Parameters:
        i - -- the ith allele (from 0 to n - 2 for a context with n alleles including a reference allele)
        Returns:
        the ith non-reference allele in this context
        Throws:
        IllegalArgumentException - if i is invalid
      • hasSameAllelesAs

        public boolean hasSameAllelesAs​(VariantContext other)
        Parameters:
        other - VariantContext whose alleles to compare against
        Returns:
        true if this VariantContext has the same alleles (both ref and alts) as other, regardless of ordering. Otherwise returns false.
      • hasSameAlternateAllelesAs

        public boolean hasSameAlternateAllelesAs​(VariantContext other)
        Parameters:
        other - VariantContext whose alternate alleles to compare against
        Returns:
        true if this VariantContext has the same alternate alleles as other, regardless of ordering. Otherwise returns false.
      • getNSamples

        public int getNSamples()
        Returns:
        the number of samples in the context
      • hasGenotypes

        public boolean hasGenotypes()
        Returns:
        true if the context has associated genotypes
      • hasGenotypes

        public boolean hasGenotypes​(Collection<String> sampleNames)
      • getGenotypes

        public GenotypesContext getGenotypes()
        Returns:
        set of all Genotypes associated with this context
      • getGenotypesOrderedByName

        public Iterable<Genotype> getGenotypesOrderedByName()
      • getGenotypes

        public GenotypesContext getGenotypes​(String sampleName)
        Returns a map from sampleName -> Genotype for the genotype associated with sampleName. Returns a map for consistency with the multi-get function.
        Parameters:
        sampleName - the sample name
        Returns:
        mapping from sample name to genotype
        Throws:
        IllegalArgumentException - if sampleName isn't bound to a genotype
      • getGenotypes

        protected GenotypesContext getGenotypes​(Collection<String> sampleNames)
        Returns a map from sampleName -> Genotype for each sampleName in sampleNames. Returns a map for consistency with the multi-get function. For testing convenience only
        Parameters:
        sampleNames - a unique list of sample names
        Returns:
        subsetting genotypes context
        Throws:
        IllegalArgumentException - if sampleName isn't bound to a genotype
      • getSampleNames

        public Set<String> getSampleNames()
        Returns:
        the set of all sample names in this context, not ordered
      • getSampleNamesOrderedByName

        public List<String> getSampleNamesOrderedByName()
      • getGenotype

        public Genotype getGenotype​(String sample)
        Parameters:
        sample - the sample name
        Returns:
        the Genotype associated with the given sample in this context or null if the sample is not in this context
      • hasGenotype

        public boolean hasGenotype​(String sample)
      • getGenotype

        public Genotype getGenotype​(int ith)
        Parameters:
        ith - the sample index
        Returns:
        the ith genotype in this context or null if there aren't that many genotypes
      • getCalledChrCount

        public int getCalledChrCount()
        Returns the number of chromosomes carrying any allele in the genotypes (i.e., excluding NO_CALLS)
        Returns:
        chromosome count
      • getCalledChrCount

        public int getCalledChrCount​(Set<String> sampleIds)
        Returns the number of chromosomes carrying any allele in the genotypes (i.e., excluding NO_CALLS)
        Parameters:
        sampleIds - IDs of samples to take into account. If empty then all samples are included.
        Returns:
        chromosome count
      • getCalledChrCount

        public int getCalledChrCount​(Allele a)
        Returns the number of chromosomes carrying allele A in the genotypes
        Parameters:
        a - allele
        Returns:
        chromosome count
      • getCalledChrCount

        public int getCalledChrCount​(Allele a,
                                     Set<String> sampleIds)
        Returns the number of chromosomes carrying allele A in the genotypes
        Parameters:
        a - allele
        sampleIds - - IDs of samples to take into account. If empty then all samples are included.
        Returns:
        chromosome count
      • isMonomorphicInSamples

        public boolean isMonomorphicInSamples()
        Genotype-specific functions -- are the genotypes monomorphic w.r.t. to the alleles segregating at this site? That is, is the number of alternate alleles among all fo the genotype == 0?
        Returns:
        true if it's monomorphic
      • isPolymorphicInSamples

        public boolean isPolymorphicInSamples()
        Genotype-specific functions -- are the genotypes polymorphic w.r.t. to the alleles segregating at this site? That is, is the number of alternate alleles among all fo the genotype > 0?
        Returns:
        true if it's polymorphic
      • getNoCallCount

        public int getNoCallCount()
        Genotype-specific functions -- how many no-calls are there in the genotypes?
        Returns:
        number of no calls
      • getHomRefCount

        public int getHomRefCount()
        Genotype-specific functions -- how many hom ref calls are there in the genotypes?
        Returns:
        number of hom ref calls
      • getHetCount

        public int getHetCount()
        Genotype-specific functions -- how many het calls are there in the genotypes?
        Returns:
        number of het calls
      • getHomVarCount

        public int getHomVarCount()
        Genotype-specific functions -- how many hom var calls are there in the genotypes?
        Returns:
        number of hom var calls
      • getMixedCount

        public int getMixedCount()
        Genotype-specific functions -- how many mixed calls are there in the genotypes?
        Returns:
        number of mixed calls
      • extraStrictValidation

        public void extraStrictValidation​(Allele reportedReference,
                                          Allele observedReference,
                                          Set<String> rsIDs)
        Run all extra-strict validation tests on a Variant Context object
        Parameters:
        reportedReference - the reported reference allele
        observedReference - the observed reference allele
        rsIDs - the true dbSNP IDs
      • validateReferenceBases

        public void validateReferenceBases​(Allele reportedReference,
                                           Allele observedReference)
      • validateRSIDs

        public void validateRSIDs​(Set<String> rsIDs)
      • validateAlternateAlleles

        public void validateAlternateAlleles()
      • validateChromosomeCounts

        public void validateChromosomeCounts()
      • toStringDecodeGenotypes

        public String toStringDecodeGenotypes()
      • toStringWithoutGenotypes

        public String toStringWithoutGenotypes()
      • fullyDecode

        public VariantContext fullyDecode​(VCFHeader header,
                                          boolean lenientDecoding)
        Return a VC equivalent to this one but where all fields are fully decoded See VariantContext document about fully decoded
        Parameters:
        header - containing types about all fields in this VC
        Returns:
        a fully decoded version of this VC
      • isFullyDecoded

        public boolean isFullyDecoded()
        See VariantContext document about fully decoded
        Returns:
        true if this is a fully decoded VC
      • getContig

        public String getContig()
        Description copied from interface: Locatable
        Gets the contig name for the contig this is mapped to. May return null if there is no unique mapping.
        Specified by:
        getContig in interface Locatable
        Returns:
        name of the contig this is mapped to, potentially null
      • getStart

        public int getStart()
        Returns 1-based inclusive start position of the variant.

        INDEL events usually start on the first unaltered reference base before the INDEL.

        Warning: be aware that the start position of the VariantContext is defined in terms of the start position specified in the underlying vcf file, VariantContexts representing the same biological event may have different start positions depending on the specifics of the vcf file they are derived from.

        Warning: Note also that the VCF spec allows 0 and N + 1 for POS field for telomeric event, where N is the length of the chromosome. The "0" value returned should be interpreted as telomere, and does not violate the above "1-based" comment. Code consuming the returned start should be prepared for such out-of-the-ordinary values.

        Specified by:
        getStart in interface Locatable
        Returns:
        0 or greater.
      • getEnd

        public int getEnd()
        Specified by:
        getEnd in interface Locatable
        Returns:
        1-based closed end position of the Variant If the END info field is specified that value is returned, otherwise the end is the start + reference allele length - 1. For VariantContexts with a single alternate allele, if that allele is an insertion, the end position will be on the reference base before the insertion event. If the single alt allele is a deletion, the end will be on the final deleted reference base.
      • isReferenceBlock

        public boolean isReferenceBlock()
        Returns:
        true if the variant context is a reference block
      • hasSymbolicAlleles

        public boolean hasSymbolicAlleles()
      • hasSymbolicAlleles

        public static boolean hasSymbolicAlleles​(List<Allele> alleles)
      • getAltAlleleWithHighestAlleleCount

        public Allele getAltAlleleWithHighestAlleleCount()
      • getAlleleIndex

        public int getAlleleIndex​(Allele allele)
        Lookup the index of allele in this variant context
        Parameters:
        allele - the allele whose index we want to get
        Returns:
        the index of the allele into getAlleles(), or -1 if it cannot be found
      • getAlleleIndices

        public List<Integer> getAlleleIndices​(Collection<Allele> alleles)
        Return the allele index #getAlleleIndex for each allele in alleles
        Parameters:
        alleles - the alleles we want to look up
        Returns:
        a list of indices for each allele, in order
      • getGLIndicesOfAlternateAllele

        public int[] getGLIndicesOfAlternateAllele​(Allele targetAllele)
      • getStructuralVariantType

        public StructuralVariantType getStructuralVariantType()
        Search for the INFO=SVTYPE and return the type of Structural Variant
        Returns:
        the StructuralVariantType of null if there is no property SVTYPE