Package picard.util

Class SingleBarcodeDistanceMetric


  • public class SingleBarcodeDistanceMetric
    extends Object
    A class for finding the distance between a single barcode and a barcode-read (with base qualities)
    • Constructor Detail

      • SingleBarcodeDistanceMetric

        public SingleBarcodeDistanceMetric​(byte[] barcodeBases,
                                           byte[] readBases,
                                           byte[] readQualities,
                                           int minimumBaseQuality,
                                           int maximalInterestingDistance)
    • Method Detail

      • hammingDistance

        public int hammingDistance()
      • leniantHammingDistance

        @Deprecated
        public int leniantHammingDistance()
        Deprecated.
      • lenientHammingDistance

        public int lenientHammingDistance()
        Similar to Hamming distance but this version doesn't penalize matching bases with low quality for the read.
        Returns:
        the edit distance between the barcode(s) and the read(s)
      • freeDistance

        public int freeDistance()
        +++lifted from Commons Lang Text +++

        Modified to specific to comparing barcodes to reads. This means ignoring insertions or deletions in the last position as it would amount to double counting otherwise. Based on https://www.pnas.org/content/115/27/E6217

        Also, in case of passing the threshold this algorithm has been modified to return maximalInterestingDistance + 1

        This implementation only computes the distance if it is less than or equal to the threshold value, returning maximalInterestingDistance + 1 otherwise. The advantage is performance: unbounded distance is O(nm), but a bound of k allows us to reduce it to O(km) time by only computing a diagonal stripe of width 2k + 1 of the cost table. It is also possible to use this to compute the unbounded Levenshtein distance by starting the threshold at 1 and doubling each time until the distance is found; this is O(dm), where d is the distance.

        One subtlety comes from needing to ignore entries on the border of our stripe eg. p[] = |#|#|#|* d[] = *|#|#|#| We must ignore the entry to the leftRev of the leftmost member We must ignore the entry above the rightmost member. * As a concrete example, suppose s is of length 5, t is of length 7, and our threshold is 1. In this case we're going to walk a stripe of length 3. The matrix would look like so:

            1 2 3 4 5
         1 |#|#| | | |
         2 |#|#|#| | |
         3 | |#|#|#| |
         4 | | |#|#|#|
         5 | | | |#|#|
         
        in addition, we terminate the calculation early and return threshold +1 if we detect that there is no way to get to the bottom right corner without going over the thershold. This implementation decreases memory usage by using two single-dimensional arrays and swapping them back and forth instead of allocating an entire n by m matrix. This requires a few minor changes, such as immediately returning when it's detected that the stripe has run off the matrix and initially filling the arrays with large values so that entries we don't compute are ignored. See Algorithms on Strings, Trees and Sequences by Dan Gusfield for some discussion.