Class IlluminaDataProviderFactory


  • public class IlluminaDataProviderFactory
    extends Object
    IlluminaDataProviderFactory accepts options for parsing Illumina data files for a lane and creates an IlluminaDataProvider, an iterator over the ClusterData for that lane, which utilizes these options.

    Note: Since we tend to use IlluminaDataProviderFactory in multithreaded environments (e.g. we call makeDataProvider in a different thread per tile in IlluminaBasecallsToSam). I've made it essentially immutable. makeDataProvider/getTiles are now idempotent (well as far as IlluminaDataProviderFactory is concerned, many file handles and other things are opened when makeDataProvider is called). We may in the future want dataTypes to be provided to the makeDataProvider factory methods so configuration is not done multiple times for the same basecallDirectory in client code.

    • Constructor Detail

      • IlluminaDataProviderFactory

        public IlluminaDataProviderFactory​(File basecallDirectory,
                                           int lane,
                                           ReadStructure readStructure,
                                           BclQualityEvaluationStrategy bclQualityEvaluationStrategy,
                                           Set<IlluminaDataType> dataTypes)
        Create factory with the specified options, one that favors using QSeqs over all other files
        Parameters:
        basecallDirectory - The baseCalls directory of a complete Illumina directory. Files are found by searching relative to this folder (some of them higher up in the directory tree).
        lane - Which lane to iterate over.
        readStructure - The read structure to which output clusters will conform. When not using QSeqs, EAMSS masking(see BclParser) is run on individual reads as found in the readStructure, if the readStructure specified does not match the readStructure implied by the sequencer's output than the quality scores output may differ than what would be found in a run's QSeq files
        dataTypes - Which data types to read
      • IlluminaDataProviderFactory

        public IlluminaDataProviderFactory​(File basecallDirectory,
                                           File barcodesDirectory,
                                           int lane,
                                           ReadStructure readStructure,
                                           BclQualityEvaluationStrategy bclQualityEvaluationStrategy,
                                           Set<IlluminaDataType> dataTypes)
        Create factory with the specified options, one that favors using QSeqs over all other files
        Parameters:
        basecallDirectory - The baseCalls directory of a complete Illumina directory. Files are found by searching relative to this folder (some of them higher up in the directory tree).
        barcodesDirectory - The barcodesDirectory with barcode files extracted by 'ExtractIlluminaBarcodes'. This will be set to `basecallsDirectory` if null.
        lane - Which lane to iterate over.
        readStructure - The read structure to which output clusters will conform. When not using QSeqs, EAMSS masking(see BclParser) is run on individual reads as found in the readStructure, if the readStructure specified does not match the readStructure implied by the sequencer's output than the quality scores output may differ than what would be found in a run's QSeq files
        bclQualityEvaluationStrategy - The basecall quality evaluation strategy that is applyed to decoded base calls.
        dataTypes - Which data types to read
    • Method Detail

      • getOutputReadStructure

        public ReadStructure getOutputReadStructure()
        Sometimes (in the case of skipped reads) the logical read structure of the output cluster data is different from the input readStructure
        Returns:
        The ReadStructure describing the output cluster data
      • getAvailableTiles

        public List<Integer> getAvailableTiles()
        Return the list of tiles available for this flowcell and lane. These are in ascending numerical order.
        Returns:
        List of all tiles available for this flowcell and lane.
      • setApplyEamssFiltering

        public void setApplyEamssFiltering​(boolean applyEamssFiltering)
        Sets whether or not EAMSS filtering will be applied if parsing BCL files for bases and quality scores.
      • makeDataProvider

        public BaseIlluminaDataProvider makeDataProvider​(List<Integer> requestedTiles)
        Call this method to create a ClusterData iterator over the specified tiles.
        Returns:
        An iterator for reading the Illumina basecall output for the lane specified in the constructor.
      • findUnmatchedTypes

        public static Set<IlluminaDataType> findUnmatchedTypes​(Set<IlluminaDataType> requestedDataTypes,
                                                               Map<IlluminaFileUtil.SupportedIlluminaFormat,​Set<IlluminaDataType>> formatToMatchedTypes)
        Given a set of formats to data types they provide, find any requested data types that do not have a format associated with them and return them
        Parameters:
        requestedDataTypes - Data types that need to be provided
        formatToMatchedTypes - A map of file formats to data types that will support them
        Returns:
        The data types that go unsupported by the formats found in formatToMatchedTypes
      • determineFormats

        public static Map<IlluminaFileUtil.SupportedIlluminaFormat,​Set<IlluminaDataType>> determineFormats​(Set<IlluminaDataType> requestedDataTypes,
                                                                                                                 IlluminaFileUtil fileUtil)
        For all requestedDataTypes return a map of file format to set of provided data types that covers as many requestedDataTypes as possible and chooses the most preferred available formats possible
        Parameters:
        requestedDataTypes - Data types to be provided
        fileUtil - A file util for the lane/directory we wish to provide data for
        Returns:
        A Map
      • findPreferredFormat

        public static IlluminaFileUtil.SupportedIlluminaFormat findPreferredFormat​(IlluminaDataType dt,
                                                                                   IlluminaFileUtil fileUtil)
        Given a data type find the most preferred file format even if files are not available
        Parameters:
        dt - Type of desired data
        fileUtil - Util for the lane/directory in which we will find data
        Returns:
        The file format that is "most preferred" (i.e. fastest to parse/smallest in memory)