(C) PLOS One

(C) PLOS One
This story was originally published by PLOS One and is unaltered.
. . . . . . . . . .

Identifying topologically associating domains using differential kernels [1]

['Luka Maisuradze', 'Department Of Molecular Biophysics', 'Biochemistry', 'Yale University', 'New Haven', 'Connecticut', 'United States Of America', 'Megan C. King', 'Department Of Cell Biology', 'Yale School Of Medicine']

Date: 2024-08

Abstract Chromatin is a polymer complex of DNA and proteins that regulates gene expression. The three-dimensional (3D) structure and organization of chromatin controls DNA transcription and replication. High-throughput chromatin conformation capture techniques generate Hi-C maps that can provide insight into the 3D structure of chromatin. Hi-C maps can be represented as a symmetric matrix , where each element represents the average contact probability or number of contacts between chromatin loci i and j. Previous studies have detected topologically associating domains (TADs), or self-interacting regions in within which the contact probability is greater than that outside the region. Many algorithms have been developed to identify TADs within Hi-C maps. However, most TAD identification algorithms are unable to identify nested or overlapping TADs and for a given Hi-C map there is significant variation in the location and number of TADs identified by different methods. We develop a novel method to identify TADs, KerTAD, using a kernel-based technique from computer vision and image processing that is able to accurately identify nested and overlapping TADs. We benchmark this method against state-of-the-art TAD identification methods on both synthetic and experimental data sets. We find that the new method consistently has higher true positive rates (TPR) and lower false discovery rates (FDR) than all tested methods for both synthetic and manually annotated experimental Hi-C maps. The TPR for KerTAD is also largely insensitive to increasing noise and sparsity, in contrast to the other methods. We also find that KerTAD is consistent in the number and size of TADs identified across replicate experimental Hi-C maps for several organisms. Thus, KerTAD will improve automated TAD identification and enable researchers to better correlate changes in TADs to biological phenomena, such as enhancer-promoter interactions and disease states.

Author summary Chromatin, which encodes the genetic information for cells, must fold into the cell nucleus that is many times smaller in size. The folded 3D structure of chromatin in the nucleus enables gene expression and proper cell function. With the advent of advanced chromatin conformation capture techniques, we can identify topologically associating domains (TADs), which are regions of the genome that prefer to interact within themselves rather than with neighboring regions. Numerous methods have been developed to automatically detect TADs in Hi-C maps, however, they frequently disagree on the location and number of TADs. We develop a new algorithm, KerTAD, to identify TADs using techniques from image processing and computer vision. We find that our method is more accurate on both synthetic and manually-annotated experimental Hi-C maps than all tested methods. Our method also performs well in the presence of noise and sparsity, which are frequently encountered in experimental Hi-C maps. KerTAD will enable future studies to elucidate the role of TADs in gene regulation and disease formation.

Citation: Maisuradze L, King MC, Surovtsev IV, Mochrie SGJ, Shattuck MD, O’Hern CS (2024) Identifying topologically associating domains using differential kernels. PLoS Comput Biol 20(7): e1012221. https://doi.org/10.1371/journal.pcbi.1012221 Editor: Roland L. Dunbrack Jr., Fox Chase Cancer Center, UNITED STATES OF AMERICA Received: November 17, 2023; Accepted: June 3, 2024; Published: July 15, 2024 Copyright: © 2024 Maisuradze et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Data Availability: All code used in the manuscript, including the KerTAD algorithm and analysis code is available at https://github.com/lmaisuradze/KerTAD/. All Hi-C maps used in the manuscript are either publicly available or made available at https://figshare.com/s/c92bd17f5bd0882fa3e0. Funding: The authors acknowledge support from National Science Foundation, Grant No. 1830904 (to L.M., M.C.K., S.G.J.M., I.S, and C.S.O.) and National Science Foundation, Grant No. 2124558 (to L.M. and C.S.O.). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing interests: The authors have declared that no competing interests exist.

Introduction Chromatin is a polymer complex of DNA and proteins that forms chromosomes. Chromatin must undergo a highly organized compaction process to fit into the μm-sized nucleus. During this compaction process, chromatin forms hierarchical structures, such as loops, A/B compartments, and territories, across a range of length scales [1–4]. The spatial organization of chromatin is essential for many nuclear processes, such as DNA replication and transcription. For example, during transcription, enhancer and promoter DNA regions that are separated on the chromatin fiber must come into close proximity through the formation of loops to increase the transcription of target genes [1, 5]. Disruptions in chromatin loop formation can alter gene expression by preventing enhancer-promoter interactions [6, 7]. To better understand the structural organization of chromatin, chromosome conformation capture and proximity ligation derivative techniques (in particular Hi-C) have been developed to elucidate genome-wide spatial interactions and structures [8, 9]. Hi-C generates an interaction matrix, , where each element represents the frequency with which two loci i and j on chromatin are close in space, averaged over a cell population [8]. Hi-C maps reveal significant interactions off the diagonal that are not expected for an extended polymer. In particular, Hi-C maps display topologically associating domains (TADs), or regions of increased self-interaction (with decreased interactions outside the region), typically presenting as a square of higher frequency centered on the diagonal [10, 11]. TADs often indicate the formation, elongation, and dissolution of loops. Loops enable enhancer-promoter interactions and TAD boundaries are frequently enriched for insulator proteins and transcription marks, which explains why enhancer-promoter interactions occur mostly within TADs [10, 12–16]. Several features of experimentally determined Hi-C maps, such as noise, sparsity, and low resolution, make TAD identification difficult. Further, TAD features are heterogeneous, e.g. while some TADs possess strong corner points and weak intensity in the interior of the TAD, others possess uniform intensity in the interior with weak borders. TADs are also often difficult to differentiate from the background power-law decay in the interaction frequency away from the diagonal that arises from expected distance-dependent polymer interactions [17]. The convention for TAD identification, or TAD calling, is to specify the starting and ending loci of each TAD in the interaction matrix . However, TADs do not directly report on static chromatin structure, instead they provide a statistical description of dynamic chromatin organization that is influenced by the experimental methods used to construct the Hi-C maps [12, 18, 19]. Currently, there is no ground-truth definition for TADs in Hi-C maps, and TAD definitions are scale- and resolution-dependent [12, 18, 20]. To illustrate this point, in Fig 1A and 1B, we show the same segment (from 9 to 13 Mb) of mouse chromosome 17 Hi-C map using both linear and logarithmic (base e) intensity scales, respectively. On the linear scale, TADs are not visible, whereas on the logarithmic scale, numerous overlapping and nested TADs appear. (See the Benchmarks subsection in the Materials and Methods for definitions of overlapping and nested TADs.) In Fig 1C we show the same segment of mouse chromosome 17 on a logarithmic scale, but from a different biological replicate, showing a much sparser Hi-C map and replicate to replicate fluctuations. PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 1. Challenges of TAD identification in Hi-C maps. A: The choice of scale and normalization of Hi-C maps impacts the visibility of TADs. Mouse Hi-C map of chromosome 17 (from 9 to 13 Mb) without preprocessing on a linear scale and normalized so that yields faint TADs. B: We show the same Hi-C map as in (A), but plotted on a natural logarithmic scale. The blue square indicates the corner of a clear TAD and the dotted lines in the upper triangular matrix denote its boundaries. The green circle shows a region of noise near a TAD boundary, and the white triangles (and associated dashed lines) indicate “borderline” TADs (structures for which it is unclear whether they would count as TADs) that were not visible in the left image. C: The same region of mouse chromosome 17, but from a different biological replicate with many more low intensity values off the diagonal. D: A synthetic Hi-C map generated from negative binomial distribution sampling with TADs identified (shown in the upper right triangle) using three state-of-the-art TAD calling algorithms: SpectralTAD (open circles), deDoc (crosses), and Armatus (open rectangles). Ground truth TADs (blue circles) are shown in the lower triangular matrix. https://doi.org/10.1371/journal.pcbi.1012221.g001 Because there is currently no clear ground-truth definition of TADs in Hi-C maps, it is challenging to determine the accuracy of TAD calling algorithms on experimental data. However, TAD calling algorithms can be tested on synthetic data that mimics experimental Hi-C maps. The advantage of synthetic data is that it has a well-defined ground-truth and the noise and sparsity of the data can be tuned. To generate a possible ground truth for experimental Hi-C maps, a consensus manual annotation from multiple experts can be obtained. We can then benchmark TAD calling algorithms on their accuracy compared to the manually annotated experimental data [21]. Many algorithms have been developed to identify TADs using graph-theoretic, clustering, machine-learning, and image transform techniques [10, 22–29]. In Fig 1D we compare three state-of-the-art TAD calling algorithms on synthetic data generated by sampling from a negative binomial distribution (see the Complex Synthetic Hi-C Maps subsection in the Materials and Methods for further details) meant to mimic experimental mouse Hi-C maps. These TAD callers identify different numbers of TADs and in different locations, as expected from previous TAD identification algorithm comparison studies [21, 30–33]. Previous studies have found that on manually annotated GM12878 and hESC Hi-C maps at 50 kb resolution, current TAD calling algorithms rarely exceed a positive predictive value of 40% [21]. On synthetic data for overlapping and nested TADs, these methods mostly obtain a true positive rate (TPR) (see Metrics subsection in the Materials and methods) of ≲ 0.6 [31, 33]. In addition, most current TAD-calling algorithms impose strong restrictions that limit their ability to call overlapping, nested, and gapped TADs. [21, 30–33]. In this article, we develop a novel TAD-calling algorithm, KerTAD, that applies gradient and other image operators on Hi-C maps to accentuate and extract their off-diagonal features. We show that KerTAD is more accurate than the current state-of-the-art methods as determined by previous studies [30–33] across three categories of Hi-C maps: synthetic maps generated via molecular dynamics simulations of block copolymers; synthetic maps with overlapping and nested TADs sampled from a binomial distribution of intensities; and manually annotated GM12878 maps at 50kb resolution. On all three datasets, KerTAD is the most accurate in terms of TPR while having a negligible false discovery rate (FDR). On synthetic data, our method has an average TPR of ≈ 0.98 and ≈ 0.99 on non-nested and nested maps, respectively, and a median TPR of ≈ 0.75 on manually annotated Hi-C maps. In addition, KerTAD is highly resistant to noise and sparsity, achieving a higher TPR at the highest level of noise tested than other methods with no noise. Because KerTAD outperforms every tested method on both manually annotated experimental and synthetic data, KerTAD is likely able to capture the underlying features in experimental Hi-C maps. This article is organized as follows. In the Materials and methods section, we first describe the preprocessing of the input Hi-C maps and the generation of masks to identify key features of TADs in Hi-C maps. We also define the metrics for sensitivity and false discovery rate for comparing the predictions of KerTAD to ground truth for the synthetic and manually annotated Hi-C maps. We then define the techniques used for generating noise and sparsity in synthetic data. In the results section, we summarize the performance of KerTAD (as well as six other methods) in TAD identification on synthetic and manually annotated Hi-C maps. We also analyze replicate Hi-C maps across four organisms and compare the variation in number and mean size of TADs identified by three TAD identification algorithms. Finally, we discuss how the improved accuracy in TAD identification will enable more robust inferences between the identified TADs and chromatin organization.

Materials and methods The description of the Materials and methods is organized into two sections. In the first section, we explain the new TAD identification algorithm, KerTAD, including the preprocessing steps and the application of masks to identify key features of TADs. In the second section, we discuss the implementation of six other state-of-the-art methods to identify TADs, metrics that we use to quantify the accuracy of the TAD identification methods, and techniques to generate sparse and noisy synthetic data. We describe the motivation and process of manually annotating experimental Hi-C maps, as well as the methods for comparing the accuracy of TAD identification methods on manually annotated experimental data. We finally describe in detail our analysis of the performance of several TAD identification algorithms on replicate non-annotated experimental Hi-C maps across several organisms. KerTAD KerTAD takes as input a symmetric N × N matrix, , which gives the frequency of contacts between bins i and j and returns an M × 2 matrix, where each row gives the corner location of one of the M TADs in . The preprocessing step normalizes such that for all i, j and reduces fluctuations in while preserving edge features. The method then feeds the preprocessed Hi-C map into two separate pipelines, each of which generates a mask. One pipeline seeks to extract small-scale diffuse point features in the Hi-C map, while the other favors larger scale regions near corner points. The final TADs are given by the intersection of the two masks. Preprocessing. There is no standard format or normalization scheme for Hi-C maps [34–40]. Because normalization is known to significantly affect TAD-calling performance [34], we first preprocess to satisfy the requirements below. First, we ensure that the diagonal elements of are the maxima in their respective rows, i.e. . If a given , we then set . This condition is reasonable in the sense that we should expect that local regions of chromatin interact with themselves more than any other region. We then locally row-normalize by re-setting to , where σ i is the standard deviation of the ith row of . This normalization reduces global fluctuations and also perturbs the original less than other normalization schemes like requiring to be both row- and column-normalized [39]. We then filter with a Gaussian kernel with standard deviation σ = 3Γ/2 and filter size 2⌈(2σ)⌉ + 1, where ΓN2 is the number of zero elements in and ⌈⋅⌉ is the ceiling function. This Gaussian filtering is performed since extremely sparse Hi-C maps can cause division by zero errors in the KerTAD masks. Additionally, normalization and applying a Gaussian kernel to reduces the total variation of , which is defined as: (1) where , , and the outside bins of are given by , , , and . While spatial variation is a hallmark of TADs, excessive total variation outside of TAD boundaries (such as speckle noise) can obscure the signal and make TAD identification challenging. It is important however to regulate standard smoothing techniques, like Gaussian blurring, since while they can reduce the total variation, they can also remove stark edge features that are essential for identifying TADs. Finally, is automatically segmented (if necessary) by finding outliers on the diagonal where the ratio of zero elements to nonzero elements of the 5 elements around the diagonal (either to the left or to the right of the diagonal depending on the location of the diagonal) for each row is greater than 0.8. Further outliers are found using the Grubbs method if necessary and then all the adjacent non-outliers are segmented into separate maps to process [41]. Mask for corner point features. The mask for corner point features is designed to identify locations near the diagonal where there are strong changes in intensity, since these often indicate transitions between TADs, and then to generate a mask of possible corner point combinations in . We first calculate the discrete partial derivative of . We then feed the row vectors of the partial derivative map into a non-linear function that produces a similarity matrix. The similarity matrix is then filtered by applying a local maximum operator and global threshold, which identifies locations on the diagonal of where there are sharp local changes. We then use the identified locations on the diagonal to generate a binary mask of every TAD corner point combination, with each diagonal location representing one index of a possible TAD corner point. Differential operators in image processing are often represented as convolutions of an image with a kernel that is separable into at least one smoothing filter. Smoothing can reduce noise, but excessive smoothing removes edge features, making it difficult to determine TAD locations. Thus, we implement a low-order partial derivative map with no smoothing filter, , with symmetric boundary conditions. Next, we construct a list of row vectors , where is the ith row of . We then construct a similarity matrix, , (2) and and return the maximum and minimum components of , respectively. Finally, we define the N × N binary mask of point features, , as follows: for every i, j such that i < j, if and only if and are both local maxima in their respective 3 × 3 local neighborhoods and , where Ω is the global threshold determined using the triangle algorithm [42] on . Fig 2 illustrates the several intermediate steps and maps to transform an input Hi-C map, , into . PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 2. Illustration of the four steps in constructing the point feature binary mask. A: We start with a Hi-C map (with bins i and j labelled from 1 to 125). B: We first calculate the discrete partial derivative, Δ y A ij . C: We then construct from a nonlinear function of the pairs of row vectors of . D: The binary mask is obtained by combining a local maximum filter with binary thresholding of . If (black squares), and are both local maxima in their 3 × 3 windows and above the threshold set by the triangle method on . https://doi.org/10.1371/journal.pcbi.1012221.g002 Mask for corner regions. While the previous mask captured point features of TADs spread throughout the Hi-C map, we also need a mask to identify the specific corner regions near the diagonal in . As before, we calculate an image derivative, this time , using periodic boundary conditions. For i < j, if then is set to 0 and for i > j if then is set to 0. We then calculate (3) has several important features. First, TAD corners and edges are maxima of in their local neighborhood as shown in Fig 3C. The diagonal elements of that correspond to TAD corner points (i.e. if is the corner point of a TAD, the corresponding points in are and ) are strongly negative minima in their neighborhood. Taking advantage of both of these facts, we construct the final binary mask : (4) where Ω is the threshold determined by the triangle method on the matrix, . PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 3. Illustration of the steps used to construct the mask Illustration of the steps used to construct the maskfor identifying corner regions in Hi-C maps. A: We start with the same input Hi-C map as in Fig 2. B: We first calculate the discrete partial derivatives, Δ x A ij . C: We then calculate from Δ x A ij . D: We obtain the final binary mask after applying a global threshold on - . https://doi.org/10.1371/journal.pcbi.1012221.g003 Final mask and parameters. After constructing both masks, we take the element-wise product of and to obtain the final binary mask, . Each nonzero element of represents a predicted TAD corner point. For the final output, KerTAD converts to a 2 column list where each row represents the start and end index of a TAD corner point. We illustrate the full algorithm applied to chromosome 12 of a GM12878 cell line in Fig 4. PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 4. Graphical representation of KerTAD algorithm. The progression through the KerTAD algorithm is depicted above: the initial input Hi-C map undergoes preprocessing, followed by segmentation of the preprocessed map, application of masks 1 and 2 (see Figs 2 and 3) for each segment, calculating their element-wise product (shown overlaid with the input segment), followed by recombination of each binary output for each segment into the final TAD predictions. https://doi.org/10.1371/journal.pcbi.1012221.g004 KerTAD does not require any user-provided parameters, taking only a Hi-C matrix as input. However, KerTAD has two optional hard-coded parameters that can be manually overridden for expert users to have more flexibility. First, the parameter κ is the maximum number of TADs that can be identified per row. Starting from the diagonal and moving outward for each row, every nth TAD where n > κ is discarded. By default, κ = 3, since greater than 3 TADs per row is unlikely. Throughout the manuscript we use κ = 3 with the exception of simple Hi-C maps where we set κ = 1 (to mimic simple TAD callers). The other optional parameter is γ, which controls how many times the initial automatic segmentation of is broken into smaller maps for additional segmentation. If a segmentation, , of spans from to (i.e. ), then if γ = n, is divided into n further segments, w 1 , …, w n , where w i spans to , , and ⌈.⌉ is the ceiling function. By default, γ = 2. Further splitting Hi-C maps is useful for large and heterogeneous Hi-C maps where different regions have varying coverage and local intensity. Generally, TAD predictions scale with γ (i.e. increasing γ will increase the number of TADs predicted). γ can be tuned based on user preference in either direction. (Increasing γ will likely increase TPR, but it will also increase FDR.) For robust Hi-C maps, γ can be set to 1 (i.e. no further segmentations are calculated after the initial automatic segmentation). For this manuscript, we use γ = 1 for synthetic Hi-C maps and γ = 2 for all other Hi-C maps. Benchmarks When determining the accuracy of TAD identification methods, we first categorize the Hi-C maps into two types: synthetic and experimental Hi-C maps. For synthetic Hi-C maps, we also distinguish between “simple” and “complex” Hi-C maps. For simple Hi-C maps, each element on the diagonal of must belong to one and only one TAD. This condition implies that i) has no nested or overlapping TADs and ii) has no gaps between TADs. Thus, in a simple Hi-C map, if a TAD is identified over a set of diagonal elements, e.g. from A ii to A jj , there are no other TADs within that set and the next TAD must start at A (j+1)(j+1) . Complex Hi-C maps are defined as any Hi-C map that is not simple, i.e. has either nested, overlapping, or gapped TADs. A nested TAD is a TAD with its corner point located at (where j > i) while there exists another TAD corner at (where l > k), where k ≤ i and j ≤ l. An overlapping TAD has a corner at (where j > i) and another TAD corner at (where l > k), where k < i and i < l < j or i < k < j and j < l. A Hi-C map possesses a gapped TAD if there exists an element on the diagonal, , that does not belong to any TAD. In S1 Fig, we show a graphical illustration and examples from the GM12878 chromosome 6 Hi-C map of nested, overlapping, and gapped TADs. We analyze the performance of TAD identification algorithms on simple and complex synthetic Hi-C maps separately. Many TAD identification algorithms assume that the input Hi-C maps are simple. This additional information provides constraints on the locations of TADs, which can lead to enhanced accuracy for these algorithms. However, the additional constraints do not improve TAD prediction in manually annotated experimental Hi-C maps, as most experimental Hi-C maps are not simple. In previous work comparing the performance of TAD identification algorithms, the top performers on simple and complex synthetic maps were different [31, 33]. In the Results section, we show that KerTAD is highly accurate in identifying TADs in both simple and complex Hi-C maps, while not presupposing that a given Hi-C map is simple or complex. Simple synthetic Hi-C maps. To compare the performance of different TAD identification algorithms for simple, synthetic Hi-C maps, we consider 100 Hi-C maps generated by molecular dynamics (MD) simulations of block copolymers from previous studies [43]. In these MD simulations, chromatin is modeled as a bead-spring polymer with non-bonded, purely repulsive interactions to prevent bead overlaps, non-specific short-ranged attractive interactions between bead pairs to induce compaction, and specific short-ranged attractive interactions between bead pairs to mimic TADs that occur in specific epigenomic profiles. From previous studies [21, 30–33] we select the top performing TAD identification algorithms for simple, synthetic maps. Namely, we compare KerTAD with TopDom [27], HICSeg [28], and CHDF [29]. We perform TAD identification on the set of 100 simple, synthetic Hi-C maps discussed above. (Note that TopDom, HICSeg, and CHDF do not identify nested or overlapping TADs.) For TopDom we count the “domain” predictions and set the window size to 5 as done in previous work [31, 33] for the same synthetic Hi-C maps. Again following previous work [21, 30, 31, 33], we set the max TAD size parameter for CHDF to 50 and for HICSeg we use the “G” distribution. When comparing TAD predictions from KerTAD to those for the other algorithms on the simple, synthetic Hi-C maps, we impose a further restriction on our identified TADs. Since KerTAD can identify nested and overlapping TADs, it has more chances to identify correct TADs compared to methods that are unable to call nested and overlapping TADs. Thus, we set κ = 1, considering only the innermost TAD corners with the smallest distance from the diagonal. Complex synthetic Hi-C maps. For generating complex, synthetic Hi-C maps, we use a variation of a previously developed procedure [30, 44] that mimics mouse embryonic stem cells by sampling from a negative binomial distribution of Bernoulli trials, where successful trials represent contacts between chromatin loci. The distribution is characterized by a location-dependent variance (with dispersion factor r = 0.01) and mean . The location-dependent mean is defined by (5) where δ ij is the Kronecker-delta, K d gives , K t and c are parameters that control the power-law decay of away from the diagonal. (K d = 35, K t = 28, and c = −0.69 were selected to match in chromosome five in IMR90 replicate B.) θ ij = 1 when is inside of a TAD (excluding diagonal elements) and 0 otherwise [17, 30]. TAD boundary lengths are selected randomly from a uniform distribution with widths from 5 to 20 bins (where each bin represents 40 kb). We then remove randomly selected TADs from this list and fill in the gaps with larger overlapping and nested TADs. The deletion process involves randomly selecting 25% of TADs in the lower layer, removing them, and then adding a new TAD block that spans the length of the two TADs between any deleted TADs, thus creating nested and overlapping TADs. More specifically, if two TADs around a deleted TAD have corner points located at (x 1 , y 1 ) and (x 2 , y 2 ), the new TAD will have a corner point located at (x 1 , y 2 ). is a random variable that mimics weak and non-specific ligation events by sampling (with replacement) a fraction of randomly selected elements of and adding a constant, K noise . (We set K noise = 5.) The likelihood that an element of receives a noise impulse scales with (i − j + 1)c. We generate 100 complex, synthetic Hi-C maps using this protocol with , where each Hi-C map has on average 150 TADs. From previous studies [21, 31–33] we select the top performing TAD callers on similar datasets of complex, synthetic Hi-C maps. We compare KerTAD with deDoc [24], Armatus [22], and SpectralTAD [25]. As before, we follow the default or recommended parameters for each algorithm. For Armatus we set g = 0.05 and s = 0.05 [30], for SpectralTAD we use levels = 2, and for deDoc we use both the dedoc(M) and dedoc(E) predictions, removing duplicates. The accuracy of TAD identification was determined for these three methods, along with KerTAD, for each complex, synthetic Hi-C map. Noise and sparsity. To test the robustness of the TAD identification algorithms, we compare TAD predictions for two sets of new complex, synthetic Hi-C maps with varying levels of added noise and sparsity. In the first set, we generate 10 complex Hi-C maps with (as previously described) and for each, construct an additional 20 Hi-C maps, with varying levels of noise (totalling 210 Hi-C maps). Because many TAD identification algorithms only accept integer counts, we do not use additive Gaussian noise. Instead, we randomly sample (with replacement) and add a constant additive impulse, K noise = 5, as described previously for . The noise is parameterized by χ, which represents the number of added impulses divided by the number of elements of . To generate the noisy maps, we increase χ in increments of 0.05 starting from 0 to 1. For the second set, we perform the same procedure but instead add sparsity to by setting random elements of equal to 0. Sparsity is parameterized by ξ, which is the fraction of elements of that are set to zero compared to the total number of elements. We generate 200 sparse maps by increasing ξ in increments of 0.05 starting from 0 to 0.95 (ξ = 1 would mean a map of only 0s). For KerTAD, we turn off outlier detection for highly sparse Hi-C maps to avoid runtime errors. Experimental maps. To obtain ground truth for experimental Hi-C maps, we follow the previous manual annotations performed on Hi-C maps for the GM12878 cell line at 50 kb resolution for the 40–45 Mb regions of 10 different chromosomes (chromosomes 2, 3, 4, 5, 6, 7, 12, 18, 20, and 22) [21]. In the original annotations, “any identifiable TAD structure” was annotated and the positive predictive value (PPV) of the identified TADs was calculated for seven TAD identification algorithms [21]. However, calculating PPV does not penalize TAD callers that miss “obvious” TADs and even TPR may be inappropriate for gauging TAD prediction accuracy if the annotations are too lenient. In addition, likely due to differences in the pipeline or visualization, we found that many of the original annotations were displaced or pointed at no features or structures. Thus, using the original annotations as a guide, we keep the most “obvious” TADs and then calculate TPR to capture the accuracy of the TAD identification methods. Because the annotations are not meant to be exhaustive, we do not calculate FDR. Because the experimental Hi-C maps are complex, we use deDoc, Armatus, and SpectralTAD, as well as KerTAD, to identify TADs in the manually annotated GM12878 Hi-C maps. For the input maps to each TAD caller, we used the cutout sections of the genome except for Armatus, which returned no TADs with the smaller map (a previously described bug) and for which we used the full intrachromosomal map as input. For experimental Hi-C maps without manual annotations, we evaluate in situ Hi-C maps for four organisms: fruit fly S2 cells [45] (4DN accession code: 4DNESFOADERB), zebrafish embryos [46] (4DN accession code: 4DNESV5PGOUC), mouse CH12.LX cells [17] (4DN accession code: 4DNESK95HVFB), and human HCT-116 cells [47] (4DN accession code: 4DNES3QAGOZZ). All Hi-C maps were obtained from the 4DN data portal [48] and the .pairs files for each biological and technical replicate were converted to .cool files and then intrachromosomal Hi-C maps at 50 kb resolution were extracted using Cooler [49]. For zebrafish Hi-C maps, we analyzed three biological replicates with one technical replicate for each biological replicate. For fruit fly Hi-C maps, we also analyzed three biological replicates with one technical replicate each. For mouse Hi-C maps, we used three biological replicates with 11, 2, and 2 technical replicates. For human Hi-C maps, we analyzed six biological replicates with 3, 4, 2, 3, 2, and 2 technical replicates. For each Hi-C map, we perform TAD identification using KerTAD and the top performers in TPR for the simple and complex Hi-C map categories: TopDom and deDoc. For TopDom we used a window size of 10 following the recommendation for 50kb resolution from previous work [21]. Because TopDom generated an error message for chromosome Y of biological replicate 2 for fruit fly, we do not include that Hi-C map in our analysis for TopDom. We calculate the total number of identified TADs by summing the number of predicted TADs for each intrachromosomal map for each replicate. We also calculate the mean size of the identified TADs for each intrachromosomal map. We characterize the distribution of the number of TADs and mean sizes of TADs over replicates for each organism by calculating the median, maximum, and minimum values. Accuracy metrics. We apply each TAD identification algorithm to each synthetic or manually annotated experimental Hi-C map and compare the lists of identified TADs to ground truth. For a predicted TAD corner point located at , we denote it as a “true positive” if and only if there is a ground truth TAD with the same corner point coordinates. We calculate two metrics for each synthetic and experimental Hi-C map for every algorithm: and , where p is the number of true positives, is the total number of ground truth TADs, and is the total number of TADs predicted. In manually annotated experimental Hi-C maps, since the TAD corners are often difficult to define, a “true positive” is counted as long as the ground truth coordinate is one of the coordinates in the 3 × 3 square centered around the predicted TAD corner point.

Discussion In this article, we developed a novel algorithm, KerTAD, to identify TADs in Hi-C maps. Most previous TAD calling algorithms assume simple Hi-C maps, i.e. each diagonal element of must belong to one and only one TAD. For simple Hi-C maps, when a TAD is identified at element i and j, the next TAD must have a starting index of j + 1 and there can be no additional TADs between i and j. In contrast, our method does not assume that Hi-C maps are simple and can accurately identify nested, overlapping, and gapped TADs. Among the few algorithms that can identify TADs in complex Hi-C maps, which is necessary for accurate TAD identification in experimental Hi-C maps, there is a large discrepancy in the number and size of TADs called, even among replicate Hi-C maps from the same experiment. Here, we present a novel algorithm that consistently outperforms other TAD identification algorithms on synthetic and manually annotated Hi-C maps, while being robust to noise and sparsity. KerTAD uses two kernel-based techniques that detect complementary features of Hi-C maps. The method focuses on regions of Hi-C maps near the diagonal where there are large changes in intensity and strong corner points. We show that KerTAD outperforms six state-of-the-art TAD identification algorithms on both synthetic and manually annotated experimental Hi-C maps. In particular, we calculate the TPR and FDR by comparing the results for the predicted TADs for each algorithm to ground truth for the synthetic and experimental manually annotated Hi-C maps. We also test the performance of the TAD identification algorithms on complex, synthetic Hi-C maps with increasing levels of impulse noise and sparsity. For all of the Hi-C maps with ground truth that we tested (i.e. simple and complex synthetic, noisy and sparse, and manually annotated, experimental), our method has the highest TPR and negligible FDR. We also find that our method has low variance in the median number and size of TADs across replicates for the experimental Hi-C maps without ground truth. In previous work [21, 31] that evaluated TAD identification algorithms, algorithms that can identify nested and overlapping TADs predict more TADs and possess higher variance in the number of identified TADs over replicates. This result is consistent with the fact that simple TAD identification algorithms can only call at most N TADs for a Hi-C map with N × N elements, whereas algorithms for complex Hi-C maps can identify at most N2 TADs. Our results also show that algorithms for complex Hi-C maps identify more TADs than those for simple Hi-C maps, e.g. deDoc identifies significantly more TADs and with higher variance among replicates than TopDom. However, unlike deDoc, our method, which can identify TADs in complex Hi-C maps, shows significantly lower variation among replicates, with maximum and minimum values for the numbers and sizes of TADs comparable to those for TopDom. The fact that our method generates results for the numbers and sizes of TADs with small variations among replicates suggests that our method identifies the most important features of Hi-C maps that are insensitive to noise and sparsity. While KerTAD outperforms other current TAD identification algorithms on synthetic Hi-C maps, it can be improved. For Hi-C maps where there are high-intensity regions compared to the local neighborhood, we find that despite TVR reducing the variation, our method still tends to identify TADs in the regions of high intensity, rather than in regions of low intensity. Since TADs are usually defined locally, using global techniques that threshold across the whole Hi-C map will invariably suffer from this problem. Unfortunately, this results in a well-known dilemma: if one does not normalize weaker intensity regions, the algorithm will miss TADs, but normalizing weak intensity regions will bring out noise causing false positive TADs. This can be controlled to some degree by separating large maps into smaller ones (i.e. setting γ = 1), but risks “cutting off” TAD boundaries. In future work, we will develop new techniques to reduce noise, while maintaining the ability to identify TADs in weak intensity regions. Because our method possesses the highest accuracy on synthetic and manually annotated experimental Hi-C maps, we hypothesize that our method will be accurate in capturing the true number and size of TADs in experimental Hi-C maps. However, it is worth reiterating that there is currently no ground truth definition of TADs in experimental Hi-C maps, which means that TPR and FDR on synthetic and manually annotated data, while useful, are only proxies for the accuracy of TAD identification algorithms on experimental Hi-C maps. Previous research groups [21, 30, 31, 33] have benchmarked their TAD identification algorithms using different metrics. For example, several studies have searched for correlations between predicted TAD boundaries and CTCF enrichment as a measure of TAD identification accuracy. However, this benchmark may not be related to benchmarks that rely on visual identification of TADs in experimental Hi-C maps. Currently, there can be large variations in the experimentally determined Hi-C maps from one experiment to the next. As chromatin conformation capture experiments continue to improve, it will be possible to determine well-defined, relatively noise-free, and experimentally reproducible Hi-C maps. It is also important to understand how Hi-C maps depend on the phase of the cell cycle, cell type, cell-to-cell fluctuations, and tissue type in each organism. After such experimental studies are carried out and well-defined Hi-C maps are obtained, computational studies can be carried out to determine in an unsupervised way the important features that distinguish one Hi-C map from another. After identifying these key features, further studies can be carried out to understand the spatiotemporal dynamics of chromatin that give rise to each of the key features in Hi-C maps.

Supporting information S1 Fig. Graphical description and examples of different types of TADs. We illustrate graphically (from left to right) nested, overlapping, and gapped TADs in the top row. In the bottom row, below each type of TAD, we show an example of that particular type of TAD (outlined using dotted lines) in an experimental Hi-C map of chromosome 6 from the human GM12878 cell line. https://doi.org/10.1371/journal.pcbi.1012221.s001 (TIF)

[END]
---
[1] Url: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1012221

Published and (C) by PLOS One
Content appears here under this condition or license: Creative Commons - Attribution BY 4.0.

via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/