(C) PLOS One
This story was originally published by PLOS One and is unaltered.
. . . . . . . . . .
Neural network and kinetic modelling of human genome replication reveal replication origin locations and strengths [1]
['Jean-Michel Arbona', 'Laboratoire De Biologie Et Modélisation De La Cellule', 'Ens De Lyon', 'Lyon', 'Hadi Kabalane', 'Cnrs', 'Laboratoire De Physique', 'Jeremy Barbier', 'Arach Goldar', 'Université Paris-Saclay']
Date: 2023-07
In human and other metazoans, the determinants of replication origin location and strength are still elusive. Origins are licensed in G1 phase and fired in S phase of the cell cycle, respectively. It is debated which of these two temporally separate steps determines origin efficiency. Experiments can independently profile mean replication timing (MRT) and replication fork directionality (RFD) genome-wide. Such profiles contain information on multiple origins’ properties and on fork speed. Due to possible origin inactivation by passive replication, however, observed and intrinsic origin efficiencies can markedly differ. Thus, there is a need for methods to infer intrinsic from observed origin efficiency, which is context-dependent. Here, we show that MRT and RFD data are highly consistent with each other but contain information at different spatial scales. Using neural networks, we infer an origin licensing landscape that, when inserted in an appropriate simulation framework, jointly predicts MRT and RFD data with unprecedented precision and underlies the importance of dispersive origin firing. We furthermore uncover an analytical formula that predicts intrinsic from observed origin efficiency combined with MRT data. Comparison of inferred intrinsic origin efficiencies with experimental profiles of licensed origins (ORC, MCM) and actual initiation events (Bubble-seq, SNS-seq, OK-seq, ORM) show that intrinsic origin efficiency is not solely determined by licensing efficiency. Thus, human replication origin efficiency is set at both the origin licensing and firing steps.
DNA replication is a vital process that produces two identical replicas of DNA from one DNA molecule, ensuring the faithful transmission of genetic information from mother to daughter cells. The synthesis of new DNA strands initiates at multiple sites, termed replication origins, propagates bidirectionally, and terminates by merging of converging strands. Replication initiation continues in unreplicated DNA but is blocked in replicated DNA. Experiments have only given partial information about origin usage. In this work we reveal the exact propensity of any site to initiate replication along human chromosomes. First, we simulate the DNA replication process using approximate origin information, predict the direction and time of replication at each point of the genome, and train a neural network to precisely recover from the predictions the starting origin information. Second, we apply this network to real replication time and direction data, extracting the replication initiation propensity landscape that exactly predicts them. We compare this landscape to independent origin usage data, benchmarking them, and to landscapes of protein factors that mark potential origins. We find that the local abundance of such factors is insufficient to predict replication initiation and we infer to which extent other chromosomal cues locally influence potential origin usage.
Funding: This work was supported by the Agence Nationale de la Recherche (ANR-18-CE45-0002 and ANR-19-CE12-0028) and the Cancéropôle Ile- de-France and the INCa (PL-BIO16-302). OH was also supported by the Ligue Nationale Contre le Cancer (Comité de Paris; RS19/75-75), the Association pour la Recherche sur le Cancer (PJA 20171206387) and the Fondation pour la Recherche Médicale (FRM EQU202203014910). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Hadi Kabalane and Jeremy Barbier received a salary from the ANR.
Copyright: © 2023 Arbona et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
We then train a neural network on simulated MRT and RFD profiles to infer an IPLS that jointly predicts experimental MRT and RFD almost exactly, surpassing IPLSs based on DNase I HSS, ORC, MCM, Bubble-seq, SNS-seq or ORM profiles. In our model, each potential origin has the same intrinsic probability of activation per unit time. The optimised IPLS, which reflects intrinsic origin efficiencies, can be directly compared with ORC and MCM profiles. To compare the IPLS to actual initiation events as monitored by SNS-seq, bubble-seq, OK-seq or ORM, we establish novel mathematical expressions that relate observed and intrinsic origin efficiencies to MRT and therefore allow us to take origin passivation effects into account. The results show that the optimised IPLS is not fully consistent with ORC abundance profiles, and largely inconsistent with MCM abundance profiles, and that the firing probability of MCM DHs cannot be uniform in time and space. Our results therefore support an origin affinity model and provide a basis to investigate the distinct genetic and epigenetic determinants of origin licensing and firing.
Our aim being to extract the IPLS that best predicts available MRT and RFD data, we first compare their information contents. We show a remarkable conformity of MRT and RFD data to a simple mathematical equation that links both profiles. Extending the work by Gindin et al. [ 42 ], we then ask whether the correlation of DNase I HSS with origin activation seen at MRT resolution (50–100 kb) still holds true at RFD resolution (< 5 kb). We demonstrate that MRT and RFD data provide distinct information at different scales.
Given an IPLS possibly derived from a specific genomic feature (e.g. a DNase I HSS profile), a fixed number of localized potential origins is drawn (red circles). Limiting firing factors (blue rectangles) activate origins in a probabilistic manner and engage with each pair of newly created forks, which propagate at velocity v. Engaged factors can no longer activate origins. Unfired origins are passivated when they are reached by a passing fork. Merging of two forks emanating from adjacent origins results in a replication termination event and the release of one firing factor which becomes available again for origin activation. MRT and RFD are then computed from the average of 200 simulations. See Materials and methods .
In the present work, we harness our previous kinetic model of DNA replication [ 7 ] to predict MRT and RFD profiles. Discrete, localized potential origins (MCM DHs), randomly chosen from an arbitrary IPLS before S-phase entry, are activated in a stochastic manner by interaction with limiting firing factors that engage with forks and are released at termination ( Fig 1 ). As each MCM DH is given the same probability to fire, the non-uniformity of the obtained replication profiles only comes from the non-uniformity of the IPLS (origin density model).
Current experimental evidence has not yet clearly distinguished between the origin affinity and origin density models. ORC and MCM abundance profiles, which presumably reflect potential origin density, are well correlated with DNase I HSS and early MRT [ 9 – 11 ]. Furthermore, ORC- or MCM-based IPLSs produced realistic MRT profiles in Gindin-like simulations [ 10 , 11 ], which supports the origin density model. However, our comparison of ORC, MCM and RFD profiles of the Raji cell line showed that at constant MRT and transcription level, ORC and MCM densities are similar in initiation, random replication and termination zones [ 9 ]. Furthermore, a recent landscape of MCM DH footprints revealed that MCM DH abundance increases from early to late replication domains, which contradicts the origin density model [ 12 ]. The two latter studies suggest that potential origins may be more widespread than initiation sites but have different firing efficiencies, perhaps due to specific MCM or histone modifications affecting their affinities for firing factors, in line with the origin affinity model.
Modelling studies indicate that the reproducible spatial structure of genomic replication profiles can emerge from stochastic firing of individual origins [ 41 , 42 ]. Gindin et al [ 42 ] built a kinetic model in which the time-dependent probability of initiation at a yet unreplicated site was the product of the time-dependent availability of a limiting factor by the time-independent, local value of a genomic “initiation probability landscape” (IPLS). Of the various genomic and epigenomic profiles used as estimates of IPLS, DNase I HSS profiles produced the best match with experimental MRT profiles (Pearson correlation between simulated and experimental MRT of 0.865). Importantly, the same IPLSs did not produce realistic MRT profiles in models that did not include competition for limiting fork-associating factors [ 42 ]. Since this model did not explicitly separate origin licensing and firing, however, it remained unclear whether the IPLS reflected an inhomogeneity in potential origin density, or affinity for firing factors, or both.
What mechanisms could regulate origin firing? Modelling studies showed that a probabilistic interaction of potential origins with rate-limiting firing factors, engaged with forks and recycled at termination events, can predict the time-dependent profiles of origin firing rate and fork density universally observed in eukaryotes [ 7 , 32 , 33 ]. Experimental studies indeed suggested that rate-limiting activators regulate replication kinetics in yeast [ 34 , 35 ] and metazoans [ 36 , 37 ]. Thus, a simple model for replication regulation is that potential origins fire at different mean times because of their different affinities for limiting factors [ 38 ]. Alternatively, potential origins may all have the same affinity for firing factors but their variable density along the genome may determine MRT [ 11 , 39 , 40 ]. We refer to these two distinct models as the origin affinity model and the origin density model, respectively.
IZs can be shared between cell types or specific to a cell type, suggesting epigenetic regulation. They are enriched in DNAse I hypersensitive sites (HSSs) and histone modifications or variants such as H3K4me1, H3K27ac and H2A.Z, that usually mark active transcriptional regulatory elements [ 21 , 22 , 29 ]. H2A.Z was proposed to facilitate origin licensing and firing by recruiting SUV420H1, which promotes H4K20me2 deposition, in turn facilitating ORC binding [ 30 ]. Furthermore, binding sites for the firing factor MTBP were found to colocalize with H3K4me1, H3K27ac, H2A.Z, and other active chromatin marks [ 31 ].
Several experimental techniques allow to monitor origin licensing and firing as well as replication progression during S phase. Origin licensing can be monitored by experimental detection of ORC and MCM proteins, whose profiles are highly though not perfectly concordant [ 9 – 12 ]. In contrast to these potential origin profiles, actual initiation events can be monitored by sequencing purified replication initiation intermediates, such as short nascent DNA strands (SNS-Seq; [ 13 ]) or bubble-containing restriction fragments (Bubble-Seq; [ 14 ]). These two methods are only weakly concordant [ 15 , 16 ]. Other methods monitor replication progression along the genome. Mean replication timing (MRT) profiles have been computed by sequencing newly replicated DNA from sorted cells at different stages of S phase (Repli-seq; [ 17 – 19 ]) or by determining DNA copy number from proliferating cells [ 20 ]. Peaks of early MRT must contain origins, but low resolution (50–100 kb; [ 17 , 18 ]) has long precluded precise origin mapping from human MRT profiles. Replication fork directionality (RFD) profiles, obtained by strand-oriented sequencing of purified Okazaki fragments (OK-seq) were more resolutive (< 5 kb) [ 21 , 22 ]. Okazaki fragments mapping to the Watson and Crick strands are generated by leftward-(L) and rightward-(R) moving forks, respectively. OK-seq therefore reveals the proportions of R and L forks at any locus in a cell population. RFD is defined as the difference between the local proportions of R and L forks. RFD varies from -1 to 1, with these extreme values indicating that 100% of the forks are moving leftward or rightward, respectively. RFD profiles revealed that: (i) each cell line contains 5,000—10,000 broad (10–100 kb) initiation zones (IZs), characterised by a left-to-right upward shift in RFD; (ii) IZs often but not always flank active genes; (iii) termination events occur in broad zones (TZs), characterised by a downward RFD shift; (iv) TZs can be directly adjacent to IZs or separated from them by extended regions of unidirectional replication; (v) large randomly replicating regions, characterised by extended segments of null RFD, are observed in silent heterochromatin. OK-seq IZs were confirmed genome-wide by EdUseq-HU [ 23 ], high-resolution Repli-Seq [ 19 ], Optical Replication Mapping (ORM) [ 24 ] and by replicative DNA polymerase usage mapping (Pu-seq) [ 25 ]). Importantly, initiation events may additionally occur outside IZs, but in a too dispersed manner to be directly detected in cell population profiles. Recent single-molecule and OK-seq analyses of the yeast genome [ 26 , 27 ], of two model chicken loci [ 28 ], and ORM analysis of the human genome [ 24 ] provided direct evidence for dispersed initiation between efficient IZs in these systems.
In eukaryotes, chromosome replication starts at multiple sites referred to as replication origins [ 1 ]. Origins are licensed for replication during the G1 phase of the cell cycle, when the origin recognition complex (ORC) loads the MCM2–7 replicative helicase in an inactive, double hexameric ring form (MCM DH), around origin DNA [ 2 – 5 ]. This symmetric configuration prepares the helicases to initiate bidirectional replication upon activation. Origin activation (or firing) can take place at different times through S phase, by binding of multiple firing factors that trigger origin DNA unwinding, convert the inactive MCM DH into two active Cdc45/MCM/GINS helicases that each encircles and translocates 3’-to-5’ along a single DNA strand, and recruit DNA polymerases and accessory factors for processive replication [ 6 ]. Only a fraction of MCM DHs lead to productive initiation events, while the rest is inactivated by passing replication forks originating from other origins. This origin passivation mechanism [ 7 ] cooperates with MCM2–7 loading restriction to G1 phase to prevent rereplication in a single cell cycle [ 8 ].
2 Results
Note: a nomenclature of all the symbols used in the text is given in the supplementary information (S1 Text).
Dependence of replication kinetics on model parameters We first analyzed K562 S-phase duration T S using the median of the times to replicate 95%, 99% and 100% of the genome, T95, T99 and T100 respectively. As expected, each of these three times decreased with the density of firing factors ρ F and the fork speed v (Fig 7C and 7L) as we are in a regime of strong affinity between firing factors and potential origins (large k on ) so that [7]. The model predicted much larger differences between T100 and T99, than between T99 and T95, consistent with the latest replicated regions being the most devoid of potential origins. Indeed, for a genome-averaged distance d PO = 20 kb, the predicted distance between potential origins increased from a short 2 kb value in MRT < 0.15 regions to 380 kb in MRT > 0.85 regions (Fig I in S1 Text). This observation also explained that (i) the cell-to-cell variability of T95 or T99 (∼ 10 min) was much smaller than that of T100 (hours) (Fig J in S1 Text); (ii) increasing d PO increased T100 to a much greater extent than T95 or T99 (Fig 7F); and (iii) adding random initiation decreased T100 to a much greater extent than T99 or T95 (Fig 7I). The latter effect decreased with increasing r, consistent with the latest replicated regions being fully devoid of origins when r = 0 (Fig 7I). Consistenly, many late replicating regions showed flat MRT and null RFD profiles characteristic of random initiation [21], but the even later-replicating, common chromosomal fragile sites (CFSs), showed an origin paucity that explains their fragility [52]. Experimentally reported S phase lengths were closer to T95 than T100. Using the reference set of parameters (ρ F = ; r = r* = 0%, d PO = 20 kb, v = 1.5 kb.min−1), the predicted T95 was 8.6 h for HeLa (experimental estimate 8.8 h; [53]), 13 h in K562 (experimental estimate 12 h; [47]) 13.3 h in Raji, and, assuming a fork speed of v = 2.0 kb.min−1 as found in the closely related JEFF cell line [49], 10.7 h for GM06990 (experimental estimate 10 h; [44]). One probable explanation is that experimental detection of S phase markers misses the earliest and latest S phase cells, when the replication rate is the lowest. Indeed, very late replication of specific sequences was reported to linger during what is termed the G2 phase of the cell cycle [54], and S phase length variations around the mean ranged from minutes to hours depending on cell lines and detection methods [55, 56]. Within the parameter range we explored, T95 variations (∼ 10 min) were smaller than for T100 (hours) (Fig J in S1 Text). Experiments may therefore have underestimated the exact duration of S phase. Another possibility is that we underestimated r or v, since increasing either parameter above its reference value efficiently reduced T100 without much compromising the correlations of simulated and experimental MRT and RFD (Fig 7G, 7H and 7I, Fig D and Fig E in S1 Text). In our simulations, the faster replication of HeLa cells was explained by a larger number of firing factors than in GM06990 (2.0-fold), K562 (1.9-fold) and Raji (1.9-fold). Thus, the optimisation procedure selected a density of firing factors ρ F that gave relative S phase durations consistent with experimental measurements using sensible values for d PO and v. The origin firing rate per length of unreplicated DNA, I(t), was previously reported to follow a universal bell-shaped curve in eukaryotes [7, 33]. As expected from the choice of the k on value, the simulations produced the expected shape for the four cell lines with a maximum of I(t), I max between 0.01 and 0.02 Mb−1.min−1, in reasonable agreement with a crude estimate of 0.03–0.3 Mb−1.min−1 from low-resolution MRT data (Fig K in S1 Text) [33]. Finally, we measured in K562 the dispersion of replication times (RT) of each locus as a function of its MRT. A recent high-resolution Repli-Seq study [19] reported that RT variability, estimated as the width between the first and the third quartiles of RT distribution, increased from early to mid S phase and decreased thereafter. For most regions the RT variability was in the 1.25–2.5 h range. This study used the replicated genome fraction, as estimated by FACS analysis, as a proxy for RT, and assumed a constant S-phase length of 10 h. When using the replicated genome fraction as a proxy for RT as in [19], our simulations in K562 produced a similar bell-shaped of computed RT variability (Fig L in S1 Text, left). When using the true simulated time, however, the computed RT variability increased throughout S phase, as expected for stochastic origin firing (Fig L in S1 Text, right). These results imply that the bell shape previously reported was an artefact of the non-linear relationship of RT with replicated fraction: the rate of replication is not constant and therefore the replicated fraction is not a suitable proxy for RT to analyse RT dispersion. In summary, the kinetic parameters of S phase predicted by our stochastic model were (i) consistent with the reported time-dependencies of the firing rate I(t); (ii) consistent with the reported variability of replicated genome fraction at constant MRT; and (iii) predictive of relative S phase durations. The variablity of true RT increased through S phase, as predicted for stochastic origin firing.
[END]
---
[1] Url:
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1011138
Published and (C) by PLOS One
Content appears here under this condition or license: Creative Commons - Attribution BY 4.0.
via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/