(C) PLOS One
This story was originally published by PLOS One and is unaltered.
. . . . . . . . . .



Generating information-dense promoter sequences with optimal string packing [1]

['Virgile Andreani', 'Biomedical Engineering Department', 'Boston University', 'Boston', 'Massachusetts', 'United States Of America', 'Biological Design Center', 'Eric J. South', 'Molecular Biology', 'Cell Biology']

Date: 2024-08

Dense arrangements of binding sites within nucleotide sequences can collectively influence downstream transcription rates or initiate biomolecular interactions. For example, natural promoter regions can harbor many overlapping transcription factor binding sites that influence the rate of transcription initiation. Despite the prevalence of overlapping binding sites in nature, rapid design of nucleotide sequences with many overlapping sites remains a challenge. Here, we show that this is an NP-hard problem, coined here as the nucleotide String Packing Problem (SPP). We then introduce a computational technique that efficiently assembles sets of DNA-protein binding sites into dense, contiguous stretches of double-stranded DNA. For the efficient design of nucleotide sequences spanning hundreds of base pairs, we reduce the SPP to an Orienteering Problem with integer distances, and then leverage modern integer linear programming solvers. Our method optimally packs sets of 20–100 binding sites into dense nucleotide arrays of 50–300 base pairs in 0.05–10 seconds. Unlike approximation algorithms or meta-heuristics, our approach finds provably optimal solutions. We demonstrate how our method can generate large sets of diverse sequences suitable for library generation, where the frequency of binding site usage across the returned sequences can be controlled by modulating the objective function. As an example, we then show how adding additional constraints, like the inclusion of sequence elements with fixed positions, allows for the design of bacterial promoters. The nucleotide string packing approach we present can accelerate the design of sequences with complex DNA-protein interactions. When used in combination with synthesis and high-throughput screening, this design strategy could help interrogate how complex binding site arrangements impact either gene expression or biomolecular mechanisms in varied cellular contexts.

The way protein binding sites are arranged on DNA can influence the regulation and transcription of downstream genes. Areas with a high concentration of binding sites can enable complex interplay between transcription factors, a feature that is exploited by natural promoters. However, designing synthetic promoters that contain dense arrangements of binding sites is a challenge. The task involves overlapping many binding sites, each typically about 10 nucleotides long, within a constrained sequence area, which becomes increasingly difficult as sequence length decreases and binding site variety increases. We introduce an approach to design nucleotide sequences with optimally packed protein binding sites, which we call the nucleotide String Packing Problem (SPP). We show that the SPP can be solved efficiently using integer linear programming to identify the densest arrangements of binding sites for a specified sequence length. We show how adding additional constraints, like the inclusion of sequence elements with fixed positions, allows for the design of bacterial promoters. The presented approach enables the rapid design and study of nucleotide sequences with complex, dense binding site architectures.

Data Availability: The solver is implemented in an open-source library available at: https://gitlab.com/dunloplab/dense-arrays . The code used to generate the data and make the figures from this paper is available on the “paper” branch of this repository. All dense array sequences generated in the manuscript are available in the repository. The /benchmarks folder provides detailed information for each dense array.

After proving that the SPP is NP-hard, we reduce it to the Orienteering Problem with integer distances, which is an optimization problem related to the Traveling Salesman Problem [ 44 ]. This reduction allows us to formulate nucleotide string packing as an integer linear programming problem, which can then be solved efficiently with a variety of open-source and commercial solvers. This formulation not only underscores the computational complexity of the design task but also lays the groundwork for generating sequences that accommodate a maximal number of protein binding sites. The expressivity of integer linear programming also allows for a variety of design objectives and constraints, which can be used to tailor the specifics of the output nucleotide sequence. We demonstrate how a modification to the model makes it possible to adjust the order of solutions returned to favor a more uniform representation of binding sites across sequences. We also demonstrate that by adding additional constraints to the SPP, like the inclusion of sequence elements with fixed positions, we can effectively interweave dense regions of binding sites around -35 and -10 sigma factor recognition sites. This approach enables the design of complex bacterial promoters that interact with multiple sigma factors. The integer linear programming model and all its extensions have been made available to the community as part of an open-source Python library available at https://gitlab.com/dunloplab/dense-arrays . Equally applicable to large-scale libraries and more focused studies, this computational approach serves as a resource for designing nucleotide sequences with complex DNA-protein binding architectures.

Here we present a novel computational method for the design of nucleotide sequences with densely packed DNA-protein binding sites, which we name the nucleotide String Packing Problem (SPP), related to the classical Shortest Common Superstring problem in theoretical computer science [ 43 ]. We deliberately use the terminology “string” instead of “sequence” in the problem name to avoid confusion with the concept from computer science of a “subsequence,” which is not necessarily contiguous. This distinction is crucial for distinguishing the SPP, where an individual binding site must not be split, from a potential “Sequence Packing Problem” where the sequence AGGA of length 4 would fit the three elements AA, GG, AGA (among others), due to the non-contiguous nature of subsequences.

Designing nucleotide sequences with overlapping binding sites becomes challenging when the total length of the desired binding sites, each typically about 10 nucleotides long [ 27 ], surpasses the fixed length of the intended output sequence. The overall problem increases in complexity as the pool of binding sites expands, causing an exponential increase in potential binding site configurations. Studies that have tackled the design of overlapping sequences, and their effects on transcriptional regulatory logic, often rely on ad hoc methods or generate a limited set of short sequences [ 26 , 28 – 30 ], thus restricting the generation of large libraries. Meanwhile, generative AI techniques are starting to show promise in emulating the complexity of context-dependent promoters [ 31 – 37 ]. Many of these models are trained on large datasets of natural sequences [ 38 – 40 ], leading to synthetic promoters that mimic these natural examples. However, these models may struggle to generate sequences with cis-regulatory logic that is not present in natural genomes, as they tend to produce sequences within their training distribution. The choice of training data significantly shapes the model, mirroring assumptions about the sequence distributions to explore [ 41 ]. Consequently, there is growing interest in using synthetic DNA to generate training data to facilitate the discovery of novel expression responses [ 42 ].

As the complexity of cis-regulatory regions becomes more evident, researchers have begun to employ forward engineering approaches to systematically map promoter sequences to transcriptional readouts. One of the most prominent methods in this regard is the use of massively parallel reporter assays (MPRAs) [ 15 – 17 ]. These assays link the expression of a reporter gene to a specific cis-regulatory variant, which is often situated upstream on either an episomal or genomically integrated locus [ 18 ]. Traditionally, the design of promoter variant libraries for MPRAs has followed one of two different strategies: either diffuse nucleotide diversification, achieved through methods like error-prone PCR or random mutagenesis [ 19 , 20 ], or hybrid engineering approaches that fuse core promoter subregions with other discrete binding site elements [ 21 , 22 ]. In hybrid engineering approaches, the design of sequence variants for MPRAs has typically focused on adjusting binding site spacing and consensus sequences, often placed in adjacent or proximal positions, while overlooking the potential for overlapping binding sites [ 15 , 23 – 26 ]. While these studies offer insights into how binding site positioning and biophysical constraints inform promoter strength, their general omission of overlapping binding sites has limited the characterization of nonlinear, emergent properties that can arise in natural densely arranged cis-regulatory regions.

The layout of binding sites within nucleotide sequences can define both biomolecular interactions and subsequent biological processes. For example, the architecture of cis-regulatory regions—stretches of non-coding DNA containing binding sites for transcription factors and other regulatory proteins—plays a key role in controlling the expression of downstream genes. These regions serve to recruit regulators and convert cellular signals into transcriptional outputs [ 1 – 4 ]. Regions with few binding sites are thought to enable transcription factors to bind and then permit higher overall transcription levels, even when these proteins are in limited quantities [ 5 ]. Conversely, dense clusters of binding sites can elicit complex phenomena such as cooperative binding effects [ 6 – 9 ], steric hindrance [ 10 , 11 ], and transcription factor sharing [ 5 ]. These emergent properties can lead to nonlinear transcriptional regulatory logic, wide dynamic ranges, and context-sensitive gene expression [ 12 – 14 ].

Results

Complexity of the String Packing Problem (SPP) The problem of packing DNA-protein binding sites into a DNA sequence of a fixed length can be formulated more generally as an optimization problem on strings: consider a finite alphabet Σ, a collection R of strings from Σ*, and a natural integer L. The goal is to find a string w of length L that maximizes the number of different strings of R contained in w, that is, the number of strings x ∈ R such that there exists w 0 ∈ Σ* and w 1 ∈ Σ* such that w = w 0 ×w 1 (Fig 1A). PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 1. Formulation of the nucleotide String Packing Problem (SPP) as an Orienteering Problem (OP). (A) The SPP consists of fitting as many different strings from a binding site collection R as possible inside a string of a given length L. bp, base pairs. (B) We define an asymmetrical metric between strings i and j, d ij , as the least number of shifts needed for the prefix of the second string to match the suffix of the first string. (C) After adding Start and End vertices, we construct the complete graph between binding site strings with the associated shift, d ij , between strings shown. The OP is solved on this graph. For the double-stranded problem, we also include the reverse-complements of all the binding sites, omitted here for clarity. (D) Representative solutions for two different values of L. https://doi.org/10.1371/journal.pcbi.1012276.g001 To determine the complexity of this problem, let us first consider the associated decision problem (SPP-DECISION): given the alphabet Σ, the collection of strings R, two natural integers L and N, is there a string of length L or shorter that contains N or more of the strings of R? We will show that this problem is NP-complete, that is, it belongs to the set of NP problems at least as hard as all other NP problems. To show that SPP-DECISION is NP-complete, we first need to show that it is an NP problem, and then that it is at least as hard as all other NP problems. (1) To show that it is an NP problem, we need to show that we can construct certificates of positive answers, with sizes that are polynomial in the size of the input, which can then be verified by a polynomial-time algorithm. (2) To show that SPP-DECISION is at least as hard as all other NP problems, we will proceed by reduction from another NP-complete problem: the decision problem corresponding to the Shortest Common Superstring problem (SCS-DECISION) [45] (as described in [43]). The proof follows: SPP-DECISION is in NP: A certificate of existence of a superstring can be a list of the strings included in it with their offsets from the beginning of the superstring. The size of this certificate is polynomial in the size of the problem input. Moreover, this certificate can be verified by a polynomial-time algorithm. SPP-DECISION is then in NP. SCS-DECISION is reducible to SPP-DECISION: The SCS-DECISION problem is closely related to the SPP-DECISION problem, in that given Σ, R, and L, the question is to determine if there exists a string of length L or shorter that contains all strings from R. Considering any SCS-DECISION problem described by (Σ, R, L), we can build an SPP-DECISION problem described by the same alphabet Σ, collection of strings R, and integer L, and where N = |R|. A positive (respectively negative) answer to this SPP-DECISION problem implies a positive (respectively negative) answer to the original SCS-DECISION problem, which shows that SCS-DECISION is reducible to SPP-DECISION, and thus that SPP-DECISION is NP-complete. Let us now consider the original optimization problem, SPP, which is not a decision problem. We will show that it is NP-hard, that is, every problem in NP is reducible to it. To show this, we will show that SPP-DECISION, which we just showed is NP-complete, is reducible to the SPP. Let us consider any SPP-DECISION problem, described by (Σ, R, L, N). We can construct the SPP described by (Σ, R, L). Let us call M the number of different substrings (i.e., binding sites) of R included in the optimal solution to this problem. It is enough to compare M to N to answer the SPP-DECISION problem, thus showing that SPP-DECISION is reducible to SPP and therefore that SPP is NP-hard.

Performance and scalability Brute force approaches to solving the SPP are not scalable, due to the NP-hardness of the task. No polynomial-time algorithms are known to solve it optimally, and the only deterministic algorithms known all eventually rely on an exponential number of steps in the worst case. But the choice of the approach greatly impacts the practical solve time. For example, for a binding site collection with |R| = 50 binding sites and a sequence of length L = 200 base pairs, one brute force approach could be to enumerate all possible 4200 ≈ 10120 sequences, which is computationally infeasible. Despite their computational challenges, many NP-hard problems hold significant practical value. This has led to the development of specialized solvers designed to efficiently manage these problems, although they may face difficulties with extremely large instances [46]. Many numerical solvers have been developed to tackle the NP-hard integer linear programming problem, which restricts linear programming to integer variables. We formulated the SPP as an integer linear programming problem by reducing it to a variant of the Travelling Salesman Problem (TSP) known as the Orienteering Problem (OP) [44,47,48] (see Methods for the integer linear model). This formulation abstracts away the notion of nucleotides and considers instead the interaction between binding sites through a shifting metric defined between every pair of binding sites (Fig 1B), leading to a dramatic decrease in the complexity of the problem. Indeed, a brute force approach to solve this formulation would now only need to enumerate all possible paths in the graph visiting each binding site at most once (Fig 1C and 1D). For the binding site collection considered above (|R| = 50 and L = 200), there are about 1065 such paths (as calculated using the approach in Ref. [49]), which is still astronomically large, but much less than 10120. Thus, the same problem can be formulated in different ways, and the amount of computational work needed to solve them can be very different. The algorithms in integer linear programming are also exponential but are designed to dramatically decrease the exponential constant. While they cannot cheat the exponential for very large tasks, they embed heuristics that help them to quickly find an optimal solution in many cases. Here, thanks to their internal algorithms and heuristics, the integer linear programming solvers do not have to explore all 1065 paths of the OP, and thus are able to find a solution for a problem of size |R| = 50 and L = 200 highly efficiently, returning solutions within seconds. The strength of our approach relies on two factors: the formulation of the initial SPP as a graph problem abstracts the notion of nucleotides and eliminates the need to represent them individually, which drastically cuts down the search space, and the use of an optimized solver further reduces the complexity by tackling this problem using state of the art algorithms. Solve times for the SPP are expected to be dependent on the parameters of the problem (size of the binding site collection, |R|, and length of the sequence, L). To investigate the influence of these parameters, we considered collection sizes, |R|, between 10–100 distinct binding sites, each with a random length between 5 and 15 base pairs, where these lengths are based on known protein-DNA interactions [27]. We also considered typical lengths of cis-regulatory regions between 20–300 base pairs [50] to use as our sequence length, L. Using the highly efficient Gurobi solver as a backend for the integer linear programming, we found that the solve times were rapid (Fig 2A). For example, the collection with |R| = 50 binding sites and a sequence length of L = 200 base pairs can be solved in 2 seconds on a laptop. We found that solve time scales proportionally with the number of variables of the model, that is, with the square of the binding site collection size. The solve time is also largely independent of the length of the sequence (Fig 2B), which may be explained by the fact that this parameter does not change the number of variables or the structure of the model. It is also possible that increasing the length of the sequence may increase the number of optimal solutions, making it easier for the solver to find one of them, despite their increasing complexity. The solve times range from fractions of seconds for collections of size |R| = 20, to 10 seconds for collections of size 100, making this is an efficient and accessible method for practical nucleotide design tasks. PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 2. Performance of the SPP. Solve time as a function of (A) binding site collection size, |R|, and (B) sequence length, L. All experiments were run 10 times, with the Gurobi v10.0.1 backend. The shaded regions represent the bootstrapped 95% confidence interval around the mean. All replicates use a different, randomly sampled binding site collection, with random binding sites of uniformly random length between 5 and 15 base pairs. The binding site collection size and sequence length are specified on the figure. Double-strand optimization is performed here. For all these experiments, Gurobi was given access to an Intel Xeon E5-2650 v2 CPU (16 logical threads) and its memory usage was under 3 GB. A timeout of 600 seconds was set, reached only 8 times out of 1500, which we excluded from these data. https://doi.org/10.1371/journal.pcbi.1012276.g002

Heuristic approaches to solve the problem In some cases, it is not necessary to reach the optimal solution to an NP-hard problem if a suboptimal but sufficiently good solution is acceptable. It can be much easier to find suboptimal solutions, and this can often be achieved by “approximation algorithms,” many of which run in polynomial time. For instance, in the case of the TSP, the nearest neighbor algorithm—always visiting the closest city not yet visited—operates in time proportional to the square of the number of cities, yet it may yield solutions significantly worse than the optimal. For metric TSPs (i.e., when the distance respects the triangle inequality), the Christofides–Serdyukov algorithm is a polynomial-time algorithm that is guaranteed to return a solution no longer than 1.5 times the optimal solution [51–53]. Approximation algorithms also exist for the SCS problem [54]. The greedy algorithm (repetitively merging the two strings with the best overlap, until there is only one remaining) will return a superstring at most 3.425 times longer than the optimal string. A recent approximation algorithm allows this factor to be lowered to 2.475 [55]. Thus, if adapted to our problem, a similar strategy would create an array in the worst case less than half as dense as it could be. To test if an approximation method could produce results comparable to the optimal results returned by our integer linear programming approach, we employed a greedy algorithm. This algorithm involves repetitively merging the two binding sites with the best overlap until only one is left (in the worst case this is 3.425 times longer than the optimal solution). If this resulting sequence is shorter than the targeted sequence length, while fitting all binding sites in the collection, it is an optimal solution and is thus returned. If the sequence extends beyond the target length, we apply a sliding window corresponding to our target length, and then locate the sequence segment containing the highest number of binding sites; this segment is subsequently returned. In a numerical study using randomly generated sequences to represent binding sites, we compared the number of binding sites in the solutions returned by this greedy algorithm to the optimal number computed with our integer linear programming formulation. We found that the approximation algorithm often returned fewer binding sites than optimal (60% of the optimal in the worst cases), with large binding site collection sizes (|R|) and compact sequence lengths (L) posing challenges that only the integer linear programming formulation was able to tackle (S1 Fig). Meta-heuristics—general strategies applicable to a range of problems—are an alternative approach and have shown promise in identifying quality suboptimal solutions. Genetic algorithms, for instance, have shown efficacy in similar nucleotide overlapping problems [28,29]. However, the practical application of these algorithms is limited by difficulties in tuning them to yield good solutions quickly.

Diversity of the generated solutions Next, we returned to our integer linear programming approach and quantified the diversity of sequence solutions returned by the algorithm. Some applications might require the generation of multiple sequences, composed of binding sites from the same collection, for the purposes of library curation or biological screening. For example, generating diverse DNA sequences from the same binding site collection can enable nuanced studies in functional genomics, or facilitate comparative analysis during either drug screens or synthetic biology circuit optimizations. We found that maximizing the number of sites in a sequence introduces a bias for smaller binding sites (S2 Fig). Interestingly, we found that this bias is not exclusively tied to size variation, as it manifests even among equally sized binding sites. To demonstrate this, we created random binding site collections of |R| = 10 binding sites of 10 base pairs each. We enumerated the best solutions for a final sequence of length L = 50 base pairs: in all cases, six binding sites were fit in this sequence at best (score = 6) (Fig 3A). For the three random collections we tested, the number of different sequences realizing this best score were 1854, 2654, and 10,922, where the higher numbers correspond to a case where there were more ways to create sequences composed of 6 binding sites, for example due to similar sequences in the starting binding site collection. If the frequency of binding site inclusion was perfectly uniform, we would expect each binding site to occur with a frequency of 0.1 (1/|R|). However, we found that every binding site was not equally likely to be part of these best solutions (Fig 3B). To rule out the chance that these findings were simply due to sampling variations, we used a control experiment where we randomly selected six binding sites from the collection as many times as there were solutions, removing the constraint that these sites needed to fit into a sequence. To compare the sampling bias between these two scenarios, we sorted binding sites by decreasing frequency (Fig 3B and 3C). The flatter the line, the more uniform the distribution is. Conversely, a steeper slope indicates a less diverse distribution, tending to favor certain binding sites over others. We found that the creation of dense arrays inherently tends to favor specific binding sites, shaped by the unique interplay among sites within each specific binding site collection (Fig 3D). We next investigated the reasons for discrepancies in the representation of binding sites that were all the same length, focusing on results across the top scoring dense arrays. We compared each binding sites’ average shift metric d ij to and from other binding sites, representing their average overlap with other binding sites, with their rate of appearance in top-scoring solutions. As expected, we found that the binding sites with the highest average overlap with others were more represented in the top-scoring solutions (S3 Fig). PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 3. Assessing binding site representation across dense arrays. (A) For the |R| = 10 randomly generated binding sites of Collection 1 we found 1854 optimal sequences (each fitting 6 binding sites within L = 50 bp). We then tallied up the binding sites involved in these sequences. (B) The frequency of binding site distribution is represented as a histogram. Three random binding site collections of |R| = 10 binding sites, each of 10 base pairs in length, were used to create sequences of length L = 50 bp. The bar plots show the frequency at which each of the 10 binding sites was present in these sequences. Optimal solutions using the dense array approach with the SPP (blue) and using random samples of 6 binding sites (gray) are shown. (C) Sorting the histogram gives an idea of the heterogeneity of the distribution, where larger differences between the frequency of binding site usage correspond to larger sampling biases. (D) The histograms from (B) were first sorted by decreasing frequency, then assembled on this plot for comparison. The slope of the curve is an indicator of the diversity (or lack thereof) in the corresponding distributions. https://doi.org/10.1371/journal.pcbi.1012276.g003 Bias in binding site selection due to size (S2 Fig) or sequence (Fig 3D) can potentially be transiently amplified by solver-specific tendencies: when faced with solutions of equal score and no additional constraints, the solver is free to generate outcomes in an order that reflects its intrinsic algorithms, often more akin to local exploration than direct sampling. Indeed, different solvers return the same solutions of equal score in different orders (S4 Fig). This implies that the distribution of the binding sites involved in the first few solutions of equal score (for instance, those containing 6 binding sites), may not accurately represent the distribution of the entire set. To create a sample of the best solutions with a more balanced distribution of binding sites, one could ask the solver for a much larger number of solutions and filter this set to homogenize the representation of binding sites. However, this is computationally wasteful, as many of the generated solutions are discarded.

Generation of more representative solution sets We address the issue of binding site representation bias by making a minor adjustment to our integer linear programming model. By maintaining a count of the binding sites included in previously generated solutions, we adjust the objective of the integer linear program to favor those binding sites that are underrepresented compared to the average. Concretely, we modify the objective to where c ij is set to 1 for j corresponding to binding sites more represented than the average so far, and to 1+ϵ for binding sites less represented than the average. The value of ϵ is irrelevant if it is sufficiently small to never allow the score of a solution with fewer binding sites to overcome the score of a solution with more binding sites (any strictly positive value that is smaller than 1/n meets this criterion). We apply this change after every generated solution. This change allows the solver to break ties between solutions in a different way each time. We refer to the original solution method as “solver order,” because the specific solver implementation determines how ties are broken. We refer to our modified version of this approach that directs the algorithm towards the use of underrepresented sites as “diversity-driven order,” because the addition of the weighting terms controls how the solver breaks ties when selecting which binding site to include (Fig 4A). PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 4. A minor modification to the model fosters a diverse representation of binding sites as solutions are returned. (A) Schematic representing binding site selection frequencies using solver order (top) or diversity-driven order (bottom). The color intensity of each vertex represents the absolute counts with which it was included in a solution. (B) Solver order, using the Gurobi solver, and (C) diversity-driven order approaches generate different frequencies of binding site representation. Each color represents a different binding site (|R| = 10 binding sites). The three columns represent the three binding site collections presented in Fig 3, with respective number of top-scoring solutions of 1854, 2654, and 10,922. (D) The distribution entropy of the solver order and diversity-driven order strategies. In all cases, the entropy of the diversity-driven order is the same or higher than the entropy of the solver order. https://doi.org/10.1371/journal.pcbi.1012276.g004 Whereas the solver order strategy leads to a pool of solutions that never represents every binding site equally (Fig 4B), we found that the diversity-driven strategy is effective at generating solutions in an order that transiently improves inclusivity in binding site representation (Fig 4C). In this example, |R| = 10 so equal representation of binding sites would have their frequency of appearance at 0.1. Both approaches ultimately result in the same binding site distribution because in the limit where they generate the maximum number of top-scoring solutions, the two strategies produce the same set of solutions. Critically, what is different is the order in which these solutions are returned before this maximum is reached. To summarize the results across the different solution methods, we calculated the entropy of the solver order and diversity-driven order strategies (Fig 4D). Higher entropy values correspond to more diverse solutions, where the theoretical maximum is log(|R|), reached only if all binding sites have been uniformly involved (with frequency 1/|R|) in the solutions. The diversity-driven order solutions exhibited a rapid increase in entropy that was maintained across many rounds of binding site selection (Fig 4D), demonstrating its efficacy in generating solutions with broad representation. The diversity-driven order also produced higher entropy distributions than the solver order for collections with size differences among binding sites (S5 Fig).

[END]
---
[1] Url: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1012276

Published and (C) by PLOS One
Content appears here under this condition or license: Creative Commons - Attribution BY 4.0.

via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/