## Abstract

Gene regulatory networks (GRNs) are quite large and complex. To better understand and analyse GRNs, mathematical models are being employed. Different types of models, such as logical, continuous and stochastic models, can be used to describe GRNs. In this paper, we present a new approach to identify continuous models, because they are more suitable for large number of genes and quantitative analysis. One of the most promising techniques for identifying continuous models of GRNs is based on Hill functions and the generalized profiling method (GPM). The advantage of this approach is low computational cost and insensitivity to initial conditions. In the GPM, a constrained nonlinear optimization problem has to be solved that is usually underdetermined. In this paper, we propose a new optimization approach in which we reformulate the optimization problem such that constraints are embedded implicitly in the cost function. Moreover, we propose to split the unknown parameter in two sets based on the structure of Hill functions. These two sets are estimated separately to resolve the issue of the underdetermined problem. As a case study, we apply the proposed technique on the SOS response in *Escherichia coli* and compare the results with the existing literature.

## 1. Introduction

Gene expression is a process that transcribes information of genes and translates it into functional gene products. These products are proteins that play a central role in performing numerous cellular functions. Spatio-temporal expression of these products is controlled by close interaction of different genes with each other, which is commonly known as gene regulatory network (GRN). The large and complex nature of GRNs limits our ability to understand them intuitively [1,2]. Therefore, mathematical models are constructed to get better insight into and understanding of these complex phenomena.

Mathematical models that describe GRNs are of three major types: logical models, stochastic models and continuous models [1]. Logical models [3], e.g. Petri nets, Boolean and Bayesian networks, can only describe qualitative behaviour of GRNs. Stochastic and continuous models, represented by ordinary differential equations (ODEs), can describe the dynamic behaviour of GRNs quantitatively [1,2]. GRNs like any natural system involve randomness and this randomness becomes significant especially when the numbers of molecules are small [4,5]. Single-molecule level models [6] and stochastic differential equations [7] are applied to describe quantitative behaviour with randomness. Stochastic models are complex, computationally expensive and suitable only for a small number of molecules [8]. ODEs, on the other hand, are relatively simple, well studied and can easily simulate the quantitative behaviour of GRNs. These attributes make ODEs preferable whenever randomness is negligible. ODEs are mainly of two types, i.e. linear and nonlinear. Analytical solution of linear ODEs is found easily, but they can be used only in establishing the qualitative behaviour of GRNs [1,9]. GRNs, like most of the physical phenomena, are nonlinear. Therefore, nonlinear ODEs are used in modelling and quantitative analysis of GRNs. Moreover, nonlinear ODEs can easily simulate regulation and feedback effects of genes.

The mathematical model has unknown parameters whose value has to be obtained through experimental data of the biological system. The dynamics of a biological process are captured using time-series experiments that sample the process for measuring the concentration of gene products at different times. These dynamic time-series data can be used for parameter estimation of the ODE model [10]. Data for gene expressions are usually sparse [1,10], i.e. measurements are taken at only a few time points because of limitations of experimental set-up and high expense associated with these experiments. Sparseness of the data forms an underdetermined problem, i.e. the number of equations generated from the experimental data are small compared with the number of unknown parameters of ODEs. Besides sparseness, these measurements suffer from noise that may cause inaccuracies in parameter estimation. Methods for estimating ODE parameters from time-series data can be classified into three broader categories: classical, discretization and collocation methods. Classical techniques are based on first-order Taylor series expansion. First-order expansion for replacing nonlinear structures is only suitable for small durations and mild nonlinearities [11]. Discretization methods, e.g. [12], employ direct numerical solution of ODEs for data fitting. Therefore, ODE parameters are estimated along with initial conditions, which results in an increased number of unknowns. As discretization methods use numerical solutions they are computationally intensive. Moreover, an inherent disadvantage of these methods is that they can result in inaccurate parameter estimation because of high sensitivity with initial conditions [13,14].

Collocation methods [11,15] use basis systems such as polynomials, splines and Fourier basis, etc. Instead of fitting the experimental data directly to the numerical solution of ODEs, data are fitted to a function of basis systems. This function includes coefficients that are determined first and then parameters of ODEs are estimated through these coefficients. Thus, these methods avoid computationally expensive numerical solutions. Another important advantage of these methods is the estimation of initial conditions as a side product of the procedure. On the other hand, solution of nonlinear ODEs is very sensitive to initial conditions and can lead to inaccurate estimates of parameters. The most common variant of collocation methods is the generalized profiling method (GPM) [11]. The GPM employs cascaded optimization. The outer optimization is for parameters of ODEs. The inner optimization is for coefficients of the function of the basis system. The GPM produces better estimates than some of the other variants of collocation methods [15,16]. Detailed discussion of the GPM is given in §2.2

As stated earlier, the optimization problem for GRNs is usually underdetermined because of sparse data [10] and large number of unknowns involved in modelling with ODEs [2]. Moreover, it is a constrained optimization problem due to different reasons such as protein concentration cannot be negative. Nonlinear optimization in collocation methods can be solved by different algorithms. Global optimization algorithms, e.g. particle swarm [17], are very attractive for their convergence to a global solution. However, these algorithms use sampling approach, i.e. they need a large number of function evaluations at each optimization iteration. Function evaluation is computationally expensive in collocation methods because of the cascaded optimization structure. Therefore, algorithms based on a sampling approach are not suitable. Moreover, global optimization algorithms are suitable for a small number of unknowns. Researchers are focusing to improve these two aspects of global optimization algorithms, (e.g. [18,19]). The trust-region-reflective algorithm [20] and the Levenberg–Marquardt algorithm [21] are two other popular techniques of nonlinear optimization. A limitation of the trust-region-reflective algorithm is its inability to deal with underdetermined problems, which is usually the case in GRNs. Levenberg–Marquardt cannot deal with constrained optimization [21].

In this paper, we propose a new optimization approach to solve an underdetermined optimization problem with constraints. This approach is based on Hill functions. Hill functions are considered suitable for building GRN models with ODEs [1,2]. They can quantify activation and inhibition effects of genes. Hill functions are mainly composed of two types of parameters: threshold and cooperativity. Based on the structure of Hill functions, we propose separation of parameters of ODEs into two sets, i.e. threshold and cooperativity parameters. Splitting unknowns into two sets helps in avoiding underdeterminedness. These two sets of parameters are estimated separately with a suitable solver. The two-step estimation is iterated until change in parameters becomes insignificant. Details are given in §3. We take the SOS response in *Escherichia coli* as a case study for the proposed technique and compare the results with work reported in the literature. The case study is discussed in §4.

## 2. Background

The complete process of parameter estimation of a Hill function-based ODE model from experimental genome data is shown as a block diagram in figure 1. Firstly, gene network structure is obtained either from well-established literature on the subject, experimental observations or by applying reverse engineering techniques (e.g. BANJO [22], ARACNE [23] and TSNI [9]). Network structure is used to construct a Hill function-based ODE model of the GRN with unknown parameters. The parameters of the ODE model are estimated by using the experimental data. Our work is focused on parameter estimation based on the generalized profiling method (GPM) [11]. In the GPM, an optimization problem has to be solved to calculate the most optimal parameters. Lastly, the identified mathematical model can be used for dynamic simulations of the GRN. In the following subsections, we provide details of the Hill function-based modelling and the GPM.

### 2.1 Hill function-based ordinary differential equation model of gene regulatory networks

ODEs belong to the category of continuous mathematical models. Concentrations of gene products are considered as the state variables. The rate of change of these concentrations is expressed as a function of state variables. The standard form is as follows:
*x*_{i} denotes the concentration of the product of gene *i* and *N* is the total number of genes in a GRN. The rate of change of a state *f*_{i}(.) of all the states. The concentration of any gene product *x*_{i} is non-negative. Hence the model should be such that *x*_{i}≥0 for *i*=1,…,*N* for all practically possible initial conditions [24].

For modelling of GRNs with ODEs, Hill or sigmoidal functions are employed in the literature [25,26]. Hill curves are preferred because they can adopt sigmoidal shape [2,27] with suitable parameters. Mendes [27] has suggested the following function:
*x*_{i} denotes concentration of the *i*th gene product, *N*} that denotes the set of all inhibiting gene products for the *i*th gene, *i*th gene, *P*_{i} is the synthesis rate constant, *S*_{i} is the degradation rate constant, *h*^{−} is the inhibiting Hill function, *h*^{+} is the activating Hill function, *Q*_{i,j} and *Q*_{i,k} denote the threshold parameters of Hill functions, and the exponents *R*_{i,j} and *R*_{i,k} denote the cooperative parameters. Threshold parameter is the value of the concentration after which significant effect of inhibitor or activator is observed. Cooperative parameter controls how sharply the level transition occurs across the threshold. Activator and inhibitor Hill functions are depicted graphically for different values of the cooperative parameter in figure 2.

If the GRN composed of *N* genes has *M* interconnections, then the complete Mendes model of the form (2.2) requires 2*(*M*+*N*) parameters to fully describe the GRN, which includes *N* synthesis rate constants, *N* degradation rate constants, *M* threshold parameters and *M* cooperative parameters . We can write the model (2.2) in compact form as shown below:
*X*=[*x*_{1}…*x*_{N}]^{T} denotes state variables and *Θ*=[*P*_{1}…*P*_{N} *Q*_{1,2}…*Q*_{N,N−1} *R*_{1,2}… *R*_{N,N−1}*S*_{1}…*S*_{N}]^{T} is the vector of all the parameters. The parameters *Θ* have to be estimated from the experimental data.

It may be noted that all the parameters *Θ* should be positive. In a real GRN it is not possible to have negative synthesis rate constants *P*_{i} or degradation rate constants *S*_{i}. Similarly, the concentration thresholds *Q*_{i,j} or the exponents *R*_{i,j} of the Hill functions cannot be negative. The necessity of positive parameters should be incorporated in the estimation problem.

### 2.2 Generalized profiling method for estimation of parameters

Estimation is a challenging task because of non-availability of an analytical solution of (2.2) and numerical methods are expensive. Collocation methods avoid these difficulties by using polynomial regression. One of the efficient collocation methods is the generalized profiling method (GPM), which provides accurate estimation with low computational load [11]. In the GPM, β-splines are used as polynomial regression of states and their derivatives. The β-splines are preferred for interpolation as they are differentiable. Further details on β-splines can be found in [28]. The states and their derivatives are defined in terms of β-splines as follows:
*c* is column vector of coefficients, while *ϕ*(*t*) denotes β-spline basis systems. Estimation is achieved in two cascaded optimization steps. The inner optimization determines the coefficient vector *θ* such that the following objective function is minimized:
*y*_{i}(*t*) is the experimentally observed concentration of the product of the *i*th gene, *t*_{0} is the starting time of the experiment, *t*_{f} is the ending time of the experiment, _{i} is the weighting parameter. The first term of (2.6) minimizes the sum of squared residuals between the data *y*_{i}(*t*) and the state of the model *f*_{i}(.) in (2.5) and their estimates from β-splines. Weighting factor λ_{i} is used as a smoothening parameter.

The outer optimization determines the estimate of ODEs parameters *c* such that the objective function in (2.6) is minimized. The above-stated objective function is an implicit function of ODE parameters *Θ*. Further details on the GPM can be found in [29].

As mentioned in §2.1, the parameters *Θ* should be positive. Therefore, the above optimization problems are to be solved with the following constraint:

## 3. Proposed method

The nonlinear optimizations in the generalized profiling method in §2.2 have to be solved by numerical optimization algorithms. Having the constraint (2.8) in the optimization problem makes it more complex to solve. Some solvers, such as Lavenberg–Marquardt [21], also do not handle constraints. We propose a reformulation of the mathematical model that allows us to avoid the constraints. Let *P*_{i}=*e*^{pi}, *Q*_{i,j}=*e*^{qi,j}, *R*_{i,j}=*e*^{ri,j} and *S*_{i}=*e*^{si}. The model (2.2) can be rewritten as follows:
*θ*=[*p*_{1}…*p*_{N} *q*_{1,2}…*q*_{N,N−1} *r*_{1,2}… *r*_{N,N−1} *s*_{1}…*s*_{N}]^{T}. The benefit of this change of variables is that we do not have to place any constraints on *θ*. Whether the elements of *θ* are negative or positive, the elements of *Θ* will always be positive because of the exponential function.

As mentioned earlier, the number of time samples in experimental data are usually fewer than the number of unknown parameters, i.e. 2(*M*+*N*), in the ODE model (2.2) or (3.1). Therefore, the nonlinear optimizations in the generalized profiling method in §2.2 are underdetermined. This could result in poor estimates of the parameters. Some of the optimization techniques, such as the trust-region-reflective method [20], are more prone to inaccuracy in case of underdeterminedness.

To solve the issue of underdeterminedness, we propose a new approach. The idea is based on the structure of Hill functions. The cooperative parameters *R*_{i,j}=*e*^{ri,j} in the Hill functions control the sharpness of the switching action around the threshold parameters. Therefore, cooperative parameters only have prominent impact near the switching of Hill functions. We propose to split the parameters *θ* to be estimated into two sets *θ*_{r}=[*r*_{1,2}…*r*_{N,N−1}]^{T}, which includes only the cooperative parameters *r*_{i,j}, and *θ*_{p}=[*p*_{1}…*p*_{N} *q*_{1,2}…*q*_{N.N−1} *s*_{1}…*s*_{N}]^{T}, which includes all the rest of the unknown parameters. We propose to initially assume a value of *θ*_{r}, e.g. 2 for all *r*_{i,j}, and estimate only the parameters *θ*_{p}. As the number of unknown parameters are reduced, it helps to improve the issue of underdeterminedness. However, as *θ*_{r} was initially guessed, we then fix the value of *θ*_{p} and solve the optimization problem to estimate *θ*_{r}. We repeat this process until the estimates of all the parameters converge to a value.

The complete proposed scheme is illustrated with a flow chart in figure 3. Initially the estimation problem is reformulated with the change of variables to avoid constraints. An initial value of *θ*_{r} is assumed. The GPM method is applied to only estimate the parameters *θ*_{p}, which has *M*+2*N* elements. Using the recent optimal values of *θ*_{p}, the GPM method is applied to only estimate the parameters *θ*_{r}, which has *M* elements. If the change in all the parameters *θ* is less than some threshold, then the estimation process is complete; otherwise repeat the process.

## 4. Case study: SOS response in *Escherichia* *coli*

SOS response in *Escherichia coli* is an inducible DNA repair system that enables bacteria to survive under severe DNA damage. In the SOS response, nearly 40 genes are directly regulated by *recA* and *lexA*, while tens of other genes are indirectly controlled [30]. Nine genes named *lexA*, *recA*, *recF*, *rpoD*, *rpoS*, *dinI*, *umuDC*, *rpoH* and *ssB* are considered to be directly participating in the SOS response. Established network connections among these nine genes are known in the literature [9,30,31]. Network structure of these nine genes is shown as in figure 4.

### 4.1. Hill function-based model of the SOS response

The ODE model of the SOS response in *Escherichia coli* is constructed based on the network structure given in figure 4 [9,30]. The network has nine genes and 43 interconnections, i.e. *N*=9 and *M*=43. The Hill function-based ODEs for the nine gene product concentrations are given below:

A total of 2*(43+9)=104 parameters are needed to fully describe the above model. We can write the above model in compact form as shown below:
*X*:=[*x*_{1}…*x*_{9}]^{T}= [*lexA* *recA* *recF* *rpoS* *rpoD* *umuD* *dinI* *ssB* *rpoH*]^{T} denotes state variables, and *Θ*=[*P*_{1}…*P*_{9} *Q*_{1}…*Q*_{43} *R*_{1}…*R*_{43} *S*_{1}…*S*_{9}]^{T} is the set of unknown parameters. The value of *Θ* has to be estimated from the experimental genome data.

### 4.2. Results of the proposed method

In this section, the results of estimation by the proposed method are presented and compared with the results reported in the literature. SOS is a well-studied response, and a lot of literature and experimental datasets are available. Many Microbe Microarrays Database M3D [32] provides datasets for the SOS response under different perturbations, like ultraviolet light or some antibiotic. M3D database provides datasets in two formats, i.e. time course and steady-state measurements. Both forms of data are employed in this work. Time course measurements have been used for estimation of the model parameters. Experimental measurements of steady-state values have been used for validation.

Parameters are estimated with an initial guess of *r*_{i,j}=2. For the GPM, the weighting factor λ_{i} is chosen as 150 for *i*=1,…,*N*. The basis system is chosen as β-splines of order 4 with 100 knots, which is the number of joints in the spline function. The choice of λ_{i}, order and knots is subjective and chosen based on the required smoothness. The algorithm converged in four iterations of the loop in figure 3 with a total execution time of approximately 82 minutes on a PC with a Core i5-650 processor. The Matlab code is available at [33]. Estimated parameters are given in table 1.

The estimated model is simulated to generate time course evolution of gene product concentrations. The results are compared with experimental data and the model reported in [31]. Baralla *et al.* [31] also investigated the same subpart of the SOS response using the same dataset of M3D. They have applied the particle swarm algorithm for optimization and a direct numerical solution of the ODE model for data fitting. It can be seen in figure 5 that dynamic simulation with the model estimated by the proposed method is much more accurate than that of [31], especially for *recF*, *rpoD*, *ssB* and *rpoH*.

For the purpose of validation, the model estimated by the proposed method has also been used to calculate the steady-state values of concentrations. The steady-state values are given in table 2 along with the experimentally observed values and the values obtained by the estimated model in [31]. The table also gives the sum of squared error between the experimental data and values obtained from estimated models. It can be seen that the total error in the values obtained by the model estimated by the proposed method is much smaller than that of [31].

## 5. Conclusion

Obtaining a mathematical model of GRNs requires estimation of model parameters from the experimental data. To estimate the optimal values of parameters an optimization problem has to be solved. Usually the experimental data have much fewer measurements than the number of parameters which could result in an underdetermined optimization problem. Moreover, depending on how the optimization problem is posed, constraints have to be incorporated to find practically feasible values of parameters, which could result in numerical issues. In this paper, we have proposed a new approach to reformulate the estimation problem such that constraints are not required. Based on the structure of Hill functions, we also suggest to split the number of unknowns into two sets, which are estimated one at a time. This helps to alleviate the issue of underdeterminedness.

The proposed technique is applied to estimate the model of the SOS response in *Escherichia coli*. The results are compared with the existing literature and experimental data. Both dynamic and steady-state results show that the proposed technique provides more accurate estimates of the unknown model parameters.

## Data accessibility

The Matlab code of the proposed algorithm is available from the Dryad Digital Repository: https://doi.org/10.5061/dryad.ht047 [33].

## Authors' contributions

A.H. conceived the research idea. F.E.E. did the coding and simulations. Both co-authors analysed and interpreted the results. Both co-authors drafted the manuscript and gave their final approval for publication.

## Competing interests

We declare we have no competing interests.

## Funding

We received no funding for this study.

- Received August 28, 2017.
- Accepted January 25, 2018.

- © 2018 The Authors.

Published by the Royal Society under the terms of the Creative Commons Attribution License http://creativecommons.org/licenses/by/4.0/, which permits unrestricted use, provided the original author and source are credited.