## Abstract

Diabetes is a growing concern for the developed nations worldwide. New genomic, metagenomic and gene-technologic approaches may yield considerable results in the next several years in its early diagnosis, or in advances in therapy and management. In this work, we highlight some human proteins that may serve as new targets in the early diagnosis and therapy. With the help of a very successful mathematical tool for network analysis that formed the basis of the early successes of Google^{TM}, Inc., we analyse the human protein–protein interaction network gained from the IntAct database with a mathematical algorithm. The novelty of our approach is that the new protein targets suggested do not have many interacting partners (so, they are not hubs or super-hubs), so their inhibition or promotion probably will not have serious side effects. We have identified numerous possible protein targets for diabetes therapy and/or management; some of these have been well known for a long time (these validate our method), some of them appeared in the literature in the last 12 months (these show the cutting edge of the algorithm), and the remainder are still unknown to be connected with diabetes, witnessing completely new hits of the method.

## 2. Introduction

Every day dozens of scientific publications appear in the world in each important area of biology and medicine [1]. Lots of relevant results get published in every field, and most of them go unnoticed by the majority of researchers even if they are working on the very same area. Electronically maintained databases of the literature help scientists to follow the new developments of each area.

Knowledge related to the structure, function and interactions of proteins are represented, organized, cleaned, filtered, converted and published regularly in numerous databases. These plentiful information sources could transform health sciences in the coming years if appropriate tools for automatic processing of the databases become widely available.

One goal of scientists specializing in human diseases is to find new possible protein targets of intervention for a given disease. It is well known that the human genome contains around 21 000 genes that encode more than 100 000 proteins [2]. From this set of proteins, one source [3] lists 364 protein and nucleic acid targets for the 1540 approved drugs, while another paper [4] mentions only 133 targets for FDA approved drugs. Thus, very few proteins are targeted by the drugs approved today, and clearly the number of targeted proteins needs to be increased dramatically in the coming years.

For this goal, concentrated efforts are needed: from hundreds or thousands of possible targets, smaller, more focused sets need to be chosen for these concentrated efforts.

In this paper, we describe a new method that is based on the PageRank computation of the Google web-search engine, which is capable of identifying new, relevant protein targets for possible effects on diabetes. These proteins may serve as the targets of more focused efforts in the future.

### 2.1 The protein interaction graph and PageRank

The overall protein concentration of the cell is very high; about 20–30% of the cytosol [5] consists of proteins, and it is proven that a great proportion of proteins work in interaction with other proteins in the cell. The usual way of describing these interactions is the interaction graph: the vertices of the graph are the proteins, and two vertices are connected by an edge if they would interact. In the case of directed interactions (e.g. in metabolic processes or signalling networks), the graph edges can also be directed.

It is a natural idea to assign importance to those vertices that are connected to many other vertices; these vertices are called ‘hubs’. It has turned out, however, that this choice—in numerous aspects—is a too simplistic solution; it is not robust against errors in the data. From the alternative definitions of importance that appeared in 1998 [6,7], the PageRank algorithm [7] proved to be more applicable, robust [8], and useful; the search engine of Google also used this method for finding the most important hits in a web search (a short review of the definition of PageRank is given in §5.5). In biological networks, we can especially make use of the robustness of PageRank [8], as protein interaction data may contain numerous false positives and false negatives, and if small changes in the data would imply large changes in PageRank, then the whole concept of PageRank computation would be useless. We remark that the—otherwise very appealing—method of Kleinberg [6] is not robust [9]; this was one of the reasons for the success of the PageRank algorithm.

We have proposed previously to use PageRank for the analysis of protein interaction networks [10]. In that work, the PageRank algorithm was applied to finding relevant nodes in directed graphs, corresponding to metabolic networks of different organisms and also for undirected graphs with personalization.

It is well known that high-degree nodes in a network (sometimes called hubs, hub-nodes, commodity-nodes) have large PageRanks [11,12]; and, on the other hand, large-degree nodes are usually of vital importance in numerous biochemical processes [13]. Inhibiting or promoting the activity or the production of these proteins with a large number of interactions may have numerous unwanted side-effects, as their modifications (inhibition or promotion) would influence many other processes and proteins in the cell. Therefore, these large-degree nodes usually are not viable candidates for protein targets, so we intended to discover low-degree nodes in the network having importance in the molecular mechanisms of diabetes.

In Bánky *et al.* [12], based on Grolmusz [14], we introduced a measure of importance for the nodes of directed graphs that are essentially computed by dividing the PageRank of the node by its degree (see §5.5 for details). Therefore, the small-degree nodes are compensated, and one can find small-degree and relevant nodes, as well.

We apply a somewhat similar method in this work: we divided the personalized PageRanks of the nodes of the human interactome (an undirected graph) by their degrees; and reviewed the vertices with the highest PageRank/degree values.

## 3. Results

Our results are represented in tables 1 and 2 and the electronic supplementary material, table S3.

Table 1 shows that even the personalized PageRank would identify very large-degree nodes (hubs) with the highest scores that have too many interacting partners and, therefore, are virtually useless for possible interventions.

Table 2 lists the best hits found in this work, and we review them one-by-one in the next section. The whole set of results is given in the electronic supplementary material, table S3.

## 4. Discussion

In this section, we review the protein hits with the highest score given in table 2. We will find that some of them have well-known connections with diabetes, so these proteins should have been included in the list of diabetes-related proteins constructed from the UniProt database (electronic supplementary material, table S2); the others are clearly interesting hits with at most one or two references that connect them to diabetes, or without such connections. The full list of hits is given in electronic supplementary material, table S3.

P47211 Galanin receptor type 1: connection with diabetes is well known (e.g. [15–17]).

O43603 Galanin receptor type 2: connection with diabetes is well known.

O75325 Leucine-rich repeat neuronal protein 2: we have not found any direct connections with diabetes; however, it is over-amplified in malignant gliomas as shown by the UniProt source. Malignant gliomas and diabetes have numerous references (e.g. [18]).

P37288 Vasopressin V1a receptor: connection with diabetes is well known.

Q8IWW8 Alcohol dehydrogenase iron-containing protein 1: we have found only one, but a very interesting reference [19]. The authors selected thoroughbred horses with exceptional racing performance, and performed genetic mapping; they found that the gene

*ADHFE1*of protein Q8IWW8 might be involved in increased insulin sensitivity of the horses.Q9BZL3 Small integral membrane protein 3; we have not found any data concerning the link with diabetes.

P00736 Complement C1r subcomponent, C1R, a serine protease: may be involved in metabolic changes in pregnancy [20].

P18505 GABA(A) receptor subunit beta-1: very recently, one study [21] have found the gene for this protein to be related to a higher incidence of type-2 diabetes in an extended UAE Arab family.

P09871 C1 esterase; its connection with diabetes is well known (e.g. [22]).

P16118 6-Phosphofructo-2-kinase, PFKFB1; Atsumi

*et al*. [23] show it may be related to obesity, and Garcia-Herrero*et al*. [24] show it interacts with glucokinase (GCK) that, in turn, acts as a glucose sensor in the pancreatic beta cell and regulates insulin secretion.P55317 Hepatocyte nuclear factor 3-alpha, FOXA1: Garcia-Herrero

*et al*. [24] show the deficiency of transcription factor Neurogenin3 leads to diabetes in humans; and FOXA1 can amplify the auto-regulation of Neurogenin3.P62341 Selenoprotein T, SELT: we have not found any data concerning the link with diabetes.

Q9NZ43 Vesicle transport protein, USE1: we have not found any data concerning the link with diabetes.

Q96HH6 Transmembrane protein 19: we have not found any data concerning the link with diabetes.

Q9Y2Y9 Krueppel-like factor 13, NSLP1, BTEB3: we have not found any data concerning the link with diabetes.

P43694 Transcription factor GATA-4: it is mostly mentioned in relation to neonatal heart disease, but very recently its connections with early pancreas development have been described [25]. It may have a role in regenerative therapy in type-1 diabetes.

Q9UGH3 Sodium-dependent vitamin C transporter 2, SVCT2: numerous sources show the pivotal role of this protein in diabetes, e.g. [26].

P40199 Carcinoembryonic antigen-related cell adhesion molecule 6, CD66c: we have not found any data concerning the link with diabetes.

P14778 Interleukin-1 receptor type 1, IL1R1: its connection with diabetes is well known (e.g. [27]).

## 5. Material and methods

The methodology of this work comprises the following parts:

— downloading, filtering and pre-processing the human interactions from the IntAct database [28];

— downloading and pre-processing the set of diabetes-related proteins from the UniProt database [29];

— computing the PageRank for the nodes in the human protein interaction graph, personalized to the diabetes-related vertices [10], by the Perl-script at http://uratim.com/pp.zip; and

— post-processing and evaluating the results.

### 5.1 Constructing the human interaction graph

Some tools (e.g. [30,31]) and databases [32] on the World Wide Web produce or contain predicted interactions. Other databases contain only protein interactions inferred from laboratory experiments (e.g. MINT [33], HPRD [34], DIP [35] and IntAct [28]). We have chosen for our present study the rich, laboratory-data based, constantly updated IntAct protein interaction database of the European Bioinformatics Institute.

The binary interactions from the IntAct database were downloaded with the ‘Homo sapiens’; organism filter in MI-TAB 2.5 format [28] on 13 October 2013.

The downloaded data still contained proteins from non-human species in some interactions; these interactions were deleted. We also removed those interactions where at least one of the interacting proteins was not denoted by their UniProt accession number [29]. Next, the isoforms of proteins (e.g. P02545-2) in the table were substituted by their clear UniProt accession numbers (e.g. P02545).

The resulting table contained numerous multiplicities, that is multiple appearances of the same interaction.

We intended to build a graph edge-list from the list of interactions. Physical protein–protein interactions are symmetric relations, in other words the graph that corresponds to the dataset is undirected. Our script that computes the personalized PageRank needs the graph-edges to be given as directed edges, therefore we followed the process below:

— For each pair (

*a*,*b*), describing the interaction between proteins identified by their UniProt accession numbers*a*and*b*, we added also the pair (*b*,*a*), even if*a*=*b*was true.— Then we removed all multiplicities, that is, if (

*a*,*b*) was in the list considered above, and*a*≠*b*, then the final table contains (*a*,*b*) and (*b*,*a*) (each once), and if*a*=*b*then it contains (*a*,*a*) only once.

We worked in the resulting undirected graph, with edge-list corresponding to the table constructed: each non-directed edge corresponds to a symmetric pair of directed edges. The edge-list is given in the electronic supplementary material, table S1, with 120 175 directed edges on 11 766 vertices.

### 5.2 The initial list of proteins with a role in diabetes

The UniProt database [29] contains extensive annotations of the proteins deposited. These annotations are based on literature evidence and are very useful tools for fast and reliable retrieval of protein subsets related to some disease or syndrome.

We performed the following search of the UniProt database (on 13 October 2013): ‘(diabetes AND organism: ‘Homo sapiens [9606]’) AND reviewed:yes’, meaning that we were looking for proteins, related to diabetes (both type-1 and type-2); the proteins needed to be human ones, and needed to be the elements of the SwissProt (i.e. the manually reviewed) subset of UniProt. We remark that SwissProt is the strongly controlled subset of UniProt, containing the sequences of proteins that have evidence of their existence at the protein level (i.e. they are not just predicted from their gene sequences). Using this latter restriction is important to assure the quality of the data used in this work.

The query above returned 195 proteins; their annotated list is given in the electronic supplementary material, table S2.

### 5.3 Performing the personalized PageRank computation

The Perl script, downloadable from http://uratim.com/pp.zip [10], was applied to the human protein interaction graph taken from the IntAct database (electronic supplementary material, table S1), with personalization to the proteins marked as ‘related to diabetes’ in UniProt (electronic supplementary material, table S2). The Perl script scales the PageRank values by multiplying all numbers such that the smallest PageRank is set to be 1 (without scaling the numbers would be uncomfortably small: their sum for the 11 766 vertices would have been 1). The parameters of the computation were the following ones in the Perl script: damping_factor=0; personalization_damping_factor=0.15. The PageRank computation is very fast, taking less than 10 s even on a low-end laptop computer.

### 5.4 Post-processing and evaluating the results

From the nature of the PageRank computation with personalization, the nodes that we personalized to have high scores in this measure; so we were looking for proteins that

— were not the elements of the diabetes-related vertex-set from UniProt that we personalized to (electronic supplementary material, table S2); and

— have high value of the PageRank/degree quotient.

The first requirement assures us that we identify either new proteins that are important in diabetes or at least proteins that were not listed as ‘diabetes-related’ in the UniProt database and our electronic supplementary material, table S2; the second requirement assures us that we identify important proteins possibly with only few interacting partners.

The nodes with the highest PageRank are given in table 1, the nodes with the highest PageRank/degree quotient are given in table 2 and the list of all hits are given in the electronic supplementary material, table S3.

### 5.5 Mathematical remarks

In this section, we review some of the mathematical details of the new method of finding important nodes in undirected graphs for those readers with mathematical interest.

The PageRank algorithm [7] can be described by a random walk on the graph. First let us review the simplest case of a random walk: we are given a non-bipartite, connected, undirected graph, *G*, and a player is wandering on the graph according to the rules:

(a) if the player is at a node

*v*with*d*connected edges, then she will choose randomly each adjacent edge with the same 1/*d*probability for the next move.

One can show very easily [36] that, after a long time, the probability of being at a given node will be independent of the identity of the starting node and also of the number of steps taken, and will be proportional to the degree of that given node.

For directed graphs, no such statement is true. Page & Brin [7], in the process of designing the Google search engine, suggested a little different random walk on the graph, that would also work well for directed graphs:

— The player at each step will choose with 80% probability step

*A*and with 20% probability step*B*:(

*A*) if the player is at a node*v*with*d*out-going edges, then she will choose randomly each outgoing edge with the same 1/*d*probability for the next move; and(

*B*) the player teleports to any vertex of the graph, with the same probability (i.e. if the graph has*n*vertices, then the player jumps to any vertex with the same 1/*n*probability).

The probability of being at vertex *v* after a large number of steps is the PageRank of vertex *v* [7].

In the early versions of the Google search engine, PageRank was used for assigning a measure of importance to web pages, and the web pages of higher relevance were shown to the user first, and the less important ones later on in the hit-list of a web search [7].

The personalized PageRank [37] differs from the original PageRank in the teleporting step *B*: let vertex set *V* ′⊂*V* , and let the size of set *V* ′ be equal to *m*≤*n*, then step *B*′ needs to be substituted for *B*:

(

*B*′)the player teleports to any vertex of the set*V*′⊂*V*, with the same probability 1/*m*.

The probability of being at vertex *v* after a large number of steps is the personalized PageRank of vertex *v* [37].

Originally [37], this definition described the importance of web pages related to some personal interest of the user, represented by the set of subjectively interesting web pages *V* ′⊂*V* ; because of this historical reason it is called ‘personalized’ PageRank.

Certainly, in the process, the vertices we have personalized to have high PageRank; we, of course, should not consider those vertices in the search for new, relevant ones.

In Ivan & Grolmusz [10], we have applied the personalized PageRank to a biomedical problem in the following way: we considered the human undirected protein interaction graph, and personalized the PageRank to a proteomics dataset for human melanoma. We received high PageRank vertices, relative to this personalization, of both known and unknown functions; it is very appealing that most of the proteins of the high PageRank vertices with known functions were related to cancers; therefore, we may assume that the high PageRank vertices with unknown functions are also related to cancers in general and melanoma in particular. Since low-concentration proteins cannot be identified in proteomics experiments, this method may also be capable of finding those hidden proteins in a disease.

In [12], we introduced a method for directed graphs that is capable of identifying important nodes of low degree. The method has the following background.

— Let us recall that the limit distribution of a random walk on a connected, non-bipartite, undirected graph converges to the degree distribution of the nodes [36].

— We have shown in [14] that if, in the case of undirected graphs, we compute the PageRank personalized to the degree distribution, then the limit distribution is the degree distribution itself. Therefore, if one computes the PageRank/degree quotient, then we get the same constant for all nodes of the undirected graph.

— If we compute the PageRank/degree quotient in the case of directed graphs, then typically, the results will not be a constant: the larger values will correspond in some sense the relevancy of the node, inherited from the directed structure of the graph. Note that this measure is independent of the degree of the nodes in the undirected graph, so it is believed to compensate small degree nodes against large degree nodes.

In Bánky *et al.* [12], we identified and proposed numerous new protein targets in the directed metabolic networks of pathogenic organisms.

In this work, based on the results of Grolmusz [14], we suggest a unification of the methods of Ivan & Grolmusz [10] and Bánky *et al.* [12] as follows:

— we consider the undirected human interaction graph from IntAct;

— we compute the PageRank, personalized to the set of diabetes-related proteins, taken from UniProt; and

— we have found that the large-degree nodes have very large personalized PageRanks even in this setting (cf. table 1); so we compute the PageRank/degree quotient for each node and would review those that were not personalized to, but have high PageRank/degree quotient (cf. table 2).

## 6. Conclusion

We have performed a refined mathematical algorithm for assigning scores of interest, related to diabetes, in the human interactome, downloaded from the IntAct database. After removing the diabetes-related proteins labelled by UniProt (electronic supplementary material, table S2), we have found proteins with

— well-known relations to diabetes (e.g. galanin receptor type 1 and 2, vasopressin V1a receptor or C1 esterase);

— some very interesting hits with known marginal connections to diabetes (leucine-rich repeat neuronal protein 2, alcohol dehydrogenase iron-containing protein 1, complement C1r subcomponent, 6-phosphofructo-2-kinase, hepatocyte nuclear factor 3-alpha, transcription factor GATA-4); and

— some hits with no known connection to diabetes (small integral membrane protein 3, selenoprotein T, vesicle transport protein, transmembrane protein 19, Krueppel-like factor 13, carcinoembryonic antigen-related cell adhesion molecule 6).

The hits in the first group validate our method. The hits in the second group, with marginal evidence mostly from the last two years, show that our method is capable of finding very recently discovered important proteins in the disease.

The members of the third group show that completely new, still unknown relations between diabetes and these proteins or genes can be searched for with probable success, since other proteins with high score by our method have clear connections.

## Data accessibility

Tables S1, S2 and S3 are available as the electronic supplementary material of this article. The Perl script, created by Gábor Iván and first applied and referred to in Ivan & Grolmusz [10] at http://uratim.com/pp.zip, was used here for the PageRank computation and is also available in the electronic supplementary material.

## Funding statement

No funding was received for this work.

## Conflict of interests

The author is a professor of mathematics at Eötvös University, and also the CEO of Uratim Ltd., a commercial organization. Uratim Ltd. has no roles in design, application or exploitation of this research. Uratim Ltd. has not paid the author for performing this research.

## Acknowledgements

The author does not acknowledge any particular support.

- Received August 21, 2014.
- Accepted April 2, 2015.

© 2015 The Authors. Published by the Royal Society under the terms of the Creative Commons Attribution License http://creativecommons.org/licenses/by/4.0/, which permits unrestricted use, provided the original author and source are credited.