Systems Biology

Part I: Expression networks

Prof. Patrick E. Meyer

Motivation

Synthetic Biology

Solutions for a better future: Mutate one gene and create a perfect cell factory!?

Results: the mutant dies in a few hours.
Reason: genes have multiple roles, changing one alters many functions.
The cell is a circuit of interconnected elements (proteins/genes/metabolites/RNA/DNA/…)

Understanding circuitry

gene, the less involved in other functions.
the environmental conditions suited for the mutant.
another mutation that compensates for collateral losses.

Changing paradigm

Mercator map
16th century (most precise for one century)
Two human generations

Satellite
Highly precise and informative
Few seconds (through sensors and computers)

Changes in Biology

Traditional genetic map
Informative connections but small number
Decades of work from a whole team

Large-scale gene network
Thousands of genes
A few weeks by one scientist

Satellites in Biology

Microscope and manipulations (PCR, fluorescence,…)

High-throughput sequencers (transcriptomics)

Meta-networks

Systems biology is the computational modeling of complex biological systems.
It is focused on networks (graphs representing a system of interconnected elements).
A partner field to synthetic biology

By 2050 ?

Given a seed with
- the DNA sequence,
- the composition of the medium/organelles,
- environmental parameters,
Computers will be able to
- grow the organism numerically… in advanced speed.
- predict the impact of environmental and molecular changes.
Numerical experiments (faster and less costly) will be a precursor to lab experiments.

Inferring Networks (part I)

Transcriptional Network

⇒ Each cell has an encoded network (circuit) in DNA.

\(gene \rightarrow RNA \rightarrow protein\)
Some protein (TF) can modify RNA production of target genes (TG)

Each node is a gene.
An arc connects a regulator gene (TF) to a regulated one (TG).

Expression data

microarray
RNA‑seq

Data matrix (illustrative)

\[\begin{array}{|c|c|c|c|c|} \hline \text{DATA} & X_{1} & X_{2} & \dots & X_{n} \\ \hline s_{1} & 0.1 & 0.9 & \dots & 0.5 \\ \hline \vdots & \vdots & \vdots & \ddots & \vdots \\ \hline s_{m} & 0.2 & 0.3 & \dots & 0.8 \\ \hline \end{array}\]

Spearman’s correlation

i.e. a Pearson’s correlation but on ranks, to better detect monotonic dependencies

Co‑expression networks

Compute squared Spearman correlation for all pairs

\[\begin{array}{|c|c|c|c|c|} \hline \hat{T} & X_{1} & X_{2} & \dots & X_{n} \\ \hline X_{1} & - & 0.2 & \dots & 0.9 \\ \hline X_{2} & 0.2 & - & \dots & 0.1 \\ \hline \vdots & \vdots & \vdots & \ddots & \vdots \\ \hline X_{n} & 0.9 & 0.1 & \dots & - \\ \hline \end{array}\]

why squaring correlations?
why is the matrix symmetric? diagonal irrelevant?

Co‑expression networks: False‑positive trends

Let \(I(X,Y)\) be an information measure.

If \(X_i \rightarrow X_k \rightarrow X_j\) (or \(X_i \leftarrow X_k \rightarrow X_j\)),
then \(I(X_i,X_k)\) and \(I(X_k,X_j)\) are high
and \(I(X_i,X_j)\) is likely high as well, suggesting a false link between \(X_i\) and \(X_j\).
\[ \begin{array}{rcl} & fire\\ \swarrow & ? & \searrow\\ firemen \quad \quad& \longleftrightarrow &\quad victims\end{array} \]

Improving Co‑expression Networks

Compute co-expression matrix
Apply an arc-elimination algorithm on the matrix, e.g.
- RelNet (thresholding)
- ARACNE
- CLR
- MRNET
But direct thresholding (RelNet) usually fail, why?

ARACNE

Algorithm for the Recon. of Accurate Cellular NEtwork

Three possible indirect‑interaction motifs (three variables):

\(X_j \rightarrow X_k \rightarrow X_i\)
\(X_j \leftarrow X_k \rightarrow X_i\)
\(X_j \rightarrow X_k \leftarrow X_i\)
By the data‑processing inequality

\[ I(X_j;X_i) <= I(X_j;X_k) \quad\text{and}\quad I(X_j;X_i) <= I(X_k;X_i) \]
For every triangle, eliminate the weakest link.

ARACNE: False‑negative trends

How about quaternary interactions?
How about feed-back and feed-forward loops?
ARACNE will remove genuine edges, leading to false negatives.

MRNET

Minimum Redundancy Network (Meyer et al., 2007)

For each target variable \(X_j\), select predictors using MRMR (minimum redundancy – maximum relevance):

\[ X_i^{\text{MRMR}} = \arg\max_{X_i \in X\setminus S}\bigl(u_i - z_i\bigr) \]

Relevance \(u_i = I(X_i;X_j)\)
Redundancy \(z_i = \frac{1}{|S|}\sum_{X_k \in S} I(X_i;X_k)\)

Iteratively build the network.

Example MRMR (1)

Example MRMR (2)

Example MRMR (3)

Example MRMR (4)

CLR – Context Likelihood of Relatedness

\[ w_{ij}= \sqrt{z_{i}^{2}+z_{j}^{2}} \]

where \(z_i\) and \(z_j\) are z-scores

\[ z_{i}= \max\!\left(0,\, \frac{I(X_{i};X_{j})-\mu_{i}}{\sigma_{i}}\right) \]

with \(\mu_{i}\) and \(\sigma_{i}\) the mean and standard deviation of the distribution of informations \(I(X_{i},X_{k})\), \(k=1,\dots,n\).

CLR (2)

principle: Each gene is choosing his bests friends

{Hayete et al. 2007}

Validating Networks (part I)

Problem Formalization

Network Inference Objective
> Given a dataset \(D_{m}\), infer a network \(\hat T\) that has the highest similarity \(S\) with the known (or partially known) true network \(T\). \[ \hat{T}^{\max}= \arg\max_{\hat{T}} S(\hat{T},T) \]

Confusion matrix

edge actual positive actual negative

inferred positive TP FP

inferred negative FN TN

edge	actual positive	actual negative
inferred positive	TP	FP
inferred negative	FN	TN

Weighted Network

Inferred networks are weighted:

tf	tg	\(w\)
\(X_{1}\)	\(X_{2}\)	0.1
\(X_{1}\)	\(X_{3}\)	0.3
\(X_{i}\)	\(X_{k}\)	0
…	…	…
\(X_{\#tf}\)	\(X_{\#tg}\)	0.83

→ Infinity of networks (depending on the threshold).
Which confusion matrix do we use?

ROC Curves

\[ TPR = \frac{TP}{TP+FN}=p(TP)\] \[ FPR = \frac{FP}{FP+TN}=p(FP) \]

Precision–Recall

Recall = TPR

Precision: fraction of correct predictions

\[PRE = \dfrac{TP}{TP+FP}\] Recall: fraction of knowledge recovered
\[REC= \dfrac{TP}{TP+FN}\]

One network - One score

Correlation with partial truth
AUROC
AUPR10 and AUPR20
F‑Score (weighted harmonic mean of precision and recall)

\[ 0 \le F = \frac{2\;pre*rec}{pre+rec} \le 1 \]

MiNET Bioconductor Package

Citation: Meyer et al. 2008

Four network‑inference methods (rel, clr, aracne, mrnet)
Three validation tools (fscores, ROC, PR)

library(minet)
res <- validate(my.net, true.net)
show.pr(res, col = "green", type = "b")
auc.pr(res)
show.roc(res, col = "green", type = "b")
auc.roc(res)

Experimental Framework

Results

Wins/losses statistically significant on

from 100 to 1000 variables
from 100 to 1000 microarrays (samples)
from 0 % to 30 % Gaussian noise

Match	score
MRNET / RELNET	9 – 2
MRNET / ARACNE	8 – 3
MRNET / CLR	12 – 8

Exploiting Networks (part I)

Graphs for Guilt-by-association

A graph G is a mathematical object that allows us to represent networks (interconnected elements).

Formally, (G = (V, E))
- V is a set of vertices, E is a set of edges.
- An edge is a pair of vertices (e.g. {1,2}).
Guilt-by-association:
- Social clubs
- ECG (brain areas)
- Functional gene sets (predictions on unknown genes)!

The power of using graphs

Examples:

Bio: How many genes are targeted by a TF?
CS: What is the degree of a vertex?
Bio: Minimum number of steps needed to convert pyruvate into lactate?
CS: What is the shortest path between two vertices?
Bio: The maximal set of interacting proteins?
CS: What is the maximally connected subgraph of G?

R package igraph

There are R packages for dealing with graphs, like igraph.

> library(igraph)
> g <- graph.formula(A-B, A-C, B-C, B-D, B-E, D-E)
V(g)
Vertex sequence:
[1] "A" "B" "C" "D" "E"
E(g)
Edge sequence:
[1] B -- A
[2] C -- A
[3] C -- B
[4] D -- B
[5] E -- B
[6] D -- E
plot(g)

Types of Graphs

> g <- graph.formula(A-+B, A+-C, B-+C, B+-D, B+-E, D-+E)
plot(g)
is.weighted(g)
# [1] FALSE
E(g)$weight <- runif(ecount(g))
is.weighted(g)
# [1] TRUE

undirected (i.e. previous slide)

directed (i.e. {1,2} ≠ {2,1})

Coding Adjacencies

Adjacency matrix:
  A B C D E
A . 1 . . .
B . . 1 . .
C 1 . . . .
D . 1 . . 1
E . 1 . . .

Adjacency list:
A -> B C
B -> C
C -> D
D -> B E
E -> B

less memory intensive using lists
but faster computation using matrices
both can be used to create graphs:
- graph_from_edgelist(el,...)
- graph_from_adjacency_matrix(mat,...)

Graph dimensions

> g <- graph.formula(A-B,A-C,
        B-C,B-D,B-E,D-E)
> gorder(g)
[1] 5
> gsize(g)
[1] 6
> edge_density(g, loops = FALSE)
[1] 0.6

> shortest.paths(g,v="A",to="D")
    D
  A 2
> get.shortest.paths(g,from="A",to="D")
  $vpath
  $vpath[[1]]
  [1] 1 2 4
> diameter(g)
  [1] 2

Degree

> degree(g)
  A B C D E 
  2 4 2 2 2
> degree(g,mode="in")
  A B C D E 
  1 3 1 0 1 
> degree(g,mode="out")
  A B C D E 
  1 1 1 2 1

The degree of a vertex is the number of incident edges
Out-degree: out of a vertex
In-degree: toward a vertex

Clusters and Communities in a graph

Clusters (disconnected parts):

> clusters(g)
$membership
[1] 1 1 1 1 1

$csize
[1] 5

$no
[1] 1

Graph partitioning (Min dist in - max dist out):

> fastgreedy.community(g)
Graph community structure calculated with the fast greedy algorithm
Number of communities (best split): 2
Modularity (best split): 0.1111111
Membership vector:
A B C D E 
1 1 1 2 2

Drawing a graph

> g <- graph.formula(A-+B,A-+C,A-+D,
       B-+F,B-+G,C-+H,C-+I)
> par(mfrow=c(1,3))
> igraph.options(vertex.size=25, edge.arrow.size=0.5)
> plot(g,layout=layout.circle)
> plot(g,layout=layout.reingold.tilford)
> plot(g,layout=layout.reingold.tilford(g,circular=T))