Systems Biology

Part I: Expression networks

Prof. Patrick E. Meyer

Motivation

Synthetic Biology

Solutions for a better future: Mutate one gene and create a perfect cell factory!?

  • Results: the mutant dies in a few hours.
  • Reason: genes have multiple roles, changing one alters many functions.
  • The cell is a circuit of interconnected elements (proteins/genes/metabolites/RNA/DNA/…)

Understanding circuitry

  • gene, the less involved in other functions.
  • the environmental conditions suited for the mutant.
  • another mutation that compensates for collateral losses.

Changing paradigm

  • Mercator map

  • 16th century (most precise for one century)

  • Two human generations

  • Satellite

  • Highly precise and informative

  • Few seconds (through sensors and computers)

Changes in Biology

  • Traditional genetic map

  • Informative connections but small number

  • Decades of work from a whole team

  • Large-scale gene network

  • Thousands of genes

  • A few weeks by one scientist

Satellites in Biology

  • Microscope and manipulations (PCR, fluorescence,…)

  • High-throughput sequencers (transcriptomics)

Meta-networks

  • Systems biology is the computational modeling of complex biological systems.
  • It is focused on networks (graphs representing a system of interconnected elements).
  • A partner field to synthetic biology

By 2050 ?

  • Given a seed with

    • the DNA sequence,
    • the composition of the medium/organelles,
    • environmental parameters,
  • Computers will be able to

    • grow the organism numerically… in advanced speed.
    • predict the impact of environmental and molecular changes.
  • Numerical experiments (faster and less costly) will be a precursor to lab experiments.

Inferring Networks (part I)

Transcriptional Network

⇒ Each cell has an encoded network (circuit) in DNA.

  • \(gene \rightarrow RNA \rightarrow protein\)
  • Some protein (TF) can modify RNA production of target genes (TG)

  • Each node is a gene.
  • An arc connects a regulator gene (TF) to a regulated one (TG).

Expression data

  • microarray
  • RNA‑seq

Data matrix (illustrative)

\[\begin{array}{|c|c|c|c|c|} \hline \text{DATA} & X_{1} & X_{2} & \dots & X_{n} \\ \hline s_{1} & 0.1 & 0.9 & \dots & 0.5 \\ \hline \vdots & \vdots & \vdots & \ddots & \vdots \\ \hline s_{m} & 0.2 & 0.3 & \dots & 0.8 \\ \hline \end{array}\]

Spearman’s correlation

i.e. a Pearson’s correlation but on ranks, to better detect monotonic dependencies

Co‑expression networks

Compute squared Spearman correlation for all pairs

\[\begin{array}{|c|c|c|c|c|} \hline \hat{T} & X_{1} & X_{2} & \dots & X_{n} \\ \hline X_{1} & - & 0.2 & \dots & 0.9 \\ \hline X_{2} & 0.2 & - & \dots & 0.1 \\ \hline \vdots & \vdots & \vdots & \ddots & \vdots \\ \hline X_{n} & 0.9 & 0.1 & \dots & - \\ \hline \end{array}\]
  • why squaring correlations?
  • why is the matrix symmetric? diagonal irrelevant?

Improving Co‑expression Networks

  • Compute co-expression matrix
  • Apply an arc-elimination algorithm on the matrix, e.g.
    • RelNet (thresholding) 
    • ARACNE
    • CLR
    • MRNET
  • But direct thresholding (RelNet) usually fail, why?

ARACNE

Algorithm for the Recon. of Accurate Cellular NEtwork

Three possible indirect‑interaction motifs (three variables):

  • \(X_j \rightarrow X_k \rightarrow X_i\)

  • \(X_j \leftarrow X_k \rightarrow X_i\)

  • \(X_j \rightarrow X_k \leftarrow X_i\)

  • By the data‑processing inequality

    \[ I(X_j;X_i) <= I(X_j;X_k) \quad\text{and}\quad I(X_j;X_i) <= I(X_k;X_i) \]

  • For every triangle, eliminate the weakest link.

MRNET

Minimum Redundancy Network (Meyer et al., 2007)

For each target variable \(X_j\), select predictors using MRMR (minimum redundancy – maximum relevance):

\[ X_i^{\text{MRMR}} = \arg\max_{X_i \in X\setminus S}\bigl(u_i - z_i\bigr) \]

  • Relevance \(u_i = I(X_i;X_j)\)
  • Redundancy \(z_i = \frac{1}{|S|}\sum_{X_k \in S} I(X_i;X_k)\)

Iteratively build the network.

Example MRMR (1)

Example MRMR (2)

Example MRMR (3)

Example MRMR (4)

CLR – Context Likelihood of Relatedness

\[ w_{ij}= \sqrt{z_{i}^{2}+z_{j}^{2}} \]

where \(z_i\) and \(z_j\) are z-scores

\[ z_{i}= \max\!\left(0,\, \frac{I(X_{i};X_{j})-\mu_{i}}{\sigma_{i}}\right) \]

with \(\mu_{i}\) and \(\sigma_{i}\) the mean and standard deviation of the distribution of informations \(I(X_{i},X_{k})\), \(k=1,\dots,n\).

CLR (2)

principle: Each gene is choosing his bests friends


{Hayete et al. 2007}

Validating Networks (part I)

Problem Formalization

Network Inference Objective
> Given a dataset \(D_{m}\), infer a network \(\hat T\) that has the highest similarity \(S\) with the known (or partially known) true network \(T\). \[ \hat{T}^{\max}= \arg\max_{\hat{T}} S(\hat{T},T) \]

  • Confusion matrix

    edge actual positive actual negative
    inferred positive TP FP
    inferred negative FN TN

Weighted Network

Inferred networks are weighted:

tf tg \(w\)
\(X_{1}\) \(X_{2}\) 0.1
\(X_{1}\) \(X_{3}\) 0.3
\(X_{i}\) \(X_{k}\) 0
\(X_{\#tf}\) \(X_{\#tg}\) 0.83

→ Infinity of networks (depending on the threshold).
Which confusion matrix do we use?

ROC Curves

\[ TPR = \frac{TP}{TP+FN}=p(TP)\] \[ FPR = \frac{FP}{FP+TN}=p(FP) \]

Precision–Recall


  • Recall = TPR

Precision: fraction of correct predictions

\[PRE = \dfrac{TP}{TP+FP}\] Recall: fraction of knowledge recovered
\[REC= \dfrac{TP}{TP+FN}\]

One network - One score

  • Correlation with partial truth

  • AUROC

  • AUPR10 and AUPR20

  • F‑Score (weighted harmonic mean of precision and recall)

    \[ 0 \le F = \frac{2\;pre*rec}{pre+rec} \le 1 \]

MiNET Bioconductor Package

Citation: Meyer et al. 2008

  • Four network‑inference methods (rel, clr, aracne, mrnet)
  • Three validation tools (fscores, ROC, PR)
library(minet)
res <- validate(my.net, true.net)
show.pr(res, col = "green", type = "b")
auc.pr(res)
show.roc(res, col = "green", type = "b")
auc.roc(res)

Experimental Framework

Results

Wins/losses statistically significant on

  • from 100 to 1000 variables
  • from 100 to 1000 microarrays (samples)
  • from 0 % to 30 % Gaussian noise
Match score
MRNET / RELNET 9 – 2
MRNET / ARACNE 8 – 3
MRNET / CLR 12 – 8

Exploiting Networks (part I)

Graphs for Guilt-by-association

A graph G is a mathematical object that allows us to represent networks (interconnected elements).

  • Formally, (G = (V, E))

    • V is a set of vertices, E is a set of edges.
    • An edge is a pair of vertices (e.g. {1,2}).
  • Guilt-by-association:

    • Social clubs
    • ECG (brain areas)
    • Functional gene sets (predictions on unknown genes)!

The power of using graphs

Examples:

  • Bio: How many genes are targeted by a TF?
    CS: What is the degree of a vertex?
  • Bio: Minimum number of steps needed to convert pyruvate into lactate?
    CS: What is the shortest path between two vertices?
  • Bio: The maximal set of interacting proteins?
    CS: What is the maximally connected subgraph of G?

R package igraph

There are R packages for dealing with graphs, like igraph.

> library(igraph)
> g <- graph.formula(A-B, A-C, B-C, B-D, B-E, D-E)
V(g)
Vertex sequence:
[1] "A" "B" "C" "D" "E"
E(g)
Edge sequence:
[1] B -- A
[2] C -- A
[3] C -- B
[4] D -- B
[5] E -- B
[6] D -- E
plot(g)

Types of Graphs

> g <- graph.formula(A-+B, A+-C, B-+C, B+-D, B+-E, D-+E)
plot(g)
is.weighted(g)
# [1] FALSE
E(g)$weight <- runif(ecount(g))
is.weighted(g)
# [1] TRUE


undirected (i.e. previous slide)


directed (i.e. {1,2} ≠ {2,1})

Coding Adjacencies

Adjacency matrix:
  A B C D E
A . 1 . . .
B . . 1 . .
C 1 . . . .
D . 1 . . 1
E . 1 . . .
Adjacency list:
A -> B C
B -> C
C -> D
D -> B E
E -> B
  • less memory intensive using lists
  • but faster computation using matrices
  • both can be used to create graphs:
    • graph_from_edgelist(el,...)
    • graph_from_adjacency_matrix(mat,...)

Graph dimensions

> g <- graph.formula(A-B,A-C,
        B-C,B-D,B-E,D-E)
> gorder(g)
[1] 5
> gsize(g)
[1] 6
> edge_density(g, loops = FALSE)
[1] 0.6

> shortest.paths(g,v="A",to="D")
    D
  A 2
> get.shortest.paths(g,from="A",to="D")
  $vpath
  $vpath[[1]]
  [1] 1 2 4
> diameter(g)
  [1] 2

Degree

> degree(g)
  A B C D E 
  2 4 2 2 2
> degree(g,mode="in")
  A B C D E 
  1 3 1 0 1 
> degree(g,mode="out")
  A B C D E 
  1 1 1 2 1 

  • The degree of a vertex is the number of incident edges
  • Out-degree: out of a vertex
  • In-degree: toward a vertex

Clusters and Communities in a graph

Clusters (disconnected parts):

> clusters(g)
$membership
[1] 1 1 1 1 1

$csize
[1] 5

$no
[1] 1
  • Graph partitioning (Min dist in - max dist out):

    > fastgreedy.community(g)
    Graph community structure calculated with the fast greedy algorithm
    Number of communities (best split): 2
    Modularity (best split): 0.1111111
    Membership vector:
    A B C D E 
    1 1 1 2 2

Drawing a graph

> g <- graph.formula(A-+B,A-+C,A-+D,
       B-+F,B-+G,C-+H,C-+I)
> par(mfrow=c(1,3))
> igraph.options(vertex.size=25, edge.arrow.size=0.5)
> plot(g,layout=layout.circle)
> plot(g,layout=layout.reingold.tilford)
> plot(g,layout=layout.reingold.tilford(g,circular=T))