Part I: Expression networks
Solutions for a better future: Mutate one gene and create a perfect cell factory!?
Mercator map
16th century (most precise for one century)
Two human generations
Satellite
Highly precise and informative
Few seconds (through sensors and computers)
Traditional genetic map
Informative connections but small number
Decades of work from a whole team
Large-scale gene network
Thousands of genes
A few weeks by one scientist
Microscope and manipulations (PCR, fluorescence,…)
High-throughput sequencers (transcriptomics)
Given a seed with
Computers will be able to
Numerical experiments (faster and less costly) will be a precursor to lab experiments.
⇒ Each cell has an encoded network (circuit) in DNA.
Data matrix (illustrative)
\[\begin{array}{|c|c|c|c|c|} \hline \text{DATA} & X_{1} & X_{2} & \dots & X_{n} \\ \hline s_{1} & 0.1 & 0.9 & \dots & 0.5 \\ \hline \vdots & \vdots & \vdots & \ddots & \vdots \\ \hline s_{m} & 0.2 & 0.3 & \dots & 0.8 \\ \hline \end{array}\]i.e. a Pearson’s correlation but on ranks, to better detect monotonic dependencies
Compute squared Spearman correlation for all pairs
\[\begin{array}{|c|c|c|c|c|} \hline \hat{T} & X_{1} & X_{2} & \dots & X_{n} \\ \hline X_{1} & - & 0.2 & \dots & 0.9 \\ \hline X_{2} & 0.2 & - & \dots & 0.1 \\ \hline \vdots & \vdots & \vdots & \ddots & \vdots \\ \hline X_{n} & 0.9 & 0.1 & \dots & - \\ \hline \end{array}\]Let \(I(X,Y)\) be an information measure.
If \(X_i \rightarrow X_k \rightarrow X_j\) (or \(X_i \leftarrow X_k \rightarrow X_j\)),
then \(I(X_i,X_k)\) and \(I(X_k,X_j)\) are high
and \(I(X_i,X_j)\) is likely high as well, suggesting a false link between \(X_i\) and \(X_j\).
\[ \begin{array}{rcl} & fire\\ \swarrow & ? & \searrow\\ firemen \quad \quad& \longleftrightarrow &\quad victims\end{array} \]
Algorithm for the Recon. of Accurate Cellular NEtwork
Three possible indirect‑interaction motifs (three variables):
\(X_j \rightarrow X_k \rightarrow X_i\)
\(X_j \leftarrow X_k \rightarrow X_i\)
\(X_j \rightarrow X_k \leftarrow X_i\)
By the data‑processing inequality
\[ I(X_j;X_i) <= I(X_j;X_k) \quad\text{and}\quad I(X_j;X_i) <= I(X_k;X_i) \]
For every triangle, eliminate the weakest link.
Minimum Redundancy Network (Meyer et al., 2007)
For each target variable \(X_j\), select predictors using MRMR (minimum redundancy – maximum relevance):
\[ X_i^{\text{MRMR}} = \arg\max_{X_i \in X\setminus S}\bigl(u_i - z_i\bigr) \]
Iteratively build the network.
\[ w_{ij}= \sqrt{z_{i}^{2}+z_{j}^{2}} \]
where \(z_i\) and \(z_j\) are z-scores
\[ z_{i}= \max\!\left(0,\, \frac{I(X_{i};X_{j})-\mu_{i}}{\sigma_{i}}\right) \]
with \(\mu_{i}\) and \(\sigma_{i}\) the mean and standard deviation of the distribution of informations \(I(X_{i},X_{k})\), \(k=1,\dots,n\).
principle: Each gene is choosing his bests friends
{Hayete et al. 2007}
Network Inference Objective
> Given a dataset \(D_{m}\), infer a network \(\hat T\) that has the highest similarity \(S\) with the known (or partially known) true network \(T\). \[
\hat{T}^{\max}= \arg\max_{\hat{T}} S(\hat{T},T)
\]
Confusion matrix
| edge | actual positive | actual negative |
|---|---|---|
| inferred positive | TP | FP |
| inferred negative | FN | TN |
Inferred networks are weighted:
| tf | tg | \(w\) |
|---|---|---|
| \(X_{1}\) | \(X_{2}\) | 0.1 |
| \(X_{1}\) | \(X_{3}\) | 0.3 |
| \(X_{i}\) | \(X_{k}\) | 0 |
| … | … | … |
| \(X_{\#tf}\) | \(X_{\#tg}\) | 0.83 |
→ Infinity of networks (depending on the threshold).
Which confusion matrix do we use?
\[ TPR = \frac{TP}{TP+FN}=p(TP)\] \[ FPR = \frac{FP}{FP+TN}=p(FP) \]
Precision: fraction of correct predictions
\[PRE = \dfrac{TP}{TP+FP}\] Recall: fraction of knowledge recovered
\[REC= \dfrac{TP}{TP+FN}\]
Correlation with partial truth
AUROC
AUPR10 and AUPR20
F‑Score (weighted harmonic mean of precision and recall)
\[ 0 \le F = \frac{2\;pre*rec}{pre+rec} \le 1 \]
Citation: Meyer et al. 2008
Wins/losses statistically significant on
| Match | score |
|---|---|
| MRNET / RELNET | 9 – 2 |
| MRNET / ARACNE | 8 – 3 |
| MRNET / CLR | 12 – 8 |
A graph G is a mathematical object that allows us to represent networks (interconnected elements).
Formally, (G = (V, E))
Guilt-by-association:
Examples:
There are R packages for dealing with graphs, like igraph.
> g <- graph.formula(A-+B, A+-C, B-+C, B+-D, B+-E, D-+E)
plot(g)
is.weighted(g)
# [1] FALSE
E(g)$weight <- runif(ecount(g))
is.weighted(g)
# [1] TRUE
undirected (i.e. previous slide)
directed (i.e. {1,2} ≠ {2,1})
graph_from_edgelist(el,...)graph_from_adjacency_matrix(mat,...)Graph partitioning (Min dist in - max dist out):
> g <- graph.formula(A-+B,A-+C,A-+D,
B-+F,B-+G,C-+H,C-+I)
> par(mfrow=c(1,3))
> igraph.options(vertex.size=25, edge.arrow.size=0.5)
> plot(g,layout=layout.circle)
> plot(g,layout=layout.reingold.tilford)
> plot(g,layout=layout.reingold.tilford(g,circular=T))Systems Biology – Part I – Prof. Patrick E. Meyer