Analyse de données biologiques

II - Langage et jeux de données (version 3.0)

Prof. Patrick E. Meyer

S: the first statistical language

S is developped

in 1976 by J. Chambers et al. in Bell laboratories with objective:
- interactive
- statistics with little programming
- user friendly
1988 Version 3
1998 Version 4
2008 licenses sold 25M

R history

1991 R (S clone) is born
1995 GNU License (freedom to use, study, share, copy, and modify the software)
2000 Version 1
2013 version 3
- 4000+ packages
- Easy graphics

Bioconductor

Specific for molecular biology and medical data

Reproductibility
2001 Version 1 (15 packages)
2013 Version 3 (750 packages)

Rstudio Desktop - Server (2011)

numeric

a <- 5 #assigning
5:10 #fromto
myvec <- c(1,2,4, 5:7) #concatenate
myvec[3] #accessing
myvec[3:5] #multiple accessing
myvec[c(2,6,4)] #multiple accessing
myvec2 <- c("hello", "world") #string
myvec2[-1] #omitting

logical

  a <- T #assignation
  b <- F #assignation

Comparaisons: a == b, a != b, a >= b, a <= b, a > b, a < b

numeric with logical condition

myvec <- c(1,2,4, 5:7) #concatenate
myvec[c(T,F,F,T,T,F)]
myvec>2
myvec<=6
myvec>3 & myvec<=6
myvec[myvec>3 & myvec<=6]

functions

myvec*3 #operators such as (+ - * / ^2)
?log #help function
log(myvec)
sort(myvec)
max(myvec)
sum(myvec)
which(myvec==4)

objects in R

numeric
- numbers
- logical #T,F
- character #type “hello”
factor #categories
list #mixed types
matrix #numeric in 2D
data.frame #list of numeric and factor
function

Matrices

 m1<-cbind(A=1:4, B=5:8, C=9:12)
 m2<-rbind(A=1:4, B=5:8, C=9:12)
 dim(m1)
 x <- matrix(1:12, nr=3, byrow=T)
 rownames(x) <- LETTERS[1:3]
 colnames(x) <- letters[1:3]
 x[2,3]
 x[2,]
 x[,3]
 x[1,-1]
 x[c(1,2), -c(2,3)]

Listes

Pour des objets différents

 list1 <- list(34,21,mot1="hi",vec=c(2,1),TRUE)
 list1[[1]]
 list1$mot1
 list1$vec[2]

Notons qu’un ficher .rdata peut (comme une liste) contenir différents objets dans l’environnement et se charge automatiquement en cliquant 2X dessus dans Rstudio, et sinon “manuellement” avec la fonction load("file.rdata")

data.frame

liste de vecteurs et facteurs (catégories) de mêmes tailles
en ligne les experiences
en colonne les variables
Remarque: a quoi correspond une corrélation de colonnes vs de lignes dans un jeu de données d’expression?
La fonction data() permet de charger un dataset intégré (i.e. un dataset qui vient avec le logiciel) ex: data(airquality).

Pour résumer la notation

crochets simples (ex: $randomName[1]$)
#numeric
crochets avec virgule (ex: $randomName[2,4]$)
#element of a matrix or data.frame
doubles crochets (ex: $randomName[[2]]$)
#list or column of data.frame
dollar (ex: $randomName\$name[4]$)
#list or column of data.frame
parenthèses (ex: $randomName(10,"a",3,...)$) #function

Données d’expression et fonction corrélation

> correl <- cor(X[,1],X[,2]) #two vectors
> correl2 <- cor(X[,1],X[,-1]) # what does it mean?
> correl3 <- cor(X) #???

Comment interpreter:

Le gène le plus corrélé à un autre?
Le gène le plus anti-corrélé à un autre?
Le gène avec la corrélation au carré la plus grande?
Le gène avec la corrélation au carré la plus petite?