第一部分 GSEA原理
目標:預先定義的基因集S是否隨機的分布在排序的基因list
1. 表達譜,樣品分為兩類,以1/2定義
GSEA considers experiments with genomewide expression profiles from samples belonging to two classes, labeled
1 or 2.
2. 基因按照表達與分類的相關性排序
Genes are ranked based on the correlation between their expression and the class distinction by using?any suitable metric
3. 計算富集打分(ES)
Given an a priori defined set of genes S (e.g., genes encoding products in a metabolic pathway, located in the same cytogenetic band, or sharing the same GO category), the goal of GSEA is to determine whether the members of S are randomly distributed throughout L or primarily found at the top or bottom. We expect?that sets related to the phenotypic distinction will tend to show the latter distribution.
Step 1: Calculation of an Enrichment Score.
We calculate an enrichment score (ES) that reflects the degree to which a set S is overrepresented at the extremes (top or bottom) of the entire ranked list L.
The score is calculated by walking down the list L, increasing?a running-sum statistic?when we encounter a gene in S and decreasing it when we encounter genes not in S.
The magnitude of the increment depends on the correlation of the gene with the phenotype. The enrichment score is the maximum deviation from zero encountered in the random walk; it corresponds to a weighted Kolmogorov–Smirnov-like statistic
a running-sum statistic,
4. 評估ES的顯著性(p值)
采用permutation :可以選擇1000次,500次等
5. 多重檢驗校正(FDR值)
ref:
Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles
http://www.pnas.org/content/102/43/15545
https://blog.csdn.net/qq_29300341/article/details/52956052
第二部分? 軟件的運行
下載鏈接:http://software.broadinstitute.org/gsea/downloads.jsp
需要事先安裝JAVA,此軟件是基于JAVA運行的
1、軟件界面
2,文件準備
2.1.? Expression dataset file (res, gct, pcl, or txt) ?? ?樣品表達文件
一般是以 鍵保存為.txt 格式,然后將后綴.txt改為.gct就可以了
#表格中的二列描述一定要有,寫成na列也行,但是必須有,我之前就沒有這一列,折騰了好久一直報錯不知道問題出在哪里
2.2? Phenotype labels file (cls) ?? ?樣品表型分類文件
用文本文件寫成.cls結尾的就行,同樣是tab分割
2.3.? Gene sets file (gmx or gmt) ?? ?預定義基因集(非必須)
這個文件可以自己根據上面的格式生成,如之前的KEGG本地化就可以生成這樣的文件
也可以選擇軟件中定義的數據庫
2.4.? Chip (array) annotation file (chip) ?? ?芯片注釋文件(非必須)
軟件上可以選擇
3、run
3.1 加載數據,將上面準備好的數據加載
3.2 選擇參數
1) collapse dataset to gene symbolstrue?? ?芯片數據
false?? ?測序的基因表達矩陣
2) Chip platform非芯片數據可不選
芯片數據則按照芯片類型選擇
3) permutation typephenotype推薦,要求每組樣品至少7個
gene_set 適用樣品少
4) 顯著性參數若選擇phenotype,FDR可設置0.25
若選擇gene_set, FDR需低于0.05
5) metric for ranking genes一般可以選擇log2_Ratio_of_classes,就是logFC
還可以根據自己需要選擇另外的參數
6) gene set database可以選擇軟件中的如KEGG,GO,以及GO里面的cc,bp,mf等等
也可以是用戶自己定義的gmt文件
7) 用戶還可以選擇自己的結果保存路徑
4、點擊下面的Run按鈕
5、結果解讀
第三部分? 常見的錯誤及解決辦法
1、第一種錯誤Java heap space ,OutOfMemoryError
目前就遇到這個最頭疼的錯誤,折騰了好久
意思就是運行GSEA的時候OutOfMemoryError,運行內存不足
如這張圖的右下角,你會看到運行的內存,這里是84M,用了43M
那就改運行java的運行內存吧,我自己的笨辦法是下載了一個eclipse軟件https://www.eclipse.org/downloads/
然后按照下面的教程改然后就可以運行了,你再次運行的時候可以看到上面的那個84M會變大很多
https://jingyan.baidu.com/article/5d6edee2f5efff99ebdeec63.html
https://blog.csdn.net/tomorrow13210073213/article/details/53031818
可以更改的大一些
對基因進行排序的各種參數解釋
Metrics for Ranking Genes
For categorical phenotypes, GSEA determines a gene’s mean expression value for each phenotype and then uses one of the following metrics to calculate the gene’s differential expression with respect to the two phenotypes. To use median rather than mean expression values, set the Median for class metrics parameter to True, as described above.
●??????Signal2Noise(default) uses the difference of means scaled by the standard deviation. Note: You must have at least three samples for each phenotype to use this metric.
where μ is the mean and σ is the standard deviation; σ has a minimum value of .2 * absolute(μ), where μ=0 is adjusted to μ=1. The larger the signal-to-noise ratio, the larger the differences of the means (scaled by the standard deviations); that is, the more distinct the gene expression is in each phenotype and the more the gene acts as a “class marker.”
●?????tTestuses the difference of means scaled by the standard deviation and number of samples. Note: You must have at least three samples for each phenotype to use this metric.
where μ is the mean, n is the number of samples, and σ is the standard deviation; σ has a minimum value of
.2 * absolute(μ), where μ=0 is adjusted to μ=1. The larger the tTest ratio, the more distinct the gene expression is in each phenotype and the more the gene acts as a “class marker.”
●?????????Ratio_of_Classes?(also referred to as fold change) uses the ratio of class means to calculate fold change for natural scale data:
where μ is the mean. The larger the fold change, the more distinct the gene expression is in each phenotype and the more the gene acts as a “class marker.”
●???????Diff_of_Classes?uses the difference of class means to calculate fold change for log scale data:
where μ is the mean. The larger the fold change, the more distinct the gene expression is in each phenotype and the more the gene acts as a “class marker.”
●????log2_Ratio_of_Classes?uses the log2 ratio of class means to calculate fold change for natural scale data:
where μ is the mean. This is the recommended statistic for calculating fold change for natural scale data.
來源于:丁香園夏木1220