StudyDaddy Article Writing

Waiting for answer This question has not been answered yet. You can hire a professional tutor to get the answer.

QUESTION

Jan 08, 2019

Functional enrichment anlaysis, clustering and classification of microarray gene expression profiles.

Functional enrichment anlaysis, clustering and classification of microarray gene expression profiles.Our second homework assignment is 1) functional enrichment analysis of differentially expressed genes and 2) application of clustering and classification of gene expression profiles. We will use the data from "Gene expression profiling predicts clinical outcome of breast cancer", Nature 415, 530 - 536 (2002) as well as the data from Wang et al. in this homework. We denote the first dataset nature data and the second dataset wang data.Nature dataset consists of the following files:1. training_data.txt : 24,481 gene expressions of 78 tumor samples2. training_labels.txt : label information of the 78 tumor samples (1: metastasis-free, 2: metastasis-developed)3. Gene_Name_24481.txt : the probe's systematic name + known gene name (Some of probes do not have a gene name)4. test_data.txt : 24,481 gene expressions of the 19 Tumor samples for testing5. test_labels.txt : label information of the 19 tumor samples (1: metastasis-free, 2: metastasis-developed)Part a) Functional enrichment analysisIn our first homework assignment, we performed t-test to select differently expressed genes. To further validate our selected differently expressed genes, we will perform functional enrichment analysis of the genes. Download the list of the gene names, and matlab data & script for functional enrichment analysis from here. The genes in the list are selected by t-test from wang data.1. (30 points)1) Perform functional analysis of the first 200 genes in the provided gene list. What are the top 10 enriched functions of the first 200 genes? Describe top 10 enriched functions and find evidences how these functions are related with breast cancer metastasis (check gene ontology annotations).2) Use Nature dataset including training_data.txt and test_data.txt and perform t-test. Note that you should remove probes that do not have a gene name. Select the top 200 genes by p-values and perform the same functional enrichment analysis. What are top 10 enriched functions of selected genes? Are top 10 enriched functions are overlapped with those of selected genes from wang dataset? Analyze the results of your comparison.Part b) Unsupervised learning (K-means clustering)2. (30 points): Cluster analysis is used to identify genes with similar expression patterns over experiment conditions or possibly related functions. We will first perform k-means clustering and check the functions of the genes in the same clusters. Use Wang data and the provided gene names in part a in this problem.1) Select the first 1000 genes and perform k-means clustering with k=10, 20, 50 on the expression profiles of the 1000 genes. For each case, plot the histogram of the cluster sizes.2) Select one cluster from the k-means clustering with k = 20. Perform functional enrichment analysis of the genes in the cluster. Describe the enriched functions of the genes in the cluster. You should report the list of genes in the cluster and the enriched functions.Part c) Supervised learning (KNN and SVM)3. (40 points): In this problem, you will perform classification to predict clinical outcome of patients with KNN and SVM in either Matlab. You do not need to tune any parameters for SVMs. Just use the default setting of SVM in Matlab. Use Nature data for this problem.1) Run knn classifier with k=1, 3, 5 using the samples in training_data.txt as training data and the samples in test_data.txt as test data. Report your classification accuracy for the three cases.2) Select top 1,000 genes ranked by the t-test values from training data of nature dataset. Repeat the classification using only the top 1000 genes.3) Repeat 2) with SVM with linear kernel using the top 1,000 genes.*Extra credit: Forward Feature selection4. (10 points): Feature selection is often used to select the subset of features from high dimensional data. In this problem, we apply Forward Feature selection described in this class to run feature selection on the Nature data. Note that you can use either KNN or SVM as your classifier. Randomly select 70% of training data and keep the rest of training data as test data. Sort each gene based on its classification performance. Run ¡°Forward Feature Selection¡± with greedy approach to find the best subset of discriminative genes. Repeat this three times. Report the list of selected genes in each trial and their classification performance. How many genes are overlapped from all three trials? Do they show similar classification performance?