# Statistical bioinformatics

**Question 1 (10pts)**a) What is probability of observing 61325 when rolling fair dice?

*Probabilities for fair dice*: P(1)=P(2)=P(3)=P(4)=P(5)=P(6)=1/6

b) What is probability of observing 61325 when rolling loaded dice?

*Probabilities for fair dice*: P(1)=P(2)=P(3)=P(4)=P(5)= 0.1 and P(6)=0.5

c) What is probability of observing A in a random sequence?

d) What is probability of observing ATGC in a random sequence?

e) What is probability of observing ATGC in a genome described by a simple model where P(G)=P(C)=0.33 and P(A)=P(T)=0.67 i.e. P(ATGC | simple model)

**Question 2 (50pts)**On a hypothetical island virus outbreak becomes a threat of future pandemic. Researchers have narrowed down the cause of outbreak to two viruses (virus 1 and virus 2). The DNA sequencing lab receives a sample for further analysis. Unfortunately, the sample was contaminated and the removal of foreign DNA leaves the lab with a short DNA fragment: AGCAGCTTCCAG. Given all available information (provided below) how can lab determine the type of the virus that caused the outbreak?

Nucleotide probabilities of virus1P(A)=P(T) =0.35P(G)=P(C) = 0.15Nucleotide probabilities of virus 2P(A)=P(T)=P(G)=P(C)= .25

Assume:

- Virus 1 and Virus 2 are equally likely to occur in nature.- nucleotides are independent and identically distributed

**Question 3 (40 pts)**Align two sequences shown below using Needelman Wunsch algorithm.Use match score of 3, mismatch score of -3 and gap penalty score of 2 (note, you should subtract this from the scoring function).

Show:a) dynamic programming matrix with scores (as it shown in Durbin’s Figure 2.5)b) trace back pointersc) alignment score

sequences:sequence 1:AGAGCTCACAA

sequence 2:AGTAGCTTCCAAA