Computational and Systems Biology

BHS043-2

AY21/22

Introduction

The techniques that we will be using in this practical will be new to most of you so to develop confidence in using them, and to get an understanding of the information they provide, we will go through some worked examples. In this workshop we shall look at how to retrieve sequences from the Los Alamos HIV database; these will then be aligned using MUSCLE, which will produce both an alignment and a phylogenetic tree. Finally, we will determine a more rigourous phylogenetic tree using MegaX.

Figure 1 Genome Structure of HIV

Part 1: HIV Databases

Task 1 Finding the database

The data source we will use is held at the Los Alamos National Laboratory in the US. Those with a sense of science history will remember that Los Alamos was where atomic weapons development took place during WW2. Easiest way to do this is to google it. Try something like Los Alamos HIV database.

With a bit of luck the Pathogen Database is your top hit. Follow that to:

Choose HIV databases, which takes you to the HIV front page

Click sequence database and on the next page follow Search Interface under Programs and Tools. This will be your main port of call in the future so you might want to bookmark it.

You can interrogate the database in many ways. HIV is unambiguously categorised (subtypes; gene names; geographical origin etc) so its possible to be quite precise in your search terms. In this example we will do something simple: find all the subgroup A protease genes.

I have selected A in the Subtype box and protease in the Genomic region field. I have chosen the protease because it is small, which makes it easier to look at the alignments later. Click the search box for your results.

Your results should look like this. I have found almost 800 sequences. Whats the length of this gene? How many amino acids does it code for?

I have started choosing sequences (above) using the boxes on the left hand side of each record. I would like you to choose the first sequences from Cote DIvoire, Lebanon, Uganda, Spain, and USA.

Click Download Sequences at the top of the page. This will bring up another option box which has the download options. Click Align off and then hit Download

This is the next page

The hiv-db.fasta link is your file. Click to download and then find it in your downloads folder. The downloadable file always has the same name so you need to start renaming using meaningful file names. I would suggest that you also set up a working folder rather than leaving things in downloads; also note down names in your notebook.

You should be able to open the file using the basic text editor on a PC. (Notepad, I think) Textedit on a Mac.

The file should look something like this: 4 sequences in FASTA format. This is a standard sequence format. It starts with the >text line, followed by the sequences.

Hopefully you can see how you will be able to access any set of sequences you want. If you are using different search terms then each set of results can be downloaded separately and then the results files can simply be pasted together.

As an exercise, set up a file containing two protease sequences each from HIV-1, HIV-2, SIV, and SHIV. Save this file as we will be using it later.

Part 2: Multiple Sequence Alignment (MSA)

The files containing Fasta formatted sequences will now be used to obtain the best multiple sequence alignment. We could do lots of pairwise alignments (and the programme does, unseen) but that only tells us about two sequences. What if we want to understand the arrangement across the entire group, which in a research situation might include tens of thousands of sequences.

The software that we will use for MSA is MUSCLE. The calculations are are carried out on a server at the European Bioinformatics Institute near Cambridge.

Do a google search for muscle EBI. Hopefully the right hit is at the top.

Follow this to the MUSCLE page at EBI.

Set up is very straightforward. This software has been optimised for nucleic nucleotide sequences, in this case DNA (although the HIV genome is RNA, most of the sequencing will be done using cDNAs). Paste your sequence file into the sequences box. Another reason for using basic text editors is that they dont contain lots of hidden control characters, which can be the case with word processors. Finally, tick the box, to be informed by email. For small examples this isnt necessary but when you are using more or bigger sequences it can take longer. Also the email will give you a link to access your results. Click on Submit.

The results page should look like this

Here we see the 4 sequences lined up. Where a nucleotide is conserved across all the sequence, there is an asterisk *. This protein is well conserved but there are mutation sites, mainly in the Spanish and American sequences; the Ivoirian and Lebanese sequences show greater identity. Analysis of the inferred protein sequences. Work along the first line, 3 nucleotides at a time, using a codon table to see if the mutations are silent or if there are amino acid changes.

Under the Results Summary tab you can download the various files produced. Click on Phylogenetic Tree and you will see the relationship between the sequences.

This gives no time data but it shows three branches, or clades. The top one contains both the CI and LB sequences, while the other two sequences are their own lineages.

Repeat this exercise using the HIV/SIV data from the previous section.

Determining Phylogenetic Trees using MegaXIntroduction

In this section we will find out how to create phylogenetic trees using the Mega-X software. The trees produced here are more rigorous than those from Clustal or Muscle, and have the advantage of having a mutational time axis, as well as showing the similarity based relationships between the sequences.

Part 1: Phylogenetic trees

Hopefully you have all downloaded the Mega software. It can be used to carry out alignments as well as create phylogenetic trees, but the MUSCLE interface is much easier to use for the former. Because of the way it calculates trees, it can be a bit particular about input. It likes all the sequences to be the same length.

When you load MEGA, this is what you see.

It has a lot of different options but we will confine ourselves to phylogeny.

When you click the Phylogeny button you will be asked for tree type. We will us the Maximum Likelihood (ML) model which computes the tree based on the probability of change. Distance between each pair of sequences is calculated by calculating the number of nucleotide differences between each sequence.

You will then be asked for your input file. This can be a collection of sequences in fasta format. In this example I have pulled a set of protease sequences from different subgroups of HIV-1. The only criterion was that they were from that gene only, not the genome.

You then click through a number of window accepting the defaults. And thats it!

The final tree is given here. The horizontal axis represents the time from a branch point.

I chose these sequences at random but the results still throw up an interesting feature. Remember, the virus subtype is the first letter. At first sight it looks arranged by subtype but look more closely at the A.TZ and the two B.JP sequences. B seems to be a subset of A, which it isnt; alternatively the other A-class proteases might show recombination. This looks like an interesting example where this might not be apparent in the alignment of the full genome but individual genes show a different tree pattern.

Assessment Task

Collect a set of sequences (10-20) for a single gene (but not the protease) across a range of subtypes of HIV-1, and generate a phylogenetic tree. Sequences from each subtype will have to be downloaded separately and then pasted into a master file. Produce a multiple sequence alignment and a phylogenetic tree using Muscle and MegaX. Compare the two different types of trees and discuss the relationships between the genes. What does this tell us about the transmission and evolution of the virus?

-38100-550545

Submission Deadline

Marks and Feedback

Before 10am on:

19/04/2024

20 working days after deadline (L4, 5 and 7)15 working days after deadline (L6)10 working days after deadline (block delivery)

375666022860Unit title & code

BHS043-6 Computational and Systems Biology

Assignment number and title

Assessment 1 (resit) Practical Portfolio

Assignment type

CW-PO

Weighting of assignment

100%

Size or length of assessment

3200 words

Unit learning outcomes

Demonstrate a depth of knowledge in the application of computational techniques to the modelling of structural and functional aspects of biological systems. Such an understanding will cover both Molecular and Systems levels of modelling.

Show a critical awareness of the limitations of the techniques such as sequence alignment, homology modelling and systems modelling and apply these methods to model systems.

3556013017500

53721010795

What am I required to do in this assignment?

Part 1

You will produce laboratory reports for the two practicals carried out in Semester 1

In the first you will carry out a model sequence alignment, which demonstrates the application of the dynamic programming technique. You will then use the NCBI server to align two related sequences. Finally, you will map the sequence alignment data onto the three-dimensional structure of one of the proteins.

In the second session you will use a number of different sequence analysis techniques (workshopped in the practical session) to produce a phylogenetic tree for selection of HIV genes. Conclusions about the relationship s between the systems will be drawn from these trees.

The reports should detail your findings across both sessions. Each report should be no more than 1000 words. Excessive word counts will negatively affect grading. The lab report should be structured like a research article in a scientific journal (see details below).

Part 2

This will be a timed essay carried out under exam conditions (1200 words maximum). The Essay will be a question in the subject areas of Proteomics and Transcriptomics. Students can bring one A4 sheet of notes into the session. This will be submitted with the written work.

What do I need to do to pass? (Threshold Expectations from UIF)

Provide a critical analysis of the development of a topic within Computational Biology.

Demonstrate the ability to carry out modelling tasks including sequence alignment and the use of molecular graphics.

Produce a written report in the form of a scientific paper that discusses the content of the practical sessions and shows evidence of analysis and discussion.

How do I produce high quality work that merits a good grade?

Outline Of Report Structure

Part 1

Your lab report should contain five sections; Introduction, Method, Results, Discussion and References. Each section should be clearly labelled. Clarity of English language and presentation is essential throughout.

Introduction: This section should typically represent 20-30% of your report. It should summarise the published background literature relevant to this study. It must explain what your experimental study is about, and place it in context of the previously published literature. It should state the scientific aim of the study.

Method:This section should typically represent 10% of your report. It should briefly summarise how the practical work was carried out. It should be written in the past tense and in paragraphs. It should contain sufficient detail to allow someone else to reproduce your experiment, but avoid unnecessary detail.

Results and discussion:This section should typically represent the remaining 60-70% of your report. Data may be presented in tables, graphs, diagrams, or photographs as appropriate for your particular experimental study. Figures and tables should be separately numbered, and be clearly labelled. You should include written text to explain what your findings are and what is shown in the figures and tables. Results should describe your findings/observations, but not interpret their meaning. In the discussion you should interpret your results, explaining what they indicate. You should evaluate the quality of your data. You should identify any problems with the technique or data (if any exist) and suggest possible solutions. You should compare your findings to previously published findings or your expected findings, and should place your results in the context of published scientific literature. Your discussion should also include a reflection on your performance within the practical. What insights did you gain from the practical sessions; what problems did you have to overcome etc.

References: You should include at least three peer-reviewed scientific journal articles or textbooks as sources. These should be listed in correct UoB/Harvard format in a single reference list. (see https://lrweb.beds.ac.uk/__data/assets/pdf_file/0009/557568/UoBHarvard17_18.pdf)

The reference list should only contain sources that have been cited appropriately e.g. (Smith, 2010) - in the main text of your report.

Part 2

You are required to write an essay on one selected topic in Computational Biology. In the resit the essay question will be in the area of Proteomics and Transcriptomics. The question will be released on the day.

How does this assignment relate to what we are doing in scheduled sessions?

The assignment tasks are based upon the direct application of techniques that have been discussed in lectures. In addition, you will be expected to gain insight into the biological problem

607060190500-25406350000

How will my assignment be marked?

Your assignment will be marked according to the threshold expectations and the criteria on the following page.

You can use them to evaluate your own work and consider your grade before you submit.

Pass 40-49%

Pass 50-59%

Commendation 60-69%

Distinction 70%+

Quality of understanding and analysis of scientific principles and knowledge base. (30%) Understanding of scientific principles at a basic threshold level. Some evidence of a literature review. Acceptable level of understanding of relevant scientific principles and knowledge base. Adequate review of relevant literature, though some omissions or tangents. A reasonable attempt to relate study to broader context and explain aim and approach. A good understanding of the scientific principles and knowledge base. Literature review should be more critical. Context requires a more detailed approach. A comprehensive understanding of the scientific principles and knowledge base. Detailed and focused review of previously published literature. Broader context of study clearly described. Experimental aim and approach well defined.

Data handling and presentation. (15%) Data analysis is present but incomplete. Presentation is could be improved. Explanations of data superficial. Data analysis is mostly correct with few errors or omissions. Presentation is generally clear and appropriate. Some attempt is given to explain what is being presented. Data analysis is accurate, but not complete. Presentation is clear and appropriate. Clear explanation of what is presented is given. Good understanding of data analysis shown Data analysis is accurate, thorough and complete. Presentation is clear and appropriate. Clear explanation of what is presented is given. Excellent understanding of data analysis shown.

Critical evaluation and discussion. (25%) Evidence of reflection and evaluation of scientific problem and approach. More critical evaluation of cited literature is required. Demonstrates some ability to make evaluative links between the current scientific thought and the work in hand, but the evaluation is often rather superficial. Satisfactory evidence of reflection and evaluation of scientific problem and approach. Some critical evaluation of cited literature, though at times a little shallow. Demonstrates some ability to make evaluative links between the current scientific thought and the work in hand, but the evaluation is sometimes rather superficial. Demonstrates some ability to evaluate scientific problems and to make clear evaluative links between the current scientific thought and the work in hand. Shows good critical evaluation of cited literature. Demonstrates a well-developed ability to evaluate scientific problems and to make clear evaluative links between the current scientific thought and the work in hand, which are capable of contributing to the advance of scientific knowledge. Shows excellent, deep critical evaluation of cited literature.

Written expression and structure. (20%) Written expression is not always clear and arguments can sometimes be confused The structure of the work is satisfactory but planning could have been more thorough in parts. Written expression is clear and arguments can be followed without undue difficulty. The structure of the work is satisfactory but planning could have been more thorough in parts. Written expression is clear and concise. Arguments are put forward succinctly but the structure of the piece needs further planning to enhance its readability Written expression is clear and concise. Arguments are put forward succinctly and the structure of the piece is well planned, well-thought out and logical, enhancing its readability.

Use of literature and referencing.(10%) A limited range of literature cited, with considerable reliance on secondary sources. Incorrect use of UoB Harvard referencing format or lack of appropriate citations within text of report. A range of primary sources is accessed. Possible errors in the use of the UoB Harvard formatting of citations and reference list. A broad range of primary sources is accessed. Possible errors in the UoB Harvard referencing format. A broad range of primary sources is accessed. Correct Journal of Cell Science formatting of citations and reference list used throughout.

Download Solution Now

Uploaded By : Pooja Dhaka
Posted on : November 13th, 2024
Downloads : 0
Views : 378

Computational and Systems Biology

Download Solution Now

Download Solution Now

Choose a Plan

Premium

Gold

Silver

Computational and Systems Biology

Download Solution Now

Download Solution Now

Choose a Plan

Premium

Gold

Silver

Request a Call Back