PH700A Principles of Programming for Public Health Project Report

Subject Code :
PH700A

For this project, you will be search for associations between diabetes and a variety of biochemistry metrics and mock genetic mutations. Submit your scripts for this project through our GitHub Classroom using this invite link https://classroom.github.com/a/78VAxKeQ.

The diabetes and biochemistry lab metrics were collected by the National Health and Nutrition Examination Survey (NHANES). The CSV files and codebooks are provided on canvas here. More information on this dataset can be found at www.cdc.gov/nchs/nhanes/.

The genetic mutation data is also provided on canvas here as a zipped folder containing VCF files (variant call format). These mutations are mock data, but the file format is a common plain text format used for storing genetic data. Part of the challenge in this project will be figuring out how to parse these files to retrieve the data you need, using what you’ve learned about working with strings and files.

Part I

Write a python script named finalpart1.py that will read the Diabetes survey results table

(DIQ_J.csv) and the Standard Biochemistry Profile table (BIOPRO_J.csv). For each available metric in BIOPRO_J.csv, your script will perform a two-tailed Student’s t-test testing whether subjects with diabetes have significantly different values then subjects without diabetes. Your script will write the results of each t-test to a file. For each metric your script will also create a boxplot showing the diabetes and no diabetes groups. Each column beyond the first in BIOPRO_J.csv is a metric to test.

Write the t-test results to a single file. Output the boxplots each to a separate file. Have your script make a directory (you can use the module os for this) and output these files to inside that directory.

The first column in both tables is the subject ID number. Use DIQ_J.csv to determine which subjects have diabetes. Exclude subjects that are not in both tables. Assume subjects who answered 1 (Yes) in column DIQ010 (Doctor told you have diabetes) have diabetes and subjects who answered 2 (No) do not have diabetes. Exclude subjects who do not have a 1 or 2 in DIQ10. For each metric in BIOPRO_J.csv, exclude subjects who have missing data for that column.

Part 2

Write a python script named finalpart2.py that will read the Diabetes survey results table

(DIQ_J.csv) and read the VCF files in the folder subject_vcfs (you can unzip the folder yourself first). You can use the os module to get the list of all files in a folder. Then you can use a for loop to open and extract the desired data from each file in the list of file names. For each VCF file, your script will use the file’s name to determine the subject ID, and then use DIQ_J.csv to determine whether that subject has diabetes (a 1 in column DIQ010 means yes and a 2 means no). These files have a consistent naming scheme that includes the subject ID so you will be able to use the split function on the file names to extract the subject ID. Your script will parse through the VCF file to determine which genes the subject has a mutation in. These VCF files have a lot of information in them, but the only information you need from them for this project is the gene name of each variant. In class we will go over where this information is located in the VCF file. Each line is a mutation, and nested in the INFO column is the name of the gene each mutation is within. An example of a mutation in a VCF file is provided below, with the gene name in bold.

17 753174 . C A . PASS

ADP=164;WT=0;HET=0;HOM=1;NC=0;CSQ=A|missense_variant|MODERATE|vapC6|Rv0656c|Transcript| CCP43399|protein_coding|1/1||||194|194|65|W/L|tGg/tTg|||-1||1|SNV|ENA_GENE|

As your script determines which genes have mutations in each subject, have the script update a data structure storing counts for each gene encountered. These counts are the number of patients with and without diabetes that do and do not have mutations in the gene (everything you need for a contingency table). Once your script has read all the vcf files and updated all of the counts, have your script calculate an odds ratio and a Fisher’s exact test for each gene’s association with Diabetes. Write these counts and results to a csv file, one row per gene, with the columns Gene, Mutant_Diabetes, Wildtype_Diabetes, Mutant_NoDiabetes, Wildtype_NoDiabetes, Odds_Ratio, Fishers.

Download Solution Now

Uploaded By : Katthy Wills
Posted on : December 14th, 2022
Downloads : 0
Views : 144

Download Solution Now

Choose a Plan

Premium

80 USD

All in Gold, plus:
30-minute live one-to-one session with an expert
- Understanding Marking Rubric
- Understanding task requirements
- Structuring & Formatting
- Referencing & Citing

Most
Popular

Gold

30 50 USD

Get the Full Used Solution
(Solution is already submitted and 100% plagiarised.
Can only be used for reference purposes)

Save 33%

Silver

20 USD

Journals
Peer-Reviewed Articles
Books
Various other Data Sources – ProQuest, Informit, Scopus, Academic Search Complete, EBSCO, Exerpta Medica Database, and more

PH700A Principles of Programming for Public Health Project Report

Download Solution Now

Download Solution Now

Choose a Plan

Premium

Gold

Silver

Request a Call Back