COMP0235 Engineering Data analysis

Coursework task
In this coursework you are required to build a distributed pipeline across your cloud  machines that will run the 4 steps in the `pipeline_script.py` it should accomplish this in distributed fashon across your mini-cluster of 7 machines (one host and 5 clients). You are free to accomplish this as you see fit. Your solution should include the following features
1. Should use an appropriate configuration system. We have covered Ansible and Salt but others are available.
2. Make use of an apporpriate datastore for the complete human proteome contained in file uniprotkb_proteome_UP000005640_2023_10_04.fasta.gz . This should be able to return appropriate records from a list of arbitrary protein IDs
3. Make use of appropriate monitoring and logging of your mini-cluster and your data analysis pipeline
4. Should collate the results calculated on the client machines and make them
available to the researchers.
You need to collate the following information from step 4 for the researchers, preserving these file formats:
1. A csv file that contains a list of proteins and the identity of the best hit calculated by HHSearch. You can find an example such file in
coursework_example_output/example_hits_output.csv
2. A file containing the mean Standard Deviation and mean Geometric means for all 6,000 HHSearch runs you calculate (i.e. capture the STD and Gmean values for each
pipeline run and take the average across 6,000 runs). You can find an example such file in coursework_example_output/example_profile_output.csv
Challenges/hints
1. On your 6 AWS machines we estimate it should take about one to two days to run all the calculations.
2. The host instances are too small to run the calculations
3. EC2 client instances have 4 CPUs
4. You need to be able to understand how to install and run the s4pred and hhsearch programs
5. You need to be able to understand how to fetch the required datasets for s4pred and hhsearch
6. You will need to be able to understand the FASTA data format
7. You should ensure you can successfully run pipeline.script.py. This could be on either your own machine or on one of the cloud machines you have access to. In the directory `pipeline_example`. You can find an example input sequence `test.fa`. If you run the script successfully you should produce a number of intermediary files, example of these can be found in the directory. And a final output file
hhr_parse.out . The file you produce should be equivalent (though some figures may have some minor differences)
8. A runtime the Load Average for a Client machine should not exceed 3

你可能感兴趣的:(学习方法)