Data management

My publications have been analysed and summarised in R, Python, and BASH. Data for these analyses derived from clinical trials databases (PostgreSQL), in-house data server (Mysql) and public databases. Sequence analysis pipelines that I have utilised have been built and called through python. I have updated my previous analyses to encompass newer techniques, including utilising TCGA and DREAM project leader approaches. I have also been translating previous R based pipelines/scripts to python to enable easier scalable machine and deep learning approaches .

I have LIMS experience included using Progeny LIMS for sample processing and bio-banking. I have also used LIMS together with liquid handlers automation including Hamilton (NGS Star) and Qiagen (QIAglity) in conjunction with high-throughput assays.

I'm experienced at dynamic documents, version control, containerization & workflow engines and working in Jupyter Notebook environments. In my last position I worked on three informatic platforms/levels: On a desktop, with lots of (64GB) RAM running Rstudio and Ananconda; I maintained a local Unbuntu Linux server for testing scripts and pipelines and housing shiny apps; Lastly I utilised the ICR Scientific Computing Infrastructure which included a 2,000 core, 2PB scratch High Performance Computing system, a 6.5 petabytes storage and a 4x NVIDIA Tesla P100 GPU array. I am experienced at using the command line and running scripts and apps in Linux and also familiar with translating this workings or pipelines to container environments with interfaces such as Dockers (Windows (powershell) and Linux), Google Colaboratory, Azure and Kaggle.

I have extensive experience in analysing public genomic, transcriptomic and proteomics datasets, including:

- NCBI GEO Datasets and EMBL-EBI ArrayExpress, through which I’ve regularly accessed .CEL Affymetrix U133+2 RNA arrays and SNP data for eQTL analysis, prognostic signature work and hypotheses generating boxplots, subset distribution plots and prognostic curves (typically quantile or tertile), to reinforce molecular biology work within the lab, comparing preclinical experimental paired profiles, from knockdown/upregulated, drug-treated pairs, using Limma/Deseq2 packages.

- NCBI GEO Datasets where I also accessed Cancer Cell Line Encyclopaedia (CCLE) data, which I used to examine characteristics of drug sensitive by IC50 to examine to show similar patterns to which we observed in primary experiments.

- European Genome-phenome Archive (EGA) for methylation data from blueprint consortium which I for replication and annotation;

- Wash Epi-genome Browser for Chromatin Interaction Analysis by Paired-End Tag Sequencing ChIA-PET and ENCODE RoadMap browser again for annotation i.e. Linc-RNA.

- MMRF Commpass study which is also part of TCGA and can be accessed through the GDC Data portal - International Myeloma consortium - To evaluate neutral cancer status on overall and progression free survival: I utilised whole exon data, correcting variant allele frequencies (VAF) with copy number generated from low pass whole genome sequencing. In a recent DNA methylation biomarker project, I have utilised pair-end RNA sequencing data (FPKM) in conjunction with FISH-seq data to carry out quartile cox proportional with age, sex, subgroup to replicate DNA methylation findings.

Additional data analysis skill sets include:

◦ Interpretation and presentation of multi-centre phase III clinical trial data including adaptive and maintenance arms.

◦ Meta-analysis of trial-based clinical end point data and combining data from multi-country real world datasets or cancer registries.

◦ Integration of translational data with clinical endpoints.

◦ Statistical analyses in SPSS, R and Python.

◦ Generation of publication ready clinical trials analyses for JCO, Nature Comms, Leukemia.

◦ Application of relevant statistical techniques include sample size/power calculation, cox regression, multi-variable analysis, ROC modelling, landmark analysis, C-stats and Bayesian methods.

◦ Collection of clinical trial data, data cleaning, HTA.

◦ Development of trials protocols, and REC approvals.

◦ HTML and Powershell were utilised to create this Portfolio. ◦ Strong Microsoft Office skills (Word, Excel, Access, PowerPoint and Visio)

◦ Image editing tools and processes utilizing GIMP2 and photoshop.

◦ Project tracking and management with MS project, Podio and Freedcamp.

◦ Registered Azure and AWS user.

◦ Collaborative working through Slack, Asana and Git/Jupyter notebooks.

◦ Keen twitter follower of health care luminaires and advocates

◦ Utilise bioRxiv to follow emerging research and technologies.

◦ Routinely use messageboard resources such as:https://www.rna-seqblog.com, https://www.reddit.com/r/bioinformatics/, https://www.feedspot.com/infiniterss.php?q=https%3A%2F%2Fwww.nature.com%2Fsubjects%2Fcomputational-biology-and-bioinformatics.rss