A decade leading public health programmes — disease surveillance, outbreak response, health systems research, and USAID-funded data capacity building. Now applying formal data science training to the problems I've spent my career working on.
Managed a USAID-funded initiative to strengthen COVID-19 data management capacity. Coordinated cross-functional teams spanning IT, data science, public health, and disaster risk reduction across seven local government councils.
Co-authored an evaluation of Patient Advice and Liaison Services across ten NHS Trusts, assessing the usability, accessibility, and clarity of patient-facing information on each Trust's website. Conducted qualitative coding of patient and public feedback and contributed written sections that were included verbatim in the final published report.
Contributed to a national-scale patient safety study. Built Excel-based automation tools that generated summaries for 600 patient safety indicators, reducing manual reporting burden. Co-authored a peer-reviewed journal article on the findings.
Oversaw data collection from paper-based patient records at one of the five tertiary hospitals and led statistical analysis across 684 patient records using logistic and multivariate regression in Stata. Findings were submitted to the Department of Health as a co-authored research report and presented at the 2nd National Antimicrobial Resistance Summit.
Led all preventive health programmes, disease surveillance, and outbreak response for an underserved rural municipality. Coordinated community health workers, nurses, and midwives across ward-level service delivery.
Managed full-cycle implementation of a two-year national postgraduate programme for approximately 100 rural physicians. Coordinated government officials, faculty, and administrative processes. Identified and resolved a six-month contract payment backlog that had stalled programme operations.
In early 2021, the Philippines began COVID-19 vaccination with a limited supply and 1,480 municipalities to cover. The Department of Health needed a transparent, data-driven method to decide which communities should receive vaccines first. Five government datasets existed but had never been combined into a single analytical framework for this purpose. This project built that framework.
What I built
I collected, cleaned, and merged all five datasets at the municipality level, resolving inconsistent naming conventions across agencies and handling 94 municipalities with no recorded health facility bed data. I then ran exploratory analysis, spatial autocorrelation using Moran's I and LISA, and a hierarchical k-means clustering approach to group municipalities into four priority tiers. The analysis is deployed as a working Streamlit application where decision-makers can view choropleth maps of any risk variable, explore spatial autocorrelation hot spots and outliers, view the four-cluster priority map at national or provincial level, and update individual municipality data to rerun the model.
What it found
27 municipalities had no health facility beds at all, including several with active COVID-19 cases at the time of analysis. These represent the sharpest equity gaps in the dataset. The highest-priority cluster captured Metro Manila and major cities: highest case burden, highest density, relatively better infrastructure but overwhelming absolute need. The lowest-priority cluster by immediate risk was also the lowest by healthcare access, which is the equity argument for parallel allocation even during early rollout. High case numbers and high poverty do not always coincide. The analysis separates these two dimensions rather than collapsing them into a single score.
Limitations: Data is cross-sectional as of March 2021. The model does not account for cold chain capacity, local government implementation capacity, or vaccine hesitancy. Not an official DOH tool and was not used in actual policy decisions.
NHS A&E departments face persistent pressure on waiting times, admission rates, and resource allocation. Understanding which patients stay longest, when demand peaks, and what drives length of stay is foundational to operational planning. This project analyses four years of A&E admission data from a single NHS provider to identify the patient, clinical, and temporal factors most associated with extended A&E time and high investigation burden.
The dataset
The data source is the NHS England A&E Synthetic Dataset, a realistic but privacy-safe dataset generated as part of an NHS England pilot to enable data sharing without exposing real patient records. The full dataset is approximately 4.29GB and contains over 65 million rows covering A&E attendances across 200+ NHS providers from 2014 to 2018. At that size, the dataset could not be loaded into memory using standard pandas on a consumer laptop. I used Dask to process the full file in chunks, filter to admitted patients, identify the provider with the highest admission volume, and extract its 326,444 records for analysis.
What it found
The strongest predictor of extended A&E time is investigation burden. Patients undergoing eight or more investigations have a median stay exceeding 240 minutes. But the more revealing finding is about acuity. Patients with high healthcare resource needs spend less time in A&E than those with low resource needs, likely because high-acuity cases move through faster clinical pathways. The combination driving the longest stays is low healthcare resource need, older age, ambulance arrival, and night admission. Operational pressure in this dataset is not driven primarily by the most complex patients. It is driven by a volume of lower-acuity older patients who require extensive workup before a clinical decision can be made. That distinction matters for how A&E capacity is planned and staffed.
Other findings: admissions increased year-on-year from 2014 to 2018. Demand peaks in March and troughs in August. Saturdays have the highest weekly admission rate. The majority of patients in this dataset came from the most deprived areas by IMD decile.
Limitations: The dataset is synthetic and cannot be used for clinical or policy decisions. Analysis is limited to a single unidentified NHS provider and may not generalise across the system. The NHS England hosting page for this dataset is no longer active.
Life expectancy is one of the most direct measures of whether a health system is working for its population. This project examines how life expectancy changed between 2000 and 2015 across countries grouped by income level, and which health and socioeconomic indicators track most closely with those changes.
What I built
I combined WHO life expectancy data with UN country income classifications, cleaned and prepared the data in Python, and built an interactive Tableau dashboard exploring trends across income groups and nine associated indicators.
What it found
Globally, average life expectancy increased from 66 years in 2000 to 71 years in 2015. The largest gains were in low-income countries, where life expectancy rose by 8 years from a base of 54. Lower middle-income countries gained 5 years; upper middle and high-income countries each gained 4.
Indicators associated with lower life expectancy, including adult deaths, under-five deaths, infant deaths, HIV/AIDS deaths, and adolescent thinness, declined across most income groups over the period. Indicators associated with higher life expectancy, including polio vaccination coverage, health expenditure, years of schooling, and income composition, increased across most groups.
Two findings cut against the general trend. In low-income countries, adolescent thinness did not decline, pointing to persistent food insecurity. In high-income countries, HIV/AIDS-related infant deaths increased over the period.
By 2015, countries with the lowest life expectancy and the poorest indicators across all nine measures were concentrated in Africa. The equity gap is not closing uniformly. Low-income African countries made the biggest relative gains but remained furthest from the global average.
Limitations: Analysis covers 2000 to 2015 only. Data is from a Kaggle WHO dataset and a UN income classification document. Correlations between indicators and life expectancy are descriptive, not causal.
Chronic disease prevalence is not evenly distributed. Who gets diabetes, and how their health is affected, depends heavily on demographic characteristics, body weight, and modifiable risk factors. This project builds an interactive Tableau dashboard to let users explore those relationships across a US population health dataset.
What I built
Using the Diabetes Health Indicators dataset from Kaggle, I cleaned and prepared the data and built a three-panel interactive Tableau dashboard. Users can filter by demographic characteristics, disease status, and risk factors, then explore how those selections change the distribution of weight categories, disease prevalence, perceived general health, and days of physical and mental ill health per month. The middle panel runs a Pareto analysis to identify which population subgroups account for 80% of diabetes cases in the selected view.
What it shows
The dashboard makes visible what aggregate statistics obscure: that risk factors and disease burden concentrate in specific demographic and weight-category subgroups, and that the same risk factor can have very different associations with health outcomes depending on which group you are looking at. The Pareto view identifies where a small number of subgroups account for a disproportionate share of cases.
Limitations: US population health survey data only. Findings are descriptive and exploratory. No causal claims can be made from this analysis.
Poverty alleviation requires knowing not just where poverty is highest but which factors are most closely associated with it and whether those patterns are consistent across geographies. This project uses US Census data to explore county-level poverty patterns across the contiguous United States in 2015.
What I built
Using a US Census demographic dataset from Kaggle, I selected six variables: poverty count, poverty percentage, median income, employment rate, percentage of minorities, and percentage of public sector workers. I cleaned and prepared the data in Tableau and built a multi-panel dashboard covering bar charts, scatter plots, and an interactive county-level map.
What it found
The headline finding is that the states with the largest number of people living below the poverty line are not the same states with the highest poverty percentage. These are two different problems requiring different responses. A state with a large poor population but a moderate poverty rate needs different resourcing than a state with a small population and an extremely high poverty rate.
At county level, the variation within states is as striking as the variation between them. The gap between the highest and lowest poverty counties within a single state can be larger than the gap between states.
On associated factors: higher unemployment, higher minority population percentage, and higher public sector employment all correlate positively with poverty percentage. Higher median income correlates negatively. Geographically, the counties with the highest poverty percentages cluster in the southeastern and southwestern United States and in parts of South Dakota.
Limitations: Single year cross-sectional data from 2015 only. Correlations are descriptive, not causal. Data is at county level and may mask variation within counties.
Women's health outcomes & PCOS prevalence — UK data analysis.
Public health physician with a decade of experience in disease surveillance, health systems research, and programme management — including USAID-funded data initiatives. Now bringing formal data science training to public health analytics in the UK.