Senior Site Reliability Engineer / Technical Lead - Microsoft Azure
Technical Lead in Azure’s Safe Change infrastructure SRE Team, responsible for chaos engineering, resiliency validation, and release infrastructure Harmonisation: Led the modernisation of Azure's release infrastructure, migrating 60+ repositories and 600+ pipelines, increasing deployment reliability and speed across multiple critical customer-facing services including, among others, Azure Cosmos DB, Log Analytics, Web Apps & Function Apps.
- Platorm Engineering & DevOps Expertise: Developed Platform tooling improvements to streamline engineering workflows and improve developer experience and led shift-left initiatives, integrating early validation mechanisms to catch issues earlier in the development lifecycle.
- Chaos Engineering & Resilience Validation: Designed and implemented Chaos Engineering experiments to validate system failure hypotheses covering 80% of high-impact critical customer scenarios and improve resilience strategies and built synthetic monitoring and business validation testing to proactively identify and mitigate reliability risks.
- Organised multiple internal learning sessions, developing a 9-part self-guided onboarding tutorial as part of the SRE Academy, enabling new engineers to onboard 75% faster to the new release system.
- Leadership and Team Management: Technical lead & Scrum Master for my immediate team of 5 engineers, responsible for setting technical direction, mentoring, and defining strategies and goals for the team as well as the broader department, serving as the technical lead for a newly formed team within the Safe Change Infrastructure SRE organisation, and supporting to multiple program managers and teams from across three other SRE organisations in bootstrapping new SRE engagements.
- Cross-Org Collaboration and Stakeholder Engagement: Partnered with 10+ service teams across Azure to help them migrate to the new release system, contributing high-quality pull requests to their repositories as best-practice examples driving down change related outages by 20%.
- SRE Best Practices an Knowledge Sharing: Core Contributor & Commimee Member for the Azure SRE Playbook, authoring a new SRE patterns with 3 sub-patterns and overseeing the review and integration of 3 additional major patterns.
- Technical Evangelism & Internal Training: Speaker at Azure SRE Tech Talks, delivering sessions on reliability, deployment strategies, and Platform engineering.
- Maintained and expanded the Azure SRE Wiki, working across all SRE organisations to standardise and document operational excellence.
- Recognition & Awards: Azure Reliability Quality Star – Leadership Excellence Award for sustained high- quality contributions to Azure’s engineering culture and reliability improvements