I specialize in building and maintaining large scale distributed systems that serve hundreds of millions of users, with deep expertise in chaos engineering, observability, and resilience testing. My passion lies in making systems more reliable, teams more effective, and turning incidents into learning opportunities.
With 13 years of experience in Software Engineering and Site Reliability Engineering, I've designed and operated distributed systems at global scale (across millions of servers), led critical infrastructure migrations for major cloud platforms, and built reliability practices that have become organizational standards at companies like Microsoft Azure.
