Job Description:
Key Responsibilities:
Production Monitoring & Observability
" Oversee production health and performance monitoring using Prometheus, AWS CloudWatch, log aggregation, and proprietary monitoring tools.
" Define and enforce observability standards, including alerting, metrics collection, and dashboards across services and regions.
" Drive proactive detection and resolution of production issues.
CI/CD & Infrastructure Automation
" Lead development and maintenance of CI/CD pipelines for over 100 services across 10 AWS regions.
" Ensure reliable, secure, and efficient deployment processes through automation and standardization.
" Promote Infrastructure-as-Code (IaC) practices to maintain consistency across environments.
SRE & Production Stability
" Manage a Site Reliability Engineering (SRE) team focused on rapid issue resolution and root cause analysis.
" Drive incident management, on-call practices, and post-incident reviews to improve system resilience and uptime.
" Collaborate with R&D and Product teams to design for reliability and operational excellence.
Leadership & Strategy
" Lead, mentor, and grow a cross-regional DevOps team.
" Define long-term operational strategies aligned with business goals for performance, cost efficiency, and scalability.
" Partner with Security and Engineering teams to ensure compliance, best practices, and secure operations.
Job Qualifications:
" Proven management experience, leading DevOps, SRE, or infrastructure teams.
" Strong background in AWS, monitoring technologies (Prometheus, CloudWatch, etc.), and CI/CD development.
" Experience supporting large-scale distributed systems (multi-region, multi-service).
" Excellent communication skills and ability to collaborate with cross-functional teams.
" Hands-on technical understanding of production operations, automation, and reliability engineering.