Concord

Site Reliability Engineer (SRE) Manager

Job Location

Nuevo León, Mexico

Job Description

Location: Hybrid in Monterrey, MX. 8 days a month on-site. Possibility to get a travel or relocation stipend for travel. Type of Employment: contract to hire. 1-3 month remote contract, and then full-time employment. Requirement: Must be legally authorized to work for any Mexican employer without sponsorship, now or in the future. About Us Concord isn't your typical consulting firm; we're an execution focused company passionate about delivering results. Our mission is to help clients enhance customer experiences, optimize operations, and revolutionize product offerings through seamless integration, optimization, and activation of technology and data. Our services and solutions include Digital Experience (Salesforce, Headless Commerce, UI/UX), Data and Analytics (Snowflake, Databricks, Martech Analytics), and Engineering and Application Services (Application Modernization, Greenfield Apps, Portal Buildout, etc.). About the Role We are seeking a strategic, technically adept, and hands-on SRE Manager to lead the reliability, scalability, and operational excellence of our production systems. This role is ideal for a leader who thrives in high-pressure environments, excels at debugging complex production issues, and is passionate about building and mentoring high-performing teams. The SRE Manager will be responsible for hiring and managing a team of SREs, driving incident response and postmortem processes, and collaborating with multiple product teams to build and maintain robust CI/CD pipelines and deployment practices. This role demands a strong sense of ownership, a deep understanding of cloud-native infrastructure, and the ability to lead by example. Business Alignment The SRE Manager will partner with business stakeholders to ensure reliability goals support customer experience, compliance, and growth targets. This includes aligning SRE initiatives with broader business objectives such as revenue protection, innovation, and regulatory adherence. Key Responsibilities Build and lead a high-performing Site Reliability Engineering team. Create individualized development plans for SREs, encourage participation in industry conferences, and support certification programs. Debug and resolve complex production issues, ensuring minimal downtime and rapid recovery. Own the incident lifecycle, including coordination, communication, and creation of detailed postmortem documentation. Implement blameless postmortems and maintain a library of runbooks for common incident types. Follow up with product teams to ensure resolution and implementation of long-term fixes. Partner with internal product and engineering teams to understand infrastructure needs and deliver scalable, secure, and reliable solutions. Drive the design, implementation, and automation of cloud infrastructure using Azure, Terraform, and Kubernetes (AKS). Lead the adoption and management of tools such as Argo CD, Argo Workflows, Azure DevOps, and Octopus Deploy. Architect and manage API Gateways, WAFs, Service Mesh, and multi-cloud networking (VNets, private networks). Establish and enforce deployment best practices, including documentation, versioning, rollback strategies, and environment management. Collaborate with product teams to build and maintain CI/CD pipelines, ensuring reliable and repeatable deployments. Foster a culture of ownership, accountability, and continuous improvement across the team. Define and track key performance indicators (KPIs) for system reliability and team effectiveness. Define and manage Service Level Objectives (SLOs) and error budgets for all critical services. Lead the adoption of advanced observability tools for proactive reliability management. Collaborate with security, compliance, and architecture teams through joint reviews, shared dashboards, and audits to ensure infrastructure meets enterprise standards. Required Qualifications 10 years of experience in infrastructure, DevOps, or SRE roles, with 3 years in a technical leadership or management capacity. Proven experience debugging and resolving production issues in large-scale systems. Experience building and scaling cloud-native infrastructure on Azure. Deep expertise in Kubernetes (AKS), CI/CD pipelines, and Infrastructure as Code (Terraform). Strong understanding of networking, VNets, private cloud connectivity, and multi-cloud architectures. Hands-on experience with Argo CD, Argo Workflows, Azure DevOps. Demonstrated ability to hire, mentor, and lead engineering teams. Excellent communication and stakeholder management skills. Strong problem-solving mindset with a bias for action and ownership. Ability to create and maintain detailed deployment documentation and lead by example in operational excellence. Advanced English proficiency (C1 or C2) with proven success collaborating in global, English-speaking environments. Preferred Qualifications Experience supporting internal product teams or platform engineering organizations. Familiarity with FinOps, cost optimization, and cloud governance. Exposure to compliance frameworks (SOC2, ISO, HIPAA). Experience with service mesh technologies (Istio, Linkerd). Knowledge of emerging technologies such as AI/ML ops, edge computing, and sustainability practices. What Success Looks Like A high-performing SRE team that operates with autonomy and accountability. Internal customers view the SRE team as a trusted partner in delivering reliable, scalable systems. Infrastructure is automated, observable, and resilient by design. Incidents are rare, well-managed, and always lead to learning and improvement. CI/CD pipelines are robust, well-documented, and consistently deliver high-quality deployments.

Location: Nuevo León, Mexico, MX

Posted Date: 10/27/2025

View More Concord Jobs

Contact Information

Contact	Human Resources Concord