Jobs / Catalyst Brands
Senior Manager, Site Reliability Engineering
Catalyst Brands · Dallas, TX, United States
Dallas, TX, United StatesExp: 5+ yrs103,500-172,500 USD/yearlyHybrid
Remuneration
103,500-172,500 USD/yearly
Location
Dallas, TX, United States
Visa sponsorship
Not specified
Job summary
The Senior Manager, Site Reliability Engineering oversees daily operations and delivery of SRE teams, driving productivity and ensuring the health, performance, resilience, and stability of Catalyst’s eCommerce and CRM platforms. This hybrid leadership role combines technical expertise with people management, contributing to the technical direction, automation strategy, telemetry and observability practices, solution delivery, and incident management.
Qualifications
- 10+ years of experience in global organizations with effective communication across all levels.
- 5+ years of hands-on Site Reliability Engineering (SRE) experience, including platform automation, telemetry, observability, and self-healing systems.
- Demonstrated leadership and collaboration in high-availability, mission-critical digital environments.
- Strong support knowledge and understanding of retail eCommerce flow, Web and Mobile technologies.
- Experience working with software engineers across scrum teams and performance engineering to ensure systems meet reliability and performance standards.
- Hands-on experience with debugging, optimizing code, and automation.
- Ability to identify opportunities to adopt innovative technologies and continuous improvement (Automation, Shift left, Self-Heal).
- Extensive experience supporting and administering digital retail and eCommerce platforms with AWS, Azure, or Google Cloud.
- Demonstrated experience in application design, software development, testing, and production support of Java-J2EE based eCommerce applications.
- Practical experience monitoring and maintaining streaming platform technologies such as Apache Kafka.
- Deep understanding of cloud-native architectures and platform operations.
- Proficient with modern monitoring, logging, and telemetry tools including New Relic, Splunk, ELK, Datadog, DynaTrace, Catchpoint, and AWS CloudWatch.
- Hands-on experience designing and implementing automated health checks, observability pipelines, and self-healing solutions.
- Strong experience with automation tools and frameworks such as Jenkins, Chef, Ansible, and Terraform.
- Expertise in scripting languages for platform automation and diagnostics: PowerShell, Python, Ruby, AWK, SED.
- Advanced experience with public cloud platforms: Microsoft Azure and Amazon Web Services (AWS).
- Solid understanding of networking fundamentals: TCP/IP, DNS, DHCP, WINS.
- Advanced experience with Content Delivery Networks (CDNs) such as Akamai and Cloudflare.
- Experience using ITSM and collaboration platforms: Jira, BMC Remedy, ServiceNow.
- Strong understanding of IT operations frameworks (e.g., ITIL, MOF).
Responsibilities
- Oversee daily operations and delivery of Site Reliability Engineering teams.
- Drive team productivity and ensure health, performance, resilience, and stability of Catalyst’s eCommerce and CRM platforms.
- Contribute to the technical direction of the team.
- Shape automation strategy.
- Guide telemetry and observability practices.
- Lead solution delivery.
- Manage incidents and problems affecting platform reliability.
- Contribute to short and long-term planning initiatives, including systems architecture, team development, and organizational strategy.
- Provide technical and people leadership to SRE teams through one-on-one meetings, team syncs, and performance reviews.
- Manage project execution by organizing cross-functional teams, assigning responsibilities, and tracking progress.
- Assist in budgeting, workforce planning, hiring, and third-party contract negotiations.
- Drive continuous improvements in platform reliability, stability, and performance by overseeing deployment of automated telemetry, observability, and AI-driven monitoring solutions.
- Lead development and enhancement of intelligent alerting and automated incident response systems.
- Collaborate with administrators and platform engineers on implementation decisions for reliable infrastructure, systems, and integrations.
- Document all changes in accordance with change control policies and documentation standards; identify risks and recommend corrective actions.
- Provide advanced Incident Management and Problem Management support by analyzing telemetry data and system logs.
- Participate in on-call escalation support rotations.
- Act as Escalation Manager/Critical Incident Manager during major incidents.
- Communicate timely updates and incident reports to senior leadership.
- Lead conversations and provide business and engineering support for internal and external stakeholders.
Skills
AkamaiAnsibleAWSAzureChefCloudflareCloudWatchDatadogDynatraceGCPJavaJenkinsJiraKafkaNew RelicPowerShellPythonRubyServiceNowSplunkTerraform
Certifications
Azure/AWSMicrosoftITIL
Degrees
Bachelor’s degree in computer science or related technical field
Languages
PowerShellPythonRubyAWKSED
Work schedule
On-call escalation support rotations
Relocation
No