Jobs / Mastercard

Senior Site Reliability Engineer

Apply Now

Mastercard · Salt Lake City, UT, United States

Salt Lake City, UT, United States96,000-163,000 USD/yearlyRemote

Apply Now

Remuneration

96,000-163,000 USD/yearly

Location

Salt Lake City, UT, United States

Visa sponsorship

Not specified

Job summary

Mastercard is seeking a Senior Site Reliability Engineer for Commerce Media to lead the reliability, scalability, and production operations of a greenfield application. This individual contributor role involves end-to-end ownership of system reliability, partnering with engineering and platform teams to ensure resilient, observable, and production-ready services.

Benefits

Medical insurancePrescription drug insuranceDental insuranceVision insuranceDisability insuranceLife insuranceFlexible spending accountHealth savings accountPaid new parent leave (16 weeks)Paid bereavement leave (up to 20 days)Paid Sick and Safe Time (80 hours)Vacation time (25 days)Personal days (5 days)Paid U.S. observed holidays (10 annually)401k with company matchDeferred compensation for eligible rolesFitness reimbursementOn-site fitness facilitiesTuition reimbursement

Qualifications

Professional experience operating distributed systems at scale in production
Strong expertise in Kubernetes and containerized environments
Strong expertise in observability (metrics, logging, tracing)
Strong expertise in Spring Boot and/or Golang ecosystems
Hands-on experience across application, infrastructure, and release pipelines
Demonstrated ownership of service reliability, incident response, and operational strategy
Ability to influence system design through technical leadership and data-driven decisions
Pragmatic mindset, balancing automation, trade-offs, and system evolution
Experience navigating enterprise environments while maintaining delivery velocity
Leverage AI tools (e.g., Copilot, ChatGPT, Claude) to accelerate design, coding, and testing
Leverage AI tools to improve code quality and operational outcomes
Integrate AI into workflows for architecture reviews, code generation, testing, and documentation
Apply strong judgment in production-critical, low-latency environments

Responsibilities

Lead reliability, scalability, and production operations for a greenfield application
Partner with engineering and platform teams to ensure resilient, observable, and production-ready services
Drive reliability-focused design
Lead architecture and launch readiness reviews, including capacity planning and failure-mode/risk analysis
Define and enforce non-functional requirements (availability, latency, resilience)
Own production reliability and service health
Act as incident commander, leading triage, mitigation, and communication
Lead blameless post-mortems with actionable follow-ups
Proactively identify and reduce operational risk across the system
Define and manage SLIs, SLOs, and error budgets
Design and operate monitoring and alerting using Prometheus, Grafana, OpenSearch/Elasticsearch, and Opsgenie
Build dashboards aligned to user impact and system health
Drive automation-first operations to scale systems sustainably
Enhance CI/CD pipelines (GitHub Actions) with deployment gating and validation
Identify and resolve performance and reliability bottlenecks
Improve developer experience through operational tooling and best practices

Skills

AWSDockerElasticsearchGitHubGitHub ActionsGoGrafanaKubernetesOpenSearchOpsgeniePrometheusWindows

Relocation

Apply Now