Jobs / Mastercard

Senior Site Reliability Engineer

Mastercard · Salt Lake City, UT, United States
Salt Lake City, UT, United States96,000-163,000 USD/yearlyRemote
Remuneration
96,000-163,000 USD/yearly
Location
Salt Lake City, UT, United States
Visa sponsorship
Not specified

Job summary

Mastercard is seeking a Senior Site Reliability Engineer for Commerce Media to lead the reliability, scalability, and production operations of a greenfield application. This individual contributor role involves end-to-end ownership of system reliability, partnering with engineering and platform teams to ensure resilient, observable, and production-ready services.

Benefits

Medical insurancePrescription drug insuranceDental insuranceVision insuranceDisability insuranceLife insuranceFlexible spending accountHealth savings accountPaid new parent leave (16 weeks)Paid bereavement leave (up to 20 days)Paid Sick and Safe Time (80 hours)Vacation time (25 days)Personal days (5 days)Paid U.S. observed holidays (10 annually)401k with company matchDeferred compensation for eligible rolesFitness reimbursementOn-site fitness facilitiesTuition reimbursement

Qualifications

  • Professional experience operating distributed systems at scale in production
  • Strong expertise in Kubernetes and containerized environments
  • Strong expertise in observability (metrics, logging, tracing)
  • Strong expertise in Spring Boot and/or Golang ecosystems
  • Hands-on experience across application, infrastructure, and release pipelines
  • Demonstrated ownership of service reliability, incident response, and operational strategy
  • Ability to influence system design through technical leadership and data-driven decisions
  • Pragmatic mindset, balancing automation, trade-offs, and system evolution
  • Experience navigating enterprise environments while maintaining delivery velocity
  • Leverage AI tools (e.g., Copilot, ChatGPT, Claude) to accelerate design, coding, and testing
  • Leverage AI tools to improve code quality and operational outcomes
  • Integrate AI into workflows for architecture reviews, code generation, testing, and documentation
  • Apply strong judgment in production-critical, low-latency environments

Responsibilities

  • Lead reliability, scalability, and production operations for a greenfield application
  • Partner with engineering and platform teams to ensure resilient, observable, and production-ready services
  • Drive reliability-focused design
  • Lead architecture and launch readiness reviews, including capacity planning and failure-mode/risk analysis
  • Define and enforce non-functional requirements (availability, latency, resilience)
  • Own production reliability and service health
  • Act as incident commander, leading triage, mitigation, and communication
  • Lead blameless post-mortems with actionable follow-ups
  • Proactively identify and reduce operational risk across the system
  • Define and manage SLIs, SLOs, and error budgets
  • Design and operate monitoring and alerting using Prometheus, Grafana, OpenSearch/Elasticsearch, and Opsgenie
  • Build dashboards aligned to user impact and system health
  • Drive automation-first operations to scale systems sustainably
  • Enhance CI/CD pipelines (GitHub Actions) with deployment gating and validation
  • Identify and resolve performance and reliability bottlenecks
  • Improve developer experience through operational tooling and best practices

Skills

AWSDockerElasticsearchGitHubGitHub ActionsGoGrafanaKubernetesOpenSearchOpsgeniePrometheusWindows

Relocation

No