Jobs / Matador

Ingénieur.e de Fiabilité Senior.e - Fiabilité des Produits | Senior Site Reliability Engineer - Product Reliability

Matador · Laval, QC, Canada
Laval, QC, CanadaExp: 5+ yrs130,000-150,000 CAD/yearlyRemote
Remuneration
130,000-150,000 CAD/yearly
Location
Laval, QC, Canada
Visa sponsorship
Not specified

Job summary

Matador is seeking a Senior Site Reliability Engineer – Product Reliability to scale, operate, and improve the reliability of its AI-powered communication platform. This role involves ensuring the stability, scalability, and performance of systems powering real-time interactions across distributed architectures. The engineer will also investigate production incidents, identify root causes, and collaborate with engineering teams to enhance observability and long-term reliability.

Benefits

Competitive compensationOpportunities for advancement

Qualifications

  • 5+ years of experience in Site Reliability Engineering, Production Engineering, Backend Engineering, or similar roles
  • Strong hands-on experience with Node.js and TypeScript in production environments
  • Proven experience operating and troubleshooting distributed systems and microservices architectures
  • Experience managing production workloads on AWS, including ECS, Lambda, SQS, and API Gateway
  • Hands-on experience with Kafka, AWS SQS, or other messaging/event-streaming systems
  • Strong understanding of observability, monitoring, alerting, and incident response best practices
  • Experience debugging complex production issues across application, infrastructure, and networking layers
  • Deep understanding of system reliability concepts including concurrency, asynchronous workflows, resiliency, fault tolerance, and eventual consistency
  • Experience with MongoDB and Redis in high-scale production environments
  • Ability to analyze logs, traces, metrics, and system behavior to identify root causes efficiently
  • Strong communication skills and ability to collaborate across engineering, product, and support teams
  • Experience mentoring engineers and contributing to operational excellence initiatives
  • Experience with Kubernetes and container orchestration in production (Nice to have)
  • Broader AWS infrastructure experience (networking, infrastructure-as-code, observability, cost optimization) (Nice to have)
  • Experience with relational databases such as PostgreSQL (Nice to have)
  • Experience developing load tests, resilience tests, and chaos engineering exercises (Nice to have)
  • Prior customer support experience or direct work with customers to understand business impact (Nice to have)

Responsibilities

  • Serve as first line of technical investigation for production incidents, product failures, and performance issues
  • Analyze logs, traces, metrics, and system behavior to identify root causes and implement solutions
  • Collaborate with backend engineering and DevOps teams to diagnose issues impacting stability, latency, and reliability
  • Design and implement observability improvements, including monitoring, alerting, and structured logging across distributed systems
  • Establish and improve incident response processes, including escalation procedures, post-mortem analysis, and prevention of recurring incidents
  • Participate in architectural design of backend services, event-driven systems, and asynchronous messaging pipelines to ensure reliability and disaster recovery
  • Optimize performance and resilience of systems operating under high load, powering thousands of real-time interactions
  • Develop and maintain operational documentation, runbooks, and dashboards to support production operations
  • Collaborate with product and customer support teams to understand business impact and prioritization
  • Mentor junior engineers on reliability best practices and resilient design principles

Skills

AWSECSKafkaKubernetesAWS LambdaMongoDBNode.jsPostgreSQLRedisSQSTypeScript

Work schedule

Flexible hours

Industry

Automotive retailAI

Relocation

No