Jobs / Matador

Ingénieur.e de Fiabilité Senior.e - Fiabilité des Produits | Senior Site Reliability Engineer - Product Reliability

Apply Now

Matador · Laval, QC, Canada

Laval, QC, CanadaExp: 5+ yrs130,000-150,000 CAD/yearlyRemote

Apply Now

Remuneration

130,000-150,000 CAD/yearly

Location

Laval, QC, Canada

Visa sponsorship

Not specified

Job summary

Matador is seeking a Senior Site Reliability Engineer – Product Reliability to scale, operate, and improve the reliability of its AI-powered communication platform. This role involves ensuring the stability, scalability, and performance of systems powering real-time interactions across distributed architectures. The engineer will also investigate production incidents, identify root causes, and collaborate with engineering teams to enhance observability and long-term reliability.

Benefits

Competitive compensationOpportunities for advancement

Qualifications

5+ years of experience in Site Reliability Engineering, Production Engineering, Backend Engineering, or similar roles
Strong hands-on experience with Node.js and TypeScript in production environments
Proven experience operating and troubleshooting distributed systems and microservices architectures
Experience managing production workloads on AWS, including ECS, Lambda, SQS, and API Gateway
Hands-on experience with Kafka, AWS SQS, or other messaging/event-streaming systems
Strong understanding of observability, monitoring, alerting, and incident response best practices
Experience debugging complex production issues across application, infrastructure, and networking layers
Deep understanding of system reliability concepts including concurrency, asynchronous workflows, resiliency, fault tolerance, and eventual consistency
Experience with MongoDB and Redis in high-scale production environments
Ability to analyze logs, traces, metrics, and system behavior to identify root causes efficiently
Strong communication skills and ability to collaborate across engineering, product, and support teams
Experience mentoring engineers and contributing to operational excellence initiatives
Experience with Kubernetes and container orchestration in production (Nice to have)
Broader AWS infrastructure experience (networking, infrastructure-as-code, observability, cost optimization) (Nice to have)
Experience with relational databases such as PostgreSQL (Nice to have)
Experience developing load tests, resilience tests, and chaos engineering exercises (Nice to have)
Prior customer support experience or direct work with customers to understand business impact (Nice to have)

Responsibilities

Serve as first line of technical investigation for production incidents, product failures, and performance issues
Analyze logs, traces, metrics, and system behavior to identify root causes and implement solutions
Collaborate with backend engineering and DevOps teams to diagnose issues impacting stability, latency, and reliability
Design and implement observability improvements, including monitoring, alerting, and structured logging across distributed systems
Establish and improve incident response processes, including escalation procedures, post-mortem analysis, and prevention of recurring incidents
Participate in architectural design of backend services, event-driven systems, and asynchronous messaging pipelines to ensure reliability and disaster recovery
Optimize performance and resilience of systems operating under high load, powering thousands of real-time interactions
Develop and maintain operational documentation, runbooks, and dashboards to support production operations
Collaborate with product and customer support teams to understand business impact and prioritization
Mentor junior engineers on reliability best practices and resilient design principles

Skills

AWSECSKafkaKubernetesAWS LambdaMongoDBNode.jsPostgreSQLRedisSQSTypeScript

Work schedule

Flexible hours

Industry

Automotive retailAI

Relocation

Apply Now