Jobs / Matador
Ingénieur.e de Fiabilité Senior.e - Fiabilité des Produits | Senior Site Reliability Engineer - Product Reliability
Matador · Laval, QC, Canada
Laval, QC, CanadaExp: 5+ yrs130,000-150,000 CAD/yearlyRemote
Remuneration
130,000-150,000 CAD/yearly
Location
Laval, QC, Canada
Visa sponsorship
Not specified
Job summary
Matador is seeking a Senior Site Reliability Engineer – Product Reliability to scale, operate, and improve the reliability of its AI-powered communication platform. This role involves ensuring the stability, scalability, and performance of systems powering real-time interactions across distributed architectures. The engineer will also investigate production incidents, identify root causes, and collaborate with engineering teams to enhance observability and long-term reliability.
Benefits
Competitive compensationOpportunities for advancement
Qualifications
- 5+ years of experience in Site Reliability Engineering, Production Engineering, Backend Engineering, or similar roles
- Strong hands-on experience with Node.js and TypeScript in production environments
- Proven experience operating and troubleshooting distributed systems and microservices architectures
- Experience managing production workloads on AWS, including ECS, Lambda, SQS, and API Gateway
- Hands-on experience with Kafka, AWS SQS, or other messaging/event-streaming systems
- Strong understanding of observability, monitoring, alerting, and incident response best practices
- Experience debugging complex production issues across application, infrastructure, and networking layers
- Deep understanding of system reliability concepts including concurrency, asynchronous workflows, resiliency, fault tolerance, and eventual consistency
- Experience with MongoDB and Redis in high-scale production environments
- Ability to analyze logs, traces, metrics, and system behavior to identify root causes efficiently
- Strong communication skills and ability to collaborate across engineering, product, and support teams
- Experience mentoring engineers and contributing to operational excellence initiatives
- Experience with Kubernetes and container orchestration in production (Nice to have)
- Broader AWS infrastructure experience (networking, infrastructure-as-code, observability, cost optimization) (Nice to have)
- Experience with relational databases such as PostgreSQL (Nice to have)
- Experience developing load tests, resilience tests, and chaos engineering exercises (Nice to have)
- Prior customer support experience or direct work with customers to understand business impact (Nice to have)
Responsibilities
- Serve as first line of technical investigation for production incidents, product failures, and performance issues
- Analyze logs, traces, metrics, and system behavior to identify root causes and implement solutions
- Collaborate with backend engineering and DevOps teams to diagnose issues impacting stability, latency, and reliability
- Design and implement observability improvements, including monitoring, alerting, and structured logging across distributed systems
- Establish and improve incident response processes, including escalation procedures, post-mortem analysis, and prevention of recurring incidents
- Participate in architectural design of backend services, event-driven systems, and asynchronous messaging pipelines to ensure reliability and disaster recovery
- Optimize performance and resilience of systems operating under high load, powering thousands of real-time interactions
- Develop and maintain operational documentation, runbooks, and dashboards to support production operations
- Collaborate with product and customer support teams to understand business impact and prioritization
- Mentor junior engineers on reliability best practices and resilient design principles
Skills
AWSECSKafkaKubernetesAWS LambdaMongoDBNode.jsPostgreSQLRedisSQSTypeScript
Work schedule
Flexible hours
Industry
Automotive retailAI
Relocation
No