Jobs / JPMorganChase
Senior Lead Site Reliability Engineer
JPMorganChase · Palo Alto, CA, United States
Palo Alto, CA, United StatesExp: 10+ yrs171,000-260,000 USD/yearlyRemote
Remuneration
171,000-260,000 USD/yearly
Location
Palo Alto, CA, United States
Visa sponsorship
Not specified
Job summary
As a Senior Lead Site Reliability Engineer at JPMorgan Chase, you will define non-functional requirements and availability targets for services, ensuring they are accounted for in product design and test phases. This role combines SRE excellence with AI/ML capabilities to build autonomous systems for infrastructure operations at an enterprise scale.
Qualifications
- Formal training or certification in software engineering concepts with 10+ years of applied experience in Site Reliability Engineering, DevOps, or Software Engineering.
- Advanced knowledge in site reliability culture and principles with demonstrated ability to implement site reliability within an application or platform.
- Advanced knowledge and experience in observability, including white and black box monitoring, service level objectives, alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus, Datadog, and Splunk.
- Expert-level proficiency in Java, Go (Golang), Python, and Terraform for building enterprise-grade applications, high-performance systems, automation, and infrastructure as code.
- Advanced knowledge of software applications and technical processes with considerable depth in multiple technical disciplines, including distributed systems, microservices architecture, and cloud-native technologies.
- Hands-on experience building AI Agents and autonomous systems with proficiency in AI frameworks (LangChain, LangGraph, AutoGen, CrewAI) and leveraging AI development tools (GitHub Copilot, Claude).
- Expertise in designing and implementing logging pipelines (Fluentd, Logstash, Vector) and systems for metrics collection, analysis, and distributed tracing.
- Strong experience building production-grade RESTful APIs and designing message queue architectures (Kafka, RabbitMQ, SQS) for event-driven systems.
- Experience with graph databases (Neo4j, TigerGraph), vector databases (Pinecone, Weaviate, Chroma), and integrating multiple data stores for AI-powered systems.
- Proficiency with containerization (Docker, Kubernetes), CI/CD pipelines, and GitOps workflows.
- Ability to communicate data-based solutions with complex reporting and visualization methods.
- Recognized as an active contributor to the engineering community.
- Ability to expand network and lead evaluation sessions with vendors to align offerings with firm strategy.
- Bachelor's or Master's degree in Computer Science, Engineering, or a related field.
- Experience with MCP (Model Context Protocol) Servers or similar agent frameworks for building autonomous systems.
- Understanding of LLM integration, prompt engineering, and RAG (Retrieval-Augmented Generation).
- Familiarity with AI/ML model building, deployment, and lifecycle management using frameworks like TensorFlow, PyTorch, or scikit-learn.
- Experience with big data technologies (Hadoop, Spark, Flink), analytical databases, NoSQL databases (MongoDB, Cassandra, DynamoDB), and time-series databases (InfluxDB, TimescaleDB).
- Knowledge of security best practices and compliance requirements in highly regulated industries.
- Experience in chaos engineering tools (Chaos Monkey, Gremlin, LitmusChaos) and GameDay exercises.
Responsibilities
- Create high quality designs, roadmaps, and program charters for AI-powered automation systems, intelligent monitoring solutions, and next-generation reliability platforms.
- Provide advice and mentoring to other engineers, acting as a key resource for technical and business-related issues, especially in SRE and AI/ML technologies.
- Demonstrate and champion site reliability principles and practices within the team.
- Collaborate to create and implement robust, stable observability and reliability designs for complex systems, including logging pipelines and systems for metrics and traces across distributed systems.
- Design and build AI Agents and MCP Servers for autonomous operations, including incident detection, root cause analysis, and auto-remediation.
- Architect solutions that integrate multiple data stores, including graph, vector, transactional, analytical, and big data platforms.
- Develop automation scripts and infrastructure-as-code using Java, Go, Python, and Terraform to improve operational efficiency.
- Build and maintain RESTful services, APIs, and message queue architectures for event-driven systems and platform automation.
- Contribute significantly to JPMorgan Chase's site reliability community through internal forums, communities of practice, guilds, and conferences.
Skills
AWSAzureCassandraDatadogDockerDynamoDBDynatraceFluentdGCPGitHubGoGrafanaHadoopInfluxDBJavaKafkaKubernetesLogstashMongoDBPrometheusPythonRabbitMQRESTSparkSplunkSQSTerraformTimescaleDB
Relocation
No