Company Overview
Chubb is a leading global insurance provider, dedicated to delivering superior products and services to our clients.
Our Platforms Team is at the forefront of innovation, creating technology solutions that empower multiple business lines across the organization.
We are seeking a hands-on Regional SRE Engineer for our LATAM/North Americas Region.
Job Summary
We are looking for a highly skilled, hands-on Site Reliability Engineer (SRE) to join our LATAM/North America team.
In this role, you will be directly responsible for troubleshooting, maintaining, and improving the reliability and performance of critical applications and infrastructure.
You will work closely with other SREs, developers, and business teams to ensure our systems are robust, observable, and scalable.
While this is primarily a technical engineering role, you will also provide mentorship and occasional guidance to junior team members and collaborate with regional teams to share best practices.
Key Responsibilities
Hands-On Engineering & Troubleshooting
- Diagnose and resolve complex application and server issues, including root cause analysis (RCA), memory debugging, and performance optimization.
- Perform application and server troubleshooting, including IIS administration and environment setup/deployment on Windows.
- Write and maintain automation scripts (PowerShell, Python) to streamline deployments, monitoring, and operational tasks.
- Develop and debug .NET applications as needed to support reliability and performance goals.
Monitoring, Observability & Analysis
- Set up and maintain observability tools and dashboards (e.g., AppDynamics, New Relic, OpenTelemetry) to monitor application health and user experience.
- Aggregate and analyze logs using tools like Splunk or ELK Stack to identify and resolve performance bottlenecks.
- Define and track Critical User Journeys (CUJs) and ensure relevant telemetry is in place.
Infrastructure & Deployment
- Set up, configure, and maintain Windows environments and server configurations.
- Deploy and manage applications in both on-premises and cloud (Azure) environments.
- Manage identity and access (Active Directory, Azure AD) and leverage Platform-as-a-Service (PaaS) offerings as needed.
Automation & Toil Reduction
- Identify repetitive manual tasks and automate them to improve operational efficiency and reduce toil.
- Advocate for and implement process improvements and tooling enhancements.
Incident Response & Reliability
- Deep understanding of SRE principles, including SLAs, SLOs, error budgets, and reliability-focused system design.
- Lead incident management, postmortems, and implement preventative measures.
- Support chaos engineering and resilience testing to ensure robust recovery mechanisms.
Collaboration & Mentorship
- Mentor engineers and champion SRE best practices, embedding a reliability-first culture and ensuring technical excellence across engineering teams.
- Collaborate with cross-functional teams (developers, infrastructure, business) to drive reliability improvements.
- Share knowledge and mentor junior SREs, promoting best practices and a culture of reliability.
Skills & Experience
Required:
- 8+ years of hands-on SRE or application support engineering experience, with some exposure to team or project leadership.
- Strong troubleshooting skills for applications and servers, including RCA, memory debugging, and performance tuning.
- Proficiency in PowerShell scripting, Python programming, and .NET development/debugging with a focus on automation, tooling, and system integration.
- Experience with Windows internals, IIS administration, and environment setup/deployment.
- Deep familiarity with observability and monitoring tools (AppDynamics, DynaTrace, ScienceLogic, Azure Insights, Splunk, ELK Stack, OpenTelemetry).
- Analytical thinking and advanced troubleshooting skills.
- Experience with Azure cloud infrastructure and services.
- Strong communication skills and ability to work under pressure.
Nice to Have:
- Experience with both application and infrastructure SRE roles.
- Background in regulated or high-compliance industries.
- Experience with chaos engineering, performance optimization, or fault injection.
- Knowledge of PaaS and identity management (Active Directory, Azure AD).
Soft Skills:
- Proactive, detail-oriented, and able to handle production-critical issues.
- Collaborative mindset and willingness to mentor others.
What Success Looks Like
- Rapid, effective resolution of incidents and performance issues.
- High uptime and reliability for all critical applications.
- Continuous improvement in automation and operational efficiency.
- A culture of reliability and technical excellence within the team.
Join Us
If you are a hands-on SRE engineer passionate about technology and reliability, and eager to make a direct impact, we would love to hear from you.
Apply now to join the Chubb Platforms Team and help us innovate for the future.