Overview 
Site Reliability Engineer (Middle) ID38916 — AgileEngine 
Join to apply for the Site Reliability Engineer (Middle) ID38916 role at AgileEngine.
AgileEngine is an Inc.
5000 company that creates award-winning software for Fortune 500 brands and startups across 17+ industries, with a people-first culture and multiple Best Place to Work awards.
Reason for joining: a place to grow, make an impact, and work with caring teams.
Responsibilities 
- Shift: Monday – Thursday 8AM – 7PM PST (11AM – 10PM EST) with rotating on-call.
 
 
- On-call shifts every 6 weeks: one week as primary responder and the next week as secondary.
 
 
- Manage alerts daily, check systems, and escalate issues as needed.
 
 
- Provide 24×7 on-call support for critical SaaS events as part of a team.
 
 
- Be available in emergencies when team members are unavailable or need help.
 
 
- Document issues and remediation steps.
 
 
- Proactively create appropriate monitors in the EKS/K8S ecosystem.
 
 
- Deploy to EKS/Kubernetes clusters using Terraform and Helm.
 
 
- Learn and maintain existing infrastructure running under Docker Swarm.
 
 
- Improve infrastructure health by implementing checks and scripts to address known issues.
 
 
- Maintain and develop deployment code; automate manual tasks.
 
 
- Implement/integrate new technologies in our Cloud Infrastructure.
 
 
- Collaborate with Support, Customer Success, Migration, and Professional Services teams to provide high-quality SaaS service.
 
 
- Apply a customer-focused approach when planning deployments/updates, considering customer impact before changes.
 
 
- Work closely with teams to provide the best-in-class SaaS service; perform root cause analysis (RCA) and take corrective actions to prevent recurrence.
 
 
- Create and assign alert-related actions to the appropriate team after investigations.
 
 
- Handle environment-specific support requests; identify automation requirements to improve RCA.
 
 
Must Haves 
- 2+ years of professional experience.
 
 
- Experience working with Datadog.
 
 
- Hands-on experience as an AWS Cloud Engineer.
 
 
- Working knowledge of EKS, Terraform, Helm.
 
 
- Working experience with Docker and Docker Swarm.
 
 
- Understanding of AWS IAM roles and policies.
 
 
- Experience logging and monitoring AWS resources using CloudWatch logs.
 
 
- Experience working in a Linux environment.
 
 
- Proficient in Bash and/or Python scripting.
 
 
- Strong understanding of web technologies such as REST APIs. 
- Experience with monitoring solutions such as Grafana and Prometheus.
 
 
- Excellent oral and written communication skills; customer-facing communication to explain issues and RCAs. 
- Experience in Product/Application Support for SaaS-based products.
 
 
- Understanding of APIs, databases, systems architecture, and design.
 
 
- Designing, implementing, and operating in a DevSecOps environment.
 
 
- Ability to work independently as well as in a team; technical aptitude and willingness to learn evolving technologies.
 
 
- Upper-Intermediate English level.
 
 
Nice to Haves 
- Experience with GCP or Azure.
 
 
- Certifications: AWS Certified DevOps Engineer – Professional or AWS Certified Advanced Networking Specialty.
 
 
Perks and Benefits 
- Professional growth:  Mentorship, TechTalks, and personalized growth roadmaps.
 
 
- Competitive compensation:  USD-based compensation with budgets for education, fitness, and team activities.
 
 
- A selection of exciting projects:  Projects with modern solutions development and top-tier clients including Fortune 500 enterprises.
 
 
- Flextime:  Flexible schedule with options for remote work or office, to suit productivity.
 
 
Seniority level 
Employment type 
Industries 
- IT Services and IT Consulting 
Referrals increase your chances of interviewing at AgileEngine.
Get notified about new Site Reliability Engineer jobs in Cali, Valle del Cauca, Colombia.
  #J-18808-Ljbffr