Site Reliability Engineer (Middle) ID38916
AgileEngine is an Inc.
5000 company that creates award-winning software for Fortune 500 brands and startups across 17+ industries.
We rank among the leaders in areas like application development and AI/ML, and our people-first culture has earned us Best Place to Work awards.
WHY JOIN US
If you're looking for a place to grow, make an impact, and work with people who care, we’d love to meet you!
WHAT YOU WILL DO
- Shift: Monday – Thursday 8AM – 7PM PST (11AM – 10PM EST) with rotating on-call
- On-call shifts: every 6 weeks, for one week as primary responder and the next week as secondary
- Manage alerts daily, check systems, and escalate issues as needed
- Provide 24×7 on-call support for critical SaaS events
- Be available in emergencies when team members are not available or need help
- Document issues and remediation steps
- Proactively create appropriate monitors in the EKS/K8S ecosystem
- Deploy to EKS/K8s cluster using Terraform and Helm
- Learn and maintain existing infrastructure running under Docker Swarm
- Improve infrastructure health by implementing checks and scripts to correct known issues
- Maintain and develop deployment code
- Automate manual tasks
- Implement/integrate new technologies in Cloud Infrastructure
- Collaborate with Support, Customer Success, Migration, and Professional Services to provide high-level SaaS service
- Apply a customer-focused approach when planning deployments/updates
- Work with solutions teams to provide best-in-class service to customers
- Perform RCA and take corrective actions to prevent recurrence
- Create and assign alert-related actions after investigations
- Handle environment-specific support requests
- Identify automation opportunities to improve RCA
Must Haves
- 2+ years of professional experience
- Experience working with Datadog
- Hands-on experience as an AWS Cloud Engineer
- Working knowledge of EKS/Terraform/Helm
- Experience with Docker and Docker Swarm
- Understanding of AWS IAM roles and policies
- Experience logging and monitoring AWS resources with CloudWatch
- Experience in a Linux environment
- Proficient in Bash and/or Python scripting
- Strong understanding of REST APIs
- Experience with monitoring solutions such as Grafana and Prometheus
- Excellent oral and written communication skills
- Customer-facing communication skills to explain issues and RCAs
- Experience in Product/Application Support for SaaS products
- Understanding of APIs, Databases, Systems Architecture, and Design
- Experience designing, implementing, and operating in a DevSecOps environment
- Ability to work independently and in a team
- Technical aptitude and willingness to learn new technologies
- Upper-Intermediate English level
Nice to Have
- Experience with GCP or Azure
- Certifications: AWS Certified DevOps Engineer – Professional or AWS Certified Advanced Networking Specialty
Perks and Benefits
- Professional growth: Mentorship, TechTalks, and personalized growth roadmaps
- Competitive compensation: USD-based compensation with budgets for education, fitness, and team activities
- Exciting projects: Modern solutions development for Fortune 500 enterprises and leading brands
- Flextime: Flexible schedule with options to work from home or office
Seniority level
Employment type
Job function
- IT Services and IT Consulting
Referrals increase your chances of interviewing at AgileEngine.
Get notified about new Site Reliability Engineer jobs in Bucaramanga, Santander, Colombia.
#J-18808-Ljbffr