EPAM is a leading global provider of digital platform engineering and development services.
We are committed to having a positive impact on our customers, our employees, and our communities.
We embrace a dynamic and inclusive culture.
Here you will collaborate with multi-national teams, contribute to a myriad of innovative projects that deliver the most creative and cutting-edge solutions, and have an opportunity to continuously learn and grow.
No matter where you are located, you will join a dedicated, creative, and diverse community that will help you discover your fullest potential.
Lead Operational Intelligence Developer
We are looking for a highly experienced and dynamic Lead Operational Intelligence Developer to join our team.
In this role, you will take ownership of leading the development, maintenance, and enhancement of our Elastic & Observability Platform deployed across GCP and Elastic Cloud.
You will drive strategic initiatives, guide a high-performing technical team, and ensure platform reliability while fostering innovation and enabling self-service capabilities for platform consumers.
This position also involves participating in an on-call rotation to oversee platform health and functionality.
Responsibilities
- Oversee the availability, functionality, performance, and security of observability and search platforms to exceed business SLAs
- Provide technical leadership during complex incidents and escalate resolutions promptly during on-call periods
- Develop and maintain comprehensive platform documentation, standard operating procedures, and knowledge-sharing resources
- Collaborate with cross-functional teams, stakeholders, and vendors to oversee operational requirements, drive strategic initiatives, and manage installations, troubleshooting, and upgrades
- Lead the enhancement of platform features and self-service capabilities, including advanced Elastic Synthetics and chargeback automation
- Architect and implement proof-of-concepts for platform innovation, such as AI-driven observability, advanced data processing models, or Kubernetes-based platform migration
- Supervise the building, deployment, and maintenance of Elastic clusters using Infrastructure-as-Code (IaC) tools like Terraform and Ansible, while mentoring team members on best practices
- Oversee platform lifecycle management activities, including component upgrades, capacity planning, cost optimization, and evolving compliance requirements
- Continuously assess and fine-tune ELK stack performance, including ingestion, indexing, and query optimization for large-scale environments
- Establish and enhance comprehensive alerting and incident management workflows, integrating sophisticated monitoring tools such as Kibana Rules, Watchers, and PagerDuty
- Supervise the ingestion, enrichment, backup, and restoration of large-scale platform data while optimizing data workflows
- Lead and plan critical operational events such as SSL certificate rotations, cluster migrations, or scalability optimization projects
Requirements
- 5+ years of experience in Operational Intelligence, with a proven track record of leadership and technical expertise in managing large-scale observability platforms
- Demonstrated ability to architect and manage Elastic clusters in complex, multi-cloud environments
- In-depth knowledge of Elastic Stack components, including advanced configurations of Elasticsearch, Kibana, and Logstash
- Advanced proficiency in Infrastructure-as-Code (IaC) tools like Terraform and Ansible, with demonstrated flexibility in adapting other tools like Jenkins CI or GitOps frameworks
- Advanced Python scripting skills for automation, data processing, and extending platform interoperability
- Deep understanding of incident management frameworks and workflows with tools like PagerDuty, Uptrends, and other enterprise monitoring solutions
- Proven expertise in troubleshooting and resolving complex platform challenges under tight SLAs
- Strong capability in managing and scaling fault-tolerant platforms while ensuring performance, security, and compliance across large distributed systems
- Demonstrated ability to mentor and grow team members, manage priorities, and act as a bridge between technical and non-technical teams
- Excellent command of English (B2+ level), both written and spoken, with a strong emphasis on technical communication skills
Nice to have
- Expertise in scripting with Groovy or experience in advanced Linux administration to optimize platform processes
- Track record of optimizing observability workflows with additional integrations or customizations in tools like Uptrends, PagerDuty, or Elastic features
- Hands-on experience with advanced Elastic Synthetics setups for robust monitoring and custom synthetic testing frameworks
- Experience driving strategic initiatives such as modernization through AI tooling, cloud-native transitions, or cost-saving observability optimizations
We offer
- International projects with top brands
- Work with global teams of highly skilled, diverse peers
- Employee financial programs
- Paid time off and sick leave
- Upskilling, reskilling and certification courses
- Unlimited access to the LinkedIn Learning library and 22,000+ courses
- Global career opportunities
- Volunteer and community involvement opportunities
- EPAM Employee Groups
- Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn
Seniority level
Employment type
Job function
- Information Technology, Engineering, and Business Development
Industries
- Software Development, IT Services and IT Consulting, and Venture Capital and Private Equity Principals
#J-18808-Ljbffr