Monitoring and Observability
HCLTech Bogota, D.C., Capital District, Colombia
Responsibilities
- Design and implement comprehensive observability strategies and architectures for AWS cloud environments, including metrics, logs, and distributed tracing.
- Configure and maintain observability tools and platforms, ensuring their proper integration with our systems and applications (cloud native and monolithic).
- Develop custom dashboards and alerts to monitor key performance indicators (KPIs) and overall system health.
- Automate the deployment and management of observability infrastructure using Infrastructure as Code (IaC) tools.
- Work closely with development, operations, and engineering teams to understand their observability needs and provide effective solutions.
- Participate in incident resolution, providing observability data and analysis to identify root causes and facilitate recovery.
- Implement and manage observability solutions specifically for containerized environments and orchestration with Elastic Kubernetes Service (EKS).
- Evaluate and recommend new observability tools and technologies to enhance our capabilities.
- Document observability configurations, processes, and best practices.
- Train and support other teams in the use of observability tools and techniques.
- Stay up-to-date on the latest trends and best practices in observability and cloud technologies.
Requirements
- Cloud Knowledge and Experience (AWS): Proven experience minimum 5 years working with the Amazon Web Services (AWS) cloud platform.
- In-depth knowledge of AWS services relevant to observability, such as CloudWatch (Logs, Metrics, Alarms), X-Ray, and potentially other AWS Observability Service.
- Understanding of the architecture and design principles of applications in the AWS cloud.
- Infrastructure as Code (IaC): Practical experience in deploying and managing infrastructure using IaC tools such as Terraform, or similar.
- Ability to write, maintain, and improve IaC code to automate the creation and configuration of observability infrastructure.
- Significant experience in the deployment, management, and observability of containerized applications using Amazon EKS.
- Deep understanding of Kubernetes concepts and its interaction with AWS.
- Hands-on experience configuring observability tools specifically for Kubernetes environments, such as Prometheus, Grafana, ELK Stack (Elasticsearch, Logstash, Kibana), Jaeger, etc., within EKS.
- Solid understanding of observability principles and best practices (metrics, logs, distributed tracing).
- Experience with various observability and monitoring tools.
- Ability to develop effective dashboards and alerts based on observability data.
- Capacity to analyze observability data to identify performance and availability issues.
- Ability to develop scripts and automate tasks using languages such as Python, Bash, etc.
- Knowledge of Linux operating systems.
- Familiarity with Agile and DevOps methodologies.
- Strong problem-solving skills and the ability to analyze complex data.
- Excellent communication and collaboration skills.
- Ability to work independently and as part of a team.
Nice to Have
- Relevant AWS certifications (e.g., AWS Certified DevOps Engineer – Professional).
- Experience with other container orchestration platforms (e.g., vanilla Kubernetes).
- Knowledge of Site Reliability Engineering (SRE) principles.
- Experience in implementing Service Level Objectives (SLOs) and Service Level Indicators (SLIs).
#J-18808-Ljbffr