Senior Site Reliability Engineer at Svitla Systems, Inc.

Job Overview

Company

Svitla Systems, Inc.

Location

Colombia

Ready to Apply?

Take the Next Step in Your Career

Join Svitla Systems, Inc. and advance your career in Other-General

Apply for This Position

Click the button above to apply on our website

Job Description

The Company

Svitla Systems is a global digital solutions company headquartered in California, with business and development offices throughout the US, Latin America, Europe, and Asia.

Svitla is an outspoken advocate of workplace flexibility, best known for its well-established remote culture, individual approach to our teammates' professional and personal growth, and trustworthy environment.

Since 2003, Svitla has served a wide range of clients, from innovative start-ups in California to mega-large corporations such as Ingenico, Amplience, InvoiceASAP, and Global Citizen.

At Svitla, developers work with clients' teams directly, building lasting and successful partnerships as a result of seamless integration with on-site processes.

Svitla Systems' global mission is to build a business that contributes to the well-being of our partners, personnel, and their families, improves our communities, and makes a lasting difference in the world.

Join us

The Opportunity

Svitla Systems Inc.

is looking for a Senior Site Reliability Engineer for a full-time position (40 hours per week) in Colombia.

Our client is an American cybersecurity company that specializes in data center and cloud security, focusing on preventing the lateral movement of cyber threats within IT environments through micro-segmentation technology.

The flagship product provides visibility into workload communication across diverse compute environments, automatically generates optimal segmentation policies, and enforces firewall rules at the host level.

It helps organizations reduce cyber risk by containing threats and stopping ransomware before it spreads.

The company pioneered the concept of breach containment, combining real-time threat detection and automated response across hybrid and multi-cloud environments.

Its platform aligns with Zero Trust security principles that treat all network traffic as untrusted until verified.

The company emphasizes AI-driven breach containment, threat intelligence integration, and policy automation to simplify and strengthen security operations, protecting more than 15 of the Fortune 500, 6 of the 10 largest global banks, and 3 of the 5 largest SaaS enterprise companies.

You'll join a 25-person Cloud Operations Group that runs a multi-cloud security SaaS platform serving enterprise customers.

This senior role owns production reliability during the 8:00 AM – 5:00 PM Pacific Time window, leads incidents, sharpens SLOs and runbooks, and mentors a small cohort of SREs. You will partner with Product Service Owners, Platform SRE, and Infra to drive uptime, safe releases, observability quality, and steady reduction of toil across a mixed estate that is moving from legacy to modern patterns.

Technologies:

Cloud:
AWS and Azure;
Kubernetes:
EKS and AKS;
IaC & CD:
Terraform
(modules, state hygiene, drift management),
Argo CD
(sync health, rollbacks);
CI/VCS:
Jenkins
(light touch),
GitHub
, or
Bitbucket
;
Config (legacy):
Chef
,
Puppet
, some
Ansible
;
Ops Tooling: PagerDuty, Observe, Slack, status page platform.

Requirements:

7+ years of experience in Site Reliability Engineering/Production Operations for high-availability SaaS or distributed systems.
Deep knowledge of Kubernetes operations in production (cluster lifecycle, networking, storage, workload debugging, node health, autoscaling).
Strong understanding of AWS and/or Azure, including IAM, VPC/networking, load balancing, and managed K8s services.
Experience with Terraform (authoring and refactoring modules, providers, workspaces, drift remediation, plan/apply safety).
Solid experience with Argo CD and GitOps workflows.
Proven expertise in incident management, crisp written and verbal communication, and stakeholder management during outages.
Solid fundamentals in Linux and networking, plus scripting in Python or similar for automation and diagnostics.
Familiarity with compliance-aware operations (SOC 2 or ISO), evidence capture, and change tracking.

Nice to have:

Experience with multi-cloud service design patterns and customer data sovereignty considerations.
Experience migrating legacy config management to an IaC-first model.
Advanced knowledge of observability design using Observe, Datadog, Prometheus, and Grafana.

Responsibilities:

Serve as incident commander for Sev-1/Sev-2 events, establish comms cadence, publish status-page updates, coordinate SMEs, drive rapid mitigation, and own post-incident reviews with clear corrective actions.
Define and refine service SLOs, error budgets, and alert policies.

Reduce MTTA/MTTR through better detection, runbook quality, and automated diagnostics.
Guide safe deploys and rollbacks with Argo CD.

Improve pre-deploy checks, health gates, and progressive delivery patterns.

Partner with service owners on release readiness.
Raise the signal-to-noise ratio using Observe (or equivalent).

Standardize logs, metrics, and traces.

Eliminate flapping alerts and codify golden signals.
Forecast capacity, validate autoscaling, enforce redundancy patterns across EKS and AKS, and remove single points of failure, championing chaos and game-day exercises.
Triage and stabilize legacy Chef/Puppet areas while guiding consolidation toward Terraform-managed, GitOps-driven infrastructure.
Author and improve SOPs and runbooks.

Raise the bar on change management, deployment checklists, and audit-ready evidence for SOC 2 and ISO.
Coach junior and mid-level SREs on Kubernetes production skills, incident handling, and rigorous root cause analysis.

We Offer:

US and EU projects based on advanced technologies.
Competitive compensation based on skills and experience.
Annual performance appraisals.
Remote-friendly culture and no micromanagement.
Comprehensive private medical insurance.
Bonuses for article writing, public talks, other activities.
15 vacation days, 10 holidays, sick leaves.
Personalized learning program tailored to your interests and skill development.
Free webinars, meetups and conferences organized by Svitla.
Fun corporate celebrations and activities.
Awesome team, friendly and supportive community

About Svitla Systems, Inc.

Quick Access Links

Job Details:
https://co.expertini.com/jobs/job/senior-site-reliability-engineer-colombia-svitla-systems-inc-4796-3322/

Company Jobs:
More Svitla Systems, Inc. Jobs

Location Jobs:
Jobs in Colombia

Category Jobs:
Other-General Jobs

Don't Miss This Opportunity!

Svitla Systems, Inc. is actively hiring for this Senior Site Reliability Engineer position

Apply Now