Principal SRE | San Francisco Bay Area
Step into a principal-level, hands-on SRE opportunity to tackle high-impact reliability challenges at massive scale using modern cloud and observability tech.
Principal Site Reliability Engineer - Operations
Location/Time zone requirements: Must be based in the San Francisco Bay Area, with weekly visits to the client’s headquarters.
About Virtasant
Virtasant is a fast-growing global consultancy transforming how technology services are delivered. We are a diverse team of cloud experts, builders, and operators. Since 2006, we’ve helped large enterprises thrive in the public cloud — optimizing cost, scaling infrastructure, migrating legacy systems, and building cloud-native products.
We take an AI-first mindset and are big on FinOps, software engineering, product development, and technology operations. Our outcome-driven model helps enterprises solve complexity in the cloud, build efficient systems, and unlock real business value.
About the Role
We are looking for a Principal-Level Site Reliability Engineer (Operations) to provide hands-on, day-to-day operational support for one of our largest global clients. This role is not a leadership or people-management position — it is a senior individual contributor SRE role focused on incident response, system diagnostics, dashboard monitoring, operational maintenance, and ensuring platform reliability.
You will be directly responsible for keeping critical systems healthy, resolving incidents, improving operational workflows, and working with engineering teams to maintain high reliability across large-scale distributed systems.
If you’re a senior SRE who enjoys solving problems in the system, not managing teams or driving strategy, this is the right role.
What You Will Do
Operational SRE Responsibilities
- Monitor dashboards, alerts, and system health in real time.
- Respond to incidents quickly and decisively, driving issues to resolution.
- Perform root-cause analysis and contribute to post-incident reviews.
- Troubleshoot complex system and infrastructure issues across distributed environments.
- Maintain and improve runbooks, playbooks, and operational documentation.
- Support and enhance the observability tooling used for metrics, logs, and alerting.
- Work cross-functionally with engineering teams to escalate system-level issues when required.
Systems Reliability & Maintenance
- Run routine operational checks to ensure platform stability.
- Tune alerts, update dashboards, and ensure monitoring accuracy.
- Identify recurring operational issues and recommend improvements.
- Implement small automation and scripting solutions to improve operational workflows.
- Keep services running smoothly through proactive maintenance.
Collaboration & Communication
- Partner with Engineering, SRE, and Product teams to ensure transparent communication during incidents.
- Provide clear, concise updates and documentation for operational work.
- Participate in shift patterns or rotational incident coverage depending on client needs.
What You Bring
Essential Experience
- 5–10+ years in SRE, Production Operations, or Infrastructure Engineering roles.
- Strong hands-on experience troubleshooting distributed systems in production.
- Proficiency in Linux fundamentals, including process management, networking, storage, and diagnostics.
- Solid understanding of cloud-native architectures, containers, and modern infrastructure tooling.
- Experience with:
- Monitoring and observability tools (e.g., Prometheus, Grafana, ELK, Datadog, etc.)
- Incident management workflows
- Root-cause analysis / postmortems
- CI/CD operational processes
Technical Skills
- Strong Linux debugging and performance troubleshooting skills.
- Familiarity with Kubernetes, containers, or cloud-native runtime environments.
- Ability to write or modify scripts (Python, Bash, or similar) for operational automation.
- Hands-on experience with logs, metrics, traces, and alert lifecycle management.
Soft Skills
- Calm, structured decision-making under pressure.
- Excellent communication — clear, concise, and reliable.
- Strong attention to detail and consistency in documentation.
- A proactive, ownership-driven mindset for reliability and operations.
Why Join Virtasant
- Build and lead a new SRE-focused customer success function from day one.
- Work at the intersection of reliability engineering, customer engagement, and cloud transformation.
- Partner with global enterprises on cutting-edge cloud and DevOps programs.
- Join a global, remote-first consultancy with 4,000+ experts across 130 countries.
- Thrive in a culture that values autonomy, agility, and innovation.
- Our team
- Virtasant - Consulting
- Locations
- HQ
- Remote status
- Hybrid
Already working at Virtasant?
Let’s recruit together and find your next colleague.