Senior Site Reliability Engineer with RudderStack

Remote (US)

$200K - 240K a year

At RudderStack, we are redefining enterprise-scale data collection and routing. We are building a customer data platform (CDP) on the customer's own data warehouse. Our open-source, developer-first approach is the first of its kind. We understand the outsized impact customer data has on businesses, and we understand the challenges and pain points. We are looking to solve the customer data management problem in enterprises, once and for all, in a secure, compliant and cost-effective way.

RudderStack collects data from 30+ sources, can transform events on the fly, and routes to 150 different marketing, sales, product, analytics applications all with one snippet of code.

We're backed by Insight Partners, Kleiner Perkins and S28 and have raised a total of $82 million in funding. Our customers include Stripe, Crate + Barrel, Acorns, Hinge, and Priceline. We process critical customer data for some top companies around the world, and are looking for ambitious individuals to join our team and help shape the future of our product.

About the the Team:

We are a high-performance team of data, security, and marketing experts, who have spent a lot of time working with large-scale data at enterprises. We are looking to solve the customer data management problem once and for all in a secure, compliant, and cost-effective way.

About the Role:

*Our roles are remote first, and can be based anywhere in the US (#LI-Remote).

Here are examples of things we've worked on:

  • Build and maintain a Kubernetes platform to deploy all our applications with high availability
  • Build Kubernetes operator to automate 100s of deployments
  • Managed 100s of postgres with HA for our deployments
  • Provision and manage air-gapped on-premise deployments in diverse environments.
  • Manage multi-region multi-cluster environment with hundreds of customer deployments in single-tenant and multi-tenant models.
  • Complete Infrastructure as a code and enforced using GitOps model
  • Wrote python scripts to do migrations of complex, highly available services
  • Working on compliance(i.e. SOC2 Type 2, HIPPA), security, scalability, and a lot more aspects to deliver top class, secure software

How we achieve results:

  • Empathy for the problems encountered by our customers.
  • Collaboration with engineering teams to achieve results.
  • Care deeply about the quality of your and the team's code
  • Curiosity and understanding, for investigating causes and finding effective solutions.
  • Output driven to provide value to our customers in a significant, measurable, and positive way.
  • Focus on writing testable, performant, bug-free code to provide the right solutions to the problems.

What you'll do:

  • Monitor and continually improve the capacity of our production environment
  • Gain a deeper understanding of Rudderstack infrastructure and help debug incidents
  • Proactively build software to help operations and support teams
  • Partner with development teams to improve services through rigorous testing and release procedures
  • Troubleshoot support escalation
  • Optimize on-call rotation and process
  • Document knowledge and build run-books
  • Conduct post-incident reviews.

Examples of desirable skills, knowledge and experience:

  • 6+ years of experience as a SRE
  • A Bachelor or Master degree in Computer Science or equivalent experience is required
  • Demonstrated Linux experience
  • Excellent debugging skills
  • Scripting and infrastructure automation
  • Familiarity with distributed systems design patterns using tools such as Kubernetes
  • Familiarity with AWS, Azure or Google Cloud Compute
  • Excellent verbal and written communication skills