Site Reliability Engineer with ZPE Systems

Fremont (California)

$160K a year

ZPE Systems solves the networking problems of large enterprises, including 6 of the top 10 global tech giants, to meet increasing demands for infrastructure availability, security, and scalability.

ZPE Systems develops and manufactures secure remote in-band and out-of-band management solutions for enterprises to access, control and manage, and automate critical IT infrastructure from data center to the edge.

Companies that maintain or operate many data centers, colos, campuses, and branch locations, such as those in healthcare, supply chain, education, government and finance, trust ZPE’s Intel-based serial consoles, services routers, and cloud management software to eliminate human error, close security gaps, and resolve interoperability issues.

About the Position

ZPE Systems is looking for a site reliability engineer – devops to join our team in Fremont – California. You must be proactive, a team player, and passionate about technology. You will participate in different projects made up of multicultural teams distributed throughout the world. You’ll work directly with developers to help with day to day builds, automation, and management of their infrastructures. If you’re looking for an opportunity to work and grow, this mightbe the right place for you!


  • Build and maintain application platforms that are reliable, scalable, and performant
  • Support application development teams in the design and development of new applications, ensuring that the designs are reliable, efficient, and optimized to meet the performance needs of the business
  • Facilitate capacity planning
  • Build and maintain application development systems and processes to facilitate effective change management
  • Automate and standardize repeatable tasks
  • Develop and execute monitoring strategies to analyze performance trends and ensure rapid issue response
  • Respond to performance and availability issues for application platforms, and resolve issues in response to reported incidents
  • Investigate and analyze root cause defects – postmortem
  • Provide on-call coverage for supported applications to ensure performance and availability within service levels

Minimum Qualifications:

  • 2+ years of experience developing and operating distributed systems
  • 2+ years of Linux server administration
  • Knowledge of networking principles and how they relate to the architecture and performance of distributed systems
  • Fluent in at least one programming language (Python, and Golang preferred)
  • Experience building and maintaining a Container Infrastructure (Docker, Rancher, Kubernetes, etc.)
  • Experience working with tools like Terraform and Ansible
  • Experience administering, monitoring, and performance tuning web application platform technologies
  • Follows best practice
  • Experienced and comfortable working in Git
  • Knowledge of Scrum & Agile methodologies
  • Strong troubleshooting abilities
  • Strong customer service mindset
  • Attention to detail
  • Self motivated and diligent
  • Eligible to work in the United States
  • Fluent English, written and spoken; excellent communication skills

Preferred Qualifications:

  • Strong Terraform and Ansible experience
  • Strong experience with Grafana, Prometheus, and Loki
  • Strong Experience with Kubernetes
  • Experience with AWS, GCP, and Linode
  • Experience building and maintaining CI / CD pipelines
  • Experience with Cassandra, PostgreSQL, Mongo, Django, Golang