Citi’s Operations & Technology organization (O&T) is driving an innovative Cloud First strategy that works to optimize the IT environment, reduce complexity, and implement high degrees of automation to enable more agile application delivery. We aim to give Citi businesses a competitive edge by leveraging cloud scale architectures and enabling new infrastructure economics. EIO&T operates as a technology company focused on implementing scalable and innovative next gen technology solutions that will shape the future of global banking.
At Citi we know how important reliability is for our customers. Our Site Reliability Engineers bring drive and determination every day to make sure our customers get the best possible experience interacting with our technology services. In this role as a Site Reliability Engineer in our Public Cloud group, you will be working on complex and difficult technical problems solving for scale, performance and availability. The ideal candidate has experience gained in a software development environment and a deep appreciation of best practice for the design and deployment of fault tolerance solutions for cloud platforms. You will have either worked closely with, or as part of a technology Operations team and have an understanding of demands and challenges of that domain.
Responsibilities:
- Engage with systems engineering and application development teams at all stages of the technology life-cycle.
- Be ready and able to express opinions and ideas related to reliability, fault tolerance and operational toil.
- Devise innovative ideas for solving difficult technical problems involving distributed systems, scale and security and translate these ideas into designs and implementation.
- Implement best practices when it comes to availability, scalability, operational excellence and efficiency, using data driven analysis techniques when appropriate.
- Identify, triage, and automate systems.
- Evolve systems by pushing for change that improve reliability and developer velocity.
- Help develop robust organizational practices around monitoring, alerting, testing, deployment, and incident response.
- Help identify key uptime and performance metrics for production systems and implement metrics based practice and process.
- Suggest methods and new technologies for increasing the effectiveness of changes and of general production support improvements.
Basic Qualifications:
- Undergraduate degree in related field or equivalent experience.
- Hands on experience developing and engineering software using technologies such as Java, Python, C++ or Ruby.
- Experience with modern SDLC tools, ability to develop and enforce CI/CD practices
- Expertise with monitoring and observability technologies like Prometheus and Grafana
- Familiarity with Domain Driven Design and Event Driven Architecture
- Experience working with complex data platforms (relational/NoSQL databases)
- Experience working in a distributed, cloud-based environment using Azure/AWS/GCP (Docker/Kubernetes)
- Experience with Service Oriented Architecture applications and cloud-based services, preferably AWS.
- Experience working with Linux/UNIX, Docker
Preferred Qualifications:
- Experience as an AWS Solutions Architect, Cloud Security Certification, and/or OpenStack Administrator Certification a plus. (Other cloud-related certification also a plus.)
- Experience with TDD and automated UI testing frameworks
- Experience working with any design frameworks
- Experience with mobile web development
- Expert level operations proficiency — In depth Linux troubleshooting analysts, TCP/IP networking, load balancers, DNS