Site Reliability Developer 3
About the Job
Oracle Cloud Infrastructure (OCI)is leading the transformation to cloud-native infrastructure in our hyper-scale, multi-tenant cloud, deployed in more than 30 regions worldwide. OCI is committed to providing the best in cloud services that meet the needs of our customers, who are tackling some of the world's biggest challenges.
We are interested in aSite Reliability Engineer with expertise and passion in DevOps techniques and who cares deeply about ensuring customers have the capacity they need to scale. At Oracle, you can help, shape, design and build innovative new platforms and tools from the ground up. These are exciting times in our space - we are growing fast, still at an early stage of our SRE journey, and working on ambitious new initiatives for automation.
As an SRE, you should have technical depth with the ability to dive deep into any part of capacity management process from host ingestion to end customer availability, including hypervisor placement optimization, bin packing, maintenance buffering, customer reboots, pool management optimization, limit automation, evacuation and decommissioning, alarming, customer troubleshooting of failed launces, and customer launch success KPI management & improvement. Furthermore, the team will be tasked with maintaining and support existing systems and documentation. You should value simplicity and scale, work comfortably in a collaborative, agile environment, and be excited to learn.
Compute capacity management is chartered with ensuring customers receive the right hardware, in the right quantity, at the right time in the right region with minimal service disruption and with optimization & automation at scale. We are tasked with ensuring there is always product to sell. The work scope encompasses not just good integration with engineering development but also with product management and local site support.
Demonstrate expertise in cloud SRE and DevOps
Optimize and automate capacity pool management
Automate host evacuations & decommissions with goals of fast shipping hardware between regions based on changing customer demand scenarios
Manage systems that automatically establish limits for new hardware and regions
Design, manage and optimize maintenance buffer algorithms and systems
Manage compute capacity alarming systems and reporting, ensuring problems are identified and resolved quickly
Troubleshoot customer launch failures with efficient resolution and follow-up root cause analysis with actions to improve cloud reliability
3 years of experience in software development / site reliability engineering / DevOps
BS in Computer Science or a related technical field or equivalent practical experience
2+ years of cloud experience.
Sound understanding of technology and programming languages
Experience at an organization with strong operational/dev-ops culture
Possess an automate everything mindset and care deeply about reliability
Excellent written and verbal communication skills with the ability to present complex information in a clear, concise manner to all audiences
Results driven; thrives in a development environment that is agile, collaborative, and in start-up mode, even when faced with ambiguity
Innovation starts with inclusion at Oracle. We are committed to creating a workplace where all kinds of people can be themselves and do their best work. Its when everyones voice is heard and valued, that we are inspired to go beyond whats been done before. Thats why we need people with diverse backgrounds, beliefs, and abilities to help us create the future, and are proud to be an affirmative-action equal opportunity employer.
Oracle is an Equal Employment Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, national origin, sexual orientation, gender identity, disability and protected veterans status, age, or any other characteristic protected by law. Oracle will consider for employment qualified applicants with arrest and conviction records pursuant to applicable law.
Solve complex problems related to infrastructure cloud services and build automation to prevent problem recurrence. Design, write, and deploy software to improve the availability, scalability, and efficiency of Oracle products and services. Design and develop designs, architectures, standards, and methods for large-scale distributed systems. Facilitate service capacity planning and demand forecasting, software performance analysis, and system tuning.
Work with Site Reliability Engineering (SRE) team on the shared full stack ownership of a collection of services and/or technology areas. Understand the end-to-end configuration, technical dependencies, and overall behavioral characteristics of production services. Responsible for the design and delivery of the mission critical stack, with focus on security, resiliency, scale, and performance. Authority for end-to-end performance and operability. Partner with development teams in defining and implementing improvements in service architecture. Articulate technical characteristics of services and technology areas and guide Development Teams to engineer and add premier capabilities to the Oracle Cloud service portfolio. Understand and communicate the scale, capacity, security, performance attributes, and requirements of the service and technology stack. Demonstrate clear understanding of automation and orchestration principles. Act as ultimate escalation point for complex or critical issues that have not yet been documented as Standard Operating Procedures (SOPs). Utilize a deep understanding of service topology and their dependencies required to troubleshoot issues and define mitigations. Understand and explain the affect of product architecture decisions on distributed systems. Professional curiosity and a desire to a develop deep understanding of services and technologies.
A BS or MS in Computer Science, or equivalent. Identifies solutions to knowledge of server hardware and software configuration, networking, standard internet services, scripting languages, cloud computing patterns, technology security and compliance. Experience running large scale customer facing web services. Identifies solutions to understanding of load balancing technologies and experience with development in programming languages, databases and big data stores, and container technologies. Work involves defining and documenting technical architecture of complex and highly scalable products. A minimum of 5+ years experience of running large scale customer facing web services.