Site Reliability Engineer mentors the development team to write code, optimized for scale, automated operations, and graceful degradation. SRE is responsible for the operational support for the product (making sure it does not use more than 50% of their time). Works with the team to implement new features in the time remaining from their operations tasks. The SRE assists the team in implementing the DevOps strategy. This role is an expert in all systems used for monitoring and troubleshooting and assists the team in trouble shooting.
- Responsible to develop the initial on-call playbook, and to keep it updated as the product evolves and as features are added, removed or changed
- Responsible for change management on the team, develops actionable plans to minimize outage risks tied to changes by focusing on
o Progressive rollouts
o Quick and accurate troubleshooting
o Efficient and reliable rollback of changes when required
- Forecast demand and plan for adequate (optimal) capacity to satisfy natural product usage cycles and any surge demands, such as marketing campaigns, promotional campaigns, etc. This is done while not exceeding the computational budget agreed for the team.
- Ensure load-shifting is completed as required to address usage variations and scheduled maintenance windows.
1. Define and standardize the NFR needs of business. Precise NFR definition for the Microservices and APIs.
2. Establish process for Production incident report, post-mortem report and conduct retrospective sessions. Guide Prod-Ops for resolving critical incidents.
3. Define critical performance KPIs, set alert rules and roll-out monitoring dashboards for Production with timely reporting to the stakeholders.
4. Review the performance certification report before application go live and ensure the performance recommendations are part of the change request process.
5. Create coding best practices for application development from performance perspective. Actively participate in the design and code review.
6. Establish application performance benchmark with given infra spec and derive the tunable params.
7. Engage with the Infra/ProdOps team to forecast capacity requirements.
8. New ideas to create a Sustainable Efficiency
a. Action items with appropriate timeline - Short Term vs Long Term
Skills and Qualifications:
At least 10 Years of hands on experience in banking domain with Application Development , DevOps & ProdOps process with end-to-end visibility of the system with 4 years of hands on experience as an SRE Engineer.)
Rvin James Murillo Andalan EA License No. 02C3423 Personnel Registration No. R1331697