Site Reliability Engineer - Web and Cloud Technologies Long Term REMOTE possible
Job Description:
Key Responsibilities and Duties
- Follow and implement SRE best practices, troubleshoot application and infrastructure issues.
- Improve availability, performance and stability of the applications and platforms.
- Build end to end monitoring by closely working with application / infrastructure partners.
- Establish SLIs, SLOs, Error Budgets, and other SRE metrics to ensure the better reliability.
- Maintain effective knowledge base and runbooks to bring faster resolution to production issues.
- Automate first mindset to identify gaps / manual process and automate toil.
- Communicate with stakeholders using strong written and verbal communication.
- Constantly update personal technical and business knowledge and skills and mentor others to increase the knowledge and skills of the team.
- Customer first mindset in resolving issue and building new products.
Required Skills:
- 3+ years of hands-on experience in application and technical support role in live production environment following Development, DevOps, and SRE best practices
Preferred Skills:
- Bachelor's degree in computer science or equivalent combination of education and experience.
- Financial Services experience
- 4+ years of hands-on experience with configuring and monitoring via tools such as Splunk, Dynatrace, ELK, ServiceNow, JavaScript Framework, etc
- Strong automation / coding skills (preferably Python, Java, Java script, React)
- Experience using Machine learning features to improve and innovate operational processes