Senior Site Reliability Engineer

London, England, gb
Company: Elsevier
Category: Architecture and Engineering Occupations
Published on 2021-06-15 06:03:56

Elsevier has a complex portfolio of products and services. We have a cloud native, microservices and container platform strategy which seeks to improve how we deliver quality software to customers at a higher pace. The experiences and expertise that our embedded and consulting SREs have can assist other engineering groups within the company, as well as to Elsevier customers running our software.

The SRE model is a specific implementation of DevOps, which has a more specific focus on innovation, application reliability and performance at scale. Fundamental to SRE is a software engineering approach to IT operations. SRE teams use software to manage systems, solve problems, and automate operations tasks.

As an SRE you will

  • Partner with Product Owners to agree availability targets that our customers value the most.

  • Document every action so your findings turn into repeatable actions–and then into automation.

  • Make monitoring and alerting alert on symptoms and not on outages.

  • Debug production issues across services and all levels of the stack including real user experience issues.

  • Be on a standby rotation to respond to availability incidents and provide support for Product teams with customer incidents.

  • Learn from your time on-call to prevent incidents from ever happening.

  • Run our Infrastructure Platform with Terraform and Kubernetes. 

  • Use the Infrastructure Platform to run your product as a first resort and make suggestions to improve the platform as much as possible.

  • Improve the deployment process to make it as boring as possible.

  • Plan the growth of your product's infrastructure.

  • Design, build and maintain paved road modules that allows products to scale.

  • You may be a fit to this role if you have some of these inclinations

  • Share our operating principles and work in accordance with those principles.

  • You see challenges from a business and customer perspective.

  • You view problems as an opportunity to improve.

  • That you are asystems thinker- with a deep understanding ofthe whole and the parts, how products interconnect and feedback loops.

  • You make decisions based on data.

  • Have an urge to document all the things so you don't need to learn the same thing twice.

  • Have an enthusiastic, go-for-it attitude. When you see something broken, you can't help but fix it.

  • Have an urge for delivering quickly and iterating fast.

  • Have an urge to collaborate and communicate asynchronously.

  • Have strong programming skills - Python and/or Java.

  • Know your way aroundLinux and the Unix Shell.

  • Know what is the use of config management systems like Puppet and Ansible.

  • Have experience with AWS, Jenkins, Docker, Kubernetes, Terraform, or similar technologies.

  • Experience with branching and merging strategies and Github.

  • Projects you could work on

  • Coding infrastructure automation with Terraform, Jenkins and similar tools.

  • Improving New Relic monitoring or building new metrics.

  • Consult with product teams e.g. Data Platform, ID+ on improving observability and reliability.

  • Helping the Infrastructure Platform team deploy and fix new components.

  • Plan, prepare for, and execute the migration of products onto the Infrastructure Platform.

  • Develop relationships with your Products group. Define Critical User Journeys, SLIs and SLOs that help define customer facing SLAs.

  • Implement Chaos Engineering across your product.

  • Promote and maintain infrastructure Platform components.

  • Areas of expertise/contribution for levelling

    Technical:

  • Implement "Infrastructure as Code" using Terraform and CI/CD for automation

  • Load balancing the application including WAFs and CDN

  • Kubernetes and containerising our products

  • Administer high-availability database clusters

  • Monitoring, alerting and automated actions using New Relic, OpsGenie and Status Page

  • Creation of actionable dashboards and reports from multiple data sources

  • Logging infrastructure

  • Backend storage and database management and scaling

  • Data Optimisation (index and storage optimisation, search, event streaming)

  • Disaster Recovery and High Availability strategies

  • Contributing to code including innersourcing via Github

  • Execution:

  • Team organisation andplanning

  • Issue, Epic, OKR leadership and completion

  • Technical Project Reviews (TPR)

  • Operational Support and Release Reviews

  • Collaboration and Communication:

  • Creating blog posts, newsletters and other training material

  • Completing Blameless Postmortems and Root Cause Analysis (RCA) investigations

  • Contributions to handbook, runbooks, general documentation

  • Leading and contributing to designs for issues, epics, OKRs

  • Improving team practices in handoffs of work and incidents

  • Hosting collaborative workshops on SRE practices and encouraging feedback

  • Influence and Maturity:

  • Involvement in hiring process - reviewing questionnaires, involved in interviews, qualifying candidates

  • Knowledge sharing, mentoring

  • Accountability, self awareness, handling conflict in the team and receiving feedback

  • Maintaining good relationships with other engineering teams in Elsevier that help improve their products

  • -----------------------------------------------------------------------

    Jobs you might also be interested in