Elsevier has a complex portfolio of products and services. We have a cloud native, microservices and container platform strategy which seeks to improve how we deliver quality software to customers at a higher pace. The experiences and expertise that our embedded and consulting SREs have can assist other engineering groups within the company, as well as to Elsevier customers running our software.
The SRE model is a specific implementation of DevOps, which has a more specific focus on innovation, application reliability and performance at scale. Fundamental to SRE is a software engineering approach to IT operations. SRE teams use software to manage systems, solve problems, and automate operations tasks.
As an SRE you will
Partner with Product Owners to agree availability targets that our customers value the most.
Document every action so your findings turn into repeatable actions–and then into automation.
Make monitoring and alerting alert on symptoms and not on outages.
Debug production issues across services and all levels of the stack including real user experience issues.
Be on a standby rotation to respond to availability incidents and provide support for Product teams with customer incidents.
Learn from your time on-call to prevent incidents from ever happening.
Run our Infrastructure Platform with Terraform and Kubernetes.
Use the Infrastructure Platform to run your product as a first resort and make suggestions to improve the platform as much as possible.
Improve the deployment process to make it as boring as possible.
Plan the growth of your product's infrastructure.
Design, build and maintain paved road modules that allows products to scale.
You may be a fit to this role if you have some of these inclinations
Share our operating principles and work in accordance with those principles.
You see challenges from a business and customer perspective.
You view problems as an opportunity to improve.
That you are asystems thinker- with a deep understanding ofthe whole and the parts, how products interconnect and feedback loops.
You make decisions based on data.
Have an urge to document all the things so you don't need to learn the same thing twice.
Have an enthusiastic, go-for-it attitude. When you see something broken, you can't help but fix it.
Have an urge for delivering quickly and iterating fast.
Have an urge to collaborate and communicate asynchronously.
Have strong programming skills - Python and/or Java.
Know your way aroundLinux and the Unix Shell.
Know what is the use of config management systems like Puppet and Ansible.
Have experience with AWS, Jenkins, Docker, Kubernetes, Terraform, or similar technologies.
Experience with branching and merging strategies and Github.
Projects you could work on
Coding infrastructure automation with Terraform, Jenkins and similar tools.
Improving New Relic monitoring or building new metrics.
Consult with product teams e.g. Data Platform, ID+ on improving observability and reliability.
Helping the Infrastructure Platform team deploy and fix new components.
Plan, prepare for, and execute the migration of products onto the Infrastructure Platform.
Develop relationships with your Products group. Define Critical User Journeys, SLIs and SLOs that help define customer facing SLAs.
Implement Chaos Engineering across your product.
Promote and maintain infrastructure Platform components.
Areas of expertise/contribution for levelling
Implement "Infrastructure as Code" using Terraform and CI/CD for automation
Load balancing the application including WAFs and CDN
Kubernetes and containerising our products
Administer high-availability database clusters
Monitoring, alerting and automated actions using New Relic, OpsGenie and Status Page
Creation of actionable dashboards and reports from multiple data sources
Backend storage and database management and scaling
Data Optimisation (index and storage optimisation, search, event streaming)
Disaster Recovery and High Availability strategies
Contributing to code including innersourcing via Github
Team organisation andplanning
Issue, Epic, OKR leadership and completion
Technical Project Reviews (TPR)
Operational Support and Release Reviews
Collaboration and Communication:
Creating blog posts, newsletters and other training material
Completing Blameless Postmortems and Root Cause Analysis (RCA) investigations
Contributions to handbook, runbooks, general documentation
Leading and contributing to designs for issues, epics, OKRs
Improving team practices in handoffs of work and incidents
Hosting collaborative workshops on SRE practices and encouraging feedback
Influence and Maturity:
Involvement in hiring process - reviewing questionnaires, involved in interviews, qualifying candidates
Knowledge sharing, mentoring
Accountability, self awareness, handling conflict in the team and receiving feedback
Maintaining good relationships with other engineering teams in Elsevier that help improve their products