Your New Role...
The Direct to Consumer (DTC) Group is Discovery Network's technology arm. We are building a global streaming video platform (OTT), and a
suite of applications to support all of our network’s brands globally. We are
building modern container-based micro-services operated on AWS. Our platform
covers everything from search, catalogue, video transcoding, personalization, to
global subscriptions, and much more. We build user experiences ranging from
classic lean-back viewing to interactive learning applications. We build for
connected TVs, web, mobile phones, tablets, and consoles for a large footprint
of WBD owned networks including Max, Eurosport and discovery+. This is a
growing, global engineering group crucial to WBD’s future.
We are hiring a Senior Software Engineer within the DTC Operational
Engineering group with focus on Cluster Engineering. The team is dedicated to
providing a secure, efficient and easy-to-use container runtime platform. We
build automation for expanding to new markets, tenants and geographical
regions, bootstrapping a fully featured container runtime ready for back-end
development teams to use.
We team up with other platform teams, manage the lifecycle of critical platform
components and work closely with SRE to improve the reliability and efficiency
of the platform. We provide strategies and processes for lifecycle management
of the container runtime environment. We own critical components, assess
when we should introduce new ones and set the bar for quality. We focus on
reducing security incidents while ensuring cost-effectiveness and supporting
business needs.
Your Role Accountabilities...
Operating Kubernetes clusters (version upgrades and managing critical
Kubernetes components like Karpenter)
Security vulnerability mitigations (rolling out security patches to
Kubernetes nodes) and cost optimizations (scaling of our thousands of
servers running hundreds of in-house built services, supporting millions
of customers across the globe).
As a Senior Software Engineer on the team, you will work with
Kubernetes cluster operational tasks like upgrading the Kubernetes
version. This includes performing an analysis of what is changing in the
new version and how it affects the workload running on each cluster, e.g.
deprecated k8s APIs and controller compatibility. You’ll use tools to scan
clusters to find things which need remediation, define actions for it, as
well as execute them. As part of rolling out upgrades to hundreds of
clusters, some ranging up to a thousand nodes, you will identify
automation opportunities to reduce the amount of toil needed.
To keep the runtime infrastructure secure you will roll out security
patches to Kubernetes nodes (EC2 AMI updates) on a regular basis (PCI
compliance demands new patches to be installed at least every 30 days)
and help improve the automation tooling to eventually on an automated
schedule.
In order to keep our runtime cost low, you will take part in cost
optimization efforts, like tweaking the Karpenter node scaling strategies,
increasing the bin-packing efficiency of pods on nodes, ensuring the right
node family type is used (m, c, r, spot and Graviton/ARM) and identifying
over scaled infrastructure.
To enable faster time to new markets and onboarding of new tenants to
our streaming platform, you’ll work on automating the creation of new
Kubernetes clusters together with bootstrapping of critical platform
capabilities like service mesh, deployment systems and the observability
stack.
Qualifications and Experience...
At least 1-2 years of Kubernetes experience.
At least 1-2 years of AWS experience.
At least 5 years of software development, infrastructure management or
operations experience.
Ability to write code for automation in Python, Bash or Golang.
Experience designing infrastructure CI/CD pipelines, e.g. Jenkins or
GitHub Actions.
Experience with IaaC, preferably Terraform.
Used to Helm templating for k8s manifests.
Understanding of how GitOps tooling like ArgoCD or Flux works.
Experience rolling out infrastructure changes to production by following
a change management workflow.
Know which metrics to monitor during a change rollout to identify
problems.
Strong ownership mentality during rollouts, stop/rollback and fix if
problems occur or escalate to management/on-call if getting stuck.
Strong sense of security, always using least privileges access and firewall
configurations when needed for maintenance.
Understanding of how running workloads on the Kubernetes clusters may
be affected by cluster changes or node rotations.
Willingness to talk to service development teams and understand their
challenges when they report problems during maintenance windows.
Ability to define and measure KPIs and honor SLAs for infrastructure
maintenance.
Experience with Git and GitHub PR workflows.
Experience in working with Agile – Sprints, Epics/Stories, Jira.
How We Get Things Done...
This last bit is probably the most important! Here at WBD, our guiding
principles are the core values by which we operate and are central to how we
get things done. You can find them at www.wbd.com/guiding-principles/ along
with some insights from the team on what they mean and how they show up in
their day to day. We hope they resonate with you and look forward to discussing
them during your interview.
Championing Inclusion at WBD
Warner Bros. Discovery embraces the opportunity to build a workforce that
reflects the diversity of our society and the world around us. Being an equal
opportunity employer means that we take seriously our responsibility to
consider qualified candidates on the basis of merit, regardless of sex, gender
identity, ethnicity, age, sexual orientation, religion or belief, marital status,
pregnancy, parenthood, disability or any other category protected by law.
If you’re a qualified candidate with a disability and you require adjustments or
accommodations during the job application and/or recruitment process, please
visit our accessibility page for instructions to submit your request.