ASOS Tech Podcast

Episode 3.5 – Journey To Kubernetes @ ASOS Tech

Lewis talks to Charlie, Tom, Shay and Phil about our journey from Azure Cloud Services to running on Azure Kubernetes Service (AKS), including all the challenges and learnings along the way.

Nov 28, 2023

You may have shopped on ASOS, now meet the people behind the tech.

In this episode of the ASOS Tech Podcast, Lewis talks to Charlie, Tom, Shay and Phil about our journey from Azure Cloud Services to running our services on Azure Kubernetes Service (AKS), including all the challenges and learnings along the way.

Featuring...

  • Charlie Hoyland (he/him) - Lead Software Engineer
  • Tom Scott (he/him) - Principal Software Engineer
  • Shay Emambaccus (he/him) - Senior Software Engineer
  • Phil Clarke (he/him) - Senior Platform Engineer
  • Lewis Holmes (he/him) - Principal Software Engineer

Show Notes

Credits

  • Producer: Si Jobling
  • Editor: Lewis Holmes
  • Reviewers: Si Jobling & Paul Turner

Check out our open roles in ASOS Tech on https://link.asos.com/tech-pod-jobs and more content about work we do on our Tech Blog http://asos.tech

Transcript
Speaker A:

Perfect.

Speaker B:

Welcome to the Asos Tech podcast, where we're going to be sharing what it's like to work at an online destination for fashion loving 20 somethings around the world. You may have bought some clothes from us, but have you ever wondered what happens behind the screens? Hi, I'm Lewis. He him, and I'm a principal software engineer at Asos. In this episode, we're talking all about how we have migrated many of our services over to running on Kubernetes over the past three or four years. When we're talking about Kubernetes, we're actually talking about Microsoft Azure's managed Kubernetes service known as Azure Kubernetes service, or AKS, which is widely used across asos Tech. For this episode, I've got some fantastic guests from Asos Tech. They are.

Speaker C:

Hi.

Speaker D:

I'm Shazad, also known as Shea. He him, and I'm a senior software engineer on the browse platform at Asos.

Speaker E:

Hi, I'm Tom Scott. I'm a principal software engineer for the Commerce digital tech domain here at Asos. I've been at Asos for eight years and I'm also heavily involved in a lot of the AKS work here.

Speaker C:

Hi, I'm Phil Hee. Him, and I'm a senior platform engineer in the AKS maintainers team at Asos and I've been at Asos for a decade now.

Speaker A:

Hi, I'm Charlie Hoyland. He him. I'm a lead software engineer over in the Order Management platform. I've been at Asos for just over five years now.

Speaker B:

Great, thanks everyone. Great to have you here today. So if you've heard the podcast before and anyone listening, we always like to do an icebreaker just to get everyone feeling relaxed. So today's one is going to be what is one gadget or tool that you couldn't live without? Now, I'll go first, just to kick it off. So for me, I'm a big coffee lover and there's a thing called an aeropress, which is like just under 30 pounds. Little gadget, little plastic plunger that makes espressoes. And it's one of the best things I've ever bought in my life. Take it everywhere, it's so small. Take on holiday with me, buy some local beans or like, coffee ground. And that's my coffee fix in the mornings. So who wants to go next?

Speaker D:

I'll go next. I cannot live without my air fryer. I know it sounds weird, but I love my cooking. I've recently just got a new one for my birthday, so I've been upgrading from like a five liter one to an eleven liter one. I've got a lot more things to cook in that now, so yeah, eleven liters?

Speaker B:

Yeah. I'm trying to imagine the size of that. It sounds enormous.

Speaker D:

It is huge. I've got two baskets in this one.

Speaker B:

Now you can't put like a giant whole, like, chicken in it or something.

Speaker D:

I can actually, yeah, you can. Two whole turkeys, actually.

Speaker B:

Wow, that sounds insane. I love that. Yeah, I got rare fry recently. They are actually amazing game changer.

Speaker E:

I think my one is very similar to yours, Lewis. And that's a coffee grinder, actually. It's not very portable, though, but it's the base for any good coffee.

Speaker B:

Yeah.

Speaker E:

And I think it's something ridiculous, like there's, I think, 235 different settings on it. So no matter what your blends or your beans, you can always just customize it exactly how you want it.

Speaker A:

Fantastic.

Speaker C:

I took the tool Gadget thing very literally and went for a leverman multi tool. It can take most things apart. It has a pair of pliers, multiple screwdriver tips, and it can also put things back together again afterwards, which is always a benefit.

Speaker B:

Nice. Very practical. Yeah.

Speaker A:

My answer to this four months ago would have been very different to what it is now, but a white noise machine for my four month old daughter assisting her sleeping is something that I couldn't live without at the moment. I was also going to say coffee machine that's been an uptick in use since her birth.

Speaker B:

So seeing a theme of coffee and caffeine here. Nice one. All right, thanks everyone. That was good.

Speaker A:

Some nice, interesting answers.

Speaker B:

Okay, so let's kick off. We're talking all about Azure Kubernetes service, AKS, and I guess really if we take it back to where this all started for Asos Tech was probably imagine about three or four years ago, maybe a little bit longer. Asus Tech is a heavy user of Cloud Services in Azure, and I think around 2021, we were notified that Cloud Services Classic was going to be retired in August in 2024, so next year. And that was one of the big things that was happening at the time, where we were looking for new compute options. So maybe we could just kick off there and talk about where this all came from, why we decided to look at Kubernetes as an option to compute.

Speaker E:

Yeah, I think if we take it back further in terms of the wider Asos landscape, we had a migration from an on premise data center into Azure, as Lewis said at that time, a lot of migrations to Azure Cloud Services, and that was good at the time, but over time it also presented a lot of issues. Things like low compute utilization teams would often have their services scaled quite highly, maybe only utilizing 20 or 30% of the CPU. So it actually turned out to be relatively expensive, along with all the other issues that we've run into over time, such as slow scaling, running Windows VMs, and of course, as we move on, didn't support things like Containerization, which is now a standard across a lot of compute options.

Speaker B:

Yes, I think at that time, I think a lot of different things came together. Right, so Net Core came out, which was cross platform and supported running on Linux. So really, then we had the option to run our applications using Net within containers on Linux, which is something we very early on realized we wanted to do.

Speaker A:

Right, yeah. From an engineering perspective, I think for us, all of our compute was Cloud services and as engineers in sort of 2019, everything greenfield, we really, really wanted to be using Net Core, which was completely at loggerheads with Cloud Services. To get Net Core running in Cloud Services, we had to do a huge amount of bootstrapping. It just felt pretty dirty. So we were really, really looking for another option at that time. And thankfully for us, it kind of strategically aligned with the way that Asus was going and investigating the use of AKS.

Speaker E:

I think as well, our strategy from a compute perspective was to get ourselves to that new containerized state so that we were actually a bit abstracted away from the actual compute, whether that was AKS or any other kind of containerization.

Speaker B:

Yeah, great.

Speaker E:

One of the other alternatives we did explore was service fabric. And although this wasn't containerized, it kind of initially offered some of the solution to the problems we were having with cloud services, such as low compute utilization. Now we can run multiple services in a single node, whereas before we couldn't on cloud services, but again, it wasn't the best solution. Reliability was an issue with cloud services and service fabric, and that didn't really change until we got on to Kubernetes. Also internally as well, we didn't have a great story around how we provision our service fabric clusters. And it wasn't until the AKS provisioning pipeline, which we'll talk about later, actually we did gain that stability.

Speaker B:

Yeah. So we were kind of thinking about what we're going to go moving forward. When we started to think about AKS and that became available, what was our thought process, how do we take those learnings as we started to embark on AKS?

Speaker E:

It's a really good question, Lewis, and I think actually the answer is that it was a really steep learning curve for the first teams who did try and migrate a service to AKS. They ran into several issues that they worked with Microsoft and the product teams to address and overcome, and these solutions actually made it into ga for all AKS consumers. I remember there was issues at the start around things like DNS and disk utilization as well. Those were the two big areas that caused issues that maybe didn't manifest themselves in an obvious way.

Speaker B:

Yeah, and I guess maybe we can just talk a bit about that, maybe. Phil, you went there. I don't know if you can take this or maybe Tom could add in. So that whole thing about how we created this central pipeline, what the goals were, what we're trying to do better than what we've done before, maybe with the failures in service something, I guess.

Speaker C:

The question would be, why do we need a core pipeline? You can create a cluster by pressing a few buttons in the Azure Portal. What is the value in having a sort of engineered tool set to create your clusters for you. If you get even the most basic thing naming conventions, if you've got sort of few dozen teams potentially using Kubernetes, you may end up with a lot of clusters called My cluster, one across your estate.

Speaker B:

Engineers love naming or Dave's cluster, or test cluster two.

Speaker C:

Test cluster another classic name. So standardization goes from a sort of basic naming right across to sort of networking, security setup. And besides the cluster itself, there's also other supporting resources that are generally needed alongside a cluster. So you got things like networking. Most clusters will require an ingress controller. If you've got an ingress, you generally need a front end certificate. However, a certificate is going to get rotated. So there's a whole wealth of areas of work besides actually provisioning the cluster itself that benefits from having a standardized pipeline and framework to spin up these clusters.

Speaker E:

Yeah, it's much more than just creating your cluster. It's all the things around that cluster. It's the software that goes on security scanning, internal certificate management. Yeah, like you said, those front end TLS certificates, how they get rotated, that would have all been impossible if we didn't have a standardized pipeline. And that's one of the key differences between our earlier iterations around service fabric and AKS.

Speaker B:

One thing I remember at the time is I don't think we had really good ownership around that service fabric provisioning. And I think one of the great things we did in this situation with AKS is we had a small team that were actually owning this pipeline and kind of treating it more like a product and really pushing it forward and dedicating their time to making it better and supporting it as well. If anyone who's listening has used Kubernetes and realized one of the great benefits of Kubernetes is there's so many different things you can configure on a cluster, which is really, really great, but also can be quite daunting to understand how to set it up. Our core pipeline that configures the cluster does a lot of that heavy lifting for you and takes some of the best things from the community out there, some of the tools that we use, and sets them up for you on cluster, which is just great.

Speaker E:

And it's also about the scale of it right as well. I think right now we've got over 100 clusters running a split between production and non production. Trying to manage those and to have some uniformity would have been impossible if we hadn't have had this pipeline.

Speaker B:

Yes. So we've talked about this central, what we call core provisioning pipeline for AC. So could you just explain a little bit about what that's like to use as an engineer?

Speaker C:

Pipeline is a bit of tooling constructed out of various languages and tool sets. But for a user, essentially, you run the pipeline against a YAML configuration file and it will build a cluster. We provide templates for the environment config team wanting to build a cluster will fill in the bits that they're interested in. So, for example, selecting the Ingress controller they want to use, the VM size they want to use for their node pools, that kind of thing, as well as the basics about which subscription they want to build a cluster in, then it's a matter of running the pipeline. The pipeline runs in a container and that will go and build a cluster. So one of the benefits of this way of configuring clusters for YAML is that your cluster configuration is source controlled.

Speaker B:

Can also deprovision clusters.

Speaker C:

As yes, yes. So deprovisioning is another important aspect. So clusters come and go. As we've seen across our Azure subscription, they are a little bit litter prone. So another feature of the AKS pipeline is that it ensures it cleans up after itself so it doesn't just delete the resources, it undoes any bits of plumbing around Azure Active Directory it may have put into place.

Speaker D:

Yeah, from the developer's point of view, the core pipeline is really useful, simple as just copying a template and turning on what you need as well. So it's basically just like a Build a Bear for our clusters. So we put in what we want and we get what we want really.

Speaker B:

Out of that build a Bear.

Speaker D:

Love it.

Speaker A:

From a kind of less technical perspective, I guess it's not the same in every sort of team or platform at Asos, but the platform I was in at the time, we didn't have a platform engineer and we obviously had a huge ambition to move to AKS. But the idea of spinning up a cluster, let alone knowing how to configure it, was quite a daunting one to us. So the abstraction of that behind a centrally owned pipeline was particularly appealing and allowed us to kind of focus on a lot of the other challenges that we had, such as deployment, pipelines and migrations to Net core, et cetera. Things that we felt more confident that we could tackle as engineers.

Speaker B:

I think one other thing at the time as well we haven't talked about yet was training, actually, and I think there was a lot of efforts from internal training to provide some training around, not just Kubernetes actually, but we talked about net core and containerization. This is already what we called at the time, our next gen compute, and we were moving away from VMs and Windows net framework to, like we said, net core containers running on Linux, on AKS. So there was definitely training available at the time and I think it was very strongly encouraged that if a team was going to move over to AKS, that they went on all this training, really to make sure they kind of understood the Kubernetes landscape a bit better because it is a whole kind of ecosystem to learn and that would really help them get going and save them a lot of time.

Speaker A:

Yeah, I think particularly like the initial versions of that training. I know we worked very closely with Microsoft to put it together, didn't we? And I think it was important, not necessarily from a hands on perspective, but conceptually, as you say, Lewis, it was particularly important so people could understand what was going on on their cluster. I think coming from the world of cloud services was a completely different ballgame, and just having that kind of understanding of the core concepts of Kubernetes and AKS to an extent was absolutely vital.

Speaker E:

I think the speed of adoption has also increased because of the training as well, and that allowed teams to gain experience in migrating and running their services over time. Once you've done a couple, you kind of know the pattern. And actually teams have really sped up where we have some teams who are completely on AKS that maybe have 40 plus applications on their clusters to other teams who are in the migration process.

Speaker B:

So we talked about we were moving over to AKS, we had this core pipeline set up. So when we were moving our applications over and building new services, we're using now, Net Core, we're using containers probably for the first time. What was that like?

Speaker D:

It was quite challenging, actually taking the first application out from deployed on Service Fabric to AKS. It's understanding how to containerize an application, understanding how we deploy it, and how we test the application as well. And it's also how do we switch from deployment on one environment, such as Service Fabric, onto AKS as well. So there was a lot of learnings.

Speaker E:

In there and I think the journey actually starts before that as well, because if you're coming from Net Framework world, you need to get all your dependencies in your project over to say, Net Standard so that it can run through Net Core. And Charlie alluded to it earlier about bootstrapping in cloud services. So in some cases, in some of the migrations that I've done with my teams, we've actually run ASP Net Core that compiles to Net Framework on Cloud Services. That really ease the process as much as possible because effectively we've got the same code base with a few minor differences that can be compiled to both the Cloud service and to AKS, for example. And that means that your migration path is a lot simpler because you can maintain both whilst you're doing this. And you can also verify that the version on cloud services, for example, is the same in terms of inputs and outputs as it is on Kubernetes.

Speaker B:

That's interesting, actually. For existing services. Was there any approaches that you found worked best to kind of migrate your consumers over? Were you kind of doing this internally with some traffic load balancing or were you kind of spinning up a separate version and asking them to just start pointing when they're ready at the new version?

Speaker E:

Yeah, really interesting. Question. We took the view that our consumers shouldn't really know anything about our move to AKS because if they do, then they are tied to a particular part of that implementation that they shouldn't be. So at Asos, we use Akamai for our CDM, so that usually would route to a traffic manager. So our strategy was to update traffic manager to send a proportion of the traffic to AKS and the previous cloud service as well. And over time, in some cases it was over a couple of days, in other cases it was just over a day, we would migrate traffic and send more and more to AKS until we were satisfied that the migration was successful.

Speaker A:

Yeah, it's an approach that we adopted as well to lies in essentially the canary aspects of traffic manager to migrate traffic over. And I guess the key for us with all of that was visibility, particularly around which instance of our application was serving that request, so that we could very, very quickly determine the health of the application in situ and roll back or revert the canary if we needed to.

Speaker E:

I was also going to mention as well, it's come up a few times about front end cert as well. And that was always something that would go wrong in cloud services in service fabric because ultimately it was a lot of manual change and through AKS we've been able to automate that completely. So we have a key vault where those certificates are stored, they are synced to all the clusters and they're used as part of the ingress. So whenever a certificate needs to be rotated, teams actually don't need to do anything. And the last couple of times when maybe somebody has asked, did your certificate rotate? They weren't even aware that it had. Which is how it should be.

Speaker A:

Right.

Speaker C:

We do tag the clusters to make sure that we can see the soot has rotated so we can verify it rather than just doing it entirely on trust.

Speaker E:

Absolutely.

Speaker B:

But yeah, I remember speaking to some engineers about this and they're like, that's just amazing. And now having that automated, it was just a really good selling point, I think, for AKS as a kind of product within Asos as well of like, you know, these are another benefit you get from using this technology is that you don't have to worry about managing your TLS sets anymore.

Speaker C:

Check out the Azos Tech blog for more content from our Azos Tech talent and a lot more insights into what goes on behind the screens at Azostech. Search Medium for the Azos Tech blog or go to Azos Tech for more.

Speaker B:

So we started to move some of our services to AKS. Now what were some of the initial benefits we found and things that we enjoyed using?

Speaker D:

I think the first benefit that I've noticed was scaling tom Scott I don't know if you remember, but we used to have a service that runs around 02:00 A.m. In the morning. And when it was on service fabric, we used to have to scale up 2 hours beforehand and then scale 2 hours after once it's completed. Now with AKS, as soon as the service hits a CPU threshold limit, it just automatically scales up within 2 seconds and we don't have to worry about additional cost and scaling up not working, it just works great.

Speaker B:

That's a great example. Yeah, I hear that a lot from people at Asos that when they're new to AKS, they just say things just work and it's just so stable. Yes, I was just going to add.

Speaker C:

About how AKS seems to just work. I mean, part of that comes down to the community that we have within Asos about how we use AKS. So we have a teams channel which works as a sort of means of the core maintainers team reaching out to explain new features and releases provide support. But it also works both ways, so issues that teams have get shared. So if there's any changing in general advice about good practice tweaks to the way we set stuff up, that information reaches all the relevant people and we can adapt and improve the way we do things to try to ensure that stuff will just work.

Speaker B:

Yeah, and that was great to see the feedback from the community. And I think really as part of that core AKS group, it has been a really successful community at Asos. Any other things that we like from using Kubernetes?

Speaker A:

Yeah, I think it's a really obvious one is cost, cost savings that each platform can leverage. I think there's the really simple well if you look at how much a cloud service was compared to your Kubernetes cluster, it's incomparably cheaper. And I think did some math at the scale that we were generally running our cloud services, one cloud service in one region was costing us around 400 pounds per month. Our entire AKS cluster in one region costs 350 pounds a month now and we're running 36 independent applications on there. So it's just like it's incomparably cheaper and unbelievable cost saving. But I guess that's just the figurehead side of it. If you just look at resource utilization as well, you can fine tune your AKS cluster and trust the scaling a lot more to be able to get more bang for your buck, I suppose.

Speaker B:

I think as well, it's a combination of things that's really making that happen. Right, so running on Linux rather than Windows, which is cheaper in the first place, using containers which are quicker to scale and faster to start, less of a footprint as well. There's many different technologies that we talked about for this next gen compute that really made a big leap forward in terms of just cost savings alone.

Speaker E:

So another benefit recently as we've been using Prometheus to monitor our applications is that we can now use some of those metrics that are coming out of Prometheus to scale our workloads better. There was a really good usage of this in data science, where they're running really, really large data science models, maybe 30GB of memory required. So generally, you had to have large VMs in your AKS cluster, and using Prometheus metrics to better scale, they were able to save thousands of pounds a month in unnecessary scaling their service because they had better data about what was happening in the service.

Speaker B:

Yeah, it's great to be able to utilize some of these tools to improve what we do. And I guess as we started this journey, have there been any nice tools that we've used and things that we can recommend to people listening?

Speaker E:

So I guess one of the most widely used tools for maybe engineers who don't always like to use a CLI would be Openlens. So Openlens essentially runs on top of the CLI, but it just provides you with a graphical interface, really, so you can visualize what your pods are doing. It can be a lot simpler to just see what's going on in the cluster without looking at a load of Kubectl YAML output, but most engineers would actually mix between the two.

Speaker B:

Yeah, definitely. And I think it's probably worth talking a little bit about Helm. That's something that we heavily utilize for application deployment and packaging.

Speaker E:

Yeah, I mean, so Helm was actually right at the start of our journey, and very similar story in that initially everyone was finding their own way. More recently, we've pulled that all together and implemented what we call our core helm chart. Really nice templated structure that covers the vast majority of our use cases, and it just takes away any maintenance overhead of upgrading Helm or upgrading Kubernetes APIs, for example, when they're deprecated, but also reuse as well. If you speak to most teams, you start off by pretty much copy and pasting a Helm chart throughout your applications. And the actual helm charts that you're using are very similar for a lot of things. So now we've abstracted that away from the actual apps themselves. Generally speaking, each app will only have a configuration file that tells the core helm chart how to render a object. So whether that's a deployment or an ingress or a config map, and we're seeing a lot more uniformity, more reuse, and easier maintenance.

Speaker A:

Yeah, I completely agree with Tom. And from an engineer's perspective, the core helm charts were really welcome.

Speaker E:

Yeah, there was another really good example recently when Workload Identity came to AKS, in that we were able to implement that in the core charts and in your configuration file. All you had to do was enable that for your workload and provide a Managed Identity Client ID. And through that, your pod would start up with those Workload Identity artifacts on there.

Speaker A:

Yeah, massively helpful. Again, I think that idea of managed Identity, particularly Workload Identity, was something that people had wanted for a very, very long time. But when it comes to the implementation of that in AKS, perhaps a little bit scary again, the lovely abstraction of that behind core helm charts is fantastic and just removes any blockers to teams using it.

Speaker B:

Yeah, I remember when we've been waiting so long for that and when it finally became available and we added it to the core helm charts and then there were teams that had it working within a few days. We talked about all the great benefits from Kubernetes AKS and all these technologies we're using at the time. What were some of the earlier pain points we found when we started to migrate our services over?

Speaker C:

I mean, I guess monitoring is always a big one so early on new technology, everyone's having to find the best way to sort of monitor diagnose and troubleshoot, find the relevant information to allow them to investigate issues. Now, however, we have a large set of dashboards that we automatically provision alongside clusters. So we've got a dashboard showing overviews of our entire cluster estates and we got individual per cluster ones. We're also sort of drilling down into different details, for example, sort of on nodes, memory requests and utilization. Same for CPU, you can see Pod autoscaler logs, the Node autoscaler logs as well.

Speaker B:

And we had alerts, I think in there as well right, which has been some sort of standard alert set up as well that were quite useful for teams.

Speaker C:

Yes, we set up a set of standard alerts when we provision the clusters. They're all sort of opt in, opt out tunable, but it's a good basic set of alerts straight out of the box that provides a sort of essential monitoring.

Speaker B:

Yeah, great.

Speaker A:

I was just going to add that I guess the challenge around it all was not only kind of getting the visibility and getting the alerts, which the core pipeline massively helped us with, it was about the learning for the people who are responsible for doing out of hour support. How do I actually go and scale my cluster? How do I go and respond to this particular alert? And it all comes down to the learning curve that we've talked so much about really making sure that everyone had the knowledge at 03:00 A.m. When something goes wrong, to be able to react to things that they seeing on the dashboards.

Speaker B:

Yeah, definitely. So, as we talked about, last three or four years, teams have been moving their services from cloud services or service fabric in Azure to AKS. Where are we now?

Speaker E:

Yeah, so I guess there's been an awful lot of migrations now and I think the majority of services is now running on AKS. There are still minorities that are in progress of migrating. So now that we've got most services running on AKS, we've actually pulled together all of our learnings over all of that experience and distilled that down into. Service templates. So when we say service templates, we mean much more than just the actual application itself. It's the pipeline for how you deploy that application. It's the configuration, it's the monitoring and alerting. And we're now at the point where we can spin up a new project and you can have that deployed to your cluster in under an hour.

Speaker A:

In terms of where we are now, I alluded to it earlier, we migrated pretty much everything over to our AKS clusters. We've got 36 applications on there at the moment. But from where we started, we would only use our AKS clusters for web APIs, really. But we realized as we gained a lot more confidence in the process and the maintenance of our clusters and the applications on it, that we could utilize this compute for everything, really. So we've got Azure functions on there, got worker services. It's obviously the cost benefits that we talked about earlier, but the uniformity of compute is massive for us. There's only one kind of deployment pipeline or nature of deployment pipeline that people need to manage. And in terms of kind of diagnosing issues with your applications, from an infrastructure perspective, it's all the same, which is hugely beneficial.

Speaker E:

Yeah, it's a really good point there, Charlie, about even functions. Right. So one of my teams that I work with, they moved everything in terms of APIs and backend, say, like message processors, into AKS. And what they were left with was then their Azure functions that were outside the cluster. So they looked at how to bring those onto their cluster and containerized the Azure function and deployed them to AKS as well.

Speaker B:

I think we're using Keda was It technology to support that as well? Yes.

Speaker E:

So Keda is available for all of our apps, but Keda, if you're not familiar, allows you to scale on external metrics. So, for example, in the context of a function, if you had a service bus trigger on a function, you could scale through Keda based on the number of messages in a service bus queue, for example.

Speaker B:

Yeah, another thing that I saw teams do more recently is kind of managing their Kubernetes upgrade process. So actually updating the Kubernetes version and some of the other bits around cluster as well. So maybe you could talk a little bit about what we did there, I think.

Speaker A:

Well, my platform certainly were heavily involved in engineering their own upgrade approach. Yeah, I think it was possibly something that we should have tapped into when we were talking about the challenges with Kubernetes. Really, there's that idea of upgrades, and there doesn't seem to be too much kind of guidance from an Azure or a Microsoft perspective on the best way to do it other than perhaps in place upgrades, test it in your non prod environment, check nothing breaks, and then just do improd. For us, that didn't really sit too well. So we kind of introduced what was coined the Phoenix approach. A number of platforms worked together on it, but it's essentially adopting the same principles of a blue green application deployment to your clusters. So we would spin up new clusters, use region and cluster specific traffic managers to ensure traffic is going to the live cluster, but we would have our migration traffic managers to allow us to test this newly spun up cluster, would use helm backups and the restore functionality to put all of our applications on there. We would run test pipelines against that cluster to make sure that everything is healthy, and then again, we would use traffic manager to move traffic over to the new upgraded version of the cluster. All of this has been a very iterative process with new pipelines to automate previously manual processes, and it's pretty gold plated now, but it certainly wasn't at the start. And there have been, you know, like.

Speaker E:

Charlie was saying, there's no right or wrong to this. I think the Phoenix approach you need to have because eventually, even if you don't use it for upgrades and if you go the in place upgrade route, which I think is the most widely used at Asos, there will still be times when either you want to make a breaking change to your cluster that can only be done at cluster creation time, or there's some other big change that you want to make where actually, maybe that change wouldn't be the best for an in place upgrade. So I think that the answer is actually both.

Speaker B:

Really? Yeah. And I think one of the things I personally liked about that Phoenix approach is that you are getting the confidence that you could provision a cluster using your pipelines successfully and you're running that quite regularly because that has many benefits. When pipelines don't run for a while, they get stale and things start to break. So being able to run things regularly and get the feedback is always good in testing that they're working. And also at the same time, it's really good for disaster recovery scenario as well. Right. Knowing that if you had to spin something up quickly, you've probably got a lot of what you need to do. Already tried and tested.

Speaker E:

Yeah, absolutely. In the AKS maintainers group, we do that to some of our test clusters, as that's right.

Speaker C:

So in order to try to ensure our pipeline is in a good working state, or if it isn't, we hopefully catch it before anyone else does. Overnight, we run the pipeline against two of our own clusters, one of which, as Tom says, we tear it down each night and build it afresh, and the other one is a long living one where we just run it against it so it picks up any sort of rolling changes as an in place thing. So we're hopefully catching sort of both types of teams out there, the ones who do the regular rebuilds for their updates as well as the ones who do in place updates just to add to the sort of upgrade conversation. One of the challenges of in place upgrades of Kubernetes clusters across minor version releases are when you get removals of Deprecated APIs. Obviously, if you're building a cluster refresh, you go and try to deploy something where the API is no longer there, you won't be able to deploy it. If it's an in place upgrade, you can get into a bit of a sticky situation. So we've integrated an additional bit of tooling into the pipeline to check for deployed releases using now Deprecated APIs and it will just block the upgrade and flag. This resource is using an old version and that's why we're pausing the upgrade. And that's been quite useful to allow teams to sort of dry run the upgrade, know what they need to address or may have missed in their applications.

Speaker B:

Yeah, that's really good to hear that we kind of doing all that testing around the core pipeline as well and supporting both approaches and trying to catch those issues and find them early on rather than teams having to deal with that. That's really good. I guess just to kind of sum up a bit now, really, if there are people listening that are thinking to move into Kubernetes and potentially AKS as well in Azure, what are some of your learnings that you would share with them?

Speaker D:

I would say just start. There's a big fear around starting deploying containerized applications on Kubernetes. I remember starting the journey for myself and it was like, it's so much to learn, so much to do at the start, but once you start getting your teeth into it, you realize, oh yeah, this is how it works. Everything just starts snowballing one into each other and the learning process becomes easier and deploying applications becomes a lot easier and your knowledge grows a lot faster as well.

Speaker A:

I'd say ensure that you build a bit of an AKS community within your own team. Ensuring that documentation is up to date support is an easy thing to do. Like Shay said, one of the biggest barriers is the fear of the unknown, really, and managing your own infrastructure. And I find that if you've got all the tools in place and all the documentation in place, all the visualization and the monitoring in place, and that kind of derisks a lot of things and makes it a lot less scary.

Speaker E:

Yeah, I 100% agree with what Charlie said there. If I had to pick one specific technical area that you should get your head round and understand, it's resource requests and limits and how that interacts with things like the scheduler. Because if you can understand that, that will prevent some fairly nasty issues potentially coming up on your cluster.

Speaker B:

Yeah, some great tips, and I think we have to say as well, get a standardized way of building and revisioning. Your clusters will save you a lot of time and headache as it grows. I guess as you adopt Kubernetes further across your organization, I guess I'd say.

Speaker C:

Tom mentioned the Kubernetes learning curve. My learning is that the learning curve never seems to stop. You get your head around one set of fairly what felt like advanced concepts at one point in time, and then you realize that it's a can of worms. There's still more beyond that, but it keeps it fun.

Speaker B:

That's I think working within this area and engineering, right, we're always learning. What are some things that you're excited about in this area? And the next few months coming year.

Speaker D:

For me is istio and scaling as well, because right now we use something called application gateway, and that is within Azure. So whilst we're scaling our applications for high traffic volumes, we also have to scale application gateway independently as well. And there are loads of problems with doing that. At the same time, the application gateway might get stuck and whatnot. And if we do implement Istio, it will scale alongside our pods as well.

Speaker B:

Yeah, maybe you could just explain a bit about istio, what that is, people that aren't familiar.

Speaker E:

Yeah, it's a different type of ingress, but it's also a more advanced form as well. So I think some people might see it more of a service mesh type application. For us, istio is really our next gen of ingress, so it will take care of things like a NTLs inside the cluster and how you can call an internal service in the cluster, for example. But for engineers, it's also some really interesting concepts like canary deployments, so we can integrate it with other applications such as Flagger as well. And that might, for example, read a Prometheus metric to understand whether your application is healthy when you're rolling it out. And the combination of both of the two can give you, for example, automated rollbacks if a Prometheus metric goes out a threshold. Those are all really powerful things that affect so much, not just for engineering teams, but also change in release management. For example, if you can explain to them that our pipelines will now automatically roll back when there is an issue that just takes so much risk out of a release as well.

Speaker B:

Yeah, ISTJ is a really interesting piece of technology. I'm excited about myself as well. Thanks everyone for your time today. It was really interesting to hear about our journey from cloud services to kind of service fabric and other compute onto Kubernetes and Azure Kubernetes service AKS. Hopefully people listening have got some useful tips learning and also not just that, but using Net Core now moving from the old Net framework and utilize containers as well. So thanks everyone for your time. Take care.

Speaker C:

Cheers.

Speaker E:

Thanks everyone.

Speaker C:

Thanks everyone.

Speaker B:

Join us next time for more stories and insights from behind the screens at Asos Tech.

Behind the screens at ASOS Tech