I had the opportunity to attend re:Invent for the second year running, courtesy of my employer StepStone. I work for them as their Group AWS Programme Manager, which can be succinctly described as making sure we Do AWS Right! For a bit more information on that you could look at my LinkedIn profile, or check out my previous post titled “How StepStone achieved a Cloud Center of Excellence”.

re:Invent is a complete whirlwind due to there being so many options for things to do over the week. This post is highlighting the breakout sessions that I took in (and a few I had to catch up on afterwards, due to adhoc meeting conflicts).

Let’s crack on with it!

Reliability and Scalability

DAT404-R: Aurora Multi Master – Scaling out database write performance

This detailed scaling out database write performance, following the multi master paradigm. This is a feature that is now offered by Amazon’s Aurora product.

To make this work well, changes to your applications are needed so that they detect issues and fail-over quickly. The cluster endpoint is not load balanced either, and will always default to one (working) instance — you would need to roll load balancing yourself. Plus some thought in application design is needed to reduce the number of deadlocks that will be encountered – e.g. can you effectively shard your database?

CMY302: Scaling Hotstar.com for 25 million concurrent viewers

Hotstar: Big in India for streaming, and particularly so for cricket. Scaling up to 25.3 million concurrent streams at peak, in fact. The Superbowl hit 3.1 million.

This session concentrated on how they handle that, including:

  • Load generation tools (using 3,000 EC2 machines)
  • ML traffic patterns.
  • Methods used to scale up and down.
  • Chaos engineering approaches.
  • How best to panic !

One thing to note is that they do NOT use Amazon’s Auto Scaling Groups or CloudFront products. ASGs were considered unsuitable for reasons covered in the presentation, and they use Akamai for CDN.

ARC335: Designing for failure: Architecting resilient systems on AWS

Resilience system design is one of my personal crusades in my current role, so I was naturally drawn to this presentation.

This covers the building blocks of resilience within AWS and the general strategies that can be employed. This includes consideration of data replication and networking.

There is also a great real world example courtesy of Snapchat! Clearly a good company to use, considering they have 200+ million daily active users and 3+ billion ‘snaps’ per day.

Finally, the concept of continuous resilience is explained, which is the target architecture for any system wishing to really do this properly.

ENT233-S: Atlassian’s journey to cloud-native architecture

Everyone knows who Atlassian are. Pretty much everyone in the room put their hands up when asked if they were a customer in some way!

This was an interesting session, detailing Atlassian moving from having everything single-tenanted in AWS for every customer instance. This meant having 100k instances! Although this was automated to an extent, it was unwieldly.

They moved to stateless apps running on AWS, with shared microservices. They called this Project Vertigo.

One key decision they made was to fork the codebase, which means different codebases for their self-hosted and cloud products now. That permitted them greater speed of execution.

100k Jira / Confluence instances were migrated over 10 months. The remainder (usual ‘last 10%’ problem) over the next year.

1,500 developers were involved, and code was re-architected as it was migrated (So no ‘lift and shift’ here!)

Key points:

  • Have clear goals for your microservices. Easier with Atlassian as the products already have very obvious shared areas (such as look and feel, shared logins).
  • Invest in tooling and be consistent (Do scaling right now to avoid chaos later).
  • Defined Tech Stacks (libraries, design patterns, logging framework) – This also helped with any ownership changes further down the line.
  • The organisational structure had to change to accomodate the project. Not just dev teams, but leadership and R&D too!
  • Cultural shift: “You build it, you run it”.
  • DO THINGS RIGHT: No Hacks or Workarounds (avoiding ‘snowflakes’).
  • Manage the last 10%: Went with 1-1 engineer assignments to Get It Done.

DOP208: Amazon’s approach to failing successfully

Key point: Never waste a failure.

(But see your failures before your customers do!)

Other takeaways:

  • Work backwards from customer impact.
  • Root cause via the ‘Five Whys’ approach.
  • “What could have been done to reduce the blast radius?”
  • “How could we have responded quicker?”
  • Automatic rollbacks triggered based on alarms.
  • Remember to report on WINS as well!
  • Metrics: Make sure they are genuinely useful! Prioritise Health over Diagnostic.

Machine Learning

AIM212: ML in retail: Solutions that add intelligence to your business

Also this is highlighting the retail sector, the principles described could be applied to other sectors as well. My company develops products for the recruitment sector, for example, and I could see the benefits.

Key stats:

  • 63% of users expect personalisation as standard.
  • 74% of users will be frustration if it is not offered.

This session dives into details of how Machine Learning can help with that, including live demos of AWS’ Forecast and Personalize products. These products abstract away a lot of the lower level ML aspects. Essentially, it is a ‘feed data and point and click’ approach.

GAM302: How CAPCOM Builds Fun Games Fast With Containers, Data and ML

CAPCOM are pretty big in the games industry!

They moved from on-prem (2008) to private cloud (2012), and finally into AWS from 2015.

The first part of this presentation covers the architecture of Monster Hunter Explore, handling 20 million queries / minute.

The second part covers ML techniques in level design. This part was fascinating, as it involved training a model to play the ‘gems’ style game, and determining whether any levels were too easy / hard based on that. This section is complete with video examples of the model in use, including when it was overfitting and getting things wrong!


ARC210: Microservice decomposition for SaaS environments

Unfortunately, the video for this session is not yet available.

I hope the video does become available soon, as this was a very strong presentation on how to break up a monolith into lovely Microservices!

It did this with consideration of various things:

  • Single v multi tenant approaches
  • Domain modelling
  • Tenant isolation
  • Bulk operations
  • Fault isolation
  • Data partitioning
  • Tenant tiering & Metrics

The notes I’ve taken from this will definitely be used on any work I do in this area in the future.

PNU301: Smart meters, data lakes, and serverless for Utilities

Background material: Smart meters explained.

Heavy on the detail, this covers:

  • Data lake creation (Starting with Amazon S3).
  • Use of IoT and how to ingest data from the smart meters.
  • Insights on the data collected, including Machine Learning.

Financial Management

ENT204: Managing your cloud financials as you scale on AWS

Managing cloud financials correctly is a particular interest of mine. In fact, it needed to be the primary focus when I started my role, as that was the largest concern at Board level. This is not uncommon elsewhere!

“Why is our AWS bill higher than expected?” – That was the question that got the largest show of hands in this talk.

Reports have shown that 35% of cloud spend is wasted.

This is therefore a useful watch, covering:

  • Measurement & Accountability
  • Cost Optimisation
  • Planning & Forecasting
  • Cloud Financial Operations


STP15: The Awkward Teenager: Learnings From One UK Unicorn’s Story

The unicorn in question is Deliveroo. They are massive in the UK, and their speed of growth means they have gone down the typical ‘monolith to microservices’ route. Their journey also involved doubling the size of their technical organisation.

I considered this more of a cultural talk and how that links with the tech. The key takeaways (see what I did there?) for me were:

  • Made mass onboardings easier by having dedicated ‘onboarding weeks’. New hires, where possible, came in during those weeks and set up with a full onboarding schedule, that all levels of the business were involved in.
  • Designing for scale uses the ‘old school’ Technical Design document approach, but there is a strong feedback loop as these documents are widely circulated and commented upon. There is a community around helping new people get started around writing these documents.
  • Game Days run to help people feel confident with what they are doing.
  • Incidents have templates as well. Post-mortems held weekly (blameless) and everyone is welcome to attend.
  • They have a custom-written release management and service catalog tool, so that every app has documented links to Slack, playbooks, GitHub etc.
  • For talent development, there is are Levels and an Expectation framework. This makes it very clear what is needed to gain promotion, plus has real-life examples.
  • FinOps seemed a bit immature compared to everything else: More a case of ‘keeping an eye’ rather than any tools or methodologies that were as strong as the other areas mentioned above.

DOP310: Amazon’s approach to security during development

It goes without saying that Amazon need to be secure in everything that they do. In fact, they rank it at the top, like this:

  1. Security
  2. Durability
  3. Availability
  4. Speed

Key takeaways:

  • PROMOTE SECURITY: At all levels. The CEO is informed of any security issue.
  • Amazon’s Leadership Principles apply to development.
  • Three pillars: Policy, Processes and Tools.
  • Security reviews start early so there are no surprises at the end (Security teams can technically veto a release).
  • Full penetration testing.
  • New services get TLS/SSL, SIGv4, AWS Identity etc ‘out of the box’ as part of a common framework.
  • Have the best staff to help with reacting to security threats. They have various serious white paper authors on staff.
  • ‘Heartbleed’ (2014) was a memory area vulnerability in OpenSSL. Although ultimately a 2 line fix, required millions of deployments within AWS. Ultimately solved within 24 hours.
  • Resulted in s2n: Amazon’s only implementation of TLS/SSL.