Going Serverless: Story of a Service Migration @ DevOps Perth

NOTE: This version of the slides does not have animation
and can be used in static slide sharing sites etc. For the animated version, you’ll need to see it live. My name is Carl Scarlett. I work at Bankwest as an IT Specialist. I design things and I build things. Here’s my contact details My best point of contact is Twitter This is a story about migrating a public web service that is a data source for a single page application on a public website from on-prem to AWS using the Serverless framework. I had no prior experience with the service, AWS, or Serverless when I started. This is a none to done kind of story. 1

{ SECTION 1 OF 13 } ### CLICK ### Today
I am representing myself and not my employer. The opinions expressed are my own and should in no way be taken as the opinions of Bankwest All the information we will cover is public data, this work was publicised in the media Nothing here is covered by non-disclosure agreements All the work I present was done by me unless otherwise indicated (including the slides) This is a personal story and should be taken in that context: it’s something I did while working at Bankwest ### CLICK ### There are so many fine details, I can’t tell you everything in the time we have I will be moving fast to get through the content ### CLICK ### 2

I will shortcut a few things in the interest of
getting learnings across if I don't cover something and you don't get to ask during question time please find me after I’m always happy to talk about all this or anything that I do 2

{ SECTION 2 OF 13 } ### CLICK ### Find
us app for customers to self serve • Like most find us apps: • Where to find ATMs, Stores, Business Centres and put them on an interactive map • Provide details about services and times open at each location • Unlike other such apps apps: • Find Home Lending Specialists that are in their suburb • Some suburbs don't have any people, but may be in surrounding suburbs ### CLICK ### The app is served from the Bankwest public website It targets desktop and mobile browsers There used to be a separate m-dot site for mobile browsers We refactored to a single responsive web app during Digital Shopfront project ### CLICK ### 3

The Bankwest iPhone and Android apps baked their location file
into the app at this point • Concerned about the app being able to be responsive with latency etc • Likely to switch to consume the new locator service in the future 3

Here’s what the application looks like. 4

In the top section you search for an address, suburb
or postcode. Google provides some suggestions based on what you type Location will be pinned on the map. If no location is specified, it will default to the last location (cookie), or to Perth WA. There are two tabs to select different views. Are you after a location on a map? Do you want to talk to a lending specialist? Some filtering toggles of location types are next Followed by a dropdown for more detailed filtering by the facilities provided at different location types 5

The searched location is shown on the map with a
pin (here Perth is defaulted) The usual zoom controls apply The map can be scrolled around, revealing locations in other areas 6

Locations are selectable, both in the list or on the
map. The list gets a selection indicator, and an information box for the location appears on the map location . 7

On the Lending Specialists view we show contact details of
people servicing the selected postcode or suburb. Here, we show all the Home Loan Specialists (HLS) for the Mount Lawley postcode. There is only one HLS in this postcode (Justin Long) and his contact details are shown on the right. HLS for the surrounding suburbs are also shown Here Jessica Duong may be contacted, even though she does not operate in Mount Lawyley. The avatar images are stretched because someone messed up the content at the time I took the screenshot. 8

If you are need a business lending specialist, you choose
the Small Business Specialists (SBS) option. Contact details for the central Business Sales Centre are shown 9

{ SECTION 3 OF 13 } Why move the service?
The Service is built on a burning platform. The servers it uses are officially out of support, so the Bank pays exorbitant fees to keep it under support. If they didn’t the bank would be failing to comply with Australian banking regulations. The business has been trying to get rid of this platform for years, but it’s not a trivial exercise. There are many, many systems dependent on these servers. The main one is the FatWire Content Management System (CMS), which used to serve the Bankwest Public Website. It also acted as a central content store for a number of periphery sites and apps. This CMS had been customized to the point it couldn’t be upgraded, and it was not supported on servers after Windows Server 2003. 10

With Server 2003 out of official support, it was easier
to move to a new CMS and get rid of this financial drain. We had to move the Locator Service off these servers before the bank finally pulled the support budget and turned off the servers. 10

As well as needing to be moved off the burning
platform, the business was trialing various cloud service providers. The bank is building a Cloud Native Platform to allow developers to build solutions in the cloud in a safe/secure/regulated way. ### CLICK ### The Website Platform team where I worked as already comfortable with AWS from moving the CMS (Digital Shopfront project). It made sense that we migrate a service we owned to AWS as part of this trial. ### CLICK ### The Cloud Native Platform (CNP) team would be building a platform at the same time as us When both parties were ready we’d bring them together to form a productionised system 11

{ SECTION 4 OF 13 } Let’s look at requests
coming through the Locator Service. 12

The app runs in the browser. I use Chrome. 13

The app sends a request from the browser over the
public internet. As we all know, the internet is “a series of tubes”. ### REFERENCE ### “It’s a series of tubes” - Senator Ted Stevens talking about the Net Neutrality Bill in 2006 14

The request hits a server which routes to API endpoints.
The Locator service has 3 endpoints. 15

The easiest to understand is the Bounded Locations endpoint. Some
code runs, and the response includes the information about locations to be drawn on the map. 16

The request passes in the latitude and longitude of: •
NE + SW of the bounding box to return locations from • Pin/current location (to calculate distances to each location on the map) 17

The Specialists endpoint provides all the Home Lending Specialist information
for the postcode or suburb of the search location 18

The postcode and suburb (if provided) of the current location
are passed in And the HLS info for this area and surrounding areas are provided. If we search for a postcode, we get the details of the entire postcode. There are multiple suburbs for a postcode, so the data is different. 19

The Location Facilities endpoint provides the filter data the app
can pass into a Bounded Locations search. It is called once when app loads to get this filter data. 20

The left column shows facilities for ATM locations The right
column shows facilities for Store locations (stores = branches) Currently there are no facilities to show for Business Centres, but they would appear in a third column The UI doesn’t line up as the app was being restyled when I took this screenshot. A UX project was doing this outside the scope of my Locator Migration project. 21

Finally, the data serving the API existed somewhere on prem,
but wasn’t a consideration when this effort kicked off. We have a simple API so this project should be simple right? 22

Here is the what the Locator Service was known to
look like at the start of the project. ### CLICK ### …and here is the where the burning platform is. The API Gateway is IIS running an ASP.NET Web API It is hosted on the same server as FatWire 23

Primary goal is to get the service off the burning
platform. My strategy to migrate the service to AWS was simple ### CLICK ### I would keep the api contract and the data format the same. This meant I didn’t have to change the app at all. ### CLICK ### This lets us get off the burning platform as quickly as possible. Minimise the scope. Keep it simple. ### CLICK ### Once we’ve switched to the new solution the danger will be over. We can go back and make the API and app work better together in a less critical timeframe. 24

We take the existing API that is on-prem… 25

We build the new API using Serverless and AWS… 26

{ LAST SLIDE IN SECTION } and we swap the
APIs out by switching the endpoints the app points to 27

{ SECTION 5 OF 13 } One of the concerns
we had is network latency due to physical distances involved. Here’s a gentle reminder of how big Australia is, using comparisons with other well known land masses. #### CLICK #### Australia is a big land mass – Its the world’s largest island It’s about 4100 km across. #### CLICK #### Here’s China, another big land mass, for comparison. China is a bit wider and taller than Australia, but definitely comparable in scale. #### CLICK #### Here’s the part of the USA that forms the greater part of Northern America. Another large land mass. Again comparable in scale. 28

#### CLICK #### As you can see, these land masses
are comparable in size and reminds us just how big Australia is. #### Actual sizes for reference #### Australia 52x50mm 4078x3922km 2534x2437mi China 78x53mm 6118x4157km 3802x2583mi USA 72x38mm 5647x2980km 3509x1852mi 28

{ LAST SLIDE IN SECTION } I had concerns about
latency because of the distances our requests would need to travel to interact with the Locator Service. Here’s Australia again. #### CLICK #### And here’s the location of our API gateways; current and future. #### CLICK #### Most of our customers live in WA, and most of the app usage is from here in Perth. #### CLICK #### When the app interacts with the on-prem service the distances is small so latency is minimal. A request comes into our data centre, is handled there, and a response goes back to the browser. #### CLICK #### When we start using AWS in Sydney, the distances become pretty huge. I was concerned our app would run slower following migration. Under the design I was given data would be coming from wherever it is on-prem, so a single request would need to traverse the width of the country at least 4 times. 29

#### CLICK #### Here’s a request to our API on
AWS. #### CLICK #### The API goes to the data store for data #### CLICK #### The response goes back to the endpoint for processing #### CLICK #### The final response then heads back to the browser. Considering each request to the API (such as scrolling the map and needing new Bounded Locations) requires at least two round trips across the country, network latency was a very real concern. With CORS preflight checks there may even more round trips. NOTE: there was no Amazon CloudFront in Perth at this point in the project. A slower service would be a fail in my book. I would be keeping a close eye on performance as I designed and built the new service. 29

{ SECTION 6 OF 13 } Let’s look at the
API again. To iteratively build towards this, I decided I would • choose easiest endpoint first (lowest hanging fruit) • take it all the way to completion on AWS (at least in a DEV Stack) • Repeat until all endpoints are done By starting with the simplest first, we would • Get some experience with all this new tech • Establish a developer flow (local build and test, deploy and test) with source control • Possibly even have a CICD pipeline setup (depending on the CNP Team) As with any iterative design strategy, the benefits are • You get all the hardest/longest learning hurdles out of the way and in place as soon as possible • For more complicated iterations that are coming you minimise distractions outside 30

the problem domain • Therefore maximise your chances of success
30

Easiest by far is the location facilities. This is parts
we’re dealing with 31

Here’s some serverless code to configure this endpoint It’s YAML
I define a locator service on aws using node.js 6.10 in Sydney Given the skillsets of the team I chose to go with a Node.js solution using pure JS (not TS). We chose the version of Node AWS supported that was closest to what our team currently used (6.10) Then a Lambda function with handler code is specified. This is just a file in one of our serverless solution folders When an HTTP GET hits our API on the LocationFacilities endpoint this code is triggered to construct the response. With CORS (Cross-origin resource sharing) enabled this code also manages the CORS response header. 32

On deployment this creates an Amazon API Gateway to route
requests to our endpoints. 33

The API routes the Location Facilities request to our Node.js
code running on AWS Lambda. This is all plumbing. It’s time to investigate what data to return. 34

I examined the current API response at this endpoint and
found a JSON response like this. I’ve formatted it for easy reading. Each type of location (ATM, STO, BBC) has a display name (Title) and an array of Facilites. Each facility has a display name, and some other information we don’t care about. In the spirit of keeping the data contract the same when we swap the Location Services, we’ll preserve this other info. We hardcoded this directly into our handler and we were able to test a working endpoint with Postman. 35

When I asked the business people supporting this data, they
told me this information hadn’t changed in years. Given how easy it was to change and deploy code with Serverless, there was no need to connect this endpoint to a backend data store. #### CLICK ### This endpoint was done! 36

{ LAST SLIDE IN SECTION } ### CLICK ### How
to Deploy to AWS with Serverless • Handcranked using Serverless deploy on the command line • CNP team wasn’t ready for us yet ### CLICK ### Built a developer flow • Writing Node.js local and testing with Mocha • Used VS Code task runner to run Mocha tests and deploy commands • Working with Node.js on Lambda in the cloud with online editors • Working with Log files in the cloud (CloudWatch) • Working with security on AWS (IAM roles, 2FA/MFA) ### CLICK ### Gained experience with Amazon, AWS, and Serverless technology • Cloud Formation, Stacks, Amazon Consoles • Serverless YAML 37

### CLICK ### How easy it is to configure an
API endpoint with routing and code behind it • We were very excited with the speed of progress • Our CIO got excited too and announced to the media that we would be done by the end of the financial year! (weeks away) • I had doubts • we had known unknowns • probably some unknown unknowns • I admitted it was feasible, but unlikely 37

{ SECTION 7 OF 13 } Let’s look at the
next endpoint. The first question I asked was “Can I keep the existing data source?” PRO: All the business processes that updated that data would stay as is PRO: The work to support the API would be minimal CON: I’d have to solve integrating AWS with on-prem infrastructure so the API could surface the data, CON: Potential latency 38

I examined the code for the existing endpoint and guess
where I found the data? #### CLICK #### On SQL Server on the same server where the API lived. On the same server where Fatwire lived. And it actually got worse: #### CLICK #### The SQL Server was the actual database for Fatwire! Additional tables had been added. Additional columns had been added. API data used primary and foreign keys from Fatwire tables. The API data was coupled so tightly into the Fatwire data it was part of referential integrity. The API data was spread like a cancer on top of, inside of, and throughout the Fatwire database. 39

So the impact of the burning platform is actually this.
We now know where the data store is, and we need to get rid of it. This would mean changing the business processes to add and update data into a new data store. It would mean we weren’t just migrating the API we were migrating the entire service. Given we had to get off the burning platform, we had no choice but to make this part of MVP. This simple project just became not so simple. 40

{ SECTION 8 OF 13 } So how does the
data get there? I talked to the business to find out what their process was. Turns out there are two CSV files used to populate the data for the Specialists endpoint. The first is a postcodes file that is provided by Australia post on a monthly basis. It lists all the subrubs, postcodes and surrounding suburbs, in Australia. The second is provided by the business themselves, and has all the information about Home Lending Specialists (name, contact details, etc) and which suburbs in Australia they service. 41

In the spirit of choosing the simplest thing first, I
focussed on the postcodes file for now. The process to update postcodes seemed relatively simple. The business receives a postcode file every month 42

A request was made to a development team to upload
the data The developers manually ran a script to upload the data into a holding table in SQL Server, 43

Then they manually ran another script to turn the data
into a cancer and spread it where it needed to be for the API to rea 44

Then they checked the data was OK before handing off
to a testing team Through manual and automated tests the testing team would ensure the API still worked (no regression issues). Testers and the business would also to some verification tests at this point to ensure known changes were present once the deployment had happened 45

The developers would then repeat this process in a UAT/STAGE
environment and PROD if everything went OK. All in all a pretty straightforward process. AUDIENCE PARTICIPATION TIME Does anyone want to guess how long it takes a monthly csv file to be received by the development team, go through this process, and end up in PROD? Just shout it out! 46

That’s right, that’s not a typo. When I looked at
this process I found it was taking 2.5 years for a file we receive monthly to be exposed by the Locator Service in production. Not a very good customer experience is it? Fortunately postcode data doesn’t change very often, but this is still a simple process and shouldn’t be very difficult. So why does it take so long? 47

The first reason is this. First is due to the
cancerous nature of the data. Because it’s coupled tightly with the structure and foreign keys of the Fatwire database, deployment is not as simple as it should and the risks are high. Though it is a scripted process, a lot of manual checking and confirmation has to be done. The change management and governance is heavy handed as a result, and a lot of scrutiny is put on the process to ensure the public website isn’t damaged by the deployment of the data. The lesson here is to have single purpose data stores for all your things; never share a data store. While in this case there were legacy reasons to share data sources, in this day and age there is no excuse not to. Shared data stores create complexity that will only hurt you in the future.. Allow your infrastructure to bear the responsibility for distributed system complexity; not the code. 48

The second reason is this. The DEV, TEST, STAGE and
PROD environments where Fatwire resided were highly unstable. STAGE was particularly fussy, and would go down if you even thought about pushing load on the CPU. Deployments were done with a lot of hoping and praying that the inevitable wouldn’t happen; the deploy would fail, or only partially work, or crash the servers etc. Each failure scenario meant heavy involvement by multiple operations teams, security teams etc to recover, and would involve a lot of people to get operational again. 49

{ LAST SLIDE IN SECTION } It is this segregation
of duties that hurts you the most. In his paper “What I talk about when I talk about platforms”, Evan Bottcher from Thoughtworks talks of his study of hundreds of tasks at an Australian telecommunications company, and found tasks that couldn’t be done by a single team took 10-12 times longer than tasks that could be done within a team. What’s more the effect is cumulative. The more teams in the pipeline the slower it gets. All that time taken by the dev/test team dealing with the chaos and mess reduces capacity while they chased up work on other people’s backlogs. In our case the pain of getting a simple CSV file into PROD was so great the work kept getting deprioritised. Not a very good customer experience. Fortunately the postcode data changed so infrequently we could get away with not doing deployments for some time. 50

Specialist information was a bit more fluid however, and the
pain was felt. Learnings: • Avoid having overlapping data domains in a single database • Avoid burning platforms like the plague • If you have a burning platform, get out as soon as you can 50

{ SECTION 9 OF 13 } I was determined to
fix this situation in the new design, and get developers completely out of the loop. It made sense to move the data store up to the cloud and automate the processing. This approach had the added benefit of minimising network latency with cross- country traffic. Let’s go back to the start and show the steps I went through. I wanted to have data store in AWS and I had the postcodes.csv file to process. The resulting data would be stored, and then the specialists csv file would be processed later, and mix the data together to make the endpoint’s data source. 51

First I experimented locally to get a Node.js function to
pull apart the postcode csv file. 52

Then I moved the Lambda to the cloud. I manually
uploaded the file to s3 from my desktop using the console. 53

I started looking at storage solutions at this point. I
wanted to keep things cheap, so expensive relational databases were canned from the start. Document databases looked to be the cheapest data store at this point, and asking Google I found lots of advice pointing me to DynamoDB. 54

I realised responses requests were deterministic between data uploads. ###
CLICK ### The same request would always produce the same response between uploads. ### CLICK ### Rather than processing on every request, I could process responses at upload time ### CLICK ### The code on the endpoint could just fetch ### CLICK ### This would produce a very fast response, but could I store all the things? 55

DynamoDB is essentially a key value store I decided on
a “postcode_suburb” composite key (suburb was optional) Using my on-prem processor code I performed some analysis; here’s some metrics. It looked doable. 56

I configured the lambda to build the responses and store
them in DynamoDB. I dropped the postcodes file into S3 using the online console and observed the lambda process the file. It processed fine, but when I checked the data in the DynamoDB table… 57

{ Video of squirrel giving an evil eye with a
“DUNN DUNN DUNNNNN” sound effect } So two thirds of the data was missing. Why was that? 58

DynamoDB write is configured using Write Capacity Units (WCU) This
tells Amazon how much write capacity to provision for the table. ### CLICK ### I started a 1 WCU, which accounts for 1kb per second. ### CLICK ### The lambda was writing at 3132.38 items per sec, averaging at 1kb per item => 3132.38kb per second ### CLICK ### I was trying to write ~3000 times faster than I had configured DynamoDB for. ### CLICK ### DynamoDB does absorb some of the excess and tries to accommodate, but the difference here is too great and it starts throwing write requests away. 59

I needed to change the scaling of DynamoDB to accommodate
the writing of data. ### CLICK ### I could have configured DynamoDB to autoscale, however (at least from a documentation standpoint) it loses data before it scales up. ### CLICK ### I could set and forget the WCU in the range [1..10000], however the higher the WCU number the more it costs. 10000 WCU (10000kb per sec) cost~$5500 AUD at the time. ### CLICK ### My writes were bursty; a period of intense activity to write then no activity for a month (in the case of postcodes). Using the SDK I wanted to configure the WCU as part of the upload process. 60

The solution is to use Step Functions. This lets you
define state machine to control parts of your system. I made a generic state machine that was configured by a kickstart lambda. It would scale up a table, invoke a file processor lambda, and scale down the table. I wrote generic functions to hande the scaling and invoking. 61

The state machine has a linear flow. The very first
time you scale up it takes about 15 minutes (presumably as Amazon moves the table to different virtual infrastructure behind the scenes) After that the scaling up and down is instantaneous… 62

…almost. DynamoDB configuration, like it’s data, is eventually consistent. Initially
I put a 5 second pause in do allow for the small delay. Later I had another developer build a polling step as an exercise. I found I could get a way with a WCU of ~1200. and set it to 2000 for good measure. I’m only charged for a small amount of high WCU when I write in monthly bursts. 63

At this point the Cloud Native Platform team had a
working platform. They wrapping the Serverless CLI with their own. It managed security as well as deployment, and compliance as code. They split the serverless.yml file in two. On one side you defined the infrastructure using Amazon Service Catalog. These were pre-approved, secure service configurations you would compose to build safe solutions. Security Roles were also defined here. On the other side you defined your code, and your storage. Because of this split you couldn’t define Lambdas to consume S3 anymore. The solution was to use SNS (Simple Notification Services). Lambdas in the second file could be configured to consume topics on SNS defined in the first file. 64

All that remained was to get the file to S3.
I created a Sharepoint location and gave access to the Business Team responsible for the postcode file Then I created a Team City task that would check the location for changes periodically and push the data to S3. 65

The system was now fully automated for the postcode file.
A business person could put the file onto Sharepoint every month, and it would automatically update the data used by the service. Remember how it used to take 2.5 years for this to happen? This process was a little faster… 66

This is conservative. Processing is so fast the reporting of
progress can’t keep up; it’s actually only about a minute end to end. The time saving in person hours alone pays for the entire cost of my development time. Factor in • All the on-prem infrastructure we no longer use / pay for • All the people from all the teams no longer trying to keep the burning platform operational • All the regained capacity/productivity from developers no longer micromanaging the process This is a huge result. 67

Now we repeat the process for the specialists file. We
put the file on the same data channel. The kickstart was updated to generate input for the state machine to work with a different table and new file processor lambda. In this case data from the file is mixed with the data from the postcodes table and stored in it’s own DynamoDB table. I used piece of work to onboard another developer with no prior experience with Serverless, AWS, or Node.js. After a handover from it took him < two hours to add the specialists file to the channel. It did take longer for him to code the file processor, mainly due to another limit in DynamoDB where we exceeded the maximum size allowed for storage against a single item (400kb). We discussed several solutions for this, but the easiest for him to implement was to • remove some of the data which wasn’t used by the Find Us app from the response • move some of the static data to be hardcoded in the Lambda serving the endpoint. 68

So there it is; the complete Specialists endpoint data pump.
With the addition of a simple lambda to serve incoming API requests and marshall data from a DynamoDB table this is the process. 69

So what did I learn building this endpoint? ### CLICK
### Pay more attention to limits of Amazon Services • When you’re in the weeds trying to learn this stuff, pay as much attention to these details as you can • Online help of the limits is good here, though I did find a limit that wasn’t documented and burned me later • I didn’t cover all the DynamoDB limits here; there’s more to be aware of ### CLICK ### Pay less attention to guidance from Amazon / Google • The general recommendations for using certain services is skewed towards larger solutions • My solution was relatively small; the guidance I found led me into a suboptimal solution architecture • There was a better way to do this; will cover in the final wrapup ### CLICK ### 70

Lots of capabilities of AWS • I got hand-on experience
with the services I used • Through experimentation and making mistakes I got to know the Amazon Services I used very well ### CLICK ### Lots about system monitoring • Each Amazon service has it’s own web console • Each console has lots of information to help build and run the service • DynamoDB is particularly complicated but has a great metrics console to help you diagnose issues #### REFERENCE DETAILS #### • Size of node modules folder affects online editing of lambda https://stackoverflow.com/questions/49060915/online-edit-amazon-lambda- function-with-alexa-sdk • DynamoDB only lets you scale down a small number of times before throttling you to 1 change every 4 hours. This hurts when you’re doing iterative development. Solution is to destroy your entire stack and redeploy with Serverless; a fresh table resets this counter! 70

What would I improve? ### CLICK ### Blue/Green deployment •
Deploy to pre-live table, test and switch • No down time during updates (even though this is a very short period, there’s still downtime) • Easy to switch back if a problem is found ### CLICK ### Remove DynamoDB + Step functions • Complex to use + paying more than required • S3 looks like it can handle the load • Likely free given amount of storage and requests 71

{ LAST SLIDE IN SECTION } This is what an
S3 solution would look like. There’s a lot less boxes and complexity. • Removed DynamoDB • Removed Step Functions / State machine + generic lambdas • Simpler flow using only S3 data store • Blue/Green still possible, just using folders in the bucket Totally doable – components and glue architecture pattern 72

{ SECTION 10 OF 13 } The final stage of
MVP is to deliver the Bounded Locations Endpoint. ### CLICK ### This endpoint that serves requests for ATM, Stores, and Business Centre locations within a bounded rectangle of latitude and longitudes, It calculates distances from the provided location, And does some filtering. 73

In this case we were changing where the data was
coming from. Instead of recreating the existing SQL data source in the cloud, we were going to use a public API from CBA, our parent group. This source had location details of all the Bankwest ATMs. It didn’t have the fine-grained filter details for ATMS. Also, it didn’t have the Store or Business Centre location details. I created a workbook with this information, populated with the current data, to fill in the data the API couldn’t provide. 74

It was simple to set up a lambda that would
pull data from the CBA API on a schedule. The response was stored in S3. This fired an “all locations” lambda that would process the data for storing in an appropriate format, The best storage format would be driven by my choice in data store. 75

I had to choose a data store appropriate for finding
the locations within a given bounded box. I started by looking for Amazon Services that supported geospatial functions and search. I found a couple of Amazon services with geospatial features, but none of them nicely supported what I was trying to do. Some had limited API support, some had serious cost implications and were too big for the problem I was solving. I was stuck with several options that I didn’t like, and no good ideas on how to proceed. 76

I put the problem and my findings onto our internal
public chat service to seek the advice from colleagues. The suggestion quickly came back to do the logic myself using a brute force algorithm. I only had 17,000 odd locations in a 3Mb JSON file so it would probably be very quick to process this data in Node.js. 77

NOTE: This is sample lambda code filtering the endpoint data,
not the code for “all locations” It was very fast on my dev machine. I came up with a JSON structure that would support doing this very fast. I experimented briefly with different key sizes and data hierarchies in the JSON, but only marginal gains were possible so I left it human readable. Here is a sample from the for the brute force. 78

I updated the file processor to transform and store the
data. There are two S3’s pictured here, but it is the same bucket. The duplication is to simplify the diagram only. Processing the JSON file on AWS Lambda was faster than on local as you would probably expect. However the performance of Lambda does swing wildly. I presume this is due to the virtualised infrastructure and loads with multi-tenancy from AWS. It was still more than fast enough for my needs. 79

Next I added the workbook to my file deployment channel
and pushed up to S3. I created a new file processing lambda to convert the xls file into JSON to be consumed by the “all locations” lambda. However when I added a node module to handle workbook I hit a problem. 80

In adding the node module I used to process the
workbook to the serverless.yaml file the size of the node_modules folder crossed an undocumented threshold on Lambda. ### CLICK ### While the system continued to function correctly, the online code editor in Lambda failed to load with an error saying that the solution size was too big. ### CLICK ### The size of the zipped up packages was only 3.4Mb so this was disappointing. This issue impacted my ability to iterate with lambdas; I would be forced to do a full serverless deploy every time I wanted to tinker with them. Here’s a link on stack overflow where I talked about this issue. This issue wasn’t one I could live with so I came up with an alternate solution architecture. 81

I moved the code to convert the workbook into JSON
back on prem to avoid the issue. This restored my ability to editing lambdas online. I simply added a step to Team City to run some Node.js to do the conversion and ship the resulting JSON up to AWS. While I’m not happy that I had to change my solution architecture to manage this issue, the modular nature of the architecture made this a very easy thing to do and it was more of a blip than a blocker. 82

Finally I modified the lambda that created the “all locations”
JSON file to also be fired when the supplemental data was updated in S3. This meant it would merge data from CBA with the supplemental data whenever one of the sources changed. 83

{ SECTION 11 OF 13 } Now that I had
all the data for the endpoint in s3 (and the data pump to update it), I needed to add the logic to the servicing lambda. 84

Here’s almost the complete code for the brute force algorithm
First I needed to load the “all locations” file on S3 and filter out all the locations that weren’t in the bounding box defined by coordinates passed in to the endpoint as part of the request. Here’s a sample of the code to do that. It’s a simple brute force algorithm that loops through each location and caches is if it doesn’t fall within the bounding rectangle. As mentioned previously, this works fast enough on the amount of data we have (17,000 locations). 85

Next I needed to calculate the distance of the location
from the location passed as part of the request. 86

Initially I reached for a node module to do geospatial
calculation, but I didn’t want to add another node module because of that hidden lambda limit. Instead I used a simple algebraic equation to calculate the distance between two points on a sphere with radius the size of Earth. While this isn’t strictly accurate (it doesn’t account for heights of the locations above sea level) it’s good enough for our use case. 87

All that was required now was to add the filtering
of locations based on the user selections passed through from the UI in the request. I left this as an exercise for my understudy but was a trivial addition to the brute force algorithm. 88

With the solution complete I was finally ready to do
performance comparisons with the existing endpoint. I was aiming to perform at least as fast as the existing service. ### CLICK ### And it did. From a cold start, average query time came down from ~4.1s on the old to ~0.6s on the new. ### CLICK ### From a warm start, average query time came down from ~1s to ~0.25ms ### CLICK ### Once connected to the app itself, the app came alive and felt a lot more responsive then before. It was fantastic to see all my efforts come together in such a positive outcome. 89

{ LAST SLIDE IN SECTION } So what did I
learn building this endpoint? S3 + Lambda is more powerful than you may think. • S3 has very high limits that make it viable for document database scenarios • Lambda runs very fast once you’ve loaded data from S3 • Use the characteristics of warm Lambdas to only load data once per Lambda lifetime (5 mins) and minimise overheads Pay less attention to guidance from Amazon / Google • As mentioned before, you are guided away from S3 into data stores like DynamoDB and Redis • Such guidance is for use cases they don’t define well, and I suspect is for larger systems than the one I’ve built here • Try to be mindful of this; your solution may not require as much compute and storage as you think AWS Documentation isn’t 100% Accurate 90

• While it does a good job, only through experimentation
do you understand the impacts of your decisions • Keep changes small and purposeful to protect yourself from hidden dangers Socialise your challenges • Try hard to recognise when you’re stuck • Even when you’re responsible, you don’t have to make decisions alone 90

{ SECTION 12 OF 13 } With that the solution
was functionally complete. All that remained was to apply a few security measures: • Wrap the solution inside a VPC. • Put CloudFront in front of the API for protection against DDOS. 91

Yes, I meant that. 92

### CLICK ### Breaking AWS stacks and components when tweaking
through the console • When you’re tweaking stacks and components using AWS consoles you can break your ability to redeploy over the top • Fortunately you can manually delete everything through the consoles and redeployment will work • With experience you learn what tweaks will break and what tweaks won’t ### CLICK ### Hidden limits screwing up your architecture and workflow • When you’re learning there is a mountain of information to take in • It’s hard enough learning enough to get started without having undocumented limits knocking you off your feet ### CLICK ### Online Documentation pushing you in the wrong direction • I have the feeling a lot of the documentation and guidance on which services to use are biased towards systems bigger than this one • I’ve not found a yardstick to help decide whether your solution is large or small, 93

and subsequently what services you should use. • Unfortunately that
leaves you having to experiment yourself • Always socialise your solutions; with so much to concentrate on it’s easy to miss simple solutions ### CLICK ### Even with the greatest care, you’re going to make mistakes • Unless you’ve got a photographic memory, you’re going to forget important facts when making decisions • At some point it’s inevitable you will make a bad choice and waste some time • Be confident: the modular architecture will allow you to easily correct these mistakes later on 93

### CLICK ### Modular systems result in simple code that
is easy to maintain • Systems with modular architecture like AWS make solutions at this scale very easy to build and maintain • Let the infrastructure bear responsibility of the solution architecture, not the code • Lets your code is almost purely address solving the problem, not defining the solution architecture as wel ### CLICK ### I admit to having a bit of a crush on modular systems ### CLICK ### Serverless Framework hides a lot of complexity • Greatly simplifies defining infrastructure • Domain Specific Language (DSL) hides complexity of Cloud Formation Templates • Where it doesn’t provide a DSL, still allows you to define Cloud Formation • Very extensible, and great community extensions exist • Makes packaging and deploying a breeze with a single command 94

### CLICK ### Embedding infrastructure decisions within the development team
• Typically, on-prem infrastructure choices are defined before development starts • This limits choices and creates a culture of stagnating infrastructure, with heavily coupled code, with high-cost support • Serverless systems allow iterative approaches to infrastructure/development workflow • No up-front infrastructure is required, it can be discovered through experimentation just-in-time • This means developers can choose appropriate infrastructure as the solution evolves • No reliance on other teams for infrastructure means no delays to development • Infrastructure as code gives tremendous flexibility to developers and allows supremely fast development cycles 94

Cost savings from moving on-prem to AWS • All the
usual things: • No paying for physical hardware • No paying for electricity, or cooling in data centres, or data centres altogether • No paying for teams of engineers to keep hardware working, patching etc ### CLICK ### Automating the development team out of the data management process • Acquiring developer time is no longer a blocker to the process • Liberating for everyone involved (business, developers & other IT support staff) • Business can get value to the customer sooner • Developers and other IT support staff can use more of their capacity to provide value elsewhere ### CLICK ### Taking a process from 2.5 years down to 2 minutes • Clearly a better experience for customers • One hell of a metric; captures everyone’s attention and imagination 95

• A bloody good bragging point! Something I’m very proud
of achieving. 95

### CLICK ### Pay more attention to documented limits •
This is very hard to achieve when you’re starting out; there’s so many things to learn some details are deprioritised • Now that I know how important they are, I am better equipped to avoid infrastructure choices that lead to poor solution architecture S3 instead of DyanamoDB • Using DynamoDB and Step Functions to control scale was a mistake in this use case • In future I will be more inclined to use S3 for key-value style data • S3 will be my first thought for storage solutions 96

{ LAST SLIDE IN SECTION } ### CLICK ### Putting
the power to create infrastructure in the hands of developers is absolutely the right thing to do ### CLICK ### Only at the developer level are the impact of architectural choices revealed, and insight on how to make solutions work realised ### CLICK ### Especially in cloud solutions, the iteration between architecture and development efforts needs to be as tight as possible. It makes no sense to be handled by multiple teams and in different contexts ### CLICK ### Serverless is a totally viable and enjoyable way to create distributed solutions using iterative design 97

{ LAST SECTION: CREDITS } Thank you for coming tonight
I hope you enjoyed hearing my story and learned some things I’m available for questions anytime 98

{ LAST SLIDE IN PRESENTATION } 102

Going Serverless: Story of a Service Migration ...

Going Serverless: Story of a Service Migration @ DevOps Perth

More Decks by Carl Scarlett

Other Decks in Technology

Featured

Transcript