A fast-growing number of companies are migrating applications to the cloud for a variety of reasons — such as improving agility, availability, and security. Also expenses, but that comes mostly from a variety of secondary factors — like outsourcing the compliance burden.
At Poplatek we’ve worked in the cloud for about 7 years — leveraging automated deployments with declarative configurations from the beginning. Our first large-scale cloud migration was moving the payment processing of Veikkaus (the Finnish national betting agency) to AWS. It was a large migration in a mission critical environment with many pressure points. At the same time we successfully verified that Poplatek’s PCI DSS compliant hosting environment is generic and modular enough to scale with multiple customers.
I was acting as the development lead on Poplatek’s side of the Veikkaus migration project, and worked closely with our expert team plus the team at Veikkaus.
The Veikkaus project was unique and challenging for many reasons, but maybe two of the biggest ones were the real-time transactional nature of the data and the different levels of security requirements that come with the mandatory PCI DSS compliance. (We have to fulfill a list of 250 requirements since we’re processing credit card transactions.)
Since the Veikkaus games are continuously generating new transactions, it was very important to minimize downtime as it affects business operations and causes operational losses.
Understanding the System
It goes without saying that it’s critical to understand the system requirements, architecture, integration points and data models, and to have proper domain knowledge on compliance requirements. An understanding of the key players and e.g. whether software components need to be updated — or possibly re-implemented in a cloud-native way — is also required.
The Veikkaus migration project consisted of tens of integration points with multiple on-prem data centers and points of contact. It turned out to be quite challenging as the firewalls and routers had many different settings along the connection path (asymmetric routing with load balancers, spoofing protection etc.). It was crucial to conduct automated connection testing to and from AWS when possible, and to keep the tests up-to-date between all different environments (testing, staging and production).
Our testing environment had a VPN instance acting as the gateway from on-prem to the cloud. With this instance we could easily TCP dump packets when needed. This turned out to be useful for developing the access controls on the cloud and the automated testing suite — by confirming that something did or did not reach the cloud.
The same tests were used in staging and production environments, but with corresponding environment specific settings (e.g. IP addresses). We used pystache (Mustache for Python) to template the tests with the environment data, as we were already familiar with it.
The infamous TCP MTU setting (and TCP MSS advertisement) is something that is easily forgotten — especially, but not solely — when VPNs are involved. Thus, test with large packets, too.
Known Platform Tooling
Have familiar and battle-tested IaC tools on hand, and all configurations in a version control system.
We’ve been doing DevSecOps for many years and have sorted out the required interfaces for containers, how AWS works and what limitations it has. It’s critical to understand the cloud.
Poplatek jumped on the Docker bandwagon quite early on. Back then we adapted our hosting environment to fully integrate containers. Meaning that we for example took proper care of log collection, secret management, and health checking for them. Our brilliant AWS engineering team integrated containerisation management into our IaC platform together with immutable images and repeatable builds in all levels.
Proper Staffing and Planning
Have coding cloud architects and domain experts in the team. Use testing, staging, and production environments in the cloud as well. Have proper project management that takes all players into the loop on time, and explains what changes are coming so that they can prepare in time and arrange e.g. connection testing and problem solving events.
Regarding migration projects, it is crucial that the customer is fully engaged with the transition.
Moving to the cloud often means that e.g. the TLS endpoints change, and thus ciphersuites and TLS protocol versions are updated. This means that in some cases integrations need to be updated to allow these upgrades. Also, IP addresses change — or better yet — the need to use DNS-based endpoint addresses emerges with the introduction of cloud computing.
Divide architecture into natural service components (if not already) based on their criticality level and how easy it is to switch between the on-prem and cloud setups (data migration).
Move less critical components to AWS first, possibly with a load balancer that can act as a switch between on-prem and the cloud.
Moving some pieces to the cloud meant that we had to live with an architecture where some of the components were in the cloud, and some were not. This gives considerably more overhead in terms of calendar time and actual work — but on the other hand — improves confidence when the migration is conducted within a large system.
Scalability, HA, Testing and Rollback
Do performance and scalability testing on mission critical infrastructure components (e.g. real-time transactional systems).
Design for HA and include the possibility for seamless rolling upgrades, and test them.
We appreciate simplicity and practical approaches. The load profile of transaction processing didn’t vary enough to require automatic performance scaling for the backend systems. However, we designed for high availability for which the cloud gives good tools to work with. This meant that we designed the system in such a way that one availability zone could go down and the system would still be operational.
We also made sure there were well-tested rollback plans.
We tested different options for migrating data to the cloud, but ended up having database-level replication with accurate timestamps for rollbacks. Running continuous replication on the live system, and then syncing to a stopped database provided us with the benefit of doing the actual switch to cloud inside a minimal time window. Of course we had to test that the continuous replication would not harm the actual processing itself. We also measured the time it takes to get data into the cloud in practice.
And yes, we tested the rollback too. We also discussed and designed steps that would need to be taken to migrate data back to on-prem in case that would become necessary later on post-migration.
Success and Smooth Sailing
Enjoy your new cloud infrastructure, but don’t forget to clean up any on-prem data in a proper manner — e.g. PCI classified cardholder data 😉
As a whole the Veikkaus migration was a success, thanks to careful planning and execution. For example, the most critical migration phase of the transaction processing system took only 3 hours (less than the planned downtime). This included all communication between teams and affected parties.
The smoothness of the migration lasted throughout the project (we had some time pressure too), as the final switchover was virtually invisible to customers. To quote the ICT director of Veikkaus: “It was amazing that no one called to report any problems after the switchover”. We definitely aced it.
Poplatek is a proud member of the AWS Partner Network. We are leading professionals of AWS maintenance and innovative cloud service development. And we’re hiring!