Less than a year ago, I started as a Security Engineer/Architect @Cimpress.
I had to face a great challenge: create security services on AWS, and I knew almost nothing about it. I had just a theoretical understanding of cloud and cloud security but never had the chance to play with it.
As soon as I started, I felt compelling the need to automate everything (as I always do: automate everything). I then started a journey into the DevOps world.
I heard from several friends about CloudFormation, Terraform, Packer, Docker, Jenkins, etc. but never dug into them too much. Now I had to start working with them so I embraced myself and started studying.
I decided to go with Terraform as my IaC tool of choice for several reasons that I’m not going to explain here.
After almost a year, after made several different versions of our services’ infrastructures, created, destroyed and migrated them, I want to share some tips about Terraform I wish I’d known before I started, more or less like this inspirational blog post I found while looking for AWS tips: https://wblinks.com/notes/aws-tips-i-wish-id-known-before-i-started/
I’m not an expert, I’m not a guru, my journey into SecDevOps has just started, but I feel that sharing this experience with you, especially if you are thinking to start this adventure or if you have already started but you’re on an early stage, they can save you big time and big pain.
As a side note: I already knew some of the below concepts/best practices when started, but I decided to ignore/put on hold them, until I’ve reached the point where I had to implement/use them. Problem was: when it happened (because it happens), it was very painful to redesign everything and I wish I had started from the beginning following them.
Finally, I’d suggest to read the following article, as it gives some insights for baking security into AWS through Terraform: https://blog.threatstack.com/incorporating-aws-security-best-practices-into-terraform-design
Let’s start then:
Use remote state
By default terraform creates a local terraform.tfstate file where it saves the states of the resources. While this is ok for testing purpose, when you go live, you don’t want to have such an important file on your local hard disk. So many bad things may happen and you may lost that file forever, having to reverse engineer your infrastructure (which is often impractical).
Remote state saves you from pain and fatality. I’m using an S3 bucket as a remote backend: every time a spin a new infrastructure, Terraform stores the tfstate file on an AWS S3 bucket, so you don’t have to worry about it, since it’s on a safe place. For a list of supported backends take a look here: https://www.terraform.io/docs/backends/types/.
Why you need to know this in advance?
Switching from a local state file to a remote one is now painless: after added the configuration lines into your template file, Terraform will ask you if you want to move your local state straight to the remote one.
It’s ok if you manager 2-3 infrastructures. Different story if you manage 20 of them. Starting early will save you big time.
Do not pass AWS credentials/regions explicitly (i.e. via variables)
I started passing the access key, security key and region via the -var switch. It was horrible. Then moved to the tfvars file. Better. Then I had to rotate my keys: it was a nightmare. I had to go through all my infrastructures (at the time, thank God, just a bunch of them) and changed the keys manually.
Shortly after I realised I could have leverage Terraform’s smartness: as first step, terraform looks for AWS credentials and region into the files located at ~/.aws/credentials and region in ~/.aws/config created by the AWS cli. You can easily change the default profile or just tell Terraform to look for a different profile name using shell variables (or editing the above mentioned files).
Why you need to know this in advance?
Rotating AWS keys is a must in every organisation (if it’s not in yours, well, you are doing cloud security wrong), so you will soon hits this painful issue. I’ve seen tons of examples in which the authors pass the keys via -var or inside files. I think it’s totally unnecessary, since you can leverage the AWS cli’s configuration files. If you don’t have the AWS cli, well…you are doing SecDevOps wrong.
Use template_file to autogenerate bash scripts or configuration files
It’s almost impossible to have an userdata script static, i.e. without passing any parameter to it, as well as other scripts you may want to upload/pass to your instance. At the beginning, I created a bash script that, thanks to a null_resource and sed, replaced some placeholder with actual data (e.g. database’s IP, bucket name, etc). It was clunky and a mess, but it worked.
Then a friend of mine told me about template_file. It was amazing. I was able to generate files based on variables without the need of an external script. More info here: https://www.terraform.io/docs/providers/template/d/file.html
Why you need to know this in advance?
My natural approach was to create a bash script to generate files. I think it’s a pretty common “instinct” do it the way we are used to. In this case, Terraform already thought about it and created the template_file data, which makes everything easier and clean. You cannot understand the hours I’ve spent making the bash scripts, ensuring that everything worked as expected. And then realising the template_file thing and had to toss all my work done. A disgrace.
Use modules. Always.
Code reuse is one of the main concept about IaC: you create a piece of infrastructure, and you use it multiple times. Think about a service you deploy for different customers: same infrastructure, it changes just few parameters (e.g. a name, an ID code).
If you are starting this journey, you may want to create the infrastructure in different files together in the same folder (e.g. a file for the S3 bucket resource, another file for IAM roles, etc.), and I understand why: it’s easy, it’s clean, it’s logical. What happens when you create a service and need to spin it up for a different customer? You copy paste the files and make the relevant changes to fit the customer’s needs. Now, you find out your infrastructure has a bug and you need to fix it. After spotted the bug, you have to fix the same bug for every customer you have spun up that service, i.e. editing the same file N times (if the bug resides only in one file…).
With modules, this is not going to happen: you fix the bug just once and the new files are shipped to every customer (thanks to git, more later on).
Creating modules is no different than creating programming languages’ classes: you have to picture in your mind a stub, a driver, and then create the module. Calling the module from your main file should be easy and straightforward.
Why you need to know this in advance?
Moving from a “flat files everywhere” to modules is a big change for Terraform and there is no easy way to migrate your previous infrastructure to a new one (if you don’t have many resources you can try the mv command). Refactoring your infrastructure to use modules often means destroying and recreating it. So be sure to start early and stick with modules.
Use a VCS
Version Control System(s) are wonderful allies when it comes down to IaC: you can track changes, reverse them, compare them, and you can distribute your code easily. When you use modules,you can specify as a source a VCS to pull the code down from. In the example above, when you find a bug in your infrastructure, fixing it is easy: just push the fixed code into a VCS and from your service’s directory tell Terraform to download/update the module: voilà, you seamlessly patched your service for all your customers (to make it effective you still have to run apply, but be sure to run a plan first).
Start deploying your AWS VPC with Terraform. Thank me later.
Usually, soon after your first access to the AWS Console, you start creating your environment: at least one VPC, a couple of subnets, NACLs, route tables, etc. etc. This is great. I created (in only one region) two VPCs, 12 subnets (6 subnets using 2 availability zones), some NACL, standard security groups, a couple of DNS names, route tables, NAT gateways, and so on. Everything in the Console.
Then we had a need: to spin the same infrastructure on a different region. Exact copy. This turned out to be a nightmare (and it’s still ongoing): you have to ‘reverse engineer’ your infrastructure and translate it in Terraform.
This is not even the worst thing that can happen. What if you need to change a NACL? Doing manually it’s fine, but if something breaks, or you simply want to keep track of changes, there is no way you can do it without having the environment done in Terraform. It stars becoming a hell of a day.
Do not create and manage your EIPs with Terraform. Something can go terribly wrong.
On an early version of one of our service, I created and managed EIPs inside Terraform so I didn’t had to manually create them. One day, I had to migrate one of those service due to an infrastructure refactoring (as mentioned multiple times above). It was in production and the customer had to hardcode the EIP into their scripts for whatever reason so I cannot change EIP. I carefully planned everything, every step, to minimise the downtime and avoid losing data. I spun up the new infrastructure, checked that everything was working fine, then manually switched the EIP from the old infrastructure to the new one. Sweet, everything was working perfectly with less than 1 minute of downtime (just the EIP switch).
Happy enough, I decided to destroy the old infrastructure. Terrible mistake. I realised it too late: since I created and managed the EIP inside Terraform, the destroy command released the EIP. Everything stopped working. The customer reached out to me with “I see nothing is working and our scripts are failing…is everything ok?”.
Now, you may well imagine that when you release an EIP, it’s gone. Gone forever. That’s because it can be reassigned to someone else around the globe that requests a new EIP, and new EIPs are requested every minute.
But I’m a man of strong faith. So I decided to call Amazon Support immediately. The support guy was amazing and within minutes I had the EIP back reassigned to my AWS account.
What’s the point of the whole story? For some critical resource, I found out to be safer to do create it manually (just some examples: ACM, EIP, DNS).
Use ‘env’ or structured folder to differentiate your prod from dev environment
If you follow some of the tips outlined here, you will soon face another issue: how organise your files? I started having both production and testing environments in the same directory as the terraform files, with different state files (when I didn’t use the remote backend). When you switch to remote backend, and you start modularising your infrastructure, it becomes a nightmare to maintain.
I found myself very well organised by having the following directory tree structure:
service_1/ | |-> customer_a/ | | | |-> dev/ | | | |-> prod/ | |-> customer_b/ | | | |-> dev/ | | | |-> prod/
Your code should look like a main.tf file, an output.tf and maybe a variables.tf and terraform.tfvars files. That’s all.
Keeping your code organised is important to not go crazy if you have to handle multiple customers and multiple services.
Terraform introduced the env command. I’m quite sure I can clean my directory’s structure by using it, but still haven’t tested it yet. But it sounds promising, so check it out.
filter AMIs by tags. Do not hardcore AMI ids (same goes for dynamic resources)
I started hardcoding AMI ids into variables.tf file. It seemed reasonable at the time: I had my nice, rock solid, well tested AMI and I want to use precisely that for building my service. Then I started thinking about patching: how to handle AMI patching? I started looking into Packer and found it awesome so I created a script to automatically patch certain AMIs (based on tags). But how to update terraform’s infrastructures with the new AMI id? I needed a pipeline. I ended up using Jenkins + Patcher + Terraform, writing the packer’s output AMI id into terraform’s files. But still hardcoding the AMI id into the variables.tf file seemed shitty to me, even if I introduced an automated way to do so.
I then decided to tag my AMIs with a ‘Service’ tag and filter the id based on it, as specified here https://www.terraform.io/docs/providers/aws/d/ami_ids.html. In this way I don’t have to touch my terraform files: all I need to do is to run Packer, wait for it, and then run an apply. Terraform will automatically detect that a new AMI has been provided and it will create a new launch configuration and attach it to the autoscaling group (then you have to manually shutdown the instances to apply the new launch configuration, but that’s another story…)
Do not use remote-exec
If you’re using remote-exec provisioner to do stuff, you’re doing something wrong. I had this mongodb replica cluster, not under autoscaling group (ASG): they were just 3 servers. I had to configure the master and the slaves, and I though: this is a one shot configuration, they are ‘static’ servers (i.e. not terminated and created as in an ASG). And then: I’m going to configure the servers after they have been spin up with a remote-exec. Jumping into bastion host and then into the machines, running the configuration scripts, win.
There are so many wrong things with that approach:
- servers may fail due to AZ failures, or something may go crazy and you have to nuke the machine. Didn’t happened to me, but it’s still a possibility
- it feels like cheating: since you can use it as you were in SSH on the machine, you’ll start using it as a replacement, failing in your automation
- it adds complexity: you have to configure the connection to your bastion host and then to the machine itself (even more: what if your bastion host authenticates SSH with 2FA? Impossible to use remote-exec)
I fixed it by moving the remote-exec script into an userdata script and attached to the mongodb servers (with different parameters if it’s running on a master or a slave). Still they are not under ASG, but in this way I don’t have to explicitly connect to any box over SSH.
Do not trust plan
Before running an apply, I always run a plan. It gives me some degree of confidentiality that my terraform infrastructure is going to work. Unfortunately, it only shows you what resources it’s going to create, not in which order, or if there is a malformed JSON somewhere, or if you put underscore in names (AWS supports only hyphens). validate is of not great help either. Take plan with a grain of salt and prepare yourself to an infinite journey of pain.
Those are just some points I feel are important when start working with Terraform, and are by no means exhaustive.
I’m more than happy to hear from you what are your tips you wish you’d know before you started using Terraform.