Retrospective on infrastructure and database technologies used at Pyrra Tech

Sadly my time as Lead Engineer at Pyrra Tech is coming to an end, which makes it a good time to write a retrospective of the technology choices I made while I worked at Pyrra.

This section is focused on infrastructure and database technologies. Another post will follow soon about application and framework choices.

Background

Pyrra’s two main products are a web application for media monitoring and alerting, and a data API that provide an interface to our underlying data store (Elasticsearch). I designed and built most of both products, and both were greenfield projects. So I got to make a lot of technology choices, both for the software and the underlying infrastructure.

Infrastructure

While building the build the web-app I also wanted to get our infrastructure in a good place. We were using AWS but it was configured entirely by click-ops. I knew I needed Infrastructure as Code, but the last time I did any IaC work was with Chef in 2012. I didn’t like Chef back then, so I definitely wasn’t going to start a project with it in 2022. I talked to a good friend who’s forgotten more about DevOps than I’ll ever know, and he strongly recommended using Pulumi to manage the infrastructure, based on his good experience with it. I took his advice.

Pulumi

TL;DR: Overall 👍. Slight initial learning curve but would use again. Would be cautious about using it with EKS.

Pulumi is an IaC platform that lets you configure infrastructure using your programming language of preference, provided that preference is one of TypeScript/JavaScript, Python, Go, C#, Java, or YAML (who chooses YAML?). I went with Typescript because the rest of the application code was going to be written entirely in Typescript, and because I liked Typescript’s type system more than Python’s at the time.

I used Pulumi to provision everything the products needed. I figured we’d get a dedicated DevOps person one day, and that they would want to use k8s, so to avoid them having to rework things I chose to start with k8s and used Pulumi to configure both the cluser and also the container deployments.

In general Pulumi worked really well. It took some time to get my head around Pulumi’s not-a-Promise Outputs but I loved the way I could write my infrastructure like regular Typescript (compiler type-checking for AWS infra config! functions!) and Pulumi would work out the dependency chain for me and deploy everything in the write order. Provided the code compiled there was a pretty good chance it would deploy correctly. That saved some time, which is good because waiting for infra to fail due to a misconfiguration is a huge drag on iteration speed when you’re trying to set stuff up. I also liked the Crosswalk system that “use(s) automatic well-architected best practices to make common infrastructure-as-code tasks in AWS easier and more secure.”

I did run into a few hassles with Pulumi

Multiple Projects

If you spend more than a little time with Pulumi you’ll realise that you need multiple projects for different portions of your infrastructure. Having all infra in one project means that if you tear down your application you’ll be tearing down your database, VPC, IAM… everything. This would be a really bad time! You want to be able to work on a portion of your infra without worrying about modifying other parts.

This is fully supported by Pulumi: you can create a project that sets up your networking, and then you can export the VPC IDs from that project and import them into another one. Pulumi keeps track of everything and it basically just works. But the actual functionality of importing config leaves something to be desired: you import variables by name, and if you import a value that wasn’t exported by the source project then it’ll return undefined at runtime and you’ll get an error somewhere down the line. Likewise, everything’s a string, even if the type of the exported value was something else, like a number or an array.

I worked around this by creating a shared import helper that threw an error if there wasn’t an exported config value matching the provided name. This allowed the code to fail more quickly, usually before Pulumi had tried to apply any changes. It also took a Zod schema and applied that to the config value, so that code that expected an array or number could act as if it got the expected type back (otherwise an error would be thrown) which allowed for more strongly-typed code.

Editing the State File

Pulumi tracks the current state of everything in a state file. This is the dependency tree, the underlying IDs (AWS ARNs, for example), etc…

For some reason it seems inevitable that you’ll end up editing the state file by hand sometimes. Particularly if you’re making lots of changes to the infra in quick succession and deploying each change to see if it works correclty. Pulumi provides some tooling for this (export the state file, import it, or modify individual values via the CLI), and I get that this exists so that you’re never completely screwed. But it feels like a bug every time you need to think about or edit the state file. I don’t know what was happening under the hood to cause Pulumi to seemingly lose track of things sometimes, and I could usually fix it (worst case: blow away all the infra and deploy fresh) but it did undermine my confidence somewhat. Which brings me to EKS. To be clear, this was mostly a problem during infra development when I’d be testing many changes and running pulumi update frequently. Once the changes were settled and it was time to apply them to other environments it usually just worked.

EKS

I had real problems working with EKS (AWS k8s service). I was trying to prototype a k8s upgrade process that consisted of replacing the older cluster with a brand new one, partly inspired by why kubernetes needs an LTS. Pulumi got extremely confused here, repeatedly losing track of the state of the cluster: it would often try to apply changes to a cluster that it had torn down. There’s an environment variable PULUMI_K8S_DELETE_UNREACHABLE that would sometimes fix the error, but then it would recur. I’d often see this error message:

error: configured Kubernetes cluster is unreachable: unable to load schema information from the API server: Get "https://<redacted>.us-east-1.eks.amazonaws.com/openapi/v2?timeout=32s": dial tcp: lookup <redacted>.us-east-1.eks.amazonaws.com on 127.0.0.53:53: no such host
    If the cluster has been deleted, you can edit the pulumi state to remove this resource

pulumi refresh did not work here, which was annoying as it’s exactly the kind of problem refresh is supposed to solve.

Editing the state file here sucked: there were so many dependencies hanging off the cluster that it was very hard to update things without making a mistake an creating an invalid state. And isn’t manually editing dependency files the kind of thing I’m using Pulumi to avoid? I don’t know why EKS caused problems, but there are a number of Github issues where people have run into this kind of error. This was back in late 2023 so it might have been addressed by now.

Kubernetes

TL;DR 😐. Overkill for my team.

I configured an EKS cluster for the web application and API containers (just three containers!) because we thought the Data Science team would also need somewhere to deploy their ML services. Fargate doesn’t support GPUs so k8s seemed like the logical option. I also figured an eventual DevOps person would want to use k8s so we may as well start there. In the end the DS team ended up split between GCP (free credits) and AWS, and setup their own k8s. We never hired a DevOps person. This was a lesson in YAGNI for me.

To be fair, once it was configured it worked really well, except for the handful of times where it would complain during a release that there were insufficient resources available to schedule containers, but then it would resolve and work anyway. I loved that a deployment would roll out with no downtime, and though I think that resource constraints are more subtle than they seem at first glance (and there’s conflicting advice about how to allocate CPU and memory) I got them to work.

One of the primary features of k8s is the ability to pack containers into nodes to maximise hardware utilisation. It’s a core idea in how k8s is designed. That’s a very I-own-my-own-hardware problem. In AWS there are hundreds of instance types, and there’s almost certainly one that’s very close to the right size for any given container workload. In the end I decided it was much simpler to have instance-per-container than many containers per instance. Then I didn’t have to think much about allocating resources or contention on an instance.

EKS is still sitting there, and I expect it to continue to work. We had a much simpler ECS/Fargate replacement in the wings but it hasn’t been deployed - even worse than a making a major change on Friday is making one just before you leave the company!

Kubernetes is obviously working for a lot of people, that’s why it’s become so popular. We never hired a DevOps team, though, and it was a lot of infrastructure for just a web app. What ultimately turned me off k8s for my team were two things:

So. Much. Configuration.

At least compared to ECS/Fargate. And it took me so long to get the networking right. I had to learn about EC2 Managed Node Groups and Launch Templates though I didn’t really want to think about either. Configuring the ALB mean first realising that magic annotations are the way, and then finding those magic annotations. Realising that a lot of stuff works via annotations and you just have to deploy it to see if you got the annotation right was a real disappointment, especially as I’d been rejoicing about all the type safety I was getting from Pulumi.

Cluster upgrades are a pain in the ass

Matt Duggan wrote a good post about this: Why Kubernetes Needs and LTS

As I mentioned above, I think the best way to upgrade in the cloud is to replace an old cluster with a new one. So much easier than trying to go through an upgrade process, and it also functions as a test of your ability to provision from scratch. Now, if you’re doing click-ops and AWS hasn’t identified any compatibility issues then yeah you could YOLO that upgrade button, but what if the upgrade fails? What if AWS didn’t identify all the possible issues? What’s your plan for testing compatibility with the new version first? Basically you need to deploy a new cluster in order to test compatibility anyway, so you may as well make “deploy a new cluster” the actual upgrade process. No chance of getting stuck halfway, that way (for very simple setups like ours).

Hosted Elasticsearch

TL;DR 🎉 Great service, but you still have to know about managing Elasticsearch. Would use again.

If you want to do full-text search with aggregations, and need to be able to scale beyond a single machine, then Elasticsearch is still the go-to choice. Maybe really the only choice. I did give OpenSearch a look but I was worried about adopting a less-used fork with an unclear future - I know AWS doesn’t tend to drop support for things, but they could effectively kill OpenSearch and it wouldn’t hurt them at all. Whereas Elastic is committed to Elasticsearch.

And as we were going to pay for hosting anyway I’d rather give money to Elastic, who did almost all the work, than to AWS who’s still mostly free-riding (and we’re giving plenty of money to AWS already!). Even if Elastic did just jump on the AI bandwagon and rename their product to Elastic Search AI Platform …

Overall hosted Elasticsearch was great. Scaling worked, backups were on by default and restore worked perfectly. I was nervous the first time I clicked the “upgrade” button, despite having tested it on a backup (did I mention how easy it is to restore backups for this kind of thing? It’s great). But it upgraded without issue or downtime, every time.

Hosted UI is confusing

There’s a lot of UI provided by Elastic if all you want is Elasticsearch. Some of it is UI for managing Elasticsearch (which runs on your Elasticsearch instance. And therefore will go down if, hypothetically, you run out of scroll contexts!). Some of it is for the various services and products that build on Elasticsearch, many of which you may not be using. Bits of the UI look quite different and it can be unclear where you should be looking. I ended up bookmarking my most-used admin pages, which I rarely have to do.

You still have to know a lot about Elasticsearch

It’s just hosting the instances and managing backups, scaling, and upgrading. Which, don’t get me wrong, is still a huge burden taken away. But you still have to think about sharding, indexes, what node types you need, ILM policies, etc… If you don’t know how to setup Elasticsearch you can still run into problems with the hosted service (most likely related to sharding) that will prevent scaling.

When someone can invent a service that actually abstracts Elasticsearch away to an endpoint for ingestion and endpoints for search that’ll really be something (probably an expensive something). I feel like if Warpstream can build Kafka on S3 then maybe (maybe with S3 Express?) we could see something like that for search.

API key management is annoying

Only the user who created an API key can update it. Doesn’t matter if you’re an admin, account owner, whatever: if you need to add a new index to an API key created by someone else then you just can’t. If that user is on holiday then I guess you’ll be creating a new API key and rolling that out to your application. It’s pretty annoying.

Likewise, there’s no easy way to find which key needs updating. You’ll just get an error pointing you to the user ID that created the key, but if that user created multiple keys you’ll have to hope that your services are using the keys you expect, because there’s no way to check.

dotenv vault

TL;DR 🙂. Worked great; always found the tooling confusing

It’s clear that a lot of thought has gone into dotenv-vault. I really liked having secrets encrypted in the vault file and just needing to supply a single environment variable (the vault key) to the application. Changing application config (env vars in a non-vault world) meant updating the vault, committing it, then rebuilding and deploying the app. In our workflow that was much easier than editing environment variables. It worked perfectly, but something about the command line tooling just never quite clicked with me. I was always having to check the docs and occasionally getting confused about which commands did what. Perhaps it’s just because we didn’t change env vars all that often.

We are paying customers and at one point it looked like dotenv-vault was being sunset in favour of a new system called dotenvx, as of today I’m not sure what the future is and which system would be the one to get started with.

Twingate

TL;DR 🎉. Worked great with minimum fuss. Easy to get started. Would definitely consider using again.

When I setup our AWS infra with Pulumi I wanted to make sure that nothing was publicly accessible unless it needed to be. I also took a pass over the click-ops infra that had been setup before I started, and restricted access to that where possible. So now we needed a way for engineers to connect to the infra. Enter Twingate, “a central Zero Trust orchestration layer”.

I found it very easy to deploy tiny Twingate EC2 connectors into our VPCs, or Twingate containers into k8s. There’s a Pulumi integration, too, which worked. My only tiny niggle was it felt like I was constantly being emailed to upgrade because a patch or minor version had been released. To their credit, I could go a long time without updating and I don’t think it ever caused a problem, but it was hard to tell whether any particular release mattered to me, because there were so many of them.

The Mac app worked like a charm: try to connect to Postgres and it would pop a notification asking me to authenticate. If I got through the Google SSO process fast enough the database client might even connect before it timed out. The app lowers the barrier for anyone wanting to connect to the infrastructure. No arcane command line incantations, no VPN messing up the rest of your internet (I know it doesn’t have to be that way). It worked and then it stayed out of the way.

Craig Glennie