some sort of blog

Obligatory post about LLMs

During the last couple of months, I’ve been encouraged to think a lot about the new crop of AI, the LLMs and the GPTs, how they are being integrated into our day-to-day, but even more specifically, how they might affect my (and in general, software engineers’) work. This (perhaps rambling) piece of writing is the product of all of that.

This is, for the most part, a recounting of my personal experience with the newest wave of AI. This is not a thesis meant to reflect unbiased research into a topic. For some of the topics discussed below, I fully admit that I am starting from my gut and working backwards to try to find the reasoning (with all the faults that implies).

You have to use the good models

Stepping back a bit, earlier in the AI craze, before I felt pressured to have opinions and most of my experience was a short back and forth with the free version of ChatGPT, a lot of people would counter criticism of LLMs by saying that the free models are just not good enough. The stupid mistakes we were all seeing and making fun of, that’s all because of the limitations of the free models, the claim goes. Giving the paid models a chance was the way to become a believer.

I am willing to take that as a given. Let’s ignore the free stuff, ignore the blatantly obvious falsehoods that get generated by the popular free services, the ones that become running jokes and make the rounds on the social web, and focus on “professional” offerings.

Pressing sparkly buttons ✨

The exploration started small. I decided the first step would be to relax my hesitation of clicking the AI buttons that have started showing up on every website and application I interact with.

First, I started clicking on AI summaries for linked Jira tickets in Slack whenever I had the opportunity. Most times, it would take longer to generate the summary than it did to click through and read the full description and some of the discussion on the actual ticket. I gave up on the “Summarise Ticket” button after a few days.

“Summarise this” buttons have since been showing up almost everywhere. I click them, from time to time, but almost never trust them enough to not have a look at the original text. I find that an LLM-generated summary often takes longer to read than skimming the real text and getting the same information with the same depth or granularity, assuming the original text is well structured. I suppose “skimming as a service” is a useful product, so I don’t dismiss it outright. However, in many cases, the full text is the full text for a reason. A broader context, a small aside, or the writer’s personal ways of expression, are usually smoothed out of existence when summarised by an LLM, and more often than not, they’re just as important as the bottom line. I also noticed, though can’t prove, that poorly structured text tends to produce much worse summaries. This is especially obvious when the writer is a non-native speaker of the language. It often requires knowing the person, how they speak and communicate in that language more generally, to really understand what they’re trying to convey.

GitHub Copilot

In early April I started a free 30-day trial of GitHub Copilot Pro. Compared to the sparkly button presses, this eventually turned out much better (though, the bar was pretty much on the floor to begin with).

Explain error

The first thing that caught my eye was the “Explain error” button on failed GitHub actions. When this worked, if it worked, which was about half the time, it pretty much just translated the error output to natural language, with no insight into the cause. Any attempt it made to suggest ways to fix the error pretty much boiled down to “change the code to not do this”. There is, I suppose, some value in this. Test log output is very often too verbose. In most cases, test failure output is interleaved between every other successful test, making it difficult to pinpoint the real, exact line that explains what failed and how. The proper solution would be to improve the system that generates test logs, but in the absence of that, a huge neural network acting as a text filter is maybe worth something.

Pull request reviews

GitHub Copilot’s pull request reviews are the first thing that genuinely and pleasantly surprised me. I used it a few times on a personal solo project with consistently good results. It caught a few stupid mistakes, the kind that many of us do when we’re not fully paying attention to our work and that any reviewer would catch in a first pass of a patch. There’s definitely value here for single-person projects or small overworked teams. Much like we have linters and formatting tools that let us focus on the more important things, an LLM’s review can save time by automating the first or second pass where the obvious things get caught. Beyond the stupid mistakes though, it also managed to catch more serious bugs, things that would require some understanding of what the patch intended to accomplish and how the code deviated from that. These parts of the review were sometimes hidden behind the “low confidence” fold, which is perhaps expected as they’re not as obvious or blatant mistakes.

As much as I found the Copilot review feature useful and interesting, I never used it on a project where other people are involved. It didn’t feel right to ask an LLM to do my review work for me. When someone spends time and effort writing a patch, I feel I owe them the review, instead of asking someone or something else to do it for me. Perhaps some people wouldn’t mind, but without knowing how another person feels about Copilot posting reviews on their patches, I will always default to avoiding it.

Going off a bit on a tangent, there has been a very noticeable trend of people apologising for something they produced (emails, presentations, code) not being as good because they used an LLM to make it. It relates directly to the reason I avoided having Copilot review other people’s work. When it doesn’t work, no one feels accountable for the decision, not even the person that delegated work to it. If my code, or review, or email is bad, that’s on me and the tool and methodology choices I made along the way.

Writing an osbuild stage

Stepping up the game a bit, I decided to try some real code generation. I didn’t want to stress test the idea at first, didn’t want it to write something that required understanding the whole project. So I asked it to write the Go structs and code for an osbuild stage, something that has many examples in the project to learn from. I fed the JSON schema for the osbuild stage into the Copilot chat and asked it to write Go code to generate the stage with its options and also do it in a way that’s compatible with the rest of the stages in the project.

The result was okay. It needed a bit of work to fix, some parts were clearly wrong, but not way off the mark, and many other parts needed cleaning up for readability. I struggle to say how much time this saved me, if at all. Perhaps the task was too small to draw any meaningful conclusions. I’ve been led to believe that newer, code-specific tools that can ingest whole projects are much better at generating code that matches the style and conventions. I haven’t had any experience with those yet.

Unit tests

Writing tests is often one of the least interesting parts of software development. It can be fun and interesting sometimes, when it involves finding smart ways to check for edge cases, or when simplifying and improving parts of the test infrastructure, but most of the time, tests are repetitive and require little to no creativity. Therefore it’s of little surprise that an LLM is actually quite good at writing unit tests. I have to assume that the nature of unit testing—repeating the same function call with small variations, building an input-to-expected-output mapping and comparing expected with actual outputs—is as straightforward for an LLM to write as it is tedious for a human.

I didn’t keep track of every unit test I wrote with the help of an LLM. I did it at least a handful of times, but every time, I had to fix up some details, like make the testing setup and teardown match other tests in the same project more closely or add an extra edge case or two. In a couple of cases, the test was written to pass on buggy behaviour because the function itself had a bug. In all such cases, a review of the testing code caught the buggy behaviour in the function. Perhaps the bug would have been caught by Copilot reviewing the code and its tests, but I never tried them in combination that way.

What we thought we were getting

A long time ago, but not long enough that we’ve forgotten about it, AI was a dream for a future where fully autonomous machines or programs would give us information, do research, and control systems using only natural, conversational language. Super-intelligence was part of that dream, sometimes, but the less flashy, more down-to-Earth version of it is human-level AI, passively existing in the world around us—in the office, in the kitchen, on the bridge of our own personal starship—that takes care of the boring stuff. This is also the promise we’re hearing about today’s LLMs and AI agents. But instead of getting excited every time the ✨ sparkles or the word AI appears on a website or in an app, I can only respond with indifference, in the best case, and dread in the worst.

Every app and website now has an AI helper. 10 years ago it was useless support chatbots, so you might say this isn’t new. What’s new is that, while the old chatbots were glorified search engines for the service’s help and support documentation, the new chatbot is instead a glorified text-auto-complete with a propensity for making stuff up. The chatbot button on the support page used to make me think “I’m going to have to navigate through the chat version of a phone tree to get some real help, aren’t I?”. The modern AI button instead makes me think “I’m going to try this twice and then develop a new type of banner blindness for it, aren’t I?”.

A few days ago my task manager grew a new button that asks users to “Speak your thoughts, we’ll organize them”. It does nothing of the sort. I’ve tried all sorts of ways to talk to it and every time, all I get out is a bullet point list of tasks created in my Inbox, roughly one task per sentence I spoke. It doesn’t “organise” anything (for example, into different projects or subtasks) nor does it seem to care when I mention when something is due. It just drops some bare tasks into the inbox for me to organise. The same task manager has a natural language quick-add feature, which supports markup for quickly adding it to a project, tag it, set priorities, and picks up pretty much any description of a date or recurrence (tomorrow, end of week, every second Wednesday, etc) and sets up the task accordingly. This has existed since day 1 and is immensely more useful than a poor attempt to take every sentence I say, express it in imperative form, and make a bulleted list out of it. I don’t mind that there’s a new button. I’m very good at not clicking buttons I don’t find useful. But I do have to wonder how much time, energy, and effort went into building this feature, what it’s trained on, and whether it’s understood well enough to be meaningfully improved.

Which brings me to a broad concern about the current state of this technology. Entire businesses are being built that rely sometimes completely on a commercial service as their primary infrastructure. Most of the big computing platforms these days are closed and proprietary (mobile phone OSes and app stores, the big social media sites), fully owned and controlled by a handful of companies. This isn’t a recent phenomenon, but lately it seems a lot of people are realising that the foundations on which so much is being built can quickly become entirely hostile to them. Similarly, LLM companies are asking us to build infrastructure on services that were built at the cost of the GDP of a small country, they charge us $20-200 / month, have yet to make a profit, and expect us to trust them that these services will both not disappear and also remain affordable. I wouldn’t trust something like this for a dogfood subscription lest I become dependent on the convenience, let alone build work and productivity habits, or worse, actual products on it.

Productivity optimisation

Speaking of productivity, on a recent episode of a podcast I follow, the hosts were addressing a listener question which asked how they know if and when the investment into a new productivity tool or method was worth the time and effort. They wanted to know if and how one could measure or evaluate the cost of switching to a new tool or way of doing things vs the potential productivity increase in the long run. Ignoring the obviously correct answer from Randall Munroe, the response from one of the hosts, one that resonated with me strongly, is that it doesn’t matter. Sometimes, most of the time, the productivity optimisation is part of the hobby. It’s fun. It’s interesting. It’s more than an exploration into alternative ways to accomplish a task, it’s a way of examining new ways to think about a problem. I get to peek into the mental model that other people have of the problem and its solutions and sometimes, if I’m lucky, I find the model that completely aligns with my own and I adopt the same solutions, I make them my own. Thinking about this I realised that faffing with an LLM is completely uninteresting. There’s no methodology there. There’s nothing to learn, nothing to understand. It’s an exercise in delegation (something I will admit I struggle with more generally), but one where the thing to which the task is being delegated will learn nothing from the process and produce things no one can explain.

Let’s talk about Vibe Coding

I want to make one thing very clear: I enjoy what I do and I enjoy writing. I enjoy writing prose, though I don’t know if I’m particularly good at it. I enjoy writing code, which I’m probably a bit better at. I love thinking through a problem. I draw genuine pleasure out of programming and engineering solutions, even to simple, trite problems. I’ve come to understand that this isn’t universal.

I’m also becoming more aware that a lot of people, perhaps myself included sometimes, don’t fully understand what they’re making. We’ve all been there; we write some code, maybe call into a third-party API, it doesn’t do exactly what we expected, so we start fiddling until the expected thing happens. For some, it goes as far as copy-pasting sample code, often without reading it, running it to see what happens and tweaking until it’s just right. This never sat well with me. Even when I engaged in the fiddling-until-it’s-right practice myself, I would feel so uncomfortable with how I got to the result, that I would spend an inordinate amount of time reverse engineering my own code to fully understand what it was doing. I won’t claim that this makes me better at my job, in fact, if anything, it likely makes me slower. I’ve come to understand that this isn’t universal.

We all joke about blindly copy-pasting from Stack Overflow so much that they made a novelty keyboard about it (originally an April fool’s joke, but later a real product). The left-pad incident showed us how many people would rather plop an import statement into their code rather than think through a problem as simple as string padding. Entire careers are spent rewriting the same god-damned login page. I know that very little of what I do is novel, though I have a habit of making just about anything interesting for myself. But this is all to say that we all sort of knew that half of us were on autopilot all the time and the rest of us half the time. We call what we do “engineering”, like we were building bridges meant to last centuries when instead we were cobbling together flimsy Lego bricks into even flimsier infrastructure. We tolerated it. And now we’re getting what we deserve. The Lego bricks are assembling themselves. Someone built a cement mixer where bricks tumble around until they accidentally fit together well enough to resemble the design specification, if you squint hard enough. We had it coming. Vibe coding is just the logical conclusion of things that have been building up for the last decade or two. We’re done even pretending that anyone knows what they’re doing and we’re content with letting a technical debt generator run the show.

Aside: A bonus personal problem is that my eyes hurt from rolling every time I hear the phrase “vibe coding”, which is happening multiple times a day now. I should maybe get that looked at.

On knowability

I can understand how most people are fine with not knowing what an LLM is doing when they interact with one. No one understands every piece of technology they interact with. Every single one of us, every single day, uses things we don’t fully understand. It’s practically impossible to live any other way. But every time I saw some software do something impressive, I could always ask “How does it do that?” and I could get a satisfying answer. There was always enough to satisfy my curiosity about how the thing works. In fact, in every case, there was more than enough information, I could always dig deeper, and I could decide how deep I wanted to go. I don’t do this with everything. In fact, for the longest time, I didn’t even realise, had not truly internalised, how knowable most things are, especially in the world of computers and technology.

I think we can all agree that most people don’t care how their computer or the software running on it works. So when the LLM comes along and does genuinely impressive things, generates whole essays worth of text on a particular subject, it’s normal to be amazed, call it magic, and keep going. But when an LLM starts losing the thread, when it starts writing sentences that are completely and obviously untrue, or code that doesn’t work, or images that don’t make sense, there is no satisfying answer to the question “How or why did it just do that?”. The only answer is a shrug and “Well, that’s just where the RNG landed this time”. This is hardly a satisfying way to live. It’s definitely not a good way to work if your work involves engineering. Do I know every part of how my editor, terminal, desktop environment, or operating system works? No, not even close. But I know that for each of them, for each component, for each module and function, there’s someone out there that understands it. More importantly, there are plenty of people that can understand it if needed. I also like to think that, with enough time and effort, even I can understand it. I just don’t understand how one can look at that and say they’d rather have unknowable software and more of it, please.

Final thoughts

I’ve been procrastinating on writing this for about 2 months now. As time went on, I started getting frustrated at the requirement that I do something with AI. I wasn’t asked to do much. In fact, I was told that I could have recorded a short video where I have an LLM generate some unit tests and call it a day. That somehow made it worse. Ticking boxes and moving on. I really wanted to do something useful. I wanted to actually learn something. I decided a better use of my time would be to examine that frustration and figure out where it comes from.

Let me be clear here, though. The LLM craze is not entirely uninteresting. The technology is genuinely impressive and worth researching. What I find uninteresting is the constant strive to find problems for this particular solution. I am not at all interested in adapting my daily life and work to fit a tool that has not proven its value nor makes any promises about its long term viability and availability. I am even less interested in that tool if its inner workings are completely unknowable. To put another way, I would never build a workflow around a tool (say, a code editor) that cannot guarantee (to a reasonable degree) to still be around and affordable to me in 5 or 10 years and, on top of that, cannot be fully understood, debugged, or even modified in a predictable way.