Hiring Software Engineers in a ChatGPT World

Hiring good software engineers has always been a challenge. Large language models (LLMs) have made many routine tasks easier for humans. At the same time they’ve made applying for jobs easier, LLMs have made hiring much harder.

Hiring is Hard Enough Already

At Hop Labs, we hire and work remotely. We take hiring seriously, because we want to find the most talented, motivated, and interesting people to work with. This takes dedicated effort to do well, and as a small company, hiring takes a significant proportion of our time. The availability of LLMs like ChatGPT has made this more difficult by increasing the number of applications that look great on paper but don't hold up to scrutiny.

The "scrutiny" is what we want to discuss here. How can we discern the authentic candidates from the AI-generated ones when the effort required to spoof an application has become so low? Importantly, how do we do this without spending all of our time on hiring?

Baby With the Bathwater

One option is to ratchet up the difficulty for candidates in one of many ways. We could design an elaborate gauntlet of tests that can hopefully stay ahead of the machines. This is probably a losing battle over time, but more immediately it is a grueling experience and would turn away many good-faith candidates. The candidate we eventually want to hire is a real human that we hope to work with one day. The interview process sets the tone for our future relationship, and it would be nice if we could avoid treating them like a petty thief.

Another approach is to be unfairly selective about whom we consider as candidates. We could, for example, only hire people we already know, or only people with prestigious pedigrees. This fails in a number of ways. Mainly, even if we were able to verify a candidate's background cheaply/easily/quickly, this would serve to reinforce existing biases in tech industry hiring practices. Candidates from top schools, well-regarded open source contributors, and the people in our extended network tend to come from a narrow range of backgrounds. The last thing this industry needs is more homogeneity in its workforce. There are great people from diverse backgrounds that we would be worse off for excluding.

One thing to note before moving on is that we don't need a perfect system. LLMs make submitting job applications easier across the board, but some applications require more work to cheat using LLMs. Our goal is to structure the application process so that it's not worth the effort for unqualified applicants to apply, without imposing undue difficulty on good-faith candidates, thereby filtering out many low-effort attempts and keeping them from clogging up later parts of our hiring funnel.

You might say, "Why do we care if candidates are using LLMs?" If they can "trick" our hiring process, couldn't they actually just do the job itself? Probably not. Hiring assessments are designed to be self-contained simulations of the work that an engineer does daily. They have to be simplified enough for the candidate and the assessor to not be unduly burdensome. The day-to-day reality of the role is messier, with lots of backstory, layered context, and personal and organizational dynamics. The hiring assessment gives LLMs an unfair advantage because all the necessary context must be generally made available. (This is often but not always true – we'll talk more about technical assessments later.)

Challenges

How exactly can LLMs be used to "cheat" the hiring process? At the risk of adding fuel to fire, we'll talk about a few ways in more detail, and then share mitigation strategies below.

The first thing we’ve noticed is that it is significantly easier to apply for jobs now. LLMs can generate short responses to any questions. They can even regurgitate whatever language the company has in its website’s "About Us" section. It's hard to tell if any given application is from an individual simply trying to send their own resume out to more companies with less effort, or if it’s coming from a service that offers to do this for a fee. In either case these candidates are less likely to be seriously interested in our particular job, and their resumes are more likely to be distorted or embellished in order to appear more attractive to us.

Another major category of spam applications consists of what appear to be entirely fabricated candidates. These candidates have stellar resumes from top-tier companies that are filled with techno-jargon, and they average around three-year stints at each company. This problem is surely not new in the hiring space, but LLMs have made it much easier to produce more convincing fake resumes than ever before.

This goes beyond the scope of this article, but it's worth mentioning – foreign-state actors may also use LLMs to impersonate legitimate candidates. For example, KnowBe4 published a story about hiring a candidate who turned out to be working for the government of North Korea.

It Can’t Be Hopeless

We still need to hire and and we still need to screen out an increasing volume of both genuine and dubious candidates. The trick is to proceed in a manner that preserves the humanity of the (human) candidates while not burning everyone's time. Below we outline some strategies that we’ve used in our hiring pipeline to mitigate these types of abuse and strike a good balance between rigorous and onerous.

Screening Resumes

The application form is the first chance for candidates to use LLMs to generate or augment their submission. Fundamentally we don’t have a problem with candidates using tools to build resumes, but we obviously have a problem with fabricating job histories. Because the tooling will likely evolve over time, the best advice here is to look for common patterns. We’ve found many too-good-to-be-true resumes that have similar formatting, use of company icons, (too) consistent three-year stints at previous companies, etc. None of these attributes alone are enough to conclude anything, but together they are enough to warrant further scrutiny. It could be worth a little time looking into the currently available tools and services for automated resume and job application tools to get a sense of what to look for.

We find that short-answer application questions help filter out the very low-effort candidates from the process. Of course, LLMs are great at answering questions, but these answers generally don’t hold up to scrutiny (at least, for now). For one, we’ve found that they don’t show as varied a response as one would expect from a national hiring pool. It's helpful to keep a running document with similar-sounding responses for quick comparison later. For example, our “tell us about a system you built” question received a surprisingly high proportion of responses about preventative maintenance in industrial settings, many with very similar phrasing. Again, a candidate using a model to help them write a compelling answer about their own past experience can be fine, but using it to imagine an answer completely is not.

Phone Screens

Our next step is a 15-minute voice conversation. The overall goal for this step is to quickly check alignment between the candidate and the role. For our current discussion, the relevant part is to make sure that the candidate has had all of the experience they claim on their resume. This has always been part of our screening process, but now we have to contend with realtime LLM usage. There are two main strategies we have found useful here.

First is to ask questions that are at a high enough conceptual level that they are difficult to ask an LLM, or at least will slow down the entry of a prompt. It will likely take the person too long to enter all the context for the LLM to get an appropriate response, resulting in some generic answer. For example, ask, “Why did you use this framework over that one for this particular project?” If the candidate just lists attributes of that framework without mentioning any context from the project, that’s not a good sign (with or without LLM assistance).

The second thing here is to pay attention to any latency in the responses. This may indicate that the candidate is waiting for a model to return an answer. You might even ask them to rephrase a response and see if that is a more difficult request than it should be. We have noticed some candidates claiming poor connections or phone issues in order to stall for time. This is a tricky call, because connection issues can be genuine, so you’ll need to take into account the rest of their application.

While we don’t believe it has come to this yet, soon we will have to contend with generated voice and video streams. This will require major rethinking of the process and is worth its own article.

Technical Assessments

For engineering positions, we need to make sure that the candidate is competent at writing and maintaining software. The best method for this is often a live pairing session where the candidate designs and/or builds something alongside the interviewer. As a preliminary screen, we often give our candidates simpler coding challenges that they can complete on their own before the live pairing interview.

For us, these coding challenges have typically been small scripts that can be written in less than an hour. Since LLMs have gotten very good at writing exactly these types of scripts, we have had to change our approach here. Instead of focusing on solutions to puzzle-like problems, we frame our challenges around building and maintaining software. We try to insert a simple problem into a larger context – for example, starting with some existing boilerplate code and asking for an additional feature, or improved tests.

The crucial difference, though, is that we not only assess the direct ask (“Did they correctly add the new feature?”), but we also see how they interact with the provided code context. Do they integrate with existing code in a natural way? How well do they document their changes? How do they handle ambiguity in the framing of the problem, or do they even notice that at all? These are all questions that would differentiate candidates even if they all used LLMs to produce a baseline level of coding ability. Given a complicated scenario, an LLM might create a complex, contorted solution while a good candidate might refactor the problem setup slightly to make a cleaner solution.

There are some downsides to this approach, of course. The main downside is that this makes evaluation more subjective and energy intensive. We still believe that it can reduce the number of candidates that make it to the live interview phase (itself even more subjective and energy intensive). This works better on mid- to senior-level candidates. We haven’t built assessments for junior candidates, but they would probably be better treated with live interviews.

Arms Race

As machine learning models improve, this problem will evolve and become more challenging. We don’t expect that these strategies will work forever, but hopefully they provide some insights on how to approach this problem. Importantly, we don’t want to forget that there are real humans in this process, and they deserve respect and fair consideration. These are people that we want to work with, to learn from, and grow with as we build software together.

— Mark Flanagan, ML Engineer @ Hop