Assessing the maturity of AI tools

ChatGPT is one of the hottest AI topics of recent times. Launched in November 2022, it’s a chatbot that can use text prompts to create a variety of different types of content, such as SEO briefs, blog posts, and job applications. The possibilities are enormous.

As always with AI, the question with ChatGPT is: how will this technology change the way we work?

At NearForm, we’re always seeking to experiment with leading-edge technologies. We saw the rise of ChatGPT as an opportunity to assess the maturity of AI tools available in the market and consider what changes they may bring to our ways of working.

To do so, we decided to apply this new technology to our internal process for organizing technical interviews. When done manually, this process involves several parties, such as our recruitment and engineering teams, and considerable communication in order to find the right time and interviewer for each candidate’s application.

Our approach was to implement the engine for this solution by following three paths, so that we could then compare the outcomes:

  • Traditional coding
  • AI assisted code generation
  • AI natural language solution

This blog post covers our experience in the AI natural language solution path. Read on to discover what we learned from our experiment and find out what the impact might be.

The problem we sought to solve with our experiment

Finding the best interviewer for a candidate is an essential, but often time-consuming process.

Matching the most suitable people together means there’s the best chance of the right questions being asked and both parties knowing if there’s a good fit between candidate and company.

However, the downside is that a lot of work can go into this, taking time away from our recruitment and engineering teams that could be spent on more valuable activities.

This inspired us to see if we could give our people some time back by using ChatGPT to automate this process.

In order to choose an interviewer from our pool, we have a set of rules which apply filters — to exclude interviewers unfit or unavailable for the interview — and sorting, to offer the best match first. A high-level list of rules is as follows:

  • Date and time availability match with the candidate
  • Match based on technical skills required for the interview
  • Weekly interviewer availability to run interviews

The process takes interview data (such as skills required and possible time slots) and a list of interviewer’s data (including known skills and availability) and returns a list of all the interviewers who can conduct the interview, sorted by their suitability for it.

How we conducted our experiment

ChatGPT was the inspiration for our investigation, as we realized that it can not only answer general questions, but in some way also reason about novel problems it is asked to solve.

At the time of writing, ChatGPT is open for public beta. This means that the product doesn’t have a public API that can be used. Besides, it often runs out of capacity at peak hours and its conversational approach makes it unfit for a rules engine. Instead, we used GPT-3, which is the API behind ChatGPT and can be used programmatically.

Stage 1: Building the rules engine

The idea was simple: go through the three approaches described earlier with a limited number of rules and evaluate the findings to identify the best approach.

We believed that traditional coding would be the way to go, as we felt GPT-3 wouldn’t be mature enough to deliver the best results.

GPT-3 takes a written input (known as a prompt) and responds with a suggested continuation for that prompt. The prompt can be sent to GPT-3 through its API or using a client library. However, it also provides a playground graphical interface to quickly interact with it for experimentation.

Our first attempt consisted of using a single prompt. This prompt contained all the input data and the set of rules to produce the expected final, ordered list of recommended interviewers. It didn’t produce the right outcome. The order was wrong and the best interviewers weren’t selected. This was because using a single prompt asked GPT-3 to think too much on its own to deliver the right results.

We now had to try different techniques to get the desired results.

Stage 2: Getting GPT-3 to return the correct answers

GPT-3 generates the most probable continuation for a given prompt by predicting the next most probable word until the full continuation is produced — it guesses what the answer is based on how detailed your questions/requests are. This means that the accuracy of its response is directly related to the length of the output, with longer sequences and prompts providing more context for accurate predictions.

In short, if you submit short prompts for complex problems then it usually results in GPT-3 giving incorrect outputs.

One of the best techniques for getting GPT-3 to deliver a correct solution is to have the model respond with not only the answer, but also the process taken to arrive at that conclusion. Although custom prompts usually deliver better outputs, the sentence “let’s think step by step” has been found to be very effective as a generic approach. This method and other techniques can be found in the Open AI cookbook, which presents studies with broader testing samples.

This can work for simple prompts, but it won’t give the desired output for more complex ones, as the process to arrive at the conclusion is too long. This means that you have to tell GPT-3 to go step by step, so it tries to infer the process to come to a solution on its own.

To summarize, the options from simplest to most complex (which usually leads to better results) in a single prompt are:

1. Write a prompt with just the data and the question

Copy to Clipboard

2. Add a generic sentence to the prompt so that GPT-3 responds with the thought process it used to come to the solution

Copy to Clipboard

3. Write custom steps that GPT should follow to answer the prompt

Copy to Clipboard

Despite all these attempts, the issue we still faced was that this didn’t scale well.

The problem with the latter approach is that it greatly increases the number of words used for the prompt and the output. The most advanced model in GPT-3 is called text-davinci-003, which accepts a maximum of 4,000 tokens (~3,000 words) between prompt and solution. Increasing the number of interviewers or the number of rules in the ruleset in our case quickly exceeded the maximum capacity. Besides, increasing the number of rules also increases the complexity of the prompt, which results in lower quality outputs.

We were able to solve two problems in one go by dividing the task into subtasks and feeding each output as the input of the next subtask. In this way, we were able to provide just the necessary information for each step

For the previous example this will result in three prompts. First filter by the tech skills, such as JavaScript.

Copy to Clipboard

This will return John, Jane and Jim, so any information pertaining to Jack is no longer relevant for the second prompt, which filters by the second technology, for example, React.

Copy to Clipboard

This will result in John and Jim, leaving Jane out for the final prompt, which consists in sorting by preference on Backend or Frontend interviews.

Copy to Clipboard

This iterative approach will return the final answer we expect — Jim first and John second.

Stage 3: Getting the workflow right

Although our original idea was to have a full AI solution, at the end of the day the program still needed to receive an input and map it to a prompt, query the API and post-process the output for the returned value to be usable by the rest of the workflow.

In our case, we needed to write custom code to parse the dates of the inputs, allowing for multiple time zones, into a uniform 24-hour format for GPT-3 to understand.

 Stage 4: Reaching the solution

The road to it had its ups and downs, but we finally came to a working solution. We had a common suite of unit tests in place which we ran against each of the three approaches we followed, and they were all passing in the end.

After satisfying a fairly simple set of requirements, we decided to stress test our solution and we came up with exotic inputs, such as using unique IDs instead of names to identify interviewers and skills that contained non-ascii characters.

While the traditional and AI-assisted coding approaches withstood the tests, or revealed little tweaks were needed, the GPT-based approach started failing and for some cases we weren’t able to achieve a working solution.

What we discovered

When writing code we are used to programs with a deterministic output. With AI this becomes difficult, to say the least.

AI functions as a black box and even though GPT-3 has parameters for making the output as ’predictable as possible’ we still found different cases that didn’t live up to this promise.

Process matters and one of the most surprising discoveries was that even with the exact same input, the model may respond with different answers, depending on previous interactions. This might not seem like a big deal, but when your workflow is composed of many different steps, having some of them condition others can affect the results in unpredictable ways.

Furthermore, the model doesn’t apply logic the way humans do, as the following examples demonstrate.

Copy to Clipboard

Having this correct solution might mislead you into thinking that GPT-3 will solve the problem for different outputs, but in practice that’s not the case (notice the different end time of the interview compared to the previous example).

Copy to Clipboard

As it can be seen, Jim is also available (16:00 to 20:00 contains the requested availability range). It is surprising that Jim was included when the requested availability was from 16:00 to 18:00, which is a longer period that includes 16:00 to 17:00, but was excluded from the second set of results. Inconsistently solving problems with such slight input variations means we simply cannot rely on GPT-3 to deliver consistently accurate results.

What we learned from our experiment

We came up with the definition of ‘prompt engineering’ to identify the work that’s required to generate the prompts to feed into an AI system such as GPT-3, which understands natural language instead of code.

These are the key learning from our GPT-3 experiment:

  • You still need to write code if you want to have reliable outcomes.
  • GPT-3 may be faster to start with, but it will become more difficult to maintain. While adding new rules in the other approaches turned out to be easy, with AI, a full prompt engineering cycle was needed for each new rule with its preprocessing and postprocessing included.
  • If GPT-3 works for some inputs, it doesn’t mean that it will work for similar ones.
  • Prompt engineering might be as complex as coding, it’s quite unnatural and you have to learn the hidden rules of the model which, not being clearly stated, might be harder to grasp than code written in a traditional programming language.
  • Token limitations prevent the model from solving large problems.
  • For complex problems, you need to walk the model through the reasoning process. This means it will not solve anything you don’t already know how to solve.

At the moment, GPT-3 might be good for brainstorming, coming up with ideas and one-time responses, but when it comes to providing consistent, production-ready answers to problems with only one correct answer, we found it ultimately unreliable.

If you are a software engineer, you can relax for the time being, safe in the knowledge that AI isn’t going to take your job.

Share Me

Related Reading


Don’t miss a beat

Get all the latest NearForm news, from technology to design. Sign up for our newsletter.

Follow us for more information on this and other topics.