Seeking Integrity with the First Massively Parallel Multi-Language Benchmark for Code Generation

A woman in the foreground wearing a light blue and white striped shirt reviews code on two computer screens and multiple print-outs.

Michael Greenberg’s project MultiPL-E is now integrated into open-source large language model BigCode

Generative artificial intelligence (AI) — or AI tools that can do or create something in response to a user’s request — became a global phenomenon virtually overnight with the launch of OpenAI’s ChatGPT in November 2022.

Capable of generating content as far-ranging as sonnets, songs, digital paintings and programming code, generative AI can be unnerving, exciting — and also wildly inaccurate, sometimes inventing spurious historical or anatomical “facts” or even failing at basic math. Generative AI chatbots have even made headlines for producing content that is racist, misogynist or otherwise biased in nature as if it were objective fact.

Underlying all generative AIs are deep neural networks called large language models. These models are built by algorithms that have been trained on massive amounts of data — millions or billions of lines of text, code or other content that trains the AI to recognize and process patterns from which to spontaneously generate new content.

The integrity of an AI’s results, however, is directly tied to the integrity of the model that it generates based on those initial data sets. Inaccurate or insufficient content going in can mean problematic content coming out.

Code generation models — which are large language models of programming code specifically — are subject to the same vulnerabilities as models that deliver results in natural (human) language, said Michael Greenberg, an assistant professor in the Stevens Institute of Technology Department of Computer Science.

“Sometimes these models do a good job. Sometimes it's incomprehensible,” he explained. “If you generate code [using one of these models] without looking at the results closely, it's probably going to be trash. Very plausible trash. Syntactically correct trash. But trash.”

To help improve code generation and model quality, Greenberg collaborated with 12 researchers from Northeastern University, Oberlin College, Wellesley College, Microsoft Research and Hanover High School (New Hampshire) to develop MultiPL-E.

MultiPL-E — which stands for Multiple Programming Languages Evaluation — is a framework for translating a suite of benchmark tests from the popular programming language Python into 18 other programming languages in order to evaluate the performance quality of a code generation model. Described by the team as “the first massively parallel multi-language benchmark for code generation,” the MultiPL-E evaluator was integrated into the open-source, collaborative large language model BigCode in May 2023.

The team’s research, titled "MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation,” was published in the Institute of Electrical and Electronics Engineers’ IEEE Transactions on Software Engineering in April 2023.

Translating Python oranges into multi-language apples

Michael Greenberg (mgreenbe) Assistant Professor Michael GreenbergA primary problem of code-generating AIs, said Greenberg, is the absence of proper “apples to apples” evaluations of their models’ performance in a variety of programming languages.

While several code generation models are trained on multiple programming languages, they are usually only evaluated for accuracy on a single programming language — typically, Python. Thus, such models lack verification of whether they perform as well on non-Python languages as they do on Python specifically.

To address this shortcoming, Greenberg and the team developed a collection of 18 special compilers (translators) to simultaneously convert two popular Python code generation benchmark suites (HumanEval and Mostly Basic Python Programming (MBPP)) into 18 different target languages.

Unlike full-fledged compilers, MultiPL-E’s “mini compilers” (as Greenberg called them) translate only four particular components from Python: the name and arguments of a function (the instructions that explain how to complete a task); simple unit tests (tests of individual units of code); a comment describing the expected function behavior; and, when required, type annotations, which describe what kind of input and output are expected (such as, for example, that arithmetic division would call for two numbers going in and one number coming out).

Despite its focus on only these four elements, MultiPL-E is able to leverage the full range of HumanEval and MBPP benchmarks to automatically, fully and consistently evaluate how an AI model performs in any of MultiPL-E’s 18 target languages. This limited scope also makes MultiPL-E easy to scale to other programming languages in the future.

MultiPL-E’s 18 target languages include common programming languages like C++, Java and JavaScript, as well as more niche languages that have never been used to measure natural language-to-code performance before, including newer languages Julia and Swift and older scripting languages Bash and Perl. (Greenberg is himself an expert in these older languages.)

Greenberg says to think of MultiPL-E like the dozens of language translators at the United Nations, all translating what one particular person says into dozens of different languages simultaneously. Only in this case, the words being spoken are computer code, and the source language being translated is Python.

“We are the translator division,” he said.

Better benchmarks make better models

In their initial research, Greenberg’s team used MultiPL-E to evaluate the multi-language performance of three popular code generation models: Codex (which is also owned by OpenAI), InCoder and CodeGen.

MultiPL-E’s evaluation results are not a binary pass-fail: rather, the evaluator delivers a percentage pass rate for the model’s overall performance on a particular language.

A pass-at-one test, for example, evaluates how frequently a large language model passes a particular benchmark test on the first try.

“Some of the top models, you’ll see they’re passing [the pass-at-one test] 50% of the time — which, considering how loopy-loo some of ChatGPT’s text answers are, is pretty remarkable for code,” said Greenberg. “It means that 50% of the time, the model is really hitting the mark.”

Data models behind generative AI, however, are not static entities: they must be updated (or, more accurately, rebuilt) regularly so that their data does not become outdated or stale.

So not only does MultiPL-E’s evaluation score show how well a particular model performs on a particular language currently: it can show how that particular AI’s model improves or regresses on a language over time.

Being part of this big milestone community project means that we’re helping science continue to be open and independent.

Michael GreenbergAssistant Professor

“For example, take a researcher who is using a large language model to do some fancy type prediction in [the programming language] TypeScript. They use [model] Version A when they know that the TypeScript pass-at-one rate is 43%,” explained Greenberg. “Then a bunch of improvements are made [to the model], and now the pass-at-one rate is 52%. So the researcher should rerun their experiments, because results are going to be better because this language model now knows a lot more TypeScript.”

MultiPL-E’s evaluations should also contribute to improved large language models over the long term by setting a new minimum standard for each model going forward.

The TypeScript developer in Greenberg’s hypothetical scenario, for example, would be reluctant to use a model whose performance on their chosen language gets worse over time. By integrating MultiPL-E into a model’s evaluation, Greenberg said, “the bar for what counts as a good model goes higher.”

This advantage is part of why MultiPL-E’s integration into the open-source BigCode model is so noteworthy — and why Greenberg is so excited by it.

“By using MultiPL-E, BigCode is now basically using our form of evaluation as a benchmark for their model,” he said. “So when they release a new model, people will want to see that these ratings have gone up, which means that our ratings are getting taken as the standard. Everyone who's using BigCode is benefiting from MultiPL-E’s evaluation, and MultiPL-E is getting seen as a robust way to evaluate language models across a bunch of languages simultaneously.”

Better models make better breakthroughs — and better science

By helping to improve code generation models, MultiPL-E supports both software developers and model-makers in generating more and better code and evaluative tests in a variety of programming languages. More and better code and tests means faster progress and innovations made, ultimately leading to accelerating the development of technology and scientific discovery.

But to Greenberg, perhaps the most important aspect and benefit of MultiPL-E — and another reason why its integration into BigCode is so significant — is its open-source nature.

A “side effect” of the conclusions drawn by MultiPL-E’s initial research, Greenberg said, was discovering that “the open-source models are not nearly as good as the private models.”

“The large language model that performs the absolute best in our paper — OpenAI’s Codex — is closed-source. Nobody knows what its training set is. Nobody has any control over what's going on, the same way that ChatGPT is totally secret. It's a bummer in a lot of ways.”

For one, closed-source data models raise serious legal and privacy concerns, whereby personal data and copyrighted, proprietary content may be inappropriately included in their data sets — and, consequently, may find their way back out again in generated results.

Additionally, the lack of access to a data model means researchers have no way of inspecting the quality of the data being used or the methods by which conclusions are made. Codex’s closed model is, in essence, a black box of mysteries that cannot be examined, assessed or reproduced.

But perhaps most significantly of all, by limiting access to the most fundamental resource in scientific research — data — closed models also restrict access to scientific innovation and discovery to only the select few with enough funding or clout to be able to afford them — and to those populations or communities that such researchers serve. Access to such models can also be revoked at any time.

“We did our experiments when it was free to use Codex. If we did them now, that would cost about $40,000,” explained Greenberg. “That's not a position of strength. [As a researcher] you want to know what you have and be able to work with it anywhere, not be beholden to any particular corporation for access.”

“That Codex is closed is sort of anti-science,” he added.

In contrast, BigCode was deliberately developed to be a free, open, collaborative and equitable resource that incorporates only data that has been ethically and legally sourced and makes its model fully accessible and examinable by all.

“All of this makes it so that it’s more of a level playing field for academia,” said Greenberg. “It's nice to have [BigCode] be a bunch of people who are trying to do the right thing.”

Ultimately, by helping to improve the integrity of code generation models in multiple programming languages, MultiPL-E has the potential to support not only developers, model-makers, researchers and technological progress in general: it has the potential to support the integrity of science as a whole.

“A key question in all of the machine learning work right now — the literally million-dollar question for AI researchers — is, how do you do work if you are not a multimillion dollar corporation? Because the GPUs [graphics processing units] you need to run on are very expensive. The data sets are huge, and you don't have them. Both of these combine to make research super challenging. So being part of this big milestone community project means that we’re helping science continue to be open and independent.”

Learn more about academic programs and research in the Department of Computer Science:

Visit the Department Undergraduate Studies Graduate Studies

Seeking Integrity with the First Massively Parallel Multi-Language Benchmark for Code Generation

Translating Python oranges into multi-language apples

Better benchmarks make better models

Better models make better breakthroughs — and better science

Learn more about academic programs and research in the Department of Computer Science:

Related Stories

Stevens News