One day after release, Code Llama's coding ability has improved by leaps and bounds, and the fine-tuned version of Human_ scored higher than GPT-4

Me from yesterday (August 25): Open source LLM will beat GPT-4 in a few months for code generation. Me now: Today, actually.

Yesterday, Meta open source code Llama, a basic model specializing in code generation, is free for research and commercial purposes.

There are three parameter versions of the Code Llama series of models, the number of parameters is 7B, 13B and 34B. And supports multiple programming languages, including Python, C++, Java, PHP, Type (Java), C# and Bash.

Code Llama versions provided by Meta include:

  • Code Llama, the base code model;
  • Code Llama-Python, fine-tuned version of Python;
  • Code Llama-Instruct, a fine-tuned version of natural language instructions.

In terms of its effect, different versions of Code Llama have a generation pass rate (pass@1) on Human and MBPP datasets that surpasses GPT-3.5.

In addition, the pass@1 of Code Llama's "Unnatural" 34B version on the Human dataset is close to GPT-4 (62.2% vs 67.0%). However, Meta did not release this version, but achieved significant performance improvements through training with a small amount of high-quality encoded data.

Source:

Just after a day, some researchers challenged GPT-4. They come from Phind (an organization that aims to build an AI search engine for developers), which beat GPT-4** in human evaluation with **fine-tuned Code Llama-34B.

Phind co-founder Michael Royzen said: "This is just an early experiment aimed at reproducing (and surpassing) the "Unnatural Code Llama" results in the Meta paper. In the future, we will have an expert portfolio of different CodeLlama models that I think will be competitive in real-world workflows. "

Both models are open-sourced:

The researchers published these two models on Huggingface, and everyone can go to check them out.

  • Phind-CodeLlama-34B-v1:
  • Phind-CodeLlama-34B-Python-v1:

Next, let's see how this research was implemented.

** Fine-tune Code Llama-34B to beat GPT-4**

Let's look at the results first. This study fine-tuned Code Llama-34B and Code Llama-34B-Python with Phind's internal dataset, and obtained two models, Phind-CodeLlama-34B-v1 and Phind-CodeLlama-34B-Python-v1, respectively.

The two newly obtained models achieved 67.6% and 69.5% pass@1 respectively on Human.

For comparison, CodeLlama-34B pass@1 is 48.8%; CodeLlama-34B-Python pass@1 is 53.7%.

And GPT-4 pass@1 on Human is 67% (data released by OpenAI in the "GPT-4Technical Report" released in March this year).

Source:

Source:

When it comes to fine-tuning, data sets are naturally indispensable. The study fine-tuned Code Llama-34B and Code Llama-34B-Python on a proprietary data set containing about 80,000 high-quality programming problems and solutions.

Instead of code completion examples, this dataset uses instruction-answer pairs, which is different from the Human data structure. The study then trained the Phind model for two epochs, with a total of about 160,000 examples. The researchers said that LoRA technology was not used in the training, but local fine-tuning was used.

In addition, the research also adopted DeepSpeed ZeRO3 and Flash Attention2 technologies. It took them three hours to train these models on 32 A100-80GB GPUs, with a sequence length of 4096 tokens.

In addition, the study applied OpenAI's decontamination method to the dataset to make the model results more effective.

As we all know, even the very powerful GPT-4 will face the dilemma of data pollution. In layman's terms, the trained model may have been trained on the evaluation data.

This problem is very tricky for LLM. For example, in the process of evaluating the performance of a model, in order to make a scientifically credible evaluation, the researcher must check whether the problem used for evaluation is in the training data of the model. If so, the model can remember these questions, and when evaluating the model, it will obviously perform better on these specific questions.

It's like a person already knows the exam questions before taking the exam.

In order to solve this problem, OpenAI disclosed how GPT-4 evaluates data pollution in the public GPT-4 technical document "GPT-4Technical Report". They disclose strategies for quantifying and assessing this data pollution.

Specifically, OpenAI uses substring matching to measure cross-contamination between the evaluation dataset and the pre-training data. Both evaluation and training data are processed by removing all spaces and symbols, leaving only characters (including numbers).

For each evaluation example, OpenAI randomly selects three 50-character substrings (if less than 50 characters, the entire example is used). A match is determined if any of the three sampled evaluation substrings is a substring of the processed training example.

This produces a list of tainted examples, which OpenAI discards and reruns to obtain an untainted score. But this filtering method has some limitations, substring matching can lead to false negatives (if there are small differences between evaluation and training data) as well as false positives. Thus, OpenAI uses only part of the information in the evaluation examples, only using questions, context or equivalent data, but ignoring answers, responses or equivalent data. In some cases, multiple choice options were also excluded. These exclusions may lead to increased false positives.

For this part, interested readers can refer to the paper for more information.

Paper address:

However, there is some controversy over the Human score Phind used when benchmarking GPT-4. Some people say that the latest test score of GPT-4 has reached 85%. But Phind replied that the relevant research that derived this score did not conduct pollution research, and it was impossible to determine whether GPT-4 had seen Human's test data when undergoing a new round of testing. Considering some recent research on "GPT-4 becoming stupid", it is safer to use the data in the original technical report.

However, considering the complexity of large-scale model evaluation, whether these evaluation results can reflect the true capabilities of the model is still a controversial issue. You can download the model and experience it yourself.

Reference link:

View Original
The content is for reference only, not a solicitation or offer. No investment, tax, or legal advice provided. See Disclaimer for more risks disclosure.
  • Reward
  • Comment
  • Share
Comment
0/400
No comments