Instruction Tuning with GPT-4 (original) (raw)

Large Language Models (LLMs) have shown impressive generalization capabilities such as in- context-learning and chain-of-thoughts reasoning. To enable LLMs to follow natural language instructions and complete real-world tasks, researchers have been exploring methods of instruction-tuning of LLMs. To advance the state of the art of instruction-tuning for LLMs, we present the first attempt to use GPT-4 to generate instruction-following data for LLM finetuning.

GPT-4 Data

We release the following data assets:

llama following the instructions

How Good is the Data?

Evaluating the performance of self-instruct tuned models on GPT-4 data for tasks that have not been seen before remains a difficult task. Our objective is to assess their capability to comprehend and follow instructions for various tasks. To accomplish this, we utilize the following three types of evaluations.Our empirical investigation confirms that the utilization of GPT-4-generated data is an efficient and effective approach for LLM instruction-tuning than other machine generated data.

Human evaluation was performed on model generation results using Amazon Mechanical Turk following Helpfulness, Honestness and Harmlessness criteria by Anthropic AI. The results are summarized as follows:

LLaMA-GPT4 vs Alpaca (i.e., LLaMA-GPT3)

LLaMA-GPT4 vs GPT-4

Inspired by Vicuna, GPT-4 was used to evaluate the quality of responses generated by different chatbot models on 80 unseen questions. The responses from LLaMA-GPT-4 (7B) and GPT-4 were collected, and the release answers from other models were obtained from a previous study. GPT-4 was asked to rate the quality of responses between two models using a scale of 1 to 10, and the results were compared against a strong competing model (ChatGPT and GPT-4).

Evaluations Scores from GPT-4

ROUGE-L on Unnatural Instructions.

Sample Responses Comparison

Citation

If the paper inspires you and the data is used in your research, please cite us:

@article{peng2023instruction, title={Instruction Tuning with GPT-4}, author={Peng, Baolin and Li, Chunyuan and He, Pengcheng and Galley, Michel and Gao, Jianfeng}, journal={arXiv preprint arXiv:2304.03277}, year={2023} }

Release and License

The data is intended solely for research and non-commercial purposes. Its use is subject to the Terms of Use for data generated by OpenAI. If you discover any potential violations, please contact us. Additionally, the code is governed by the Apache License 2.0.

Acknowledgement

We thank Guoyin Wang, Haotian Liu and Hao Cheng for valuable discussions and insightful experience sharing on instruction-tuning language models. We thank the LLaMA team for giving us access to their models.