InterCode (original) (raw)

News

10.19.2023 InterCode (v1.0.2) released, new IC-SWE!

10.12.2023 Lemur sets new highs on IC-[Bash, CTF, SQL]!

09.22.2023 InterCode accepted to 2023 NeurIPS Datasets & Benchmark track!

08.15.2023 InterCode (v1.0.1) released, new IC-CTF, IC-Python!

07.01.2023 InterCode now available on PyPI

06.27.2023 InterCode (v1.0.0) available on GitHub

What is InterCode?

InterCode is a benchmark for evaluating language models on the interactive coding task. Given a natural language request, an agent is asked to interact with a software system (e.g., database, terminal) with code to resolve the issue.

InterCode currently features 5 different code environments: IC-Bash, IC-CTF, IC-Python, IC-SQL, IC-SWE. You can learn more about each on the Environments page!

Question & Contributing

If you have any questions or would like to contribute to InterCode, you can post an issue on the InterCode GitHub issues page. Also, please feel free to contactJohn Yang directly.

Acknowledgements

We would like to thank thePrinceton NLP group for their support towards building InterCode. In particularly, we'd like to thankCarlos E. Jimenez and Yuhan Liu for testing InterCode and providing valuable feedback. In addition, our thanks toProf. Pranav Rajpurkar for giving us permission to use the SQuAD template for this website.

Citing

If you found InterCode helpful for your work, please cite us!

@misc{yang2023intercode, title={InterCode: Standardizing and Benchmarking Interactive Coding with Execution Feedback}, author={John Yang and Akshara Prabhakar and Karthik Narasimhan and Shunyu Yao}, year={2023}, eprint={2306.14898}, archivePrefix={arXiv}, primaryClass={cs.CL} }

Leaderboard

The Success Rate metric refers to the percentage of tasks that were resolved by the model (received a score of 1.0).

0 - Denotes zero shot evaluation (no interaction)

Model	Date	Owner	Success Rate
🥇 GPT-4	06.27.2023	OpenAI	48.5
🥈 GPT-3.5-Turbo	06.27.2023	OpenAI	46.5
🥉 CodeLlama-34B-INST	10.12.2023	Meta	36.0
Lemur-70B-Chat	10.12.2023	Salesforce	34.5
GPT-3.5-Turbo0	06.27.2023	OpenAI	34.5
GPT-40	06.27.2023	OpenAI	34.0
Llama-2-70B-Chat	06.27.2023	Meta	31.5
Vicuna-13B	06.27.2023	Open Source	24.5
StarChat-16B	06.27.2023	Open Source	23.7
text-bison-001	06.27.2023	Google	22.5
chat-bison-001	06.27.2023	Google	19.2
chat-bison-0010	06.27.2023	Google	17.7
StarChat-16B0	06.27.2023	Open Source	17.7
text-bison-0010	06.27.2023	Google	17.0
Vicuna-13B0	06.27.2023	Open Source	16.0

Model	Date	Owner	Success Rate
🥇 GPT-4	10.12.2023	OpenAI	37.0
🥈 Lemur-70B-Chat	10.12.2023	Salesforce	22.0
🥉 CodeLlama-34B-INST	10.12.2023	Meta	16.0
GPT-3.5-Turbo	10.12.2023	OpenAI	11.0
Llama-2-70B-Chat	10.12.2023	Meta	9.0

Model	Date	Owner	Success Rate
Be the first!

Model	Date	Owner	Success Rate
🥇 GPT-4	06.27.2023	OpenAI	84.4
🥈 Lemur-70B-Chat	10.12.2023	Salesforce	73.39
🥉 GPT-3.5-Turbo	06.27.2023	OpenAI	72.82
Llama-2-70B-Chat	10.12.2023	Meta	67.89
CodeLlama-34B-INST	10.12.2023	Meta	67.79
text-bison-001	06.27.2023	Google	12.9
text-bison-0010	06.27.2023	Google	11.5
GPT-3.5-Turbo0	06.27.2023	OpenAI	10.5
chat-bison-001	06.27.2023	Google	9.9
StarChat-16B	06.27.2023	Open Source	9.7
GPT-40	06.27.2023	OpenAI	9.1
StarChat-16B0	06.27.2023	Open Source	8.9
chat-bison-0010	06.27.2023	Google	7.9
Vicuna-13B	06.27.2023	Open Source	6.3
Vicuna-13B0	06.27.2023	Open Source	2.6

Model	Date	Owner	Success Rate
Be the first!