InterCode (original) (raw)

News

10.19.2023 InterCode (v1.0.2) released, new IC-SWE!

10.12.2023 Lemur sets new highs on IC-[Bash, CTF, SQL]!

09.22.2023 InterCode accepted to 2023 NeurIPS Datasets & Benchmark track!

08.15.2023 InterCode (v1.0.1) released, new IC-CTF, IC-Python!

07.01.2023 InterCode now available on PyPI

06.27.2023 InterCode (v1.0.0) available on GitHub

What is InterCode?

InterCode is a benchmark for evaluating language models on the interactive coding task. Given a natural language request, an agent is asked to interact with a software system (e.g., database, terminal) with code to resolve the issue.

InterCode currently features 5 different code environments: IC-Bash, IC-CTF, IC-Python, IC-SQL, IC-SWE. You can learn more about each on the Environments page!

Question & Contributing

If you have any questions or would like to contribute to InterCode, you can post an issue on the InterCode GitHub issues page. Also, please feel free to contactJohn Yang directly.

Acknowledgements

We would like to thank thePrinceton NLP group for their support towards building InterCode. In particularly, we'd like to thankCarlos E. Jimenez and Yuhan Liu for testing InterCode and providing valuable feedback. In addition, our thanks toProf. Pranav Rajpurkar for giving us permission to use the SQuAD template for this website.

Citing

If you found InterCode helpful for your work, please cite us!

@misc{yang2023intercode, title={InterCode: Standardizing and Benchmarking Interactive Coding with Execution Feedback}, author={John Yang and Akshara Prabhakar and Karthik Narasimhan and Shunyu Yao}, year={2023}, eprint={2306.14898}, archivePrefix={arXiv}, primaryClass={cs.CL} }

Leaderboard

The Success Rate metric refers to the percentage of tasks that were resolved by the model (received a score of 1.0).

0 - Denotes zero shot evaluation (no interaction)

Model Date Owner Success Rate
🥇 GPT-4 06.27.2023 OpenAI 48.5
🥈 GPT-3.5-Turbo 06.27.2023 OpenAI 46.5
🥉 CodeLlama-34B-INST 10.12.2023 Meta 36.0
Lemur-70B-Chat 10.12.2023 Salesforce 34.5
GPT-3.5-Turbo0 06.27.2023 OpenAI 34.5
GPT-40 06.27.2023 OpenAI 34.0
Llama-2-70B-Chat 06.27.2023 Meta 31.5
Vicuna-13B 06.27.2023 Open Source 24.5
StarChat-16B 06.27.2023 Open Source 23.7
text-bison-001 06.27.2023 Google 22.5
chat-bison-001 06.27.2023 Google 19.2
chat-bison-0010 06.27.2023 Google 17.7
StarChat-16B0 06.27.2023 Open Source 17.7
text-bison-0010 06.27.2023 Google 17.0
Vicuna-13B0 06.27.2023 Open Source 16.0
Model Date Owner Success Rate
🥇 GPT-4 10.12.2023 OpenAI 37.0
🥈 Lemur-70B-Chat 10.12.2023 Salesforce 22.0
🥉 CodeLlama-34B-INST 10.12.2023 Meta 16.0
GPT-3.5-Turbo 10.12.2023 OpenAI 11.0
Llama-2-70B-Chat 10.12.2023 Meta 9.0
Model Date Owner Success Rate
Be the first!
Model Date Owner Success Rate
🥇 GPT-4 06.27.2023 OpenAI 84.4
🥈 Lemur-70B-Chat 10.12.2023 Salesforce 73.39
🥉 GPT-3.5-Turbo 06.27.2023 OpenAI 72.82
Llama-2-70B-Chat 10.12.2023 Meta 67.89
CodeLlama-34B-INST 10.12.2023 Meta 67.79
text-bison-001 06.27.2023 Google 12.9
text-bison-0010 06.27.2023 Google 11.5
GPT-3.5-Turbo0 06.27.2023 OpenAI 10.5
chat-bison-001 06.27.2023 Google 9.9
StarChat-16B 06.27.2023 Open Source 9.7
GPT-40 06.27.2023 OpenAI 9.1
StarChat-16B0 06.27.2023 Open Source 8.9
chat-bison-0010 06.27.2023 Google 7.9
Vicuna-13B 06.27.2023 Open Source 6.3
Vicuna-13B0 06.27.2023 Open Source 2.6
Model Date Owner Success Rate
Be the first!