GitHub - isHuangXin/ISSTA22-CodeStudy: This repo illustrates how to evaluate the artifacts in the paper An Extensive Study on Pre-trained Models for Program Understanding and Generation published in ISSTA'22.

General Introduction

This repo illustrates how to evaluate the artifacts in the paper An Extensive Study on Pre-trained Models for Program Understanding and Generation published in ISSTA'22. Specifically, we have extensively investigated the effectiveness and limitations of NL-PL pre-trained models for program understanding and generation tasks. We first discover the performance fluctuation of different pre-trained models over different tasks and datasets, which indicates that it can be challenging to propose an almighty pre-trained model across task types and it is essential for reliable experiments to demonstrate the superiority of proposing new models. Furthermore, we also validate the superiority of pre-trained models over conventional/previous SOTA methods on different downstream tasks. Finally, we perform the first study for NL-PL pre-trained model robustness via adversarial attacks and find that the existing pre-trained models are rather vulnerable, e.g., they can be easily attacked by a simple random attack approach, and current strategies for improving the robustness of pre-trained code models have limited effectiveness. Therefore, researchers should make more efforts on proposing integration schemes of additional information with pre-training.

Due to the random nature of neural networks, users may obtain slightly different results via retraining the models. Please note that such results usually can be tolerated, i.e., they mostly do not conflict with the conclusions of the paper.

Environment Preparation

Hardware: GPU: 2080Ti * 8; CPU: Intel Core i7, 128GB RAM, 50GB free disk space, or above.

python 3.8+

argparse               1.4.0
numpy                  1.20.1
pandas                 1.2.3
matplotlib             3.4.1
sklearn                0.0
tqdm                   4.59.0
torch                  1.8.2
torchaudio             0.8.2
torchvision            0.9.2
tensorboardX           2.5.1
transformers           4.19.2
tokenizers             0.12.1

Ubuntu 18.04
CUDA Toolkit

Code Structure

We list the program directories and their files which can be used by artifact evaluations as follows.

./DGMS: Not pre-trained sota approach of code search.
./FCDetector: Not pre-trained sota approach of clone detection.
./PLBART: Source code of PLBART.
./rencos: Not pre-trained sota approach of code summarization.
./ReVeal: Not pre-trained sota approach of defect detection.
./Task: The directory of the 7 program understanding and generation tasks and corresponding pre-trained models.
- Clone-Detection-BigCloneBench/
- Clone-Detection-FCDetector
- Clone-Detection-POJ-104/
- Code-Generation
- Code-Refinement
- Code-Search
- Code-Search-FB-Java
- Code-Summarization
- Code-Summarization-Rencos
- Code-Translation
- Defect-Detection
- Defect-Detection-Reveal

Adversarial Attack

The artifact of adversarial attack for pre-trained code models can be found in CodeAttack.

Acknowledgement

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

General Introduction

Environment Preparation

Code Structure

Adversarial Attack

Acknowledgement

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
CoTexT/cc		CoTexT/cc
DGMS		DGMS
FCDetector		FCDetector
PLBART		PLBART
ReVeal		ReVeal
Task		Task
contracode		contracode
rencos		rencos
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

License

isHuangXin/ISSTA22-CodeStudy

Folders and files

Latest commit

History

Repository files navigation

General Introduction

Environment Preparation

Code Structure

Adversarial Attack

Acknowledgement

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages