This page contains links to download the Project CodeNet dataset hosted as part of Data Asset eXchange (DAX).
Mini Project CodeNet is a scaled-down version of the full Project CodeNet dataset. Its purpose is to give people who cannot or do not want to download the full dataset a sense of the structure and contents of the full dataset. Mini Project CodeNet is only meant to be informative and is not intended to be used for any experiments.
Within the Mini_Project_CodeNet
directory, the data
sub-directory contains submissions for 8 problems in 6 programming languages
(C, C++, Java, Python, Go, and Ruby). The metadata
sub-directory
contains CSV files with information on these problems and submissions. The
data
and metadata
sub-directories have the same structure
as the original Project CodeNet dataset but contain much less data.
Mini_Project_CodeNet
does not contain the derived
or
problem_descriptions
sub-directories, which are present in the full
Project CodeNet dataset.
This dataset contains several benchmarks tar files. These benchmarks are useful for code classification and code similarity. These benchmarks consist of code samples extracted from CodeNet but filtered in the following ways:
The file Project_CodeNet_C++1000.tar.gz expands to a directory Project_CodeNet_C++1000 that contains 1000 directories. Each directory corresponds to a problem in Project CodeNet, where the problem id in this benchmark correspond to the same problem in the Project CodeNet dataset. Each directory contains 500 C++ files so there is a total of 500000 C++ code samples.
The file Project_CodeNet_C++1400.tar.gz expands to a directory Project_CodeNet_C++1400 that contains 1400 directories. Each directory corresponds to a problem in Project CodeNet, where the problem id in this benchmark correspond to the same problem in the Project CodeNet dataset. Each directory contains 300 C++ files so there is a total of 420000 C++ code samples. For the two C++ benchmarks suffixed with "preprocessed", the preprocessor directives in the C++ code samples have been resolved.
The file Project_CodeNet_Python800.tar.gz expands to a directory Project_CodeNet_Python800 that contains 800 directories. Each directory corresponds to a problem in Project CodeNet, where the problem id in this benchmark correspond to the same problem in the Project CodeNet dataset. Each directory contains 300 Python files so there is a total of 240000 Python code samples.
The file Project_CodeNet_Java250.tar.gz expands to a directory Project_CodeNet_Java250 that contains 250 directories. Each directory corresponds to a problem in Project CodeNet, where the problem id in this benchmark correspond to the same problem in the Project CodeNet dataset. Each directory contains 300 Java files so there is a total of 75000 Java code samples.
The file Project_CodeNet_C++1000_spts.tar.gz expands to a directory Project_CodeNet_C++1000_spts that contains 8 csv files, which are the SPT inputs to the GNN modeling for the Project_CodeNet_C++1000 benchmark.
The file Project_CodeNet_C++1400_spts.tar.gz expands to a directory Project_CodeNet_C++1400_spts that contains 8 csv files, which are the SPT inputs to the GNN modeling for the Project_CodeNet_C++1400 benchmark.
The file Project_CodeNet_Python800_spts.tar.gz expands to a directory Project_CodeNet_Python800_spts that contains 8 csv files, which are the SPT inputs to the GNN modeling for the Project_CodeNet_Python800 benchmark.
The file Project_CodeNet_Java250_spts.tar.gz expands to a directory Project_CodeNet_Java250_spts that contains 8 csv files, which are the SPT inputs to the GNN modeling for the Project_CodeNet_Java250 benchmark.
Description of csv files for each benchmark:
edge.csv
:node-feat.csv
:node_dfs_order.csv
:num-edge-list.csv
:graph-label.csv
:node_depth.csv
:node_is_attributed.csv
num-node-list.csv
The following archives contain context-aware semantic structure (CASS) files necessary to run MISIM tool on Project CodeNet benchmarks.
The file Project_CodeNet_Java250_cass.tar.gz
expands to a
directory Project_CodeNet_Java250
that has a cass directory containing CASS files and a split file for the Project_CodeNet_Java250
benchmark.
The file Project_CodeNet_Python800_cass.tar.gz
expands to a
directory Project_CodeNet_Python800
that has a cass directory containing CASS files and a split file for Project_CodeNet_Python800
benchmark.
The file Project_CodeNet_C++1000_cass.tar.gz
expands to a
directory Project_CodeNet_C++1000
that has a cass directory containing CASS files and a split file for Project_CodeNet_C++1000
benchmark.
The file Project_CodeNet_C++1400_cass.tar.gz
expands to a
directory Project_CodeNet_C++1400
that has a cass directory containing CASS files and a split file for Project_CodeNet_C++1400
benchmark.
The experimentation validation data set, Project_CodeNet_contest_experimentation_dataset.tar.gz, contains pairs of C++ code samples that were curated the from the original Project CodeNet dataset for use in the CodeNet Challenge. This dataset is comprised of 50% similar pairs (i.e., solve the same problem) and 50% dissimilar pairs (i.e., solve different problems). It is in the same format that the 'Dev' and 'Final' phase test data sets will be in with a couple exceptions. This dataset contains a directory called 'data', which contains a subdirectory for each problem number, e.g., 'p03023'. The individual code solution files reside within the appropriate problem number directories. The problem number directories will not be present in the 'Dev' and 'Final' phase test sets. There is a file called 'pairs.csv', which has three columns: 'pair-id', 'file1', and 'file2'. The file paths for the code submissions are relative to the 'data' directory. Finally, there is the 'ground_truth.csv', which will not be present in the 'Dev' and 'Final' phases. The 'pair-id' column in 'ground_truth.csv' corresponds to the 'pair-id' column in 'pairs.csv'. The 'similar' column contains a '1' when the two solutions solve the same problem and a '0' when the solutions solve different problems. Collapse