IBM Data Asset eXchange - Project CodeNet

This page contains links to download the Project CodeNet dataset hosted as part of Data Asset eXchange (DAX).

Full Dataset:

Project_CodeNet.tar.gz - 7.8GB

Mini Project CodeNet

Mini Project CodeNet is a scaled-down version of the full Project CodeNet dataset. Its purpose is to give people who cannot or do not want to download the full dataset a sense of the structure and contents of the full dataset. Mini Project CodeNet is only meant to be informative and is not intended to be used for any experiments.

Within the Mini_Project_CodeNet directory, the data sub-directory contains submissions for 8 problems in 6 programming languages (C, C++, Java, Python, Go, and Ruby). The metadata sub-directory contains CSV files with information on these problems and submissions. The data and metadata sub-directories have the same structure as the original Project CodeNet dataset but contain much less data. Mini_Project_CodeNet does not contain the derived or problem_descriptions sub-directories, which are present in the full Project CodeNet dataset.

Other Samples:

Sample for language classification: Project_CodeNet_LangClass.tar.gz - 397KB
Sample for masked language models: Project_CodeNet_MLM.tar.gz - 51KB

Language Specific Benchmarks:

About the Benchmarks

This dataset contains several benchmarks tar files. These benchmarks are useful for code classification and code similarity. These benchmarks consist of code samples extracted from CodeNet but filtered in the following ways:

Each code sample is "unique" in the sense that they are not near-duplicates of each other. Specifically, code samples are clustered based on Jacard distance and only one representative from each cluster is in the benchmark
Each problem is compared with the rest and identical problems are not included in the benchmark. However, we cannot guarantee that there are no identical problems, since the tests we performed are not exhaustive.
Codes with a large fraction of "dead" codes are not included. Again, we cannot guarantee that there are no dead codes in the code samples.
A token sequence can be successfully constructed using the tokenizer in the CodeNet repo
A simple parse tree can be successfully constructed using the SPT-generator in the CodeNet repo

The file Project_CodeNet_C++1000.tar.gz expands to a directory Project_CodeNet_C++1000 that contains 1000 directories. Each directory corresponds to a problem in Project CodeNet, where the problem id in this benchmark correspond to the same problem in the Project CodeNet dataset. Each directory contains 500 C++ files so there is a total of 500000 C++ code samples.

The file Project_CodeNet_C++1400.tar.gz expands to a directory Project_CodeNet_C++1400 that contains 1400 directories. Each directory corresponds to a problem in Project CodeNet, where the problem id in this benchmark correspond to the same problem in the Project CodeNet dataset. Each directory contains 300 C++ files so there is a total of 420000 C++ code samples. For the two C++ benchmarks suffixed with "preprocessed", the preprocessor directives in the C++ code samples have been resolved.

The file Project_CodeNet_Python800.tar.gz expands to a directory Project_CodeNet_Python800 that contains 800 directories. Each directory corresponds to a problem in Project CodeNet, where the problem id in this benchmark correspond to the same problem in the Project CodeNet dataset. Each directory contains 300 Python files so there is a total of 240000 Python code samples.

The file Project_CodeNet_Java250.tar.gz expands to a directory Project_CodeNet_Java250 that contains 250 directories. Each directory corresponds to a problem in Project CodeNet, where the problem id in this benchmark correspond to the same problem in the Project CodeNet dataset. Each directory contains 300 Java files so there is a total of 75000 Java code samples.

The file Project_CodeNet_C++1000_spts.tar.gz expands to a directory Project_CodeNet_C++1000_spts that contains 8 csv files, which are the SPT inputs to the GNN modeling for the Project_CodeNet_C++1000 benchmark.

The file Project_CodeNet_C++1400_spts.tar.gz expands to a directory Project_CodeNet_C++1400_spts that contains 8 csv files, which are the SPT inputs to the GNN modeling for the Project_CodeNet_C++1400 benchmark.

The file Project_CodeNet_Python800_spts.tar.gz expands to a directory Project_CodeNet_Python800_spts that contains 8 csv files, which are the SPT inputs to the GNN modeling for the Project_CodeNet_Python800 benchmark.

The file Project_CodeNet_Java250_spts.tar.gz expands to a directory Project_CodeNet_Java250_spts that contains 8 csv files, which are the SPT inputs to the GNN modeling for the Project_CodeNet_Java250 benchmark.

Description of csv files for each benchmark:

edge.csv:
Each line is a pair of (source node ID, target node ID)
node-feat.csv:
Each line is a feature vector of a node. Each node contains 4 features: (i) is it a token (ii) token type (iii) parsing rule type (iv) is it a reserved word
node_dfs_order.csv:
Each line is the DFS ID of a node.
num-edge-list.csv:
Each line is the number of edges in each SPT.
graph-label.csv:
Each line is the class of a SPT.
node_depth.csv:
Each line is the distance between the corresponding node and its SPT root.
node_is_attributed.csv
Each line indicates if the node is a leaf node or an internal node.
num-node-list.csv
Each line is the number of nodes in a SPT.

CASS files

Project_CodeNet_Python800_cass.tar.gz - 46MB
Project_CodeNet_Java250_cass.tar.gz - 20MB
Project_CodeNet_C++1000_cass.tar.gz - 164MB
Project_CodeNet_C++1400_cass.tar.gz - 176MB

About the CASS files

The following archives contain context-aware semantic structure (CASS) files necessary to run MISIM tool on Project CodeNet benchmarks.

The file Project_CodeNet_Java250_cass.tar.gz expands to a directory Project_CodeNet_Java250 that has a cass directory containing CASS files and a split file for the Project_CodeNet_Java250 benchmark.

The file Project_CodeNet_Python800_cass.tar.gz expands to a directory Project_CodeNet_Python800 that has a cass directory containing CASS files and a split file for Project_CodeNet_Python800 benchmark.

The file Project_CodeNet_C++1000_cass.tar.gz expands to a directory Project_CodeNet_C++1000 that has a cass directory containing CASS files and a split file for Project_CodeNet_C++1000 benchmark.

The file Project_CodeNet_C++1400_cass.tar.gz expands to a directory Project_CodeNet_C++1400 that has a cass directory containing CASS files and a split file for Project_CodeNet_C++1400 benchmark.

Experimentation Validation Data Set

The experimentation validation data set, Project_CodeNet_contest_experimentation_dataset.tar.gz, contains pairs of C++ code samples that were curated the from the original Project CodeNet dataset for use in the CodeNet Challenge. This dataset is comprised of 50% similar pairs (i.e., solve the same problem) and 50% dissimilar pairs (i.e., solve different problems). It is in the same format that the 'Dev' and 'Final' phase test data sets will be in with a couple exceptions. This dataset contains a directory called 'data', which contains a subdirectory for each problem number, e.g., 'p03023'. The individual code solution files reside within the appropriate problem number directories. The problem number directories will not be present in the 'Dev' and 'Final' phase test sets. There is a file called 'pairs.csv', which has three columns: 'pair-id', 'file1', and 'file2'. The file paths for the code submissions are relative to the 'data' directory. Finally, there is the 'ground_truth.csv', which will not be present in the 'Dev' and 'Final' phases. The 'pair-id' column in 'ground_truth.csv' corresponds to the 'pair-id' column in 'pairs.csv'. The 'similar' column contains a '1' when the two solutions solve the same problem and a '0' when the solutions solve different problems. Collapse