IBM Data Asset eXchange - Project CodeNet

This page contains links to download the Project CodeNet dataset hosted as part of Data Asset eXchange (DAX).

Full Dataset:

Mini Project CodeNet

  • Mini_Project_CodeNet.tar.gz - 1.4MB
  • Mini Project CodeNet is a scaled-down version of the full Project CodeNet dataset. Its purpose is to give people who cannot or do not want to download the full dataset a sense of the structure and contents of the full dataset. Mini Project CodeNet is only meant to be informative and is not intended to be used for any experiments.

    Within the Mini_Project_CodeNet directory, the data sub-directory contains submissions for 8 problems in 6 programming languages (C, C++, Java, Python, Go, and Ruby). The metadata sub-directory contains CSV files with information on these problems and submissions. The data and metadata sub-directories have the same structure as the original Project CodeNet dataset but contain much less data. Mini_Project_CodeNet does not contain the derived or problem_descriptions sub-directories, which are present in the full Project CodeNet dataset.

    Other Samples:

    Language Specific Benchmarks:

    About the Benchmarks

    This dataset contains several benchmarks tar files. These benchmarks are useful for code classification and code similarity. These benchmarks consist of code samples extracted from CodeNet but filtered in the following ways:

    The file Project_CodeNet_C++1000.tar.gz expands to a directory Project_CodeNet_C++1000 that contains 1000 directories. Each directory corresponds to a problem in Project CodeNet, where the problem id in this benchmark correspond to the same problem in the Project CodeNet dataset. Each directory contains 500 C++ files so there is a total of 500000 C++ code samples.

    The file Project_CodeNet_C++1400.tar.gz expands to a directory Project_CodeNet_C++1400 that contains 1400 directories. Each directory corresponds to a problem in Project CodeNet, where the problem id in this benchmark correspond to the same problem in the Project CodeNet dataset. Each directory contains 300 C++ files so there is a total of 420000 C++ code samples. For the two C++ benchmarks suffixed with "preprocessed", the preprocessor directives in the C++ code samples have been resolved.

    The file Project_CodeNet_Python800.tar.gz expands to a directory Project_CodeNet_Python800 that contains 800 directories. Each directory corresponds to a problem in Project CodeNet, where the problem id in this benchmark correspond to the same problem in the Project CodeNet dataset. Each directory contains 300 Python files so there is a total of 240000 Python code samples.

    The file Project_CodeNet_Java250.tar.gz expands to a directory Project_CodeNet_Java250 that contains 250 directories. Each directory corresponds to a problem in Project CodeNet, where the problem id in this benchmark correspond to the same problem in the Project CodeNet dataset. Each directory contains 300 Java files so there is a total of 75000 Java code samples.

    The file Project_CodeNet_C++1000_spts.tar.gz expands to a directory Project_CodeNet_C++1000_spts that contains 8 csv files, which are the SPT inputs to the GNN modeling for the Project_CodeNet_C++1000 benchmark.

    The file Project_CodeNet_C++1400_spts.tar.gz expands to a directory Project_CodeNet_C++1400_spts that contains 8 csv files, which are the SPT inputs to the GNN modeling for the Project_CodeNet_C++1400 benchmark.

    The file Project_CodeNet_Python800_spts.tar.gz expands to a directory Project_CodeNet_Python800_spts that contains 8 csv files, which are the SPT inputs to the GNN modeling for the Project_CodeNet_Python800 benchmark.

    The file Project_CodeNet_Java250_spts.tar.gz expands to a directory Project_CodeNet_Java250_spts that contains 8 csv files, which are the SPT inputs to the GNN modeling for the Project_CodeNet_Java250 benchmark.

    Description of csv files for each benchmark:

    1. edge.csv:
      Each line is a pair of (source node ID, target node ID)
    2. node-feat.csv:
      Each line is a feature vector of a node. Each node contains 4 features: (i) is it a token (ii) token type (iii) parsing rule type (iv) is it a reserved word
    3. node_dfs_order.csv:
      Each line is the DFS ID of a node.
    4. num-edge-list.csv:
      Each line is the number of edges in each SPT.
    5. graph-label.csv:
      Each line is the class of a SPT.
    6. node_depth.csv:
      Each line is the distance between the corresponding node and its SPT root.
    7. node_is_attributed.csv
      Each line indicates if the node is a leaf node or an internal node.
    8. num-node-list.csv
      Each line is the number of nodes in a SPT.

    CASS files

    About the CASS files

    The following archives contain context-aware semantic structure (CASS) files necessary to run MISIM tool on Project CodeNet benchmarks.

    The file Project_CodeNet_Java250_cass.tar.gz expands to a directory Project_CodeNet_Java250 that has a cass directory containing CASS files and a split file for the Project_CodeNet_Java250 benchmark.

    The file Project_CodeNet_Python800_cass.tar.gz expands to a directory Project_CodeNet_Python800 that has a cass directory containing CASS files and a split file for Project_CodeNet_Python800 benchmark.

    The file Project_CodeNet_C++1000_cass.tar.gz expands to a directory Project_CodeNet_C++1000 that has a cass directory containing CASS files and a split file for Project_CodeNet_C++1000 benchmark.

    The file Project_CodeNet_C++1400_cass.tar.gz expands to a directory Project_CodeNet_C++1400 that has a cass directory containing CASS files and a split file for Project_CodeNet_C++1400 benchmark.

    Experimentation Validation Data Set

    The experimentation validation data set, Project_CodeNet_contest_experimentation_dataset.tar.gz, contains pairs of C++ code samples that were curated the from the original Project CodeNet dataset for use in the CodeNet Challenge. This dataset is comprised of 50% similar pairs (i.e., solve the same problem) and 50% dissimilar pairs (i.e., solve different problems). It is in the same format that the 'Dev' and 'Final' phase test data sets will be in with a couple exceptions. This dataset contains a directory called 'data', which contains a subdirectory for each problem number, e.g., 'p03023'. The individual code solution files reside within the appropriate problem number directories. The problem number directories will not be present in the 'Dev' and 'Final' phase test sets. There is a file called 'pairs.csv', which has three columns: 'pair-id', 'file1', and 'file2'. The file paths for the code submissions are relative to the 'data' directory. Finally, there is the 'ground_truth.csv', which will not be present in the 'Dev' and 'Final' phases. The 'pair-id' column in 'ground_truth.csv' corresponds to the 'pair-id' column in 'pairs.csv'. The 'similar' column contains a '1' when the two solutions solve the same problem and a '0' when the solutions solve different problems. Collapse