IBM Data Asset eXchange - Project CodeNet

This page contains links to download the Project CodeNet dataset hosted as part of Data Asset eXchange (DAX).

Full Dataset:

Mini Project CodeNet

  • Mini_Project_CodeNet.tar.gz - 1.4MB
  • Mini Project CodeNet is a scaled-down version of the full Project CodeNet dataset. Its purpose is to give people who cannot or do not want to download the full dataset a sense of the structure and contents of the full dataset. Mini Project CodeNet is only meant to be informative and is not intended to be used for any experiments.

    Within the Mini_Project_CodeNet directory, the data sub-directory contains submissions for 8 problems in 6 programming languages (C, C++, Java, Python, Go, and Ruby). The metadata sub-directory contains CSV files with information on these problems and submissions. The data and metadata sub-directories have the same structure as the original Project CodeNet dataset but contain much less data. Mini_Project_CodeNet does not contain the derived or problem_descriptions sub-directories, which are present in the full Project CodeNet dataset.

    Other Samples:

    Language Specific Benchmarks:

    About the Benchmarks

    This dataset contains several benchmarks tar files. These benchmarks are useful for code classification and code similarity. These benchmarks consist of code samples extracted from CodeNet but filtered in the following ways:

    The file Project_CodeNet_C++1000.tar.gz expands to a directory Project_CodeNet_C++1000 that contains 1000 directories. Each directory corresponds to a problem in Project CodeNet, where the problem id in this benchmark correspond to the same problem in the Project CodeNet dataset. Each directory contains 500 C++ files so there is a total of 500000 C++ code samples.

    The file Project_CodeNet_C++1400.tar.gz expands to a directory Project_CodeNet_C++1400 that contains 1400 directories. Each directory corresponds to a problem in Project CodeNet, where the problem id in this benchmark correspond to the same problem in the Project CodeNet dataset. Each directory contains 300 C++ files so there is a total of 420000 C++ code samples. For the two C++ benchmarks suffixed with "preprocessed", the preprocessor directives in the C++ code samples have been resolved.

    The file Project_CodeNet_Python800.tar.gz expands to a directory Project_CodeNet_Python800 that contains 800 directories. Each directory corresponds to a problem in Project CodeNet, where the problem id in this benchmark correspond to the same problem in the Project CodeNet dataset. Each directory contains 300 Python files so there is a total of 240000 Python code samples.

    The file Project_CodeNet_Java250.tar.gz expands to a directory Project_CodeNet_Java250 that contains 250 directories. Each directory corresponds to a problem in Project CodeNet, where the problem id in this benchmark correspond to the same problem in the Project CodeNet dataset. Each directory contains 300 Java files so there is a total of 75000 Java code samples.

    The file Project_CodeNet_C++1000_spts.tar.gz expands to a directory Project_CodeNet_C++1000_spts that contains 8 csv files, which are the SPT inputs to the GNN modeling for the Project_CodeNet_C++1000 benchmark.

    The file Project_CodeNet_C++1400_spts.tar.gz expands to a directory Project_CodeNet_C++1400_spts that contains 8 csv files, which are the SPT inputs to the GNN modeling for the Project_CodeNet_C++1400 benchmark.

    The file Project_CodeNet_Python800_spts.tar.gz expands to a directory Project_CodeNet_Python800_spts that contains 8 csv files, which are the SPT inputs to the GNN modeling for the Project_CodeNet_Python800 benchmark.

    The file Project_CodeNet_Java250_spts.tar.gz expands to a directory Project_CodeNet_Java250_spts that contains 8 csv files, which are the SPT inputs to the GNN modeling for the Project_CodeNet_Java250 benchmark.

    Description of csv files for each benchmark:

    1. edge.csv:
      Each line is a pair of (source node ID, target node ID)
    2. node-feat.csv:
      Each line is a feature vector of a node. Each node contains 4 features: (i) is it a token (ii) token type (iii) parsing rule type (iv) is it a reserved word
    3. node_dfs_order.csv:
      Each line is the DFS ID of a node.
    4. num-edge-list.csv:
      Each line is the number of edges in each SPT.
    5. graph-label.csv:
      Each line is the class of a SPT.
    6. node_depth.csv:
      Each line is the distance between the corresponding node and its SPT root.
    7. node_is_attributed.csv
      Each line indicates if the node is a leaf node or an internal node.
    8. num-node-list.csv
      Each line is the number of nodes in a SPT.