Posted October 17, 2019
A Code Knowledge Graph for Planning Data Science Experiments

The dynamic languages commonly used in data science, like Python, R, and Javascript, are not easily amenable to current programming cognitive assistance tools.  They often require a deeper semantic understanding of software libraries. Building on our RPI Whyis knowledge graph framework, we will create a set of text and code cross-embeddings for a popular machine learning toolkit (e.g., scikit-learn), thereby making the semantics explicit. We will create automated tags for machine learning methods using available content (Data Science Ontology, Wikipedia pages, Usage documentation, and code documentation, and actual code on public repositories). We will evaluate these embeddings for the accuracy of the generated tags, the ability to group related methods, and the ability to match methods from articles with the corresponding analysis code. This can allow automated tagging of methods of data science code. Having these tags will allow us to explore the use of AI planning for the effective composition of data science methods.

Key Findings

Knowledge representation of "turtle" code analysis is fully covered by extending PROV-O with minimal classes.
Working on developing whole-graph embeddings (as opposed to individual nodes).
Embeddings will also be comparable to whole text embeddings.


Project Start Date
Project End Date