Skip to content

cj2001/odsc_east_kg_2021

Repository files navigation

Open Data Science East 2021 Workshop

Going from Text to Knowledge Graphs: Putting Natural Language Processing and Graph Databases to Work

Presenter: Dr. Clair J. Sullivan, Graph Data Science Advocate, Neo4j

Twitter: @CJLovesData1

This repository contains all of the software you will need to follow along with this workshop (see abstract below). It is mostly contained within the Docker container to ensure reproducibility among a variety of operating systems and environments. The slides are also available in the top-level directory of this repository.

Updates since workshop!

  • This repository was last updated on 2021-05-10. Be sure to pull the latest version!

  • Should you be interested in using the data from the workshop without having to query the Google Knowledge Graph or Wikipedia, you will find the raw data files in /data.

  • Watch for a Twitch stream (live) and YouTube video (live and stored) coming soon!

Downloadable prerequisites

In addition to cloning this repository, participants in this workshop will need to download the following:

  • All data will be accessed through public APIs and open source packages. These packages will be provided via Docker and/or venv.

  • All software will be provided and configured through the Docker container. This includes Jupyter lab, a variety of Python packages (particularly spacy, wikipedia, and the Python neo4j driver), and the Neo4j browser.

  • We will be working with the Google Knowledge Graph API. Users are permitted 100,000 calls per day for free to the API, but will require an API key for the API calls. A link on how to create this API key is below. Once the key is created, it is recommended that you store in in a file named .api_key at the root level of this repo.

Useful links

Abstract

In order to turn data into action we must know the context of that data. Traditionally humans were required to provide that context, however recently more and more context is available through data science approaches. This is achieved through the conversion of text into entities such as nouns like people and places and the verbs that describe their actions. In this way, we can obtain the nodes representing those nouns along with the verbs representing the relationships or edges between these nouns. We can further augment these nodes and edges by identifying words like adjectives further describing the nouns or word occurrences that can add additional relationships between nodes. This approach of named entity recognition can then be used in a variety of problems, such as creating better search engines or recommender systems.

In this workshop, we will start with an open source data set of text and convert it to a knowledge graph. We will use standard natural language processing (NLP) packages and approaches in Python to clean that text and create a knowledge graph data model within a graph database that can be queried and turned into data insights. This data model will include the nodes, edges, and attributes identified through the NLP process that can be used to create the necessary ontologies for the graph. We will experience the problems associated with generating such knowledge graphs, such as entity disambiguation and the lack of sufficient training data (zero-shot learning). Attendees of this workshop will create and put to use a complete pipeline for knowledge graph generation and analysis.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published