entity2vec computes vector representations of Knowledge Graph entities that preserve semantic similarities and are suitable for classification tasks. It generates a set of property-specific entity embeddings by running node2vec on property specific subgraphs, i.e. K(p) = (s,p,o) where p is a given property. The repository includes:
-
A reimplementation of node2vec, which introduces the possibility of avoiding the preprocessing of the transition probabilities, which has the effect of reducing memory effort, but slowing down the computation
-
entity2vec, which generates a set of entity embeddings from Knowledge Graphs corresponding to different properties. Entity2vec can work with a set of pre-downloaded dumps or download them from a SPARQL endpoint.
- Python 2.7 or above
- numpy
- gensim
- networkx
- pandas
- SPARQL Wrapper
If you are using pip
:
pip install -r requirements.txt
The set of properties can be defined in the configuration file config/properties.json
, otherwise the software will run on each file that is located in datasets/your_dataset/graphs
or if a SPARQL endpoint is provided, it will download all the graphs for all properties in datasets/your_dataset/graphs
.
python src/entity2vec.py --dataset dataset --config_file config_file --entities entities --sparql sparql --default_graph default_graph
Alternatively, e2v can be loaded as a module and used like:
from entity2vec.entity2vec import Entity2Vec
e2v = Entity2Vec(False, False, False, 1, 1, 10, 5,
128, 10, 8, 5, 'path/to/properties.json', False,
'dataset_name', 'all', False, False,
'path/to/feedback.edgelist')
option | default | description |
---|---|---|
dataset |
null (Required) | name of the dataset. It will be used to create folders and retrieve properties from config file |
config_file |
config/properties.json | path of the configuration file |
entities |
all | a list of entities for which the embeddings have to be computed. By default, it will use them all. |
sparql |
null | endpoint from which property-specific graphs are obtained. If not provided, it assumes that the graphs are already stored in datasets/your_dataset/graphs |
default_graph |
null | whether using a default_graph in the SPARQL endpoint |
num_walks |
500 | number of random walks per entity |
feedback_file |
null | Path to a DAT file that contains all the couples user-item. If not defined, it assumes that is the file datasets/<my_dataset>/graphs/feedback.edgelist |
Generate unique vector representation for an entity, without considering the role of semantic properties, to use in classification tasks.
-
Create empty directory called emb
-
Run node2vec on the whole graph to create a single global embedding of the entity
python src/node2vec.py --input datasets/aifb/aifb.edgelist --output emb/aifb_p1_q4.emd --p 1 --q 4
-
Obtain scores, e.g.:
cd ml python rdf_predict.py --dataset aifb --emb ../emb/aifb_p1_q4.emd --dimension 500