Usage

Get the DBpedia datasets

Download the DBpedia files containing documents to index (e.g., DBpedia 2016 Core). You can check which files are indexed by DBpedia Lookup here: https://github.com/dbpedia/lookup#get-the-following-dbpedia-datasets.

Setup the environment

Create the .env file with the following variables:

SPARK_APPLICATION_NAME=<name of the app>  # default: elasticpedia
SPARK_MASTER_URL=<local|spark-master-host|yarn>  # default: local
ELASTIC_NODES=<comma-separated-list-of-elastic-nodes>  # default: localhost
ELASTIC_INDEX_NAME=<index_name>
ELASTIC_WAN_ONLY=<True|False>  # Default: False
ALL_DATA_DIR=/path/to/all_data/*.ttl
REDIRECTS_DIR=/path/to/redirect/links/*.ttl

Also, the elasticsearch-spark library must be available in Spark’s classpath. If it is not the case, download the ES-Hadoop library (choose the version compatible with your Elasticsearch installation) and extract the elasticsearch-spark connector from the dist directory.

Run your app

Use Elasticpedia in your project (e.g., index.py):

import os

from dotenv import load_dotenv

from elasticpedia import elasticpedia
from elasticpedia.config.elastic_conf import ElasticConfig

load_dotenv()
es_config = ElasticConfig(os.getenv("ELASTIC_INDEX_NAME"), mapping_id=ElasticConfig.Fields.URI.value)
elasticpedia.DBpediaIndexer(es_config).index(os.getenv("ALL_DATA_DIR"), os.getenv("REDIRECTS_DIR"))

Run your project:

$ spark-submit --jars ./elasticsearch-spark-20_2.11-7.5.1.jar index.py

Usage with Docker

We built a Docker image for running Elasticpedia in a dockerized environment. The images is based on the BDE Docker Spark image. Here is an example of docker-compose file with ElasticSearch 7.5.1 and Spark 2.4.4:

version: '2.2'
services:
  es01:
    image: docker.elastic.co/elasticsearch/elasticsearch:7.5.1
    container_name: es01
    environment:
      - node.name=es01
      - cluster.name=es-docker-cluster
      - cluster.initial_master_nodes=es01
      - bootstrap.memory_lock=true
      - ES_JAVA_OPTS=-Xms512m -Xmx512m
    ulimits:
      memlock:
        soft: -1
        hard: -1
    volumes:
      - es01data:/usr/share/elasticsearch/data
    ports:
      - 9200:9200
  dbpedia-indexer:
    image: vcutrona/dbpedia-indexer:1.0.0
    container_name: dbpedia-indexer
    environment:
      - ENABLE_INIT_DAEMON=false
      - ELASTIC_NODES=es01
      - ELASTIC_INDEX_NAME=dbpedia-df
      - ELASTIC_WAN_ONLY=true
    volumes:
      - /path/to/redirect/links/:/redirects
      - /path/to/all_data/:/all_data
  spark-master:
    image: bde2020/spark-master:2.4.4-hadoop2.7
    container_name: spark-master
    ports:
      - "8080:8080"
      - "7077:7077"
    environment:
      - INIT_DAEMON_STEP=setup_spark
    volumes:
      - /path/to/redirect/links/:/redirects
      - /path/to/all_data/:/all_data
  spark-worker:
    image: bde2020/spark-worker:2.4.4-hadoop2.7
    depends_on:
      - spark-master
    environment:
      - "SPARK_MASTER=spark://spark-master:7077"
    volumes:
      - /path/to/redirect/links/:/redirects
      - /path/to/all_data/:/all_data
volumes:
  es01data:
    driver: local
    name: es01data

Run the following command to run the application with 1 master and 3 workers:

$ docker-compose up -d --scale spark-worker=3