.. highlight:: shell ===== Usage ===== Get the DBpedia datasets ------------------------ Download the DBpedia files containing documents to index (e.g., `DBpedia 2016 Core`_). You can check which files are indexed by DBpedia Lookup here: https://github.com/dbpedia/lookup#get-the-following-dbpedia-datasets. .. _DBpedia 2016 Core: http://downloads.dbpedia.org/2016-10/core/ Setup the environment --------------------- Create the .env file with the following variables:: SPARK_APPLICATION_NAME= # default: elasticpedia SPARK_MASTER_URL= # default: local ELASTIC_NODES= # default: localhost ELASTIC_INDEX_NAME= ELASTIC_WAN_ONLY= # Default: False ALL_DATA_DIR=/path/to/all_data/*.ttl REDIRECTS_DIR=/path/to/redirect/links/*.ttl Also, the elasticsearch-spark library must be available in Spark’s classpath. If it is not the case, download the `ES-Hadoop`_ library (choose the version compatible with your Elasticsearch installation) and extract the elasticsearch-spark connector from the dist directory. .. _ES-Hadoop: https://www.elastic.co/downloads/hadoop Run your app ------------ Use Elasticpedia in your project (e.g., index.py):: import os from dotenv import load_dotenv from elasticpedia import elasticpedia from elasticpedia.config.elastic_conf import ElasticConfig load_dotenv() es_config = ElasticConfig(os.getenv("ELASTIC_INDEX_NAME"), mapping_id=ElasticConfig.Fields.URI.value) elasticpedia.DBpediaIndexer(es_config).index(os.getenv("ALL_DATA_DIR"), os.getenv("REDIRECTS_DIR")) Run your project: .. code-block:: console $ spark-submit --jars ./elasticsearch-spark-20_2.11-7.5.1.jar index.py Usage with Docker ----------------- We built a Docker image for running Elasticpedia in a dockerized environment. The images is based on the `BDE Docker Spark image`_. Here is an example of docker-compose file with ElasticSearch 7.5.1 and Spark 2.4.4:: version: '2.2' services: es01: image: docker.elastic.co/elasticsearch/elasticsearch:7.5.1 container_name: es01 environment: - node.name=es01 - cluster.name=es-docker-cluster - cluster.initial_master_nodes=es01 - bootstrap.memory_lock=true - ES_JAVA_OPTS=-Xms512m -Xmx512m ulimits: memlock: soft: -1 hard: -1 volumes: - es01data:/usr/share/elasticsearch/data ports: - 9200:9200 dbpedia-indexer: image: vcutrona/dbpedia-indexer:1.0.0 container_name: dbpedia-indexer environment: - ENABLE_INIT_DAEMON=false - ELASTIC_NODES=es01 - ELASTIC_INDEX_NAME=dbpedia-df - ELASTIC_WAN_ONLY=true volumes: - /path/to/redirect/links/:/redirects - /path/to/all_data/:/all_data spark-master: image: bde2020/spark-master:2.4.4-hadoop2.7 container_name: spark-master ports: - "8080:8080" - "7077:7077" environment: - INIT_DAEMON_STEP=setup_spark volumes: - /path/to/redirect/links/:/redirects - /path/to/all_data/:/all_data spark-worker: image: bde2020/spark-worker:2.4.4-hadoop2.7 depends_on: - spark-master environment: - "SPARK_MASTER=spark://spark-master:7077" volumes: - /path/to/redirect/links/:/redirects - /path/to/all_data/:/all_data volumes: es01data: driver: local name: es01data Run the following command to run the application with 1 master and 3 workers: .. code-block:: console $ docker-compose up -d --scale spark-worker=3 .. _BDE Docker Spark image: https://github.com/big-data-europe/docker-spark