Welcome to kibior package introduction vignette!

1 General notions

As one of the hot topics in science, being able to make findable, accessible, interoperable and researchable our datasets (FAIR principles) brings openness, versionning and unlocks reproductibility. To support that, great projects such as biomaRt R package enable fast consumption and ease handling of massive validated data through a small R interface.

Even though main entities such as Ensembl or NBCI avail massive amounts of data, they do not provide a way to store data elsewhere, delegating data handling to research teams. During data analysis, this can be an issue since researchers often need to send intermediary subsets of analyzed data to collaborators. Moreover, it is pretty common now that, when a new database or dataset emerges, a web platform and an API are provided alongside it, allowing easier exploration and querying.

Multiplying the number of research teams in life-science worldwide with the ever-growing database and datasets publication on widely varying sub-columns results in an even greater number of ways to query heterogenous life-science data.

Here, we present an easy way for datasets manipulation and sharing throught decentralization. Indeed, kibior seeks to make available a search engine and distributed database system for sharing data easily through the use of Elasticsearch (ES) and Elasticsearch-based architectures such as Kibio.

It is a way to handle large datasets and unlock the possibility to:

pull/download datasets from a local or remote instance of Elasticsearch,
filter, query and search in large amounts of data,
push/store datasets to local or remote instance of Elasticsearch,
share datasets for collaborators around the world,
perform joins between R in-memory and ES-based datasets,
import and export datasets from and to files,
valid safe-state datasets during pipeline execution,
comply to FAIR-sharing requirements by allowing REST requests on data and metadata from Elasticsearch API.

1.1 Goal of this vignette

The following sections will explain some basic and advanced technical usage of kibior. A second vignette will focus these features to biological applicaitons.

1.2 Vocabulary

We will use both Elasticsearch and R vocabulary, which have similar notions:

R	Elasticsearch
data(set), tibble, df, etc.	index
columns, variables	fields
lines, observations	documents

kibior uses tibbles as main data representation.

1.3 Public instances

The public Kibio instance is available at kibio.compbio.ulaval.ca port 80. You can simply connect to it via the get_kibio_instance() method of kibior.

1.4 Demonstration datasets

Before going to the second separate vignette showing biological datasets example, we strongly advise the reader to start reading the basic and advanced usage sections. In these sections, we will use some datasets taken from other known packages, such as dplyr::starwars…

…dplyr::storms…

…datasets::iris…

…and ggplot2::diamonds to show our examples.

2 Deploying an Elasticsearch instance

Before starting, you should know that this step will start an Elasticsearch service and store all data on your machine.

So, you should ponder the quantity of data you will handle in your code according the remaining space left on your computer.

2.1 Installation with Docker and docker-compose

To use this feature, you will need Docker and docker-compose installed on your system.

To install Docker, simply follow the steps detailled on its website.

If you are on a Linux / Unix-based system, you should also check the post-installation steps, mainly for the Manage Docker as a non-root user step.

To install docker-compose, simply follow the next steps.

2.2 Run your own Elasticsearch instance

We want something easy to use, so we use the following docker-compose fashion. You can use the docker way by passing all parameters inline but it is verbose.

You can find the following described files in the kibior package, folder inst/docker_conf.

2.2.1 Elasticsearch configuration file

Copy-paste these lines in a new elasticsearch.yml file.

cluster.name: "docker-cluster"
network.host: 0.0.0.0

# minimum_master_nodes need to be explicitly set when bound on a public IP
# set to 1 to allow single node clusters
# Details: https://github.com/elastic/elasticsearch/pull/17288
discovery.zen.minimum_master_nodes: 1

# Uncomment and tweak the following lines if you need to connect to remote instances
# such as Kibio's or if you want to configure several disjoint local instances. 
# This also allows to use KibioR `$copy()` and `$move()` methods with remote instances.
# reindex.remote.whitelist: [
#   "first_instance:9200", 
#   "second_instance:9200",
# ]

2.2.2 DNS configuration file

Copy-paste these lines in a new resolv.conf file if you need to connect to ES named services on the web.

nameserver 8.8.8.8      # Google's DNS resolver: use DNS if searching online ES named services
nameserver 127.0.0.11   # local, might change
options ndots:0
options rotate
options timeout:1

2.2.3 Docker-compose configuration file

Copy-paste these lines inside a single-es.yml file.

version: '2.4'
services:

##  --------------------------
##  If you need rstudio
##  --------------------------

  # rstudio4:
  #   container_name: rstudio4
  #   image: rocker/rstudio:4.0.3
  #   environment:
  #   - PASSWORD=myrstudio
  #   - USERID=1000
  #   #
  #   volumes:
  #   - type: bind
  #     source: <path_for_RStudio_data_folder_on_your_computer>
  #     target: /work/rstudio/data    # we create a folder inside the container
  #     read_only: false
  #   #
  #   ports:
  #   - 8787:8787
  #   networks:
  #   - kibiornet
  #   # cpu and ram constraints
  #   cpu_count: 1
  #   cpu_percent: 75
  #   cpus: 0.75
  #   memswap_limit: 0
  #   mem_reservation: 256m
  #   mem_limit: 6g

##  --------------------------
##  If you need a bash cli + R cli
##  See https://hub.docker.com/u/rocker for more versions 
##  with preinstalled material (e.g. tidyverse)
##  --------------------------

  # r4:
  #   container_name: r4
  #   image: roncar/kibior-env:4.0.3        # pre-configured R version 4.0.3 with Kibior installed
  #   stdin_open: true
  #   tty: true
  #   entrypoint: "/bin/bash"
  #   #
  #   volumes:
  #   - type: bind
  #     source: <path_for_R_data_folder_on_your_computer>
  #     target: /work/r/data    # we create a folder inside the container
  #     read_only: false
  #   - type: bind
  #     source: ./resolv.conf
  #     target: /etc/resolv.conf
  #     read_only: false
  #   #
  #   networks:
  #   - kibiornet
  #   # cpu and ram constraints
  #   cpu_count: 1
  #   cpu_percent: 75
  #   cpus: 0.75
  #   memswap_limit: 0
  #   mem_reservation: 256m
  #   mem_limit: 6g

##  --------------------------
##  Elasticsearch container
##  --------------------------

  elasticsearch:
    # this configuration will run a service called "elasticsearch"
    container_name: elasticsearch
    # the elasticsearch image used will be version 7
    # but you can use another version, such as 6.8.6
    image: docker.elastic.co/elasticsearch/elasticsearch:7.10.2
    # defines env var
    # last line tells us java will use 512MB
    # if you need more, change it for 2GB, for instance
    # "ES_JAVA_OPTS=-Xms2g -Xmx2g"
    environment:
    - discovery.type=single-node
    - bootstrap.memory_lock=true
    - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
    # strict limit to 1GB of RAM
    mem_limit: 1g
    memswap_limit: 0
    # lock memory
    ulimits:
      memlock:
        soft: -1
        hard: -1
      nofile:
        soft: 65536
        hard: 65536
    # bind files and folders of your system with those inside of the container 
    volumes:
    # ES data folder
    - type: bind
      source: <path_for_es_data_folder_on_your_computer>
      target: /usr/share/elasticsearch/data
      read_only: false
    # ES configurations
    - type: bind
      source: ./elasticsearch.yml
      target: /usr/share/elasticsearch/config/elasticsearch.yml
      read_only: true
    # export port to access Elasticsearch service from outside docker
    ports: 
    - 9200:9200
    # networks managed by docker 
    networks:
    - kibiornet

# network declaration
networks:
  kibiornet:

2.2.4 Run the services

Now, run the configuration to launch the service(s) with:

# run services (daemonized)
➜ docker-compose -f single-es.yml up -d
Starting elasticsearch ... done

#  see the current docker processes
➜ docker ps
CONTAINER ID   IMAGE                                                  COMMAND                  CREATED          STATUS         PORTS                              NAMES
40814036d980   docker.elastic.co/elasticsearch/elasticsearch:7.10.2   "/tini -- /usr/local…"   30 minutes ago   Up 5 seconds   0.0.0.0:9200->9200/tcp, 9300/tcp   elasticsearch

# curl
➜ curl -X GET localhost:9200
{
  "name" : "40814036d980",
  "cluster_name" : "docker-cluster",
  "cluster_uuid" : "InZqVTNiTK6idAWrEweWDg",
  "version" : {
    "number" : "7.10.2",
    "build_flavor" : "default",
    "build_type" : "docker",
    "build_hash" : "747e1cc71def077253878a59143c1f785afa92b9",
    "build_date" : "2021-01-13T00:42:12.435326Z",
    "build_snapshot" : false,
    "lucene_version" : "8.7.0",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}

The Elasticsearch service will be accessible. You can also interact with it on any browser. Check http://localhost:9200.

2.3 R session

2.3.1 I have R already installed on my computer

If you have R installed on your computer, simply use it with a kibior instance pointing at localhost:9200. Since it is the default configuration, you will only need this to work:

# In your R session
kc <- Kibior$new()

# or explicitely
kc <- Kibior$new(host="localhost", port=9200)

2.3.2 I do not already have R installed on my computer

If you do not have R installed on your computer, you can:

Install it, or
Use Docker and docker-compose.

The following sections guide you to use the R cli or the RStudio container. Both have kibior and its dependencies installed, but you can choose to use a clean R environment instead (i.e. rocker containers).

2.3.2.1 R command-line interface (R cli)

Steps:

Uncomment the R cli section (i.e. section “If you need a bash cli + R cli”) in the es-single.yml file.
Put a volume path if you need to work on specific files.
Use the same command to launch the service.
Use the R command-line interface inside the container.

# run services (daemonized)
➜ docker-compose -f single-es.yml up -d
elasticsearch is up-to-date
Creating r4 ... done

#  see the current docker processes
➜ docker ps
CONTAINER ID   IMAGE                                                  COMMAND                  CREATED          STATUS          PORTS                              NAMES
0f1afd07f58a   roncar/kibior-env:4.0.3                                "/bin/bash"              4 minutes ago    Up 4 minutes                                       r4
40814036d980   docker.elastic.co/elasticsearch/elasticsearch:7.10.2   "/tini -- /usr/local…"   4 minutes ago    Up 4 minutes    0.0.0.0:9200->9200/tcp, 9300/tcp   elasticsearch

# open an interactive bash inside the R container (see previous command container ID)
➜ docker exec -it 0f1afd07f58a bash

# inside the R container, query the ES container (with its container name)
root@0f1afd07f58a:/$ curl -X GET "http://elasticsearch:9200"
{
  "name" : "20f2383b909a",
  "cluster_name" : "docker-cluster",
  "cluster_uuid" : "InZqVTNiTK6idAWrEweWDg",
  "version" : {
    "number" : "7.10.2",
    "build_flavor" : "default",
    "build_type" : "docker",
    "build_hash" : "747e1cc71def077253878a59143c1f785afa92b9",
    "build_date" : "2021-01-13T00:42:12.435326Z",
    "build_snapshot" : false,
    "lucene_version" : "8.7.0",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}

# inside the R container, run R cli
root@0f1afd07f58a:/$ R --vanilla

R version 4.0.3 (2020-10-10) -- "Bunny-Wunnies Freak Out"
Copyright (C) 2020 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(kibior)
# Here you can directly load kibior as it is pre-installed inside the container.

This container comes with R version 4.0.3 and kibior package and its dependencies pre-installed. If you need a clean container with only R, you can use the rocker/r-ver:4.0.3 image instead.

2.3.2.2 RStudio

Steps:

Uncomment the RStudio section (i.e. section “If you need rstudio”) in the es-single.yml file.
Put a volume path if you need to work on specific files.
Use the same command to launch the service.
Use RStudio interface inside your web browser with login/password you configure in the RStudio section.

# run services (daemonized)
➜ docker-compose -f single-es.yml up -d
elasticsearch is up-to-date
Creating rstudio4 ... done

#  see the current docker processes
➜ docker ps
CONTAINER ID   IMAGE                                                  COMMAND                  CREATED         STATUS         PORTS                              NAMES
62344a365b70   roncar/kibior-rstudio:4.0.3                            "/init"                  7 seconds ago   Up 5 seconds   0.0.0.0:8787->8787/tcp             rstudio4
111ebcf0d5c4   docker.elastic.co/elasticsearch/elasticsearch:7.10.2   "/tini -- /usr/local…"   7 seconds ago   Up 5 seconds   0.0.0.0:9200->9200/tcp, 9300/tcp   elasticsearch

Connect with your web browser at localhost:8787 with login/password that where configured in the es-single.yml file.

This container comes with RStudio version 4.0.3 and kibior package and its dependencies pre-installed. If you need a clean container with only RStudio, you can use the rocker/rstudio:4.0.3 image instead.

2.4 Initialization

You can use several type of initialization:

#> Initiate a remote connection
kc_remote <- kibior$new(host = "something-far", user = "foo", pwd = "bar")

#> Create an new local instance bound to your local Elasticsearch
#> By default, `kibior uses localhost isntance with 9200 port
kc_local <- kibior$new()

#> you may need to authenticate since Elasticsearch uses auth system
#> the default login/password is "elastic"/"changeme", so
kc_local <- kibior$new(user = "elastic", pwd = "changeme")
#> You can now use `kc_local` as your own private instance.

2.5 Stop the Elasticsearch service

To stop the service, simply enter the command:

# stop all services
➜ docker-compose -f single-es.yml down
Stopping r4            ... done
Stopping elasticsearch ... done
Removing r4            ... done
Removing elasticsearch ... done
Removing network docker_kibior_test_kibiornet

4 Basic usage

Here, we will see the main methods (push(), pull(), list(), columns(), keys(), has(), match(), export(), import(), move(), copy()) and public attributes (verbosity) of kibior class. kibior uses elastic (Chamberlain 2020) to perform base functions.

4.1 Verbosity attributes

By default, kibior comes with three public attributes: $verbose, $quiet_progress and $quiet_results all initiliazed to FALSE.

$verbose toggles the printing of more informations which can be useful to see all processes steps.
$quiet_progress toggles the printing of progress bars. This can be useful for scripts.
$quiet_results toggles the verbosity output of called methods. You may want to deactivate it when you do not need interactive feedback.

To quickly show them, simply print the instance you are using:

kc

## KibioR client: 
##   - host: elasticsearch 
##   - port: 9200 
##   - verbose: no 
##   - print result: yes 
##   - print progressbar: yes

Use kc$<attribute-name> <- TRUE/FALSE to toggle verbosity mode on these three attributes.

A new instance of kibior has defaults to interactive behavior: progress bar and results immediate printing, but no additional informations.

See Attribute access in Advanced usage section for all attribute descriptions.

4.2 `$push()`: Store a dataset to Elasticsearch

To store data using kc connection:

kc$push(dplyr::storms, "storms")

## [1] "storms"

# or magrittr style
dplyr::starwars %>% kc$push("starwars")

## [1] "starwars"

If not already taken, the given index name will be created automatically before receiving data. If already taken, an error is raised.

Important points:

$push() automatically send data to Elasticsearch server, which needs unique IDs. One can define its own IDs using the id_col parameter which requires a column name that has unique elements.

If not defined, kibior will attribute a kid column counter as unique IDs (default).

$push() expects well-formatted data, mainly in a data.frame or derivative structure such as tibble.

See Push modes in Advanced usage section for more information.

4.3 `$pull()`: Download a dataset from Elasticsearch

The $pull() method downloads datasets. It can retrieve all or parts of datasets.

s <- kc$pull("storms")
s %>% names()

## [1] "storms"

Results are stored in a list of tibbles.

s$storms

## # A tibble: 10,010 x 14
##    name   year month   day  hour   lat  long status category  wind pressure
##    <chr> <int> <int> <int> <int> <dbl> <dbl> <chr>  <chr>    <int>    <int>
##  1 Ike    2008     9     7    18  21   -74   hurri… 3          105      946
##  2 Ike    2008     9     8     0  21.1 -75.2 hurri… 4          115      945
##  3 Ike    2008     9     8     2  21.1 -75.7 hurri… 4          115      945
##  4 Ike    2008     9     8     6  21.1 -76.5 hurri… 3          100      950
##  5 Ike    2008     9     8    12  21.1 -77.8 hurri… 2           85      960
##  6 Ike    2008     9     8    18  21.2 -79.1 hurri… 1           75      964
##  7 Ike    2008     9     9     0  21.5 -80.3 hurri… 1           70      965
##  8 Ike    2008     9     9     6  22   -81.4 hurri… 1           70      965
##  9 Ike    2008     9     9    12  22.4 -82.4 hurri… 1           70      965
## 10 Ike    2008     9     9    14  22.6 -82.9 hurri… 1           70      965
## # … with 10,000 more rows, and 3 more variables: ts_diameter <dbl>,
## #   hu_diameter <dbl>, kid <int>

With this, we can use search patterns to return multiple indices at once.

See Pattern search in Advanced usage section for more information.

4.4 `$list()`: List all Elasticsearch indices

#> list all indices
kc$list()

## [1] "storms"      "starwars"

4.5 `$columns()`: List all columns of an Elasticsearch index

#> list all columns
kc$columns("storms")

## $storms
##  [1] "category"    "day"         "hour"        "hu_diameter" "kid" 
##  [6] "lat"         "long"        "month"       "name"        "pressure"
## [11] "status"      "ts_diameter" "wind"        "year"

4.6 `$count()`: Count the number of elements

#> count all lines
kc$count("storms")

## $storms
## [1] 10010

#> count all columns
kc$count("storms", type = "variables")

## $storms
## [1] 14

#> count all indices lines via a pattern
kc$count("s*")

## $starwars
## [1] 87
## 
## $storms
## [1] 10010

As $search() and $pull(), this method accepts a query parameter to count the number of hits in your dataset following a query. See Querying in Advanced usage section for more information.

4.7 `$keys()`: List all unique keys of an Elasticsearch index column

#> list all keys on integer column 
kc$keys("storms", "year")

##  [1] 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997
## [24] 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015

#> list all keys on string column
kc$keys("storms", "status")

## [1] "hurricane"           "tropical depression" "tropical storm"

You should not use this on columns that can represent a continuous range such as temperature or coordinate. It will aggregate all possible values which can a large amount of time if your dataset is big enough.

4.8 `$has()`: Test if an Elasticsearch index exists

#> test presence of an index
kc$has("storms")

## $storms
## [1] TRUE

kc$has("abcde")

## $abcde
## [1] FALSE

#> test presence of all indices
c("storms", "abcde") %>% kc$has()

## $storms
## [1] TRUE
## 
## $abcde
## [1] FALSE

4.9 `$match()`: Select matching Elasticsearch indices

#> get exact matching indices 
kc$match("storms")

## [1] "storms"

kc$match("abcde")

## NULL

#> get matching pattern indices
kc$match("s*")

## [1] "starwars" "storms"

#> get list of mixed pattern and non pattern matching indices
c("s*", "abcde") %>% kc$match()

## [1] "starwars" "storms"

$match() and $has() differ on some points:

$has() retuns TRUE or FALSE for any string passed.
$has() does not accept patterns and only looks if the given strings are in $list().
$match() only returns something if some indices match the given strings.
$match() accepts patterns and unpacks all possible indices matching given strings.

4.10 `$export()`: Extract Elasticsearch index content to a file

The $export() method create file and export in-memory dataset or Elasticsearch index to this file.

#> Create temp files with data
storms_memory_tmp <- tempfile(fileext=".csv")
storms_elastic_tmp <- tempfile(fileext=".csv")

#> export a in-memory dataset to a file
dplyr::storms %>% kc$export(data = ., filepath = storms_memory_tmp)

## [1] "/tmp/RtmpVAwsWi/file243436451ae3.csv"

kc$import(storms_memory_tmp) %>% tibble::as_tibble()

## # A tibble: 10,010 x 13
##    name   year month   day  hour   lat  long status category  wind pressure
##    <chr> <int> <int> <int> <int> <dbl> <dbl> <chr>     <int> <int>    <int>
##  1 Amy    1975     6    27     0  27.5 -79   tropi…       -1    25     1013
##  2 Amy    1975     6    27     6  28.5 -79   tropi…       -1    25     1013
##  3 Amy    1975     6    27    12  29.5 -79   tropi…       -1    25     1013
##  4 Amy    1975     6    27    18  30.5 -79   tropi…       -1    25     1013
##  5 Amy    1975     6    28     0  31.5 -78.8 tropi…       -1    25     1012
##  6 Amy    1975     6    28     6  32.4 -78.7 tropi…       -1    25     1012
##  7 Amy    1975     6    28    12  33.3 -78   tropi…       -1    25     1011
##  8 Amy    1975     6    28    18  34   -77   tropi…       -1    30     1006
##  9 Amy    1975     6    29     0  34.4 -75.8 tropi…        0    35     1004
## 10 Amy    1975     6    29     6  34   -74.8 tropi…        0    40     1002
## # … with 10,000 more rows, and 2 more variables: ts_diameter <dbl>,
## #   hu_diameter <dbl>

#> export an Elasticsearch index to a file
"storms" %>% kc$export(data = ., filepath = storms_elastic_tmp)

## [1] "/tmp/RtmpVAwsWi/file24343220815.csv"

kc$import(storms_elastic_tmp) %>% tibble::as_tibble()

## # A tibble: 10,010 x 14
##    name   year month   day  hour   lat  long status category  wind pressure
##    <chr> <int> <int> <int> <int> <dbl> <dbl> <chr>     <int> <int>    <int>
##  1 Ike    2008     9     7    18  21   -74   hurri…        3   105      946
##  2 Ike    2008     9     8     0  21.1 -75.2 hurri…        4   115      945
##  3 Ike    2008     9     8     2  21.1 -75.7 hurri…        4   115      945
##  4 Ike    2008     9     8     6  21.1 -76.5 hurri…        3   100      950
##  5 Ike    2008     9     8    12  21.1 -77.8 hurri…        2    85      960
##  6 Ike    2008     9     8    18  21.2 -79.1 hurri…        1    75      964
##  7 Ike    2008     9     9     0  21.5 -80.3 hurri…        1    70      965
##  8 Ike    2008     9     9     6  22   -81.4 hurri…        1    70      965
##  9 Ike    2008     9     9    12  22.4 -82.4 hurri…        1    70      965
## 10 Ike    2008     9     9    14  22.6 -82.9 hurri…        1    70      965
## # … with 10,000 more rows, and 3 more variables: ts_diameter <dbl>,
## #   hu_diameter <dbl>, kid <int>

This method can also automatically use zip by adding the file extension.

#> file with zip extension
storms_memory_zip <- tempfile(fileext=".csv.zip")
#> export it
dplyr::storms %>% kc$export(storms_memory_zip)

## [1] "/tmp/RtmpVAwsWi/file243412667717.csv.zip"

Note: kibior is using rio (Chan et al. 2018) that can export much more formats. See rio documentation and rio::install_formats() function.

4.11 `$import()`: Get a file content to a new Elasticsearch index

The $import() method can duplicate a dataset retrieved from a file to a in-memory variable, a new Elasticsearch index or both.

#> import data from file
kc$import(filepath = storms_memory_tmp)

## # A tibble: 10,010 x 13
##    name   year month   day  hour   lat  long status category  wind pressure
##    <chr> <int> <int> <int> <int> <dbl> <dbl> <chr>     <int> <int>    <int>
##  1 Amy    1975     6    27     0  27.5 -79   tropi…       -1    25     1013
##  2 Amy    1975     6    27     6  28.5 -79   tropi…       -1    25     1013
##  3 Amy    1975     6    27    12  29.5 -79   tropi…       -1    25     1013
##  4 Amy    1975     6    27    18  30.5 -79   tropi…       -1    25     1013
##  5 Amy    1975     6    28     0  31.5 -78.8 tropi…       -1    25     1012
##  6 Amy    1975     6    28     6  32.4 -78.7 tropi…       -1    25     1012
##  7 Amy    1975     6    28    12  33.3 -78   tropi…       -1    25     1011
##  8 Amy    1975     6    28    18  34   -77   tropi…       -1    30     1006
##  9 Amy    1975     6    29     0  34.4 -75.8 tropi…        0    35     1004
## 10 Amy    1975     6    29     6  34   -74.8 tropi…        0    40     1002
## # … with 10,000 more rows, and 2 more variables: ts_diameter <dbl>,
## #   hu_diameter <dbl>

#> import data from file and send it to a new 
#> Elasticsearch index, with default configuration
kc$import(filepath = storms_memory_tmp, 
        push_index = "storms_file",
        push_mode = "recreate")

## # A tibble: 10,010 x 14
##    name   year month   day  hour   lat  long status category  wind pressure
##    <chr> <int> <int> <int> <int> <dbl> <dbl> <chr>     <int> <int>    <int>
##  1 Sean   2011    11    10    18  30.5 -70   tropi…        0    55      983
##  2 Sean   2011    11    11     0  31   -69   tropi…        0    55      984
##  3 Sean   2011    11    11     6  32.2 -67.2 tropi…        0    50      987
##  4 Sean   2011    11    11    12  33.4 -65.3 tropi…        0    45      991
##  5 Sean   2011    11    11    18  34.8 -62.6 tropi…        0    40      995
##  6 Albe…  2012     5    19     6  32.8 -77.1 tropi…       -1    30     1008
##  7 Albe…  2012     5    19    12  32.5 -77.3 tropi…        0    40     1005
##  8 Albe…  2012     5    19    18  32.3 -77.6 tropi…        0    45      997
##  9 Albe…  2012     5    20     0  32.1 -78.1 tropi…        0    50      995
## 10 Albe…  2012     5    20     6  31.9 -78.7 tropi…        0    45      998
## # … with 10,000 more rows, and 3 more variables: ts_diameter <dbl>,
## #   hu_diameter <dbl>, kid <int>

kc$list()

## [1] "starwars" "storms"     "storms_file"

As $export(), it can also read directly from zipped files.

#> import data from file and send it to a new 
#> Elasticsearch index, with default configuration
kc$import(storms_memory_zip)

## # A tibble: 10,010 x 13
##    name   year month   day  hour   lat  long status category  wind pressure
##    <chr> <int> <int> <int> <int> <dbl> <dbl> <chr>     <int> <int>    <int>
##  1 Amy    1975     6    27     0  27.5 -79   tropi…       -1    25     1013
##  2 Amy    1975     6    27     6  28.5 -79   tropi…       -1    25     1013
##  3 Amy    1975     6    27    12  29.5 -79   tropi…       -1    25     1013
##  4 Amy    1975     6    27    18  30.5 -79   tropi…       -1    25     1013
##  5 Amy    1975     6    28     0  31.5 -78.8 tropi…       -1    25     1012
##  6 Amy    1975     6    28     6  32.4 -78.7 tropi…       -1    25     1012
##  7 Amy    1975     6    28    12  33.3 -78   tropi…       -1    25     1011
##  8 Amy    1975     6    28    18  34   -77   tropi…       -1    30     1006
##  9 Amy    1975     6    29     0  34.4 -75.8 tropi…        0    35     1004
## 10 Amy    1975     6    29     6  34   -74.8 tropi…        0    40     1002
## # … with 10,000 more rows, and 2 more variables: ts_diameter <dbl>,
## #   hu_diameter <dbl>

Note: kibior is using rio (Chan et al. 2018) that can export much more formats. See rio documentation and rio::install_formats() function.

The $import() method can natively manage sequence, alignement and feature formats (e.g. fasta, bam, gtf, gff, bed, etc.) since it also wraps Bioconductor library methods such as rtracklayer::import() (Lawrence, Gentleman, and Carey 2019), Biostrings::read*StringSet() (Pagès et al. 2020) and Rsamtools::scanBam() (Morgan et al. 2020).

Dedicated methods are implemented inside kibior (e.g. $import_features() and $import_alignments()), and the generic $import() method tries to open the right format according to file extension. You can also use specific methods if the import cannot be guessed by the general import() method: import_sequences(), import_alignments(), import_features(), import_tabluar() and import_json().

4.12 `$move()`: Rename an index

The $move() method rename an index. The $copy() method is equivalent to $move(copy = TRUE).

#> move a existing dataset to another index
m <- kc$move("storms_file", "storms_file_moved")
kc$list()

## [1] "starwars" "storms"     "storms_file_moved"

4.13 `$copy()`: Copy an index

The $copy() method copy an index to another name. It is a wrapper around $move(copy = TRUE).

#> copy index
m <- kc$copy("storms_file_moved", "storms_file")
kc$list()

## [1] "starwars"   "storms"    "storms_file"       "storms_file_moved"

4.14 `$delete()`: Delete an Elasticsearch index

The $delete() method deletes one or more indices.

#> delete one or multiple indices
c("storms_file", "storms_file_moved") %>% kc$delete()

## $storms_file
## [1] TRUE
## 
## $storms_file_moved
## [1] TRUE

It can also delete following a pattern.

#> push some subsets with the same prefix
push_storm <- function(storm_name, index_name){
    dplyr::storms %>% 
        filter(name == storm_name) %>% 
        kc$push(index_name)
}
push_storm("Amy", "storms_amy")

## [1] "storms_amy"

push_storm("Doris", "storms_doris")

## [1] "storms_doris"

push_storm("Bess", "storms_bess")

## [1] "storms_bess"


#> list
kc$list()

## [1] "starwars"     "storms_bess"  "storms"       "storms_doris" "storms_amy"

#> delete following a pattern
kc$delete("storms_*")

## $storms_amy
## [1] TRUE
## 
## $storms_doris
## [1] TRUE
## 
## $storms_bess
## [1] TRUE

kc$list()

## [1] "starwars"   "storms"

4.15 `$search()`: Search everything

Elasticsearch is here… You know, For search. As a search engine, it is its main feature.

Using $search() method, you can search for everything inside a part or all data indexed by Elasticsearch. If no restrictions is found in the query parameter, all data will be searched, which means in every indices, every columns, every keywords.

#> here, we search the exact string "something" everywhere
#> but will find nothing
kc$search(query = "something")

## $starwars
## list()
## 
## $storms
## list()

#> we search for the exact string "anita" in "storms" dataset
kc$search("storms", query = "anita")[["storms"]]

## # A tibble: 5 x 14
##   name   year month   day  hour   lat  long status category  wind pressure
##   <chr> <int> <int> <int> <int> <dbl> <dbl> <chr>  <chr>    <int>    <int>
## 1 Anita  1977     8    29    12  26.9 -88.4 tropi… -1          20     1012
## 2 Anita  1977     8    29    18  27   -88.9 tropi… -1          25     1010
## 3 Anita  1977     8    30     0  26.9 -89.4 tropi… -1          30     1009
## 4 Anita  1977     8    30     6  26.8 -89.8 tropi… 0           40     1006
## 5 Anita  1977     8    30    12  26.7 -90.3 tropi… 0           50     1003
## # … with 3 more variables: ts_diameter <lgl>, hu_diameter <lgl>, kid <int>

#> we search for text containing the substring "am" in "storms" dataset
kc$pull("storms", query = "*am*")[["storms"]]$name %>% unique

## [1] "Tammy"  "Gamma"  "Amy"    "Amelia"

By default, $search() has head mode active, which will return a small subset (default is 5) of the actual complete result to allow quick inspection of data. With $verbose <- TRUE, it will be printed in the result as “Head mode: on”. To change the head size, modify the $head_search_size attribute.

To get the full result, you have to use $search(head = FALSE), or more simply : $pull().

See Querying in Advanced usage section for more information.

4.16 `$stats()`: base statistics of columns

Alongside data handling methods are descriptive statistical methods. You already know $count() but here some others displayed by kibior.

The $stats() method is a shortcut to ask for: count, min, max, avg, sum, sum_of_squares, variance, std_deviation, std_deviation_upper (bound), std_deviation_lower (bound).

#> multi-indices, index pattern and multicolumns
kc$stats(c("starwars", "s*"), c("height", "mass"))

## $starwars
## # A tibble: 2 x 11
##   column count   min   max   avg    sum sum_of_squares variance std_deviation std_deviation_bounds_… std_deviation_bounds…
##   <fct>  <int> <dbl> <dbl> <dbl>  <dbl>          <dbl>    <dbl>         <dbl>                  <dbl>                 <dbl>
## 1 height    81    66   264 174.  14123        2559177     1194.          34.6                   243.                  105.
## 2 mass      59    15  1358  97.3  5741.       2224219.   28229.         168.                    433.                 -239.
## 
## $storms
## list()

#> work also with query and sigma for standard deviation
kc$stats("starwars", c("height", "mass"), sigma = 2.5, query = "homeworld:naboo")

## $starwars
## # A tibble: 2 x 11
##   column count   min   max   avg   sum sum_of_squares variance std_deviation std_deviation_bounds_… std_deviation_bounds_…
##   <fct>  <int> <dbl> <dbl> <dbl> <dbl>          <dbl>    <dbl>         <dbl>                  <dbl>                  <dbl>
## 1 height    11    96   224 175.   1930         349446     984.          31.4                   254.                   97.1
## 2 mass       6    32    85  64.2   385          26979     379.          19.5                   113.                   15.5

Some important warnings here:

Counts are approximate

Standard Deviation and Bounds require normality

In addition to $count() and $stats(), lots of others methods exist to perform descriptive analysis: avg, mean, min, max, sum, q1, q2, median, q3 and summary.

4.17 `$describe_index()` and `$describe_columns()`: get the description of index and columns

You can ask for description of datasets with these methods.

Important: this feature requires the user that pushed the data to manually add the metadata with $add_description().

5 Advanced usage

5.1 Pattern search

Some methods allow wildcard use "*" such as $search() and $pull().

#> consider these two datasets
dplyr::starwars %>% kc$push("starwars", mode = "recreate")

## [1] "starwars"

dplyr::storms %>% kc$push("storms", mode = "recreate")

## [1] "storms"


#> We want to search all indices startings with an "s" 
#> We search for words in the "name" field that start with a "d"
#> Both "index" and "storms" index have a "name" field
s <- kc$search("s*", query = "name:d*", head = FALSE)
s %>% names()

## [1] "starwars" "storms"

s$starwars

## # A tibble: 11 x 14
##    name  height  mass hair_color skin_color eye_color birth_year gender
##    <chr>  <int> <int> <chr>      <chr>      <chr>          <dbl> <chr> 
##  1 R2-D2     96    32 <NA>       white, bl… red             33   <NA>  
##  2 Dart…    202   136 none       white      yellow          41.9 male  
##  3 R5-D4     97    32 <NA>       white, red red             NA   <NA>  
##  4 Bigg…    183    84 black      light      brown           24   male  
##  5 Jabb…    175  1358 <NA>       green-tan… orange         600   herma…
##  6 Dart…    175    80 none       red        yellow          54   male  
##  7 Dud …     94    45 none       blue, grey yellow          NA   male  
##  8 Dormé    165    NA brown      light      brown           NA   female
##  9 Dooku    193    80 white      fair       brown          102   male  
## 10 Dext…    198   102 none       brown      yellow          NA   male  
## 11 Poe …     NA    NA brown      light      brown           NA   male  
## # … with 6 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <chr>, starships <chr>, kid <int>

s$storms

## # A tibble: 722 x 14
##    name   year month   day  hour   lat  long status category  wind pressure
##    <chr> <int> <int> <int> <int> <dbl> <dbl> <chr>  <chr>    <int>    <int>
##  1 Debby  1988     9     6     6  21.5 -107. tropi… -1          30     1005
##  2 Debby  1988     9     6    12  22   -107. tropi… -1          25     1005
##  3 Debby  1988     9     6    18  22.5 -108. tropi… -1          25     1006
##  4 Debby  1988     9     7     0  23   -108  tropi… -1          25     1006
##  5 Debby  1988     9     7     6  23.5 -108. tropi… -1          25     1007
##  6 Debby  1988     9     7    12  23.9 -108. tropi… -1          25     1007
##  7 Debby  1988     9     7    18  24.2 -109. tropi… -1          25     1007
##  8 Debby  1988     9     8     0  24.4 -109. tropi… -1          25     1008
##  9 Debby  1988     9     8     6  24.3 -109. tropi… -1          20     1008
## 10 Debby  1988     9     8    12  24.2 -109. tropi… -1          20     1008
## # … with 712 more rows, and 3 more variables: ts_diameter <dbl>,
## #   hu_diameter <dbl>, kid <int>

5.2 Attributes access

As objects, kibior instances attributes can be accessed and updated for some.

Attribute name	Read-only	Default	Description
$host		“localhost”	the Elasticsearch host
$port		9200	the Elasticsearch port
$user	x	NULL	the Elasticsearch user
$pwd	x	NULL	the Elasticsearch password
$connection	x	NULL	the Elasticsearch connection object
$head_search_size		5	the head size default value
$cluster_name	x	When connected	the cluster name if and only if already connected
$cluster_status	x	When connected	the cluster status if and only if already connected
$nb_documents	x	When connected	the current cluster total number of documents if already connected
$version	x	When connected	the Elasticsearch version if and only if already connected
$elastic_wait		2	the Elasticsearch wait time for update commands if already connected (in seconds)
$valid_joins	x	A vector	the valid joins available in `kibior
$valid_count_types	x	A vector	the valid count types available (mainly observations = rows, variables = columns)
$valid_elastic_metadata_types	x	A vector	the valid Elasticsearch metadata types available
$valid_push_modes	x	A vector	the valid push modes available
$shard_number		1	the number of allocated primary shards when creating an Elasticsearch index
$shard_replicas_number		1	the number of allocated replicas in an Elasticsearch index
$default_id_col		“kid”	the ID column name used when sending data to Elasticsearch if not provided by user
$verbose		FALSE	the verbose mode
$quiet_progress		FALSE	the progress bar printing mode
$quiet_results		FALSE	the method results printing mode

#> access the current host for the "kc" instance
kc$host

## [1] "elasticsearch"

#> modify the head_search threshold
kc$head_search_size <- 10L

Some attributes cannot be modified.

#> error when trying to modify read-only attributes
kc$user <- "nope"

5.3 Organizing data for searches

Working alone directly on a massive cluster of servers is an unlikely situation. Moreover, handling large datasets on your own computer or storing all data in your local Elasticsearch repository is generally a bad idea. We generally tend to only handle what we can afford to, and organize pipelines and softwares accordingly.

There are multiple strategies to organize data, and our main objective here is to use servers for what they have been built for: to do the cpu- and memory-greedy job. Thus, in comparison, our personal computers or laptop will not have huge load processes. Putting kibior in this equation will help us further as it is backed by a database and search engine.

As a rule of thumb, subsetting and querying is a good strategy, e.g. splitting on categorial variables.

#> push storms dataset
dplyr::storms %>% 
    kc$push("storms", mode = "recreate")

## [1] "storms"

#> select the first 5 storms names and push them
#> in different indices, each name prefixed with "storms_"
dplyr::storms %>% 
    split(dplyr::storms$name) %>% 
    head() %>% 
    purrr::imap(function(data, index_name){ 
        index_name %>% 
            tolower() %>% 
            paste0("storms_", .) %>%
            kc$push(data, .) 
    })

## $AL011993
## [1] "storms_al011993"
## 
## $AL012000
## [1] "storms_al012000"
## 
## $AL021992
## [1] "storms_al021992"
## 
## $AL021994
## [1] "storms_al021994"
## 
## $AL021999
## [1] "storms_al021999"
## 
## $AL022000
## [1] "storms_al022000"

kc$list()

## [1] "starwars"        "storms"          "storms_al011993" "storms_al012000"
## [5] "storms_al021992" "storms_al021994" "storms_al021999" "storms_al022000"

What we can do then, is searching in all indices names starting with the prefix “storms_”

#> Within them, we search some minimum winds and pressure
#> results come already filtered by storm names
kc$search("storms_*", 
        query = "wind:>25 && pressure:>30", 
        columns = c("name", "year", "month", "lat", "long", "status"), 
        head = FALSE)

## $storms_al011993
## # A tibble: 4 x 6
##   month  year name       lat  long status             
##   <int> <int> <chr>    <dbl> <dbl> <chr>              
## 1     6  1993 AL011993  25.4 -77.5 tropical depression
## 2     6  1993 AL011993  26.1 -75.8 tropical depression
## 3     6  1993 AL011993  26.7 -74   tropical depression
## 4     6  1993 AL011993  27.8 -71.8 tropical depression
## 
## $storms_al021992
## # A tibble: 4 x 6
##   month  year name       lat  long status             
##   <int> <int> <chr>    <dbl> <dbl> <chr>              
## 1     6  1992 AL021992  25.7 -85.5 tropical depression
## 2     6  1992 AL021992  27   -84.5 tropical depression
## 3     6  1992 AL021992  27.6 -84   tropical depression
## 4     6  1992 AL021992  28.5 -82.9 tropical depression
## 
## $storms_al022000
## # A tibble: 10 x 6
##    month  year name       lat  long status             
##    <int> <int> <chr>    <dbl> <dbl> <chr>              
##  1     6  2000 AL022000   9.6 -21   tropical depression
##  2     6  2000 AL022000   9.9 -22.6 tropical depression
##  3     6  2000 AL022000  10.2 -24.5 tropical depression
##  4     6  2000 AL022000  10.1 -26.2 tropical depression
##  5     6  2000 AL022000   9.9 -27.8 tropical depression
##  6     6  2000 AL022000   9.9 -29.3 tropical depression
##  7     6  2000 AL022000  10.1 -30.1 tropical depression
##  8     6  2000 AL022000  10.1 -32.6 tropical depression
##  9     6  2000 AL022000  10   -34.2 tropical depression
## 10     6  2000 AL022000   9.8 -36.2 tropical depression
## 
## $storms_al021994
## # A tibble: 2 x 6
##   month  year name       lat  long status             
##   <int> <int> <chr>    <dbl> <dbl> <chr>              
## 1     7  1994 AL021994  33   -79.1 tropical depression
## 2     7  1994 AL021994  33.2 -79.2 tropical depression
## 
## $storms_al021999
## # A tibble: 3 x 6
##   month  year name       lat  long status             
##   <int> <int> <chr>    <dbl> <dbl> <chr>              
## 1     7  1999 AL021999  20.2 -95   tropical depression
## 2     7  1999 AL021999  20.6 -96.3 tropical depression
## 3     7  1999 AL021999  20.5 -97   tropical depression
## 
## $storms_al012000
## list()

As we show before, we did not push all data but only some subsets of interest. By selecting and pushing what we need, datasets can be searched and shared immediately after.

If you work in sync with multiple remote collaborators on the same Elasticsearch cluster, that can be a great strategy. For instance, one of your collaborators can add a new dataset that will not change the request, but will enrich the result.

#> added from remote kibior instance 
#> using `tail()` to simulate other data
dplyr::storms %>% 
    split(dplyr::storms$name) %>% 
    tail(2) %>% 
    purrr::imap(function(data, index_name){ 
        index_name %>% 
            tolower() %>% 
            paste0("storms_", .) %>%
            kc$push(data, .) 
    })

## $Wilma
## [1] storms_wilma
## 
## $Zeta
## [1] storms_zeta

We can apply the same request and found some new results.

#> search all, same request as before
s <- kc$search("storms_*", 
            query = "wind:>25 && pressure:>30", 
            columns = c("name", "year", "month", "lat", "long", "status"), 
            head = FALSE)
#> assemble results if needed
do.call(rbind, s)

## # A tibble: 96 x 6
##    month  year name       lat  long status             
##  * <int> <int> <chr>    <dbl> <dbl> <chr>              
##  1     6  1993 AL011993  25.4 -77.5 tropical depression
##  2     6  1993 AL011993  26.1 -75.8 tropical depression
##  3     6  1993 AL011993  26.7 -74   tropical depression
##  4     6  1993 AL011993  27.8 -71.8 tropical depression
##  5    12  2005 Zeta      23.9 -35.6 tropical depression
##  6    12  2005 Zeta      24.2 -36.1 tropical storm     
##  7    12  2005 Zeta      24.7 -36.6 tropical storm     
##  8    12  2005 Zeta      25.2 -37   tropical storm     
##  9    12  2005 Zeta      25.6 -37.3 tropical storm     
## 10    12  2005 Zeta      25.7 -37.6 tropical storm     
## # … with 86 more rows

5.4 Querying

One of the main features of kibior is to be able to search inside vast amounts of data thanks to Elasticsearch. You can use the search feature with the eponym method $search() but also $pull() by using the query parameter.

5.4.1 Querying notation

To query specific data, the query parameter of methods such as $count() or $search() requires one string following the Elasticsearch Query String Syntax.

To sum them up, you can search for:

terms,

kc$search("starwars", query = "orange")$starwars

## # A tibble: 5 x 15
##   name  height  mass hair_color skin_color eye_color birth_year sex   gender homeworld species films vehicles starships
##   <chr>  <int> <int> <chr>      <chr>      <chr>          <int> <chr> <chr>  <chr>     <chr>   <lis> <chr>    <chr>    
## 1 Jar …    196    66 none       orange     orange            52 male  mascu… Naboo     Gungan  <chr… ""       ""       
## 2 Plo …    188    80 none       orange     black             22 male  mascu… Dorin     Kel Dor <chr… ""       "Jedi st…
## 3 Jabb…    175  1358 NA         green-tan… orange           600 herm… mascu… Nal Hutta Hutt    <chr… ""       ""       
## 4 Ackb…    180    83 none       brown mot… orange            41 male  mascu… Mon Cala  Mon Ca… <chr… ""       ""       
## 5 Roos…    224    82 none       grey       orange            NA male  mascu… Naboo     Gungan  <chr… ""       ""       
## # … with 1 more variable: kid <int>

or phrases, with double-quotes.

kc$search("starwars", query = '"Luke Skywalker"')$starwars

## # A tibble: 1 x 15
##   name  height  mass hair_color skin_color eye_color birth_year sex   gender homeworld species films vehicles starships
##   <chr>  <int> <int> <chr>      <chr>      <chr>          <int> <chr> <chr>  <chr>     <chr>   <lis> <list>   <list>   
## 1 Luke…    172    77 blond      fair       blue              19 male  mascu… Tatooine  Human   <chr… <chr [2… <chr [2]>
## # … with 1 more variable: kid <int>

To complement, you can apply multiple operators:

boolean operators:
- AND (or “&&”, double-ampersand),
- OR (or “||”, double-pipe),
- NOT (or “!”, exclamation point),
- + (plus) the term MUST be present,
- - (minus) the term MUST NOT be present.
grouping: organize boolean operators, ex: “(quick OR brown) AND fox”.
field selecting: target a specific column.
- Phrases can be searched.

#> rows that have "name" == "Luke Skywalker" 
kc$search("starwars", query = 'name:"Luke Skywalker"')$starwars

## # A tibble: 1 x 15
##   name  height  mass hair_color skin_color eye_color birth_year sex   gender homeworld species films vehicles starships
##   <chr>  <int> <int> <chr>      <chr>      <chr>          <int> <chr> <chr>  <chr>     <chr>   <lis> <list>   <list>   
## 1 Luke…    172    77 blond      fair       blue              19 male  mascu… Tatooine  Human   <chr… <chr [2… <chr [2]>
## # … with 1 more variable: kid <int>

Boolean operators can be used.

#> rows that have blue or green eyes
kc$search("starwars", query = 'eye_color:(blue OR green)')$starwars

## # A tibble: 5 x 15
##   name  height  mass hair_color skin_color eye_color birth_year sex   gender homeworld species films vehicles starships
##   <chr>  <int> <int> <chr>      <chr>      <chr>          <dbl> <chr> <chr>  <chr>     <chr>   <lis> <list>   <list>   
## 1 Grie…    216   159 none       brown, wh… green, y…       NA   male  mascu… Kalee     Kaleesh <chr… <chr [1… <chr [1]>
## 2 Luke…    172    77 blond      fair       blue            19   male  mascu… Tatooine  Human   <chr… <chr [2… <chr [2]>
## 3 Owen…    178   120 brown, gr… light      blue            52   male  mascu… Tatooine  Human   <chr… <chr [1… <chr [1]>
## 4 Beru…    165    75 brown      light      blue            47   fema… femin… Tatooine  Human   <chr… <chr [1… <chr [1]>
## 5 Anak…    188    84 blond      fair       blue            41.9 male  mascu… Tatooine  Human   <chr… <chr [2… <chr [3]>
## # … with 1 more variable: kid <int>

range notation: using [min TO max] for inclusive or {min TO max} for exclusive.
- Can be use as a simple search expression for one side unbounded:
  - n:>=10 is equivalent to n:[10 TO *].
  - n:<=10 is equivalent to n:[* TO 10].
  - n:>10 is equivalent to n:{10 TO *}.
  - n:<10 is equivalent to n:{* TO 10}.
- Inclusive threshold.

#> include 160 and 180 values
kc$search("starwars", query = "height:[160 TO 180]")$starwars

## # A tibble: 5 x 15
##   name  height  mass hair_color skin_color eye_color birth_year sex   gender homeworld species films vehicles starships
##   <chr>  <int> <int> <chr>      <chr>      <chr>          <int> <chr> <chr>  <chr>     <chr>   <lis> <list>   <list>   
## 1 Luke…    172    77 blond      fair       blue              19 male  mascu… Tatooine  Human   <chr… <chr [2… <chr [2]>
## 2 C-3PO    167    75 NA         gold       yellow           112 none  mascu… Tatooine  Droid   <chr… <chr [1… <chr [1]>
## 3 Owen…    178   120 brown, gr… light      blue              52 male  mascu… Tatooine  Human   <chr… <chr [1… <chr [1]>
## 4 Beru…    165    75 brown      light      blue              47 fema… femin… Tatooine  Human   <chr… <chr [1… <chr [1]>
## 5 Wilh…    180    NA auburn, g… fair       blue              64 male  mascu… Eriadu    Human   <chr… <chr [1… <chr [1]>
## # … with 1 more variable: kid <int>

Exclusive threshold.

#> exclude 160 and 180 values
kc$search("starwars", query = "height:{160 TO 180}")$starwars

## # A tibble: 5 x 15
##   name  height  mass hair_color skin_color eye_color birth_year sex   gender homeworld species films vehicles starships
##   <chr>  <int> <int> <chr>      <chr>      <chr>          <int> <chr> <chr>  <chr>     <chr>   <lis> <list>   <list>   
## 1 Luke…    172    77 blond      fair       blue              19 male  mascu… Tatooine  Human   <chr… <chr [2… <chr [2]>
## 2 C-3PO    167    75 NA         gold       yellow           112 none  mascu… Tatooine  Droid   <chr… <chr [1… <chr [1]>
## 3 Owen…    178   120 brown, gr… light      blue              52 male  mascu… Tatooine  Human   <chr… <chr [1… <chr [1]>
## 4 Beru…    165    75 brown      light      blue              47 fema… femin… Tatooine  Human   <chr… <chr [1… <chr [1]>
## 5 Gree…    173    74 NA         green      black             44 male  mascu… Rodia     Rodian  <chr… <chr [1… <chr [1]>
## # … with 1 more variable: kid <int>

Mixing inclusive and exclusive.

#> exclude 160 but include 180
kc$search("starwars", query = "height:{160 TO 180]")$starwars

## # A tibble: 5 x 15
##   name  height  mass hair_color skin_color eye_color birth_year sex   gender homeworld species films vehicles starships
##   <chr>  <int> <int> <chr>      <chr>      <chr>          <int> <chr> <chr>  <chr>     <chr>   <lis> <list>   <list>   
## 1 Luke…    172    77 blond      fair       blue              19 male  mascu… Tatooine  Human   <chr… <chr [2… <chr [2]>
## 2 C-3PO    167    75 NA         gold       yellow           112 none  mascu… Tatooine  Droid   <chr… <chr [1… <chr [1]>
## 3 Owen…    178   120 brown, gr… light      blue              52 male  mascu… Tatooine  Human   <chr… <chr [1… <chr [1]>
## 4 Beru…    165    75 brown      light      blue              47 fema… femin… Tatooine  Human   <chr… <chr [1… <chr [1]>
## 5 Wilh…    180    NA auburn, g… fair       blue              64 male  mascu… Eriadu    Human   <chr… <chr [1… <chr [1]>
## # … with 1 more variable: kid <int>

fuzzyness and proximity: using “~” at the end of a term to use approximative search.
- Default fuzzy factor is 2, meaning “quikc~” and “quikc~2” are identical.
- It can be applied to phrases, ex: “"fox quick"~5”.

#> fuzzy search for blue/black/brown/... eyes
#> useful when we do not know exactly the content
kc$search("starwars", query = "eye_color:bla~3")$starwars

## # A tibble: 5 x 15
##   name  height  mass hair_color skin_color eye_color birth_year sex   gender homeworld species films vehicles starships
##   <chr>  <int> <int> <chr>      <chr>      <chr>          <dbl> <chr> <chr>  <chr>     <chr>   <lis> <list>   <list>   
## 1 Luke…    172    77 blond      fair       blue            19   male  mascu… Tatooine  Human   <chr… <chr [2… <chr [2]>
## 2 Owen…    178   120 brown, gr… light      blue            52   male  mascu… Tatooine  Human   <chr… <chr [1… <chr [1]>
## 3 Beru…    165    75 brown      light      blue            47   fema… femin… Tatooine  Human   <chr… <chr [1… <chr [1]>
## 4 Anak…    188    84 blond      fair       blue            41.9 male  mascu… Tatooine  Human   <chr… <chr [2… <chr [3]>
## 5 Wilh…    180    NA auburn, g… fair       blue            64   male  mascu… Eriadu    Human   <chr… <chr [1… <chr [1]>
## # … with 1 more variable: kid <int>

boosting: using “^” ponderate some expressions over others.
- Value:
  - O to 1: decrease boosting.
  - Superior to 1: increase boosting.
- Boost type:
  - terms, ex: quick^2 fox, quick is boosted.
  - phrases, ex: "foo bar"^2.
  - groups, ex: (foo bar)^4.

#> boost the black eye search but get the blue too
kc$search("starwars", query = "eye_color:(black^2 OR blue)")$starwars

## # A tibble: 5 x 15
##   name  height  mass hair_color skin_color eye_color birth_year sex   gender homeworld species films vehicles starships
##   <chr>  <int> <int> <chr>      <chr>      <chr>          <int> <chr> <chr>  <chr>     <chr>   <lis> <chr>    <chr>    
## 1 Gree…    173    74 NA         green      black             44 male  mascu… Rodia     Rodian  <chr… ""       ""       
## 2 Nien…    160    68 none       grey       black             NA male  mascu… Sullust   Sullus… <chr… ""       "Millenn…
## 3 Gasg…    122    NA none       white, bl… black             NA male  mascu… Troiken   Xexto   <chr… ""       ""       
## 4 Kit …    196    87 none       green      black             NA male  mascu… Glee Ans… Nautol… <chr… ""       ""       
## 5 Plo …    188    80 none       orange     black             22 male  mascu… Dorin     Kel Dor <chr… ""       "Jedi st…
## # … with 1 more variable: kid <int>

Now, we can consider making easily a more complex search query:

#> consider this dataset
ggplot2::diamonds %>% kc$push("diamonds")

## [1] "diamonds"

#> searching premium or ideal quality of diamonds, 
#> with a price inferior to 10k$, a carat superior to 1.4,
#> a z between 2.2 and 5.4 included, and not colors E or H. 
#> we only want some columns.
kc$search("diamonds", 
        query = "cut:(premium || ideal) 
            && price:<10000 
            && carat:>1.4 
            && z:[2.2 TO 5.4] 
            && -color:(E || H)", 
        columns = c("carat", "color", "depth", "clarity", "price", "z"), 
        head = FALSE)

## $diamonds
## # A tibble: 765 x 6
##    depth color clarity price carat     z
##    <dbl> <chr> <chr>   <int> <dbl> <dbl>
##  1  62.4 J     SI1      8176  1.59  4.66
##  2  62.7 I     SI1      8193  1.51  4.59
##  3  61.5 J     VS2      8203  1.51  4.54
##  4  62   J     VS1      8207  1.54  4.62
##  5  62   J     VS2      8217  1.51  4.54
##  6  62.2 I     SI2      8220  1.62  4.69
##  7  62.4 J     SI1      8221  1.57  4.65
##  8  60.3 J     VVS2     8227  1.59  4.59
##  9  62.6 I     SI2      8228  1.57  4.63
## 10  62   I     SI2      8254  1.54  4.56
## # … with 755 more rows

5.4.2 `$search()` behavior

#> consider this dataset
dplyr::storms %>% kc$push("storms", mode = "recreate")

## [1] "storms"

dplyr::starwars %>% kc$push("starwars", mode = "recreate")

## [1] "starwars"

Though Elasticsearch is very powerful as a document-oriented database, it is a full-text search engine.

#> searching for exact word "dar" but nothing found
kc$search(query = "dar")

## $diamonds
## list()
## 
## $starwars
## list()
## 
## $storms
## list()

With wildcard and targeting a single index:

#> The search is case-insensitive meaning:
#> Dar == dAr == daR == DAr == ...etc.
kc$search(query = "*Dar*")$starwars

## # A tibble: 5 x 15
##   name  height  mass hair_color skin_color eye_color birth_year sex   gender homeworld species films vehicles starships
##   <chr>  <int> <int> <chr>      <chr>      <chr>          <dbl> <chr> <chr>  <chr>     <chr>   <lis> <chr>    <chr>    
## 1 Dart…    202   136 none       white      yellow          41.9 male  mascu… Tatooine  Human   <chr… ""       "TIE Adv…
## 2 Bigg…    183    84 black      light      brown           24   male  mascu… Tatooine  Human   <chr… ""       "X-wing" 
## 3 Land…    177    79 black      dark       brown           31   male  mascu… Socorro   Human   <chr… ""       "Millenn…
## 4 Watto    137    NA black      blue, grey yellow          NA   male  mascu… Toydaria  Toydar… <chr… ""       ""       
## 5 Quar…    183    NA black      dark       brown           62   NA    NA     Naboo     NA      <chr… ""       ""       
## # … with 1 more variable: kid <int>

Column selection:

#> searching every word in name that starts with "d"
kc$search("*", 
        query = "name:d*", 
        columns = c("name", "status"))

## $diamonds
## list()
## 
## $starwars
## # A tibble: 5 x 1
##   name                 
##   <chr>                
## 1 R2-D2                
## 2 Darth Vader          
## 3 R5-D4                
## 4 Biggs Darklighter    
## 5 Jabba Desilijic Tiure
## 
## $storms
## # A tibble: 5 x 2
##   name  status             
##   <chr> <chr>              
## 1 Debby tropical depression
## 2 Debby tropical depression
## 3 Debby tropical depression
## 4 Debby tropical depression
## 5 Debby tropical depression

As you can see on the last request, some columns did not match, thus were not returned.

Now a more complex search, directly done by pulling data:

#> We can search premium or ideal quality of diamonds, 
#> with a price inferior to 10k$, a carat superior to 1.4,
#> a z between 2.2 and 5.4 included, not colors E or H,
#> and not from a clarity starting with the string "VS"
#> we only want some columns.
kc$pull("diamonds", 
        query = "cut:(premium || ideal) 
            && price:<10000 
            && carat:>1.4 
            && z:[2.2 TO 5.4] 
            && -color:(E || H)
            && -clarity:VS*", 
        columns = c("carat", "color", "depth", "clarity", "price", "z"))

## $diamonds
## # A tibble: 552 x 6
##    depth color clarity price carat     z
##    <dbl> <chr> <chr>   <int> <dbl> <dbl>
##  1  62.8 I     SI1      8574  1.51  4.58
##  2  61.4 G     SI2      8580  1.5   4.52
##  3  62   G     SI2      8580  1.5   4.52
##  4  62.8 G     SI1      8599  1.43  4.49
##  5  62.2 J     SI1      8610  1.65  4.7 
##  6  62.7 D     SI2      8631  1.52  4.59
##  7  60.8 G     SI2      8637  1.51  4.51
##  8  61.9 I     SI2      8637  1.53  4.56
##  9  62   G     SI2      8643  1.57  4.62
## 10  62.1 I     SI1      8685  1.5   4.56
## # … with 542 more rows

This was executed on a small 54k observations and 10 variables dataset. We will see it on a bigger one in biological example vignette.

5.4.3 `text` and `keyword` querying

Lastly, we need to see the difference between a keyword and a text field.

Elasticsearch can index text values as two different types: text and keyword. The difference between those two is that:

text columns such as “name” or “skin_color” are broken up into words during indexing, allowing searches on one or more words,

#> search every documents which have at least 
#> a word in "name" columns starting with "L"
kc$pull("starwars", 
        query = "name:L*", 
        columns = "name")$starwars

## # A tibble: 10 x 1
##    name              
##    <chr>             
##  1 Luke Skywalker    
##  2 Leia Organa       
##  3 Owen Lars         
##  4 Beru Whitesun lars
##  5 Lando Calrissian  
##  6 Lobot             
##  7 Cliegg Lars       
##  8 Poggle the Lesser 
##  9 Luminara Unduli   
## 10 Lama Su

keyword columns (always added when pushing data with kibior) keep the full text as one string.

#> search every documents which have their "name"
#> field starting with "L"
kc$pull("starwars", 
        query = "name.keyword:L*", 
        columns = "name")$starwars

## # A tibble: 6 x 1
##   name            
##   <chr>           
## 1 Luke Skywalker  
## 2 Leia Organa     
## 3 Lando Calrissian
## 4 Lobot           
## 5 Luminara Unduli 
## 6 Lama Su

kibior indexes all text values as text AND keyword, so we can use whole-text search (with .keyword tag) AND word-specific (without .keyword tag).

Doing a search for a word starting with a specific prefix in pure R is a bit more annoying:

dplyr::starwars[["name"]] %>%                    #> take the name column data
    lapply(function(x){                          #> for each name
        stringr::str_split(x, " ") %>%           #> split name by space
        unlist(use.names = FALSE) %>%            #> align
        grepl("^L", ., ignore.case = TRUE) %>%   #> search pattern for words starting with "L", ignore case to search also for "^l"
        any()                                    #> TRUE if at least one word match
    }) %>%                                       #> list of logicals
    unlist(use.names = FALSE) %>%                #> flatten it to logical vector to match starwars observations number
    dplyr::starwars[.,] %>%                      #> apply logical filter only on lines that were found
    dplyr::select(name)                          #> select only "name" var

## # A tibble: 10 x 1
##    name              
##    <chr>             
##  1 Luke Skywalker    
##  2 Leia Organa       
##  3 Owen Lars         
##  4 Beru Whitesun lars
##  5 Lando Calrissian  
##  6 Lobot             
##  7 Cliegg Lars       
##  8 Poggle the Lesser 
##  9 Luminara Unduli   
## 10 Lama Su

5.4.4 Reserved Elasticsearch characters

Elasticsearch has some reserved characters : + - = && || > < ! ( ) { } [ ] ^ " ~ * ? : \ /

You should remove them before pushing them into Elasticsearch. If it is not possible or you want to retrieve data from someone else that contains reserved characters, you should try to query with a keyword field.

5.5 `$push()` details

5.5.1 Define a unique IDs column

When pushing data with default parameters, kibior will define unique IDs for each record (each line of a table) and add them as metadata. You can retrieve them by using $pull(keep_metadata = TRUE).

#> With the storms index
kc$pull("storms", keep_metadata = TRUE)$storms

## # A tibble: 10,010 x 21
##    `_index` `_type` `_id` `_version` `_seq_no` `_primary_term` found `_source.name` `_source.year` `_source.month`
##    <chr>    <chr>   <chr>      <int>     <int>           <int> <lgl> <chr>                   <int>           <int>
##  1 storms   _doc    10001          1     10000               1 TRUE  Kate                     2015              11
##  2 storms   _doc    10002          1     10001               1 TRUE  Kate                     2015              11
##  3 storms   _doc    10003          1     10002               1 TRUE  Kate                     2015              11
##  4 storms   _doc    10004          1     10003               1 TRUE  Kate                     2015              11
##  5 storms   _doc    10005          1     10004               1 TRUE  Kate                     2015              11
##  6 storms   _doc    10006          1     10005               1 TRUE  Kate                     2015              11
##  7 storms   _doc    10007          1     10006               1 TRUE  Kate                     2015              11
##  8 storms   _doc    10008          1     10007               1 TRUE  Kate                     2015              11
##  9 storms   _doc    10009          1     10008               1 TRUE  Kate                     2015              11
## 10 storms   _doc    10010          1     10009               1 TRUE  Kate                     2015              11
## # … with 10,000 more rows, and 11 more variables: `_source.day` <int>, `_source.hour` <int>, `_source.lat` <dbl>,
## #   `_source.long` <dbl>, `_source.status` <chr>, `_source.category` <chr>, `_source.wind` <int>,
## #   `_source.pressure` <int>, `_source.ts_diameter` <dbl>, `_source.hu_diameter` <dbl>, `_source.kid` <int>

Metadata columns are mainly prefixed by an underscore. The actual record is embedded into the _source field. Since data have been pushed without specifying an ID column, the _id field that defines Elasticsearch unique IDs reflects the one automatically added by kibior in the data (kid by default). To change the default ID column added by kibior, change the $default_id_col attribute value.

Letting kibior handle ID attribution will produce uniqueness, but might not be the most meaningful and practical for update.

To change that behavior, you can define your own ID field when calling $push() data by using the id_col parameter.

#> Again, pushing storms, but with our own IDs, for instance, 
#> by adding "aaa" at the begining of each row number and use it as ID.
data <- dplyr::storms
ids <- seq_len(nrow(data)) %>% paste("aaa", ., sep="")
data <- cbind(a_new_unique_id = ids, data)
#> the column "a_new_unique_id" will be used as our unique ID
kc$push(data, "storm_with_our_id", id_col = "a_new_unique_id")

## [1] "storm_with_our_id"

#> and see 
s <- kc$pull("storm_with_our_id", 
             columns = "a_new_unique_id",
             keep_metadata = TRUE)$storm_with_our_id
s %>% dplyr::select(c("_id", "_source.a_new_unique_id"))

## # A tibble: 10,010 x 2
##    `_id`   `_source.a_new_unique_id`
##    <chr>   <chr>                    
##  1 aaa8991 aaa8991                  
##  2 aaa8992 aaa8992                  
##  3 aaa8993 aaa8993                  
##  4 aaa8994 aaa8994                  
##  5 aaa8995 aaa8995                  
##  6 aaa8996 aaa8996                  
##  7 aaa8997 aaa8997                  
##  8 aaa8998 aaa8998                  
##  9 aaa8999 aaa8999                  
## 10 aaa9000 aaa9000                  
## # … with 10,000 more rows

Caution here: the columns parameter does not apply to metadata.

#> columns match nothing except actual pushed data columns
kc$pull("storms", keep_metadata = TRUE, columns = c("_id", "_version"))$storms

## # A tibble: 10,010 x 7
##    `_index` `_type` `_id` `_version` `_seq_no` `_primary_term` found
##    <chr>    <chr>   <chr>      <int>     <int>           <int> <lgl>
##  1 storms   _doc    10001          1     10000               1 TRUE 
##  2 storms   _doc    10002          1     10001               1 TRUE 
##  3 storms   _doc    10003          1     10002               1 TRUE 
##  4 storms   _doc    10004          1     10003               1 TRUE 
##  5 storms   _doc    10005          1     10004               1 TRUE 
##  6 storms   _doc    10006          1     10005               1 TRUE 
##  7 storms   _doc    10007          1     10006               1 TRUE 
##  8 storms   _doc    10008          1     10007               1 TRUE 
##  9 storms   _doc    10009          1     10008               1 TRUE 
## 10 storms   _doc    10010          1     10009               1 TRUE 
## # … with 10,000 more rows

5.5.2 Push modes

When pushing data, if the index you are using in $push() already exists, an error will be thrown. This is due to mode = "check" parameter that will check if an index with the name you gave already exists. This is the default option, but can be changed to "recreate" or "update":

"recreate" will erase the index and write to a fresh one with the same name. Be cautious with this option as you will erase previously written data from that index name.

#> recreate one index, whether it already exists or no
dplyr::starwars %>% kc$push("starwars", mode = "recreate")

## [1] "starwars

"update" will push and update indexed data with corresponding IDs. For this option, you must know which field is the unique ID and send updated documents over them. You do not need all data to be updated, just send a subset of updated data. Send all data again might be error prone and can take a lot of time if your dataset is big. Knowing which field is the unique ID also helps a lot and prevent errors.

#> we will change the height of orange-eyed inhabitants of "Naboo"
#> homeworld to 300 and update that subset to the main one.
s <- kc$pull("starwars", query = "eye_color:orange && homeworld:naboo")$starwars
s

## # A tibble: 3 x 15
##   name  height  mass hair_color skin_color eye_color birth_year sex   gender homeworld species films vehicles starships
##   <chr>  <int> <int> <chr>      <chr>      <chr>          <int> <chr> <chr>  <chr>     <chr>   <lis> <chr>    <chr>    
## 1 Jar …    196    66 none       orange     orange            52 male  mascu… Naboo     Gungan  <chr… ""       ""       
## 2 Roos…    224    82 none       grey       orange            NA male  mascu… Naboo     Gungan  <chr… ""       ""       
## 3 Rugo…    206    NA none       green      orange            NA male  mascu… Naboo     Gungan  <chr… ""       ""       
## # … with 1 more variable: kid <int>

#> change the height of those selected to 300
s$height <- 300
s

## # A tibble: 3 x 15
##   name  height  mass hair_color skin_color eye_color birth_year sex   gender homeworld species films vehicles starships
##   <chr>  <dbl> <int> <chr>      <chr>      <chr>          <int> <chr> <chr>  <chr>     <chr>   <lis> <chr>    <chr>    
## 1 Jar …    300    66 none       orange     orange            52 male  mascu… Naboo     Gungan  <chr… ""       ""       
## 2 Roos…    300    82 none       grey       orange            NA male  mascu… Naboo     Gungan  <chr… ""       ""       
## 3 Rugo…    300    NA none       green      orange            NA male  mascu… Naboo     Gungan  <chr… ""       ""       
## # … with 1 more variable: kid <int>

#> and update the main dataset. Since it is a subset of that dataset, 
#> IDs are the same, which is default "kid" column.
ns <- kc$push(s, "starwars", mode = "update", id_col = "kid")
#> see the result
ns <- kc$pull("starwars", 
              query = "eye_color:orange && homeworld:naboo")$starwars
ns

## # A tibble: 3 x 15
##   name  height  mass hair_color skin_color eye_color birth_year sex   gender homeworld species films vehicles starships
##   <chr>  <int> <int> <chr>      <chr>      <chr>          <int> <chr> <chr>  <chr>     <chr>   <lis> <chr>    <chr>    
## 1 Jar …    300    66 none       orange     orange            52 male  mascu… Naboo     Gungan  <chr… ""       ""       
## 2 Roos…    300    82 none       grey       orange            NA male  mascu… Naboo     Gungan  <chr… ""       ""       
## 3 Rugo…    300    NA none       green      orange            NA male  mascu… Naboo     Gungan  <chr… ""       ""       
## # … with 1 more variable: kid <int>

5.6 Comparison with `dplyr` functions

dplyr package offers simple and effective functions called filter and select to quickly reduce the scope of interest. In the same fashion, kibior uses Elasticsearch query string syntax that is very similar to the dplyr syntax (see Querying section). Elasticsearch decuple the search possibilities by allowing similar usage on multiple indices, or datasets, on multiple remote servers.

Moreover, using $count(), $search() or $pull(), one can use their analogous features:

dplyr::select() with columns parameter,
and dplyr::filter() with query parameter.

Using both of them result in much more powerful search capabilities in a much more readable code.

Following sections are some examples of analogous requests.

5.6.1 Similarities

Select some columns:

#> dplyr 
s <- dplyr::starwars %>% 
        dplyr::select(name, height, homeworld)

#> kibior
s <- kc$pull("starwars", 
             columns = c("name", "height", "homeworld"))

Filter on strict thresholds:

#> dplyr 
s <- dplyr::starwars %>% 
        dplyr::filter(height > 180)

#> kibior
s <- kc$pull("starwars", 
             query = "height:>180")

Filter on soft thresholds:

#> dplyr 
s <- dplyr::starwars %>% 
        dplyr::filter(height >= 180)

#> kibior
s <- kc$pull("starwars", 
             query = "height:>=180")
#> or with range notation
s <- kc$pull("starwars", 
             query = "height:[180 TO *]")

Filter on ranges:

#> dplyr 
s <- dplyr::starwars %>% 
        dplyr::filter(height >= 180 && height < 300)

#> kibior
s <- kc$pull("starwars", 
             query = "height:[180 TO 300}")

Filter on exact string match for one field:

#> dplyr 
s <- dplyr::starwars %>% 
        dplyr::filter(homeworld == "Naboo")

#> kibior
s <- kc$pull("starwars", 
             query = "homeworld:Naboo")

Filter on exact string match with multiple choices on one field:

#> dplyr 
s <- dplyr::starwars %>% 
        dplyr::filter(homeworld == "Naboo" || homeworld == "Tatooine")
#> or
s <- dplyr::starwars %>% 
        dplyr::filter(homeworld %in% c("Naboo", "Tatooine"))

#> kibior (several ways to do it)
s <- kc$pull("starwars", 
             query = "homeworld:(Naboo || Tatooine)")

Filter on partial string matching:

#> dplyr, we have to use `str_detect`
s <- dplyr::starwars %>% 
        dplyr::filter(stringr::str_detect(name, "Luk|Dar"))

#> kibior, nothing else required
s <- kc$pull("starwars", 
             query = "name:(*Luk* || *Dar*)")

Filter over a compositions of multiple filters (multiple columns):

#> dplyr 
s <- dplyr::starwars %>% 
        dplyr::filter(homeworld == "Naboo" && height > 180)

#> kibior
s <- kc$pull("starwars", 
             query = "homeworld:Naboo && height:>180")

5.6.2 Differences

Even if there are lots of similarities regarding the syntax, Elasticsearch is powerful search engine. Thus, requests on billions of records are less expensive to do with it. Also, Elasticsearch is accessible throught an its API. Numerous people can access it at the same time. Which mean you can work synchronously with a collaborator pushing data and using them immediately after. Moreover, using wildcards, we can search on multiple indices at once.

What we can do very easily with Elasticsearch is searching everywhere: in every indices, in every columns, and in every words. Lastly, full-text searches are the big deal. See Text and Keyword querying for more details.

5.7 Change tibble column type

kibior will return base types in tibble structures (integer, character, logical, and list) for representing data. If you want to change some columns, use readr::type_convert() after retrieving the dataset.

#> changing the "status" column from string to factor
kc$pull("storms")$storms %>%
    readr::type_convert(
        col_types = readr::cols(
            status = readr::col_factor()))

5.8 Compare two instances

If you manage multiple instances, you can compare host:port couple easily with == and != operators.

#> is kc instance equal to kc_two instance?
(kc == kc_two)
#> are kc and kc_two instances differents?
(kc != kc_two)

5.9 Attach one instance to global environment

Using only one instance of kibior, you might want to attach this instance to the global environment. This will indeed remove the instance call at the beginning of each method call (in our examples: kc$...).

Though it can be practical in local developments for only one instance, we strongly discourage that pratice if you entend to share your code. It can induce wrong behaviors during execution in environments with different configurations or multiple instances.

5.10 Joins

kibior integrated dplyr package joins: full, left, right, inner, anti, and semi joins.

By using kibior joins, you can apply these joins to in-memory datasets and Elasticsearch-based indices. kibior supports query parameter when joining to accelerate data retrival time but cannot join on listed columns.

#> pushing a subset of data
dplyr::starwars %>% 
    dplyr::filter(homeworld == "Naboo") %>%
    kc$push("starwars_naboo", mode = "recreate")

kc$pull("starwars_naboo")

#> perform an inner join  between the in-memory full dataset
#> and the remote subset we have just sent
columns <- c("name", "height", "mass", "gender", "homeworld")
kc$inner_join(dplyr::starwars, "starwars_naboo",
            left_columns = columns,
            right_columns = columns,
            by = c("name", "height", "mass"))

As you can see, kibior uses suffixes left and right on data column.

5.11 Moving and copying data from another instance

Appart from moving and copying indices from the same cluster of Elasticsearch instances, the $move() and $copy() methods can do the same with REMOTE instances. The remote Elasticsearch endpoint has to be declared inside your elasticsearch.yml configuration file.

By adding one line to the elasticsearch.yml configuration file, allowing a server whitelist, Elasticsearch servers can talk to each others. By this, they can transfer data across them in a much faster and secure way.

#> config/elasticsearch.yml

...
reindex.remote.whitelist: "otherhost:9200, another:9200, 127.0.10.*:9200, localhost:*"
...

Full description can be found on Elasticsearch documentation.

After that, kibior will be able to use the from_instance parameter of $move() and $copy().

#> init two ES binding
#> kc_local must be configured
#> we make the assumption that both kc are accessible
kc_local <- kibior$new("es_local")
kc_remote <- kibior$new("es_remote", port = 9205)

#> copy data from kc_remote to kc_local
kc_local$copy(from_index = "remote_index", 
              to_index = "new_copy_of_remote_index_in_local",
              from_instance = kc_remote)

This method allows massive data copying in a much faster way since all data are structured the same.

6 Known limits

As all implementations and developments, there are some limits:

Elasticsearch cannot store uppercase field names, thus all column names are forced to lowercase when submitted by default.
Elasticsearch interprets dots in strings as nested values (ex: “aaa.bbb” is understand as field “aaa” containing a field “bbb”), which is prone to errors with R language since variables can be named with dots. To avoid errors when pushing data to Elasticsearch, dots in column names are replaced by underscores.

#> iris column names
datasets::iris %>% names()

## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"

#> example with iris dataset
datasets::iris %>% kc$push("iris")

## [1] "iris"

# get columns of index iris
kc$columns("iris")

## $iris
## [1] "kid"          "petal_length" "petal_width"  "sepal_length" "sepal_width"  "species"

Elasticsearch has updatable default limitations to 1000 columns, so if datasets pushed with more than 1000 variables, it will generate an error. Two solutions: try to transpose it, or define a higher Elasticsearch limit in its configurations.
Elasticsearch handles each document (each line of a table) with a unique ID: a specific "_id" metadata field. What can be confusing here is that metadata are not on the same level as data in Elasticsearch. To be able to update data more easily by targeting accurately document IDs, we force add a new unique field (default is kid) when pushing data to Elasticsearch and define it as the unique "_id" field. If you know one of your column is unique and can be used as an ID column, you can use the id_col of the $push() method to define this column as main ID.
The columns parameter does not handle metadata columns.
Elasticsearch is really great for textual and keyword search, for that the text has to have common delimiters to be cut down to words. Passing a single, billions-long, uninterrupted biomolecular sequence is not a good thing for Elasticsearch and may result in an indexing failure.
$move() and $copy() for remote instances are very sensitive to authentication and security configurations. Some tasks will not be possible due to each organism security measures. Check with your favorite or proper system administrator.
Joins are not executed server-side (on ES), which actually means the Elasticsearch data must be downloaded before executing the actual join. Querying and selecting columns with joins parameters left_columns, right_columns, left_query and right_query is realtively important to lower data transfer payload and fasten the execution.
Elasticsearch limits returned results to 10.000 elements per bulk. If you try to set bulk_size > 10000 in parameter, kibior will downsize it to match the maximum allowed.
The query parameter expressiveness is a powerful string-based mecanism. Users need to understand that the query parameter sends in one request a query to an Elasticsearch instance. If the request is generated based on a list of elements such as c("id1", "id2", "id3", ...) %>% paste0(collapse = " || ") %>% kc$search("*", query = .), it can possibly represents a very long string which cannot be entirely passed down to Elasticsearch properly. One way to counter this issue is to split up the element vector into subset and do mulitple calls. It will be fully automated in future versions.
Kibior applies some modifications on datasets before sending them on Elasticsearch: turns all dataset names to lowercase, removes all dataset dotted-based names to underscore-based names, adds kid column, etc. All these tranformations can affect the behavior of $*_join() methods.
The $keys() method limits by default the number of unique keys found to 1000 since it aggregate a possible unlimited number of keys which can happen when calling it on integer or floating point values. If you want more, change the max_size method parameter.

7 Tested with

kibior has been tested with these configurations:

Software	Version
`Elasticsearch`	`6.8`, `7.5`, `7.8`, `7.9`, `7.10`
`R`	`3.6.1`, `4.0.2`, `4.0.3`
`RStudio`	`1.2.5001, build 93, 7b3fe265`, `1.4.1103, build "Wax Begonia", 458706c3`

This vignette has been built using the following session:

Session info

```r
sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04 LTS
## 
## Matrix products: default
## BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] kibior_0.1.1   magrittr_2.0.1 readr_1.4.0    stringr_1.4.0  dplyr_1.0.3   
## [6] ggplot2_3.3.3  knitr_1.30    
## 
## loaded via a namespace (and not attached):
##  [1] zip_2.1.1         Rcpp_1.0.6        cellranger_1.1.0  pillar_1.4.7     
##  [5] compiler_4.0.3    forcats_0.5.0     elastic_1.1.0     tools_4.0.3      
##  [9] digest_0.6.27     jsonlite_1.7.2    evaluate_0.14     lifecycle_0.2.0  
## [13] tibble_3.0.5      gtable_0.3.0      pkgconfig_2.0.3   rlang_0.4.10     
## [17] openxlsx_4.2.3    crul_1.0.0        curl_4.3          yaml_2.2.1       
## [21] haven_2.3.1       xfun_0.20         rio_0.5.16        withr_2.4.1      
## [25] generics_0.1.0    vctrs_0.3.6       hms_1.0.0         grid_4.0.3       
## [29] tidyselect_1.1.0  glue_1.4.2        httpcode_0.3.0    data.table_1.13.6
## [33] R6_2.5.0          readxl_1.3.1      foreign_0.8-80    rmarkdown_2.6    
## [37] tidyr_1.1.2       purrr_0.3.4       scales_1.1.1      ellipsis_0.3.1   
## [41] htmltools_0.5.1.1 colorspace_2.0-0  stringi_1.5.3     munsell_0.5.0    
## [45] crayon_1.3.4
```

</p>

References

Chamberlain, Scott. 2020. “Elastic: General Purpose Interface to ‘Elasticsearch’.” Bioinformatics. https://CRAN.R-project.org/package=elastic.

Chan, Chung-hong, Geoffrey CH Chan, Thomas J. Leeper, and Jason Becker. 2018. “Rio: A Swiss-Army Knife for Data File I/O.” https://CRAN.R-project.org/package=rio.

Lawrence, Michael, Robert Gentleman, and Vincent Carey. 2019. “Rtracklayer: An R Package for Interfacing with Genome Browsers.” Bioinformatics 25: 1841–2. https://doi.org/10.1093/bioinformatics/btp328.

Morgan, Martin, Hervé Pagès, Valerie Obenchain, and Nathaniel Hayden. 2020. “Rsamtools: Binary Alignment (BAM), FASTA, Variant Call (BCF), and Tabix File Import.” https://doi.org/10.18129/B9.bioc.Rsamtools.

Pagès, H., P. Aboyoun, R. Gentleman, and S. DebRoy. 2020. “Biostrings: Efficient Manipulation of Biological Strings.” https://doi.org/10.18129/B9.bioc.Biostrings.

KibioR - Introduction

Régis Ongaro-Carcy

2021-01-28