View on GitHub

JODA - JSON On Demand Analysis

Efficient data wrangling for semi-structured JSON documents

Quick Navigation

Syntax

The basic syntax of queries is:

LOAD     (<COLLECTION>)   (<SOURCES>)
(JOIN    <JOIN EXPRESSION>)
(CHOOSE  <PRED>)
(AS      <PROJ>)
(AGG     <AGG>)
(STORE   <COLLECTION>  | AS FILE "<FILE>" | AS FILES "<DIR>")

The execution order is LOAD->CHOOSE->AS->AGG->STORE->DELETE.

Load

LOAD <COLLECTION> loads an previously stored result, for further usage in the following query, from variable <COLLECTION>.

LOAD <COLLECTION> <SOURCES> imports data from the <SOURCES> into <COLLECTION> and continues the query with it. If <COLLECTION> already exists, the contents of <SOURCES> will be appended to it.

If <COLLECTION> is omitted, the data is only parsed but not stored for later reuse. It will then be only available in the current query.

<SOURCES> can be one or multiple single <SOURCE>, combined with ,.

Instead of <SOURCES>, data can also be imported by an ongoing group by with the FROM GROUPED <EXP> command. (For more informations see Group-By)

Sources

Currently the following sources are implemented:

FROM FILE "<FILE>" parses a single file in the path <FILE>.

FROM FILES "<DIR>" parses all files in the directory <DIR>.

FROM URL "<URL>" parses documents returned by a GET call to <URL>.

FROM STREAM (See streaming)

Sampling

Sources may also be sampled, to decrease memory footprint and/or improve query times.

To sample sources SAMPLE X has to be appended to a <SOURCE> command. X has to be a value between (0,1).

Keep in mind that sampling only skips documents in the sources. All results will only represent these sources.

Currently, no further extrapolations are performed by the system to project aggregations like sums or similar functions.

JOIN (Optional)

The loaded dataset can be joined with another dataset, using the JOIN command.

After the JOIN keyword JODA expects either the name of a stored collection, or a sub-query surrounded by parentheses.

For example are the following two valid JOIN commands:

LOAD A JOIN B <condition>;
LOAD A JOIN (LOAD FROM FILES ...) <condition>;

Each matching document pair is joined into a single document, where the top-level is an object with the following structure:

{
    "inner": <Document of A>,
    "outer": <DOcument of B>
}

Currently two types of joins are implemented, equality joins and theta joins.

Equality Joins

Equality-joins join every document that has the same value for a given key. Either one key for both sides, or one key for each side can be used.

The equality join is performed using the LOAD A JOIN B ON (<key>[, <key2>]) expression.

For example are the following valid equality joins:

LOAD A JOIN B ON ('/user/id');
LOAD A JOIN B ON ('/name', SUBSTR('/user',10));

Theta joins

Theta-joins allow the user to join two datasets with an arbitrary predicate. For this to work, every combination of documents is first materialized as noted before with inner and outer attributes in one document. These can then be accessed in arbitrary predicates following the WHERE keyword:

LOAD users JOIN messages WHERE IN('/outer/text','/inner/name');

Note: Materializing every combination of documents may be feasible for small document sets, but can become very expensive for larger datasets. If possible, use the equality join.

The smaller dataset should be the first one in the join for performance reasons. In future versions the optimizer may automatically switch the join partners.

CHOOSE (Optional)

CHOOSE <PRED> selects JSON documents according to the given predicates. If this command is not supplied, all documents will be retrieved.

The predicate can consist of the following:

Variables

Variables can be any of the following:

Comparison

Comparisons can be any of the following:

Each does what is expected. For comparisions the following rules apply:

Pointer evaluate to the needed type. If the given pointer does not exist or the evaluated type is not compatible, then false is returned.

Every comparison may be applied to two strings, but they may not be compared to anything else.

Every comparison may be applied to two numbers, but they may not be compared to anything else.

Booleans can only be checked for equality or unequality and may not be compared to anything else.

AS (Optional)

AS <PROJ> projects the chosen documents into the with <PROJ> given form. If this command is not supplied, the documents will stay as they are.

The projection command has the following syntax:

<PROJ> :    ('<TOPOINTER>' : <SOURCE>), ...

<TOPOINTER>: JSON-Pointer, see section JSON-Pointer, in which the supplied source will be stored

Source

Source can be any of the following:

Set Functions

Set functions can provide multiple values, which results in duplication of the given document with different values in the specified attributes.

Currently the following set functions are implemented:

Examples

With the following JSON input:

{
    "sub":
        {
            "array":
                [
                    1,
                    "test",
                    3
                ]
        }
}

AS ('/id':ID()),('/file':FILENAME()),('/arr_elt':FLATTEN('/sub/array')) could result in this JSON output:

{
    "id" :      12,
    "file" :    "/home/testuser/testjson.json",
    "arr_elt" : 1
}

{
    "id" :      12,
    "file" :    "/home/testuser/testjson.json",
    "arr_elt" : "test"
}

{
    "id" :      12,
    "file" :    "/home/testuser/testjson.json",
    "arr_elt" : 3
}

AGG (Optional)

AGG <AGG> aggregates the results of the projection. If no aggregation is supplied, the results will be returned as is.

The aggregation command has the following syntax:

<AGG> :     ('<TOPOINTER>' : <SOURCE>), ...

Keywords

Source can be any of the following functions:

GROUP BY

Aggregations can also be grouped by a specified value.

<AGG>           : ('<TOPOINTER>' : GROUP <SOURCE> <AS> BY <VALUE>)
<AS> [optional] : AS <IDENT>
<IDENT>         : Any variable name string.

**Note: ** Only atomic data types (string, number, boolean) can be used to group. Arrays and objects are ignored.

WINDOW

Aggregations can also performed using a tumbling window, using the AGG WINDOW (<#documents>) <AGG expressions> expression.

For every set number of documents the aggregation result is computed, returned and the aggregation reset for the next set.

For example would the following query return multiple documents consiting only of the number 10, except maybe for the last one:

LOAD A AGG WINDOW(10) ('':COUNT(''));

Examples

With the following JSON input:

{"num": 0, "dyn" : 1}
{"num": 1, "dyn" : 1}
{"num": 2, "dyn" : "test"}
{"num": 3, "dyn" : {"sub":1}}
{"num": 4 }

Query

LOAD data
AGG   ('/sum':SUM('/num')),
      ('/count':COUNT('/dyn')),
      ('/distinct':DISTINCT('/dyn')),
      ('/grouped' : GROUP COUNT('/num') AS count BY '/dyn')

Result

{
    "sum" : 10,
    "count" : 4,
    "distinct" : [1, "test"],
    "grouped" : [
      {
        "group" : 1,
        "count" : 2
      },
      {
        "group" : "test",
        "count" : 2
      }
    ]
}

STORE (Optional)

STORE <VAR> stores the result of the query in variable <VAR>.

STORE AS FILE "<FILE>" writes the result of the query to <FILE>.

STORE AS STREAM writes the result into the output stream (See streaming).

STORE GROUPED BY <SOURCE> groupes the dataset. (See Group-By)

Group By

Joda has the ability to group multiple documents together, depending on a value.

This is achieved by storing a dataset with the GROUPED BY clause.

STORE GROUPED BY <SOURCE> groups all the documents, with the same value for <SOURCE>, together. To retrieve these grouped results a new query is necessary.

LOAD <VAR> FROM GROUPED <SOURCE> will retrieve previously grouped results.

Multiple datasets can be grouped together, by repeatedly executing a STORE command with the same <SOURCE>.

After the LOAD statement, the grouping is complete and can no longer be referenced.

Source

Source can be any of the following:

Examples

Lets assume we have multiple messages replying to each other:

{
    "id" : 2,
    "reply_to" : 1,
    "msg" : "Hello World!",
},
{
    "id" : 3,
    "reply_to" : 1,
    "msg" : "Hello Joda!",
}

Now we might want to get all replies to the same message. This can be achieved by combining these two queries:

 LOAD msgs
 STORE GROUPED BY '/reply_to';

 LOAD replies FROM GROUPED '/reply_to';

This will return:

 {
     "reply_to" : 1,
     "joins" :
        [
            {
                "id" : 2,
                "reply_to" : 1,
                "msg" : "Hello World!",
            },
             {
                 "id" : 3,
                 "reply_to" : 1,
                 "msg" : "Hello Joda!",
             }
        ]
 }

Misc

Streaming

JODA also supports continuous streaming of query results. If JODA is invoked with pre-set queries (e.g.: using the -q or -f CLI parameters) and without user interaction (started as a pipe, with --noninteractive, or from a non-interactive shell) the input will be streamed and the query evaluated continuously.

In streaming mode additional import (LOAD command) and export (STORE command) sources are available for reading from or writing to the connected stream.

For example is it possible to use the following query to filter and modify each JSON document passed to JODA:

LOAD FROM STREAM 
CHOOSE '/num' > 10 
AS ('':'/num')
STORE AS STREAM;

The FROM URL, FROM FILE, AS FILE sources can also be used to connect to (web-)streams:

LOAD FROM URL https://api.stream.test
...
STORE AS FILE "/dev/null"

(Not that this specific example would do anything useful)

Aggregation

By default, JODA aggregation aggregates all incoming documents. If an aggregation function is used, it will block until the input stream is closed (if it is closed at all).

To make aggregations usable in a streaming environment we added the WINDOW functionality explained in the AGG chapter.

Functions

JODA supports many functions. These can be used everywhere in queries where JSON values are expected, e.g., in CHOOSE, AGG, and AS clauses. Functions can also be nested into each other.

An exhausive list of functions can be found here:

Parameters

The parameters are defined as:

Each parameter can be replaced by a JSON-Pointer. If the attributed does not exist or is of wrong type, the function will return false (or another default value).

JSON-Pointer

JSON-Pointers are pointers to specific parts of a JSON document. The implementation conforms to the RFC 6901.

In this document:

{
    "test" :
        [
            1,
            "test",
            {
                "sub" : "Hello World!"
            }
        ]
}

The pointer '/test/2/sub' would point to the string “Hello World!”.

The empty pointer '' points to the root object.

Example Queries

LOAD A FROM FILES "/home/user/jsondirectory"
CHOOSE '/message/sender' == "Peter"
AS ('/filename' : FILENAME())
AGG ('/files' : DISTINCT('/filename'))
STORE AS FILE "/home/user/peters_files.json"
DELETE A

Loads all files in directory /home/user/jsondirectory and selects only those with the value “Peter” for the attribute /message/sender. From these selected documents the filenames, in which they where contained, is stored in /filename. The resulting documents are then aggregated by choosing all distinct filenames and storing them as array in the attribute /files.

The result is a single document, which is then written to the file /home/user/peters_files.json. Lastly the variable A is removed, so all documents are purged from the system.