JODA
0.13.1 (59b41972)
JSON On-Demand Analysis
|
JODA is an in-memory vertically scalable data processor for semi-structured data, with an initial emphasis on JSON datasets. It can be used to efficiently filter, transform, and aggregate a large set of JSON documents. Please visit the project website to learn more about what JODA is and how it can be used.
Detailed installation instructions can be found on the project website.
These are options that may be supplied to the program during invokation.
These commands can be used any time within the program. The commands have to be entered followed by a ;
.
quit
: Quits the programcache
: Toggles usage of the cachesources
: Lists all named data sources (and the # of documents stored in them)results
: Lists all temporary data sources (and the # of documents stored in them)The basic syntax of queries is:
The execution order is LOAD
->CHOOSE
->AS
->AGG
->STORE
->DELETE
.
LOAD <COLLECTION>
loads an previously stored result, for further usage in the following query, from variable <COLLECTION>
.
LOAD <COLLECTION> <SOURCES>
imports data from the <SOURCES>
into <COLLECTION>
and continues the query with it. If <COLLECTION>
already exists, the contents of <SOURCES>
will be appended to it.
<SOURCES>
can be one or multiple single <SOURCE>
, combined with ,
.
Instead of <SOURCES>
, data can also be imported by an ongoing group by with the FROM GROUPED <EXP>
command. (For more informations see Group-By)
Currently the following sources are implemented:
FROM FILE "<FILE>"
parses a single file in the path <FILE>
. (Supports sampling)
FROM FILES "<DIR>"
parses all files in the directory <DIR>
. (Supports sampling)
FROM URL "<URL>"
parses documents returned by a GET call to <URL>
.
Some sources may be sampled, to decrease memory footprint and/or improve query times.
To sample sources SAMPLE X
has to be appended to a <SOURCE>
command. X
has to be a value between (0,1).
Keep in mind that sampling only skips documents in the sources. All results will only represent these sources.
Currently, no further extrapolations are performed by the system to project aggregations like sums or similar functions.
CHOOSE <PRED>
selects JSON documents according to the given predicates. If this command is not supplied, all documents will be retrieved.
The predicate can consist of the following:
<PRED> && <PRED>
: AND (Both predicates have to be fulfilled)<PRED> || <PRED>
: OR (One of the predicates have to be fulfilled)!<PRED>
: NEGATE (Returns the opposite of the given predicate)(<PRED>)
: BRACKET (Predicates can be bracketed)<VAR> <CMP> <VAR>
: COMPARE (Compares the given variables)<VAR>
: Single variable/functionVariables can be any of the following:
1
, +3.14
, -284674
)"Hello World!"
)true
, false
)Comparisons can be any of the following:
==
!=
>=
>
<=
<
Each does what is expected. For comparisions the following rules apply:
Pointer evaluate to the needed type. If the given pointer does not exist or the evaluated type is not compatible, then false is returned.
Every comparison may be applied to two strings, but they may not be compared to anything else.
Every comparison may be applied to two numbers, but they may not be compared to anything else.
Booleans can only be checked for equality or unequality and may not be compared to anything else.
AS <PROJ>
projects the chosen documents into the with <PROJ>
given form. If this command is not supplied, the documents will stay as they are.
The projection command has the following syntax:
<TOPOINTER>
: JSON-Pointer, see section JSON-Pointer, in which the supplied source will be stored
Source can be any of the following:
<FROMPOINTER>
: (Any)JSON-Pointer, see section JSON-PointerSet functions can provide multiple values, which results in duplication of the given document with different values in the specified attributes.
Currently the following set functions are implemented:
FLATTEN(<POINTER>)
: (Any) If <POINTER>
is of type array, each element contained within is distibuted over multiple documents.With the following JSON input:
‘AS (’/id':ID()),('/file':FILENAME()),('/arr_elt':FLATTEN('/sub/array')) ` could result in this JSON output:
AGG <AGG>
aggregates the results of the projection. If no aggregation is supplied, the results will be returned as is.
The aggregation command has the following syntax:
Source can be any of the following functions:
Aggregations can also be grouped by a specified value.
**Note: ** Only atomic data types (string, number, boolean) can be used to group. Arrays and objects are ignored.
With the following JSON input:
Query
Result
STORE <VAR>
stores the result of the query in variable <VAR>
.
STORE AS FILE "<FILE>"
writes the result of the query to <FILE>
.
STORE GROUPED BY <SOURCE>
groupes the dataset. (See Group-By)
DELETE <VAR>
removes the result stored in <VAR>
.
Joda has the ability to group multiple documents together, depending on a value.
This is achieved by storing a dataset with the GROUPED BY
clause.
STORE GROUPED BY <SOURCE>
groups all the documents, with the same value for <SOURCE>
, together. To retrieve these grouped results a new query is necessary.
LOAD <VAR> FROM GROUPED <SOURCE>
will retrieve previously grouped results.
Multiple datasets can be grouped together, by repeatedly executing a STORE
command with the same <SOURCE>
.
After the LOAD
statement, the grouping is complete and can no longer be referenced.
Source can be any of the following:
<FROMPOINTER>
: (Any)JSON-Pointer, see section JSON-PointerLets assume we have multiple messages replying to each other:
Now we might want to get all replies to the same message. This can be achieved by combining these two queries:
This will return:
FILENAME()
: Returns the filename (with full path), or "\[PROJECTION]" if the document was projected.FILEPOSSTART()
: Returns the starting position of the document, within it's file.FILEPOSEND()
: Returns the end position of the document, within it's file.ID()
: Returns the internal ID of the document.TYPE(Any)
: Returns the type of the given attribute ("OBJECT","ARRAY","NUMBER","STRING","BOOL","NULL");EXISTS(Any)
: Checks if the given attribute exists.ISX(Any)
: Checks if the given attribute has type X:ISNULL
ISBOOL
ISNUMBER
ISSTRING
ISOBJECT
ISARRAY
INT(Any)
: Returns the integer representation of the value. Floats are truncated, Strings are parsed as numbers if possible, and Boolean is converted to {1,0}.FLOAT(Any)
: Returns the float representation of the value. Strings are parsed as numbers if possible, and Boolean is converted to {1,0}.STRING(Any)
: Returns the string representation of the value. Only works for non-object/array values.LEN(String)
: Returns the length of the string.LOWER(String)
: Returns the string in lowercase.UPPER(String)
: Returns the string in uppercase.LTRIM(String)
: Trims spaces to the left of the string.RTRIM(String)
: Trims spaces to the right of the string.CONCAT(String,String)
: Concatenates two strings.SCONTAINS(String,String)
: Returns true if first string contains the second.STARTSWITH(String,String)
: Returns true if first string starts with the second.FINDSTR(String,String)
: Returns the position of the second string in the first string, or -1 if it is not contained.SUBSTR(String,Number,Number)
: Returns a substring of the given string. The first number is the starting position of the substring, the second (optional) number is the length of the substring. If the second is not given, everything until the end is returned.REGEX(String,String)
: Returns true if first string matches the regular expression defined in the second.REGEX_EXTRACT(String,String)
: Returns a list of strings within the first parameter, matched by the regular expression in the second.REGEX_REPLACE(String,String,String)
: Replaces all matches, within the first string, of the regular expression, defined in the second parameter, with the third string.SIZE(Array)
: Returns the size of the array.IN(Any,Array)
: Checks if parameter 1 is contained in the array given by parameter 2.MEMCOUNT(Object)
: Returns the number of members.LISTATTRIBUTES(Object)
: Returns all members in the given object.SUM(Number, Number)
: Calculates the sum of all arguments.SUB(Number, Number)
: Subtracts the arguments.PROD(Number, Number)
: Calculates the product of all arguments.DIV(Number, Number)
: Divides the arguments.MOD(Number, Number)
: Calculates the modulus.POW(Number, Number)
: Calculates the power.ABS(Number)
: Calculates the absolute.CEIL(Number)
: Calculates the ceiling.FLOOR(Number)
: Calculates the floor.TRUNC(Number)
: Truncates the value.ROUND(Number)
: Round the value.SQRT(Number)
: Calculates the square rootDEGREES(Number)
: Radians to degrees.RADIANS(Number)
: Degrees to radians.ACOS(Number)
: inverse cosineASIN(Number)
: inverse sineATAN(Number)
: inverse tangentATAN2(Number, Number)
: inverse tangent of y/xCOS(Number)
: cosineSIN(Number)
: sineTAN(Number)
: tangentPI()
: Returns PINOW()
: Returns the UNIX timestamp in milliseconds.SEQNUM()
: Returns a sequential number between 0 and the number of documents.SEQNUM()
calls within one query each return a number in the range independently from each other. (e.g.: ‘... AS (’/seq1':SEQNUM()),('/seq2':SEQNUM())could result in
{"seq1":1,"seq2":3}.
(Number)
HASH(Any)`: Returns the hash of the value.The parameters are defined as:
<NUMBER>
: (e.g.: 1
, +3.14
, -284674
)<STRING>
: (e.g.: "Hello World!"
)<BOOL>
: (true
, false
)Each parameter can be replaced by a JSON-Pointer. If the attributed does not exist or is of wrong type, the function will return false (or another default value).
JSON-Pointers are pointers to specific parts of a JSON document. The implementation conforms to the RFC 6901.
In this document:
The pointer ‘’/test/2/sub'` would point to the string "Hello World!".
The empty pointer ‘’'` points to the root object.
Loads all files in directory /home/user/jsondirectory
and selects only those with the value "Peter" for the attribute /message/sender
. From these selected documents the filenames, in which they where contained, is stored in /filename
. The resulting documents are then aggregated by choosing all distinct filenames and storing them as array in the attribute /files
.
The result is a single document, which is then written to the file /home/user/peters_files.json
. Lastly the variable A is removed, so all documents are purged from the system.
If you want to cite this project in your research, please use our ICDE 2020 demo paper.