Skip to content

NXCALS-1139 Hide special characters encoding and decoding process from users

This task is related with the encoding of special characters but is unrelated with the implementation itself.

So basically, independently of what type of encoding we use, at the end the user is fetching the dataset with the converted values. This is due to a missing conversion when we are generating the schema on the datasets. There is a problem though, if we convert the schema of the original datasets to use the undecoded names, then everything passes untill we actually need to collect the data. This means that if we only do transformation and manipulation and fetch the result of it, it works fine (for instance we can count the number of rows), BUT if we actually try to fetch a value from a column which needs to be decoded it fails. The reason is unknown to me as I tried to debug this for the past 2 days without success (learned a lot about spark though :p). The query executed by Spark is correct, as it makes a select of the original column name and then exposes it with the decoded name.

This is an example of the query extracted by spark when the column exists in the time window:

SELECT (columnName AS String) AS decodedName

And outside the time window:

SELECT (null AS String) AS decodedName

and the query is successfull, BUT if then we try to fetch the data if in the schema of the dataset, the name of the column is the original name, it passes, if the name of the column is the undecoded name it fails...

This might be hard to explain without context, so feel free to pass by me and ask me to explain it to you and show it, as I can reproduce this 100% and very easily just modifying a one line of code.

In the middle of all this and before I actually crumbled under my desperation of not understanding why this fails I found an actually elegant simple and good solution for the problem, LETS TRICK THE USER.

As you might know we use an implementation of BaseRelation to provide the user with our own data source and format. This means that we have control over an intermediary step between the user request and the actual request where we are free to make transformations on the data.

This means that the user actually does not make operations over the real dataset, but instead, it does its operations over our base relation that is then backed in our case by a dataset (without actually knowing it, as it is transparent to the user).

Here I do 2 things then, I trick the user and whenever the user asks for the schema, I convert he converted column names to their original name, and this means that then, the user will see the schema as if no transformation had been one, and will make queries using this original field names.

Then, whenever the user make a query, I encode again the illegal values before submitting the actual action.

Please take a look and comment!!

Edited by Tiago Martins Ribeiro

Merge request reports

Loading