Book: "Pentaho Kettle Solutions", Matt Casters, Roland Bouman, & Jos van Dongen, Wiley 2010 Book: "Pentaho Solutions", Roland Bouman & Jos van Dongen, Wiley 2009

Wednesday, May 04, 2016

ODXL - A generic Data Export Layer for SAP/HANA based on OData

I'm very pleased to be able to announce the immediate availability of the Open Data Export Layer (ODXL) for SAP/HANA!

Executive summary

ODXL is a framework that provides generic data export capabilities for the SAP/HANA platform. ODXL is implemented as a xsjs Web service that understands OData web requests, and delivers a response by means of a pluggable data output handler. Developers can use ODXL as a back-end component, or even as a global instance-wide service to provide clean, performant and extensible data export capabilities for their SAP/HANA applications.

Currently, ODXL provides output handlers for comma-separated values (csv) as well as Microsoft Excel output. However, ODXL is designed so that developers can write their own response handlers and extend ODXL to export data to other output formats according to their requirements.

ODXL is provided by Just BI to the SAP/HANA developer community as open source software under the terms of the Apache 2.0 License. This means you are free to use, modify and distribute ODXL. For the exact terms and conditions, please refer to the license text.

The source code is available on github. Developers are encouraged to check out the source code and to contribute to the project. You can contribute in many ways: we value any feedback, suggestions for new features, filing bug reports, or code enhancements.

If you require professional support for ODXL, please contact Just-BI for details.

What exactly is ODXL?

ODXL started as an in-house project at the Just-BI department of custom development. It was borne from the observation that the SAP/HANA web applications that we develop for our customers often require some form of data export, typically to Microsoft Excel. Rather than creating this type of functionality again for each project, we decided to invest some time and effort to design and develop this solution in such a way that it can easily be deployed as a reusable component. And preferably, in a way that feels natural to SAP/HANA xs platform application developers.

What we came up with, is a xsjs web service that understands requests that look and feel like standard OData GET requests, but which returns the data in some custom output format. ODXL was designed to make it easily extensible so that developers can build their own modules that create and deliver the data in whatever output format suits their requirements.

This is illustrated in the high-level overview below:



For customers of Just-BI, there is an immediate requirement to get Microsoft Excel output. So, we went ahead and implemented output handlers for .xlsx and .csv formats, and we included those in the project. This means that ODXL supports data export to the .xlsx and .csv formats right out of the box.

However, support for any particular output format is entirely optional and can be controlled by configuration and/or extension:
  • Developers can develop their own output handlers to supply data export to whatever output format they like.
  • SAP/HANA Admins and/or application developers can choose to install only those output handlers they require, and configure how Content-Type headers and OData $format values map to output handlers.

So ODXL is OData? Doesn't SAP/HANA suppport OData already?

The SAP/HANA platform provides data access via the OData standard. This facility is very convenient for object-level read- and write access to database data for typical modern web applications. In this scenario, the web application would typically use asynchronous XML Http requests, and data would be exchanged in either Atom (a XML dialect) or JSON format.

ODXL's primary goal is to provide web applications with a way to export datasets in the form of documents. Data export tasks typically deal with data sets that are quite a bit larger than the ones accessed from within a web application. In addition, an data export document might very well compromise multiple parts - in other words, it may contain multiple datasets. The typical example is exporting multiple lists of different items from a web application to a workbook containaing multiple spreadsheets with data. In fact, the concrete use case from whence ODXL originated was the requirement to export multiple datasets to Microsoft Excel .xlsx workbooks.

So, ODXL is not OData. Rather, ODXL is complementary to SAP/HANA OData services. That said, the design of ODXL does borrow elements from standard OData.

OData Features, Extensions and omissions

ODXL GET requests follow the syntax and features of OData standard GET requests. Here's a simple example to illustrate the ODXL GET request:
GET "RBOUMAN"/"PRODUCTS"?$select=PRODUCTCODE, PRODUCTNAME& $filter=PRODUCTVENDOR eq 'Classic Metal Creations' and QUANTITYINSTOCK gt 1&$orderby=BUYPRICE desc&$skip=0&$top=5
This request is build up like so:
  • "RBOUMAN"/"PRODUCTS": get data from the "PRODUCTS" table in the database schema called "RBOUMAN".
  • $select=PRODUCTCODE, PRODUCTNAME: Only get values for the columns PRODUCTCODE and PRODUCTNAME.
  • $filter=PRODUCTVENDOR eq 'Classic Metal Creations' and QUANTITYINSTOCK ge 1: Only get in-stock products from the vendor 'Classic Metal Creations'.
  • $orderby=BUYPRICE desc: Order the data from highest price to lowest.
  • $skip=0&$top=5: Only get the first five results.
For more detailed information about invoking the odxl service, check out the section about the sample application. The sample application offers a very easy way to use ODXL for any table, view, or calculation view you can access and allows you to familiarize yourself in detail with the URL format.

In addition, ODXL supports the OData $batch POST request to support export of multiple datasets into a single response document.

The reasons to follow OData in these respects are quite simple:
  • OData is simple and powerful. It is easy to use, and it gets the job done. There is no need to reinvent the wheel here.
  • ODXL's target audience, that is to say, SAP/HANA application developers, are already familiar with OData. They can integrate and use ODXL into their applications with minimal effort, and maybe even reuse the code they use to build their OData queries to target ODXL.
ODXL does not follow the OData standard with respect to the format of the response. This is a feature: OData only specifies Atom (an XML dialect) and JSON output, whereas ODXL can supply any output format. ODXL can support any output format because it allows developers to plug-in their own modules called output handlers that create and deliver the output.

Currently ODXL provides two output handlers: one for comma-separated values (.csv), and one for Microsoft Excel (.xlsx). If that is all you need, you're set. And if you need some special output format, you can use the code of these output handlers to see how it is done and then write your own output handler.

ODXL does respect the OData standard with regard to how the client can specify what type of response they would like to receive. Clients can specify the MIME-type of the desired output format in a standard HTTP Accept: request header:
  • Accept: text/csv specifies that the response should be returned in comma separated values format.
  • Accept: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet specifies that the response should be returned in open office xml workbook format (Excel .xlsx format).
Alternatively, they can specify a $format=<format> query option, where <format> identifies the output format:
  • $format=csv for csv format
  • $format=xlsx for .xlsx format
Note that a format specified by the $format query option will override any format specified in an Accept:-header, as per OData specification.

ODXL admins can configure which MIME-types will be supported by a particular ODXL service instance, and how these map to pluggable output handlers. In addition, they can configure how values for passed for the $format query option map to MIME-types. ODXL comes with a standard configuration with mappings for the predefined output handlers for .csv and .xlsx output.

On the request side of things, most of OData's features are implemented by ODXL:
  • The $select query option to specify which fields are to be returned
  • The $filter query option allows complex conditions restricting the returned data. OData standard functions are implemented too.
  • The $skip and $top query options to export only a portion of the data
  • The $orderby query option to specify how the data should be sorted
ODXL currently does not offer support for the following OData features: The features that are currently not supported may be implemented in the future. For now, we feel the effort the implement them and adequately map their semantics to ODXL may not be worth the trouble. However, an implementation can surely be provided should there be sufficient interest from the community.

Installation

Use ODXL presumes you already have a SAP/HANA installation with a properly working xs engine. You also need HANA Studio, or Eclipse with the SAP HANA Tools plugin installed. The steps are a little bit different, depending on whether you just want to use ODXL, or whether you want to actively develop the ODXL project.

Here are the steps if you just want to use ODXL, and have no need to actively develop the project:
  1. In HANA Studio/Eclipse, create a new HANA xs project. Alternatively, find an existing HANA xs project.
  2. Find the ODXL repository on github, and download the project as a zipped folder. (Select a particular branch if you desire so; typically you'll want to get the master branch)
  3. Extract the project from the zip. This will yield a folder. Copy its contents, and place them into your xs project directory (or one of its sub directories)
  4. Activate the new content.
After taking these steps, you should now have a working ODXL service, as well as a sample application. The service itself is in the service subdirectory, and you'll find the sample application inside the app subdirectory.

The service and the application are both self-contained xs applications, and should be completely independent in terms of resources. The service does not require the application to be present, but obviously, the application does rely on being able to call upon the service.

If you only need the service, for example, because you want to call it directly from your own sample application, then you don't need the sample application. You can safely copy only the contents of the service directory and put those right inside your project directory (or one of its subdirectories) in that case. But even then, you might still want to hang on to the sample application, because you can use that to generate the web service calls that you might want to do from within your application.

If you want to hack on ODXL then you might want to fork or clone the ODXL github repository. If you do this inside a SAP/HANA xs project, or if you create a project pointing to that location, you can then deploy that to SAP/HANA and use that to send pull requests in case you want to contribute your changes back into the project.

Getting started with the sample application

To get up and running quickly, we included a sample web application in the ODXL project. The purpose of this sample application is to provide an easy way to evaluate and test ODXL.

The sample application lets you browse the available database schemas and queryable objects: tables and views, including calculation views (or at least, their SQL queryable runtime representation). After making the selection, it will build up a form showing the available columns. You can then use the form to select or deselect columns, apply filter conditions, and/or specify any sorting order. If the selected object is a calculation view that defines input parameters, then a form will be shown where you can enter values for those too.

In the mean while, as you're entering options into the form, a textarea will show the URL that should be used to invoke the ODXL service. If you like, you can manually tweak this URL as well. Finally, you can use one of the download links to immediately download the result corresponding to the current URL in either .csv or .xlsx format.

Alternatively, you can hit a button to add the URL to a batch request. When you're done adding items to the batch, you can hit the download workbook button to download as single .xlsx workbook, containing one worksheet for each dataset in the batch.

What versions of SAP/HANA are supported?

We initially built and tested ODXL on SPS9. The initial implementation used the $.hdb database interface, as well as the $.util.Zip builtin.

We then built abstraction layers for both database access and zip support to allow automtic fallback to the $.db database interface, and to use a pure javascript implementation of the zip algorithm based on Stuart Knightley's JSZip library. We tested this on SPS8, and everyting seems to work fine there.

We have not actively tested earlier SAP/HANA versions, but as far as we know, ODXL should work on any earlier version. If you find that it doesn't, then please let us know - we will gladly look into the issue and see if we can provide a solution.

Why Open Source? What's the Business Model? What's the catch?

For Just BI, Open Source software is not a business model, but a development model. While some companies build a successful business model around selling custom code code, this is currently not Just-BI's primary goal. Rather, Just-BI is a consulting company that focuses mainly on Business Intelligence solutions around the SAP ecosystem. Our areas of expertise include Business Objects, SAP BW, SAP HANA, as well as custom BI (web) Helping customers by providing solutions for their business problems is Just-BI's primary concern - not selling code.

However, we do acknowledge that sometimes, custom code plays an essential role in building a business solution for our customers. In these cases, we will gladly help our customers to design, build and deploy such solutions. But even in these cases we will try to look for standard component toolkits, like SAP UI5, or frameworks like Angular as a basis for our work.

The urge to standardize on familiar, well known toolkits and libraries hardly needs justification. In the end, customers don't have the end goal of acquiring and owning too many custom coded solutions, because today's hot new custom solution is tomorrow's legacy. The more a customer relies on custom code, the harder it will become to maintain and to move forward.

Sometimes, a particular building block that we need for applications may not be publicly available already. If such a building block is sufficiently generic (i.e., not bound to any particular customer) then we have every reason to want that to become a standard. For a generic and reusable component like ODXL, we believe that an open source model is the right way to do that.

We think that an open source development model will help maintain and advance ODXL. By using an open source release and development model, we have potentially more eyes to scrutinize our code, find bugs, suggest features, etc. In addition we hope our customers will feel more confident to embrace an open source solution, since they need not be locked into only our company for support and ongoing development.

How to Contribute

If you want to, there are many different ways to contribute to ODXL.
  1. If you want to suggest a new feature, or report a defect, then please use the github issue tracker.
  2. If you want to contribute code for a bugfix, or for a new feature, then please send a pull request. If you are considering to contribute code then we do urge you to first create an issue to open up discussion with fellow ODXL developers on how to best scratch your itch
  3. If you are using ODXL and if you like it, then consider to spread the word - tell your co-workers about it, write a blog, or a tweet, or a facebook post.
Thank you in advance for your contributions!

Finally

I hope you enjoyed this post! I hope ODXL will be useful to you. If so, I look forward to getting your feedback on how it works for you and how we might improve it. Thanks for your Time!

Sunday, March 20, 2016

Installing the Open Source Xavier XML/A client on the Jedox Premium OLAP Suite

Jedox is a software vendor that specializes in OLAP services and solutions. The company has been around quite a while and is probably best known for their PALO MOLAP engine and the matching add-in for Microsoft Excel.

Jedox' flagship product, Jedox Premium comprises the Palo MOLAP engine, API's, a REST server, and ETL server, and client tools. It also comes with a MDX interpreter and a XML for Analysis server. An interesting tidbit is that the MDX layer is not considered native, and Jedox' own clients use a lower level API, or address it via the REST service.

In this blog post I will explain how to install and configure the Open Source browser-based ad-hoc query and analysis tool Xavier to use it with Jedox. A video of the process is embedded below:


Here's a written list of instructions to get up and running with Xavier and Jedox:
  1. Download Jedox Premium. Run the downloaded installer to actually install the product. By default, it will be installed in C:\Program Files (x86)\Jedox\Jedox Suite. In the remainder of this post, I will refer to this directory as "the Jedox Suite directory".
  2. Download xavier.zip. Unpack the zip. A xavier directory will be extracted.
  3. Stop the JedoxSuiteHttpdService. If you don't know about windows services, then look here.
  4. Copy the xavier directory that you extracted from xavier.zip into the Jedox Suite\httpd\app\docroot directory.
  5. Open the Jedox Suite/httpd/conf/httpd.conf file in a text editor. You should probably make a backup copy of the httpd.conf file before editing it so you can always revert your changes.
  6. Add a line to load the HTTP proxy module. To do that, search the httpd.conf file for a bunch of lines that start with LoadModule. Look for a line that reads:

    LoadModule proxy_http_module modules/mod_proxy_http.so

    In my installation, the line is already present, like this:
    <IfDefine JDX_DEV>
    LoadModule log_config_module modules/mod_log_config.so
    LoadModule proxy_http_module modules/mod_proxy_http.so
    LoadModule setenvif_module modules/mod_setenvif.so
    </IfDefine>
        
    Now, what you'll want to do is to cut this line out of the <IfDefine JDX_DEV> block, and put it outside that block, for example, right before it, like this:
    LoadModule proxy_http_module modules/mod_proxy_http.so
    <IfDefine JDX_DEV>
    LoadModule log_config_module modules/mod_log_config.so
    LoadModule setenvif_module modules/mod_setenvif.so
    </IfDefine>
        
  7. Add a proxy configuration so that web applications deployed on the Apache HTTP server can access the Jedox XML/A service as if it lives in the same domain as the web application. To do that, add a Location directive at the end of the httpd.conf file, like this:
    <Location /xavier/Xmla>
      ProxyPass http://localhost:4242/xmla/
      ProxyPassReverse http://localhost:4242/xmla/
      SetEnv proxy-chain-auth
    </Location>
        
    This allows a web application on the Apache HTTP server to access the XML/A service via the URL /xavier/Xmla. By default, the place where the Jedox XML/A service lives is http://localhost:4242/xmla. You can verify this by crosscecking this with the configation in Jedox Suite\odbo\config.ini: the values for the MDXAddress and MDXPort should match the server and port in the URLs configured for ProxyPass and ProxyPassReverse
  8. Save the changes to your httpd.conf file, and start the JedoxSuiteHttpdService. If the service starts, you should be good to go. If it doesn't, check the Jedox Suite/log/apache_error.log file and see if you can find some information there that can help you troubleshoot your problem.
If all went well, you should now be able to navigate to http://localhost/xavier/resources/html/index.html and you should see the xavier welcome screen. Note that this assumes the Jedox HTTP server is running on its default port (80). If you chose another port for the HTTP server when installing Jedox, the URL for xavier would have to be amended respectively. For example, I chose port 8181, and hence my URL would be http://localhost/xavier/resources/html/index.html instead.

If you're in doubt what port you chose for your Jedox HTTP server, you can look it up in the Jedox Suite/httpd/conf/httpd.conf file. Look for a line that starts with Define JDX_PORT_HTTP. The port is specified right after that, enclosed in double quotes.

Saturday, March 19, 2016

Installing the Open Source Xavier XML/A client on icCube OLAP suite

IcCube is a relatively young Swiss company specialized in creating OLAP software products. Their flagship product is the icCube Suite, providing an OLAP server, an XML for Analysis service, and a bunch of browser-based client tools for modeling cubes and dimensions, and for querying, reporting and visualizing OLAP data.

When you download the icCube trial edition, you'll get a java web server with an already deployed instance of the icCube java web application. This set-up makes it super-easy to install any browser-based third party XML/A clients for icCube.

In this blog post I want to provide you with the instructions to deploy the ad-hoc OLAP query tool called Xavier directly on your icCube server. This allows you to do ad-hoc, drag 'n drop style reporting on any data exposed by the icCube server.

If you like, you can watch the video below to see how it's done. Detailed, step-by-step instructions appear in writing right below the video.



  • Go to the icCube download site. Find the appropriate version for your operating system, and click it to download the installer.
  • After downloading the icCube installer, run it. This will install icCube for you.
  • Download xavier.zip.
  • Unpack xavier.zip. Move the extracted xavier directory to web directory below the icCube home directory
  • Start icCube.
If all goes well you should now be able to navigate to xavier. If you installed icCube locally and kept the default ports, this would be the link: http://localhost:8282/xavier/resources/html/index.html

I hope this was useful for you!

Need a Mondrian .WAR? Check out XMondrian.

To whom it may concern, this is a quick note to bring the xmondrian project to your attention.

Introduction: Open Source OLAP, Mondrian, Pentaho, and JasperSoft

Mondrian is the open source OLAP engine. Mondrian provides:
  • a multi-dimensional view of a relational database (ROLAP)
  • a MDX query engine
  • Clever, advanced caching layers to speed up OLAP query performance (making it a MOLAP/ROLAN hybrid i.e., HOLAP)
  • Standards compliant OLAP data access by providing XML for Analysis (XML/A) and OLAP4J access APIs
Mondrian was designed and invented by Julian Hyde, who acted as technical and architectural lead of the Mondrian project for many years.

Mondrian was adopted by Pentaho, and is included in the Pentaho BI Stack as Pentaho Analysis Services. Mondrian is also the OLAP engine that ships with the Tibco/JasperSoft Reporting server, and with Meteorite BI's Saiku product.

Running Mondrian Standalone

While Pentaho, Jaspersoft and Meteorite all do a good job of integrating Mondrian inside their respective BI servers, some people would like to run only Mondrian directly in their java servers. The Mondrian project used to make that quite easy, since it shipped a .WAR (web-archive) file containing Mondrian itself, documentation, sample cubes, and the JPivot mondrian client.

Unfortunately, the Mondrian project stopped supporting the .WAR and sample content. This happpened a while ago already, but there are still people that are finding out about it only now. This might have to do with the fact that the Mondrian documentation has not been very well maintained and still refers to the .WAR as if it is part of the Mondrian project.

Introducing XMondrian

I felt the need to have a Mondrian .WAR myself. Main reason is that I created a couple of OLAP client tools myself, and I want to provide potential users with a quick and easy path to check them out. So, I decided to pack them all in a .WAR, together with Mondrian, the Foodmart Sample cube, and an embedded dataset.

The result is called xmondrian which you can find on github.

Getting started with XMondrian

Getting started with XMondrian is easy:
  • Download the .WAR file
  • Deploy to your java server. In theory, the process to do that will be dependent upon which webserver you are running. I tried with Apache Tomcat, Jetty, and Tiny Java Web Server, and for all these products you can simply copy the .WAR to the webapps directory
  • Find the XMondrian homepage by navigating your browser to the xmondrian webapp. For example, suppose you installed Tomcat or Jetty locally, using the default port of 8080, then http://localhost:8080/xmondrian will bring you there.

What's inside XMondrian

Once you're on the XMondrian homepage, you can find more information about what's inside, but I'll summarize below:
  • Mondrian 3.12
  • A web.xml to instantiate and hook up the MondrianXmlaServlet. After installation of xmondrian, your webserver can receive XML/A requests via /xmondrian/xmla
  • HSQLDB embedded database engine
  • Sample Datasets and Schemas Both the Foodmart and Steelwheels datasets are included as embedded hsqldb database in a .jar file. There are predefined Mondrian Schema files for each dataset as well, which specify how these databases are mapped to cubes, measures, dimensions, etc. Finally, there are datasource files that tell mondrian to connect to the sample database and use the respective schema file
  • xmla4js - A javascript XML/A client library. You can use this in browser-based web applications to communicate with Mondrian via the XML/A protocol. Xmla4js ships with code samples as well as API documentation
  • Client Applications
    • XMLash - XML/A Shell: an interactive MDX command line interface for inspecting Mondrian schema objects, an for creating and running MDX queries. (See a demonstration )
    • Xavier - XML/A Visualizer: an interactive OLAP ad-hoc reporting and charting tool with a graphical user interface

The XML/A Shell Application

:

The XML/A Visualizer Application

:

Finally

I hope this post was useful to you. Please let me know how you get along with the xmondrian .WAR. I'm open to suggestions and I would love to collaborate to make xmondrian better. Please use the github issue tracker to provide your feedback. Thanks for your time and interest.

Tuesday, December 29, 2015

jjsml: a Module Loader for the Nashorn JavaScript Shell

jjs is a JavaScript shell that ships with Oracle Java 1.8. I recently found myself in a situation where it seemed worth while to check it out, so I did. I do not want to use this post to elaborate too much on why I started looking at jjs, but I intend to write about that shortly. For now I just want to share a few observations, as well as a solution to a particular obstacle I encountered.

What is jjs?

Java 1.8 (both SDK and JRE) ships a new executable called jjs. This executable implements a command-line JavaScript shell - that is, a program that runs from the command line and which "speaks" JavaScript. It is based on Nashorn - a JavaScript engine written in Java and first released with Java 1.8.

Nashorn can be used to offer embedded JavaScript functionality in Java applications; jjs simply is such an application that happens to be desigend as shell.

Why do I think jjs is cool?

The concept of a JavaScript shell is not at all new. See this helpful page from Mozilla for a comprehensive list. The concept of a JavaScript engine written in Java is also not new: The Rhino engine has been shipping with Java since version 1.6.

Even so, I think jjs has got a couple of things going for it that makes it uniquely useful:
jjs is an executable shell
Even though Java shipped a javascript engine since Java 1.6, (the Rhino engine), you'd still have to get a shell program externally.
Ships with Java
This means it will be available literally everywhere. By "available everywhere" I don't just mean that it will be supported on many different operating systems - I mean that it will in fact be installed on many actual systems, and in many cases it will even be in the path so that you can simply run it with no extra configuration or setup. This is not the case for any other javascript shell.
Java Integration
jjs comes with extensions that make it really quite simple to instantiate and work with Java classes and objects (but of course, you get to do it using JavaScript syntax). This may not mean much if you're not familiar with the Java platform. But if you are it means you get to use tons and tons of high-level functionality basically for free, and you get to use it through an interface you are already familiar with. I'm sure other JavaScript shells allow you to create your own extensions so you can make external functionality scriptable, but then you'd still need to create such a module, and you'd need to invent the way how to bind that external functionality to shell built-ins. Because Java and JavaScript both support an objected-oriented language paradigm the integration is virtually seamless.
It's JavaScript!
To some, this may sound hardly as an argument in favor, but to me, it is. While Java is great in that it offers so much functionality, I just can't comfortably use it for quick and dirty programs. There are many reasons why this is so - very verbose and explicit syntax, compilation, static typing etc. I'm not saying these language features are categorically bad - just that they get in the way when you just want to make a small program real quick. I really think JavaScript helps to take a lot of that pain away.
Scripting extensions
In addition to offering JavaScript with Java language integration, jjs throws in a couple of features that are neither Java nor JavaScript: String Interpolation and Here document support. Together, these features make it very easy to use jjs for templating and text output generation. Again such features themselves are very far from unique. But in my opinion, they complement the Java/JavaScript language environment in a pretty significant way and greatly increase its ease of use.
So, in short, I'm not saying any particular feature is unique about jss. But I do think the particular combination of features make it a very interesting tool, especially for those that are used to Java and/or JavaScript and that have the need for ubiquitous and reliable shell scripting.

Quick example

The Nashorn Manual already provides pretty complete (but concise) documentation with regard to both the Java language integration and the typical shell-scripting features, so I won't elaborate too much on that. Instead, I just want to provide a really simple example to immediately give you an idea of how jjs works, and what it would be like to write scripts for it.

Here's my example script called "fibonacci.js":
//get the number of arguments passed at the command-line
var n = $ARG.length;
if (n === 0) {
  //if no arguments were passed, we can't calculate anything, so exit.
  print("Enter a space-separated list of integers to calculate their corresponding fibonacci numbers.");
  exit();
}

//a function to calculate the n-th number in the fibonacci sequence
function fibonacci(n){
  var i, fibPrev = 0, fibNext = 1;
  if (n === 1) return fibPrev;
  if (n === 2) return fibNext;
  for (i = 2; i < n; i++, fibNext = fibPrev + (fibPrev = fibNext));
  return fibNext;
}

var i, arg;

//process the arguments passed at the command line:
for (i = 0; i < n; i++) {
  arg = $ARG[i];

  //validate the argument. Is it a positive integer?
  if (!/^[1-9]\d*$/g.test(arg)) {

    //Not valid, skip this one and report.
    print(<<EOD
Skipping "${arg}": not a positive integer.
    EOD);
    continue;
  }

  //Calculate and print the fibonacci number.
  print(<<EOD
fibonacci(${arg}): ${fibonacci(parseInt(arg, 10))}
  EOD);
}
As you might have guessed the script calculates the fibonacci number corresponding to the sequence number(s) entered by the user on the command-line. While this may not be a very useful task, this example does illustrate the most important elements that are key to typical shell scripts:
  • Collect and validate command-line arguments passed by the user
  • Process the arguments - use them to do some work deemed useful
  • Print out information and results
Assuming that jjs is in the path, and fibonacci.js is in the current working directory, then we can execute it with this command:
$ jjs -scripting fibonacci.js
There are three parts to this command: jjs, the actual executable program, the -scripting option, and finally fibonacci.js, which is the path to our script file.

I think the executable and the script file parts are self-explanatory. The -scripting option may need some explanation: without this, jjs only supports a plain, standard javascript environment without any extensions that make it behave like a proper shell-scripting environment. Personally I think it's a bit annoying we have to explicitly pass it, but such is life. Just make sure you always pass it.

On *nix based systems, you can even make the script itself executable by prepending the code with a shebang, like so:
#!/usr/bin/jjs
When we run it, the output is:
Enter a space-separated list of integers to calculate their corresponding fibonacci numbers.
Which is not surprising, since we didn't actually provide any input arguments. So let's fix that:
$ jjs -scripting fibonacci.js -- 1 2 3 4 5 6
fibonacci(1): 0
fibonacci(2): 1
fibonacci(3): 1
fibonacci(4): 2
fibonacci(5): 3
fibonacci(6): 5
Note the -- delimiter in the command line. This is a special token that tells the jjs executable to treat anything that appears after the -- token as command line arguments targeted at the script. We will see in a bit how you can access those arguments programmatically from within your JavasSript code.

Now, it's not necessary to analyze every aspect of this script. Instead, let's just highlight the features that are specific to the jjs environment itself. In the code sample above, I marked up those elements that are specific to Nashorn/jjs in lovely fuchsia
  • We can use the built-in global $ARG property to refer to the arguments passed at the command line. $ARG is essentially an array that will contain any token appearing after the -- delimiter on the jjs command-line is an element. Since $ARG can be treated as an array, we can use its length property to see how many arguments we have, and we can use the array access operator (square braces) to get an element at a particular index.
  • We can use the built-in print() function to write to standard out.
  • We can call the built-in function exit() to terminate the jjs program. This will exit the shell, and return you to whatever program spawned jjs.
  • Here documents are strings delimited by <<EOD and EOD delimiters. At runtime, Here documents are simply javascript strings, but their syntaxis allows the string to span multiple lines, and to contain unescaped regular string literal delimiters (single and double quotes, " and '). This makes Here documents extremely useful for generating largish pieces of text.
  • Literal strings and here documents can contain ${<expression>} placeholders, and jjs will automatically substitute these with the value of the expression. Expressions can be simple variable names, but also more complex expressions, like function calls, or operations.

Next step: reusing existing functionality with load()

We already mentioned that the program does three distinct things (collect input, process input to create output, and print output). Obviously, these different tasks are related or else they wouldn't have been in the same script. But at the same time, we can acknowledge that the fibonacci() function that does the actual work of calculating numbers from the Finonacci sequence might be usable by other scripts, whereas collecting input and printing output (let's call it the user-interface) are quite specific for exactly this program.

The desire to separate these concerns into distinct scripts is only natural and will become stronger as programs grow bigger and more complex. So let's cut it up.

For now let's assume we only want to be able to reuse the fibonacci() function in other scripts. In order to do that we at least have to separate that in put it in a new script. Let's call that fibonacci.js, and let's store it in some central directory where we store all of our reusable scripts - say, scripts/fibonacci.js. Handling of both input and output can stay together in the same script for now - let's just call that fibonacci-ui.js.

So, in order to actually make this happen we can rename the orignal file to fibonacci-ui.js and create a new and empty fibonacci.js which we store in the scripts directory. We can then cut the code for the entire fiboncci() from fibonacci-ui.js and paste it into fibonacci.js.

Of course, the code in fibonacci-ui.js is now missing its definition of the fibonacci() function. But since it is still calling the function, we somehow need to import the function definition from scripts/fibonacci.js. To do that, the Nashorn scripting environment provides a built-in function called load().

The load() function takes a single string argument, which represents a file path or a URL that identifies a script. Load will evaluate the script immediately, and it returns the evaluation result. (It may be bit strange to think of the evaluation result of script, but this is simply the value of the last expression. It need not concern us right now since the fibonacci() function is global anyway - we can call it regardless of whether we have the evaluation result of scripts/fibonacci.js or not.)

So, putting it all together, we now have scripts/fibonacci.js:
//a function to calculate the n-th number in the fibonacci sequence
function fibonacci(n){
  var i, fibPrev = 0, fibNext = 1;
  if (n === 1) return fibPrev;
  if (n === 2) return fibNext;
  for (i = 2; i < n; i++, fibNext = fibPrev + (fibPrev = fibNext));
  return fibNext;
}
and fibonacci-ui.js:
//get the number of arguments passed at the command-line
var n = $ARG.length;
if (n === 0) {
  //if no arguments were passed, we can't calculate anything, so exit.
  print("Enter a space-separated list of integers to calculate their corresponding fibonacci numbers.");
  exit();
}

//acquire the fibonacci function
load("scripts/fibonacci.js");

var i, arg;

//process the arguments passed at the command line:
for (i = 0; i < n; i++) {
  arg = $ARG[i];

  //validate the argument. Is it a positive integer?
  if (!/^[1-9]\d*$/g.test(arg)) {

    //Not valid, skip this one and report.
    print(<<EOD
Skipping "${arg}": not a positive integer.
    EOD);
    continue;
  }

  //Calculate and print the fibonacci number.
  print(<<EOD
fibonacci(${arg}): ${fibonacci(parseInt(arg, 10))}
  EOD);
}
Assuming that jjs is in the path, and the directory where we stored fibonacci-ui.js is also the current working directory, we can now run it like this:
$ jjs -scripting fibonacci-ui.js -- 1 2 3 4 5 6
If all went well, we would get the same output as we got before.

The problem with load()

While it is cool that we can load external scripts with the built-in load() function, not all is well. We just ran our fibonacci-ui.js script while our current working directory was also the directory where our script resides. Suppose that, for some reason, our current working directory is the scripts directory - i.e. in a subdirectory of the directory where fibonacci-ui.js resides.

Obviously, we need to modify our command line to point to the right location of fibonacii-ui.js, like so:
$ jjs -scripting ../fibonacci-ui.js -- 1 2 3 4 5 6
But if we run this, we get an error message:
../fibonacci-ui.js:7 TypeError: Cannot load script from scripts/fibonacci.js
This tells us is that in our entry point script, ../fibonacci-ui.js, an error occurred at line 7. This is our call to load() and the complaint is that it cannot seem to find the external script file as it was passed to load, scripts/fibonacci.js.

So, it looks like load() resolves relative paths against the current working directory. This is a bit of a bummer since it poses a serious challenge to creating reusable, portable scripts. What I would find more intuitive is if load() would resolve relative paths against the directory of the current script (that is, the directory where the script that calls out to load() resides).

Alas, it does not so we have to find a solution.

UPDATE: @sundararajan_a kindly pointed me to a page in the Open JDK wiki which explains that there actually does exist a solution. Nashorn/jjs provides the built-ins __DIR__ and __FILE__ which hold the directory and file of the current script. There's also a __LINE__ built-in which holds the current linenumber in the script. So we can write load(__DIR__ + relative_path) to resolve a relative path against the directory of the current script.


Without any extra measures, I think there are only two possible "solutions":
  • Abstain from using relative paths - only pass absolute paths to load().
  • Write your entry point scripts with one fixed working directory in mind, and ensure that is the current working directory before running your script with jss.
Neither of these "solutions" should make you happy: requiring a fixed directory for all reusable scripts means our reusable scripts aren't portable. We can't just demand that the location of our reusable scripts are the same, regardless of platform. And even if we could, we would be unable to have multiple copies (say, one copy for active development, one for production purposes) of our reusable scripts on one system. So clearly this is solution is not without its own share of problems. Requiring a fixed working directory for our top-level entry point scripts is maybe a little bit better, but would require some external solution to ensure the correct working directory for our script.

Despite the possibility of using these workarounds, there is one particular scenario that I think just can't be solved - at least, not without a way to resolve relative paths against the location of the "current" script. Suppose you would create a script that is intended for reuse. But now suppose that this script itself relies on functionality from other reusable scripts. Either this script chooses to use absolute paths to acquire the scripts it depends upon, in which case the entire solution becomes unportable. Or this script would require whichever script that needs to use it to first set the current working directory so that correct loading of its dependencies is ensured.

No matter which way I look at it, load() puts us between a rock and a hard place.

A better load()?

Let's imagine for a moment we can write "a better load()" - that is, a function that works just like the built-in load(), but which resolves any relative paths against "the current script". This function would also need to have some sort of starting point - that is, it needs to know the current working directory. For now let's call this function a_better_load().

As it turns out, all the elements that we need to actually build a_better_load() are present in the jjs/Nashorn scripting environment. In pseudo code it would look something like this:
//only create the function if it doesn't already exist
if (!a_better_load) {

  //create a stack to keep track of the current working directory
  //$PWD is provided by jjs and contains the current working directory
  cwd = [$PWD];

  function a_better_load(path){

    //separate path in a directory and a file
    dir = getDir(path);
    file = getFile(path);

    if (path is absolute) {

      //path is absolute - push it unto the cwd stack so that any load requests 
      //that might occur inside the script we are now loading get resolved against 
      //the absolute path of this current script.
      cwd.push(dir)

    }
    else {
  
      //path is relative - take the path that is currently on top of the cwd stack,
      //and append the relative dir path
      //This should be the dir of the current script.
      cwd.peek() += dir

    }
 
    //use the built-in load to actually acquire the external script;
    ret = load(cwd.peek() + file);

    //after loading, we have to restore the state of the cwd stack

    if (path is absolute) {

      //the absolute path of the current script is on top of the stack.
      //remove it.
      cwd.pop()

    }
    else {
  
      //path is relative - take the path on top of the stack and remove
      //the part of the path that was added by the relative dir of the 
      //current script
      cwd.peek() -= dir

    }

    //return the evaluated script, just like load() would do.
    return ret;
  }

}

jjsml: A module loader for jjs

I actually implemented something like a_better_load() and then I started to think about the problem a bit more. Sure, a_better_load() sovles the script portability problem. But it adds a new global, and it would require all scripts to use this instead of plain, built-in load(). Alternatively, I could modify the script for a_better_load() a bit more and actually overwrite the built-in load(). But this would actually be worse in a way, since I would then need to distinguish between those scripts that know about this modified behavior of load() and those that, for some reason or other rely on the default behavior of load() (which would basically be any third party script).

I ended up creating a solution familiar from the browser world - a module loader like RequireJS. The solution is called jjsml and you can find it here on github: https://github.com/rpbouman/jjsutils/blob/master/src/jjsml/jjsml.js.

You might wonder what a module loader is, and why you'd need it - indeed, it may seem we already solved the problem by creating a better load(), why would we need to introduce a new concept?

I just identified a couple of problems by introducing a_better_load(). It seems self-evident to me that no matter what solution we end up with, new scripts would need to use it and become dependent upon it to actually benefit from it. This would even be the case if we'd overwrite the built-in load() function, which seems like a good argument to me to not do such a thing at all, ever, since a visible dependency is much better than a magical, invisible one.

So, if we're going to need to accept that we'd have to write our scripts to explicitly use this solution, then we'd better make sure the solution is attractive as possible, and offers some extra functionality that makes sense in the context of external script and dependency loading. I'm not sure if I succeeded, but check out the next section and let me know in the comments what you think.

Using jjsml

jjsml.js provides a single new global function called define(). The name and design was borrowed directly from RequireJS and it is called like that because it's purpose is to define a module. A module is simply some bag of related functionalities and it may take the form of an object, or a function. Instead of being a referenceable value itself, the module may even manifest itself simply by exposing new globals. Defining a module simply means creating it, and doing everything that presumes creating it, such as loading any dependencies and running any initialization code.

The define() function has the following signature:
<module> define([<scriptPath1> ,..., <scriptPathN>], <moduleConstructor>) 
  • <module> is a module - typically an object or a function that provides some functionality.
  • <scriptPath> is either a relative or an absolute path that points to a script. Ideally loading this script would itself return a module. Note that the square braces in the signature indicate that the argument is optional, and can be multiple - In other words, you can pass as many (or as little) of these as you like. Currently, define() does not handle an array of dependencies (but I should probably modify it so that it does. At least, RequireJS does it that way too.)
  • <moduleConstructor> is a module constructor - some thing that actually creates the module, and typically runs the code necessary to initialize the module.
While this describes it in a very generalized way, there are a few more things to it to get the most out of this design:
  • An important point to make is that the various <scriptPath>'s are guaranteed to be loaded prior to evaluating the <moduleConstructor>. The idea is that each <scriptPath> represents a dependency for the module that is being defined.
  • In many cases, the <moduleConstructor> is a callback function. If that is the case, then this callback will be called after all dependencies are loaded, and those dependencies would be passed to the callback function as arguments, in the same order that the dependencies were passed. Of course, to actually use the dependencies in this way, evaluating the <scriptPath> should result in a referenceable value. So this works best if the <scriptPath> are themselves proper modules. The immediate advantage of this design is that no module ever really needs to add any new globals: any functionality provided by a module is, in principle, managed in isolation from any functionalities provided by any other modules.
  • Proper modules are loaded only once, and are then cached. Subsequent references (typically, because another module tries to load it as a dependency) to an already loaded module will be served from the cache. This ensures that no time is wasted loading and running modules, or worse, to mess up module initialization.

Modular fibobacci example

This may sound a bit abstract so here's a concrete example, based on our earlier Fibonacci example. Let's first turn scripts/fibonacci.js into a proper module:
(function(){

  //a function to calculate the n-th number in the fibonacci sequence
  function fibonacci(n){
    var i, fibPrev = 0, fibNext = 1;
    if (n === 1) return fibPrev;
    if (n === 2) return fibNext;
    for (i = 2; i < n; i++, fibNext = fibPrev + (fibPrev = fibNext));
    return fibNext;
  }

  return fibonacci;

})();
Note that the script itself is an anonymous function, that is called immediately. The purpose of this construct is to establish a scope wherein we can create any private variables and functions - things only our script can see but which are not visible from the outside. The only interaction this script has with its surroundings is via its return value. In this case, the return value is simply the fibonacci() function itself. Also note that this script does not actually run any code, apart from defining the function. That's because this module does not require any initialization, and does not rely on any private data. It simply provides a pure function, and that's that.

To modify our fibonacci-ui.js example accordingly, we could simply change this line:
//acquire the fibonacci function
load("scripts/fibonacci.js");
to this:
//acquire the fibonacci function
var fibonacci = load("scripts/fibonacci.js");
This is a small but crucial difference: in the setup we had earlier, scripts/fibonacci.js would create a new fibonacci() function as a global. Since scripts/fibonacci.js is now a proper module, it returns the function rather than adding a new global itself. So in order to use the modularized fibonacci() function, we capture the result of our call to load(), and store it in a local fibonacci variable.

However, this change only has to do with the modular design of our modified scripts/finbonacci.js script. It still uses the old, built-in load() function, and is this not portable.

To actually benefit from define(), we should slightly rewrite the fibonacci-ui.js script in this way:
(function(){

  define(
    "scripts/fibonacci.js",
    function(fibonacci){
      //get the number of arguments passed at the command-line
      var n = $ARG.length;
      if (n === 0) {
        //if no arguments were passed, we can't calculate anything, so exit.
        print("Enter a space-separated list of integers to calculate their corresponding fibonacci numbers.");
        exit();
      }

      var i, arg;

      //process the arguments passed at the command line:
      for (i = 0; i < n; i++) {
        arg = $ARG[i];

        //validate the argument. Is it a positive integer?
        if (!/^[1-9]\d*$/g.test(arg)) {

          //Not valid, skip this one and report.
          print(<<EOD
Skipping "${arg}": not a positive integer.
          EOD);
          continue;
        }

        //Calculate and print the fibonacci number.
        print(<<EOD
fibonacci(${arg}): ${fibonacci(parseInt(arg, 10))}
        EOD);
      }
    }
  );

})();
Just like the modified scripts/fibonacci.js script, we wrapped the original code in an anonyumous function that is immediately called, thus keeping any variable and function definitions completely private and isolated from the global space. Inside that anonymous function, we have a single call to define(), passing the relative path to our reusable and modularized scripts/fibonacci.js script.

The last argument to define() is the module constructor. In this case, it is a callback function that will get called after the scripts/fibonacci.js dependency is loaded, and which is responsible for creating the actual program. The callback has a single argument, that directly corresponds to the dependency - when the callback is called, the fibonacci() function that was returned by the scripts/fibonacci.js script will be passed via this argument, and will this become available to the module constructor code.

Running scripts with jjsml

Suppose we acquired the jjsml.js script and stored it in the same directory as fibonacci-ui.js, then we can run the program using the following command line:
$ jjs -scripting -Djjsml.main.module=fibonacci-ui.js jjsml.js -- 1 2 3 4 5 6
You'll notice the same command line elements as we used before, plus one extra: -Djjsml.main.module=fibonacci-ui.js.

As you can see, in the command line, the actual script that jjs gets to run is jjsml.js, and not fibonacci-ui.js. Instead, fibonacci-ui.js is passed via the jjsml.main.module property of the java.lang.System object. You may recognize the -D prefix from other java programs: this is what you use to set a so-called system property, and this is what the jjsml.js script looks out for after attaching the define() function to the global object. If specified, the jjsml.js will attempt to load that as the initial script, i.e. the entry point of the shell scripted program.

Now, at this point, you may wonder - was all this really worth it? I would say it was, for the following reasons:
  • Modularization improved the quality of our scripts. Since they run in complete isolation there is no chance of undesired side effects
  • The fibonacci() function is now truly reusable, and any other script that may need it can order it as a dependency via a call to define(), and no matter how often it will be pulled in, it will be loaded exactly once
  • Last but not least, any relative paths used to identify dependencies will be resolved against the location of the script that pulls in the dependency, thus making all solutions completely portable
Arguably, our example was so trivially simple that we may not notice these benefits, but as you start to rely more on reusable scripts, and those scripts themselves start to depend on other scripts, you most surely will be happy that these things are managed for you jjsml.js.

Finally

I created jjsml.js not beacause of some theorethical principle, but because I really do need to write scripts that have dependencies, and I cannot afford to assume an absolute, fixed location for my scripts. You may have noticed that jjsml.js itself is part of a larger project called jjsutils. Inside this project are already a few reusable components (for example, for JDBC database access) as well as some top-level utilities. I plan to write more about jjsutils in the future.

In the mean while - let me know what you think! I really appreciate your comments, and you're free to check out and use both jjsml.js as well as the entire jjsutils project. There's API documentation, a few examples, and since the entire project is on github you can file issues or send me pull requests. Be my guest!

Sunday, November 01, 2015

MySQL: a few observations on the JSON type

MySQL 5.7 comes with built-in JSON support, comprising two major features: Despite being added rather recently (in MySQL 5.7.8 to be precise - one point release number before the 5.7.9 GA version), I feel the JSON support so far looks rather useful. Improvements are certainly possible, but compared to for example XML support (added in 5.1 and 5.5), the JSON feature set added to 5.7.8 is reasonably complete, coherent and standards-compliant.

(We can of course also phrase this more pessimistically and say that XML support falls short on these accounts, but that's not what this post is about :-)

There is potentially a lot to write and explain about the JSON support, and I can't hope to completely cover the subject in one blog post. Rather, I will highlight a few things I observed and I hope that this will be helpful for others that want to get started with JSON in MySQL 5.7.

Creating JSON values

There are a number of ways to create values of the JSON type:

Using CAST(... AS JSON)

CAST a value of any non-character string type AS JSON to obtain a JSON representation of that value. Example:
mysql> SELECT CAST(1 AS JSON), CAST(1.1 AS JSON), CAST(NOW() AS JSON);
+-----------------+-------------------+------------------------------+
| CAST(1 AS JSON) | CAST(1.1 AS JSON) | CAST(NOW() AS JSON)          |
+-----------------+-------------------+------------------------------+
| 1               | 1.1               | "2015-10-31 23:01:56.000000" |
+-----------------+-------------------+------------------------------+
1 row in set (0.00 sec)
Even though it may not be immediately clear from the result, the CAST operation actually turned these values into JSON equivalents. More about this in the next section.

If the value you're casting is of a character string type, then its value should be parseable as either a JSON object or a JSON array (i.e., JSON documents), as a JSON keyword indicating a built-in value, like null, true, false, or as a properly quoted JSON string value:
mysql> SELECT CAST('{}' AS JSON) object
    -> ,      CAST('[]' AS JSON) array
    -> ,      CAST('null' AS JSON) "null"
    -> ,      CAST('true' AS JSON) "true"
    -> ,      CAST('false' AS JSON) "false"
    -> ,      CAST('"string"' AS JSON) string
    -> ;
+--------+-------+------+------+-------+----------+
| object | array | null | true | false | string   |
+--------+-------+------+------+-------+----------+
| {}     | []    | null | true | false | "string" |
+--------+-------+------+------+-------+----------+
1 row in set (0.00 sec)
If the string is not parseable as JSON, you'll get a runtime error:
mysql> SELECT CAST('' AS JSON);
ERROR 3141 (22032): Invalid JSON text in argument 1 to function cast_as_json: "The document is empty." at position 0 in ''.
mysql> SELECT CAST('{]' AS JSON);
ERROR 3141 (22032): Invalid JSON text in argument 1 to function cast_as_json: "Missing a name for object member." at position 1 in '{]'.
Note that many keywords that might be valid in other environments, like NaN, Infinity, javascript built-in constructor fields like Number.EPSILON, and even undefined are *not* valid in this context. Remember - this is JSON, not javascript.

To get the JSON presentation of a plain, unquoted string value, you can use the JSON_QUOTE() function:
mysql> SELECT JSON_QUOTE(''), JSON_QUOTE('{]');
+----------------+------------------+
| JSON_QUOTE('') | JSON_QUOTE('{]') |
+----------------+------------------+
| ""             | "{]"             |
+----------------+------------------+
1 row in set (0.00 sec)

SELECT-ing from a JSON column

Table columns can be defined to be of JSON data type, and SELECT-ing from such a column will create JSON values at runtime. Of course, such a column would first need to be populated before it yields JSON values, and this can be done simply with an INSERT statement. When INSERT-ing non-JSON type values into a column of the JSON type, MySQL will behave as if it first converts these values to JSON-type, just as if it would apply CAST(value AS JSON) to those values.
UPDATE: Giuseppe Maxia kindly pointed out that there is an issue when INSERT-ing the result of LOAD_FILE() into a JSON column. You first have to CONVERT() the -binary- result to the utf8 character set before it can be successfully accepted as JSON. This was reported as bug #79066.

Calling functions that return JSON values

The JSON_QUOTE() function mentioned above is one example of built-in functions returning a JSON value. To create new JSON documents from scratch, JSON_OBJECT() and JSON_ARRAY() are probably most useful:
mysql> SELECT JSON_ARRAY(1, 2, 3) array, JSON_OBJECT('name1', 'value1', 'name2', 'value2') object;
+-----------+----------------------------------------+
| array     | object                                 |
+-----------+----------------------------------------+
| [1, 2, 3] | {"name1": "value1", "name2": "value2"} |
+-----------+----------------------------------------+
1 row in set (0.00 sec)
Note that we could have achieved the previous result also by CASTing literal string representations of these JSON documents AS JSON:
mysql> SELECT CAST('[1, 2, 3]' AS JSON) array, CAST('{"name1": "value1", "name2": "value2"}' AS JSON) object;
+-----------+----------------------------------------+
| array     | object                                 |
+-----------+----------------------------------------+
| [1, 2, 3] | {"name1": "value1", "name2": "value2"} |
+-----------+----------------------------------------+
1 row in set (0.00 sec)
However, as we shall see later on, this approach is not entirely equivalent to constructing these documents through JSON_ARRAY and JSON_OBJECT.

There are many more built-in JSON functions that return a value of the JSON data type. Unlike JSON_QUOTE(), JSON_ARRAY() and JSON_OBJECT(), most of these also require a JSON document as their first argument. In these cases, the return value represents a modified instance of the document passed as argument.

Operating on JSON documents: Extraction and Modification

While the JSON document may be a convenient unit for storing and transporting related items of data, any meaningful processing of such documents will always involve some operation to transform or modify such a document: for example, extracting some item stored inside the document, or adding or removing properties or array elements.

Manipulation of JSON documents always involves at least two distinct items:
  • The JSON document to operate on. This can be an explicit or implicitly obtained JSON document, constructed in any of the ways described earlier in this post. In general, functions that manipulate JSON documents accept the document that is being operated on as their first argument.
  • A path. The path is an expression that identifies which part of the document to operate on. In general, the second argument of functions that manipulate JSON documents is a path expression. Depending on which function exactly, other arguments may or may not accept path expressions as well.
It is important to point out that none of the functions that modify JSON documents actually change the argument document inline: JSON functions are pure functions that don't have side effects. The modified document is always returned from the function as a new document.

JSON path expressions in MySQL

While the path is passed as a string value, it's actually an expression consisting of alternating identifiers and access operators that as a whole identifies a particular piece within the JSON document:
Identifiers
There are 4 types of identifiers that can appear in a path:
  • $ (dollar sign) is a special identifier, which is essentially a placeholder for the current document being operated on. It can only appear at the start of the path expression
  • Property names are optionally double quoted names that identify properties ("fields") in a JSON object. Double quoted property names are required whenever the property name contains meta characters. For example, if the property name contains any interpunction or space characters, you need to double quote the name. A property name can appear immediately after a dot-access operator.
  • Array indices are integers that identify array elements in a JSON array. Array indices can appear only within an array-access operator (which is denoted by a pair of square braces)
  • * (asterisk) is also a special identifier. It indicates a wildcard that represents any property name or array index. So, the asterisk can appear after a dot-operator, in which case it denotes any property name, or it may appear between square braces, in which case it represents all existing indices of the array.

    The asterisk essentially "forks" the path and may thus match multiple values in a JSON document. The MySQL JSON functions that grab data or meta data usually have a way to handle multiple matched values, but JSON functions that modify the document usually do not support this.
Access operators
Paths can contain only 2 types of access operators:
  • dot-operator, denoted by a .-character. The dot-operator can appear in between any partial path expression and an identifier (including the special wildcard identifier *). It has the effect of extracting the value identified by the identifier from the value identified by the path expression that precedes the dot.

    This may sound more complicated than it really is: for example, the path $.myproperty has the effect of extracting whatever value is associated with the top-level property called myproperty; the path $.myobject.myproperty has the effect of extracting the value associated with the property called myproperty from the nested object stored in the myobject property of the top-level document.
  • array access-operator, denoted by a matching pair of square braces: [...]. The braces should contain either an integer, indicating the position of an array element, or the * (wildcard identifier) indicating all array element indices.

    The array-access operator can appear after any path expression, and can be followed by either a dot-operator (followed by its associated property identifier), or another array access operator (to access nested array elements).

    Currently, the braces can be used only to extract array elements. In javascript, braces can also contain a quoted property name to extract the value of the named property (equivalent to the dot-operator) but this is currently not supported in MySQL path expressions. (I believe this is a - minor - bug, but it's really no biggie since you can and probably should be using the dot-operator for properties anyway.)
Below is the syntax in a sort of EBNF notation in case you prefer that:
  mysql-json-path         ::= Document-placeholder path-expression?
  Document-placeholder    ::= '$'
  path-expression         ::= path-component path-expression*
  path-component          ::= property-accessor | array-accessor
  property-accessor       ::= '.' property-identifier
  property-identifier     ::= Simple-property-name | quoted-property-name | wildcard-identifier
  Simple-property-name    ::= <Please refer to JavaScript, The Definitive Guide, 2.7. Identifiers>
  quoted-property-name    ::= '"' string-content* '"'
  string-content          ::= Non-quote-character | Escaped-quote-character
  Non-quote-character     ::= <Any character except " (double quote)>
  Escaped-quote-character ::= '\"'
  wildcard-identifier     ::= '*'
  array-accessor          ::= '[' element-identifier ']'
  element-identifier      ::= [0-9]+ | wildcard-identifier

Grabbing data from JSON documents

json JSON_EXTRACT(json, path+)
This functions gets the value at the specified path. Multiple path arguments may be passed, in which case any values matching the paths are returned as a JSON array.
json json-column->path
If you have a table with a column of the JSON type, then you can use the -> operator inside SQL statements as a shorthand for JSON_EXTRACT(). Note that this operator only works inside SQL statements, and only if the left-hand operand is a column name; it does not work for arbitrary expressions of the JSON type. (Pity! I would love this to work for any expression of the JSON type, and in any context - not just SQL statements)

Grabbing metadata from JSON documents

bool JSON_CONTAINS(json, value, path?)
Checks whether the specified value appears in the specified document. If the path is specified, the function returns TRUE only if the value appears at the specified path. If the path argument is omitted, the function looks *anywhere* in the document and returns TRUE if it finds the value (either as property value or as array element).
bool JSON_CONTAINS_PATH(json, 'one'|'all', path+)
Checks whether the specified JSON document contains one or all of the specified paths. Personally I think there are some issues with this function
int JSON_DEPTH(json)
Number of levels present in the document
json-array JSON_KEYS(json-object, path?)
Returns the property names of the specified object as a JSON-array. If path is specified, the properties of the object identified by the path are returned instead.
int JSON_LENGTH(json, path?)
Returns the number of keys (when the json document is an object) or the number of elements (in case the json document is an array). If a path is specified, the function is applied to the value identified by the path rather than the document itself. Ommitting the path is equivalent to passing $ as path.
string JSON_SEARCH(json, 'one'|'all', pattern, escape?, path*)
Searches for string values that match the specified pattern, and returns the path or paths where the properties that match the pattern are located. The second argument indicates when the search should stop - in case it's 'one', search will stop as soon as a matching path was found, and the path is returned. In case of 'all', search will continue until all matching properties are found. If this results in multiple paths, then a JSON array of paths will be returned. The pattern can contain % and _ wildcard characters to match any number of characters or a single character (just as with the standard SQL LIKE-operator). The escape argument can optionally define which character should be used to escape literal % and _ characters. By default this is the backslash (\). Finally, you can optionally limit which parts of the document will be searched by passing one or more json paths. Technically it is possible to pass several paths that include the same locations, but only unique paths will be returned. That is, if multiple paths are found, the array of paths that is returned will never contain the same path more than once.

Unfortunately, MySQL currently does not provide any function that allows you to search for property names. I think it would be very useful so I made a feature request.
string JSON_TYPE(json)
Returns the name of the type of the argument value. It's interesting to note that the set of type values returned by this function are not equivalent to the types that are distinguished by the JSON specification. Values returned by this function are all uppercase string values. Some of these indicate items that belong to the JSON type system, like: "OBJECT", "ARRAY", "STRING", "BOOLEAN" and "NULL" (this is the uppercase string - not to be confused with the keyword for the SQL literal NULL-value). But some refer to native MySQL data types: "INTEGER", "DOUBLE", and "DECIMAL"; "DATE", "TIME", and "DATETIME", and "OPAQUE".
bool JSON_VALID(string)
Returns whether the passed value could be parsed as a JSON value. This is not limited to just JSON objects and arrays, but will also parse JSON built-in special value keywords, like null, true, false.

Manipulating JSON documents

json JSON_INSERT(json, [path, value]+)
Takes the argument json document, and adds (but does not overwrite) properties or array elements. Returns the resulting document.
json JSON_MERGE(json, json+)
Folds multiple documents and returns the resulting document.
json JSON_REMOVE(json, path+)
Remove one or more items specified by the path arguments from the document specified by the JSON argument, and returns the document after removing the specified paths.
json JSON_REPLACE(json, [path, value]+)
Takes the argument document and overwrites (but does not add) items specified by path arguments, and returns the resulting document.
json JSON_SET(json, [path, value]+)
Takes the argument document and adds or overwrites items specified by the path arguments, then returns the resulting document.

Functions to manipulate JSON arrays

json JSON_ARRAY_APPEND(json, [path, value]+)
If the path exists and identifies an array, it appends the value to the array. If the path exists but identifies a value that is not an array, it wraps the value into a new array, and appends the value. If the path does not identify a value at all, the document remains unchanged for that path.
json JSON_ARRAY_INSERT(json, [array-element-path, value]+)
This function inserts elements into existing arrays. The path must end with an array accessor - it must end with a pair of square braces containing an exact array index (not a wildcard). If the partial path up to the terminal array accessor identies an existing array, and the specified index is less than the array length, the value is inserted at the specified position. Any array elements at and beyond the specified position are shifted down one position to make room for the new element. If the specified index is equal to or exceeds the array length, the new value is appended to the array.
int JSON_LENGTH(json, path?)
I already described this one as a function that grabs metadata, but I found this function to be useful particularly when applied arrays.
Removing array elements
Note that there is no dedicated function for removing elements from an array. It is simply done using JSON_REMOVE. Just make sure the path argument denotes an array accessor to identify the element to remove.

To remove multiple elements from an array, you can specify multiple path arguments. In this case, the removal operation is performed sequentially, evaluating all passed path arguments from left to right. So, you have to be very careful which path to pass, since a preceding path may have changed the array you're working on. For example, if you want to remove the first two elements of an array, you should pass a path like '$[0]' twice. Passing '$[0]' and '$[1]' will end up removing elements 0 and 2 of the original array, since after removing the initial element at '$[0]', the element that used to sit at position 1 has been shifted left to position 0. The element that then sits at position 1 is the element that used to sit at position 2:
mysql> select json_remove('[1,2,3,4,5]', '$[0]', '$[0]') "remove elements 0 and 1"
    -> ,      json_remove('[1,2,3,4,5]', '$[0]', '$[1]') "remove elements 0 and 2"
    -> ;
+-------------------------+-------------------------+
| remove elements 0 and 1 | remove elements 0 and 2 |
+-------------------------+-------------------------+
| [3, 4, 5]               | [2, 4, 5]               |
+-------------------------+-------------------------+
1 row in set (0.00 sec)
Concatenating arrays
There is no function dedicated to concatenating arrays. However, you can use JSON_MERGE to do so:
mysql> SELECT JSON_MERGE('[0,1]', '[2,3]');
+------------------------------+
| JSON_MERGE('[0,1]', '[2,3]') |
+------------------------------+
| [0, 1, 2, 3]                 |
+------------------------------+
1 row in set (0.00 sec)
Slicing arrays
There is no dedicated function or syntax to take a slice of an array. If you don't need to slice arrays, then good - you're lucky. If you do need it, I'm afraid you're up for a challenge: I don't think there is a convenient way to do it. I filed a feature request and I hope this will be followed up.

JSON Schema Validation

Currently, the JSON functions provide a JSON_VALID() function, but this can only check if a string conforms to the JSON syntax. It does not verify whether the document conforms to predefined structures (a schema).

I anticipate that it might be useful to be able to ascertain schema conformance of JSON documents within MySQL. The exact context is out of scope for this post, but I would already like to let you know that I am working on a JSON schema validator. It can be found on github here: mysql-json-schema-validator.

Stay tuned - I will do a writeup on that as soon as I complete a few more features that I believe are essential.

MySQL JSON is actually a bit like BSON

MySQL's JSON type is not just a blob with a fancy name, and it is not entirely the same as standard JSON. MySQL's JSON type is more like MongoDB's BSON: it preserves native type information. The most straightforward way to make this clear is by creating different sorts of JSON values using CAST( ... AS JSON) and then reporting the type of the result using JSON_TYPE:
mysql> SELECT  JSON_TYPE(CAST('{}' AS JSON)) as "object"
    -> ,       JSON_TYPE(CAST('[]' AS JSON)) as "array"
    -> ,       JSON_TYPE(CAST('""' AS JSON)) as "string"
    -> ,       JSON_TYPE(CAST('true' AS JSON)) as "boolean"
    -> ,       JSON_TYPE(CAST('null' AS JSON)) as "null"
    -> ,       JSON_TYPE(CAST(1 AS JSON)) as "integer"
    -> ,       JSON_TYPE(CAST(1.1 AS JSON)) as "decimal"
    -> ,       JSON_TYPE(CAST(PI() AS JSON)) as "double"
    -> ,       JSON_TYPE(CAST(CURRENT_DATE AS JSON)) as "date"
    -> ,       JSON_TYPE(CAST(CURRENT_TIME AS JSON)) as "time"
    -> ,       JSON_TYPE(CAST(CURRENT_TIMESTAMP AS JSON)) as "datetime"
    -> ,       JSON_TYPE(CAST(CAST('""' AS BINARY) AS JSON)) as "blob"
    -> \G
*************************** 1. row ***************************
  object: OBJECT
   array: ARRAY
  string: STRING
 boolean: BOOLEAN
    null: NULL
 integer: INTEGER
 decimal: DECIMAL
  double: DOUBLE
    date: DATE
    time: TIME
datetime: DATETIME
    blob: BLOB
1 row in set (0.00 sec)
What this query shows is that internally, values of the JSON type preserve native type information. Personally, I think that is a good thing. JSON's standard type system is rather limited. I would love to see standard JSON support for proper decimal and datetime types.

Comparing JSON objects to JSON objects

The MySQL JSON type system is not just cosmetic - the attached internal type information affects how the values work in calculations and comparisons. Consider this comparison of two JSON objects:
mysql> SELECT CAST('{"num": 1.1}' AS JSON) = CAST('{"num": 1.1}' AS JSON);
+-------------------------------------------------------------+
| CAST('{"num": 1.1}' AS JSON) = CAST('{"num": 1.1}' AS JSON) |
+-------------------------------------------------------------+
|                                                           1 |
+-------------------------------------------------------------+
1 row in set (0.00 sec)
This is already quite nice - you can't compare two objects like that in javascript. Or actually, you can, but the result will be false since you'd be comparing two distinct objects that simply happen to have the same properties and property values. But usually, with JSON, we're just interested in the data. Since the objects that are compared here are totally equivalent with regard to composition and content, I consider the ability to directly compare objects as a bonus.

It gets even nicer:
mysql> SELECT CAST('{"num": 1.1, "date": "2015-11-01"}' AS JSON) = CAST('{"date": "2015-11-01", "num": 1.1}' AS JSON);
+---------------------------------------------------------------------------------------------------------+
| CAST('{"num": 1.1, "date": "2015-11-01"}' AS JSON) = CAST('{"date": "2015-11-01", "num": 1.1}' AS JSON) |
+---------------------------------------------------------------------------------------------------------+
|                                                                                                       1 |
+---------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
Again, the result is true, indicating that these objects are equivalent. But you'll notice that the property names appear in different order between these two objects. But the direct comparison ignores the property order - it only takes into account whether a property exists at a particular path, and whether the property values are the same. One can argue about whether the property order should be deemed significant in a comparsison. The JSON spec doesn't specify so. But I'm inclined to say that MySQL's behavior here is a nice feature.

Now let's try something a bit like that first comparison, but in a slightly different way:
mysql> SELECT  JSON_OBJECT('bla', current_date)
    -> ,       JSON_OBJECT('bla', current_date) = JSON_OBJECT('bla', current_date)
    -> ,       JSON_OBJECT('bla', current_date) = CAST('{"bla": "2015-11-01"}' AS JSON)
    -> \G
*************************** 1. row ***************************
                                        JSON_OBJECT('bla', current_date): {"bla": "2015-11-01"}
     JSON_OBJECT('bla', current_date) = JSON_OBJECT('bla', current_date): 1
JSON_OBJECT('bla', current_date) = CAST('{"bla": "2015-11-01"}' AS JSON): 0
1 row in set (0.00 sec)
The difference here is of course creating the object using JSON_OBJECT as opposed to using CAST(... AS JSON). While the string representation of the result of JSON_OBJECT('bla', current_date) looks exactly the same like that of CAST('{"bla": "2015-11-01"}' AS JSON), they are not equivalent: in the case of JSON_OBJECT, MySQL internally attached native type information to the property which is of the type DATE (a type that does not exist in standard JSON), whereas in the case of the CAST(... AS JSON) operation, MySQL did not have any additional type information for the value of the property, leaving it no other choice than to assume a STRING type. The following query proves the point:
mysql> SELECT  JSON_TYPE(JSON_EXTRACT(JSON_OBJECT('bla', current_date), '$.bla'))
    -> ,       JSON_TYPE(JSON_EXTRACT(CAST('{"bla": "2015-11-01"}' AS JSON), '$.bla'))
    -> \G
*************************** 1. row ***************************
     JSON_TYPE(JSON_EXTRACT(JSON_OBJECT('bla', current_date), '$.bla')): DATE
JSON_TYPE(JSON_EXTRACT(CAST('{"bla": "2015-11-01"}' AS JSON), '$.bla')): STRING
1 row in set (0.00 sec)

Comparing JSON values to non-JSON values

Fortunately, comparison of JSON values to MySQL non-JSON values is pretty consistent, without requiring explicit CAST operations. This may sound obvious, but it's really not. The following query might explain better what I mean. Consider a JSON object with a property called "myProp" that has a string value of "value1":
mysql> SELECT JSON_EXTRACT(JSON_OBJECT('myProp', 'value1'), '$.myProp');
+-----------------------------------------------------------+
| JSON_EXTRACT(JSON_OBJECT('myProp', 'value1'), '$.myProp') |
+-----------------------------------------------------------+
| "value1"                                                  |
+-----------------------------------------------------------+
1 row in set (0.00 sec)
Note the double quotes around the value - when we extract the value of the myProp property, the result is a JSON string - not a native MySQL character type. And when that result is rendered by the client, its MySQL string representation includes the double quotes. To get a proper MySQL string, we can apply JSON_UNQUOTE(), like this:
mysql> SELECT JSON_UNQUOTE(JSON_EXTRACT(JSON_OBJECT('myProp', 'value1'), '$.myProp'));
+-------------------------------------------------------------------------+
| JSON_UNQUOTE(JSON_EXTRACT(JSON_OBJECT('myProp', 'value1'), '$.myProp')) |
+-------------------------------------------------------------------------+
| value1                                                                  |
+-------------------------------------------------------------------------+
1 row in set (0.00 sec)
But fortunately, we don't really need to apply JSON_UNQUOTE() for most operations. For example, to compare the extracted value with a regular MySQL string value, we can simply do the comparison without explicitly casting the MySQL string to a JSON type, or explicitly unquoting the JSON string value to a MySQL string value:
mysql> SELECT JSON_EXTRACT(JSON_OBJECT('myProp', 'value1'), '$.myProp') = 'value1';
+----------------------------------------------------------------------+
| JSON_EXTRACT(JSON_OBJECT('myProp', 'value1'), '$.myProp') = 'value1' |
+----------------------------------------------------------------------+
|                                                                    1 |
+----------------------------------------------------------------------+
1 row in set (0.00 sec)
Again, I think this is very good news!

Still, there definitely are some gotcha's. The following example might explain what I mean:
mysql> SELECT  CURRENT_DATE
    -> ,       CURRENT_DATE = '2015-11-01'
    -> ,       JSON_EXTRACT(JSON_OBJECT('myProp', CURRENT_DATE), '$.myProp')
    -> ,       JSON_EXTRACT(JSON_OBJECT('myProp', CURRENT_DATE), '$.myProp') = '2015-11-01'
    -> ,       JSON_EXTRACT(JSON_OBJECT('myProp', CURRENT_DATE), '$.myProp') = CURRENT_DATE
    -> ,       JSON_UNQUOTE(JSON_EXTRACT(JSON_OBJECT('myProp', CURRENT_DATE), '$.myProp')) = '2015-11-01'
    -> \G
*************************** 1. row ***************************
                                                                              CURRENT_DATE: 2015-11-01
                                                               CURRENT_DATE = '2015-11-01': 1
                             JSON_EXTRACT(JSON_OBJECT('myProp', current_date), '$.myProp'): "2015-11-01"
              JSON_EXTRACT(JSON_OBJECT('myProp', CURRENT_DATE), '$.myProp') = '2015-11-01': 0
              JSON_EXTRACT(JSON_OBJECT('myProp', CURRENT_DATE), '$.myProp') = CURRENT_DATE: 1
JSON_UNQUOTE(JSON_EXTRACT(JSON_OBJECT('myProp', CURRENT_DATE), '$.myProp')) = '2015-11-01': 1
1 row in set (0.00 sec)
Note that this is the type of thing that one might easily get wrong. The comparison CURRENT_DATE = '2015-11-01' suggests the MySQL date value is equal to its MySQL string representation, and the comparison JSON_EXTRACT(JSON_OBJECT('myProp', current_date), '$.myProp') = CURRENT_DATE suggests the value extracted from the JSON document is also equal to the date value.

From these two results one might expect that JSON_EXTRACT(JSON_OBJECT('myProp', CURRENT_DATE), '$.myProp') would be equal to '2015-11-01' as well, but the query clearly shows this is not the case. Only when we explicitly apply JSON_UNQUOTE does the date value extracted from the JSON document become a real MySQL string, which we then can compare with the string value '2015-11-01' successfully.

When you think about a minute what really happens, it does make sense (at least, I think it does):
  • A MySQL date is equivalent to the MySQL string representation of that date
  • A MySQL date is equivalent to it's JSON date representation
  • A JSON date is not equal to the MySQL string representation of that date
  • A MySQL string representation of a JSON date is equal to the MySQL string representation of that date
That said, you might still find it can catch you when off guard.

Table columns of the JSON type

The JSON type is not just a runtime type - it is also available as a storage data type for table columns. A problem though is that there is no direct support for indexing JSON columns, which is sure to become a problem in case you plan to query the table based on the contents of the JSON document. Any WHERE, JOIN...ON, GROUP BY or ORDER BY-clause that relies on extracting a value from the JSON column is sure to result in a full table scan.

There is a workaround though: Once you know the paths for those parts of the document that will be used to filter, order and aggregate the data, you can create generated columns to have these values extracted from the document, and then put an index on those generated columns. This practice is recommended for MySQL by the manual page for CREATE TABLE. A complete example is given in the section called Secondary Indexes and Virtual Generated Columns.

Obviously, this approach is not without issues:
  • You need to rewrite your queries accordingly to use those generated columns rather than the raw extraction operations on the document. Or at least, you will have to if you want to benefit from your indexes.
  • Having to create separate columns in advance seems at odds with schema flexibility, which I assume is a highly-valued feature for those that find they need JSON columns.
  • The generated columns will require additional storage.
Of these concerns, I feel that the need to rewrite the queries is probably the biggest problem.
UPDATE: Roy Lyseng kindly pointed out to me that I missed an important feature. MySQL is actually smart enough to use indexed generated columns on the json document. Just look at this query:
mysql> explain select doc from posts where json_extract(doc, '$.Id') = 1;
+----+-------------+-------+------------+-------+---------------+------+---------+-------+------+----------+-------+
| id | select_type | table | partitions | type  | possible_keys | key  | key_len | ref   | rows | filtered | Extra |
+----+-------------+-------+------------+-------+---------------+------+---------+-------+------+----------+-------+
|  1 | SIMPLE      | posts | NULL       | const | id            | id   | 4       | const |    1 |   100.00 | NULL  |
+----+-------------+-------+------------+-------+---------------+------+---------+-------+------+----------+-------+
Note how the query does not directly reference the generated column id. MySQL magically understands that json_extract(doc, '$.Id') was used as expression for a generated column, and this is enough to automatically include it in the plan evaluation. Thanks Roy! This is marvelous :)
The additional storage seems to be the smallest issue, assuming the number of items that you need to index is small as compared to the entire document. (Although I can imagine the extra storage would start to count when you want to extract large text columns for full-text indexing). That said, if I understand correctly, if you create the index on VIRTUAL generated columns, only the index will require extra storage - there won't also be storage required for the columns themselves. (Note that creating an index will always require extra storage - that's just how it works, both in MySQL, as well as in specialized document databases like MongoDB.)

As far as I can see now, any indexing scheme that requires us to elect the items within the documents that we want to index in advance suffers from the same drawback: If the schema evolves in such a way that fields that used to be important enough to be deemed fit for indexing get moved or renamed often, then this practice will affect all aspects of any application that works on the document store. My gut feeling is that despite the theoretical possibility of schema flexibility, this will cause enough inertia in the schema evolution (at least, with respect to those items that we based our indexes on) to be well in time to come up with other solutions. To be fair though, having to set up generated columns would probably add a some extra inertia as compared to a pure document database (like MongoDB).

But my main point still stands: if you choose to keep changing the schema all the time, especially if it involves those items that you need to filter, sort, or aggregate the data, then the changes will affect almost every other layer of your application - not just your database. Apparently, that's what you bargained for and in the light of all other changes that would be needed to support this practice of a dynamic schema evolution, it seems that setting up a few extra columns should not be that big a deal.

JSON Columns and Indexing Example

Just to illustrate how it would work out, let's try and setup a table to store JSON documents. For this example, I'm looking at the Stackexchange datasets. There are many such datasets for various topic, and I'm looking at the one for math.stackexchange.com because it has a decent size - 873MB. Each of these archives comprises 8 xml files, and I'm using the Posts.xml file. One post document might look like this:
<row 
  Id="1" 
  PostTypeId="1" 
  AcceptedAnswerId="9"
  CreationDate="2010-07-20T19:09:27.200" 
  Score="85" 
  ViewCount="4121" 
  Body="&lt;p&gt;Can someone explain to me how there can be different kinds of infinities?&lt;/p&gt;" 
  OwnerUserId="10" 
  LastEditorUserId="206259" 
  LastEditorDisplayName="user126" 
  LastEditDate="2015-02-18T03:10:12.210" 
  LastActivityDate="2015-02-18T03:10:12.210" 
  Title="Different kinds of infinities?" 
  Tags="&lt;set-theory&gt;&lt;intuition&gt;&lt;faq&gt;" 
  AnswerCount="10" 
  CommentCount="1" 
  FavoriteCount="28"
/>
I'm using Pentaho Data Integration to read these files and to convert them into JSON documents. These JSON documents look like this:
{
  "Id": 1,
  "Body": "<p>Can someone explain to me how there can be different kinds of infinities?<\/p>",
  "Tags": "<set-theory><intuition><faq>",
  "Score": 85,
  "Title": "Different kinds of infinities?",
  "PostTypeId": 1,
  "AnswerCount": 10,
  "OwnerUserId": 10,
  "CommentCount": 1,
  "CreationDate": "2010-07-20 19:09:27",
  "LastEditDate": "2015-02-18 03:10:12",
  "AcceptedAnswerId": 9,
  "LastActivityDate": "2015-02-18 03:10:12",
  "LastEditorUserId": 206259
}
Initially, let's just start with a simple table called posts with a single JSON column called doc:
CREATE TABLE posts (
  doc JSON
);
After loading, I got a little over a million post documents in my table:
mysql> select count(*) from posts;
+----------+
| count(*) |
+----------+
|  1082988 |
+----------+
1 row in set (0.66 sec)
(There are actually some 5% more posts in the stackexchange data dump, but my quick and dirty transformation to turn the XML into JSON led to a bunch of invalid JSON documents, and I didn't bother to perfect the transformation enough to get them all. A million is more than enough to illustrate the approach though.)

Now, let's find the post with Id equal to 1:
mysql> select doc from posts where json_extract(doc, '$.Id') = 1
    -> \G
*************************** 1. row ***************************
doc: {"Id": 1, "Body": ">p<Can someone explain to me how there can be different kinds of infinities?</p>", "Tags": "<set-theory><intuition><faq>", "Score": 85, "Title": "Different kinds of infinities?", "PostTypeId": 1, "AnswerCount": 10, "OwnerUserId": 10, "CommentCount": 1, "CreationDate": "2010-07-20 19:09:27", "LastEditDate": "2015-02-18 03:10:12", "AcceptedAnswerId": 9, "LastActivityDate": "2015-02-18 03:10:12", "LastEditorUserId": 206259}
1 row in set (1.45 sec)
Obviously, the query plan requires a full table scan:
mysql> explain select doc from posts where json_extract(doc, '$.Id') = 1;
+----+-------------+-------+------------+------+---------------+------+---------+------+---------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key  | key_len | ref  | rows    | filtered | Extra       |
+----+-------------+-------+------------+------+---------------+------+---------+------+---------+----------+-------------+
|  1 | SIMPLE      | posts | NULL       | ALL  | NULL          | NULL | NULL    | NULL | 1100132 |   100.00 | Using where |
+----+-------------+-------+------------+------+---------------+------+---------+------+---------+----------+-------------+
1 row in set, 1 warning (0.00 sec)
First, let's try and add a generated column for the Id. The Id is, as its name implies, unique, and it seems sensible to create a PRIMARY KEY for that as well:
mysql> ALTER TABLE posts
    -> ADD id INTEGER UNSIGNED
    -> GENERATED ALWAYS AS (JSON_EXTRACT(doc, '$.Id'))
    -> STORED
    -> NOT NULL PRIMARY KEY;
Query OK, 1082988 rows affected (36.23 sec)
Records: 1082988  Duplicates: 0  Warnings: 0
You might notice that in this case, the generated column is STORED rather than VIRTUAL. This is the case because MySQL won't let you create a PRIMARY KEY on a VIRTUAL generated column. If you try it anyway, you'll get:
mysql> ALTER TABLE posts
    -> ADD id INTEGER UNSIGNED
    -> GENERATED ALWAYS AS (JSON_EXTRACT(doc, '$.Id')) NOT NULL
    -> VIRTUAL
    -> PRIMARY KEY;
ERROR 3106 (HY000): 'Defining a virtual generated column as primary key' is not supported for generated columns.
Now, let's try our -modified- query again:
mysql> explain select doc from posts where id = 1;
+----+-------------+-------+------------+-------+---------------+---------+---------+-------+------+----------+-------+
| id | select_type | table | partitions | type  | possible_keys | key     | key_len | ref   | rows | filtered | Extra |
+----+-------------+-------+------------+-------+---------------+---------+---------+-------+------+----------+-------+
|  1 | SIMPLE      | posts | NULL       | const | PRIMARY       | PRIMARY | 4       | const |    1 |   100.00 | NULL  |
+----+-------------+-------+------------+-------+---------------+---------+---------+-------+------+----------+-------+
1 row in set, 1 warning (0.00 sec)
If you actually try to run the query you'll notice it returns instantly - as is to be expected, since we can now access the document directly via the PRIMARY KEY.

Now, let's try this again but using a VIRTUAL column and a UNIQUE index:
mysql> ALTER TABLE posts
    -> DROP COLUMN id
    -> ;
Query OK, 1082988 rows affected (35.44 sec)
Records: 1082988  Duplicates: 0  Warnings: 0

mysql> ALTER TABLE posts
    -> ADD id INTEGER UNSIGNED
    -> GENERATED ALWAYS AS (JSON_EXTRACT(doc, '$.Id'))
    -> VIRTUAL
    -> NOT NULL UNIQUE;
Query OK, 1082988 rows affected (36.61 sec)
Records: 1082988  Duplicates: 0  Warnings: 0
Now the plan is:
mysql> explain select doc from posts where id = 1;
+----+-------------+-------+------------+-------+---------------+------+---------+-------+------+----------+-------+
| id | select_type | table | partitions | type  | possible_keys | key  | key_len | ref   | rows | filtered | Extra |
+----+-------------+-------+------------+-------+---------------+------+---------+-------+------+----------+-------+
|  1 | SIMPLE      | posts | NULL       | const | id            | id   | 4       | const |    1 |   100.00 | NULL  |
+----+-------------+-------+------------+-------+---------------+------+---------+-------+------+----------+-------+
1 row in set, 1 warning (0.00 sec)
The plan is almost the same, except of course that now access if via the UNIQUE key rather than the PRIMARY KEY. The query again returns almost instantly, although it will be slightly slower.

That said, this example is not so much about making a benchmark or measuring performance, it's more about showing how to achieve some form of indexing when storing JSON documents in a MySQL table. I truly hope someone else will try and conduct a serious benchmark so that we can get an idea just how performance of the MySQL JSON type compares to alternative solutions (like the PostgreSQL JSON type, and MongoDB). I feel I lack both the expertise and the tools to do so myself so I'd rather leave that to experts.

Daniƫl van Eeden kindly pointed out that query results maybe different depending in the presence of an index. Please read bug 76834 to learn how this may affect you.

In Conclusion

  • MySQL JSON support looks pretty complete.
  • Integration of JSON type system and MySQL native type system is, in my opinion, pretty good, but there are definitely gotcha's.
  • Achieving indexing for JSON columns relies on a few specific workarounds, which may or may not be compatible with your requirements.
I hope this post was useful to you. I sure learned a lot by investigating the feature, and it gave me a few ideas of how I could use the JSON features in the future.