Locations of visitors to this page Book: "Pentaho Kettle Solutions", Matt Casters, Roland Bouman, & Jos van Dongen, Wiley 2010 Book: "Pentaho Solutions", Roland Bouman & Jos van Dongen, Wiley 2009

Monday, July 27, 2015

Using REST services to work with the Pentaho BI Server repository

A couple of months ago, I wrote about how to use Pentaho's REST services to perform various user and role management tasks.

In that post, I tried to provide a few general guidelines to help developers interested in using the REST services, and, by way of example, I showed in detail how to develop a sample application that uses one particular service (namely, UserRoleDaoResource) from within the PHP language to perform basic user management tasks.

Recently, I noticed this tweet from Rafael Valenzuela (@sowe):


I suggested to use a Pentaho REST service call for that and I offered to share a little javascript wrapper I developed some time ago to help him get started.

Then, Jose Hidalgo (@josehidrom) chimed in an expressed some interest so I decided to go ahead and lift this piece of software and develop it into a proper, stand-alone javascript module for everyone to reuse.

Introducing Phile

Phile stands for Pentaho File access and is pronounced simply as "file". It is a stand-alone, cross-browser pure javascript module that allows you to build javascript applications that can work with the Pentaho respository.

The main use case for Phile is Pentaho BI Server plugin applications. That said, it should be possible to use Phile from within a server-side javascript environment like node js.

Phile ships as a single javascript resource. The unminified version, Phile.js, includes YUIDoc comments and weighs 32k. For production use, there's also a minified version available, Phile-compiled.js, which weighs 5.7k.

You can simply include Phile with a <script> tag, or you can load it with a module loader, like require(). Phile supports the AMD module convention as well as the common js module convention.

Phile has its YUIDoc api documentation included in the project.

Phile is released under the terms and conditions of the Apache 2.0 software license, but I'm happy to provide you with a different license if that does not, for some reason, suit your needs. The project is on github and I'd appreciate your feedback and/or contributions there. Github is also the right place to report any issues. I will happily accept your pull requests so please feel free to contribute!

Finally, the Phile project includes a Pentaho user console plugin. This plugin provides a sample application that demonstrates pretty much all of the available methods provided by Phile. Installing and tracing this sample application is a great way to get started with Phile.

Getting started with Phile

In the remainder of this post, I will discuss various methods provided by Phile, and explain some details of the underlying REST services. For quick reference, here's a table of contents for the following section:

Installing the sample application

The recommended way to get started with Phile is to install the sample application unto your Pentaho 5.x server. This basic application provides a minimal but functional way to manipulate files and directories using Phile. Think of it as something like the Browse files perspective that is built into the Pentaho user console, but implemented as a BI server plugin that builds on top of Phile.

To install the sample application, first download phile.zip, and extract it into pentaho-solutions/system. This should result in a new pentaho-solutions/system/phile subdirectory, which contains all of the resources for the plugin. After extraction, you'll have to restart your pentaho BI server so the plugin gets picked up. (Should you for whatever reason want to update the plugin, you can simply overwrite it, and refresh the pentaho user console in your browser. The plugin contains no server side components, except for the registration of the menu item.)

Once restarted, you should now have a "Pentaho files" menu item in the "Tools" menu:



Note: Currently neither Phile nor the sample application are available through the Pentaho marketplace. That's because the sample application is only a demonstration of Phile, and phile itself is just a library. In my opinion it does not have a place in the marketplace - developers should simply download the Phile script files and include them in their own projects.

Running the sample application

As you probably guessed, the sample application is started by activating the "Tools" > "Pentaho Files" menu. This will open a new "Pentaho Files" tab in the user console:


The user interface of the sample application features the following elements:
At the very top: Toolbar with push buttons for basic functionality.
The label on the buttons pretty much describes their function:
New File
Create a new file in the currently selected directory
New Directory
Create a new directory in the currently selected directory
Rename
Rename the currently selected file or directory
Delete
Move the currently selected file or directory to the trash, or permanently remove the currently selected item from the trash.
Restore
Restore the current item from the trash
API Docs
Opens the Phile API documentation in a new tab inside the Pentaho user console.
Left top: Treeview.
The main purpose of this treeview is to offer the user a way to navigate through the repository. The user can open and close folders by clicking the triangular toggle button right in front of a folder. Clicking the label of a file or folder will change the currently selected item to operate on. In response to a change of the currently selected item, the appropriate buttons in the toolbar are enabled or disabled as appropriate for the selected item. Initially, when the application is started, the tree will load the directory tree from the root of the repository down to the current user's home directory, and automatically select the current user's home directory. (In the screenshot, note the light blue highlighting of the /home/admin folder, which is the home directory of the admin user.)
Left bottom: trash
This is a view on the contents of the current user's trash bin. Unlike the treeview, this is a flat list of discarded items. Items in the trash can be selected just like items in the treeview. After selecting an item from the trash the appropriate toolbar buttons will be enabled and disabled to manipulate that item.
Right top: properties
Items stored in the repository carry internal metadata, as well as localization information. The properties pane shows the properties of the currently selected item in JSON format. JSON was chosen rather than some fancy, user-friendly form in order to let developers quickly see exactly what kinds of objects they will be dealing with when programming against the pentaho repository from within a javascript application.
Right bottom: contents and download link
If the currently selected item is a file, its contents will be shown in this pane. No formatting or pretty printing is applied to allow raw inspection of the file resource. In the title of the pane, a download link will become available to allow download of the currently selected resource for inspection on a local file system.

An overview of the Phile API

To get an idea of what the Phile API provides, click the "API Docs" button in the toolbar to open the API documentation. A new tab will open in the Pentaho user console showing the YUIDoc documentation. Inside the doc page, click the "Phile" class in the APIs tab (the only available class) and open the "Index" tab in the class overview:


Click on any link in the index to read the API documentation for that item. Most method names should give you a pretty good clue as to the function of the method. For example, createDirectory() will create a directory, discard() will remove a file or directory, and so on and so forth. We'll discuss particular methods in detail later in this post.

An closer look at the Phile API

Now we'll take a closer look to what actually makes the sample application work. For now, you need only be concerned with the index.html resource, which forms the entry point of the sample application.

Loading the script

The first thing of note is the <script> tag that includes the Phile module into the page:
    <script src="../js/Phile.js" type="text/javascript"></script>
Remember, you can either use the unminified source Phile.js, which is useful for debugging and educational uses, or the minified Phile-compiled.js for production use. The functionality of both scripts is identical, but the minified version will probably load a bit faster. The sample application uses the unminified script to make it easier to step through the code in case you're interested in its implementation, or in case you want to debug an issue.

The location of the script is up to you - you should choose a location that makes sense for your application and architecture. The sample application simply chose to put these scripts in a js directory next to the location of index.html, but if you want to use Phile in more than one application, it might make more sense to deploy it in a more central spot next to say the common ui scripts of Pentaho itself.

As noted before you can also dynamically load Phile with a javascript module loader like require(), which is pretty much the standard way that Pentaho's own applications use to load javascript resources.

Calling the Phile constructor

The next item of note in index.html is the call to the Phile() constructor:
    var phile = new Phile();
This instantiates a new Phile object using the default options, and assigns it to the phile (note the lower case) variable. The remainder of the sample application will use the Phile instance stored in the phile variable for manipulating Pentaho repository files.

(Normally the default options should be fine. For advanced uses, you might want to pass a configuration object to the Phile() constructor. Refer to the API documentation if you want to tweak these options.)

General pattern for using a Phile instance

Virtually all methods of Phile result in a HTTP request to an underlying REST service provided by the pentaho platform. Phile always uses asynchronous communication for its HTTP requests. This means the caller has to provide callback methods to handle the response of the REST service. Any Phile method that calls upon the backing REST service follows a uniform pattern that can best be explained by illustrating the generic request() method:
    phile.request({
      success: function(
        request, //the configuration object that was passed to the request method and represents the actual request
        xhr,     //The XMLHttpRequest object that was used to send the request to the server. Useful for low-level processing of the response
        response //An object that represents the response - often a javascript representation of the response document 
      ){
        //handle the repsonse
        
      },
      failure: function(request, xhr, exception){
        //handle the exception
        //Arguments are the same as passed to the success callback, 
        //but the exception argument represents the error that occurred rather than the response document
      },
      scope: ..., //optional: an object that will be used as the "this" object for the success and failure callbacks
      headers: {
        //any HTTP request headers.
        Accept: "application/json"
      },
      params: {
        //any name/value pairs for the URL query string.
      },
      ...any method specific properties...
    });
In most cases, developers need not call the request() method themselves directly. Most Phile methods provide far more specific functionality the generic request, and are implemented as a wrapper around to the generic request() method. These more specific methods fill in as many of the specific details about the request (such as headers and query parameters) as possible for that specific functionality, and should thus be far easier and more reliable to invoke than the generic request().

However, in virtually all cases the caller needs to provide the success() and a failure() callback methods to adequately handle server action. For each of the specific methods provided by Phile, the YUIDoc API documentation lists any specific options that apply to the configuration object passed to that particular method. You will notice that some of those more specific properties occur more often than others across different Phile methods. But the callback methods are truly generic and recur for every distinct Phile method that calls upon an underlying REST service.

Using Phile to work with repository directories

The request() method demonstrates the generic pattern for calling Phile methods, but the actual call will look different for various concrete requests. We'll take a look at a few of them as they are demonstrated by the sample application.

Getting the user's home directory with getUserHomeDir()

The sample application first obtains the user's home directory by calling the getUserHomeDir() method of the Phile object. The getUserHomeDir() method is implemented by making a HTTP GET request to services in the /api/session API.

The sample application uses the home directory to determine how to fill the repository treeview:
    //get the user home directory
    phile.getUserHomeDir({
      success: function(options, xhr, data){
        var dir = data.substr(0, data.indexOf("/workspace"));
        createFileTree(dir);
      },
      failure: failure
    });
The data argument returned to the success callback is simply a string that represents the path of the user's home directory.

For some reason, paths returned by these calls always end in /workspace, so for example, for the admin user we get back /home/admin/workspace. But it seems the actual user home directory should really be /home/admin, so we strip off the string "/workspace". Then, we pass the remaining path to the createFileTree() method of the sample application to fill the treeview.

As we shall see, when getting the repository structure for the treeview, the user home directory is used to decide how far down we should fetch the repository structure, and which directory to select initially.

Specifying a username to get a particular user's working directory

In the example shown above, the argument object passed in the call the getUserHomeDir() method of the Phile object did not include a user option. In this case, the home directory of the current user is retrieved with a GET request to /api/session/userWorkspaceDir. But getUserHomeDir() can also be used to retrieve the home directory of a specific user by passing the username of a specific user in the name option to the argument object passed to getUserHomeDir(). If the user option is present, a GET request is done instead to /api/session/workspaceDirForUser, which returns the home directory for that particular user.

Reading repository structure with the getTree() method

The createFileTree() function of the sample application obtains the repository structure to populate the treeview by doing a call to the method of the Phile object. The getTree() in Phile is implemented by doing a GET HTTP request to the /api/repo/files/.../tree service.
    function createFileTree(dir) {
      var dom = document.getElementById("tree-nodes");
      dom.innerHTML = "";
      var pathComponents = dir.split(Phile.separator);
      phile.getTree({
        path: dir[0],
        depth: dir.length,
        success: function(options, xhr, data) {
          createTreeBranch(data, pathComponents);
        },
        failure: failure
      });
    }
In the argument passed to getTree(), we see two options that we haven't seen before:
  • path specifies the path from where to search the repository.
  • depth is an integer that specifies how many levels (directories) the repository should be traversed downward from the path specified by the path option.
Many more methods of the Phile object support a path option to identify the file object that is being operated on. For flexibility, in each case where a path option is supported, it may be specified in either of the following ways:
  • As a string, separating path components with a forward slash. In code, you can use the static Phile.separator property to refer to the separator.
  • As an array of path component strings. (Actually - any object that has a join method is accepted and assumed to behave just like the join method of the javascript Array object)
Remember that the sample application initially populates the treeview by passing the user's home directory to createFileTree(). In order to pass the correct values to path and depth, the user home directory string is split into individual path components using the static Phile.separator property (which is equal to "/", the path separator). Whatever is the first component of that path must be the root of the repository, and so we use pathComponents[0] for path (i.e., get us the tree, starting at the root of the repository). The depth is specified as the total number of path components toward the user's home directory, ensuring that the tree we retrieve is at least so deep that it will contain the user's home directory.

The data that is returned to the success callback is an object that represents the repository tree structure. This object has just two properties, file and (optionally) children:
file
An object that represents the file object at this level. This object conveys only information (metadata) of the current file object.
children
An array that represents the children of the current file. The elements in the array are again objects with a file and (optionally) a children property, recursively repeating the initial structure.

The structure of file objects

Object like the one held by the file property described above are generally used to represent files by the Pentaho REST services, and other Phile API calls that expect file objects typically receive them in this format. They are basically the common currency exchanged in the Pentaho repository APIs.

Official documentation for file objects can be found here: repositoryFileTreeDto.

From the javascript side of things, the file object looks like this:
aclNode
String "true" or "false" to flag if this is an ACL node or not.
createdDate
A string that can be parsed as an integer to get the timestamp indicating the date/time this node was created.
fileSize
A string that can be parsed as an integer to get the size (in bytes) of this node in case this node represents a file. If a filesize is not applicable for this node, it is "-1".
folder
String "true" or "false" to flag if this is node represents a folder or not.
hidden
String "true" or "false" to flag if this is node is hidden for the end user or not.
id
A GUID identifiying this node.
locale
The current locale used for localized properties like title.
localeMapEntries
This is an array of localized properties for this file. The array items have these properties:
locale
The name of the locale for this map of localized properties. There is also a special "default" locale indicating the current locale.
properties
This is a bag of name/value pairs, representing localized properties.
key
The key for this property.
value
The value for this property.
locked
String "true" or "false" to flag if this is node is locked or not.
name
The name of this node.
ownerType
A string that can be parsed as an integer indicating the owner type.
path
A string containing the forward slash separated path components.
title
The title for presenting this node to the user.
versioned
String "true" or "false" to flag if this is node is versioned or not.
versionId
If the file is versioned, the versionId property is present and its value is a String that represents the version number.
Note that the list of properties above is just a sample: Depending on context, some properties might not be present, or extra properties may be present that are not listed here. But in general, key properties like id, path, folder, dateCreated, name and title will in practice always be present.

There are a few things that should be noted about the file object. What may strike you by surprise is that in the raw JSON response, all scalar properties are represented as strings. For example, properties a like folder and hidden and so on look like they should be booleans, but their value is either "true" or "false", and not, as one might expect, true and false. Likewise, integer fields like fileSize, and timestamp fields like dateCreated are also represented as strings - not as integers.

There is not really any good reason why the Pentaho REST service uses string representations in its response, since there are proper ways to express these different types directly in JSON. However, things are the way they are and there's not much Phile can (or should) do to change it. For convenience it might be a good idea to add a conversion function to Phile which turns these strings into values of a more appropriate datatype. Remember, contributions and pull requests are welcome!

For convenience, most of the Phile methods that require a path option can also pass a file option instead. If a file option is passed, it is expected to adhere to the structure of the file object described above. In those cases, Phile will use the path property of the passed file object as the path option. This allows application developers to use the file objects as they are received from the Pentaho server directly as input for new calls to the Phile API. (That said, the sample application currently does not use or demonstrate this feature and passes an explicit path instead.

Browsing folders with getChildren()

We just discussed how the getTree() method could retrieve the structure of the repository from one particular path up to (or rather, down to) a specific level or depth. There is a very similar method for obtaining the contents of only a single directory: getChildren(). This is implemented by calling the /api/repo/files/.../children service. In the sample application, this is used to populate the (hitherto unpopulated and collapsed) folder nodes in the treeview:
    function getChildrenForPath(path){
      phile.getChildren({
        path: path,
        success: function(options, xhr, data) {
          if (!data) {
            return;
          }
          var files = data.repositoryFileDto, i, n = files.length, file;
          sortFiles(files);
          for (i = 0; i < n; i++){
            file = files[i];r
            createTreeNode(file);
          }
        },
        failure: failure
      });
    }
Just like in the call to getTree() a path option must be specified. The getChildren() call does not support a depth option, which makes sense since it only returns the contents of the specified directory. (In other words, unlike getTree(), getChildren is not recursive, so it makes no sense to specify or require a "depth".)

Because getChildren() only returns the contents of the specified directory, the structure of the response data passed to the success() callback differs slightly from what is returned by getTree(): in the case of getChildren(), the response data is a javascript object that has a single repositoryFileDto property, which is an array of file objects:
{
  "repositoryFileDto": [
    ...many file objects...
  ]
}

Browsing the trash bin

For each user, the repository has a special trash folder that is used as temporary storage for discarded files (and directories). The trash folder cannot (should not) be approached directly with a call to getChildren() or getTree(). Instead, there is a special method available called getTrash() to list the contents of the trash folder. This method is implemented by doing a GET request to the /api/repo/files/deleted service.

The sample application reads the contents of the trash folder to populate the list in the left bottom of the screen with the loadTrash() function:
    function loadTrash() {
      phile.getTrash({
        success: function(options, xhr, data){
          var thrashList = document.getElementById("trash-nodes");
          createFileList(data, thrashList);
        },
        failure: failure
      });
    }
As you can see, loadTrash() simply calls the getTrash() method on the Phile instance, and uses the data passed back to the success() callback to build the list.

The data that getTrash() passes back to the callback has essentially the same structure as what is passed back by the getChildren() method: an object with a single repositoryFileDto property, which holds an array of file objects. However, since the file objects are in the trash, they have a couple of properties that are specific to discarded items:
deletedDate
A string that can be parsed as an integer to get the timestamp indicating the date/time this node was deleted.
originalParentFolderPath
A string that holds the original path of this file object before it was discarded to the trash folder. This is essential information in case you want to restore an item from the trash, since its actual own path property will be something like /home/admin/.trash/pho:4cce1a1b-95e2-4c2e-83a2-b19f6d446a0d/filename and refers to its location in the trash folder, which most likely means nothing to an end user.

Creating new directories

The sample application handles a click event on the "New Directory" toolbar button by calling the newDirectory() function. This function calls the createDirectory() method of the Phile object to create a new directory inside the currently selected directory:
    function newDirectory(){
      var path = getFileNodePath(selected);
      var name = prompt("Please enter a name for your new directory.", "new directory");
      if (name === null) {
        alert("You did not enter a name. Action will be canceled.");
        return;
      }
      var newPath = path + Phile.separator + name;
      phile.createDirectory({
        path: newPath,
        success: function(xhr, options, data) {
          selected.lastChild.innerHTML = "";
          selected.setAttribute("data-state", "expanded");
          getChildrenForPath(path);
        },
        failure: failure
      });
    }
As you can see, the user is presented with a prompt to get a name for a new directory, and this is simply appended to the path of the currently selected directory. This new path is then passed in the path property of the argument to the createDirectory method of the Phile object. (Please note that a serious application would provide some checks to validate the user input, but in order to demonstrate only the principles of using Phile, the sanmple application takes a few shortcuts here and there.)

The createDirectory() method is implemented by doing a PUT request to the /api/repo/dirs service. One might expect to be returned a file object that represents the newly created directory, but alas this is not the case: the request does not return any data. The sample application refreshes all of the children of the parent directory instead. (Please note that this is just a quick way to ensure the treeview reflects the new state of the directory accurately. A serious application should probably change the gui to add only the new directory, and otherwise retain the state of the tree. Again this is a shortcut just to keep the sample application nice and simple.)

Sorting file objects

When working with methods like getTree(), getChildren() and getTrash(), it may be needed to sort an array of file objects. For example, in the sample application, the treeview first presents all folders, and then the files, and within these types of file objects, the items are sorted alphabetically. The trash pane takes a different approach, and sorts all files based on their original path.

In the sample application this is achieved simply by calling the native sort() function on the array in which the file objects are received. But to achieve a particular type of sort, a comparator function is passed into the sort function.

Built-in file comparators

Phile offers a number of static general purpose file comparators:
compareFilesByPathCS()
Compares files by path in a case-sensitive (CS) manner.
compareFilesByOriginalPathAndName()
Compares files by orginal path, and then by name (in a case-sensitive manner).
compareFilesByTitleCS
Sorts folders before files, and then by title in a case-sensitive manner (CS).
compareFilesByTitleCI
Sorts folders before files, and then by title in a case-insensitive manner (CS).
Each of these methods can be of use in certain contexts. For example, the comparison implemented by compareFilesByPathCS() will by many people be regarded as the "natural" order of file objects, and this might be a useful sort order for things like autocomplete listboxes and such. The compareFilesByTitleCS() and compareFilesByTitleCI() methods on the other hand implement an order that is most suitable when presenting the files in a single directory in a GUI. And compareFilesByOriginalPathAndName() may prove to be useful when sorting items from the trash, since it takes the original name and location into account rather than the actual, current name and location.

The sample application uses the sortFiles() function to sort files in the treeview:
    function sortFiles(files){
      files.sort(Phile.compareFilesByTitleCI);
    }
As you can see, it's simply a matter of calling sort() on the array of files and passing the appropriate comparator. Since the comparators are static properties of the Phile constructor itself, it's qualified by prepending Phile. to its name.

Creating custom comparators

Phile offers a useful utility method to generate new comparators called createFileComparator(). The createFileComparator() is static an attached as a property directly to the Phile constructor itself, so in order to call it, it must be qualified, like so: Phile.createFileComparator().

The createFileComparator() takes a single argument, which represents a sort specification. It returns the comparator function that can be passed to the native sort() method of the javascript Array object.

The sort specification passed to createFileComparator() should be an object. Each property of this object indicates the name of a property in the file objects that are to be compared. The value of the property in the sort specification should be an object that contains extra information on how to treat the respective file in the comparison.

In the simplest case, the properties of the sort specification are all assigned null. In this case the fields will be compared as-is, which in practice means the field values of the file objects are compared in case-sensitive alphanumerical order. This is how the built-in comparator compareFilesByPathCS() is created:
    Phile.compareFilesByPathCS = Phile.createFileComparator({
      path: null
    });
This sort specification simply states that file objects must be compared by comparing the value of their respective path property. The built-in comparator compareFilesByOriginalPathAndName() is constructed similarly, but specifies that both originalPath and name fields are to be compared (in order):
 
    Phile.compareFilesByOriginalPathAndName = Phile.createFileComparator({
      originalParentFolderPath: null,
      name: null
    });

Specifying sort order

You can exert more control on how fields are compared by assigning an actual object to the respective property in the sort specification instead of null. When an object is assigned, you can set a direction property with a value of -1 to indicate reverse sort order.

This is put to good use by the built-in comparators compareFilesByTitleCS and compareFilesByTitleCI to ensure that folders are sorted before regular files:
    Phile.compareFilesByTitleCS = Phile.createFileComparator({
      folder: {direction: -1},
      title: null
    });
By first sorting on the folder property and after that on the title property, we achieve the desired presentation that most user will be accustomed to. However, the folder property will have a string value of either "true" or "false". Since we want all folders to be sorted before all files, we need to reverse the order (since "true" is larger than "false"). The sort specification {direction: -1} does exactly that.

Specifying case-insensitivy

Sometimes, it can be useful to exert control on the actual values that will be compared. To this end, you can specify a custom converter function in the field sort specification via the convert property. The specified function will then be applied to the raw field value, and the return value of the converter function will be used in the comparison rather than the raw value.

The built-in compareFilesByTitleCI comparator uses the convert property in the sort specification to implement a case-insensitive comparison:
    Phile.compareFilesByTitleCI = Phile.createFileComparator({
      folder: {direction: -1},
      title: {convert: function(value){return value.toUpperCase();}}
    });
In this sort specification, a convert function is specified for the title field, which accepts the original, raw title, and returns its upper case value by applying the built-in toUpperCase() method of the javascript String object. Since the upper case values rather than the raw values will be compared, this ensures the comparison is now case-insensitive.

More advanced comparators

Besides implementing case-insensitive comparison, specifying a convert function can be of use in other cases as well. For instance, if you want to sort on file creation date, you could specify a convert function that parses the createdDate property of the file object and returns its integer value to as to achieve a chronological order.

Reading and writing files and file properties

Now that we know how to use Phile to work with directories, let's take a look at working with files.

Reading file properties

Whenever the user of the sample application clicks the label of an item in either the treeview (left top) or the trash file list (left bottom), the current selection changes, and the properties pane (right top) will refresh and show the properties of the currently selected file. This provides a view of the corresponding file object as it is known in the repository.

The selection of an item is handled by the labelClick() function of the sample application. This method handles quite a bit of logic that is specific to the sample application, and we won't discuss it in full. Instead, we focus on the bit that retrieves the properties of the selected file, which is done with a call to the getProperties() method of the Phile object:
    phile.getProperties({
      path: getFileNodePath(node),
      success: function(options, xhr, data){
        var propertiesText = document.getElementById("properties-text");
        propertiesText.innerHTML = "";
        displayProperties(data, propertiesText);
      },
      failure: failure
    });
As we have seen in many other Phile calls, the path property of the object passed to the method provides the actual input. In this particular sample code, the value for path is extracted from the DOM element that represents the selected item in gui (this is held by the node variable), and the getFileNodePath() function simply extracts the path. (The details of that are not really relevant to using Phile and analysis of that code is left as an exercise to the reader.)

The data argument passed to the success() callback passed to the getProperties() method returns the properties of the file identified by the path property as a file object. The sample application simply clears the properties pane and then fills it with a JSON string representation of the object (by calling the displayProperties() function).

The actual make up of the returned file object will depend on whether the path identifies a regular file, a folder, or an discarded item in the trash folder. But the getProperties() method can be used in any of these cases to retrieve the properties of these items from the repository.

Reading file contents

In the sample application, changing the selection also affects the contents pane (right bottom). Changing the selection always clears whatever is in there, but if the newly selected item is a file (and not a directory or an item in the trash folder), its contents will also be displayed there.

Loading the content pane is handled by the labelClick() function, which is also responsible for loading the properties pane. The contents of the file are obtained with a call to the getContents() method of the Phile object:
    phile.getContents({
      path: getFileNodePath(node),
      headers: {
        Accept: "text/plain"
      },
      success: function(options, xhr, data) {
        displayContents(options.path, xhr.responseText);
      },
      failure: failure
    });
As usual, the path option is used to tell the method of which file the contents should be retrieved.

There are two features in this call to getContents() that we have not witnessed in any other Phile method call, and which are unique to this method:
  • A HTTP Accept header is specified to ensure the contents are retrieved as plain text (as indicated by the text/plain mime type value).
  • Rather than using the data argument passed to the success() callback, the responseText property of the actual XMLHttpRequest that was used to do the request to the REST service. That's because in this case, we only want to display the literal contents of the file. The exact type of the data passed to the success() callback may vary depending on the Content-Type response header, which is not what we want right now since we're interested in displaying only the raw file contents. This is exactly why the success() callback (as well as the failure() for that matter) is passed the actual XMLHttpRequest object - to access any lower level properties that might require custom handling.

Offering a download link to the user

When a file is selected, the title of the content pane (right bottom) presents a link to the end user that allows them to download the contents of the file. The url that may be used to download a particular file can be generated using the getUrlForDownload() method of the Phile object.

In the sample application, the download link is created in the displayContents() function, which is called after selecting a file item:
      function displayContents(path, contents){
        var a = document.getElementById("contents-download");
        a.textContent = path;
        a.href = phile.getUrlForDownload(path);

        var contentsText = document.getElementById("contents-text");
        contentsText.innerHTML = escapeHtml(contents);
      }
The getUrlForDownload() method takes a single path argument, and returns a string that represents a url that can be used to download the file. As usual, the path argument may be either a string, or an array of path components.

Note that generating the download link only involves string manipulation, and does not entail calling a backend REST service. Rather, when the generated url is used as the href attribute for a html-<A> element, clicking that link will access the REST service and initiate a download from the server. Therefore, this method does not require or accept callbacks, since generating the download link is not an asynchronous process.

Creating and writing files

In the previous section, we discussed how you can use the createDirectory() method of the Phile object to create a new directory. The Phile object also features a saveFile() to create and write regular files.

The sample application offers a "New File" button that allows the user to create a new file in the current directory. Clicking the button will invoke the newFile() function, which invokes the saveFile() method on the Phile object. The relevant snippet is shown below:
    phile.saveFile({
      path: newPath,
      data: contents,
      success: function(xhr, options, data) {
        selected.lastChild.innerHTML = "";
        selected.setAttribute("data-state", "expanded");
        getChildrenForPath(path);
      },
      failure: failure
    });
The call to saveFile() is quite similar to the one made to createDirectory(): the file that is to be created is conveyed by means of the path property in the argument to saveFile(), and the contents of the file are passed in via the data property. Just like in the case of createDirectory(), the callback does not receive any data; it would have been useful to receive an object that represents the newly created file but alas.

The saveFile() method can also be used to overwrite the contents of an existing file. You can test this in the sample application by entering the name of an existing file. Please note that neither the saveFile() method itself, nor the sample application warn against overwriting an existing file, but a serious application should probably check and prompt the user in such a case.

The saveFile() method is implemented by making a HTTP PUT request to the /api/repo/files service.

Discarding, restoring and renaming files

Phile also offers methods for discarding, restoring and renaming files.

Discarding files

Discarding a file means it will be removed from the user's point of view. This can either mean it is moved to the trash folder, or permanently deleted from the repository. Phile supports both operations using a single discard() method.

In the sample application, both modes of the discard() method are demonstrated in the deleteSelected() function:
    function deleteSelected(){
      if (!selected) {
        return;
      }
      var request = {
        success: function(options, xhr, data) {
          selected.parentNode.removeChild(selected);
          loadTrash();
        },
        failure: failure
      };
      var message, permanent;
      var properties = getFileNodeProperties(selected);
      var path = properties.path;
      if (selected.parentNode.id === "trash-nodes") {
        message = "Are you sure you want to permanently remove ";
        request.permanent = true;
        request.id = properties.id;
      }
      else {
        request.path = path;
        message = "Are you sure you want to discard ";
      }
      if (!confirm(message + path + "?")) {
        return;
      }
      phile.discard(request);
    }
The function builds a single request object which is passed to the discard() method of the Phile object in the last line of the function. Depending on whether the currently selected item is in the trash or a regular file or directory, different properties are set on the request object:
  • If the item is in the trash, a permanent property is set to true to indicate that the item should be permanently removed from the repository. Note that the permanent property can always be specified, even if the item is not in the trash. It's just that the sample application was designed to only permantly remove items from the trash. In addition, the id property is set on the request object to which the value of the id property of the file object is assigned.
  • If the item is not in the trash (and is thus either a directory or a regular file), only a path is set on the request object.
The success() callback is also set on the request, which removes the item that corresponds to the removed file from the gui and refreshes the view of the trash. The callback does not receive any data. This makes sense in case the object was permanently removed, but in case of moving the item to the crash it would have been nice to receive the file object that represents the moved file.

The discard() method has a slightly more complex implementation than any other method in the Phile object.

Ultimately, calling the discard() method results in a HTTP PUT request to either the delete or the deletepermanent service of the /api/repo/files API. The choice for delete or deletepermanent is controlled by the value of the permanent property on the argument passed to discard():
  • If permanent is true, deletepermanent will be used and the item will be permanently removed from the repository.
  • If permanent is absent or at least, not true, delete will be used and the item will be moved to the current user's trash folder.
However, unlike most (all?) other services that manipulate files and directories, delete and deletepermanent require that the file or directory to operate on is specified by it's id. As you might recall from our description of the file object, this is a GUID that uniquely identifies any item within the repository. So, to make the discard() function behave more like the other methods of the Phile object, measures have been taken to allow the caller to specify the file either as a path, or with a id: if an id is specified, that will always be used. But if no id is specified, Phile will see if a path was specified, and use that to make a call to getProperties() in order to retrieve the corresponding file object to extract its id, and then make another call to discard() using that id.

We mentioned earlier that as a convenience, you can specify a file object via the file property instead of a path, in which case the path property would be taken from that file object. The discard() method can also accept a file object as a specification for the file to be discarded, but in that case discard() will directly use the id property of that file object.

Restoring from the trash

You can restory items from the trash with the restore() method. The sample application demonstrates its usage in the restoreSelected() function:
    function restoreSelected(){
      if (!selected) {
        return;
      }
      var properties = getFileNodeProperties(selected);
      phile.restore({
        file: properties,
        success: function(options, xhr, data){
          var path = properties.originalParentFolderPath + Phile.separator + properties.name;
          createFileTree(path);
          loadTrash();
        },
        failure: failure
      });
    }
Currently you can specify the item to be restored from the trash using either an id or a file option. The sample application uses the latter. Currently, it is not possible to specify the file to be restored by its path (but you're welcome to implement it and send me a pull request). However, I currently feel that is not really that big of a problem, and might actually be a little bit confusing since items in the trash have both a path and an originalParentFolderPath.

The actual implementation of the restore() method relies on doing a HTTP PUT request to the /api/repo/files/restore service.

Renaming files

The rename() method can be used to rename files or directories. The sample application demonstrates its use in the renameSelected function:
    function renameSelected(){
      if (!selected) {
        return;
      }
      var newName = prompt("Please enter a new name for this file.", "new name");
      if (newName === null) {
        alert("You did not enter a name. Action will be canceled.");
        return;
      }
      var path = getFileNodePath(selected);
      phile.rename({
        path: path,
        newName: newName,
        success: function(options, xhr, data){
          debugger;
        },
        failure: failure
      });
    }
As usual, the file to operate on can be specified with a path (or file) property. The new name for the item should be a string and can be specified via the newName property.

The actual implementation of the rename() method relies on doing a HTTP PUT request to the /api/repo/files/rename service. However, I'm experiencing a problem in that this always results in a HTTP 500 status (Internal server error). However, the actual rename action does succeed. I filed a bug to report this issue here: BISERVER-12695.

Finally...

I hope this post was useful to you. Feel free to leave a comment on this blog. If you have a specific question about the project, then please report an issue on the github issue tracker. Remember, your feedback is very welcome, and I will gladly consider your requests to improve Phile or fix bugs. And I'd be even happier to receive your pull requests.

Friday, April 24, 2015

MDX: Grouping on non-unique levels

Diethard Steiner, allround open source BI consultant recently tempted me to a MDX challenge:
I’ve got a question for you. It’s actually a simple question, but it doesn’t seem that simple to solve - unless I am missing something.

In the SteelWheelsSale Cube you’ll find hierarchy called Product, with levels Line - Vendor - Product.

The hierarchy is setup in such a way, that you can use e.g. Vendor on its own, but you will see duplicated values, because the key keeps the context to Line.

So imagine we cannot change this Schema. Our task is to show a unique list of Vendors (on their own, without any other hierarchy levels and without All). I googled a bit for such a solution, but there isn’t much showing up. One article focused on DISTINCT(), but this I guess doesn’t work, because our Vendor Level still keeps the context to Line (I added ORDER() to make it easier to spot the duplicates):
SELECT
NON EMPTY Measures.Sales ON COLUMNS,
NON EMPTY
  ORDER(
    Product.Vendor.Members
  , Product.CurrentMember.Properties("MEMBER_CAPTION")
  , BASC
  ) ON ROWS
FROM SteelWheelsSales
Is there some kind of function to break the Vendor out of the hierarchy context? I’d be interested in hearing your thoughts on this.

The data

If we take a moment to analyze Diethards query, we notice that it gets the Sales (which represents money transferred in a sales transaction) for each member in the Vendor level out of the Product hierarchy. In addition to selecting the Vendor members, Diethard's query uses the Order() function to sort the Vendor members by caption, ensuring the sales results for the same vendor appear subsequently in the result.

If we run the query (I'm using Pash for that) we get a result that looks like this:
Product Sales
Autoart Studio Design 153268.09
Autoart Studio Design 196781.21999999997
Autoart Studio Design 66023.59999999999
Autoart Studio Design 67592.24999999999
Autoart Studio Design 131108.81999999998
Autoart Studio Design 184868.24000000002
Carousel DieCast Legends 200123.57999999993
Carousel DieCast Legends 208583.22

...many more rows...

Welly Diecast Productions 136692.72
Welly Diecast Productions 145128.12
As you can see, thanks to the Order() function we can easily notice lots of results for what appears to be duplicate vendors. That's because of the structure of the product dimension in the SteelWheels sample data. The Product hierarchy has the levels Product Line, Vendor, and Product.

Here's how that looks in the Pentaho Analysis Editor (Phase):



From a MDX point of view, the vendors aren't really duplicates though, which would become clear if we would change the query to include the ancestors of the product vendors. Alternatively, in Pash we can print the result of the previous query using member names instead of member captions. Pash lets you do that by entering the following SET command:
MDX> SET MEMBER_PROPERTY NAME;
This command tells pash to use the MEMBER_NAME property rather than MEMBER_CAPTION to render the headers of the dataset. So when we re-execute the query, we get a result that looks like this:
Product [Measures].[Sales]
[Product].[Classic Cars].[Autoart Studio Design] 153268.09
[Product].[Motorcycles].[Autoart Studio Design] 196781.21999999997
[Product].[Planes].[Autoart Studio Design] 66023.59999999999
[Product].[Ships].[Autoart Studio Design] 67592.24999999999
[Product].[Trucks and Buses].[Autoart Studio Design] 131108.81999999998
[Product].[Vintage Cars].[Autoart Studio Design] 184868.24000000002
[Product].[Classic Cars].[Carousel DieCast Legends] 200123.57999999993
[Product].[Ships].[Carousel DieCast Legends] 208583.22

...many more rows...

[Product].[Trucks and Buses].[Welly Diecast Productions] 136692.72
[Product].[Vintage Cars].[Welly Diecast Productions] 145128.12

Cheating - modifying the cube and creating a Vendor dimension

Before digging into any solutions for Diethard's challenge, it is useful to point out that the entire problem would not have existed in the first place if the cube would have provided an alternate hierarchy for the vendor, with the levels Vendor, Product Line, Product.

If you think about it a little more, you could even question whether a vendor level makes sense at all in a product hierarchy. In some businesses, vendors deliver unique products, but in case of the SteelWheels sample this is not the case. The Vendor in this case is more like a shop, and clearly, many shops sell the same products.

So, what we really need is a separate Vendor dimension. I think this makes sense, since a Vendor is really a distinct kind of thing as compared to product. In fact, the concept of a Vendor is completely orthogonal to a Product and I think many business users would agree.

I don't know why the SteelWheels example was setup with a Vendor level midway the Product hierarchy. But it illustrates nicely why and what to refactor.

I don't know if there is a proper term for a hierarchy like our Product hierarchy, that mixes and mingles levels that deal with more than one entirely different concept within the same hierarchy. By lack of better terms, I will henceforth call this a bastard-hierarchy. By extension, the Vendor level is a bastard-level. I think the terms are appropriate, since the Vendor level appears in a line of ancestry where it really doesn't fit. (Plus, I find it relieving to cuss at situations I don't like.)

Fortunately with Phase we can really, obscenely quickly refactor this hierarchy without even messing up our original SteelWheels cube. Phase has a nifty clone-button which allows you to make a deep copy of just about any schema or schema element. We can use this to clone the SteelWheels schema, and within the cloned schema, clone the Product dimension. We can then rename it to "Vendor" and modify its hierarchy, removing the Product Line and Product levels and leaving only the Vendor level:
  1. In the treeview, click the "SteelWheels" schema to select it, and hit the clone button. That's the first button on the toolbar above the schema form. You now have a new schema called "SteelWheels1".
  2. Expand the new SteelWheels1" schema and expand the "SteelWheelsSales" cube to find the Product dimension.
  3. Click the "Product" dimension to select it, and hit the clone button again. You now have a new dimension called "Product1", which is also automatically selected
  4. In the form, change the name of the dimension from "Product1" to "Vendor".
  5. Remove the levels "Line" and "Product". To do that, select the level and click the button with the red X - the delete button.
  6. Hit the save button to save the new cloned and modified schema.
You should now have something like this:



We can now immediately try out the new schema and Vendor dimension design in Pash:
MDX> USE SteelWheels1;
Current catalog set to "SteelWheels1".
MDX> SELECT Measures.Sales ON COLUMNS,
   2        Vendor.Vendor.Members ON ROWS
   3 FROM   SteelWheelsSales;
And we'll get a result like this:
Vendor Sales
Autoart Studio Design 799642.2199999999
Carousel DieCast Legends 749795.7799999999
Classic Metal Creations 1023667.4800000001
Exoto Designs 879854.2200000001
Gearbox Collectibles 912923.6599999999
Highway 66 Mini Classics 747959.1799999999
Min Lin Diecast 764228.96
Motor City Art Classics 809277.5399999999
Red Start Diecast 730593.4400000001
Second Gear Diecast 857851.2500000001
Studio M Art Models 567335.9299999999
Unimax Art Galleries 971571.68
Welly Diecast Productions 831247.8400000001
If you take a moment to go back to the result of Diethard's initial query and manually calculate the sum of sales for all products sold by Vendor "Autoart Studio Design" then you'll notice that this query delivers the correct result.

Now that we have seen that this approach works we could consider making it permanent. We could overwrite the old SteelWheels schema, and we could optimize the Vendor dimension a little bit by marking the Vendor level as having unique members. Finally, after clearing it with the report authors we could clean up the original Product dimensions and remove the Vendor level from that hierarchy altogether. This is something that could even be done gradually - you could create a new Product hierarchy by cloning the old one, and removing the Vendor level only there, and then, once all reports are modified, remove the old Product hierarchy. All these options are open and up to you.

A first attempt: named sets and Aggregate()

The brief intermezzo that concludes the previous section is just to inform you that you should always at least consider whether any trouble you have retrieving the results you require are maybe due to the design of the schema. In this particular case I feel it a very clear cut case that we actually should change the cube design. Especially since the changes do not require any new database structures or ETL - all we need to do is add a logical definition to our cube, and we can do so without taking away the user's ability to navigate the data using the old Product hierarchy.

You do not always have the ability or authority to change the schema, but if you have, and you can make the business case for it, then you should in my opinion always take that route. The remainder of this blog however is what you can in case you're not in such a position. So lets get on with that.

I googled a bit and bumped into this question on stackoverflow by Travis: "How can I merge two members into one in a query?".

The answer provided by user findango is modeled after a typical "sales per country" example, and shows how to combine the sales of a group of selected countries and compare that as a whole to the group of all other countries. This seems quite appropriate, since what I want to do is merge all members at the Vendor level that happen to have the same "local" vendor name, regardless of their ancestry into one member that represents the vendor.

I adapted that idea to fit Diethard's challenge and came up with this solution:
WITH
SET [Set of Autoart Studio Design] AS {
  [Product].[Classic Cars].[Autoart Studio Design],
  [Product].[Motorcycles].[Autoart Studio Design],
  [Product].[Planes].[Autoart Studio Design],
  [Product].[Ships].[Autoart Studio Design],
  [Product].[Trucks and Buses].[Autoart Studio Design],
  [Product].[Vintage Cars].[Autoart Studio Design]
}
MEMBER [Product].[Autoart Studio Design] AS Aggregate([Set of Autoart Studio Design])
SET [Set of Carousel DieCast Legends] AS {
  [Product].[Classic Cars].[Carousel DieCast Legends],
  [Product].[Ships].[Carousel DieCast Legends],
  [Product].[Trains].[Carousel DieCast Legends],
  [Product].[Trucks and Buses].[Carousel DieCast Legends],
  [Product].[Vintage Cars].[Carousel DieCast Legends]
}
MEMBER [Product].[Carousel DieCast Legends] AS Aggregate([Set of Carousel DieCast Legends])

...more SET and MEMBER clauses for the other vendors...
SET [Set of Welly Diecast Productions] AS { [Product].[Classic Cars].[Welly Diecast Productions], [Product].[Motorcycles].[Welly Diecast Productions], [Product].[Ships].[Welly Diecast Productions], [Product].[Trucks and Buses].[Welly Diecast Productions], [Product].[Vintage Cars].[Welly Diecast Productions] } MEMBER [Product].[Welly Diecast Productions] AS Aggregate([Set of Welly Diecast Productions]) SELECT [Measures].[Sales] ON COLUMNS, {[Product].[Autoart Studio Design] ,[Product].[Carousel DieCast Legends] ...names of other calculated members go here... ,[Product].[Welly Diecast Productions]} ON ROWS FROM SteelWheelsSales
This solution relies on two structural elements:
  1. A query-scoped named set for each Vendor grouping we'd like to see in our result. These named sets are constructed in the WITH-clause using the SET keyword. In the previous query, the definition of the sets themselves consist of a simple enumeration of member literals that we'd like to treat as a single group.
  2. For each of the named sets created in #1, a query-scoped calculated member that folds the members of each named set into a single new member. This is achieved by applying the Aggregate() function to the set. The Aggregate() function is passed the name of the set as first argument and then the calculated member acts as a new member that represents the set as a whole.
You might notice I marked up the Vendor name for Autoart Studio Design in bold in the previous query. I hope it helps you to reveal how this achieves a grouping of members that belong to the same Vendor. The same process applies to all other Vendors.

With these things in place, we can now select our measure on the COLUMNS axis, and put all of our caclulated members in a new set on the ROWS axis to get the required result, which looks something like this:

      
Product Sales
Autoart Studio Design 799642.2199999999
Carousel DieCast Legends 749795.7799999999
...more vendor sales results...
Unimax Art Galleries 971571.68
Welly Diecast Productions 831247.8400000001
You can crosscheck this result with the result we got from querying the sales over our Vendor dimension and you'll notice that they are identical (well, except for the caption, since we're still working with a Product hierarchy here, and not with a Vendor hierarchy). So, this certainly looks like we're on the right track.

Now, if you take a moment to analyze this query you might notice that we didn't really need to explicitly create a named set for each distinct vendor. The only really essential element is the calculated member based on the Aggregate() function, and instead of first creating a named set and then the calculated member that applies the Aggregate() function to it, we could've passed the definition of the set immediately as first argument to Aggregate().

For example, the calculated member [Product].[Autoart Studio Design] could just as well have been defined as
MEMBER [Product].[Autoart Studio Design] AS Aggregate({
  [Product].[Classic Cars].[Autoart Studio Design],
  [Product].[Motorcycles].[Autoart Studio Design],
  [Product].[Planes].[Autoart Studio Design],
  [Product].[Ships].[Autoart Studio Design],
  [Product].[Trucks and Buses].[Autoart Studio Design],
  [Product].[Vintage Cars].[Autoart Studio Design]
})
That said, the explicitly named sets do help to clarify how the solution works by separating the grouping of the members from the actual aggregation of the measure.

Drawbacks

The obvious drawback to this approach is that it is not dynamic, and thus not flexible. There are at least two glaring sources of inflexibility:
  1. An explicit definition for each group. We only knew which named sets to create because we ran Diethard's original query and looked at the result. We had to manually de-deplicate the Vendor list and create an explicit named set for each of them.
  2. The enumeration of members for each group. Again we had to look at the query result to determine the composition of each named set.
If you're a little bit familiar with MDX, you might've noticed right away that the explicit enumeration of members for each Vendor set could've been written a lot smarter. Once we know the caption of each distinct Vendor, we can construct the named sets dynamically using the Filter() function.

The Filter() function takes a set as first argument, and a condition (a logical expression) as second argument. The condition is applied to each member in the set and the function returns a subset, containing only those members for which the condition holds true. So instead of:
WITH
SET [Set of Autoart Studio Design] AS {
  [Product].[Classic Cars].[Autoart Studio Design],
  [Product].[Motorcycles].[Autoart Studio Design],
  [Product].[Planes].[Autoart Studio Design],
  [Product].[Ships].[Autoart Studio Design],
  [Product].[Trucks and Buses].[Autoart Studio Design],
  [Product].[Vintage Cars].[Autoart Studio Design]
}
We could have written:
WITH
SET [Set of Autoart Studio Design] AS 
    Filter(
      Product.Vendor.Members
    , Product.CurrentMember.Properties("MEMBER_CAPTION") = "Autoart Studio Design"
    )
So, instead of enumerating all the individual "Autoart Studio Design" members at the Vendor level, we write Product.Vendor.Members to take the entire set of members at the Vendor level, and then apply the condition Product.CurrentMember.Properties("MEMBER_CAPTION") = "Autoart Studio Design" to single out those members that belong to the particular vendor called "Autoart Studio Design".

This solution is surely better than what we had before: it is much, much less verbose, and more importantly, we now only need to have a list of unique vendors to construct our query, regardless of what members might or might not exist for that vendor. Constructing the set by explictly enumerating individual members is risky because we might accidentally leave out a member, or mix up members of different vendors in a single vendor group. More importantly: if the data changes in the future, and members are added for a particular vendor, our query will not be correct anymore. All these problems are solved by using a Filter() expression.

While Filter() allows us to solve one source of inflexibility, we are still stuck with regard to having to create a separate calculated member for each individual Vendor. The most important objection to enumerating all individual members that make up a set for one particular Vendor remains: by requiring advance knowledge of the list of unique vendors, our query is vulnerable to future data changes. Any vendors that might be added to the product dimension in the future will not be taken into account automatically by our query, and hence we run the risk of delivering incomplete (and thus, incorrect) results.

A more dynamic solution

I googled a bit more and ran into a fairly recent article on Richard Lees' blog, MDX - Aggregating by member_caption. In this article Richard explains how to aggregate over all cities with the same name in a geography dimension that has a country, state and a city level. So, quite similar to our Vendor problem!

Dynamically retrieving a unique list of Vendors

Unfortunately, Richard's code is way above my head. But I did manage to pick up one really neat idea from it: If we have a set of members ordered by vendor, then we can apply a Filter() such that members are retained only if their caption is not equal to that of the member that precedes it. In other words, we can filter the ordered set such that we keep only every first occurrence of a particular vendor.

This query does exactly that:
WITH
SET     OrderedVendors
AS      Order(
          Product.Vendor.Members
        , Product.CurrentMember.Properties("MEMBER_CAPTION")
        , BASC
        )
SET     UniqueVendors
AS      Filter(
          OrderedVendors
        , OrderedVendors.Item(OrderedVendors.CurrentOrdinal).Properties("MEMBER_CAPTION") <> 
          OrderedVendors.Item(OrderedVendors.CurrentOrdinal - 1).Properties("MEMBER_CAPTION")
        )
SELECT  Measures.Sales ON COLUMNS
,       UniqueVendors ON ROWS
FROM    SteelWheelsSales
And the result:
Product [Measures].[Sales]
[Product].[Classic Cars].[Autoart Studio Design] 153268.09
[Product].[Classic Cars].[Carousel DieCast Legends] 200123.57999999993
[Product].[Classic Cars].[Classic Metal Creations] 742694.2000000002
[Product].[Classic Cars].[Exoto Designs] 265792.4400000001
[Product].[Classic Cars].[Gearbox Collectibles] 585119.6699999999
[Product].[Classic Cars].[Highway 66 Mini Classics] 190488.55000000002
[Product].[Classic Cars].[Min Lin Diecast] 335771.35000000003
[Product].[Classic Cars].[Motor City Art Classics] 120339.81000000003
[Product].[Classic Cars].[Red Start Diecast] 110501.80000000002
[Product].[Classic Cars].[Second Gear Diecast] 506032.90000000014
[Product].[Classic Cars].[Studio M Art Models] 128409.65999999996
[Product].[Classic Cars].[Unimax Art Galleries] 351828.50000000006
[Product].[Classic Cars].[Welly Diecast Productions] 401049.32000000007
If you analyze the results and compare them with Diethard's original query, you will notice that it does indeed report Sales for only the first occurrence of each Vendor (which coincidentally happen all to be children of the "Classic Cars" product line).

So, this is still only a partial solution, since we aren't currently getting the correct Sales figures. But it's an important step nonetheless, since this does give us a unique list of vendors, and it does so in a dynamic way. In other words, this might be the key to getting rid of our requirement to explicitly write code for each disctinct vendor.

This partial solution hinges on two elements:
  1. A set of members at the Vendor level that uses the Order() function to order the members according to their caption. We've seen this already in Diethard's original query. The only difference now is that we put this in a named set called OrderedVendors, instead of putting the Order() expression immediately on a query axis.
  2. A Filter() expression which uses an expression like OrderedVendors.Item(OrderedVendors.CurrentOrdinal) to compare the caption of the currently evaluated member from the OrderedVendors set with that of the previously evaluated member, which is captured using a similar but slightly different expression OrderedVendors.Item(OrderedVendors.CurrentOrdinal - 1)
The Item() function can be applied to a set to retrieve a particular tuple by ordinal position. The ordinal position is specified as argument to the Item() function. To retrieve the current tuple we apply CurrentOrdinal to the set. So OrderedVendors.Item(OrderedVendors.CurrentOrdinal) simply means: get us the current tuple from the OrderedVendors set. Since its tuples contain only one member, we can immediately apply Properties("MEMBER_CAPTION") to retrieve its caption.

Similarly, OrderedVendors.Item(OrderedVendors.CurrentOrdinal - 1).Properties("MEMBER_CAPTION") gets the caption of the previous member in the OrderedVendors set, because substracting 1 from the current ordinal means we are looking at the previous tuple. So, the entire expression:
Filter(
  OrderedVendors
, OrderedVendors.Item(OrderedVendors.CurrentOrdinal).Properties("MEMBER_CAPTION") <> 
  OrderedVendors.Item(OrderedVendors.CurrentOrdinal - 1).Properties("MEMBER_CAPTION")
)
simply means: keep those members in the OrderedVendors set which happen to have a different caption than the previous member. And since the set was ordered by caption, this must mean we end up with only one member for each unique Vendor caption. (To be precise, we end up with only the first member for each distinct Vendor.)

Calculating totals for each unique Vendor

In order to calculate the correct totals for the Vendors we got in our previous result, we just have to add a calculated measure that takes the current value of the Vendor into account:
WITH
SET     OrderedVendors
AS      Order(
          Product.Vendor.Members,
          Product.CurrentMember.Properties("MEMBER_CAPTION"),
          BASC
        )
SET     UniqueVendors
AS      Filter(
          OrderedVendors
        , OrderedVendors.Item(OrderedVendors.CurrentOrdinal).Properties("MEMBER_CAPTION") <>
          OrderedVendors.Item(OrderedVendors.CurrentOrdinal - 1).Properties("MEMBER_CAPTION")
        )
MEMBER  Measures.S
AS      SUM(
          Filter(
            OrderedVendors
          , UniqueVendors.Item(UniqueVendors.CurrentOrdinal).Properties("MEMBER_CAPTION") =
            Product.CurrentMember.Properties("MEMBER_CAPTION")
          )
        , Measures.Sales
        )
SELECT  Measures.S ON COLUMNS
,       UniqueVendors ON ROWS
FROM    SteelWheelsSales
The intention of the Calculated measure is to get the SUM() of Measures.Sales, but only for those members at the Vendor level which happen to have a caption that is equal to the current member of our UniqueVendors set. If you look at the calculated measure, it looks quite logical: We filter the OrderedVendors set, which is a set of members at the Vendor level of the Product hierarchy. The expression Product.CurrentMember.Properties("MEMBER_CAPTION") is meant to refer to the current member of the OrderedVendors set in this Filter expression, and UniqueVendors.Item(UniqueVendors.CurrentOrdinal).Properties("MEMBER_CAPTION") is meant to refer to whatever member is current in the unique vendor set, and since we compare by caption, this should give us the set of all Vendor members with the same caption, for each unique caption.

The results of the query are:
Product [Measures].[S]
[Product].[Classic Cars].[Autoart Studio Design] 799642.2199999999
[Product].[Classic Cars].[Carousel DieCast Legends] 749795.7799999999
[Product].[Classic Cars].[Classic Metal Creations] 1023667.4800000001
[Product].[Classic Cars].[Exoto Designs] 879854.2200000001
[Product].[Classic Cars].[Gearbox Collectibles] 912923.6599999999
[Product].[Classic Cars].[Highway 66 Mini Classics] 747959.1799999999
[Product].[Classic Cars].[Min Lin Diecast] 764228.96
[Product].[Classic Cars].[Motor City Art Classics] 809277.5399999999
[Product].[Classic Cars].[Red Start Diecast] 730593.4400000001
[Product].[Classic Cars].[Second Gear Diecast] 857851.2500000001
[Product].[Classic Cars].[Studio M Art Models] 567335.9299999999
[Product].[Classic Cars].[Unimax Art Galleries] 971571.68
[Product].[Classic Cars].[Welly Diecast Productions] 831247.8400000001
If you compare it to our previous results you'll notice that this is indeed the correct result.

Now, even though it seems to work, and even though it does satisfy Diethard's challenge, there are a couple of things about this solution that aren't quite to my liking.
  1. For starters, the result is strange because the row headers are clearly a single member whereas the result is definitely not that of a single member. If all we care about is the label on the ROWS axis, then it doesn't matter, and indeed we wouldn't notice if we'd print the member captions instead of the full member name. But our initial solution based on Aggregate() was more pure in this respect, since that actually allowed us to explicitly create a new member to represent our group of Vendor members. (But of course, the drawback there was that we were unable to generate those groups dynamically)
  2. The solution with Aggregate() had another significant advantage: it knew automagically how to calculate the measure. Apparently Aggregate() is aware of the underlying aggregator that is used to define the measure, and without instruction it calculated the right result, whereas my last solution requires me to explicitly define SUM() to aggregate the values of the measure across the vendors. This might not seem a problem, but what if our measure was supposed to be aggregated by taking the average? In short, we have to have external knowledge about the nature of the measure and choose an appropriate aggregate function ourselves. I'd much prefer it if that could be avoided.
  3. Finally, what worries me about this solution is that, despite the explanation I gave of how it works, I don't really fully understand it.
If my explanation is correct, then I should be able to write:
MEMBER  Measures.S
AS      SUM(
          Filter(
            Product.Vendor.Members
          , UniqueVendors.Item(UniqueVendors.CurrentOrdinal).Properties("MEMBER_CAPTION") =
            Product.CurrentMember.Properties("MEMBER_CAPTION")
          )
        , Measures.Sales
        )
instead of:
MEMBER  Measures.S
AS      SUM(
          Filter(
            OrderedVendors
          , UniqueVendors.Item(UniqueVendors.CurrentOrdinal).Properties("MEMBER_CAPTION") =
            Product.CurrentMember.Properties("MEMBER_CAPTION")
          )
        , Measures.Sales
        )
After all, what difference does it make whether we filter Vendor level members out of an ordered or an ordered set?

Well, it turns out that it does make a difference: If I use Product.Vendor.Members instead of OrderedMembers then the measure just keeps repeating the value 799642.219999999, which is the total for the vendor "Autoart Studio Design".

Frankly I have zero clue why. I put up a question about this on Stackoverflow, so far, with not many encouraging answers. Please, chime on if you can shed some light on this. I would really appreciate any insights on this matter.

Aggregate(): Redux

So, first we had one problem, and no solutions. Now we have 2 solutions, and at least 3 problems, just different problems. Whether this can be called progess, I'd rather not decide. There is this proverb:
when you're in a hole, stop digging.
Blogs like these would not be written if I'd heed such sensible advice. Instead, I came up with yet another solution that, sort of, combines the elements of my current two solutions into something that is so terrific, we can truly call it the best worst of two worlds.

First, lets consider this summary of my two solutions:

Good Bad
1st solution (using Aggregate())
  • Proper grouping into explicit new members
  • Implicit calculation of measures
  • Completely static
  • Requires knowledge of vendors and vendor members in advance
2nd solution (using UniqueVendors)
  • Completely dynamic
  • Does not require knowledge in advance of vendors and its members
  • Strange grouping to first occurrence of Vendor
  • Explicit calculation of measures


So obviously we'd like to have a solution that has all of the good and none of the bad. The problem we have to overcome is the dynamic generation of calculated members to represent the custom grouping of our members to represent individual vendors. It turns out, there is a way. Just not in one query.

(Please, somebody, anybody, prove me wrong on this!)

What we can do though, is use a query that uses the essential elements of my second solution that generates output that is exactly like my first query. This result can then be run to obtain the desired result. In other words, we enter the domain of dynamic MDX or, with a really posh term, higher order MDX.

So, here goes:
WITH
SET     OrderedVendors
AS      Order(
          Product.Vendor.Members
        , Product.CurrentMember.Properties("MEMBER_CAPTION"),
          BASC
        )
SET     UniqueVendors
AS      Filter(
          OrderedVendors
        , OrderedVendors.Item(OrderedVendors.CurrentOrdinal).Properties("MEMBER_CAPTION") <>
          OrderedVendors.Item(OrderedVendors.CurrentOrdinal - 1).Properties("MEMBER_CAPTION")
        )
MEMBER  Measures.S
AS      "WITH"||Chr(10)||
        Generate(
          UniqueVendors
        , "MEMBER Product.["||Product.CurrentMember.Properties("MEMBER_CAPTION")||"]"||Chr(10)||
          "AS Aggregate("||
            "Filter("||
            "  Product.Vendor.Members"||
            ", Product.CurrentMember.Properties('MEMBER_CAPTION') = "||
            "'"||Product.CurrentMember.Properties("MEMBER_CAPTION")||"'"||
            ")"||
          ")"
        , Chr(10)
        )
        ||Chr(10)||"SELECT Measures.Sales ON COLUMNS,"
        ||Chr(10)||"{"||
        Generate(
          UniqueVendors
        , "Product.["||Product.CurrentMember.Properties("MEMBER_CAPTION")||"]"
        , Chr(10)||","
        )
        ||"} ON ROWS"
        ||Chr(10)||"FROM SteelWheelsSales"
SELECT  Measures.S ON COLUMNS
FROM    SteelWheelsSales
The query might be easier to analyze if you see its exact result:
WITH
MEMBER Product.[Autoart Studio Design]
AS Aggregate(Filter(Product.Vendor.Members, Product.CurrentMember.Properties('MEMBER_CAPTION') = 'Autoart Studio Design'))
MEMBER Product.[Carousel DieCast Legends]
AS Aggregate(Filter(Product.Vendor.Members, Product.CurrentMember.Properties('MEMBER_CAPTION') = 'Carousel DieCast Legends'))
MEMBER Product.[Classic Metal Creations]
AS Aggregate(Filter(Product.Vendor.Members, Product.CurrentMember.Properties('MEMBER_CAPTION') = 'Classic Metal Creations'))
MEMBER Product.[Exoto Designs]
AS Aggregate(Filter(Product.Vendor.Members, Product.CurrentMember.Properties('MEMBER_CAPTION') = 'Exoto Designs'))
MEMBER Product.[Gearbox Collectibles]
AS Aggregate(Filter(Product.Vendor.Members, Product.CurrentMember.Properties('MEMBER_CAPTION') = 'Gearbox Collectibles'))
MEMBER Product.[Highway 66 Mini Classics]
AS Aggregate(Filter(Product.Vendor.Members, Product.CurrentMember.Properties('MEMBER_CAPTION') = 'Highway 66 Mini Classics'))
MEMBER Product.[Min Lin Diecast]
AS Aggregate(Filter(Product.Vendor.Members, Product.CurrentMember.Properties('MEMBER_CAPTION') = 'Min Lin Diecast'))
MEMBER Product.[Motor City Art Classics]
AS Aggregate(Filter(Product.Vendor.Members, Product.CurrentMember.Properties('MEMBER_CAPTION') = 'Motor City Art Classics'))
MEMBER Product.[Red Start Diecast]
AS Aggregate(Filter(Product.Vendor.Members, Product.CurrentMember.Properties('MEMBER_CAPTION') = 'Red Start Diecast'))
MEMBER Product.[Second Gear Diecast]
AS Aggregate(Filter(Product.Vendor.Members, Product.CurrentMember.Properties('MEMBER_CAPTION') = 'Second Gear Diecast'))
MEMBER Product.[Studio M Art Models]
AS Aggregate(Filter(Product.Vendor.Members, Product.CurrentMember.Properties('MEMBER_CAPTION') = 'Studio M Art Models'))
MEMBER Product.[Unimax Art Galleries]
AS Aggregate(Filter(Product.Vendor.Members, Product.CurrentMember.Properties('MEMBER_CAPTION') = 'Unimax Art Galleries'))
MEMBER Product.[Welly Diecast Productions]
AS Aggregate(Filter(Product.Vendor.Members, Product.CurrentMember.Properties('MEMBER_CAPTION') = 'Welly Diecast Productions'))
SELECT Measures.Sales ON COLUMNS,
{Product.[Autoart Studio Design]
,Product.[Carousel DieCast Legends]
,Product.[Classic Metal Creations]
,Product.[Exoto Designs]
,Product.[Gearbox Collectibles]
,Product.[Highway 66 Mini Classics]
,Product.[Min Lin Diecast]
,Product.[Motor City Art Classics]
,Product.[Red Start Diecast]
,Product.[Second Gear Diecast]
,Product.[Studio M Art Models]
,Product.[Unimax Art Galleries]
,Product.[Welly Diecast Productions]} ON ROWS
FROM SteelWheelsSales
The generating query relies on the Generate() function, which I discussed in my previous blog post, MDX: retrieving the entire hierarchy path with Ancestors(). The Generate() function is used twice, both over the UniqueVendors set. The first Generate() function is used to create the code that defines the calculated members, which serve to group the Vendor members based on caption. The second Generate() function generates the code that defines the set which references these calculated members and which appears on the ROWS axis of the generated query.

There's a few elements in this query that might or might not be familiar:
  • String constants, which are denoted using either double or single quotes
  • The String concatenation operator ||. Note that this is Mondrian specific syntax - The MDX standard defines the plus sign (+) as string concatenation operator. (If someone could point out the proper way to discover which operator is used for string concatenation, I'd be really grateful!) You can easily rewrite this query to standard MDX by replacing each occurrence of || with +.
  • The Chr() function, which MDX inherits from VBA generates a character that corresponds to the character code passed as argument. The generator uses Chr(10) to create newlines in the generated query.
You might recall that I discussed a number of variants of my first query. The version generated here is the most compact one. It doesn't bother to generate separate named sets for the Vendors, and it uses the Filter() expression to define the contents of the named set instead of enumerating all the individual members for the Vendors.

It is quite easy to rewrite the query so that it generates one of the other forms of the query. In particular, it might be useful to generate a query that enumerates the members rather than using the Filter() function. You might recall I initially argued against that on the basis that it is error prone and not resilient to future data changes, but since we can now generate a correct and up to date version of the query anytime, these objections are lifted.

To obtain the query such that it enumerates all members explicitly, one might use the SetToStr() function, which I also discussed briefly in my previous blog. Such a solution would something like this to generate the Aggregate() expressions for the calculated members:
"AS Aggregate("||
  SetToStr(
    Filter(
      Product.Vendor.Members
    , Product.CurrentMember.Properties("MEMBER_CAPTION") = 
      UniqueVendors.Item(UniqueVendors.CurrentOrdinal).Properties("MEMBER_CAPTION")
    )
  )||")"

Conclusion

If you have a requirement like Diethard's, you really should first consider if you can achieve it by refactoring the schema. Remember, this is not only about making life easy for the MDX author; If there really is a need to make top-level groupings on lower levels of a hierarchy, you might be dealing with a bastard-hierarchy, and the general state of things will be much improved if you put it in a different hierarchy.

There may be other cases where there is a need to make groupings on members of non-uniqe levels. An example that comes to mind is querying Sales quarters against years. This is different from our Vendor example for two reasons. Quarters and Years clearly could very well belong to the same hierarchy (i.e., we're not dealing with a bastard-hierarchy in this case). But if we accept that, we then have to figure out how to put members from the same hierarchy on two different axes of a MDX query. This is a completely different kettle of fish.

However, it is still good to know that this situation too, could, in principle, be solved by creating two separate hierarchies: one with Year as top level and one with the Quarter as top level. (And this is a design that I have observed.)

If it is not possible or appropriate to change or add the hierarchy, you're going to have to work your way around it. I offered three different solutions that can help you out. None of them is perfect, and I did my best to point out the advantages and disadvantages of each method. I hope you find this information useful and I hope it will help you decide how to meet your particular requirements.

Finally, as always, I happily welcome any comments, critique and suggestions. I'm still quite new to MDX so it is entirely possible that I missed a solution or that my solutions could be simplified. I would be very grateful if you could point it out so I can learn from your insights.