Percona Live MySQL User's Conference, San Francisco, April 10-12th, 2012 Book: "Pentaho Kettle Solutions", Matt Casters, Roland Bouman, & Jos van Dongen, Wiley 2010 Book: "Pentaho Solutions", Roland Bouman & Jos van Dongen, Wiley 2009

Tuesday, March 20, 2012

A Generic Normalizer step for Kettle

Abstract

Kettle (a.k.a. Pentaho Data Integration) offers the standard Row Normalizer step to "unpivot" columns to rows. However, this step requires some configuration which presumes its input stream is static, and its structure is known. In this post, I explain how to construct a simple User-defined java class step that implements a generic Row Normalizer step that can unpivot an arbitrary input stream without manual configuration.

The Row Normalizer step


Kettle (a.k.a. Pentaho Data Integration) offers a standard step to "unpivot" columns to rows. This step is called the Row Normalizer. You can see it in action in the screenshot below:

In the screenshot, the input is a table of columns id, first name, and last name. The output is a table of columns id, fieldname, and value. The id column is preserved, but for each row coming from the input stream, two rows are created in the output stream: 1 for the first name field, and 1 for the last name field.

Essentially the Row Normalizer step in this example is configured to treat the first name and last name fields as a repeating group. The repeating group is untangled by dumping all values for either field in the value column. The fieldname column is used to mark the kind of value: some values are of the "first name field" kind (in case they came from the original first name input field), some are from the "last name field" kind (when the derive from the last name input field).

There are several use cases for the operation performed by the Row normalizer step. It could be used to break down a genuine repeating group in order to create a more normalized dataset. Or you might need to convert a relational dataset into a graph consisting of subject-predicate-object tuples for loading a triple store. Or maybe you want to turn a table into a fine-grained stream of changes for auditing.

The problem

The Row normalizer step works great for streams that have a structure that is known in advance. The structure needs to be known in advance in order to specify those fields that are to be considered as repeating group in the step configuration so they can be broken out into separate kinds.

Sometimes, you don't know the structure of the input stream in advance, or it is just to inconvenient to manually specify it. In these cases, you'd really wish you could somehow unpivot any field that happens to be part of the input stream. In other words, you'd need to have a generic Row Normalizer step.

The Solution

In Kettle, there's always a solution, and often more. Here, I'd like to present a solution to dynamically unpivot an arbitrary input stream using a user-defined java class step.

Below is a screenshot of the step configuration:

This configuration allows the step to take an arbitrary input stream and normalize it into a collection of triples consisting of:
  1. An id column. This column delivers generated integer values, and each distinct value uniquely identifies a single input row.
  2. A fieldnum column. This is a generated integer value that uniquely identifies a field within each input row.
  3. A value column. This is a string column that contains the value that appears in the field identified by the fieldnum column within the row identified by the rownum value.

The Row Normalizer versus the UJDC generic Normalizer

For the input data set mentioned in the initial example, the output generated by this UJDC step is shown below:
There are a few differences with regard to the output of kettle's Row Normalizer step:
  1. One obvious difference is that the Row Normalizer step has the ability to attach names to the values, whereas the UJDC step only delivers a generated field position. One the one hand, it's really nice to have field names. On the other hand, this is also one of the weaknesses of the Row Normalizer step, because providing the names most be done manually.
  2. Another difference is that the UDJC step delivers 3 output rows for each input row, instead of the 2 rows delivered by the Row Normalizer step. The "extra" row is due to the id column. Because the id column is the key of the original input data, the Row Normalizer step was configured to only unpivot the first name and last name fields, keeping the id field unscathed: this allows any downstream steps to see which fields belong to which row. The UDJC step however does not know which field or fields form the key of the input stream. Instead, it generates its own key, the rownum field, and the id field is simply treated like any other field and unpivoted, just like the first name and last name fields. So the main difference is that the downstream steps need to use the generated rownum field to see which fields belong to which row.

The Code

The code and comments are pretty straightforward:
static long rownum = 0;
static RowMetaInterface inputRowMeta;
static long numFields;
static String[] fieldNames;

public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws KettleException
{
// get the current row
Object[] r = getRow();

// If the row object is null, we are done processing.
if (r == null) {
setOutputDone();
return false;
}

// If this is the first row, cache some metadata.
// We will reuse this metadata in processing the rest of the rows.
if (first) {
inputRowMeta = getInputRowMeta();
fieldNames = inputRowMeta.getFieldNames();
numFields = fieldNames.length;
}

// Generate a new id number for the current row.
rownum += 1;

// Generate one output row for each field in the input stream.
int fieldnum;
for (fieldnum = 0; fieldnum < numFields; fieldnum++) {
Object[] outputRow = new Object[3];
outputRow[0] = rownum;
// Assign the field id. Note that we need to cast to long to match Kettle's type system.
outputRow[1] = (long)fieldnum+1;
outputRow[2] = inputRowMeta.getString(r, fieldnum);
putRow(data.outputRowMeta, outputRow);
}

return true;
}

Getting Field information

So the UDJC step only generates a number to identify the field. For some purposes it may be useful to pull in other information about the fields, like their name, data type or data format. While we could do this also directly int the UDJC step by writing more java code, it is easier and more flexible to use some of Kettle's built-in steps:
  1. The Get Metadata Structure step. This step takes an input stream, and generates one row for each distinct field. Each of these rows has a number of columns that describe the field from the input stream. One of the fields is a Position field, which uniquely identifies each field from the input stream using a generated integer, just like the fieldnum field from our UJDC step does.
  2. The stream lookup step. This step allows us to combine the output stream of our UJDC step with the output of the Get Metadata structure step. By matching the Position field of the Get Metadata Structure step with the fieldnum field of the UDJC step, we can lookup any metadata fields that we happen to find useful.


Below is a screenshot that shows how all these steps work together:
And here endeth the lesson.

Running Pentaho BI Server behind a proxy

To whom it may concern - a quick hands-on guide for running the Pentaho BI server behind a proxy

Prerequisites


This post assumes you're running Ubuntu linux (or at least a Debian) and that you have both the apache Httpd server as well as the Pentaho BI server installed.

Apache HTTP Server


If you haven't got apache installed, this is your line:
$ sudo apt-get install apache2

You can then control the apache2 Http server using the apaceh2ctl script. For instance, to start it, do:
$ sudo apache2ctl start

Once it's started you can navigate to its homepage to verify that it is running:
http://localhost/

You can stop it by running
$ sudo apache2ctl stop

If you're changing apache's configuration, you need to restart it to allow changes to take effect using this command:
$ sudo apache2ctl restart
.

Java


Pentaho relies on Java. If not installed already you can get it like this:
$ sudo apt-get install openjdk-6-jdk

Pentaho BI Server


If you haven't got the Pentaho BI Server, download the latest version from sourceforge, and unpack the archive in some location you find convenient. (For development purposes I simply keep and run it in a subdirectory of my home directory)

You can start the pentaho BI Server by cd-ing into the biserver-ce directory and then run:
$ ./start-pentaho.sh

You can then navigate to its homepage:
http://localhost:8080/pentaho/Home

(Simply navigating to http://localhost:8080 will automatically redirect you there too).

It can be useful to monitor the log while it's running:
$ tail -f tomcat/logs/catalina.out

If you want to change something in Pentaho's configuration, you need to stop the server and then restart it. This is done by running:
$ ./stop-pentaho.sh

Configuring Proxy support for Apache


Boris Kuzmanovic wrote an excellent post to setting up proxy support for Apache. My summary (and adjustment) follows below.

First, change the apache configuration to load the required proxy modules:
$ sudo a2enmod proxy
$ sudo a2enmod proxy_http

Then, edit any site definitions to use the proxy. I just modified the default site definition:
$ sudo geany /etc/apache2/sites-enabled/000-default


Inside the <VirtualHost> section, I added these snippets immediately above the </VirtualHost> that ends the section:

<Location /pentaho/>
ProxyPass http://localhost:8080/pentaho/
ProxyPassReverse http://localhost:8080/pentaho/
SetEnv proxy-chain-auth
</Location>

<Location /pentaho-style/>
ProxyPass http://localhost:8080/pentaho-style/
ProxyPassReverse http://localhost:8080/pentaho-style/
SetEnv proxy-chain-auth
</Location>


After making these changes, we need to restart apache:
$ sudo apache2ctl restart
.

These two <Location> directives are now effectively tunneled to the respective locations on the Pentaho BI Server, and vice versa, the response is passed back.

Using mod_proxy_ajp instead of proxy_http


While the regular HTTP proxy simply works, there is a better, more thightly integrated solution. The regular HTTP proxy basically handles HTTP requests received by the Apache Httpd server by sending a new, equivalent HTTP request, through to the tomcat server. Likewise, Tomcat's HTTP response is then send back as a new equivalent HTTP response to the source of the original, initial request.

So, that's twice a transport over HTTP.

Things can be improved by routing the incoming HTTP request to the tomcat server using a binary protocol called the AJP (Apache JServ) protocol. (For a detailed comparison, see this excellent comparison between HTTP/HTTPS and AJP.)

Fortunately, the steps to setup an AJP proxy are almost identical to those for setting up a regular HTTP proxy. First, enable the ajp proxy module:
$ sudo a2enmod proxy
$ sudo a2enmod proxy_ajp

(Note that the proxy module was already enabled as part of setting up the regular http proxy. The line is repeated here for completeness, but not necessary if you completed the steps for setting up support for the regular http proxy. You can enable either or both the proxy_http and the proxy_ajp modules, and both require the proxy module.)

Then, we edit again the site configuration to use the proxy. Since the locations /pentaho/ and /pentaho-style/ were already used, we first comment those out:

#<Location /pentaho/>
# ProxyPass http://localhost:8080/pentaho/
# ProxyPassReverse http://localhost:8080/pentaho/
# SetEnv proxy-chain-auth
#</Location>

#<Location /pentaho-style/>
# ProxyPass http://localhost:8080/pentaho-style/
# ProxyPassReverse http://localhost:8080/pentaho-style/
# SetEnv proxy-chain-auth
#</Location>


Then we add equivalent lines going via the AJP proxy:

ProxyPass /pentaho ajp://localhost:8009/pentaho
ProxyPassReverse /pentaho ajp://localhost:8009/pentaho

ProxyPass /pentaho-style ajp://localhost:8009/pentaho-style
ProxyPassReverse /pentaho-style ajp://localhost:8009/pentaho-style

(The bit that goes ajp://localhost:8009 refers to the ajp service that is running on port 8009 of tomcat by default.)

Again we have to restart the apache service for the changes to take effect:
$ sudo apache2ctl restart
.

Acknowledgements


Thanks to Paul Stöllberger, Pedro Alves and Tom Barber for valuable feedback and background information regarding AJP.

Thursday, December 01, 2011

Common Schema: dependencies routines

Are you a MySQL DBA? Checkout the common_schema project by Oracle Ace Shlomi Noach.

The common_schema is an open source MySQL schema that packs a number of utility views, functions and stored procedures. You can use these utilities to simplify MySQL database administration and development. Shlomi just released revision 178, and I'm happy and proud to be working together with Shlomi on this project.

Among the many cool features created by Shlomi, such as foreach, repeat_exec and exec_file, there are a few %_dependencies procedures I contributed:

  • get_event_dependencies(schema_name, event_name)

  • get_routine_dependencies(schema_name, routine_name)

  • get_sql_dependencies(sql, default_schema)

  • get_view_dependencies(schema_name, view_name)

All these procedures return a resultset that indicates which schema objects are used by the object identified by the input parameters. Here are a few examples that should give you an idea:

mysql> call common_schema.get_routine_dependencies('common_schema', 'get_routine_dependencies');
+---------------+----------------------+-------------+--------+
| schema_name | object_name | object_type | action |
+---------------+----------------------+-------------+--------+
| common_schema | get_sql_dependencies | procedure | call |
| mysql | proc | table | select |
+---------------+----------------------+-------------+--------+
2 rows in set (0.19 sec)

Query OK, 0 rows affected (0.19 sec)

mysql> call common_schema.get_routine_dependencies('common_schema', 'get_sql_dependencies');
+---------------+-------------------+-------------+--------+
| schema_name | object_name | object_type | action |
+---------------+-------------------+-------------+--------+
| common_schema | _get_sql_token | procedure | call |
| common_schema | _sql_dependencies | table | create |
| common_schema | _sql_dependencies | table | drop |
| common_schema | _sql_dependencies | table | insert |
| common_schema | _sql_dependencies | table | select |
+---------------+-------------------+-------------+--------+
5 rows in set (1.59 sec)
Of course, there's always a lot to be desired. The main shortcomings as I see it now is that the dependencies are listed only one level deep: that is, the dependencies are not recursively analyzed. Another problem is that there is currently nothing to calculate reverse dependencies (which would arguably be more useful).

The good news is, this is all open source, and your contributions are welcome! If you're interested in the source code of these routines, checkout the common_schema project, and look in the common_schema/routines/dependencies directory.

If you'd like to add recursive dependencies, or reverse dependencies, then don't hesitate and contribute. If you have a one-off contribution that relates directly to these dependencies routines, then it's probably easiest if you email me directly, and I'll see what I can do to get it in. If you are interested in more long term contribution, it's probably best if you write Shlomi, as he is the owner of the common_schema project.

You can even contribute without implementing new features or fixing bugs. You can simply contribute by using the software and find bugs or offer suggestions to improve it. If you found a bug, or have an idea for an improvement or an entirely new feature, please use the issue tracker.

For now, enjoy, and untill next time.

Friday, October 21, 2011

MySQL Hacks: Preventing deletion of specific rows

Recently, someone emailed me:
I have a requirement in MYSQL as follows:
we have a table EMP and we have to restrict the users not delete employees with DEPT_ID = 10. If user executes a DELETE statement without giving any WHERE condition all the rows should be deleted except those with DEPT_ID = 10.

We are trying to write a BEFORE DELETE trigger but we are not able to get this functionality.

I have seen your blog where you explained about Using an UDF to Raise Errors from inside MySQL Procedures and/or Triggers. Will it helps me to get this functionality? Could you suggest if we have any other alternatives to do this as well?
Frankly, I usually refer people that write me these things to a public forum, but this time I felt like giving it a go. I figured it would be nice to share my solution and I'm also curious if others found other solutions still.

(Oh, I should point out that I haven't asked what the underlying reasons are for this somewhat extraordinary requirement. I normally would do that if I would be confronted with sucha a requirement in a professional setting. In this case I'm only interested in finding a crazy hack)

Attempt 1: Re-insert deleted rows with a trigger

My first suggestion was:
Raising the error won't help you achieve your goal: as soon as you raise the error, the statement will either abort (in case of a non-transactional table) or rollback all row changes made up to raising the error (in case of a transactional table)

Although I find the requirement strange, here's a trick you could try:

write a AFTER DELETE FOR EACH ROW trigger that re-inserts the rows back into the table in case the condition DEPT_ID = 10 is met.

Hope this helps...

Alas, I should've actually tried it myself before replying, because it doesn't work. If you do try it, a DELETEresults in this runtime error:
Can't update table 'emp' in stored function/trigger because it is already used by statement which invoked this stored function/trigger.
This is also known as "the mutating table problem".

Attempt 2: Re-insert deleted rows into a FEDERATED table

As it turns out, there is a workaround that meets all of the original demands. The workaround relies on the FEDERATED storage engine, which we can use to trick MySQL into thinking we're manipulating a different table than the one that fires the trigger. My first attempt went something like this:

CREATE TABLE t (
id INT AUTO_INCREMENT PRIMARY KEY,
dept_id INT,
INDEX(dept_id)
);

CREATE TABLE federated_t (
id INT AUTO_INCREMENT PRIMARY KEY,
dept_id INT,
INDEX(dept_id)
)
ENGINE FEDERATED
CONNECTION = 'mysql://root@localhost:3306/test/t';

DELIMITER //

CREATE TRIGGER adr_t
AFTER DELETE ON t
FOR EACH ROW
IF old.dept_id = 10 THEN
INSERT INTO t_federated
VALUES (old.id, old.dept_id);
END IF;
//

DELIMITER ;
So the idea is to let the trigger re-insert the deleted rows back into the federated table, which in turn points to the original table that fired the trigger to fool MySQL into thinking it isn't touching the mutating table. Although this does prevent one from deleting any rows that satisfy the DEPT_ID = 10 condition, it does not work as intended:

mysql> INSERT INTO t VALUES (1,10), (2,20), (3,30);
Query OK, 3 rows affected (0.11 sec)

mysql> DELETE FROM t;
ERROR 1159 (08S01): Got timeout reading communication packets

mysql> SELECT * FROM t;
+----+---------+
| id | dept_id |
+----+---------+
| 1 | 10 |
| 2 | 20 |
| 3 | 30 |
+----+---------+
3 rows in set (0.00 sec)
At this point I can only make an educated guess about the actual underlying reason for this failure. It could be that the deletion is locking the rows or even the table, thereby blocking the insert into the federated table until we get a timeout. Or maybe MySQL enters into an infinite loop of deletions and insertions until we hit a timeout. I didn't investigate, so I don't know, but it seems clear this naive solution doesn't solve he problem.

Attempt 3: Deleting from the FEDERATED table and re-inserting into the underlying table

It turns out that we can solve it with a FEDERATED table by turning the problem around: Instead of manipulating the original table, we can INSERT and DELETE from the FEDERATED table, and have an AFTER DELETE trigger on the FEDERATED table re-insert the deleted rows back into the original table:

DROP TRIGGER adr_t;

DELIMITER //

CREATE TRIGGER adr_federated_t
AFTER DELETE ON federated_t
FOR EACH ROW
IF old.dept_id = 10 THEN
INSERT INTO t
VALUES (old.id, old.dept_id);
END IF;
//

DELIMITER ;
Now, the DELETE does work as intended:

mysql> DELETE FROM federated_t;
Query OK, 3 rows affected (0.14 sec)

mysql> SELECT * FROM federated_t;
+----+---------+
| id | dept_id |
+----+---------+
| 1 | 10 |
+----+---------+
1 row in set (0.00 sec)
Of course, to actually use this solution, one would grant applications access only to the federated table, and "hide" the underlying table so they can't bypass the trigger by deleting rows directly from the underlying table.

Now, even though this solution does seem to fit the original requirements, I would not recommend it for several reasons:
  • It uses the FEDERATED storage engine, which hasn't been well supported. For that reason, it isn't enabled by default, and you need access to the MySQL configuration to enable it, limiting the applicability of this solution. Also, you could run into some nasty performance problems with the FEDERATED storage engine
  • The solution relies on a trigger. In MySQL, triggers can really limit performance
  • Perhaps the most important reason is that this solution performs "magic" by altering the behaviour of SQL statements. Arguably, this is not so much the fault of the solution as it is of the original requirement.

An Alternative without relying on magic: a foreign key constraint

If I were to encounter the original requirement in a professional situation, I would argue that we should not desire to alter the semantics of SQL commands. If we tell the RDBMS to delete all rows from a table, it should either succeed and result in all rows being deleted, or it should fail and fail completely, leaving the data unchanged.

So how would we go about implementing a solution for this changed requirement?

We certainly could try the approach that was suggested in the original request: create a trigger that raises an exception whenever we find the row should not be deleted. However, this would still rely on a trigger (which is slow). And if you're not on MySQL 5.5 (or higher), you would have to use one of the ugly hacks to raise an exception.

As it turns out, there is a very simple solution that does not rely on triggers. We can create a "guard table" that references the table we want to protect using a foreign key constraint:

mysql> CREATE TABLE t_guard (
-> dept_id INT PRIMARY KEY,
-> FOREIGN KEY (dept_id)
-> REFERENCES t(dept_id)
-> );
Query OK, 0 rows affected (0.11 sec)

mysql> INSERT INTO t_guard values (10);
Query OK, 1 row affected (0.08 sec)

mysql> DELETE FROM t;
ERROR 1451 (23000): Cannot delete or update a parent row: a foreign key constraint fails (`test`.`t_guard`, CONSTRAINT `t_guard_ibfk_1` FOREIGN KEY (`dept_id`) REFERENCES `t` (`dept_id`))
mysql> DELETE FROM t WHERE dept_id != 10;
Query OK, 2 rows affected (0.05 sec)
(Like in the prior example with the federated table, the guard table would not be accessible to the application, and the "guard rows" would have to be inserted by a privileged user)

Finally: what a quirkyy foreign key constraint!

You might have noticed that there's something quite peculiar about the foreign key constraint: typically, foreign key constraints serve to relate "child" rows to their respective "parent" row. To do that, the foreign key would typically point to a column (or set of columns) that make up either the primary key or a unique constraint in the parent table. But in this case, the referenced column dept_id in the t table is contained only in an index which is not unique. Strange as it may seem, this is allowed by MySQL (or rather, InnoDB). In this particular case, this flexibility (or is it a bug?) serves us quite well, and it allows us to guard many rows in the t table with dept_id = 10 with just one single row in the guard table.

Friday, October 07, 2011

Fighting Spam: Word Verification

Hi All,

this is a quick note to let you know that from now on, commenters on this blog will need to complete a word verification (captcha) step.

Personally, I regret to have to take this measure. Let me explain why I'm doing it anyway.

Since 3 months or so, moderating comments on this blog is becoming a real drag due to a surge in anonymous spam. While bloggers spam detection is quite good, I still get notificaton mails prompting me to moderate. I feel this is consuming more of my time than it's worth.

Except for requiring word verification, other policies (or lack thereof) are still in effect: all comments are moderated, but anyone can comment, even anonymously. In practice, all real comments get published - even negative or derogatory ones (should I receive them).

Sorry for the convenience, but I hope you'll understand.

Wednesday, August 24, 2011

Re-implementing udf_init_error in MySQL 5.5 and up

To whom it may concern -

Today, I received an email from a user of the udf_init_error UDF (which resides in the lib_mysqludf_udf library). The purpose of this UDF is to generate an error condition, which can be used to abruptly terminate a trigger or stored procedure. As such it is a workaround for bug #11661. This is all described extensively in my now ancient article here.

The user wrote me because of a problem experienced in MySQL 5.5:
...calling
select udf_init_error('Transaction Cannot Be Done Because....');
will return user friendly error message:
Transaction Cannot Be Done Because....
. But in MySQL 5.5, it returns
Can't initialize function 'udf_init_error; Transaction Cannot Be Done Because....
The Can't initialize function 'udf_init_error; bit is so annoying! How can I get rid of that?
I explained that the UDF still works like it should; it's just that at some point during the 5.0 lifecycle, the format of the error message was changed. (I can't recall exactly which version that was, but I did file bug #38452 that describes this issue).

Anyway, I suggested to move away from using the udf_init_error() UDF, and port all dependent code to use the SIGNAL syntax instead, which was introduced in MySQL 5.5. (For a friendly introduction to using the SIGNAL syntax, please check out one of my prior articles).

Unfortunately, for this particular user this would not be an easy task:
The use of SIGNAL did come to my mind, but the implementation is not easy. I have thousands of stored routines to modify. Besides, I'm already satisfied with what the UDF does.
On the one hand, It makes me happy to hear the udf_init_error() UDF served him so well that he wrote so many routines that rely on it; on the other hand, I feel bad that this is holding him back from upgrading to MySQL 5.5.

For everybody that is in this same position, I'd like to suggest the following solution: simply re-implement udf_init_error() as a stored SQL function that uses the SIGNAL functionality instead. The error message returned to the client will not be exactly the same as in the olden MySQL 5.0 days, but at least there will not be an annoying complaint about a UDF that cannot be initialized.

Here's a very simple example that illustrates how to do it:
CREATE FUNCTION udf_init_error(
p_message VARCHAR(80)
)
RETURNS INTEGER
DETERMINISTIC
NO SQL
BEGIN
DECLARE err CONDITION FOR SQLSTATE '45000';
SIGNAL err SET MESSAGE_TEXT = p_message;
RETURN 1;
END;
I hope this helps.

Monday, August 15, 2011

Proposals for Codebits.EU

Codebits is an annual 3-day conference about software and, well, code. It's organized by SAPO and this year's edition is to be held on November 10 thru 12 at the Pavilhão Atlântico, Sala Tejo in Lisbon, Portugal.

I've never attended SAPO Codebits before, but I heard good things about it from Datacharmer Giuseppe Maxia. The interesting thing about the way this conference is organized is that all proposals are available to the public, which can also vote for the proposals. This year's proposals are looking very interesting already, with high quality proposals from Giuseppe about database replication with Tungsten replicator, Pentaho's chief of data integration Matt Casters about Kettle (aka Pentaho data integration), and Pedro Alves from webdetails who will be talking about "Big Data" analysis and dashboarding work he did for the Mozilla team.

There are many more interesting talks, and you should simply check out the proposals for yourself and give a thumbs up or a thumbs down according to whether you'd like see a particular proposal at the conference. I decided to send in a few proposals as well:So, if you like what you see here, take a minute to vote and shape this codebits conference. I'm hoping to meet you there!

Friday, August 12, 2011

Regarding the MySQL Conference and Expo 2012

Last week, Baron Schwartz announced the Percona Live MySQL Conference and Expo 2012.

Percona organized MySQL related conferences and seminars before, and from what I've heard, with considerable success and to satisfaction of its attendees, and there's one coming up in London in October 2011. But arguably, last week's announcement is quite different from the prior Percona conferences. It's different, because it seeks to replace the annual O'Reilly MySQL Conference and Expo.

Everyone that has read the announcement will have no trouble recognizing it as a replacement, since it reads:
We all know that the entire MySQL community has been waiting to see if there will be a MySQL conference next year in the traditional date and location. To the best of our knowledge, no one else was planning one, so we decided to keep the tradition alive.
If you're still in doubt:

  • The conference title contains the phrase: "MySQL Conference and Expo".

  • It's to be held in the Hyatt Regency Hotel in Santa Clara, which has been the venue for the O'Reilly MySQL Conference and Expo since at least 2005.

  • It's scheduled to take place midway April, exactly like the O'Reilly conferences used to be.

  • The scope of the conference encompasses the entire "eco-system" - whatever that is: Developers and DBAs; tools and techniques; tutorials, talks and BOFs. It's about users, but also explicitly about companies and businesses.

Immediately following the announcement, bloggers from the MySQL community - all of which I respect, and consider friends of mine - started posting their opinions:(In this list, I tried to maintain the affiliation of these bloggers as appropriate and relevant as possible. Please let me know if you feel I wrongly associated someone with a particular company or organizational body)

Except for Henrik's post, all of these express a negative attitude towards Percona's announcement. The critique focuses on a few themes:

  • Giuseppe and Sheeri express similar thoughts. They recall how Baron Schwartz and Peter Zaitsev (both from Percona, and now initiating the 2012 conference) criticized the O'Reilly MySQL Conference and Expo 2008 edition for increasingly becoming an event focused at business and vendors, rather than at users (see here and here). There seems to be a hidden accusation that now, only a few years later, Baron and Peter are "guilty" of organizing a business-oriented conference themsevles.

  • Also both Sheeri and Giuseppe's posts express the concern that Oracle might not allow any of its MySQL engineers and architects to speak at the conference. This would arguably make it a less interesting conference as Oracle is a major -if not the main- contributor to both MySQL and InnoDB.

  • All bloggers argue that organizing a conference of this scale should not be the effort of a single company. In particular Kaj Arnö and Monty Widenius allude to the possibility of O'Reilly organizing the conference again, just like the way things used to be. They both explicitly include a list of the major companies contributing to MySQL which they envision should help drive such a conference.

    The main concern here is that when a single company organizes this event, it will be their event. In other words, it will not be neutral. There is serious concern for unfair competition, as the organizing company gets to decide or exert greater influence on which talks are approved, and how talks are scheduled against each other

Although I understand the critique, I do not agree with it. I hope I'm not offending any of my friends, but I think none of these seemingly sensible arguments against Percona organizing the MySQL Conference have true merit.

But before I explain, I think it's interesting to observe that nobody seems to assume it as a given that O'Reilly would be organizing another MySQL Conference and Expo. Monty comes closest to saying something about it:
The reason for my state of mind is that although there have been rumors about discontinuance of the O'Reilly arranged conference there hasn't been any announcement about this.

In fact, I have been working with O'Reilly to try to setup next year's O'Reilly MySQL conference with the intention of having it 'exactly like before', even if Oracle would not participate.
So basically, because O'Reilly didn't say they weren't going to do one, it might be possible, right :) I tend to look at it differently: It means exactlty nothing when someone, O'Reilly included, didn't announce something. The way I see it, O'Reilly has nothing to gain by announcing that they will not be organizing another MySQL Conference. Similarly, they've got nothing to lose by not announcing they aren't.

In the end, organizing conferences is one of O'Reilly's business activities. The mere fact that they've been organizing one during the previous years does not bestow any special responsibility upon them to inform potential attendees and sponsors that they are discontinuing such an activity.

It's interesting that Monty mentions he was working together with O'Reilly on it. I have no reason to doubt it, but I do suspect that whatever was in the works, it was probably not going to happen at the traditional location and at the traditional time window. Silicon Valley Conference centers are busy places, and need to be reserved well in advance - starting to work on it less than three quarters in advance can probably not be considered "well in advance" for an event of this scale.

Now, here are my arguments as to why I do not share the opinions I mentioned above:

  • If you read Peter and Baron's posts from way back in 2008 (I included the links already, but here and here they are again), you will notice that they were not in fact criticizing the O'Reilly MySQL Conference and Expo at all. They simply recognized there was a gap and felt there should also be a community-driven conference. In fact, Baron initiated such an event, the Open SQL Camp. That turned out to be such a great success that others started organizing OpenSQL Camps too.

    Now, if you read Baron's announcement for the Percona MySQL Conference and Expo, you'll notice that precisely because the tables are turned, they now feel the need to maintain a business-driven MySQL conference. They simply recognized that now there is the risk of a gap as far as a business-driven MySQL conference is concerned. In other words, there is no question of should this be a business-driven event or a community-driven event. Both kinds of events are needed, and the business one wasn't being taken care of, neither by O'Reilly, nor by Oracle.

  • The concern that Oracle might not allow its engineers to attend a conference that is organized by a competitor seems reasonable. But it assumes that they would allow it if it was a vendor-neutral conference, or at least a conference that could be perceived as such. To those that have been involved to some extent in the organization of the 2010 and 2011 editions of the MySQL Conference, it should be no secret that Oracle's participation hasn't been exactly eager. Just listen to Tim O'Reilly's own talk at the MySQL 2010 conference. If that doesn't convince you, look at the sponsor list for the 2011 edition: no Oracle. And if that still doesn't convince you - Last year it was very unclear whether Oracle was sending any delegation at all. Only at a very late stage did we receive proposals from Oracle.

    To be clear, I am not blaming Oracle for not wishing to participate in a particular conference. They have their own strategy and they are entitled to execute that however they see fit, even if that includes not sponsoring or speaking at a major MySQL conference. I'm just arguing that whether or not such a conference is organized by a vendor neutral party does not seem to be a part of Oracle's consideration. There is in my opinion absolutely no guarantee that Oracle would participate if things really would be like they used to, and O'Reilly and not Percona would be organizing another edition.

  • The final matter I'd like to discuss is the idea of all major MySQL contributing companies on working togehter to organize a conference. Although I think that's a very sympathetic idea, it doesn't seem very realistic to me. Or at least, it doesn't seem realistic that this would lead to a Santa Clara conference in April 2012. So maybe this is something all involved parties should discuss for the years to come.

My final remarks are that in the end, I am mostly happy that at least one party is willing to take the up-front risk in securing the venue. Percona has announced that, just like O'Reilly, they want to set up a board of community members to drive the program. I think other companies have legitimate concerns over vendor neutrality, but I have no reason to doubt that Baron, and by extension, Percona are doing whatever they can to safeguard that.

What is left is Percona's company name in the conference title. While Percona is likely to use that to drive their own business, this is not really different from all pre-2010 MySQL conferences, where MySQL AB and then Sun used it to drive theirs. That does not mean it excludes competitors using the conference to their advantage to pursue their business interests. There has always been room for competing vendors, and arguably that is what made the event not just *a* MySQL conference, but *the* MySQL conference.

I understand that not everybody is happy about how things are going now, and it would be great if all companies that feel they have a stake here collaborate in the future. But for now, I'm really happy there will be a 2012 edition, and I thank Percona for organizing it. I will definitely send in proposals as soon as the call for papers is open, and I hope everybody that feels they have something to talk about or present will do the same. I really believe this can be a conference exactly like it was before, with the only difference that it's organized by Percona, and not O'Reilly.