Query Optimization Hints - thanks to Rick Osborne and Ben Nadel

Ben Nadel posted this blog early this morning with a bunch of query optimization hints from Rick Osborne. Thanks Ben and thanks Rick!

I don't have a DB2 database handy, which is part of the reason why there's not currently a DB2 sql-agent in DataFaucet. There is however no reason why anyone couldn't simply create a sql-agent for DB2. These comments about query optimization however lead to some interesting thoughts about potential improvements for the sql-agents in general in terms of making the agent perform more efficiently for its target platform.

As an example:

Rick says: "Yes, put as much of the filtering as you can in ON clauses. Not only does it put the conditions where they are most relevant, but in some engines you'll get orders of magnitude better performance. The DB2/400 optimizer is so dumb (how dumb is it?) that if you put the conditions in the WHERE instead of the ON it will do the joins first, no matter how big the tables, and then only apply the conditions at the end. For extremely large tables, this is a nightmare."

By default, DataFaucet's query-builder automatically puts all the join conditions in the ON clauses and when performing a left join it properly places any filters on the joined table inside the ON clause as well, so that you can filter on a left join without turning it into an inner join in disguise.

But what struck me about Rick's comment here was that it would be pretty easy to write the sql agent so that it places those filters first before the join condition to improve the query performance just on DB2. For that matter Rick mentions a number of potential optimizations that could be similarly handled inside the abstraction. DataFaucet doesn't currently handle anything in that kind of detail, for example, reordering tables based on size, etc. but in the future it could. It's at least an idea worth keeping in mind for now. :)

Something else Rick said: "And no, the dream of having one query work perfectly on multiple engines is really just a dream."

If you're talking about flat out queries, yes, that may be true. Part of the reason why I started working on DataFaucet in the first place however, way back in the CF5 days, was to produce platform agnostic SQL. So while it may not be possible within an individual query, it might certainly still be possible within the abstraction. ;)

There are two other comments from Rick that I'd like to highlight here.

Rick said: "In most modern DBMSes, you almost don't need to index as long as you have Primary *and* Foreign keys set up. Joins are where indexes really shine, so proper keying will get you 90% of the way there."

Wouldn't you know, I've been trying to convince people to use foreign key constraints for years. ;) DataFaucet makes really good use of them and also makes them really easy to build. If you're using the built-in DDL features that allow the objects to automatically install tables, making a foreign key constraint is as easy as declaring a join (or easier I think). Here's an example from a previous article.

<cfcomponent output="false" extends="datafaucet.system.activerecord">
<cfproperty name="productid" type="uuid" required="true" key="1" />
<cfproperty name="productname" type="string" required="true" length="100*" />
<cfproperty name="productdescription" type="string" required="false" length="long*" />
<cfproperty name="productprice" type="numeric" required="true" length="real" />
<!--- create a foreign key constraint to ensure this product is placed in a category --->
<cfproperty name="categoryid" type="uuid" required="true" references="tblProductCategory.CategoryID" />

<cfset setTable("tblProduct") />
</cfcomponent>

And lastly I'll just encourage you to go read the article on Ben's blog, because Rick made a really clear analogy that helps to explain "selectivity" of an index, which you may have also heard described as the "cardinality" of the data. It's a good help to understanding how indexes work to improve the performance of your queries.

Database Introspection and the MetaData Facade

Being a forerunner is often kind of exciting. There's a certain sense of pride that goes along with being the first person to achieve a particular goal. Y'know, ego stroking. ;) But there's another more challenging side to being a forerunner. You're the first person to get hit when things are launched in your direction.

Okay well I may not have been the first to get hit in this case but my challenges over the past couple days certainly stem from having been a forerunner.

As of today I'm running my ColdFusion 8 installation on my horribly under-powered, not to mention OLD notebook with the new CF Admin option "disable access to internal ColdFusion Java components".

The challenge in doing this was that I needed to find a way to get all the metadata I needed from the database without being able to get an actual JDBC metadata object. The JDBC object I need is actually *NOT* part of the ColdFusion server, oddly enough. The problem stems from the fact that it needs to be associated with an active Java SQL.connection object and the only way I can get that is by going through the undocumented "coldfusion.server.ServiceFactory".

Until Adobe released ColdFusion 8, this wasn't a problem. And Adobe actually tried to resolve the problem I have by also introducing the CFDBINFO tag in ColdFusion 8. (I'm pretty sure I submitted the enhancement request to Adobe for this tag and I've had several people thank me for it, saying it's one of their favorite new features, so I put that in the win category.) ;) Unfortunately although they do return some good and useful introspection data, they failed to solve my problems in particular. There are a handful of problems with the data returned by the cfdbinfo tag, like the fact that it won't tell you the names of any of the foreign key constraints (which, you MUST know if you ever plan to drop them programmatically). The data's there - it's part of the standard set of data returned by the JDBC metadata object even.

That wasn't such a huge thing, the bigger issue (and one I may still need to address more) is that it doesn't return any schema information for tables, views or foreign key constraints. Most of the time your applications will only work in one particular schema (on SQL server it's usually although not always DBO), but particularly if you work with Oracle for the enterprise at all there's a good chance that you may need to draw data from alternate schemas. So the DataFaucet introspection code was designed to account for that -- after all, it's been part of the SQL standard for many, many years! And without that information the foreign key constraint code in particular will fail. And so because of that I couldn't use cfdbinfo for all of the metadata I needed. Instead I had to go back to using information_schema for constraint information (although it's in the sqlagent.cfc which means you can customize it for different db's if you need to).

Don't get me wrong. I understand exactly what Adobe was trying to do when they implemented cfdbinfo. They were trying to take the complexities of JDBC introspection and boil them down into a package that would be easier for developers to digest. And so instead of requesting imported vs. exported foreign keys, you simply request columns and the column list says which columns are foreign keys. That much I understand and could even agree with. But they dropped vital information while they were at it. Hopefully they'll fix those oversights in a future release.

I have actually used information_schema before. Many years ago, I started doing database introspection when we were still limited to ColdFusion 5. That was before CFCs even, all I had to work with were custom tags. NOBODY was talking about writing db abstraction in ColdFusion at the time. In fact, pretty much everyone recommended to avoid even trying it. Except me. I'm not sure Mark was even around yet and Transfer certainly wasn't. Neither was Reactor. Of course now it's common, but not back then. And those tags I had to work with were full of an amalgam of information_schema and proprietary meta-data tools. Ugh! But that was the price of being a forerunner, working on things that weren't easy to pull off. It wasn't until around the time that 6.1 (Red Sky) was released that I converted everything to use JDBC, specifically because even between databases that support the information_schema standard, the data from JDBC is still more consistent and reliable.

So ultimately the existing code in DataFaucet that relies on the serviceFactory is still the most solid way of handling all the things that DataFaucet does. So it's going to try and use that by default, because that's a best case scenario. But we've been working on a project with Eric Jones the past week and their hosting provider disables access to the ColdFusion classes. That meant I had to work on an alternative solution. And I've found myself more or less in the same sort of situation I was in back when I was doing this with CF5, slogging through a maze of different and none-too-consistent standards for getting the information I need.

For that matter, I even asked both Mark Mandel and Sean Corfield if they had any suggestions on the Java front for an easier way that maybe I could get at a loaded metadata object without going through the serviceFactory since it's not a ColdFusion class (and so wouldn't be protected). Neither of them had any advice and for that matter, Mark doesn't even do database introspection generally speaking. So here I am again, the forerunner... bitten in the rear by my success. ;)

I can't say that the solution hasn't been interesting. I had to design a system that would attempt to get the serviceFactory and then on failure create a couple of new components. These new components are facades for the connection and metadata objects. They're just CFCs like everything else in the framework, but what they're doing is pretending to be the java objects the framework wanted. (And hooray for duck-typing in ColdFusion! Because this problem would have been a lot harder to solve in an explicitly typed language like Java.) So these objects have to return the data in the same format that's in use. Since 6.1 that's been the JDBC format... and guess what, neither cfdbinfo nor information_schema follow the same format or naming conventions. So I can't say that it hasn't been interesting (and certainly challenging), but I do wish it weren't necessary.

But then, I'm used to these kinds of tough challenges. I've never been the kind of guy to run away from a technical challenge. ;) And this does mean that if you've had problems using the ORM due to the limitations of your hosting there's now a solution. :) (Unless they disable the creation of Java objects all-together in your sandbox.)

Anyway I haven't released the code yet, but I have checked it in to the SVN repository. So if you have an immediate need, you can try the new code by getting the BER from SVN. I probably won't release a new downloadable archive until I create a small sample application and some documentation for the new Persistence Service that I've talked about in my last few blogs. And at that point because of the new service and because of the metadata facade I think I may roll it up from 1.0 to version 1.1.

DataFaucet PersistenceService Update - Tested Many-to-Many

I just committed some updates to the SVN repository that includes testing for many-to-many relationships with the PersistenceService. And yes there were a handful of issues that needed to be addressed to get these working. ;)

The object code for a reciprocal many-to-many relationshiup looks like this:

<cfcomponent displayname="department">
<cfproperty name="departmentID" type="uuid" required="true" key="1" />
<cfproperty name="departmentName" type="string" required="true" />
<cfproperty name="facultyArray" type="array" required="false" />

...
</cfcomponent>

<cfcomponent displayname="faculty">
<cfproperty name="facultyID" type="uuid" required="true" key="1" />
<cfproperty name="firstname" type="string" required="false" />
<cfproperty name="lastname" type="string" required="false" />
<cfproperty name="departmentArray" type="array" required="false" xref="tblDepartmentFaculty" />

...
</cfcomponent>

Once you've got these CFCs ready, then you would set up your PersistenceService with the appropriate IoC factory (or multiple factories if you prefer) and call the install method to install the tables to hold the data for these components. (The below code would normally be configured in your IoC factory - it's shown here for reference.) :)

<cfscript>
cacheFactory = CreateObject("component","datafaucet.system.classcachefactory").init();
cacheManager = CreateObject("component","datafaucet.system.classcachemanager").init(cacheFactory);
daoFactory = CreateObject("component","datafaucet.system.daofactory").init(accessorPattern="get*",mutatorPattern="set*");
factory = CreateObject("component","datafaucet.system.classfactory").init("datafaucet.demo");
service = CreateObject("component","datafaucet.system.persistenceservice").init(factory,daoFactory,cacheManager);

service.install("department");
service.install("faculty");
</cfscript>

And viola! You've got tables. Then all you need to do is get and save a few departments and faculty. As you create and save them it will both cache the objects and relationships and it will update the cross-reference table indicated in the xref attribute. Restart your ColdFusion server and fetch any department or faculty record via service.get("department",departmentid) and it will automatically load the associated objects.

As previously mentioned, this is a deceptively simple feature and it can be very easy to swamp your server with hundreds or even thousands of objects that aren't being used, simply because you declared a couple properties of type "array". So BE CAREFUL if you choose to use this. Make sure these are objects that really are frequently used and really should be cached in memory.

ORM PersistenceService BER

Okay so as of today, the bleading-edge release of DataFaucet includes a reasonably complete INITIAL PROTOTYPE of the PersistenceService. It now includes reciprocal one-to-many/many-to-one relationships and the purge operation has been given some preliminary testing. Aggregation (many-to-many relationships) have not been tested and cache purging needs some additional testing.

What does this mean?

For those of you who are advanced users and may have been interested in an "external ORM" where you could use your own business objects that are ignorant of the persistence engine, you can now have components like this:

<cfcomponent displayname="conference">
<cfproperty name="conferenceid" type="uuid" required="true" key="1" />
<cfproperty name="conferencename" type="string" required="true" />
<cfproperty name="forumArray" type="array" required="false" />

...
</cfcomponent>

<cfcomponent displayname="forum">
<cfproperty name="forumid" type="uuid" required="true" key="1" />
<cfproperty name="forumname" type="string" required="true" />
<cfproperty name="conference" type="conference" required="true" />

...
</cfcomponent>

Now where I put those elipses, you would have your getters and setters (however you want to write them) and in the PersistenceService you can set a pattern for them, so you can use either getValue/setValue the way I mostly use DataFaucet or you can use getConference/setConference. And then you can simply request your objects from the service in much the same way that Transfer works. In this above example when you request a conference object via PersistenceService.get("conference",conferenceid) it will automatically load that conference object with an array of its cached forums and similarly, each forum object will be loaded with a reference to the cached conference object. So if you've got one conference and 5 forums, you'll have 6 objects in the cache.

Can't wait?

For more information about how to configure the service, see the previous blog entry. There is NOT any documentation in the distribution yet. To get the code, export from the SVN repository (it's NOT in the public download yet and won't be until there's documentation).

In my mind there are specific use cases in which this is beneficial over using the existing active records and imo vice versa as well.

On the plus side:

Using the PersistenceService means that your individual cached objects won't contain all the data access methods as is the case with active record objects. That means they'll have a smaller memory footprint.

For some of you OO enthusiasts you'll be happy that the object is nearly independent of the ORM (aside from a few special attributes in the cfproperty tags).

Also there's just one DAO class, instead of having a separate DAO CFC for each object, woohoo! The DAO is loaded with some information about the class it's being used to persist, but generally speaking you shouldn't need to extend the DAO, it should just work as is to persist most business classes.

On the minus side:

Because of the seamless, recursive nature of the links between these objects, it could be really easy to code yourself into a situation where you've got dozens or even hundreds of objects loading when you only asked for ONE. As an example, given a product catalog of any size, you wouldn't want to have an array of Products in each Category obect. And that's really not the way online shopping is done anyway, so hopefully that won't be a problem for you. ;)

As of right now there are no solutions to lazy-loading property arrays. I have a few ideas about ways to handle that, but no code yet. Doing that in ColdFusion will probably mean subclassing your business objects to add hooks into the ORM for fetching those properties, although there will need to be a way of informing the DAO that those properties will be lazy-loaded.

Race conditions... race conditions... and more race conditions... To be honest there are race conditions with the active record implementation also, however, the addition of caching for any collection of objects increases the number and type of race conditions that will occur in your application. Unfortunately beyond "in your business objects", I can't tell you where you'll find them, I just know that you will. If any given method in your business objects sets or gets more than one instance variable, there's a potential race condition in it.

Use Cases

You might want to use the PersistenceService instead of active records if you have a behavior-rich application in which a number of objects need to remain resident in memory for frequent access. A good example of this would be application security / authentication. User objects, Role objects and Permission objects (if you have them) are apt to be accessed rather frequently and may include somewhat rich behaviors for determining which users have access to what features of your application. In this scenario it makes a lot of sense to retain these objects in memory to reduce the amount of database access and keep the individual objects readily accessible to each other to handle their exchange of information.

As always, I'm available to answer whatever questions you may have. ;)

New PersistenceService option in DataFaucet ORM

So exactly one week ago today on the 3rd of November, Joe Rinehart posted this blog entry in which he suggested that none of the existing ORM solutions for ColdFusuion are "true ORMs" because they function in ways that are a bit different from the way "traditional" ORM tools work in other languages... or maybe just in Java, I'm not sure.

Anyway as you might expect Joe's article has been fairly popular. MangoBlog doesn't show a view count, but there are 34 comments! In a follow-up he listed a handful of more specific problems he believes need to be solved in order to call a tool an "ORM".

Now I actually disagree with the notion that we should just avoid calling these tools ORMs unless they work in this specific way, in part because the name "object relational mapping" doesn't include any words that actually describe several of the problems listed. Caching as an example isn't described by the name although it's part of the problems outlined. I am however a fan of options.

Until recently the only caching in DataFaucet was for queries. It didn't have any caching utilities for objects. Part of the reason for an object cache being needed in Joe's description is because of the "identity" problem he wants an ORM tool to solve. The basic idea is that for instance X of a given class, there should only ever be one object in memory. So you couldn't have say Product #5103 and have Joe have one instance of 5103 and Ike have another instance of 5103 and they be different objects, because there's only ever supposed to be one Product 5103 object. This of course gives rise to a variety of new race conditions.

The active record objects in DataFaucet on the other hand are designed on a somewhat different concept of identity, one that handles race conditions related to referential integrity in a different way. Actually most of the potential race conditions that occur when everything is cached as is the case in Joe's description (and in Transfer), don't occur when you're using DataFaucet's active record objects, and so there are only a much smaller handful of race conditions to consider, those being largely related to deleted records and referential integrity.

Now I'm not saying that the active record model I developed in DataFaucet is better, I'm merely pointing out that it's different and so it has different strengths and weaknesses. If you're like me you may consider that there are a lot of applications for the web where the kind of very complicated caching done to solve the identity problem in Transfer isn't really necessary and so a somewhat simpler approach may work just fine.

On the other hand there certainly may be cases in which knowing that those objects are atomic may be very important to your business model, in which case you would need something that handles caching. In the past I just handled the cache on a case-by-case basis. The onTap framework's member plugin does this, having a manager object that maintains cache for the individual member objects and resolves the identity problem for just those objects. They actually are active record objects but in that case I felt it was important that the identity problem be solved the way Joe described in order to facilitate a more bulletproof security model. ;)

I realize I'm getting kind of long-winded here (as usual). What I'm getting at is that I've started working on a PersistenceService object in the DataFaucet core that provides a generic service/factory for working with objects in the way Joe described in his article. I have no intention for this to replace any of what already exists in DataFaucet. The existing features will stay right where they are, I'm merely extending the framework to include some new features in addition for those who prefer to work with their data in a manner more like Transfer.

One thing that you will find different from Transfer with the PersistenceService is that it's not going to create the individual objects itself. Instead it's going to defer to an IoC framework like ColdSpring or LightWire (or the simplified IoC factory in the onTap framework). And so you won't be creating any decorators like you would with Transfer because it's going to use your objects from the word go.

Also as Joe described in his article, this will make it possible to create your objects as "pure" objects, which have no internal knowledge of the ORM. Instead the ORM will analyze the object's properties and build the data schema from that - or you can provide a schema file for a given class if you need more control. I'm still working out the details of how to configure the PersistenceService and its related cache manager and DAO objects. The DAO objects and object-cache objects can also come from your IoC factory if you prefer to configure them that way (which if you're using IoC already, I rather expect you will prefer that).

I also made a particular point of designing the DAO objects to support not only the generic get/set methods that I prefer but to also support explicit getters and setters as well if you prefer that. Of course they default to getValue and setValue because the onTap framework's "duck" object has those method names, but if you've got existing objects with explicit getters and setters (from a snippet maybe?) then you can use those too.

Here's an example of a CFC I'm using for testing to show how you can use it with objects that are rather unlike the kind of objects I usually create:

<cfcomponent displayname="vendor">
   <cfproperty name="vendorid" type="uuid" required="true" key="1" />
   <cfproperty name="vendorname" type="string" required="true" />
   <cfproperty name="vendornotes" type="string" required="false" length="250*" />
   
   <cffunction name="init" access="public" output="false">
      <cfset variables.instance = newInstance() />
      <cfreturn this />
   </cffunction>
   
   <cffunction name="newInstance" access="private" output="false">
      <cfset var result = structNew() />
      <cfset result.vendorid = "" />
      <cfset result.vendorname = "" />
      <cfset result.vendornotes = "" />
      <cfreturn result />
   </cffunction>
   
   <cffunction name="getVendorID" access="public" output="false">
      <cfreturn instance.vendorID />
   </cffunction>
   
   <cffunction name="setVendorID" access="public" output="false">
      <cfargument name="vendorid" type="string" required="true" />
      <cfset instance.vendorid = arguments.vendorid />
   </cffunction>
   
   <cffunction name="getVendorName" access="public" output="false">
      <cfreturn instance.vendorName />
   </cffunction>
   
   <cffunction name="setVendorName" access="public" output="false">
      <cfargument name="VendorName" type="string" required="true" />
      <cfset instance.VendorName = arguments.VendorName />
   </cffunction>
   
   <cffunction name="getVendorNotes" access="public" output="false">
      <cfreturn instance.vendorNotes />
   </cffunction>
   
   <cffunction name="setVendorNotes" access="public" output="false">
      <cfargument name="VendorNotes" type="string" required="true" />
      <cfset instance.VendorNotes = arguments.VendorNotes />
   </cffunction>
   
</cfcomponent>

You might notice that, as Joe described, this CFC doesn't extend anything. It doesn't have to. The persistenceService is able to get all the information it needs either from the metadata of the object or from its configs, but the object itself doesn't need to know anything at all about the ORM. It doesn't even need to know you're using an ORM at all.

So having said that, the setup for my test page to configure the service looks like this:

<cfscript>
   cacheFactory = CreateObject("component","datafaucet.system.classcachefactory").init();
   cacheManager = CreateObject("component","datafaucet.system.classcachemanager").init(cacheFactory);
   daoFactory = CreateObject("component","datafaucet.system.daofactory").init(accessorPattern="get*",mutatorPattern="set*");
   factory = CreateObject("component","datafaucet.system.classfactory").init("datafaucet.demo");
   service = CreateObject("component","datafaucet.system.persistenceservice").init(factory,daoFactory,cacheManager);
   
   service.install("vendor");
   
   vendor = service.get("vendor");
   vendor.setVendorName("sony");
   service.save(vendor);
   
   vendor2 = service.get("vendor",vendor.getVendorID());
   vendor3 = service.get("vendor",vendor.getVendorID());
</cfscript>

If this were a working application I wouldn't have those createObject calls at the top. Instead I would have those objects configured in my IoC framework and then I would fetch the PersistenceService from the IoC factory and call service.get() as you see here. I would probably also use just the one IoC factory for all three types of objects (cache, dao and business objects). The init method of the service object even defaults the daoFactory to the business object factory.

The function call service.install("vendor") analyzes the object and configuration and creates the necessary tables to hold the data for the vendor object.

Then my get() call fetches a new vendor object from my IoC factory (in this case "datafaucet.system.classfactory" is just a naive tool that creates an object and spits it back, rather than an actual IoC factory - it's a stub really, existing only to support the model for testing - and the daoFactory here is the same).

The call to service.save() then checks to see if the object is new and performs either an insert or an update of the database depending on its status, just like Transfer.

Subsequent calls to service.get() where I've included the id of the record I want then return the already created object from the cache.

So far I've tested this on the single object. It creates one table and there are no composed objects yet. There is some code in SVN for handling the composition cache, but I haven't tested that part yet. There are a bunch of other things I expect don't work yet. Many of them are things that are already working with the active record objects. For example it won't handle autonumber ids yet or storing data for a single object in multiple tables. Heck, I haven't even tested the functions for purging the cache!

I wanted to post a note to let everyone know about roughly two days work on this. The code is in SVN and none of it is documented yet because of course it's not released yet. But if you're interested in playing around with it, you're more than welcome to get the BER from the SVN repository and check it out. :)

I will say that one particularly keen drawback of the PersistenceService is that it fails one of the primary objectives of the DataFaucet framework and that is making data access easy. I've said it on several occasions that it's called DataFaucet because I wanted to make getting data as easy as getting a drink of water from your kitchen sink! ;) In many ways I feel the gateway and active record objects really do live up to that ideal, particularly with features like and/or keyword searching. The PersistenceService not so much.

There's not a whole lot I think I can do about that because it needs an IoC framework to function and of course that introduces a new dependency that's beyond the scope of the ORM. But I'm certainly not opposed to the framework having "advanced techniques" for people who need or want them. As long as I can maintain that gentle learning curve, I'll be happy. :)

Query of Query - Did you know?

Did you know that you can put complex objects in a ColdFusion query? I'm honestly not sure if that's true of Java resultset objects. It may be, it may not. But you can with a ColdFusion query object. That means you can put arrays, structs and yes even CFC/Java objects in them.

What's even more important at this moment is that aside from the fact that you can inject complex objects into a query column, you can also execute query-of-query operations against them. ;) Now, in all likelyhood you would not be able to FILTER on the complex values... But that's not what I'm after. You can have a separate column with an id key, which allows you to filter the query-of-query on the simple value and that lets me do what I need.

What I'm driving at here is something that Joe Rinehart mentioned in a recent blog. Joe's contention is that there aren't currently any true ORM systems for ColdFusion. Okay, that's fine, but what do we call all these tools like Transfer and Reactor if we're not calling them ORMs? We could call them all "Data Access Layers" (DAL), which is how Steve Bryant bills DataMgr. But in the end I don't think that's a particularly helpful bone to pick. I would rather simply continue to call these ORM tools and discuss the differences in approach and feature sets on a case by case basis. I just don't see any ROI for getting all persnickety over the semantics of the label.

Having said that however, Joe's two posts did get me thinking about the way DataFaucet works and what I could provide as alternatives.

Let me start by saying that what is in DataFaucet now is not going anywhere. It works. I think it works well, I enjoy using it and I expect to continue to enjoy using it.

I do think however that I can offer some alternatives that will be more similar to what Joe described, external to the domain model objects, and at the same time will provide a record-caching mechanism that I've previously been hesitant to provide as a result of efficient cascade-deletes. The cascade-delete issue still troubles me, but I'll work on that. ;)

But at its core the engine I have in mind for handling the identity problem (1-record = 1-object), is based on the ability to place complex objects in queries and to get that data back using a query of query. Key in this strategy is the fact that objects stored in the query are returned by pointers (NOT deep-cloned or duplicated) when you execute the subsequent. If the query-of-query were to clone the object, then it would be useless because you would have multiple copies of the object in memory, making the caching mechanism moot. But because the query-of-query returns a reference, that means all instances in which that object are returned will be the same cached object. And as a result that makes a query an effective storage medium for these objects.

You might be asking "umm... hey Ike, why don't you just use a structure?"

That's a good question! And I could. However the structure poses two issues. One is that if I want to selectively cull the cache (based for example on idle time), then I would have to analyze each individual entry in the structure, whereas with the query I can simply execute a new query-of-query that excludes all entries meeting the criteria to cull -- it's a much faster operation.

Secondly I'm recalling something I read from Hal Helms in which he was talking about the Toys-R-Us debacle. In it he mentions how the developers (all very smart guys mind you) who worked on the Toys-R-Us site had made the decision to cache all the products in a big structure in memory... and the CF Server falls flat on its face.

I happen to know the reason for this problem with structures. A structure basically works on the hash-table principal. And hash tables are good for relatively small numbers of objects, however, because the hash-table ultimately is based on a "hash function", its results are somewhat imperfect. You tell it to store "foo" and it uses the hash-function to convert "foo" into a number and then attempts to store your value in that index in an array. Arrays are faster than string parsing, great!

But for the hash function to be faster than string parsing, that means it needs to actually not parse the string. Wha?! Yeah, I know, it sounds really strange, but it really doesn't. It basically performs a couple of complex math operations on the individual bytes of the string to come up with its number. And this means that, although ideally the hash function will produce few collisions, it *will* produce collisions.

A collision is when you tell it "store foo", it looks up the index for "foo", sees that there's already a value stored there, checks to see if that value is for "foo" and realizes that "bar" produces the same index using the hash-function. So since the hash function returns say 100 for both "foo" and "bar", it has to find some way to store both foo and bar in position 100 of the array. The first collision or two you won't notice much difference. The problem is that when you get a few million products in a database and you're storing them all in a big structure, you wind up with THOUSANDS of collisions in your hash table.

At that point the hash table actually becomes less efficient than doing the relatively more expensive string-parsing the hash-table was designed to avoid. Hence the server chokes... coughs up a lung and dies. And it will do that, irrespective of how much physical memory it has available, because the bottleneck is the hash table.

(To clarify, I don't actually know that's what happened at Toys-R-Us, I'm just extrapolating from what I've read about the event and about hash tables.)

Now I actually have some code currently that stores data in structures and at this point I'm considering revising some of it (though not all) to use queries instead for these two reasons I described. In at least one case the code that I have uses nested structures to avoid some of the issues related to the potential of hash table collisions. So instead of asking for struct["foo.bar.thing.#thingid#"] it actually converts that id into a foo struct, a bar struct, a thing struct and then the final data, so it's nested 3-structs deep. That allows me to store more in those structs without collisions, but because it's using structKeyExists several times in succession, that eats away some at the performance gain from caching. It's still better than not caching, but other maintenance tasks especially like culling old cache become rather complex, cumbersome and inefficient with that strategy. So I'm hoping the query-of-query approach will simplify those caching tools across the board. I wouldn't be surprised if this also finds its way into the CacheBox project at some point in the not too distant future as well.

In any event, here's some test code I was using to figure out if query-of-query could be used for this caching strategy. It seemed to work pretty perfectly actually. So I'm pleased. :)

<cfset qry = QueryNew("id,object") />
<cfset obj = CreateObject("component","datafaucet.system.duck") />

<cfloop index="x" from="1" to="5">
   <cfset QueryAddRow(qry,1) />
   <cfset qry.id[x] = x />
   <cfset qry.object[x] = duplicate(obj) />
   <cfset qry.object[x].setValue("id",x) />
</cfloop>

<cfquery name="qry2" dbtype="query">
   select * from qry where id = 3
</cfquery>

<cfset qry2.object[1].setValue("id","test") />

<cfoutput>#qry.object[3].getValue("id")#</cfoutput>

<cfdump var="#qry#" />

If you run this from a page that has a mapping for /datafaucet, you should get the output "test", followed by the dump of the 5 objects. What that means is that when it collected the data for qry2 in the query-of-query, it returned a pointer to the original duck object in record 3 of the original query. I updated the value of the ID, and then went back to the original query to display the ID value from there. If it had cloned or duplicated the object, it would have output "3" instead of "test" and that would have told me I couldn't use the query-of-query as a storage medium. But it gave me the result I wanted, so woohoo! Queries, here I come. :)

New DataFaucet ORM Build - added iterator CFC

Okay, so I don't usually publish new builds this close together. :) It turns out that I was just too hasty to add the reset() method to the read() method of the active record object and I realized that in addition to the previous problem this also interfered with some of my other existing code because it was clearing out my default values for record objects where I had defined a default value in the soft-constructor of the CFC.

But really the reset() method had come about originally partly in preparation for the iterator object, which just takes a query and a record and provides a way of looping over the query and loading the object with each record in the query. It gives you a middle-ground between the performance of a gateway and the encapsulation of a record object. I won't call it an "iterating business object" because the default iterator itself isn't really a business object per-se. It doesn't really have any business logic in it.

So I've added the iterator CFC plus documentation and I've removed the reset() from the read() method of the active record object, archived the new code (honestly not huge changes) and uploaded another new build. I don't expect to publish any new builds for a little while after this, I promise. :)

ORM Tools : Where FarCry Meets DataMgr

In the ColdFusion world these days we've got a variety of database abstraction tools to work with.

At one end we've got Steve Bryant's DataMgr which is a "Data Access Layer (DAL)" and is amazingly compact at under 50KB of code. And even with that little amount of code, DataMgr still does something most of the ORM tools don't and that's create tables. That feature is really useful for creating install scripts and/or upgrade scripts and the CFeMmy winning Savvy Content Manager is installed and upgraded using those features. But although it is somewhat robust, with less than 50KB of code, it certainly lacks features that other database tools have (and deliberately so). You will find no actual object-mapping (DAO or ActiveRecord) in DataMgr because it's not desiged to be an ORM tool. So if you want to create those things, you have to build them from scratch.

At the other end of the spectrum is one of the few other CF tools that will create database tables: the FarCry Framework. FarCry is... NOT compact. The zip archive of FarCry 5 is 11MB COMPRESSED and about 30MB once you've extracted it. But aside from the Godzilla-size footprint, FarCry also takes a very different approach to database management than the other tools. Where the most popular database tools for ColdFusion assume that you will create a database schema first and then use the tool to map objects to that schema, FarCry insists that you focus only on the objects and let it build the schema in FarCry's own way. According to FarCry if you're thinking about the database, then you're doing something wrong. Don't like UUID's? Too bad, 'cause the primary key on your table is going to be a UUID column called "objectid" (irrespective of the name of the table).

Each of these systems has advantages and drawbacks. What FarCry lacks in portability it makes up for in features. What DataMgr lacks in features it makes up for in portability. Though ultimately I think both DataMgr and FarCry represent the extreme ends of the spectrum with DataMgr seeking to be "light weight" and FarCry seeking to be "full featured". In between you have a handful of other ORM tools that are between 1MB and 2MB in terms of the footprint of the code. The most popular of them are Mark Mandel's Transfer ORM and Doug Hughes Reactor ORM. Yet neither of those tools offer any methods of installing database tables like DataMgr or FarCry. They both asume that you will have a pre-built database schema and then ask you to write some XML to tell them which tables to use and how they're related.

Which means that right now, DataFaucet is the only ORM tool for ColdFusion that will go both ways. It will allow you to use an existing database schema, or it will build the schema for you, it's your choice. It provides the object abstractions that DataMgr lacks while at the same time providing the flexible table-creation abilities that Transfer and Reactor lack, without the overhead (or restrictions) of FarCry. Yesterday I spent most of my day actually working on providing a bit more abstraction in DataFaucet, moving from using XML to define tables for installation to allowing you to use CFPROPERTY tags in much the same way they're used in FarCry.

Here's a sample from the DataFaucet documentation:

<cfcomponent output="false" extends="datafaucet.system.activerecord">
<cfproperty name="productid" type="uuid" required="true" key="1" />
<cfproperty name="productname" type="string" required="true" length="100*" />
<cfproperty name="productdescription" type="string" required="false" length="long*" />
<cfproperty name="productprice" type="numeric" required="true" length="real" />
<!--- create a foreign key constraint to ensure this product is placed in a category --->
<cfproperty name="categoryid" type="uuid" required="true" references="tblProductCategory.CategoryID" />

<cfset setTable("tblProduct") />
</cfcomponent>

Compare this to a sample from FarCry:

<cfcomponent name="superhero" extends="farcry.core.packages.types.types" output="false" displayname="Super Hero">

<cfproperty name="title" type="string" default="" hint="Super hero title." />
<cfproperty name="teaser" type="string" default="" hint="Mini intro for super hero biography." />
<cfproperty name="biography" type="longchar" default="" hint="Super hero biography." />
<cfproperty name="secretIdentity" type="string" default="" hint="Super hero secret identity." />
<cfproperty name="lAlias" type="string" default="" hint="Comma separated list of alternative titles for the super hero." />
<cfproperty name="sidekick" type="uuid" default="" hint="Super hero sidekick." />
<cfproperty name="catHero" type="string" default="" hint="Category of hero." />
<cfproperty name="apowers" type="array" default="" hint="Array of superhuman powers." />

</cfcomponent>

Unlike the FarCry example where CFPROPERTY tags are the only access to database schema, the same tags in the DataFaucet example are only syntax sugar built on top of existing XML-based table definitions. If the sugary abstraction isn't enough, if you need or want views or stored procedures, you can easily move from CFPROPERTY to creating the XML yourself. Also unlike FarCry you don't have to "deploy" this object from a "webtop" to install the tables -- all you have to do is instantiate one and the database tables will be created automatically.

The catch? You have to create the form yourself. :)

New DataFaucet ORM Build - Oct 13

I just pushed a new release of the DataFaucet core up to the RIAForge server. This release adds a columnMap CFC to resolve some previous issues with the new active-record features for stripping underscores from column names. In the long run I've come to think these features aren't really worth the amount of complexity and/or maintenance involved in adding them.

This release also includes the addition of a schemaExport CFC for exporting the table definitions from your existing database to the XML format used for creating new tables in another database. While this may be useful in copying tables from one database to another, the primary reason for adding this feature is to allow for the schema-diff tool which now powers the active record install method. What this means is that an active record object which could previously install its own database tables can now add new columns to existing tables using the same install() method. The install method exports the existing schema from the database and then uses XSLT to perform a diff between the installed schema in the database and the schema declared in the active record object. Any new tables or columns declared will be added while leaving the existing tables and columns in place. This will make writing software upgrades much easier. :)

Support for foreign keys is still limited but I expect it will be improved in a later version. Right now foreign keys can be added when adding a new column, but if you need to drop them or add them to existing columns, you'll have to perform those operations manually with a separate DDL packet.

The new schemaExport features also still need to be documented. I usually try and add some documentation whenever I add a new feature, but haven't gotten to doing that for this one yet.

Enhancement Updates

I just wrapped up a couple of enhancement requests... bounding filters for date and numeric columns in collectionFilters -- this allows you to specify [column]min and [column]max for numeric columns or start[column] and [end]column for date columns in a collectionFilter. This should be in keeping with the spirit of the collectionFilter handling the most common tasks for "advanced search" forms.

I also finished autodiscovery for server type and added the ability to remove underscores from ActiveRecord properties and gateway result columns.

Status Update

Hi all,

I haven't dropped off the earth. :) I still haven't started working on those 7 feature requests yet because I've been really busy trying to wrap up some major revisions on my other framework projects in the past couple weeks. And yesterday and the day before I worked a 14hr and 12hr day respectively at my 9-5 job.

In the meanwhile, the subject of caching has been on my mind a lot lately and I've come to realize that I wasn't aware of some of the new tools available for cache management in ColdFusion. Not native tools mind you, external tools say from newer versions of the JVM. Which has inspired me to start onto a project called CacheBox that will be pluggable so it can be used in both DataFaucet and the onTap framework and hopefully most other ColdFusion frameworks as well.

More Entries

BlogCFC was created by Raymond Camden. This blog is running version 5.5.006.