Query of Query - Did you know?

Did you know that you can put complex objects in a ColdFusion query? I'm honestly not sure if that's true of Java resultset objects. It may be, it may not. But you can with a ColdFusion query object. That means you can put arrays, structs and yes even CFC/Java objects in them.

What's even more important at this moment is that aside from the fact that you can inject complex objects into a query column, you can also execute query-of-query operations against them. ;) Now, in all likelyhood you would not be able to FILTER on the complex values... But that's not what I'm after. You can have a separate column with an id key, which allows you to filter the query-of-query on the simple value and that lets me do what I need.

What I'm driving at here is something that Joe Rinehart mentioned in a recent blog. Joe's contention is that there aren't currently any true ORM systems for ColdFusion. Okay, that's fine, but what do we call all these tools like Transfer and Reactor if we're not calling them ORMs? We could call them all "Data Access Layers" (DAL), which is how Steve Bryant bills DataMgr. But in the end I don't think that's a particularly helpful bone to pick. I would rather simply continue to call these ORM tools and discuss the differences in approach and feature sets on a case by case basis. I just don't see any ROI for getting all persnickety over the semantics of the label.

Having said that however, Joe's two posts did get me thinking about the way DataFaucet works and what I could provide as alternatives.

Let me start by saying that what is in DataFaucet now is not going anywhere. It works. I think it works well, I enjoy using it and I expect to continue to enjoy using it.

I do think however that I can offer some alternatives that will be more similar to what Joe described, external to the domain model objects, and at the same time will provide a record-caching mechanism that I've previously been hesitant to provide as a result of efficient cascade-deletes. The cascade-delete issue still troubles me, but I'll work on that. ;)

But at its core the engine I have in mind for handling the identity problem (1-record = 1-object), is based on the ability to place complex objects in queries and to get that data back using a query of query. Key in this strategy is the fact that objects stored in the query are returned by pointers (NOT deep-cloned or duplicated) when you execute the subsequent. If the query-of-query were to clone the object, then it would be useless because you would have multiple copies of the object in memory, making the caching mechanism moot. But because the query-of-query returns a reference, that means all instances in which that object are returned will be the same cached object. And as a result that makes a query an effective storage medium for these objects.

You might be asking "umm... hey Ike, why don't you just use a structure?"

That's a good question! And I could. However the structure poses two issues. One is that if I want to selectively cull the cache (based for example on idle time), then I would have to analyze each individual entry in the structure, whereas with the query I can simply execute a new query-of-query that excludes all entries meeting the criteria to cull -- it's a much faster operation.

Secondly I'm recalling something I read from Hal Helms in which he was talking about the Toys-R-Us debacle. In it he mentions how the developers (all very smart guys mind you) who worked on the Toys-R-Us site had made the decision to cache all the products in a big structure in memory... and the CF Server falls flat on its face.

I happen to know the reason for this problem with structures. A structure basically works on the hash-table principal. And hash tables are good for relatively small numbers of objects, however, because the hash-table ultimately is based on a "hash function", its results are somewhat imperfect. You tell it to store "foo" and it uses the hash-function to convert "foo" into a number and then attempts to store your value in that index in an array. Arrays are faster than string parsing, great!

But for the hash function to be faster than string parsing, that means it needs to actually not parse the string. Wha?! Yeah, I know, it sounds really strange, but it really doesn't. It basically performs a couple of complex math operations on the individual bytes of the string to come up with its number. And this means that, although ideally the hash function will produce few collisions, it *will* produce collisions.

A collision is when you tell it "store foo", it looks up the index for "foo", sees that there's already a value stored there, checks to see if that value is for "foo" and realizes that "bar" produces the same index using the hash-function. So since the hash function returns say 100 for both "foo" and "bar", it has to find some way to store both foo and bar in position 100 of the array. The first collision or two you won't notice much difference. The problem is that when you get a few million products in a database and you're storing them all in a big structure, you wind up with THOUSANDS of collisions in your hash table.

At that point the hash table actually becomes less efficient than doing the relatively more expensive string-parsing the hash-table was designed to avoid. Hence the server chokes... coughs up a lung and dies. And it will do that, irrespective of how much physical memory it has available, because the bottleneck is the hash table.

(To clarify, I don't actually know that's what happened at Toys-R-Us, I'm just extrapolating from what I've read about the event and about hash tables.)

Now I actually have some code currently that stores data in structures and at this point I'm considering revising some of it (though not all) to use queries instead for these two reasons I described. In at least one case the code that I have uses nested structures to avoid some of the issues related to the potential of hash table collisions. So instead of asking for struct["foo.bar.thing.#thingid#"] it actually converts that id into a foo struct, a bar struct, a thing struct and then the final data, so it's nested 3-structs deep. That allows me to store more in those structs without collisions, but because it's using structKeyExists several times in succession, that eats away some at the performance gain from caching. It's still better than not caching, but other maintenance tasks especially like culling old cache become rather complex, cumbersome and inefficient with that strategy. So I'm hoping the query-of-query approach will simplify those caching tools across the board. I wouldn't be surprised if this also finds its way into the CacheBox project at some point in the not too distant future as well.

In any event, here's some test code I was using to figure out if query-of-query could be used for this caching strategy. It seemed to work pretty perfectly actually. So I'm pleased. :)

<cfset qry = QueryNew("id,object") />
<cfset obj = CreateObject("component","datafaucet.system.duck") />

<cfloop index="x" from="1" to="5">
   <cfset QueryAddRow(qry,1) />
   <cfset qry.id[x] = x />
   <cfset qry.object[x] = duplicate(obj) />
   <cfset qry.object[x].setValue("id",x) />
</cfloop>

<cfquery name="qry2" dbtype="query">
   select * from qry where id = 3
</cfquery>

<cfset qry2.object[1].setValue("id","test") />

<cfoutput>#qry.object[3].getValue("id")#</cfoutput>

<cfdump var="#qry#" />

If you run this from a page that has a mapping for /datafaucet, you should get the output "test", followed by the dump of the 5 objects. What that means is that when it collected the data for qry2 in the query-of-query, it returned a pointer to the original duck object in record 3 of the original query. I updated the value of the ID, and then went back to the original query to display the ID value from there. If it had cloned or duplicated the object, it would have output "3" instead of "test" and that would have told me I couldn't use the query-of-query as a storage medium. But it gave me the result I wanted, so woohoo! Queries, here I come. :)

Comments
BlogCFC was created by Raymond Camden. This blog is running version 5.5.006. | Protected by Akismet | Blog with WordPress