Wednesday, May 30, 2007

ActiveRecord Validations Gotcha

Adding a validation to an ActiveRecord model over live data can cause sudden inexplicable brain-shattering headache. Why?

If the new validation causes existing rows to be invalid, any attempt to update attributes on the object without correcting the offending attribute will fail. This will happen, even if an html form that fronts the model doesn't provide an opportunity for your user to modify the attribute. And, boy, that can disorient your users. Any part of your application touching that row will become unwritable.

ActiveRecord's tight policy makes you think hard about what you're doing, which might be a good thing. It's hard to see alternatives that maintain ActiveRecord's simplicity. If your application never modifies rows outside of user-generated CRUD and if each model object corresponds to a single form, then you should be okay. But plenty of good application designs cannot meet both those criteria.

The best alternative behavior I could think of would be a "partial" validation. That is, a validation check only occurred on attributes that had been modified on the object, the "dirty" attributes, so to speak, letting other invalid attributes flyby. Eventually, an invalid row is going to kill you. After all, there is a reason it is invalid.

And while I prefer a noisy failure to a quiet failure, I'd rather not my application lockup and my users bear the burden of my oversight. I can even imagine a validations mechanism that strictly enforced partial validation, but performed full validation, alerting the developers in some loud cantankerous way if full validation failed but partial did not.

Depending on the size of your database, it might be a good idea to database validity during a migration. You could iterate through all rows, calling Active Record#valid? on each, and raise an exception failing on a validation failure. It could report or enact some other policy. Consider adding a migration for each such validation. A validation that invalidates existing rows is really no different than a database migration. After all, in olden times, much of the constraints expressed in validations were expressed in the database schema itself.

Monday, May 28, 2007

REST and Transactional Web Services

There's a long way to go before distributed programming becomes as safe and as simple as non-distributed programming. But if the Web is going to meet its potential of interoperable services locking together to create seamless, smart software, we need to get there. And here's something we need: distributed transactions.

One shortcoming we've found to Amazon Web Services has been this: we can send something to S3, to SQS, and then to our database, but while we can rollback our database, we cannot rollback Amazon. There is a danger that an application operating over these distributed services can be left in an inconsistent state, and that's just no fun for users. EJB was designed to solve such problems with distributed transactions, but the cost of that was, well, EJB.

How does REST clarify this picture? REST gives us a uniform, simple way to describe web services. This makes it easy to describe one Web service call to another Web service. In fact, it makes it easy to describe any RESTful Web service call to a Web service. Yes, the meta-web service! We need only supply a verb, a URL, and maybe a few headers. Signature-based authentication makes it easy to authorize a service to make a call on another without releasing your credentials. With this, I can imagine Web service filtering or Web service tunneling: calling one service through another.

Here is the Web service I would love to see Amazon develop next: a transaction service. There are many, many ways to do this, but here's just one proposal. A client can create a transaction resource with Amazon, describing all the service calls comprising the transaction. Then, Amazon can act as a coordinator for a 2PC (or 3PC) transaction protocol. Any service that can work within such a transaction, of course, has to implement the protocol.

This has some disadvantages. For one, the client wouldn't be able to change state based on the responses of intermediary service calls. For another, it's hard to see how the client could commit the transaction synchronously (And without synchronous response, much of the ease of use of the Web is lost).

For the first problem, I can imagine another way of working this without tunneling. This involves splitting up coordinator duties in 2PC. The transaction client can herself make calls to cohorts during commit-request phase, then letting the transaction service takeover as coordinator during the commit phase.

As for the second problem, creating a long-running task is pretty easy for human clients. Just give them a callback for their browsers to keep polling. BackgrounDRB works like this. For the machine-readable Web, a callback URL to submit check transaction status changes is easy enough. Another alternative is for the service to expose a REST transaction resource which its consumers can poll.

What more is needed?

(1) Some way to describe transactional status within REST resources.

(2) Some way for applications to implement transactional undo/redo across network latencies without locking too many resources. This part is difficult. Database transactions really aren't up for being kept open for long periods of time. We may need new tools to make this happen or at least a better understanding of the design and performance issues involved.

Saturday, May 26, 2007

The Web: Waiting for the Other Shoe to Drop

The philosophy behind REST design has always struck me as having this back-to-the-basics feel: Let's use HTTP verbs as they were originally intended. It's not the first time the community has looked across the convoluted territory of HTTP and felt a reformist impulse. After all, back in the late 90s, there was a lot of talk about how XHTML + CSS would help bring machine readability back to web pages by separating content from layout. The Semantic Web was supposed to further this by providing an ontology to this clean, machine-readable content. Even WS-* web services were supposed to bring about this magical world of seamlessly interoperable machines all plugging into each other.

REST brings new hope. We have a clean, uniform interface for clients to interact with one another. Tools like ActiveResource help reinforce the idea that the interface to a distributed resources can be just the same as the interface to local resources. But which clients should interact with which services and why clients ought to interact with them remain questions answered only by human intervention. What if we can find a layer of abstraction to help chip away at this problem? Actually "solving" the problem might be the realm of AI fantasyland. But I think we can begin to chip away at. And that's progress.

We've already had something basic like this in the past. For example, back in the day using the Web required finding a list of servers that implemented a particular Internet protocol: gopher, IRC, etc. Interoperability occured at the level of protocol. Can we make interoperability happen at the level of semantic description of service with REST?

This is the thought that occurred to me reading Richardson and Ruby's excellent RESTful Web Services. As they point out:

Computers do not know what "modifyPlace" means or what data might be a good value for "latitude".

Well, why not? A computer can certainly know what a good value for an email address is. We take it for granted that a compiler knows the difference between an integer and a string. Why not latitude? We just need one more layer of abstraction above what we have today, before clients can consume services without having been programmed in advance.

It prompts the question: how distributed do we want programming to be? I can imagine the ontology itself being a Web service. If a client did not know what latitude was, it could just look it up. What would the service provide? I can imagine lots. An ontology, validations, algorithms, related services.

REST is real progress, because it simplifies things. Hopefully we'll be spending less time wiring together our Web service stacks and more time thinking about the big picture.

Tuesday, May 22, 2007

The Document Abstraction

There are two ways to go with an Office 2.0 application. You can either use the familiar document abstraction, where the user makes changes to the document within the browser and finally commits them with a save. This is the familiar Office workflow, with the document being kept in JavaScript memory. The alternative is to break the document abstractions somehow. Basecamp is a great example of this. You do not work with a single monolithic document. Instead, you make and commit small changes, one change at a time.

From the point of view of building a web application, the second way is the much friendlier way to go. Easier on the browser. Easier on the database. But it's not really feasible for applications competing against the word processor or the spreadsheet. I suspect that those using the document abstraction are eventually gonna run into problems with very large documents. After all, there are going to be costs to tunneling through HTTP and browsers. Those tunneling costs are likely to surface as performance issues. We all know how some browsers begin to choke on very large pages. Toss big pages together with lots of data kept in JavaScript and you have a recipe for trouble.

There is a technical opportunity here, I feel. The great thing about the browser is that it is a universal platform. The terrible thing about the browser is that it really wasn't designed to be an application platform. If a developer wants to build a document-centric Web application, she is still missing that sweet spot of tools and best practices to make things work out. Is Flex the answer? Is Java WebStart?

Tuesday, May 1, 2007

Unit Testing and Purity

Tim Lucas has a fine article on mocking and Rails testing which touches on some themes that I also hit on in an earlier post.

There is a tendency towards keeping functional tests pure in Rails.

Now, I find myself in a different position. I need to exercise my code as much as possible with the little time I have, so I like to get lots of testing bang for my buck. That means that the pragmatist in me wins over the purist who'd like to see each strata tidily tested in its own appropriate testing layer which does not so much as touch the code stink of another layer, let alone the putrescent code fart-bomb that is the database.

Frankly, I regard it as one of the strengths of Rails that I'm again close to my database. Unlike my magnificent Tapestry+Struts+Spring+Hibernate architectures of old, I'm again within earshot of something that actually has implications for my users, no longer in that level of coding hell where I was which is testing Data Transport Objects and testing the XML configuration of my DTOs and testing my database schema declaration so that it would not be altered while I was busy testing all those other things that I had to be testing to, ya know, save a record in a database -- all of course in perfect TDD abstraction from my database, database connection, web controller, and views, and pretty much in abstraction from the 6,000 things that can and will go wrong. But at least I know my DTO code is impeccable! A bullet-proof POJO! My business logic is flawless, portable. Oh, wait. I didn't write any business logic, which is properly abstracted away into a business logic container framework. Whatever. Come break me!

Testing purists say the solution to this is just more tests, and they are certainly right. The problem is that a small startup simply doesn't have the resources. All this code does come at a cost. Can we do things less purely but more efficiently?

What is crucial to a smallish application is code coverage, not purity of testing style. This is particularly true for uncompiled languages, where there is no such thing as compilation to give you even a smoke test on your code.

Five layers of well-segregated tests are just great, but one layer of impure tests is far, far better than any number of layers of pure tests where some layer somewhere has gone uncovered.

But this alternative approach of a single round-trip from user to database and back has costs of its own. You pay a price in test fragility. Fragility is a symptom of concerns that are improperly coupled.

Lucas has hit on this problem: Rails controller tests basically change each time you change your validations. Repairing them involves duplicating validation concerns within controller tests. Fixtures seemed originally intended to resolve this problem, but they don't lighten the load any, they just hide it off in another file, which can become a curse of its own.

He proposes using mocha to stub out ActiveRecord objects during controller testing. Problem is that views will make calls on all kinds of properties, and stubbing each property just recreates the same coupling problem that stubs were supposed to get around.

One solution is to turn rendering off, but it's just very very useful to have something exercising the rendering code as a basic sanity check.

So my suggestion? What if each model object kept a class method for creating a single valid instance of itself? Something like Object.fixture? At least the responsibility then remains with the model object itself, close to the declaration of its validations. And the controller tests stay uncluttered. You can change properties on that single instance in your test itself, if that particular value is what you are testing. This way, the controller tests do not break. And if you add validations, changes only need to be made in one other method.

A low-fi suggestion, to be sure.