This document discusses proposed feature direction of Expresso for discussion
purposes. Some of it is incremental, some of it is sweeping. It contains a high-level overview of the development work planned longterm
for Expresso. By it's very nature, this information will change as new
rechnologies and standards are available, so check back about quarterly
for updates.
Introduction
Expresso 4 had it's controller API refactored as part of its integration
with Struts. Since then, performance has been way up and the new implementation
has proven itself flexible and fast. I propose that by and large we keep
the controller API as is.
Some other areas of Expresso are due for a refactoring to reflect technology
changes.
While 5.0 offers additional functionality and improvements to the data
access and security layers of Expresso, it is desireable to do some refactoring
which requires a major release. Here are a few examples:
- Not easy to integrate with EJBs
- DBObjects have become one monolithic class that could be used to be broken
down into smaller sets. Maintainability of this object has become difficult.
- Not easy to integrate with non-JDBC data sources.
- Limited LDAP support
- No JAAS support as of yet.
So, here's what is proposed:
Revised Security API
The basic goal here is to keep as much of the idea that Shash has started and
implement all roles and users as interfaces that can be easily extended. Here's some of the features that should be considered:
- JAAS Integration. The number one design goal will be to integrate with JAAS.
This will allow easy communication with any J2EE servers, as well as other
authorization containers.
- Database Independence. Currently there's still a lot of database dependencies
within the Expresso security system. We need to fully allow the Security Module
to manage it with or without database requirements. Does a person want to
plug in an XML-based security module where all the roles and permissions are
defined in an XML file? Fine. Although Expresso won't attempt to implement
all these ideas. We will allow a pluggable system whereby somebody can extend
as they will.
Revised Data Access API
So here's the basical goals:
- Scalability. We need to be able to deal with systems that can scale
to multiple servers both on the web end and lower layers.
- Multi-Datasource Aware: We need to be able to mesh with JDBC datasources,
JDO systems, as well as EJB, and JMS datasources.
- High Performance: We don't want to sacrifice performance for the
single-server setup that is the vast majority of Expresso's current user base.
- Ease of APIs: Ideally, the system would allow a basic set of API's
that allow a programmer to get a working system with very few lines of code.
However, we want another layer of API's available that allow a programmer
to tweak according to their deployment system that allows for high performance
capabilities. (It should be noted here that we want decent performance even
with the basic set of API's)
- Concurrancy Control: Currently Expresso DBObjects have no capabilitity
to deal with concurrant updates and the potential conflicts arising there.
Although this may be no problem in heavily transacted systems, it becomes
more of a problem within widely distributed applications.
- Take the load of the databases: Commercial databases systems are
expensive and can easily get bogged down. Adding more Database hardware or
replicating databases may only be a partial solution, and cost easily cost
upwards of multi-million dollars for many companies. Being able to scale using
multiple, inexpensive machines is a highly desirable scalability trait.
- Good Low-cohesion module design: For our goals to work, we must design a system that has low cohesion.
Ie, we can replace chunks of the system without seriously affecting other
chunks.
DataObject Stack
Some of the most successful systems have been designed around the simple notation
of of "stacked layers." Probably the most notable example is the tcp/ip
network stack. The premise is that each layer by and large only communicates
with the layers above and below it. Each layer can be reimplemented as long as the boundaries between the layers
are left undisturbed. Thus providing loose cohesion. So here's what the Expresso
DataObject Stack would look like:

Here's a quick summary of what the job of each layer is:
- Data Access Object Layer: The Data Object Layer will be the layer that
each programmer communicates with. By and large it will be a "Dynabean"
with the setField() and getField() methods that we have all come to know and
love. It will have some additional fields that are automatically kept to allow
for concurrancy support, but by and large people should recognize it as a DBObject.
- Cache:Before, the DBObjects would check a cache and go directly to a database
if it wasn't in the cache. The new API will be telling the cache "I
want such and such object". If the cache has it, it will produce it,
if not, then it will go to the next layer and tell it to get it for the
Cache. The goal is to have Cache follow the JCache specification being
developed in a JSR.
- Messaging Middleware:This layer isn't necessary if you're dealing with
only a single webserver. The middleware's goal is to keep all the various cache systems on all the different webservers or whatever system synchronized properly.
Unlike before where the Cache was passive, every time another system updates
a database, the Middleware layer keeps track of who has what objects and sends
the new copy to the various cache's that are interested. There can be more than
one Messaging Middleware layer on a network, and they all communicate with one
another. Usually the idea is that you'll have one Middleware for each 'subdivision'
of an enterprise.
- Transaction Layer:This can be JDBC Transactions, Entity EJB's, whatever
you wish. This is the layer that controls the final writing to the database.
- Conflict Manager: The job of this guy is to handle the problems that
arise when multiple updates take place at once. The conflict manager's algorithm
will be pluggable. Example options are: Last record always overwrites, Merge
records if modified fields are different fields. Either way, the Conflict Manager
will pass back to the system a "resolved" data object, or throw an
Exception. All layers can hook into the Conflict Manager because it's much more
effecient to detect and resolve a conflict the further up the stack you go,
but all conflicts may not necessarily be discovered until the Transaction Layer
is reached.
Stack Detail
DataObjects
As said earlier, the DataObject will in some ways look similar to the good
old DBObject we know and love. There's some key API differences, however:
- The DataObject will essentially be a "dyna-bean" with the setFields("fieldName"),
getField("fieldName") capabilities that we've come to know and
love about DBObjects. The data objects will also basic operands upon them,
add, update, delete. Anything that should be done on a single data object.
The dataobject will be an interface instead of a concrete class. This will
provide integration with the various data sources we're likely to encounter.
- All functions that would be operating on multiple data objects for querying
would be moved to a separate interface. For example, setMaxRecords, setOffeset,
searchAndRetrieve(). All these could be moved into a DataQuery interface.
Some of them are very SQL specific, and will only be available for the JDBC
implementation of the DataQuery interface.
The absolute biggest difference will be the following:
All DataObject Operations other than set/getField() are Asynchronous
|
To make this design goal clearer, here's a few questions and answers about it:
- What benefits do you get from Asynchronous data operations?Well
several actually. From the data object integration standpoint, Asynchronous
communications allow for easy integration with JMS-based transaction systems.
Secondly, there's many places where asynchronous communications can speed
up execution on the webserver side. Take for example a job that simply updates
a lot of records in the database. The job can call a batch of updates(), and
it's done. Any special circumstances can be either dealt with using callbacks,
or simple things like reporting any conflicts to the system administrator.
Contrived example, but why should a job server be bogged down for perhaps
hours waiting for each add() or update() to return? And finally, if you have
a reliable transport system, it doesn't matter if the database connection
dies all together. All modifications will be resolved when the database comes
back online and until then, the Cache system will have the updated copy as
it is.
- Won't Asynchronous programming be much harder to work with? The answer
to this is not necessarily. The short answer is that if we do our job and
provide the right constructs to work with, your job should be no more difficult
than synchronous programming. It truly is all a matter of design.
- Won't I have to redo my entire program design because of this? This
shouldn't be necessary. Again, this mainly deals with whether we do our design
job correctly. An example of a construct that could make your life easier,
is a synchronous wrapper class to go around all data objects. This wrapper
class would wait for the results of the record modification and time out if
a result isn't obtained in X amount of time, and throw an Exception if a timeout
occurs.... which is exactly how things work if a DBConnection dies while trying
to do a record update. This is an example of our goal to provide simple wrappers
and facades to allow for a novice web programmer to get some results without
having to tackle tricky synchronization issues. The construction of a synchronous
data wrapper isn't tough, it would be just like if you use a new SynchronizedMap(new
HashMap()) type of constructor.
Other under-the-hood changes to a DataObject would be:
- Before Modification Snapshot: Each DataObject will have a "pristine"
copy of itself that is maintained until a transaction is complete. This helps
us in dealing with concurrancy as per one of our goals. The conflict manager
can look at this and decide what has been modified. The pristine field value
copies would be constructied lazily, to save on memory.
- Date Last Updated: If you're working with multiple machines that
are time synchronized, this is a quick and CPU efficient way of determining
which modification came first. Again dealing with concurrant update support.
- Last Modified By: This will contain the user credentials of the
person to update a particular object. This allows the system to figure out
who to notify if there's a person's modifications conflict with another concurrant
update.
- Aggregated Concrete Classes for transactional behavior. Somewhere
along the line, we often need to be able to get to low level stuff. This reduces
system portability, but it also is often necessary for performance of particular
features that the "lowest common denominator" approach to data source
systems is insufficient. Each data object knows which underlying datasource
it came from, and the programmer can get to that datasource and downcast to
the appropropriate classes. This is much in the same way as Servlet vs HttpServlet
and Controller, vs. Servlet Controller. These classes will still provide hooks
into the rest of the system, so any special updates are still sent to the
cache, etc.
- Update Priority The programmer of the system should be given the
opportunity to define what priority this object should have if it is updated.
For example, security updates probably have a greater priority to propagate
throughout an enterprise and get written to a database, vs. say, "Updated
Workforce Slogan of the Week".
- Routing Information If multiple middle tiers are used then the system
should have a quick way of figuring out which Middleware and which Transaction
layer to route to when doing an update. Needless to say, this information
would be null if we were dealing with only a single server. Again, we won't
waste memory if we don't need it for a particular configuration.
Of course, as we all think about the problem domain more, we'll come up
with other modifications that need to happen:
Cache
The cache layer will underlie the data object layers. Some of the details
of the Cache will be:
- Make the Cache compliant with the JCache standard being drafted. This is
already based upon a working Object Cache module that Oracle is providing
for it's developers.
- The Cache Module will have both memory and disk based caches. The sizes
for these will be configurable to tune system performance. The increased size
will GREATLY improve data access performance since the number of round-trips
to the database will be significantly reduced.
- The Cache Module is ACTIVE. This means that whenever it retrieves a copy
of a DataObject from the lower layers, the lower layers now know that this
Cache is interested in this data object. If another system updates the same
dataobject, the cache receives the updated copy as well. When an object leaves
the Cache, the Cache system notifies the lower layers that it is "unsubscribing"
so to speak, and the lower layers will no longer notify this cache with updates.
This immediatly allows for multiple webservers with a data middle tier, but
with the added advantage that data is still cached at each webserver.
- The Cache Module checks for potential concurrancy problems before passing
updates to the lower tier and sends the offending records to the ConflictManager
for resolution.
The absolute biggest difference will be the following:
Messenging Middleware
This "layer" is responsible for queueing up requests to the back
end database, as well as routing dataobject updates to other Middleware's
if other ones are responsible for this particular data object. (See the
routing information described in the Data Object layer). Priority Queues
and multiple threads will be used for:
- Dispatching updates to the appropriate Caches and other middleware servers.
- Dispatching updates to the underlying data layer.
Other properties include:
- Concurrency conflicts. Sometimes the Cache layer won't be able to tell that
a dataobject modification is conflicting due to routing time differences.
The Messenging tier will check against potential conflicts before dispatching
updates.
- Each Middleware tier is only responsible for figuring out concurrancy conflicts
with it's "own" data objects. This prevents multiple systems from
"rejecting" posts that should have otherwise taken place.
Transactional Layer
This layer will do the actual "dirty work" It will be responsible
for transactional integrity, and can be programmitcally customized. Specifics
include:
- The actual custom behavior of data objects will be used here. Probably this
can be encapsulated in an inner class to better define the custom behaviors,
but I'm unclear how best to keep "extra classes clutter" out of
the situation and still define encapsulations.
- This layer can speak to Entity EJB's, JDBC recordsets, or whatever you
need. The current DBConnectionPool would reside at this layer, for example.
- If JNDI datasources and JTA managers are used, this layer can interact with
the transaction architecture this way.
Conflict Manager
The conflict manager will have to choose the appropriate algorithm to apply to each particular data object and see if it can resolve any conflicts
or if it has to deny a data object modification. Examples are:
- In a bank account situation, if two withdrawls happen at the same time,
then the conflict manager would apply BOTH withdrawls and perhaps add a penalty
to boot if the resulting bank balance is less than zero.
- In a normal personnel situation, the conflict manager could attempt to
merge any fields. If field A was modified by person A and field B was modified
by person B, then the manager would just merge the two changes, and we're
set. If both Person A & Person B modified the same field, then the
transaction would proceed with the first record, and the second record
that arrived would be rejected.
- No concurrant updates are acceptable, if any records conflict at all, reject
the changes.
Other Details
Boundaries
The border between each layer on the dataobject stack is a boundary. The dataobject must be somehow transmitted to the next layer. The boundaries are actually pluggable as well. Here's some examples:
- In a single-webserver system, the boundaries would simply call the next
class to be called and hand the data objects to the next layer without any
serialization taking place. This will result in a nice, high speed system.
(Actually much faster then Local Entity-beans since those actually only bypass
the RMI layer)
- In a multiple tier system, JMS would be the ideal way to communicate all
the messages between layers, this is especially feasable since the data objects
themselves are asynchronous.
- For systems needing a little more open approach, the data objects could
be serialized to an XML document or document fragment that the underlying
system (whatever the person desires to implement) could transport the data
in a portable way.
- For systems needing maxium open-endedness, the XML boundary could have an
XSLT Translet added to it to "massage" the data into a format that
the rest of the enterprise can speak.
Finally, the boundaries can be individual. For example, you can have an
in-memory boundary between the DataObject, and the Cache. A JMS boundary
between the Cache and the Middleware. And an XML boundary between the Middleware
and the transaction layer. This enables easy separation of work between
machines if scalability is required.
Exception Propagation
All data passed between layers is simply an "Encapsulated" message.
Any Exceptions are just that, another Message that is bundled and passed
back up the stack to the Web Application for appropriate handling. The
system should be flexible enough so that Exceptions can be routed to a
special machine for handling as well.
General Implementation Methodology
Instead of directly continuing the expresso 5, tearing it down and rebuilding
it while everybody waits for their bug fixes to the last problem. We will
create a special Expresso 6 module that will eventually take Expresso 5's
spot when it's ready. Until then, maintenance work will continue on Expresso
5 so that everybody can continue developing their commercial products off
of Expresso 5.
As each part is functional, we will release a public review release of that
particular functionality. Of course, at first, it will look nothing like a full
blown Expresso application. It will initially be just some unit tests to prove
that what we created works and provide a usage example. Once it seems that the API has stabalized sufficiently, and we know that
the release date is on the horizon, we will cease work on Expresso 5 (except
for completely critical bugs), and flesh out Expresso 6 until it is ready
to be released.
Conclusion
This of course is only a high level overview and will be fleshed out in
further design sessions via the community process. The 6.0 plans are to
say the very least, a daring undertaking. The Expresso core team's goal
is to reduce risk by still maintaining Expresso 5 series while we work
on the Expresso 6 API.
Given the current climate of requiring J2EE compatability as well as the
need for large enterprises to have a framework that will work well for
them, I believe such a refactoring will give the Expresso a definite edge
for the Java developer. Please email or post to the listserv with any comments and additions about
this proposal.
Please feel to email the author or the opensource mailing list with any comments, ideas and/or suggestions.
|