Sunday, June 14, 2009

Google App Engine Datastore doubts



I have been looking into Google App Engine since the beginning. I played a little bit with it using Python and Django but never did some serious work. As GAE/J was released back in April I decided to have a serious look by building a real application.

I mean the GAE promise is everything I need: easy to build, easy to maintain, easy to scale and I can run it on Google's infrastructure. And when my application won't be the next Twitter, most likely it will cost me nothing to host it as Google offers enough free quotas for medium and small application. I should also mention that now Grails also runs on GAE (using the App Engine plugin) my life couldn't get easier, but...

The only thing GAE does not offer is a relational datastore. GAE's Datastore, based on BigTable, is a hierarchical, schema-less, non-relational datastore.

One of the nice things relational databases support are the aggregate functions like count, sum, avg etc. If you are using GAE you must "Shift processing from reads to writes" as mentioned in The Softer Side Of Schemas - Mapping Java Persistence Standards To the Google App Engine Datastore presented at Google I/O. In practice this means that when inserting or updating a record you must calculate and store the count, sum, avg, etc. values of the whole table. But what if you have many conditional groupings? Let's take an example:


@Entity
class Person {
@Id
@GeneratedValue(strategy = GenerationType.IDENTITY)
String id;

String name;
//Date birthdate;
Integer age;
String address;
String city;
String postalCode;
}


Now imagine I want to be able to (1) get the total number of persons and (2) the average age. Everytime I would create, update or delete a person I should calculate both and store them. Not a big deal, but in my opinion already to much hassle.

But then I also want these values per city... So for each city I should store these values. Next I want them also per postalCode or a specific range of the postalCode. You see where I'm going to? Every create, update or delete of a person would result in a lot of caluculation and storing values.

And then worse, my agile product owner comes in and want to add a boolean indicating a person is a male or not. And guess what, number of persons and average age should additionally be calculated also for this male/female indicator, and to make it worse also in combination with city and postalCode...

I truly believe GAE's Datastore is suitable for many applications don't needing features part of most relational databases. But if you need aggregate queries or common SQL functions (both most likely needed when you are trying to get some analytical data) you should really think twice.

I hope one day GAE's current Datastore will support aggregate functions or additionally a relational datastore will be provided. Then I think GAE would be my preferred Cloud platform.



20 comments:

bd_ said...

The fundamental, driving design decision behind GAE is to make it almost impossible to design your applications in an /un/scalable way. However, grouping functions like you describe are frequently not easily scalable - if you ask the database for the average of some arbitrarily-specified set of items, you're going to have to scan over all of them, and if you multiply that load by a few hundred hits per second and a dataset too large to fit in RAM, then you'll be in for a lot of pain.

That said, GAE does need some sort of offline processing for just this sort of thing - I believe they've mentioned they're looking for ways to integrate in things like mapreduce, which would give you the aggregate reports you want, but that's still a ways off.

Robert J Miller said...

Marcel, thank you for the comment on PardonTheInformation.com. Although this would take some work, what about creating a secondary table that holds all of your count/average/sum data and that is populated by a scheduled process? The data wouldn't be updated real time but it would be accessible for fast querying; similar to a materialized view in Oracle. I haven't tried it, but GAE supports cron jobs for scheduled tasks.

Marcel Overdijk said...

Thanks for your reply Robert.
Calculating the aggregate values in a scheduled process is also what I thought about already. But still it would be a lot of work to store all kind of aggregate combinations. E.g.
- avg age
- avg age of females
- avg age of persons in city Amsterdam (per city calculations need to be stored)
- avg age of females in Amsterdam
- avg age of persons in a specified postalCode (I would need to store calculations for each postlCode
- avg age for postalCodes in a certain range...
I could go like this forever ;-)

Graeme Rocher said...

If I'm not mistaken, GAE doesn't allow background processes no?

Marcel Overdijk said...

@Graeme
Background processes are currently not yet supported but are on the roadmap: http://code.google.com/intl/nl/appengine/docs/roadmap.html
As workaround the CRON-job support (http://code.google.com/intl/nl/appengine/docs/java/config/cron.html) can be used to trigger an url inside the GAE application periodically. The servlet or action linked to this this url could do the processing then. But these requests need to be fullfilled within a couple of seconds or otherwise will be cancelled by GAE. Even if this would be working I would need to do a terrible amount of calculations (as described in my orignal post).

Besides moving the processing of the aggregated values to a scheduled background process leads to new problems... As the GAE datastore does not support any count or avg function, this means I would need to query all records and then process each row to calculate the count and the avg. But the rows returned in a query is limited to 1000.

Kwhit said...

Not that much experience yet with GAE but I have serious doubts as to the advantages of using big table over a good old SQL RDBMS. Yes it's great not having to worry when developing the first version of the app but it's very worry having missing properties in some objects that would otherwise be not null.

The problems you raise add to that.

overtheline said...

This is fully terrible. If a query cant pull aggregate counts on data then what use is this datastore at all.

It is not possible to identify all the potential query abstractions that will be needed AFTER the data is stored.

Either this is just bad advice and the datastore can really query it or this persistence layer is nearly worthless.

Andrew Westberg said...

I've found that to port my app to BigTable, I just have to live without a few things in order to be scalable. If your app can be adjusted to make it happen. If not, you're better off sticking to your RDBMS of choice.

Sekhar Ravinutala said...

Not sure what the 2 second limit you're referring to is, Marcel... GAE/J limits requests to 30 seconds (http://code.google.com/appengine/docs/java/runtime.html) - is this what you had in mind?

Marcel Overdijk said...

@Sekhar Ravinutala
Yes this what I was referring, to and indeed it's 30 sec limit. But will this be enough bor background processing with CRON-jobs? I don't know. Maybe today it is, but maybe not tomorrow as I will need to process more records.

lborupj said...

I cant help but to think that perhaps Google made a mistake in wrapping their Datastore and ultimately their BigTable implementation in JPA and JDO abstractions. People automatically think they'll have the same options as in a RDBMS storage but they dont.
Also I cant imagine why you want to exec an aggregated query against any storage unless you know exactly how many hits you possible can get (intranet, extranet etc).
GAE is simply another way of doing thinks and while it may seem like extra work to do when inserting or updating data, most data is read many many more times than updated.
Well I for one have decided NOT to use the JPA/JDO wrappings, but use Datastore API directly

startguy said...

I too am using the App Engine for my development efforts. In the beginning I was disappointed that I did not have the type of aggregate functions that a Relational Database has. However after a little research I found that the differences can be compensated by changing my design. I could create a table that contains all of the averages that are needed, then every time an entry is added, calculate the average?

Marcel Overdijk said...

@startguy
Off course I can calculate all kind of aggregate values (like average) when a record is added, updated and deleted.

But It's not only 1 update. E.g. I want the aggregate values for:
- avg age
- avg age of females
- avg age of persons in city Amsterdam (per city calculations need to be stored)
- avg age of females in Amsterdam

Or maybe even the average age of all persons who's name is like "A%"

bd_ said...

Again, the reason you cannot do this is simple: Your operation is O(your entire database) at worst. All datastore operations, however, are O(returned dataset).

SQL makes it seem easy to just do arbitrary queries, but as soon as you have a large amount of data it will become _slow_. The restrictions on GAE are designed to force you to write scalable code - code that does not have complexity dependent on the size of your entire database. If you refuse to accept the tradeoffs needed for scalability, your code will not scale, wherever it may be hosted.

That said, GAE's coming out with background processing (low-latency task queues) soon - once that's out, you could implement aggregate queries with a table scan.

Graeme Rocher said...

Actually having a Grails plugin that produces the aggregate data values automatically for you would be a great addition ;-)

Rob Whelan said...

It's important to point out these limitations, but you need to balance that with a big dose of "YAGNI".

What's your real world application that involves crunching these stats for random subsets of users?

In most applications, each user will have their own set of data, and won't have access to everyone else's (just their contacts, for example).

The functions that really require the kinds of queries you're talking about usually fall into "reporting", not basic user functionality. In my experience, if you have a large app that requires scaling, you'll be forced to take a different approach on this data anyway... you simply cannot just run these massive queries on the transactional data (you usually end up with a separate reporting db server, refreshed via message queue, scheduled refresh, or something else).

I agree that it's frustrating that Google is forcing you to take the scalable approach, when perhaps your application will never need to scale that much... but that's the downside of a one-size-fits-all solution like GAE.

Marcel Overdijk said...

Yesterday I had a look at the pdf presentation of "Offline Processing on App Engine: a Look Ahead" presented at Google I/O (http://code.google.com/events/io/sessions/OfflineProcessingAppEngine.html).

On slide 45 it's mentioned that the rich features for aggregations are needed. So I think they know it them self.

I'm a little bit worried that people are implementing the calculation of the aggregate values etc. on write time (instead of read time) because this is the way to do it because of scalability (that's what's everybody is saying).... But that the datastore will support it out of the box somewhere next year. Isn't this also about maturity/feature set of GAE and the datastore? I mean relational databases also didn't had all features since day 1.

@bd_: the background processing feature will be soon supported for Python yes, after that Java will follow but when? Weeks? Months? Again this relates also the maturity of the GAE platform and the current feature set.

@Rob:
I agree that many of the queries fall in the "reporting" category.
With background processing support it becomes possible to solve this using message queues etc. instead of at every write updating a lot of calculations.


Maybe the best solution would be to be able to configure on which tables, columns (or combination of columns) you want aggregate values and that the datastore will compute them automatically for you.
@Graeme: is this also what you mean for the Grails plugin

JamesHr said...

I'm always amazed how many responses are made in regard to analytic performance without the mention that Transactional can NOT co-exist with Reporting data in an large-scale environment. Way to go Rob!

Apart from that the question here is data round trips that slow performance in writing your own aggregate functions versus rDBMS builtin support which still requires overhead (nothing of much use is magically stored ahead of time folks)...even the rDBMS has to scan all this data and derive the totals...

Ash said...

Thanks I really got some smart ideas out of this post as I needed to implement count all for a simple table.

Dittmar Steiner said...

(may be it is a little late, but i just discovered this interesting discussion.)
@Ash
At first I didn't like using it because it's outside JDO/JPA, but the datastore API provides a count method:
http://code.google.com/appengine/docs/java/javadoc/com/google/appengine/api/datastore/PreparedQuery.html#countEntities()