IWETHEY v. 0.3.0 | TODO
1,095 registered users | 0 active users | 0 LpH | Statistics
Login | Create New User
IWETHEY Banner

Welcome to IWETHEY!

New This sounds familiar.
I look after a scenario of similar design, but the key numbers (simultaneous users, total data size) wildly different.

My first thought would a half-dozen load-balanced webservers to get the simultaneous access up at the front. I'd also implement some solid structures in the framework so that anything messy in having multiple web servers is abstracted away and solved once. Also, this would be a good place to do some intelligent short-term caching, including possible things with memcached.

Next I'd be asking more detail about what the data access pattern is like. Significantly, what is the read/write ratio like? And how is the dataspace organised? If the read/write ratio is pretty good, then you could get away with a master database on a big box with several slaves on other equally big boxes. DB abstraction layers can be taught to goto a random slave for reading and still stuff writes to the master. Transactions, if you need them, will have to be thought through.

If the read/write is more like 1:1, then I would look at what LiveJournal have done: partition their data across multiple databases. That way some users hit one database most of the time and others hit another most of the time.

I think this approach could be extended to support a distributed application like you suggest. I think you'd end up with a lot of duplicated data between all instances, though. Creating new entries will be tricky; one trick that I used successfully in a previous job is to devote a part of any uniqueID (like the high half) to a 'instance' number. That ensured that there would be no conflicts when the data was copied or moved. It means you might have to abandon any sequencing the database will do for you, though, but transactions and stored procedures will help.

Wade.
"Don't give up!"
New Re: This sounds familiar.
My first thought would a half-dozen load-balanced webservers to get the simultaneous access up at the front. I'd also implement some solid structures in the framework so that anything messy in having multiple web servers is abstracted away and solved once. Also, this would be a good place to do some intelligent short-term caching, including possible things with memcached.
The problem with shared-instance caching is that the data can be very volatile.

Next I'd be asking more detail about what the data access pattern is like. Significantly, what is the read/write ratio like? And how is the dataspace organised? If the read/write ratio is pretty good, then you could get away with a master database on a big box with several slaves on other equally big boxes. DB abstraction layers can be taught to goto a random slave for reading and still stuff writes to the master. Transactions, if you need them, will have to be though through.
Assume equal reads and writes.

If the read/write is more like 1:1, then I would look at what LiveJournal have done: partition their data across multiple databases. That way some users hit one database most of the time and others hit another most of the time.
Instances can't be partitioned, since they are shared by all users. Instances could be wholly on separate partitions, however.

I think this approach could be extended to support a distributed application like you suggest. I think you'd end up with a lot of duplicated data between all instances, though. Creating new entries will be tricky; one trick that I used successfully in a previous job is to devote a part of any uniqueID (like the high half) to a 'instance' number. That ensured that there would be no conflicts when the data was copied or moved. It means you might have to abandon any sequencing the database will do for you, though, but transactions and stored procedures will help.
Again, the trick with duplicated data will be invalidating cached data or propagating changes.

Thanks for the input. I'm actually more interested in sparking discussion, and I thought this was a good example to use since it's different than things like forum software and the like.

Regards,

-scott anderson

"Welcome to Rivendell, Mr. Anderson..."
New We're back to a scalability layer.
The problem with shared-instance caching is that the data can be very volatile.


This is what usually happens when I say 'caching' in this context. :-)

The sort of caching I usually have in mind in this context is the kind that, assuming a typical PHP page, stops the page hitting the database multiple times for the same piece of data. Lifetime is typically seconds and validity is local to the page.

Shared-instance caching of data is a different game. You don't cache volatile data with that; you cache essentially static data. The active user list is a good example, provided users don't get added or delete all that often. There is a range of data in most apps like that that is a few to a few dozen entries and doesn't change in days. Even if it's not much, I imagine putting it in memcached is cheaper than constantly fishing it off a database.

On the other hand, putting PHP session information into memcached apparantly works very well. I'm just about to try that.

Assume equal reads and writes.


Ouch. That makes replication much less attractive.

Instances can't be partitioned, since they are shared by all users. Instances could be wholly on separate partitions, however.


I think I need to be careful with my terminology. LJ horizontally partitioned their user database because they never joined one user's data with another user's data. Then they have a cluster database which is used to indicate which cluster a particular user's data is.

Is your data structure like that? The support system I work with could probably do that because individual tickets stand alone.


Thanks for the input. I'm actually more interested in sparking discussion, and I thought this was a good example to use since it's different than things like forum software and the like.


Indeed. And I really welcome the opportunity.

I personally have been wrestling with the concepts of a scalability/database-abstraction layer and what that means in practice. I posted about it elsewhere, but my main concern with the direction I'm heading is that it seems to be taking SQL away from the application layer which my developers aren't showing any signs that they want to understand. Yet I can see benefits from this because I could put different objects in different databases or even not in a database and the application code doesn't need to know or care.

Wade.

Postscript: I have an opportunity to help optimize a quite different application that is thrashing its database. The owner of the instance has a replication setup but no knowledge how to make the app talk to the slave for reads :-/. Unfortunately, the database handler is the ADODB one from PEAR. In other words, the *whole thing* assumes there is one database and only one database. Knobs. This is (one reason) why I don't like the PHP database library addons. And modifying it will be a non-trivial task. *sigh*

Unfortunately, the one I've written here that *does* know how to use multiple databases intelligently is technically owned by my employer...
"Don't give up!"
New You can do that with PEAR
And still keep it away from the programmers. Create a class that extends DB.php. Wrap the connection method with your own, which checks if you're reading or updating. If you're reading, do it from one of the slaves. Best way is to put a load balancer in front of the slaves so you can dynamically add/remove slaves without interrupting the app. Next best is round-robin or random select within your class.




Wait, you said ADODB. I used DB, not DB_ado, but I assume the above still applies.
===

Kip Hawley is still an idiot.

===

Purveyor of Doc Hope's [link|http://DocHope.com|fresh-baked dog biscuits and pet treats].
[link|http://DocHope.com|http://DocHope.com]
New ADODB has a reputation.
Mainly that it's a heavy layer for what it does.

You're right: derive a new class and use that instead makes sense. That's probably what I'll do for that other project.

Wade.
"Don't give up!"
New Re: We're back to a scalability layer.
Is your data structure like that? The support system I work with could probably do that because individual tickets stand alone.
There may be some opportunities, but they are probably limited.
Regards,

-scott anderson

"Welcome to Rivendell, Mr. Anderson..."
     Programming and design experiment - (admin) - (27)
         Yes, or Icon, Wade... ;-) -NT - (admin) - (1)
             I do most of my programming in PHP these days. - (static)
         This sounds familiar. - (static) - (5)
             Re: This sounds familiar. - (admin) - (4)
                 We're back to a scalability layer. - (static) - (3)
                     You can do that with PEAR - (drewk) - (1)
                         ADODB has a reputation. - (static)
                     Re: We're back to a scalability layer. - (admin)
         A few ideas - (admin) - (6)
             How critical is update ordering? - (drewk) - (5)
                 Critical. - (admin) - (4)
                     That's could be really hard. - (static) - (3)
                         One reason why I lean towards all-in-memory - (admin) - (2)
                             Doable from a database as well... - (ChrisR) - (1)
                                 I didn't say it wasn't doable. - (admin)
         Consistency, Availability, Reliability - (tuberculosis) - (4)
             Define your terms. - (admin) - (3)
                 mmm kay - (tuberculosis) - (2)
                     That's what I thought, for the most part. - (admin)
                     Or how about we do this: - (admin)
         Have you looked at the O'Reilly database war stories? - (tonytib) - (4)
             Yes, I've seen those before. - (admin) - (3)
                 I'm really starting to think there's something to my idea - (drewk) - (2)
                     Isn't that what SQLite is sorta trying to solve? - (static) - (1)
                         It's a solution to a specific problem - (drewk)
         XBase, of course. - (pwhysall) - (1)
             I heard that -NT - (tablizer)

Anomaly detected; No previous known cases.
149 ms