The default architectural approach for a dynamic web site, these days, is to put all the content into a relational database system, retrieve pieces of it with SQL queries, and format the content on the fly with some combination of code, style-sheets and style transformations. There are many good reasons for this, among them scalable performance, manageable storage, transactional updating and a full relational query system for use in subsetting and searching.
When I was writing the forum software for vF, though, my primary goals were statelessness and an absolute minimum of installation dependencies, and I was willing to assume relatively small scales, so instead of using a database I opted to store notes as simple XML files directly in the file system. In my scalability tests this actually worked better than I'd expected, so I used a variation on the same approach for the new architecture for the rest of the site when I recently rewrote it.
I did wonder, though, how much speed I was giving up in the interest of file-system transparency and simplicity. So today I got around to dumping the back issues of my music-review column into a MySQL database so I could do a couple of performance tests of the two approaches.
The first test was to build an alphabetized list of issue titles. The SQL version of this is very simple, as the titles are stored in a discreet field, and SQL lets you demand sorting, so a query like "SELECT title,id FROM thistable ORDER BY title" gets the right data in the right order, and some simple post-processing formats it into HTML. The XML version builds a file list by operating-system glob expansion (which is every bit as glamorous as it sounds), reads in each file, uses string-matching to find the content of the appropriate XML mark-up and stuff it into an array of hashes (again, very sexy), and then sorts the array by the title hash key using a custom comparison function. This test is slightly biased in favor of the SQL approach, as my XML code does some extra string-processing in between retrieval and sorting that I left out of the SQL version because there isn't a directly corresponding intermediate step. The XML version could be optimized in several obvious ways, but the SQL version would also be faster with an index.
The second test was to retrieve the back-issues that contain a particular text phrase. Again, in SQL this is very easy: "SELECT title,id,content FROM thistable WHERE content LIKE '%this phrase%'". The XML method is the same as in the first test, plus some arcane string-matching and some even more arcane string-unmatching to avoid matches that occur inside of HTML tags. This second test is more significantly biased in favor of the SQL approach, because my XML code does a whole XML-HTML transformation step that I didn't bother plugging into the SQL version, plus the SQL version would produce false-positives inside tags that in production use would have to be caught in post-processing. There aren't any trivial accelerations for either approach to this problem, and it's a much more processing-intensive example than the first one, so this is the more interesting of the two tests.
In the first test, the SQL approach reliably generated the title list in about 45 milliseconds, and the XML approach generally took about 89ms, for an SQL advantage of a factor of 2, more or less. This is actually much less of a different than I anticipated, given the absurd brute-force nature of my current XML approach to this problem.
In the second test, even handicapped by post-processing, the XML method actually beat the SQL method. Every time, although not by a lot: SQL times range from 1.5s to 1.8s, XML times hover more consistently around 1.3s to 1.4s.
Neither of these tests were remotely scientific, and I have made no attempt to run then in any environment other than the one I really use, or at any scale other than the one I'm really dealing with. So it would be insane to conclude that a global revolution against databases is imminent. But maybe, at least, I'm not as crazy as I feared for trying to see how far I can get without them.
When I was writing the forum software for vF, though, my primary goals were statelessness and an absolute minimum of installation dependencies, and I was willing to assume relatively small scales, so instead of using a database I opted to store notes as simple XML files directly in the file system. In my scalability tests this actually worked better than I'd expected, so I used a variation on the same approach for the new architecture for the rest of the site when I recently rewrote it.
I did wonder, though, how much speed I was giving up in the interest of file-system transparency and simplicity. So today I got around to dumping the back issues of my music-review column into a MySQL database so I could do a couple of performance tests of the two approaches.
The first test was to build an alphabetized list of issue titles. The SQL version of this is very simple, as the titles are stored in a discreet field, and SQL lets you demand sorting, so a query like "SELECT title,id FROM thistable ORDER BY title" gets the right data in the right order, and some simple post-processing formats it into HTML. The XML version builds a file list by operating-system glob expansion (which is every bit as glamorous as it sounds), reads in each file, uses string-matching to find the content of the appropriate XML mark-up and stuff it into an array of hashes (again, very sexy), and then sorts the array by the title hash key using a custom comparison function. This test is slightly biased in favor of the SQL approach, as my XML code does some extra string-processing in between retrieval and sorting that I left out of the SQL version because there isn't a directly corresponding intermediate step. The XML version could be optimized in several obvious ways, but the SQL version would also be faster with an index.
The second test was to retrieve the back-issues that contain a particular text phrase. Again, in SQL this is very easy: "SELECT title,id,content FROM thistable WHERE content LIKE '%this phrase%'". The XML method is the same as in the first test, plus some arcane string-matching and some even more arcane string-unmatching to avoid matches that occur inside of HTML tags. This second test is more significantly biased in favor of the SQL approach, because my XML code does a whole XML-HTML transformation step that I didn't bother plugging into the SQL version, plus the SQL version would produce false-positives inside tags that in production use would have to be caught in post-processing. There aren't any trivial accelerations for either approach to this problem, and it's a much more processing-intensive example than the first one, so this is the more interesting of the two tests.
In the first test, the SQL approach reliably generated the title list in about 45 milliseconds, and the XML approach generally took about 89ms, for an SQL advantage of a factor of 2, more or less. This is actually much less of a different than I anticipated, given the absurd brute-force nature of my current XML approach to this problem.
In the second test, even handicapped by post-processing, the XML method actually beat the SQL method. Every time, although not by a lot: SQL times range from 1.5s to 1.8s, XML times hover more consistently around 1.3s to 1.4s.
Neither of these tests were remotely scientific, and I have made no attempt to run then in any environment other than the one I really use, or at any scale other than the one I'm really dealing with. So it would be insane to conclude that a global revolution against databases is imminent. But maybe, at least, I'm not as crazy as I feared for trying to see how far I can get without them.