Over the weekend, a client of mine was hacked, part of what seems to have been a broad, rootkit based attack that took out a number of sites in what appeared to be prep work for zombification of servers for spam delivery. I won't give out the details of this particular incident, as we're still trying to figure out the exact exploit, but it brought to light a few facets of emergency preparedness that should be thought about when dealing with XML databases in particular.
The XML databases (eXist-based) were not, thankfully, touched, but in the wake of this we were left with some interesting dilemmas. For instance, we have nearly 5 GB work of information in one XML database, 3.5 GB in another. Given that we were uncertain about the integrity of our ssh (which was the primary mechanism for communicating with the server), or with sudo (Linux based super user command), the prospect of getting to that large an amount of data was somewhat daunting. To that end we made several realizations that should be considered by any XDBA (XML Database Administrator).
- XML Databases should be backed up nightly. Both eXist and Mark Logic (and I believe most of the other XDBs) provide support for both zipping content and for performing diff operations, so that incremental upgrades can be kept. Once databases get large enough, a full backup can be an expensive and time consuming operation. Thus, best strategy here is to create weekly full backups at the quietest time during the week then do incremental backups every night thereafter until the next week. These should be cron'd - automatically scheduled - so that they don't get forgotten in the regular bustle of things.
- Back-up Off Machine. The advantage of zip files is that they can be retrieved via a web service. Write up pages that make the data available via a web interface (either via REST or WEBDAV) and pull copies of the updated backups onto a different machine (and perhaps even a different IP block, just in case you get hit with a drive-by hack attack).
- Check the integrity of those files periodically. In an era of wireless connectivity, connections can get lost mid-stream, leading to XML files that are truncated and corrupt. Backing up over a lan line is always preferable of course, but writing a script that attempts to open the zip files and logs errors if the zip files can keep you from discovering that all of your carefully backed up data is gibberish.
- Archive, clean and purge. As with any other database, the larger the database, the worse the overall performance - and a lot of information in a database may ultimately end up being intermediate or bookkeeping files, or just no longer needed. Develop a strategy for archiving seldom accessed content, and pare databases down when information is no longer needed on a day-to-day basis. (Again, the ability to read an write ZIP content within XML databases comes in handy, as such zip archives can be useful for storing and retrieving older content.
- Don't forget the external files. XML document databases often may include code and resources contained external to the database itself (this is definitely true of eXist). These should be included in any backup strategy as a general principle, as it may very well happen that should it become necessary to rebuild the database from scratch, the zip file containing the data content is not going to be useful for building application logic if stored externally (this is another case for storing application logic within the database, but that's a discussion for another day).
- Keep an archive of the build used for your database. This is more true for open source databases such as eXist, which may change dramatically, especially if you're using a development build. External XQuery modules especially may be deployed for a given build, and code written against the functions in those modules will possibly break if a newer build is used. The worst time you want to be rebuilding your application is when you're recovering from a hack, a fire, or similar problem.
- Keep good documentation and redundant access for your systems. Most relational DBAs are likely to not have a clue when it comes to managing an XML DB, and as a consequence, it's usually a good idea to have good written documentation in place for how to start and stop your XML database (if it's not automatically restarted as a service, something that should be highly considered as well). Having two people designated for working with a company's XML DB is also a sound strategy - people get sick or are away from the phone or Internet almost invariably just as disaster strikes. As XML DBs become more heavily integrated within your data strategy, having an XML database not start could cause major application problems, and often ones that many programmers would be unable to help with.
- Use physical media. Saving to tape backup or even a high density (16MB or above) USB key fob can also insure that in the event both you and your co-lo are in a power outage or natural disaster, that you can set up a database quickly elsewhere.
- Run redundant systems. You can make use of web-based services to communicate between a production and backup database, such that any transaction made to one is made to the other within a limited period of time. While this adds to your load somewhat, the backup system generally will have comparatively little demand on it besides the updates. The extra RAM costs are more than made up by the integrity of data and the ability to insure near seamless turnover if a primary system becomes corrupt or goes down.
- Lock down potential security holes. Most modern XML databases are much more like application servers than they are SQL databases, and that often includes access to file systems, mail systems, SQL access and the like. Never run any of these processes in a way that guest accounts could access them, and if you don't need the capabilities of a given module, don't enable it. Also, always, always insure that your admin account is at a minimum password protected.
Most of these principles should be old hat to regular DBAs, but given that XML database users may not necessarily be database administrators, it's worth repeating them in an XML context.
- Kurt Cagle's blog
- Add new comment

- Quote
- 457 reads


Re: XML Database Security and Recovery
In EMC xDB, we make a distinction between exports and backups. As probably any proper database, we support incremental and complete backups that run off the proprietary binary XML format we use to store stuff in the database. That format is very space efficient, and the incremental backups are very fast as they directly run off the transaction logs. However you end up with data in a binary format that you won't be able to read with any tool except xDB itself.
So the alternative is exports, where you actually export the literal XML to disk. The advantage is that this is just regular XML, so any tool of your liking can access it. The drawback is that exports are typically slower than binary backups, and it's not (easily) possible to create truly incremental exports. You could export all XML files that have been changed since X, but if you just changed the text of that one title element in a multi gigabyte XML file, you would still waste a lot of space.
So xDB users generally create daily incremental backups, and weekly complete backups. Exports are - I think - much less used, at least not as the regular backup strategy.
Re: XML Database Security and Recovery
Kurt said we were uncertain about the integrity of our ssh
Does this mean that it is suspected that someone was able to decode the ssh stream ?
I think that the intrusion is mainly possible through the escalation of rights if running under Linux.
Re: XML Database Security and Recovery
There had been an escalation of rights, and ssh had been compromised as a consequence. It'd be pretty difficult to scan an ssh stream, but this put a trojan in front of the ssh, which meant that anything went over it was visible.
We did manage, finally, to root the damn thing out, but it was a painful nerve-wracking exercise.
Re: XML Database Security and Recovery
As far as I know you are running eXist under a Unix/Linux Platform.
For me me I run it under Windows 2008 Web Server.
I noticed that it is mandatory to run it as a service for security purposes (if not it run under installation account, that is a main hole for intrusion).
Do not forget to set up account in charge of running process of the service (SYSTEM by default is dangerous)
Be aware of rights of eXist/database directories.