December 2000 Search Submit Article Contact Us Join Us Merchandise

TRAM: An Old Idea Forgotten

Dru Nelson dnelson@redwoodsoft.com

Today, if your power supply fails, your computer may take a long time to fsck. What if that computer is the computer holding the main database for your website? What if that is a large database on a large 200 gigabyte raid? Clearly, this is a problem, and more people would probably pay to have this solved, if it was inexpensive. However, the solutions have generally been expensive, so most people opt for the default 'fault-tolerance' that unix provides. Other possible solutions were Journaled Meta Data filesystems such as the ones in AIX, IRIX, and a few others. These filesystems are still not in most free unix's and definitely not in a production state. They require major architectural changes to an OS to support them and they still have serious write throughput issues. In the FreeBSD case, there is also the 'soft-updates' technology. This is interesting technology, and some interesting work is happening there, but it too will not be ready or solve as many of the fault-tolerance problems as we would like. This has been the state of unix for most people for the last 30 years.

For the last 5 years, I believe that this problem could have been solved fairly inexpensively with just hardware and very minor software changes. I will call this technology TRAM, which stands for Transactional-RAM. A typical TRAM card would probably cost about $300. For example, a user would be able to plop a PCI card into a FreeBSD system that was using a standard UFS filesystem and a kernel with TRAM support. That user could power cycle that computer and it would come up cleanly in under a couple of minutes. With a little work, it could be guaranteed to come up in under 1 minute. As a free bonus, their databases would commit writes faster than any disk system could provide as well, without the need for an expensive RAID for write throughput. In the later part of this article, I'll describe a few other interesting and exotic solutions that could come about. So lets start by describing where unix is currently at, then describe TRAM, its history, and then its future.

Current-Day Unix

In unix, when a program needs to have a persistent copy of some data, it stores the data in a file within a filesystem. This data resides in set locations on a hard disk. As of the year 2000, the difference in speeds between CPU, memory, and disk are quite dramatic. CPU instruction cycles are in low nanoseconds, memory is just getting under 10's of nanoseconds, and disks are still under 10's of milliseconds. In order to keep things snappy, CPU's have to cache accesses to memory. In the same vein, OS's try to cache disk access in memory in order to prevent waiting a million instructions for a single read or write.

A long time ago, when the original unix researchers saw this great divide in performance approaching, they chose the policy of: "cache as much as you can in memory, without making the recovery job too difficult". This policy has found many interpretations in all of the different unix's out there. In BSD Unix, for example, the filesystem that was designed for common use was UFS. UFS and the Virtual Memory (VM) system work together on caching disk information in main memory. It will try to keep as much information about a file in memory. However, if a file is closed, or a timer expires, it will eventually write the inode and data to disk. It will do this in a particular order to make the job of 'fsck' easier when restoring the state of the filesystem. For example, one rule is that UFS filesystem always writes an inode to disk before linking a name to it in a directory. This allows 'fsck' to have this one rule in hand to make its life simpler: a directory should always point to a valid inode.

<Side Note: Here are the other interpretations: Linux ext2fs - cache everything dangerously. It gets speed, but requires a complex fsck and there is a higher chance of data loss. Solaris - very similar to bsd ufs. Windows NT - journaled meta-data, not known for speed. It has decent recovery, though.>

Ok, upon recovery though, this only guarantees that the filesystem 'meta-data' is intact. It also means that the recovery system must scan the entirety of the filesystem. Filesystem meta data is anything that is related to managing the data in files, but not the data in the files. Directories, inodes, block maps, super-blocks are all filesystem meta data structures. So why is it that only this is guaranteed?

Core unix philosophy

Unix was designed to be simple. This means, if they found that they could do certain things as libraries in user space, then it didn't belong in the kernel. For example, VMS had the ability to treat certain files as arrays of fixed length records. Research was happening all the time on file structures and the code to add record handling to the unix system api's would have been significant. The existing unix api's allowed people to read and write fixed length records without support in the OS. Therefore, let the user's code handle this abstraction on top of the unix kernel.

Overall, this was a good choice, filesystem structures, file data structures have changed over time. However, just for a theoretical example, if unix knew more about the structure of the data it was writing it could guarantee consistency of the data. For example: if it had fixed length records, it could make writes happen in a way that would cause the data in the file to be consistent. The easiest way to describe this is to show how hard it is for unix to handle this.

So let's say you have records that are 1024 bytes long and your disk handles writes that are 512 bytes long. If you do a write of a record, 1024 bytes long, to a certain part of a file, the first 512 bytes may get written. Now lets say that the computer fails immediately after that, before the next 512 bytes get written. When the computer comes back up, the file's record will have a record with only half the data that is valid. This is a bad thing. In fact, what if that write was for a 1 megabyte record? How do you keep track of all of these writes without knowing what you are writing? Unix doesn't know about the data so it just does the best that it can.

This is a hard problem, so the answer is to punt, that is, deal with it in your own way. Here are a few techniques used:

One old trick with small files is the 'write a new file, and then move it over the old file'. Once a file is closed, in general, you can assume that the file is on disk. This only works for small files, and it is very bad for write intensive operations.

The next technique is to build a recovery program. For example, vi has its own recovery mode. This is an extra step that a lot of programmers don't want to deal with, but most text editing programs do.

Another technique is to build a consistency checker. This is probably the most common. The program will just tell you if the config is ok or not, but you have to fix it. These are easier to build than recovery systems, but they place the burden on the user.

Modern databases use another technique for handling recovery. They log their data to a file, and then bypass this unix caching with the fsync() system call on the file. The fsync() call will guarantee that all data is on disk after it returns. So the combination of a log file that has database 'transactions' in a time-serial fashion and an fsync() after every transaction, will offer offer a good recovery for a database. This technique is the most advanced of the recovery systems and is one of the other reasons that people use databases for storing data. They don't have to write this code, if they use a database to store their data. Also, why not use fsync() all the time? Well, if you do, you are back to square zero, really poor general performance for all writes. For databases, only a few log files are using the fsync() call which makes that very fast.

For example, if you don't have the 'noatime' option on your filesystems (which most people probably don't), any time you access a file, it writes that access time data into the inode. Imagine that every time you did 'ls -al' that the disk would have to synchronously write new inodes for those files to update the access time. Very slow unix.

So the main point is this, your data may not be on disk when you write it. Some applications cannot have this kind of ambiguity and I believe that most users would like to avoid it if they could as well. Programmers will come up with their own techniques, but usually it requires a lot of programmer effort to build good recovery systems for large data sets. Even if that is done, that data still sits on a Unix filesystem that will require a fsck, which could take a long time.

Let's describe an example. Let's say you have a large multi-gigabyte database. Someone uses your website to enter an order in on that database for a widget. The record gets updated in the database and some mail is sent out. However, before the record was written to disk, but not after you authorized their credit card, the power fails... from your UPS. So the user got an email stating that they ordered something and they don't check back for awhile until they decide to become acquainted with your customer service department. Note, your customer service department was already handling the calls asking about what 'mysql_connection timed out' on your web site means when your computer was doing the fsck on your disk and then you had to run a mysql-isamchk for 4 hours on the database (since mysql doesn't do recovery logging).

Before you start pointing at some possible solutions (and faults) to this scenario, let's cover some of the basics. A modern database design will use recovery logging, so it can come up quickly. This still doesn't prevent it from losing a little data, but it sure gets that system back online quicker. These database systems are generally expensive. You will also need a good journaled filesystem so your fsck time will be cut down significantly to minutes. Note, no free unix today has, at least to the point of people trusting their main database on it, a production-ready journaled filesystem. Now if you do have a unix with a journaled filesystem, you will have to spend money on designing an IO system to deal with the journaled filesystem IO patterns. Remember, you may still lose data with all of these things. Surviving all data loss is 'outside the scope of this article.' :-)

The bottom line is this: it is really quite expensive to deal with failure given the current design of a unix systems and all free unix systems. The big boys can afford it, but I don't believe that this should be necessary. A common inexpensive solution could exist.

TRAM

So, what is the answer? The answer is simple. Place a battery backed up, high speed memory into the computer. Call this a TRAM module. Then add a little bit of code to your kernel and your file system recovery utilities. Now you can accelerate all writes safely to this memory and power-cycle the computer in easily under 3 minutes (BIOS permitting).

So, before we go into some of the details about TRAM and how it differs from everything out there, let me state a few things.

First, it is absolutely critical that the OS creates some log or structure of operations on the TRAM for filesystem operations. Basically, if the OS can mark the beginning and end of an operation and place it in this memory, you can now get a journaled meta data filesystem without a complete re-architecture of a filesystem. This means you could upgrade an OS, stick a card in, and just mount your 'existing' /usr with the mount option '-o tram'. This is not insignificant, the free unix camps have been spending years just talking about getting journaled filesystems. This could essentially leapfrog the need for a journaled filesystem implementation.

If you don't create the above structures with the OS, you have just created a disk write cache that will still require a 'fsck'. For example, I was often asked: 'what if you just put the memory on a disk or what if you put the memory on the SCSI controller?'. None of those devices understand what UFS or ext2fs are, they cannot guarantee consistency of the filesystem. Therefore, you would still need to run fsck. Also, since the OS needs to access the data, it would be best to put it on the highest bandwidth bus available to the CPU. PCI or AGP are the only connectors that I know that fit that category.

How much would this cost? Let's first consider how much data would need to be written at any one time. In my experience, I have seen that a typical database for a busy web site will not generate more than 5 megs of data every 5 minutes. So, let's give the card 32 megs of memory. Given a PCI card, 32 megs of memory, error correction, a memory/PCI controller, a small lithium ion battery, and initially low production runs... I doubt that a card should cost more than $300. That is not a lot, yet it adds a tremendous amount to system availability. If larger production runs occurred, I'm sure it could get under $100. Also, as more advanced designs came about, I'm sure that the hardware necessary to support it will be inexpensive.

How would the OS work with the TRAM for reads and writes? What I will describe is very basic and this is where a lot of details crop up that are not worth getting into. However, the basic idea is intact. Some interesting OS research could be found here. When an OS needed to write data, it would write the transaction to TRAM before flushing the block from the VM system. If it needed to read a block that wasn't in the VM, it should check the TRAM section before doing a disk IO. Finally, a process should periodically check the TRAM area. It should coalesce redundant writes and flush old entries to disk periodically.

If the computer reboots, what would the recovery be like? The fsck_tram program will understand to only write to disk the transactions with both 'begin' and 'end' marks. In fact, it might just not do the fsck at all. It might just go through the TRAM and throw out the incomplete records off of the TRAM and mount the disk. Now that is fast!

Before you go there....
Based on some of the initial conversations I've had with people about this over the years, I need to state a few things.

THE INITIAL IMPLEMENTATIONS

So how will this really happen in the real world? I believe that the first widely available TRAM device will be a simple PCI card with a new PCI type, a high speed error correcting memory system, and a power management system. An operating system should recognize it similar to a disk and it should allow partitioning. That way, I could use part of it for a RAM disk (or whatever I want) and part of it as a TRAM for the filesystem. Now a TRAM section or partition should also have a known format and unique identifier. The unique identifier should match up with a filesystem's unique id. It should also be in a state that matches the filesystem when mounted: in use, or clean, etc.

HISTORY

Was this ever done before? Most definitely yes. That is why I call this the forgotten technology. Initially, some of the early computers used core memory which were by their nature persistent. I suspect that there might have been clever uses of this property when dealing with persistent storage back in the 1960's.

The first time I saw mention of it was when I saw documentation on the elusive Prestoserv. The Prestoserv was a card with battery backed up static RAM for SUN workstations. It was usually 2 or 4 megs. I also saw it being used by DEC in their workstations as well. It was designed primarily to deal with the problem of a SUN workstation being an NFS server. As you may already know, NFS requires that the nfs_write() call to not return until the data is committed to disk. This meant that the NFS servers did the fsync() call a lot. In order to have decent performance, this forced unix vendors to come up with TRAM as the Prestoserv product. I don't recall who the manufacturer was, but eventually Legato was associated with the technology.

Today, Network Appliance, Sun, and a few others hide TRAM technology in their core storage products. Network appliance uses TRAM in a different way, though. They hold the NFS rpc requests in memory, not the disk blocks. Their WAFL filesystem is journaled and can come up quick on its own, so it doesn't require the TRAM. They use it to respond to requests quickly, yet still give the client the semantic that the operation is 'safe' or committed to disk. To deal with the journaled filesystem, they use RAID striping. SUN uses TRAM in their Storedge product line. In that line, they have patches to Solaris to recognize the TRAM in their devices. They also still market a product called the Sun Fast Write Cache. It is exactly what TRAM is today. They recommend it for their Storedge products which already have NVRAM on their hardware RAID systems. It validates the belief that having TRAM on the fastest possible memory is the best way to use TRAM.

Why isn't it common today? I don't know, but that won't stop me from speculating. I believe that the high end vendors that are using it have no intention of promoting the technology to lower end or lower profit margin markets. I also believe that the PC industry and the new/recent free unix community doesn't have enough knowledge about fault tolerant technologies. Also, in the PC industry, if a larger player doesn't go into a technology, then it is an uphill battle to see it happen. Someone who decides 'what users really need' crossed off fault tolerance as one of those features a long time ago. :-) In my humble opinion, it is probably the combination of all of the above.

FUTURE

Although these devices are not common in computers yet, I would still like to speculate on the future. What advances could happen? What exotic uses could result?

TRAM could be used for building routers out of PC's with the same uptime as a Cisco router (well almost :-). You may just remove the disk and just use the TRAM card for storage. One of the reasons that network devices don't fail as often is the fact that they don't have hard drives. If you had to wait for Cisco to fsck, I don't think Cisco would be in the router business long. It might be neat to start building your own unix appliances for whatever applications people may think of.

TRAM could be used for exotic in memory databases. For example, you could store Network Address Translator state in there and synchronize with a standby in memory system. You could store ARP entries, routes, or a special DNS like directory that wouldn't have to hit the central systems if they rebooted. That might make rebooting a lot of computers at once a little bit easier. Some applications might avoid the disk like semantics and treat it like normal memory and store data structures in their native form.

How about this. If TRAM becomes popular enough, computer manufacturers may make them standard items on computers. If Apple or Dell calculated that it might cut down the number of support calls by x% and it might make the customers z% happier, then why not? Wouldn't that make an iMac even more of an appliance if you could pull the plug on it at any time and not lose sleep? If Steve Jobs knew that it would cut boot time, I'd bet he might push for it himself. How many servers would you 'not' install with this? What if the cost to make it dropped to near just the cost of RAM today?

Well, imagine this. Today, you can order a 512 meg DIMM module. In 3 years technology could double twice. We may then see a 2 gig DIMM module. If that were the case, we may begin to see computers that don't store their data on disk. If that memory was the main memory of a computer it might make sense to battery back the whole memory system and treat the entire memory system as a persistent store. This would raise some interesting questions in OS design and would be interesting for OS research. Intel would then become a TRAM manufacturer if they made their entire memory system persistent. Wait, isn't that what a laptop already does? :-)

SUMMARY and Final Messages

Overall, TRAM is an inexpensive technology that would dramatically reduce the time taken for recovery procedures after an OS panic or power failure. It also has the benefit of safely caching writes to high speed memory which increases write performance.

In this article we also covered the very basics of unix persistence and the possibilities of TRAM. There are still a lot of details and research topics here, but nothing that prevent the basic message of this article from being a reality.

I look forward to the day that TRAM is a standard item on computers.



Author maintains all copyrights on this article.