Thursday, September 4, 2008

The future of low level file access

I've been working on the Netflix Prize lately. It's quite an interesting programming challenge, spanning machine learning, optimization, parallelization of algorithms, databases, and several other really interesting areas of computer science. I've learned a lot through participating. One of the most recent things that I have implemented for the first time is access to memory mapped files in Linux. In school, I did a really dumb application for a class one time to access data in a memory mapped file using the (really overly complicated) Win32 library on Windows. Doing the same operation on a POSIX system is a bit simpler, but really got me thinking about file access.

Somehow, I got to the Wikipedia page for fopen (really "C file input/output"). I'm not really sure what pointed me there; I really do know how to use fopen(), and how to use it well; besides, Wikipedia isn't really really the best programming reference out there (see "man page" on google if you're really that lost).

Anyway, Wikipedia pointed out that C and similar languages don't "have direct support for random-access data files." I see what they mean, but I don't think that this phrase was worded in the most clear manner. Maybe the C standard "doesn't include functionality that supports single command support for accessing data in a randomly ordered fashion" or something, but in most environments, there is support for random access of "data files" One of these methods is memory mapped file access, in which each byte in a file is given a corresponding memory address in the memory space of the application, allowing the application to access data in the file in any order it chooses (including "random access"). This is a great help sometime- it somewhat simplifies jumping around the file, and allows you to avoid using the C library of file IO functions.

The C stdio library isn't terrible- it allows you to do pretty much whatever you want to do efficiently, but it's not the easiest thing to use. Other languages like PHP have greatly extended file I/O functionality, by including more convenient functions such as file_get_contents() and file_put_contents(), which are great for simple file operation. They allow the user to focus more on what they are programming, and less on things like checking the file handle that fopen returns, or making sure to remember to close the file handle when done. I'm sure that they are a bit less efficient that directly using the low level fopen/fread/fwrite/fclose, but they sure are useful. If we wanted maximum speed, we could just use the normal fopen/fread/fwrite/fclose - PHP is kind enough to include these as well. This is a good example of how PHP is a better language in specific areas- it has convenientfunctiosn to allow for rapid development, but is flexible enough to have multiple ways to do file I/O. It also tries to conform to standards set by other languages- making it more usablefor programmers familiar with other languages.

Somehow, I've managed to avoid the topic of this article. Access to files using the C standard libaray commands is indeed low level- it reflects the operations at the hardware level for hardware used when C was designed- probably tape drives and stuff. I picture an old VHS VCR- you could watch data coming out your TV, but only going in a forward direction, and you had to 'seek' (fast-foward, rewind) to get to other bits of data that were 'ahead' or 'behind' your location in the file. It kind of makes sense. You had to get the thing playing (open the file, fopen) to get data out, and stop it when you were done (fclose). You used FF/RWD commands (fseek) to get to were you wanted to read or write data on the medium, and remember to stop it when you are done(fclose).

Devices today are a bit different. It's looking like the prevailing storage technology in the future (well, for a while anyway) will be the SSD. SSD's are pretty simple in operation- you feed them an address, and they fetch the data. You don't really search through them in a linear fashion- the address speficies where your data should be extracted from, and its as easy (well, almost - ignore banks, block selects etc for now) to grab data from a block at a random point on the other side of the disk as it is to grab the block from the next address. If we're going to be accessing files on SSDs, lets just have direct access to data within a file via a pointer. It definitely isn't any less efficient- the SSD uses an address (pointer) to get the data anyway.

Previously, this approach wouldn't have been usable for some types of files- most computers used an address space to small to be able to access multi-gigabyte files in this manner- however, most computers now have 64-bit capable hardware, which is big enough to access even the largest pedabyte-sized files (thousands of pedabyte files even!). Doesn't it make sense to move to a memory mapped file access model? Many languages have used memory addresses (or pointers, etc) to access data in short term memory (RAM) since the inception of the computer, so many of the mechanisms are in place to allow this to happen. Most operating systems support memory mapped access to files already- the only thing that is lacking is support from the C standards committee. They don't like to change the standard much- it only happens every decade or so. Lets hope next time they add this in.

0 comments: