Example: printing and forking

Important topic that can be seen in the following question: how many times is “hello” printed?

int main() {
    if (fork() == 0) {
        write(STDOUT_FILENO, "hello", 5);
    }
    if (fork() == 0) {
        write(STDOUT_FILENO, "hello", 5);
    }
    return 0;
}

This is 3. However, if we were to use printf() from the C standard library instead:

int main() {
    if (fork() == 0) {
        printf("hello");
    }
    if (fork() == 0) {
        printf("hello");
    }
    return 0;
}

The answer is now 4. This is because when we fork, the buffer allocated by printf() as well as its contents are copied. Since the first printf() was buffered, upon the second fork the “hello” will now exist in both processes.

In other words, data that is buffered but not flushed before the fork may be duplicated, which will also lead to duplicate output.

Here, the buffer is flushed when the process exits (this is part of the C standard library’s cleanup procedures).

Line buffering

It turns out that printf() is line buffered, meaning that the buffer is also flushed when a newline character ('\n') is encountered. Therefore

int main() {
    if (fork() == 0) {
        printf("hello\n");
    }
    if (fork() == 0) {
        printf("hello\n");
    }
    return 0;
}

would also only print “hello” 3 times!

C <stdio.h> versus POSIX

Buffered writing and reading in the C standard library

By default, the <stdio.h> library uses buffering on top of the POSIX API. This behavior can be controlled with functions like fflush(3) and setbuf(3).

A buffer is allocated by the standard library inside the Process’ address space. Once “enough” data is written i.e. buffer is filled, the buffer is flushed to the OS.

  • Prevents invoking the POSIX write() call too often, which is what actually performs the flushing
  • This in turn prevents writing to the filesystem too often, a relatively expensive operation
  • Writing to the buffer is faster because we’re writing to memory.

Similar to writing, when we call fread() a lot of data (more than we requested to read) is copied into a buffer allocated by the standard library.

  • Next time we read, it is retrieved from the buffer in memory instead of the file on disk
  • Once again, we’re trying to avoid expensive FS operations
  • When we read all contents of the buffer, we fetch from disk again

POSIX directly writes to the filesystem, but not immediately to disk

C stdlib creates a buffer between its print functions and the POSIX write()/read() functions. Similarly, POSIX creates an intermediary between its filesystem functions and the physical disk.

Filesystems are also abstractions on top of the physical disk, so POSIX can create things like block caches that write() instead writes to. These again exist in memory, not on disk. Remember in memory hierarchy:

We want to avoid interacting with a lower portion of the triangle as much as possible.

Link to original

POSIX includes functions to flush data to disk, like fsync() and fdatasync().

Cons of buffering

Buffered writes/reads may suffer from reliability concerns because a call to fread()/fwrite() doesn’t actually directly write to file.

  • If the computer suddenly loses power, the buffer in memory will also disappear—this results in a loss of data.
  • If you want durability (e.g. for logging), don’t buffer.
  • But also, you might “expect” fwrite() to successfully write the data to the file when in reality it hasn’t. The buffer is to blame for that.

Also, buffering takes time and has a nontrivial performance hit. We’re now copying data into a middleman buffer. This is an additional step that takes CPU cycles and memory bandwidth.

  • High-performance applications don’t like this (they prefer zero copy)
  • Buffering is faster when we’re performing many small writes, but the benefits decrease the larger the writes and the less files we’re writing to
    • A singular large write should simply skip the buffer copying because its purpose is now useless