138. Reading/writing text files efficiently

In Java, reading files efficiently is a matter of choosing the right approach. For a better understanding of the following example, let's assume that our platform's default charset is UTF-8. Programmatically, the platform's default charset can be obtained via Charset.defaultCharset().

First, we need to distinguish between raw binary data and text files from a Java perspective. Dealing with raw binary data is the job of two abstract classes, that is, InputStream and OutputStream. For streaming files of raw binary data, we focus on the FileInputStream and FileOutputStream classes, which read/write a byte (8 bits) at a time. For famous types of binary data, we also have dedicated classes (for example, an audio file should be processed via AudioInputStream instead of FileInputStream).

While these classes are doing a spectacular job for raw binary data, they are not good for text files because they are slow and may produce wrong outputs. This becomes pretty clear if we think that streaming a text file via these classes means that each byte is read from the text file and processed (the same tedious flow is needed for writing a byte). Moreover, if a char has more than 1 byte, then it is possible to see some weird characters. In other words, decoding and encoding 8 bits independent of the charset (for example, Latin, Chinese, and so on) may produce unexpected output.

For example, let's suppose that we have the following Chinese poem saved in UTF-16:

Path chineseFile = Paths.get("chinese.txt");




...

The following code will not display it as expected:

try (InputStream is = new FileInputStream(chineseFile.toString())) {

int i;
while ((i = is.read()) != -1) {
System.out.print((char) i);
}
}

So, in order to fix this, we should specify the proper charset. While InputStream doesn't have support for this, we can rely on InputStreamReader (or OutputStreamReader, respectively). This class is a bridge from raw byte streams to character streams and allows us to specify the charset:

try (InputStreamReader isr = new InputStreamReader(
new FileInputStream(chineseFile.toFile()),
StandardCharsets.UTF_16)) {

int i;
while ((i = isr.read()) != -1) {
System.out.print((char) i);
}
}

Things are back on track but are still slow! Now, the application can read more than one single byte at once (depending on the charset) and decodes them into characters using the specified charset. But a few more bytes are still slow.

InputStreamReader is a bridge between ray binary data streams and character streams. But Java provides the FileReader class as well. Its goal is to eliminate this bridge for character streams that are represented by character files.

For text files, we have a dedicated class known as the FileReader class (or FileWriter, respectively). This class reads 2 or 4 bytes (depending on the used charset) at a time. Actually, before JDK 11, FileReader didn't support an explicit charset. It simply used the platform's default charset. This isn't good for us because the following code will not produce the expected output:

try (FileReader fr = new FileReader(chineseFile.toFile())) {

int i;
while ((i = fr.read()) != -1) {
System.out.print((char) i);
}
}

But starting with JDK 11, the FileReader class was enriched with two more constructors that support an explicit charset:

  • FileReader​(File file, Charset charset)
  • FileReader​(String fileName, Charset charset)

This time, we can rewrite the preceding snippet of code and obtain the expected output:

try (FileReader frch = new FileReader(
chineseFile.toFile(), StandardCharsets.UTF_16)) {

int i;
while ((i = frch.read()) != -1) {
System.out.print((char) i);
}
}

Reading 2 or 4 bytes at a time is still better than reading 1, but it's still slow. Moreover, notice that the preceding solutions use an int to store the retrieved char, and we need to explicitly cast it to char in order to display it. Basically, the retrieved char from the input file is converted into an int, and we convert it back into a char.

This is where buffering streams enter the scene. Think about what happens when we watch a video online. While we are watching the video, the browser is buffering the incoming bytes ahead of time. This way, we have a smooth experience because we can see the bytes from the buffer and avoid the potential interruptions caused by seeing the bytes during network transfer:

The same principle is used by classes such as BufferedInputStream, BufferedOutputStream for raw binary streams and BufferedReader, and BufferedWriter for character streams. The main idea is to buffer the data before processing. This time, FileReader returns the data to BufferedReader until it hits the end of the line (for example, or ). BufferedReader uses RAM to store the buffered data:

try (BufferedReader br = new BufferedReader(
new FileReader(chineseFile.toFile(), StandardCharsets.UTF_16))) {

String line;
// keep buffering and print
while ((line = br.readLine()) != null) {
System.out.println(line);
}
}

So, instead of reading 2 bytes at a time, we read a complete line, which is much faster. This is a really efficient way of reading text files.

For further optimization, we can set the size of the buffer via dedicated constructors.

Notice that the BufferedReader class knows how to create and deal with the buffer in the context of the incoming data but is independent of the source of data. In our example, the source of data is FileReader, which is a file, but the same BufferedReader can buffer data from different sources (for example, network, file, console, printer, sensor, and so on). In the end, we read what we buffered.

The preceding examples represent the main approaches for reading text files in Java. Starting with JDK 8, a new set of methods were added to make our life easier. In order to create a BufferedReader, we can rely on Files.newBufferedReader​(Path path, Charset cs) as well:

try (BufferedReader br = Files.newBufferedReader(
chineseFile, StandardCharsets.UTF_16)) {

String line;
while ((line = br.readLine()) != null) {
System.out.println(line);
}
}

For BufferedWriter, we have Files.newBufferedWriter(). The advantage of these methods is that they support Path directly.

For fetching a text file's content as a Stream<T>, take a look at the problem in the Streaming a file's content section.

Another valid solution that may cause eye strain is as follows:

try (BufferedReader br = new BufferedReader(new InputStreamReader(
new FileInputStream(chineseFile.toFile()),
StandardCharsets.UTF_16))) {

String line;
while ((line = br.readLine()) != null) {
System.out.println(line);
}
}

Now, it's time to talk about reading text files directly into memory.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset