146. Tokenizing files

The content in a file is not always received in a way that means it can be processed immediately and will require some additional steps so that it can be prepared for processing. Typically, we need to tokenize the file and extract information from different data structures (arrays, lists, maps, and so on).

For example, let's consider a file, clothes.txt:

Path path = Paths.get("clothes.txt");

Its content is as follows:

Top|white10/XXL&amp;Swimsuit|black5/L
Coat|red11/M&amp;Golden Jacket|yellow12/XLDenim|Blue22/M

This file contains some clothing articles and their details separated by the & character. A single article is represented as follows:

article name | color  no. available items / size

Here, we have several delimiters (&, |, , /) and a very specific format.

Now, let's take a look at several solutions for extracting and tokenizing the information from this file as a List. We'll collect this information in a utility class, FileTokenizer.

One solution for fetching the articles in a List relies on the String.split() method. Basically, we have to read the file line by line and apply String.split() to each line. The result of tokenizing each line is collected in a List via the List.addAll() method:

public static List<String> get(Path path, 
    Charset cs, String delimiter) throws IOException {

  String delimiterStr = Pattern.quote(delimiter);
  List<String> content = new ArrayList<>();

  try (BufferedReader br = Files.newBufferedReader(path, cs)) {

    String line;
    while ((line = br.readLine()) != null) {
      String[] values = line.split(delimiterStr);
      content.addAll(Arrays.asList(values));
    }
  }

  return content;
}

Calling this method with the & delimiter will produce the following output:

[Top|white10/XXL, Swimsuit|black5/L, Coat|red11/M, Golden Jacket|yellow12/XL, Denim|Blue22/M]

Another flavor of the preceding solution can rely on Collectors.toList() instead of Arrays.asList():

public static List<String> get(Path path, 
    Charset cs, String delimiter) throws IOException {

  String delimiterStr = Pattern.quote(delimiter);
  List<String> content = new ArrayList<>();

  try (BufferedReader br = Files.newBufferedReader(path, cs)) {

    String line;
    while ((line = br.readLine()) != null) {
      content.addAll(Stream.of(line.split(delimiterStr))
        .collect(Collectors.toList()));
    }
  }

  return content;
}

Alternatively, we can process the content in a lazy manner via Files.lines():

public static List<String> get(Path path, 
    Charset cs, String delimiter) throws IOException {

  try (Stream<String> lines = Files.lines(path, cs)) {

    return lines.map(l -> l.split(Pattern.quote(delimiter)))
      .flatMap(Arrays::stream)
      .collect(Collectors.toList());
  }
}

For relatively small files, we can load it in memory and process it accordingly:

Files.readAllLines(path, cs).stream()
  .map(l -> l.split(Pattern.quote(delimiter)))
  .flatMap(Arrays::stream)
  .collect(Collectors.toList());

Another solution can rely on JDK 8's Pattern.splitAsStream() method. This method creates a stream from the given input sequence. For the sake of variation, this time, let's collect the resulted list via Collectors.joining(";"):

public static List<String> get(Path path, 
    Charset cs, String delimiter) throws IOException {

  Pattern pattern = Pattern.compile(Pattern.quote(delimiter));
  List<String> content = new ArrayList<>();

  try (BufferedReader br = Files.newBufferedReader(path, cs)) {

    String line;
    while ((line = br.readLine()) != null) {
      content.add(pattern.splitAsStream(line)
        .collect(Collectors.joining(";")));
    }
  }
  return content;
}

Let's call this method with the & delimiter:

List<String> tokens = FileTokenizer.get(
  path, StandardCharsets.UTF_8, "&amp;");

The result is as follows:

[Top|white10/XXL;Swimsuit|black5/L, Coat|red11/M;Golden Jacket|yellow12/XL, Denim|Blue22/M]

So far, the presented solutions obtain a list of articles by applying a single delimiter. But sometimes, we need to apply more delimiters. For example, let's assume that we want to obtain the following output (list):

[Top, white, 10, XXL, Swimsuit, black, 5, L, Coat, red, 11, M, Golden Jacket, yellow, 12, XL, Denim, Blue, 22, M]

In order to obtain this list, we have to apply several delimiters (&, |, , and /). This can be accomplished by using String.split() and passing a regular expression based on the logical OR operator (x|y) to it:

public static List<String> getWithMultipleDelimiters(
    Path path, Charset cs, String...delimiters) throws IOException {

  String[] escapedDelimiters = new String[delimiters.length];
  Arrays.setAll(escapedDelimiters, t -> Pattern.quote(delimiters[t]));
  String delimiterStr = String.join("|", escapedDelimiters);

  List<String> content = new ArrayList<>();

  try (BufferedReader br = Files.newBufferedReader(path, cs)) {

    String line;
    while ((line = br.readLine()) != null) {
      String[] values = line.split(delimiterStr);
      content.addAll(Arrays.asList(values));
    }
  }

  return content;
}

Let's call this method with our delimiters (&, |, , and /) to obtain the required result:

List<String> tokens = FileTokenizer.getWithMultipleDelimiters(
  path, StandardCharsets.UTF_8, 
    new String[] {"&amp;", "|", "\", "/"});

Ok; so far, so good! All of these solutions are based on String.split() and Pattern.splitAsStream(). Another set of solutions can rely on the StringTokenizer class (it doesn't excel at performance, so use it carefully). This class can apply a delimiter (or more than one) to the given string and expose the two main methods for controlling it, that is, hasMoreElements() and nextToken():

public static List<String> get(Path path,
    Charset cs, String delimiter) throws IOException {

  StringTokenizer st;
  List<String> content = new ArrayList<>();

  try (BufferedReader br = Files.newBufferedReader(path, cs)) {

    String line;
    while ((line = br.readLine()) != null) {
      st = new StringTokenizer(line, delimiter);
      while (st.hasMoreElements()) {
        content.add(st.nextToken());
      }
    }
  }

  return content;
}

It can be used in conjunction with Collectors as well:

public static List<String> get(Path path, 
    Charset cs, String delimiter) throws IOException {

  List<String> content = new ArrayList<>();

  try (BufferedReader br = Files.newBufferedReader(path, cs)) {

    String line;
    while ((line = br.readLine()) != null) {
      content.addAll(Collections.list(
          new StringTokenizer(line, delimiter)).stream()
        .map(t -> (String) t)
        .collect(Collectors.toList()));
    }
  }

  return content;
}

Multiple delimiters can be used if we separate them using //:

public static List<String> getWithMultipleDelimiters(
    Path path, Charset cs, String...delimiters) throws IOException {

  String delimiterStr = String.join("//", delimiters);
  StringTokenizer st;
  List<String> content = new ArrayList<>();

  try (BufferedReader br = Files.newBufferedReader(path, cs)) {

    String line;
    while ((line = br.readLine()) != null) {
      st = new StringTokenizer(line, delimiterStr);
      while (st.hasMoreElements()) {
        content.add(st.nextToken());
      }
    }
  }

  return content;
}

For better performance and regular expression support (that is, high flexibility) it is advisable to rely on String.split() instead of StringTokenizer. From the same category, consider the Working with Scanner section as well.

Table of Contents for 146. Tokenizing files

Create new playlist

Sign In

Sign Up

Table of Contents for
146. Tokenizing files