{x}
blog image

Word Frequency

Write a bash script to calculate the frequency of each word in a text file words.txt.

For simplicity sake, you may assume:

  • words.txt contains only lowercase characters and space ' ' characters.
  • Each word must consist of lowercase characters only.
  • Words are separated by one or more whitespace characters.

Example:

Assume that words.txt has the following content:

the day is sunny the the
the sunny is is

Your script should output the following, sorted by descending frequency:

the 4
is 3
sunny 2
day 1

Note:

  • Don't worry about handling ties, it is guaranteed that each word's frequency count is unique.
  • Could you write it in one-line using Unix pipes?

Solution Explanation for Word Frequency Problem

This problem requires counting the frequency of each word in a text file and outputting the result sorted by descending frequency. The provided solution uses a Unix pipeline approach with awk for efficient processing. Let's break down the solution step-by-step:

1. cat words.txt: This command reads the content of the words.txt file.

2. tr -s ' ' '\n': This command uses tr (translate characters) to replace one or more spaces (-s option for squeeze repetitions) with newline characters (\n). This effectively separates each word onto a new line.

3. sort: This command sorts the words alphabetically. This step is crucial for the next command, uniq.

4. uniq -c: This command counts the occurrences of each unique line (word). The -c option adds a count of each line before it. The output will look like count word.

5. sort -nr: This command sorts the lines numerically (-n) in reverse order (-r). This sorts the lines based on their frequency (the count) in descending order.

6. awk '{print $2, $1}': This is the final command that uses awk to rearrange the output. $1 refers to the first field (the count) and $2 refers to the second field (the word). The print $2, $1 statement swaps the order, displaying the word followed by its frequency.

Example Walkthrough:

Let's consider the sample words.txt file:

the day is sunny the the
the sunny is is
  1. cat words.txt: Outputs the content of the file.
  2. tr -s ' ' '\n': Transforms the input into:
    the
    day
    is
    sunny
    the
    the
    the
    sunny
    is
    is
    
  3. sort: Sorts the words alphabetically:
    day
    is
    is
    is
    sunny
    sunny
    the
    the
    the
    the
    
  4. uniq -c: Counts the occurrences of each unique word:
    1 day
    3 is
    2 sunny
    4 the
    
  5. sort -nr: Sorts by count in descending order:
    4 the
    3 is
    2 sunny
    1 day
    
  6. awk '{print $2, $1}': Swaps the word and count:
    the 4
    is 3
    sunny 2
    day 1
    

This is the desired output.

Time Complexity Analysis:

The time complexity of this solution is dominated by the sorting steps. sort has an average-case time complexity of O(N log N), where N is the number of words. All other operations (reading the file, tr, uniq, awk) have linear time complexity, O(N). Therefore, the overall time complexity of the solution is O(N log N).

Space Complexity Analysis:

The space complexity is determined by the intermediate data structures used during the pipeline. The maximum space used will be proportional to the number of unique words, which in the worst case could be O(N). However, in practice, the number of unique words is typically much smaller than the total number of words. Thus, the space complexity is considered to be O(M), where M is the number of unique words (M <= N).