Write a bash script to calculate the frequency of each word in a text file words.txt
.
For simplicity sake, you may assume:
words.txt
contains only lowercase characters and space ' '
characters.Example:
Assume that words.txt
has the following content:
the day is sunny the the the sunny is is
Your script should output the following, sorted by descending frequency:
the 4 is 3 sunny 2 day 1
Note:
This problem requires counting the frequency of each word in a text file and outputting the result sorted by descending frequency. The provided solution uses a Unix pipeline approach with awk
for efficient processing. Let's break down the solution step-by-step:
1. cat words.txt
: This command reads the content of the words.txt
file.
2. tr -s ' ' '\n'
: This command uses tr
(translate characters) to replace one or more spaces (-s
option for squeeze repetitions) with newline characters (\n
). This effectively separates each word onto a new line.
3. sort
: This command sorts the words alphabetically. This step is crucial for the next command, uniq
.
4. uniq -c
: This command counts the occurrences of each unique line (word). The -c
option adds a count of each line before it. The output will look like count word
.
5. sort -nr
: This command sorts the lines numerically (-n
) in reverse order (-r
). This sorts the lines based on their frequency (the count) in descending order.
6. awk '{print $2, $1}'
: This is the final command that uses awk
to rearrange the output. $1
refers to the first field (the count) and $2
refers to the second field (the word). The print $2, $1
statement swaps the order, displaying the word followed by its frequency.
Example Walkthrough:
Let's consider the sample words.txt
file:
the day is sunny the the
the sunny is is
cat words.txt
: Outputs the content of the file.tr -s ' ' '\n'
: Transforms the input into:
the
day
is
sunny
the
the
the
sunny
is
is
sort
: Sorts the words alphabetically:
day
is
is
is
sunny
sunny
the
the
the
the
uniq -c
: Counts the occurrences of each unique word:
1 day
3 is
2 sunny
4 the
sort -nr
: Sorts by count in descending order:
4 the
3 is
2 sunny
1 day
awk '{print $2, $1}'
: Swaps the word and count:
the 4
is 3
sunny 2
day 1
This is the desired output.
Time Complexity Analysis:
The time complexity of this solution is dominated by the sorting steps. sort
has an average-case time complexity of O(N log N), where N is the number of words. All other operations (reading the file, tr
, uniq
, awk
) have linear time complexity, O(N). Therefore, the overall time complexity of the solution is O(N log N).
Space Complexity Analysis:
The space complexity is determined by the intermediate data structures used during the pipeline. The maximum space used will be proportional to the number of unique words, which in the worst case could be O(N). However, in practice, the number of unique words is typically much smaller than the total number of words. Thus, the space complexity is considered to be O(M), where M is the number of unique words (M <= N).