telechargementz: Word count

Hello. Long time, no see.

Well, I had no script idead and I've been busy. Alas, here's a short script for people enjoying the challenge of NaNoWriMo.

There's of course a couple of ways to count words. If you're using some sort of office suite it's probably built in, so no problem. If you're using LaTeX, like me (because I have a LaTeX fetish) you might have it in the tool you're using too, but it's less likely.

But you want to count words anyway, so what do you do?

Well, first of all, use the Linux wc command. It does well and there are no problems. Also, to get rid of the LaTeX code from the file you can use the untex tool, which has a ton of options to choose from to have a personalized and accurate experience of removing TeX tags from the code. You just read the tex files, and save the output somewhere...

So, most of what I did was to put it all together, like so:

1#!/bin/bash
2output=raw.txt
3rm -f $output
4for file in `ls chapter-*.tex`
5do
6 untex -e $file >> $output
7done
8echo -e "Word count: \n wc\t$(cat $output | wc -w ) \n awk\t$(./count.awk $output)";

I set it up, so it only reads in files, whose names start with 'chapter-' and end with '.tex', because that is just the structure I use. However, the change to any other convention can easily be applied in line number 4 by parameterizing the ls command differently.

Additionally, it produces a raw.txt file as a side effect, which contains the actual text, which got the words counted, so if you want to verify untex or any of the counting mechanisms, you can do that easily.

Also, if you look closely, you will see that in line 8 there's something extra. I call an AWK script to provide some other word count. Here's how the script looks like inside:

1 #!/usr/bin/awk -f
2 {
3     for (i = 1; i <= NF; i++) {
4         word = $i;
5         #insert punctuation here, between the square brackets.
6         n = split(word, a, /[-,.?!~`';:"'|\/@#$%^&*_-+={}\[\]<>()]+/);
7         for (j = 1 ; j <= n; j++) {
8             if (a[j] !~ /^[ \t\n]*$/) {
9                 words++;
10            }
11        }
12    }
13}
14
15BEGIN {
16    words = 0;
17}
18
19END {
20    print words;
21}

What it actually does, is count the words, but unlike the wc command, it tries to recognize punctuation, and split words by the punctuation as well, so that hyphenated words are split. Also, it finds out stuff like long hyphens (LaTeX: '--') and removes them, so they are no longer counted as words.

I don't know which one is more accurate, but between the two, I can always have an optimistic and a pessimistic assumption about how many words I wrote.

The code is also available at GitHub as awk/count.awk
and bash/word_count.

Word count

lala moulati ana9a maghribia

seo

Categories

Rechercher dans ce blog

Libellés

Blog Archive

Blogroll

Site Info

Text