Concordancer

Concordancers are a sort of pet project for me - I'm often in the process of making one. They're simple enough to be fun, and complex enough to be interesting.

An additional perk is that nobody knows what concordancers are. If want to know about them, you probably want to start here and dig on.

This particular instance is meant to be very simple, and passes any pre- and post-processing straight onto the user. The text is split into words just by whitespaces, for instance, so all the punctuation sticks to words, and distorts the result, but if you want that fixed, you have to do it before passing it on to the concordancer.

Ok, so here's a quick walkthrough.

Basically, the concordancer read text from stdin and the keyword list from the arguments. Here's an example of searching or the word Sword in the text file de-bello-galico.txt:

cat de-bello-gallico.txt | ./concordancer.py Sword

That should output something like:

then much reduced by the *sword* of the enemy, and by
rest, perish either by the *sword* or by famine." XXXI.--They rise
rushes on briskly with his *sword* and carries on the combat
Therefore, having put to the *sword* the garrison of Noviodunum and
had advanced up the hill *sword* in hand, and had forced
labour, should put to the *sword* all the grown-up inhabitants, as
made a blow with his *sword* at his naked shoulder and
by the wound of a *sword* in the mouth; nor was

Actually, an even simpler invocation is available, if you want to create conordances for all the words in the text - in that case, you needn't provide a list of keywords, and go:

cat de-bello-gallico.txt | ./concordancer.py

... but I'm not sure how useful that'll be to you.

Typically, you'd probably want to find concordances for a word in all its forms. You can do that using aspell to generate the list of keywords from a given root:

cat de-bello-gallico.txt | ./concordancer.py `aspell dump master | grep sword`

And that'll produce output for all words containing the substring sword.

Now, there are probably better ways to use aspell for that purpose, but honestly, I played around with it and this is the only one that got me the result I wanted...

You can play around with different formats too, by just converting them to text prior to creating the concordance. For instance, for PDFs, use pdftotext:

pdftotext de-bello-gallico.pdf - | ./concordancer.py `aspell dump master | grep sword`

Right. You can also play around with the output of the concordancer. By default it marks the keywords in concordances with asterisks, but you can change that, to e.g. some HTML tags, by going:

cat de-bello-gallico.txt | ./concordancer.py -p '<b>' -s '</b>' Sword

And that'll produce something like:

then much reduced by the <b>sword> of the enemy, and by
rest, perish either by the <b>sword> or by famine." XXXI.--They rise
...

Another thing you can do is change the size of the context, here to up to 3 words on each side.:

cat de-bello-gallico.txt | ./concordancer.py -c 3 Sword

That will output:

reduced by the *sword* of the enemy,
either by the *sword* or by famine."
briskly with his *sword* and carries on
put to the *sword* the garrison of
up the hill *sword* in hand, and
put to the *sword* all the grown-up
blow with his *sword* at his naked
wound of a *sword* in the mouth;

Also, you can group the output by keywords:

cat de-bello-gallico.txt | ./concordancer.py -d group reserves declares

And that gives you something like this:

*reserves*:
neither could proper *reserves* be posted, nor
*declares*:
suddenly assaulted; he *declares* himself ready to
that council he *declares* Cingetorix, the leader
Hispania Baetica, _Carmone_; *declares* for Caesar, and

Enough rambling, here's the code:
1  #!/usr/bin/python
2  #
3  # Concordancer
4  
5  # A script for finding concordances for given keywords in the 
6  # specified text.
7  
8  # A concordance is a keyword with its context (here, the closest 
9  # n words), a combination used, for instance, in lexicography to
10 # deduce the meaning of the keyword based on the way it is used
11 # in text.
12 #
13 # Parameters:
14 #   c - the number of words that surround a keyword in context
15 #   p - the string that is stuck in front of keywords
16 #   s - the string that is stuck at the ends of keywords
17 #   d - formatting of the display,
18 #       'simple' - one concordance per line (default)
19 #       'group' - group concordances by keywords     
20 #   
21 # Example:
22 #   to find concordances for the word 'list' in the bash manual:
23 #       man bash | concordancer.py arguments options
24 #
25 # Author:
26 #   Konrad Siek <konrad.siek@gmail.com>
27 #
28 # License:
29 #
30 # Copyright 2008 Konrad Siek 
31 #
32 # This program is free software: you can redistribute it and/or modify
33 # it under the terms of the GNU General Public License as published by
34 # the Free Software Foundation, either version 3 of the License, or
35 # (at your option) any later version.
36 #
37 # This program is distributed in the hope that it will be useful,
38 # but WITHOUT ANY WARRANTY; without even the implied warranty of
39 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
40 # GNU General Public License for more details.
41 #
42 # You should have received a copy of the GNU General Public License
43 # along with this program.  If not, see <http://www.gnu.org/licenses/>.
44 
45 # Imports.
46 import getopt
47 import sys
48 
49 # Option sigils - the characters associated with various options. 
50 CONTEXT_SIZE = 'c'
51 PREFIX = 'p'
52 SUFFIX = 's'
53 DISPLAY = 'd'
54 
55 # Option default values, represented as a map for convenience.
56 OPTIONS = {\
57     CONTEXT_SIZE: str(5)\
58     PREFIX: '*'\
59     SUFFIX: '*'\
60     DISPLAY: 'simple'\
61 }
62 
63 # Character constants, also for convenience.
64 EMPTY = ""
65 SPACE = " "
66 NEWLINE = "\n"
67 TAB = "\t"
68 COLON = ":"
69 SWITCH = "-"
70 
71 def display_help(program_name):
72     """    Display usage information.
73 
74     @param program_name - the name of the script"""
75 
76     help_string = \
77     """Usage:
78     %s [OPTION] ... [WORD] ...
79 Options:
80     %s    the number of words that surround a keyword in context
81     %s    the string that is stuck in front of keywords
82     %s    the string that is stuck at the ends of keywords
83     %s    formatting of the display,
84         'simple' - one concordance per line (default)
85         'group' - group concordances by keywords
86 Words:
87     The list of words that concordances will be searched for. If
88     no list is provided, a complete concordance is made - that is,
89     one using all input words.""" \
90     % (program_name, CONTEXT_SIZE, PREFIX, SUFFIX, DISPLAY)
91     print(help_string)
92 
93 def find_concordances(keywords, words, context_size):
94     """    Finds concordances for keywords in a list of input words.
95 
96     @param keywords - list of keywords,
97     @param words - input text as a list of words
98     @param context_size - number of words that should surround a keyword
99     @return list of concordances"""
100
101    # Initialize the concordance map with empty lists, for each keyword.
102    concordances = prep_concordance_map(keywords)
103
104    # If any word in the text matches a keyword, create a concordance.  
105    for i in range(0len(words)):
106        for keyword in keywords:
107            if matches(keyword, words[i]):
108                concordance = form_concordance(words, i, context_size)
109                concordances[keyword].append(concordance)
110    
111    return concordances
112
113def find_all_concordances(words, context_size):
114    """    Make a complete concordance - assume all words match.
115
116    @param words - input text as a list of words
117    @param context_size - number of words that should surround a keyword
118    @return list of concordances"""
119
120    concordances = {}
121
122    for i in range(0len(words)):
123        word = words[i]
124        if word not in concordances:
125            concordances[word] = []
126        concordance = form_concordance(words, i, context_size)
127        concordances[word].append(concordance)
128
129    return concordances 
130
131def print_concordances(concordances, simple, prefix, suffix):
132    """    Print the concordances to screen.
133
134    @param concordances - list of concordances to display
135    @param simple - True: display only concordances, False: group by keywords
136    @param prefix - prefix to keywords
137    @param suffix - suffix to keywords"""
138
139    # For each concordance, mark the keywords in the sentence and print it out.
140    for keyword in concordances:
141        if not simple:
142            sys.stdout.write(prefix + keyword + suffix + COLON + NEWLINE)
143        for words in concordances[keyword]:        
144            if not simple:
145                sys.stdout.write(TAB)
146            for i in range(0len(words)):
147                if matches(keyword, words[i]): 
148                    sys.stdout.write(prefix + words[i] + suffix)
149                else:
150                    sys.stdout.write(words[i])
151                if i < len(words) - 1:
152                    sys.stdout.write(SPACE)
153                else:
154                    sys.stdout.write(NEWLINE)
155
156def prep_concordance_map(dict_words):
157    """    Prepare a map with keywords as keys and empty lists as values.
158
159    @param dict_words - list of keywords"""
160
161    # Put an empty list value for each keyword as key.
162    concordances = {}
163    for word in dict_words:
164        concordances[word] = []
165
166    return concordances
167
168def matches(word_a, word_b):
169    """    Case insensitive string equivalence.
170
171    @param word_a - first string
172    @param word_b - second string (duh)
173    @return True or False""" 
174
175    return word_a.lower() == word_b.lower()
176
177def form_concordance(words, occurance, context_size):
178    """    Creates a concordance.
179
180    @param words - list of all input words
181    @param occurance - index of keyword in input list
182    @param context_size - number of preceding and following words
183    @return a sublist of the input words"""
184
185    start = occurance - context_size
186    if start < 0:
187        start = 0
188
189    return words[start : occurance + context_size + 1]
190
191def read_stdin():
192    """    Read everything from standard input as a list.
193    
194    @return list of strings"""
195
196    words = []
197    for line in sys.stdin:
198        # Add all elements returned by function to words.
199        words.extend(line.split())
200
201    return words
202
203def read_option(key, options, default):
204    """    Get an option from a map, or use a default.
205    
206    @param key - option key
207    @param options - option map
208    @param default - default value, used if the map does not contain that key
209    @return value from the map or default"""
210
211    for option, value in options:
212        if (option == SWITCH + key):
213            return value
214
215    return default
216
217def get_configuration(arguments):
218    """    Retrieve the entire configuration of the script.
219    
220    @param arguments - script runtime parameters
221    @return map of options with defaults included
222    @return list of arguments (keywords)
223    @return list of words from standard input"""
224
225    # All possible option sigils are concatenated into an option string.
226    option_string = EMPTY.join([("%s" + COLON) % i for i in OPTIONS.keys()])
227    # Read all the options.
228    options, arguments = getopt.getopt(arguments, option_string)
229
230    # Apply default values if no values were set.
231    fixed_options = {}
232    for key in OPTIONS.keys():
233        fixed_options[key] = read_option(key, options, OPTIONS[key])
234
235    # Read the list of words at standard input.
236    input = read_stdin()
237
238    return (fixed_options, arguments, input)
239
240def process(options, arguments, input):
241    """    The main function.
242     
243    @param options - map of options with defaults included
244    @param arguments - list of arguments (keywords)
245    @param input - list of words from standard input"""
246
247    # Extract some key option values.
248    context_size = int(options[CONTEXT_SIZE])
249    simple = options[DISPLAY] == OPTIONS[DISPLAY]
250
251    # Conduct main processing - find the concordances.
252    concordances = {}
253    if arguments == []:
254        # If no arguments are specified, construct a concordance for all 
255        # possible keywords.
256        concordances = find_all_concordances(input, context_size)
257    else:
258        # And if there are,make a concordance for only those words.
259        concordances = find_concordances(arguments, input, context_size)
260
261    # Display the results.
262    print_concordances(concordances, simple, options[PREFIX], options[SUFFIX])
263
264# The processing starts here.
265if __name__ == '__main__':
266    # Read all user-supplied information.
267    options, arguments, input = get_configuration(sys.argv[1:])
268    
269    # The configuration is not full - display usage information.
270    if arguments == [] and input == []:
271        display_help(sys.argv[0])
272        exit(1)
273
274    # If evverything is in order, start concordancing.
275    process(options, arguments, input)
276


The code is also available at GitHub as python/concordancer.py.

lala moulati ana9a maghribia

seo

 

Blogroll

Site Info

Text

telechargementz Copyright © 2009 WoodMag is Designed by Ipietoon for Free Blogger Template