Multiple Word Replace in Text (Python) (original) (raw)

vegaseat 1,735 DaniWeb's Hypocrite Team Colleague

17 Years Ago

# simpler and faster without re ...
# contributed by Python Fan 'bvdet'
 
def multipleReplace(text, wordDict):
    """
    take a text and replace words that match the key in a dictionary
    with the associated value, return the changed text
    """
    for key in wordDict:
        text = text.replace(key, wordDict[key])
    return text

13 Years Ago

@vegaseat: You are very wrong. Well, I haven't timed it, but theoretically it should be. Because, with multiple separate replaces you're running through the whole text string for every key in the dictionary, whereas the OP's regex method traverses 'text' just once. For short texts this may not make much difference but for longer texts it will. Too, if one of the to-be-replaced strings (keys) is a superset of another, usually you want the longer string to be replaced first (it's more specific; eg 'theater' before 'the'). Well, the regex method, because regexes try to match the maximum length, will do this naturally. With simply traversing a dict, you take your chances as to which key gets replaced first (eg 'the' could be replaced, even in the middle of 'theater'). Thirdly (though least important), searching separately key-by-key, each key is totally separate; in the regex version, if there's any overlap, the regex compiler will use that to find a slightly more efficient way to search through all keys /at once/, at any part of the text.

edit: oh, you're the same guy. Well, have you timed it? I claim that if you have a sufficiently long 'text' the re method will be faster (not to mention correctness in the case of overlap among your keys).

Edited 13 Years Ago by CS guy because: noticed OP and vegaseat were the same guy.

Gribouillis 1,391 Programming Explorer Team Colleague

13 Years Ago

@vegaseat: You are very wrong. Well, I haven't timed it, but theoretically it should be. Because, with multiple separate replaces you're running through the whole text string for every key in the dictionary, whereas the OP's regex method traverses 'text' just once. For short texts this may not make much difference but for longer texts it will. Too, if one of the to-be-replaced strings (keys) is a superset of another, usually you want the longer string to be replaced first (it's more specific; eg 'theater' before 'the'). Well, the regex method, because regexes try to match the maximum length, will do this naturally. With simply traversing a dict, you take your chances as to which key gets replaced first (eg 'the' could be replaced, even in the middle of 'theater'). Thirdly (though least important), searching separately key-by-key, each key is totally separate; in the regex version, if there's any overlap, the regex compiler will use that to find a slightly more efficient way to search through all keys /at once/, at any part of the text.

edit: oh, you're the same guy. Well, have you timed it? I claim that if you have a sufficiently long 'text' the re method will be faster (not to mention correctness in the case of overlap among your keys).

Actually, on this text, the second method is 6 times faster:

from timeit import Timer
NTIMES = 100000
testlist = [multiwordReplace, multipleReplace]
for func in testlist:
    print "{n}(): {u:.2f} usecs".format(n=func.__name__, u=Timer(
        "{n}(str1, wordDic)".format(n=func.__name__),
        "from __main__ import {f}, str1, wordDic".format(f=",".join(x.__name__ for x in testlist))
        ).timeit(number=NTIMES) * 1.e6/NTIMES)

""" my output --->

multiwordReplace(): 64.09 usecs
multipleReplace(): 10.56 usecs

"""

It's more important to notice that the functions replace subwords and not words, for example 'solidity' would be replaced by 'saltedity', and also that they may give different results due to the multiple passes of the second fonction.

Edited 13 Years Ago by Gribouillis because: n/a

13 Years Ago

i have spent the last hour trying to figure out this section of code from above.

def translate(match):
return wordDic[match.group(0)]

Can anyone PLEASE explain what this is doing and how it works?

TrustyTony 888 ex-Moderator Team Colleague Featured Poster

13 Years Ago

i have spent the last hour trying to figure out this section of code from above.
def translate(match):
return wordDic[match.group(0)]
Can anyone PLEASE explain what this is doing and how it works?

There should be lot of tutorials out there, for example http://www.tutorialspoint.com/python/python_reg_expressions.htm.

13 Years Ago

def wordReplace(sentList, wordDict):
    find = lambda searchList, elem: [[i for i, x in enumerate(searchList) if x == e] for e in elem]
    wordList = list(wordDict)
    wordInd = find(sentList,wordList)
    for i in wordList:
        for k in range(len(wordInd)):
            for j in wordInd[k]:
                sentList[j] = wordDict[i]
    return "".join(sentList)

This is probably super inefficient but maybe one of y'all we be able to make it better. :D

Example:

>>> wordReplace(['Alex', ' ','is',' ','cool'],{'is':'was'})
'Alex was cool'

12 Years Ago

I wanted to get the input from the user and then use this to translate the words from English to German
for example if an user types "a" the exact german meaning of a is "ein" simillarly if the user types "an" the exact German meaning is "eine". So let me assume that the input is "a an" the desired output should be "ein eine" but what I am getting is "ein einn". So can any one please help me out here. Thank you in advance.

12 Years Ago

To get a better answer to your problem, post your question in a new thread.

Edited 12 Years Ago by Lucaci Andrew

12 Years Ago

Both the original and vegaseat code needs to replace longer words first to get over the problem where a shorter word is part of a longer one.

TrustyTony 888 ex-Moderator Team Colleague Featured Poster

12 Years Ago

Good thinking, paddy3118, but little incomplete. How about case of shorter replacement word containing another word? We should use extraction regexp adding '\W' at beginning and end of each word (non-word character).

12 Years Ago

Hi pyTony sorting on just the length should suffice as long as you arrange to replace longer matches before shorter matches.The regexp alternation operator '|' matches the LHS before the RHS so all words in order of decreasing length should do it.

If there are sequences of word characters that are not word characters then you are right, you need to replace on word boundaries. (which is also missing from the original text).

P.S. I had to do this kind of thing when replacing all signal names by an HTML link to further data. This was in a Verilog source file that could have hundreds of signals.

ZZucker 342 Practically a Master Poster

12 Years Ago

A slight modification of vegaseat's original code to take care of whole words:

''' re_sub_whole words.py
replace whole words in a text using Python module re
and a dictionary of key_word:replace_with pairs

tested with Python273  ZZ
'''

import re

def word_replace(text, replace_dict):
    '''
    Replace words in a text that match a key in replace_dict
    with the associated value, return the modified text.
    Only whole words are replaced.
    Note that replacement is case sensitive, but attached
    quotes and punctuation marks are neutral.
    '''
    rc = re.compile(r"[A-Za-z_]\w*")
    def translate(match):
        word = match.group(0)
        return replace_dict.get(word, word)
    return rc.sub(translate, text)

old_text = """\
In this text 'debug' will be changed but not 'debugger'.
Similarly red will be replaced but not reduce.
How about Fred, -red- and red?
Red, white and blue shipped a ship!"""

# create a dictionary of key_word:replace_with pairs
replace_dict = {
"red" : "redish",
"debug" : "fix",
'ship': 'boat'
}

new_text = word_replace(old_text, replace_dict)

print(old_text)
print('-'*60)
print(new_text)

''' my output -->
In this text 'debug' will be changed but not 'debugger'.
Similarly red will be replaced but not reduce.
How about Fred, -red- and red?
Red, white and blue shipped a ship!
------------------------------------------------------------
In this text 'fix' will be changed but not 'debugger'.
Similarly redish will be replaced but not reduce.
How about Fred, -redish- and redish?
Red, white and blue shipped a boat!
'''

10 Years Ago

I use the same commands daily, only the Date, etc. changes. I have a master template with the commands. I use the Python program to modify it and poop out a ready to use text document called Worksheet.txt . Then I just copy and paste my commands and text.

First Create a text File called MyMasterTemplate.txt with these five lines:
grep -i alert elfYYYYMMDD.fil (Command to find the word alert in the Error Log File (elf) with Date - YYYYMMDD)
EMAILS TO SEND
Alerts for MM/DD/YYYY (Email Subject Line)
There was an alert on MM/DD/YYYY. (Email Body)

Here is the program:

TextMemory = open("MyMasterTemplate.txt").read() #read template from Disk into Memory

TextMemory = TextMemory.replace('YYYYMMDD', '20141225') #replace YYYYMMDD with date 20141225
TextMemory = TextMemory.replace('MM/DD/YYYY', '12/25/2014')

f2 = open("Worksheet.txt", "w") #Open a file to write to
f2.write(TextMemory)

f2.close()

Edited 10 Years Ago by Gribouillis because: fixed code display

Gribouillis 1,391 Programming Explorer Team Colleague

10 Years Ago

@tim1234 This is a good way to do it. For more sophisticated work in the same direction, I would suggest trying a templating module such as mako.

Edited 10 Years Ago by Gribouillis

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.