File: findem.py (original) (raw)

#!/usr/bin/env python3 """

findem.py - locate text that may truncate web pages in Android Opera Origin: January 2022, M. Lutz, https://learning-python.com License: provided freely but with no warranties of any kind

Use HTML parsing to find all

 or  blocks containing a "/*" character 
sequence, among all the HTML files in a website tree.  This sequence makes some 
Android Operas truncate some pages.  This is a browser bug that impacts 3 out of
11 pages with this pattern at the host website, in many Opera 5x and 6x versions.
To work around: search for occurrences here, and replace "/" with "/" HTML
escapes (these are converted to "/" in 
 and  by all browsers tested).
About the Opera bug: learning-python.com/about-this-site.html#androidbrowsernews.
Update: it's not just 
.  A "/*" in a  or  comment can truncate 
pages in Android Opera too.  This now handles ; edit findin's default below
or pass to walker().  Comments are ignored here, and the bug varies by page.
Subtlety: Python's HTML parser converts already-changed / charrefs to a "/"
unless convert_charrefs is False.  If you want to see all files that have or had
a "/", set this to True; the False preset shows only files still having  a "/". 
"""
import os, sys, html.parser
EDIT ME: defaults per parser
findit = '/*'       # find this text in ...
findin = 'pre'      # tag in which to search for findit: pre|code
dumpit = False      # verbose mode: show matched block's full text
EDIT ME: see note above
convert_charrefs = False     # False=show only remaining findits
class Parser(html.parser.HTMLParser):
def __init__(self, findit, findin, dumpit, **others):
    self.findit = findit
    self.findin = findin
    self.dumpit = dumpit
    self.found = {}                # count {files: matches}
    super().__init__(**others)     # pass others to superclass

def parse(self, text, path):
    self.path = path
    self.saving = False
    self.feed(text)

def handle_starttag(self, tag, attrs):
    if tag == self.findin:
        self.content = ''
        self.saving = True

def handle_data(self, data):
    if self.saving:
        self.content += data

def handle_endtag(self, tag):
    if tag == self.findin: 
        if self.findit in self.content:
            message = 'Found <%s> "%s" in: %s'
            print(message % (self.findin, self.findit, self.path))
            self.found[self.path] = self.found.get(self.path, 0) + 1
            if self.dumpit: 
                print('-'*80, self.content, '-'*80, sep='\n')
        self.saving = False
def walker(rootpath='.', findit=findit, findin=findin, dumpit=dumpit):
    numhtml = 0 
    parser = Parser(findit, findin, dumpit, convert_charrefs=convert_charrefs)
    for (dirhere, subs, files) in os.walk(rootpath):
        for file in files:
            if file.lower().endswith(('.html', '.htm')):
                path = os.path.join(dirhere, file)
                try:
                    text = open(path, encoding='utf8').read()
                except:
                    try:
                        text = open(path, encoding='latin1').read()
                    except:
                        print('decoding error skipped:', path)
                        continue
                numhtml += 1
                parser.parse(text, path)
    return numhtml, parser.found
if name == 'main':
    rootpath = sys.argv[1] if len(sys.argv) > 1 else os.getcwd()
    numhtml, found = walker(rootpath, findit, findin)
    print('HTML files parsed:', numhtml)
    print('Found %d matches in %d files' % (sum(found.values()), len(found)))
"""
EXAMPLE OUTPUT:
~/MY-STUFF/Websites$ python3 $C/_etc/findem.py UNION 
Found  "/" in: UNION/pyedit-linux-file-associations.html
Found  "/" in: UNION/errata-supplements.html
Found  "/" in: UNION/errata-supplements.html
Found  "/" in: UNION/extract_py.html
Found  "/" in: UNION/talk.html
Found  "/" in: UNION/extract-calendar_py.html
Found  "/" in: UNION/talkmore.html
Found  "/" in: UNION/talkmore.html
Found  "/" in: UNION/talkmore.html
Found  "/" in: UNION/newex.html
Found  "/" in: UNION/newex.html
Found  "/" in: UNION/newex.html
Found  "/" in: UNION/newex.html
Found  "/" in: UNION/pymailgui-products/unzipped/_README.html
Found  "/" in: UNION/ziptools/ziptools/_README.html
Found  "/" in: UNION/ziptools/ziptools/_README.html
Found  "/" in: UNION/ziptools/ziptools/_README.html
Found  "/" in: UNION/mergeall-android-scripts/_README.html
Found  "/" in: UNION/mergeall-products/unzipped/test/ziptools/_README.html
Found  "/" in: UNION/mergeall-products/unzipped/test/ziptools/_README.html
Found  "/*" in: UNION/mergeall-products/unzipped/test/ziptools/_README.html
HTML files parsed: 2213
Found 21 matches in 11 files
"""

File: findem.py (original) (raw)

#!/usr/bin/env python3 """

Subtlety: Python's HTML parser converts already-changed / charrefs to a "/" unless convert_charrefs is False. If you want to see all files that have or had a "/", set this to True; the False preset shows only files still having a "/".

EDIT ME: defaults per parser

EDIT ME: see note above

"""