sgrep - Example queries (original) (raw)
Queries from sgrep home page
All lines with text "Hello World"
sgrep 'start or "\n" .. (end or "\n") containing "Hello World"'
Same query using sample macros
sgrep 'LINE containing "Hello World"'
All From fields from mail messages
sgrep '"\nFrom: " .. "\n" extracting ("\n" in "\nFrom: ")'
Same query using sample macros
sgrep 'MAIL_FROM'
Give me senders of all news articles with a word "sgrep" or "linux" in the subject field
Query using sample news macros
NEWS_FROM in (NEWS_HEADER containing (NEWS_SUBJ containing ("sgrep" or "linux")))
Same query with macros expanded
(("\nFrom: " in ( ( start or (("\n\nFrom ") extracting ("\n\n" in "\n\nFrom "))) .. ("\n" in "\n\n")) extracting ("\n" in "\nFrom: ") .. ( "\n" or end ))) in (( ( start or (("\n\nFrom ") extracting ("\n\n" in "\n\nFrom "))) .. ("\n" in "\n\n")) containing ((("\nSubject: " in ( ( start or (("\n\nFrom ") extracting ("\n\n" in "\n\nFrom "))) .. ("\n" in "\n\n")) extracting ("\n" in "\nSubject: ") .. ( "\n" or end ))) containing ("sgrep" or "linux")))
Now you see that macros are very useful :)
Give me titles and names of all HTML documents that contain links to www.cs.helsinki.fi
Query using sample macros
sgrep -o"%f:%r\n" '(HTML_TITLE in (start .. end containing (HTML_HREF containing "www.cs.helsinki.fi")))' *.html
Same query with macros expanded
((( ( "" or ( ("<TITLE " or "<TITLE\t" or "<TITLE\n") .. ">")) .. ( "" ) )) in (start .. end containing ((( (( " " or "\t" or "\n" or "\r") __ ">" in (inner(("<" not in ("</" or "<!" or "<?" )) .. ">" ) extracting (("<" not in ("</" or "<!" or "<?" )) __ (( " " or "\t" or "\n" or "\r") or ">" ) in inner(("<" not in ("</" or "<!" or "<?" )) .. ">" ) ))) containing "HREF=" ._ (( " " or "\t" or "\n" or "\r") or ">"))) containing "www.cs.helsinki.fi")))
Queries from the sgrep announce
Locate only TITLE and H1 .. H9 elements from HTML documents
Simple version
sgrep '("" .. "") or ("
" .. "
") or("
" .. "
") or ("" .. "
") or("
" .. "
") or ("" .. "
") or \ ("" .. "
") or ("" .. "") or("" .. "") or ("" .. "")'
Same query using example macros. This query is more exact, since it uses the SGML macros which can handle tags which contain attributes.
sgrep 'HTML_TITLE or HTML_H1 or HTML_H3 or HTML_H4 or HTML_H5
or HTML_H6 or NAMED_ELEMS(H7) or NAMED_ELEMS(H8)
or NAMED_ELEMS(H9)'
Previous query with macros expanded
(( ( "" or ( ("<TITLE " or "<TITLE\t" or "<TITLE\n") .. ">")) .. ( "" ) )) or (( ( "
" or ( ("<H1 " or "<H1\t" or "<H1\n") .. ">")) .. ( "
" ) )) or (( ( "" or ( ("<H3 " or "<H3\t" or "<H3\n") .. ">")) .. ( "
" ) )) or (( ( "" or ( ("<H4 " or "<H4\t" or "<H4\n") .. ">")) .. ( "
" ) )) or (( ( "" or ( ("<H5 " or "<H5\t" or "<H5\n") .. ">")) .. ( "
" ) )) or (( ( "" or ( ("<H6 " or "<H6\t" or "<H6\n") .. ">")) .. ( "
" ) )) or ( ( "" or ( ("<H7 " or "<H7\t" or "<H7\n") .. ">")) .. ( "" ) ) or ( ( "" or ( ("<H8 " or "<H8\t" or "<H8\n") .. ">")) .. ( "" ) ) or ( ( "" or ( ("<H9 " or "<H9\t" or "<H9\n") .. ">")) .. ( "" ) )Remove all tags from HTML document
sgrep -a -o" " 'NAMED_STAG(FONT) or ""'
Same example with macros expanded
sgrep -a -o" " ( "" or ( ("<FONT " or "<FONT\t" or
"<FONT\n") .. ">")) or ""
A different solution to same problem
sgrep 'start .. end extracting (NAMED_STAG(FONT) or "")'
Find out how many FIG elements there are under SUBPARA elements but not under PARA elements in your SGML file
sgrep -c '"" .. "" in ("".."")'
Same example using sample macros
sgrep -c 'NAMED_ELEMS(FIG) in NAMED_ELEMS(SUBPARA) not in NAMED_ELEMS(PARA)'
Print out the TITLE elements from a set of HTML documents in which word 'SGML' is mentioned more than 12 times, or which contain word SGML inside H1 or H2 elements.
sgrep 'HTML_TITLE in (start .. end containing (
join(12,"SGML") or (HTML_H1 or HTML_H2 containing "SGML") ) )' *.html
Find out mail senders of mail messages from a set of mail files, which contain word 'SGML' in the subject line, do not contain 'HTML' in the body of the mail, are sent in year 1996 and are not sent from address flame@hot.com
sgrep 'MAIL_FROM in (MAIL_MESS containing
(MAIL_SUBJ containing "SGML")
not containing (MAIL_BODY containing "HTML")
containing (MAIL_DATE containing "1996")
not containing (MAIL_FROM containing "flame@hot.com") )'
Shell scripts
A shell script to convert all <, > and & characters to <, > and & entities inside PRE elements
Note that this script bypasses all <, > and & entities so that this script can be run multiple times over one HTML document. However this presents one problem: In an sgrep query where you try to locate all & entities with a phrase like
sgrep '"&"'
the script does not convert the query to proper HTML because the**"&"** phrase looks like correct entity, and is bypassed. Instead you get HTML which when rendered by browser looks like this
sgrep '"&"'
Yes, this did bite me. Thanks to Axel Boldt for pointing this out.
Here is the manually edited script anyway:
#!/bin/tcsh sgrep -a -o"<" '"<" in ("
""")' |
sgrep -a -o">" '">" in ("
""")' |
sgrep -a -o"&" '"&" in ("
"__"")
not in (">" or "<" or "&")'
Last modified: May 3,1996
This document is maintained byJani Jaakkola
at email address Jani.Jaakkola@helsinki.fi