HTML::Selector (original) (raw)
Selects HTML elements using CSS 2 selectors.
The Selector class uses CSS selector expressions to match and select HTML elements.
For example:
selector = HTML::Selector.new "form.login[action=/login]"
creates a new selector that matches any form element with the class login and an attribute action with the value /login.
Matching Elements
Use the #match method to determine if an element matches the selector.
For simple selectors, the method returns an array with that element, ornil if the element does not match. For complex selectors (see below) the method returns an array with all matched elements, ofnil if no match found.
For example:
if selector.match(element) puts "Element is a login form" end
Selecting Elements
Use the #select method to select all matching elements starting with one element and going through all children in depth-first order.
This method returns an array of all matching elements, an empty array if no match is found
For example:
selector = HTML::Selector.new "input[type=text]" matches = selector.select(element) matches.each do |match| puts "Found text field with name #{match.attributes['name']}" end
Expressions
Selectors can match elements using any of the following criteria:
name– Match an element based on its name (tag name). For example,pto match a paragraph. You can use*to match any element.#id– Match an element based on its identifier (theidattribute). For example,#page..class– Match an element based on its class name, all class names if more than one specified.[attr]– Match an element that has the specified attribute.[attr=value]– Match an element that has the specified attribute and value. (More operators are supported see below):pseudo-class– Match an element based on a pseudo class, such as:nth-childand:empty.:not(expr)– Match an element that does not match the negation expression.
When using a combination of the above, the element name comes first followed by identifier, class names, attributes, pseudo classes and negation in any order. Do not separate these parts with spaces! Space separation is used for descendant selectors.
For example:
selector = HTML::Selector.new "form.login[action=/login]"
The matched element must be of type form and have the classlogin. It may have other classes, but the classlogin is required to match. It must also have an attribute called action with the value /login.
This selector will match the following element:
but will not match the element:
Attribute Values
Several operators are supported for matching attributes:
name– The element must have an attribute with that name.name=value– The element must have an attribute with that name and value.name^=value– The attribute value must start with the specified value.name$=value– The attribute value must end with the specified value.name*=value– The attribute value must contain the specified value.name~=word– The attribute value must contain the specified word (space separated).name|=word– The attribute value must start with specified word.
For example, the following two selectors match the same element:
#my_id [id=my_id]
and so do the following two selectors:
.my_class [class~=my_class]
Alternatives, siblings, children
Complex selectors use a combination of expressions to match elements:
expr1 expr2– Match any element against the second expression if it has some parent element that matches the first expression.expr1 > expr2– Match any element against the second expression if it is the child of an element that matches the first expression.expr1 + expr2– Match any element against the second expression if it immediately follows an element that matches the first expression.expr1 ~ expr2– Match any element against the second expression that comes after an element that matches the first expression.expr1, expr2– Match any element against the first expression, or against the second expression.
Since children and sibling selectors may match more than one element given the first element, the #matchmethod may return more than one match.
Pseudo classes
Pseudo classes were introduced in CSS 3. They are most often used to select elements in a given position:
:root– Match the element only if it is the root element (no parent element).:empty– Match the element only if it has no child elements, and no text content.:content(string)– Match the element only if it hasstringas its text content (ignoring leading and trailing whitespace).:only-child– Match the element if it is the only child (element) of its parent element.:only-of-type– Match the element if it is the only child (element) of its parent element and its type.:first-child– Match the element if it is the first child (element) of its parent element.:first-of-type– Match the element if it is the first child (element) of its parent element of its type.:last-child– Match the element if it is the last child (element) of its parent element.:last-of-type– Match the element if it is the last child (element) of its parent element of its type.:nth-child(b)– Match the element if it is the b-th child (element) of its parent element. The valuebspecifies its index, starting with 1.:nth-child(an+b)– Match the element if it is the b-th child (element) in each group ofachild elements of its parent element.:nth-child(-an+b)– Match the element if it is the first child (element) in each group ofachild elements, up to the firstbchild elements of its parent element.:nth-child(odd)– Match element in the odd position (i.e. first, third). Same as:nth-child(2n+1).:nth-child(even)– Match element in the even position (i.e. second, fourth). Same as:nth-child(2n+2).:nth-of-type(..)– As above, but only counts elements of its type.:nth-last-child(..)– As above, but counts from the last child.:nth-last-of-type(..)– As above, but counts from the last child and only elements of its type.:not(selector)– Match the element only if the element does not match the simple selector.
As you can see, :nth-child pseudo class and its variant can get quite tricky and the CSS specification doesn’t do a much better job explaining it. But after reading the examples and trying a few combinations, it’s easy to figure out.
For example:
table tr:nth-child(odd)
Selects every second row in the table starting with the first one.
div p:nth-child(4)
Selects the fourth paragraph in the div, but not if thediv contains other elements, since those are also counted.
div p:nth-of-type(4)
Selects the fourth paragraph in the div, counting only paragraphs, and ignoring all other elements.
div p:nth-of-type(-n+4)
Selects the first four paragraphs, ignoring all others.
And you can always select an element that matches one set of rules but not another using :not. For example:
p:not(.post)
Matches all paragraphs that do not have the class .post.
Substitution Values
You can use substitution with identifiers, class names and element values. A substitution takes the form of a question mark (?) and uses the next value in the argument list following the CSS expression.
The substitution value may be a string or a regular expression. All other values are converted to strings.
For example:
selector = HTML::Selector.new "#?", /^\d+$/
matches any element whose identifier consists of one or more digits.
See www.w3.org/TR/css3-selectors/
Methods
A
F
M
N
O
S
Class Public methods
Selector.for_class(cls) => selector
Creates a new selector for the given class name.
Source: show
def for_class(cls) self.new([".?", cls]) end
Selector.for_id(id) => selector
Creates a new selector for the given id.
Source: show
def for_id(id) self.new(["#?", id]) end
Selector.new(string, [values ...]) => selector
Creates a new selector from a CSS 2 selector expression.
The first argument is the selector expression. All other arguments are used for value substitution.
Throws InvalidSelectorError is the selector expression is invalid.
Source: show
def initialize(selector, *values) raise ArgumentError, "CSS expression cannot be empty" if selector.empty? @source = "" values = values[0] if values.size == 1 && values[0].is_a?(Array)
statement = selector.strip.dup
simple_selector(statement, values).each { |name, value| instance_variable_set("@#{name}", value) }
@alternates = [] @depends = nil
if statement.sub!(/^\s*,\s*/, "") second = Selector.new(statement, values) @alternates << second
if alternates = second.instance_variable_get(:@alternates)
second.instance_variable_set(:@alternates, [])
@alternates.concat alternates
end
@source << " , " << second.to_selsif statement.sub!(/^\s+\s/, "") second = next_selector(statement, values) @depends = lambda do |element, first| if element = next_element(element) second.match(element, first) end end @source << " + " << second.to_s
elsif statement.sub!(/^\s*~\s*/, "") second = next_selector(statement, values) @depends = lambda do |element, first| matches = [] while element = next_element(element) if subset = second.match(element, first) if first && !subset.empty? matches << subset.first break else matches.concat subset end end end matches.empty? ? nil : matches end @source << " ~ " << second.to_s
elsif statement.sub!(/^\s*>\s*/, "") second = next_selector(statement, values) @depends = lambda do |element, first| matches = [] element.children.each do |child| if child.tag? && subset = second.match(child, first) if first && !subset.empty? matches << subset.first break else matches.concat subset end end end matches.empty? ? nil : matches end @source << " > " << second.to_s
elsif statement =~ /^\s+\S+/ && statement != selector second = next_selector(statement, values) @depends = lambda do |element, first| matches = [] stack = element.children.reverse while node = stack.pop next unless node.tag? if subset = second.match(node, first) if first && !subset.empty? matches << subset.first break else matches.concat subset end elsif children = node.children stack.concat children.reverse end end matches.empty? ? nil : matches end @source << " " << second.to_s else
unless statement.empty? || statement.strip.empty?
raise ArgumentError, "Invalid selector: #{statement}"
endend end
Instance Public methods
match(element, first?) => array or nil
Matches an element against the selector.
For a simple selector this method returns an array with the element if the element matches, nil otherwise.
For a complex selector (sibling and descendant) this method returns an array with all matching elements, nil if no match is found.
Use +first_only=true+ if you are only interested in the first element.
For example:
if selector.match(element) puts "Element is a login form" end
Source: show
def match(element, first_only = false)
if matched = (!@tag_name || @tag_name == element.name)
for attr in @attributes
if element.attributes[attr[0]] !~ attr[1]
matched = false
break
end
endend
if matched for pseudo in @pseudo unless pseudo.call(element) matched = false break end end end
if matched && @negation for negation in @negation if negation[:tag_name] == element.name matched = false else for attr in negation[:attributes] if element.attributes[attr[0]] =~ attr[1] matched = false break end end end if matched for pseudo in negation[:pseudo] if pseudo.call(element) matched = false break end end end break unless matched end end
if matched && @depends matches = @depends.call(element, first_only) else matches = matched ? [element] : nil end
if !first_only || !matches @alternates.each do |alternate| break if matches && first_only if subset = alternate.match(element, first_only) if matches matches.concat subset else matches = subset end end end end
matches end
next_element(element, name = nil)
Return the next element after this one. Skips sibling text nodes.
With the name argument, returns the next element with that name, skipping other sibling elements.
Source: show
def next_element(element, name = nil) if siblings = element.parent.children found = false siblings.each do |node| if node.equal?(element) found = true elsif found && node.tag? return node if (name.nil? || node.name == name) end end end nil end
select(root) => array
Selects and returns an array with all matching elements, beginning with one node and traversing through all children depth-first. Returns an empty array if no match is found.
The root node may be any element in the document, or the document itself.
For example:
selector = HTML::Selector.new "input[type=text]" matches = selector.select(element) matches.each do |match| puts "Found text field with name #{match.attributes['name']}" end
Source: show
def select(root) matches = [] stack = [root] while node = stack.pop if node.tag? && subset = match(node, false) subset.each do |match| matches << match unless matches.any? { |item| item.equal?(match) } end elsif children = node.children stack.concat children.reverse end end matches end
select_first(root)
Similar to #select but returns the first matching element. Returns nil if no element matches the selector.
Source: show
def select_first(root) stack = [root] while node = stack.pop if node.tag? && subset = match(node, true) return subset.first if !subset.empty? elsif children = node.children stack.concat children.reverse end end nil end
Instance Protected methods
attribute_match(equality, value)
Create a regular expression to match an attribute value based on the equality operator (=, ^=, |=, etc).
Source: show
def attribute_match(equality, value) regexp = value.is_a?(Regexp) ? value : Regexp.escape(value.to_s) case equality when "=" then
Regexp.new("^#{regexp}$")
when "~=" then
Regexp.new("(^|\s)#{regexp}($|\s)")
when "^="
Regexp.new("^#{regexp}")
when "$="
Regexp.new("#{regexp}$")
when "*="
regexp.is_a?(Regexp) ? regexp : Regexp.new(regexp)
when "|=" then
Regexp.new("^#{regexp}($|\s)")
else
raise InvalidSelectorError, "Invalid operation/value" unless value.empty?
//end end
next_selector(statement, values)
Called to create a dependent selector (sibling, descendant, etc). Passes the remainder of the statement that will be reduced to zero eventually, and array of substitution values.
This method is called from four places, so it helps to put it here for reuse. The only logic deals with the need to detect comma separators (alternate) and apply them to the selector group of the top selector.
Source: show
def next_selector(statement, values) second = Selector.new(statement, values)
if alternates = second.instance_variable_get(:@alternates) second.instance_variable_set(:@alternates, []) @alternates.concat alternates end second end
nth_child(a, b, of_type, reverse)
Returns a lambda that can match an element against the nth-child pseudo class, given the following arguments:
a– Value of a part.b– Value of b part.of_type– True to test only elements of this type (of-type).reverse– True to count in reverse order (last-).
Source: show
def nth_child(a, b, of_type, reverse)
return lambda { |element| false } if a == 0 && b == 0
return lambda { |element| false } if a < 0 && b < 0
b = a + b + 1 if b < 0
b -= 1 unless b == 0
lambda do |element|
return false unless element.parent && element.parent.tag?
index = 0
siblings = element.parent.children
siblings = siblings.reverse if reverse
name = of_type ? element.name : nil
found = false
for child in siblings
if child.tag? && (name == nil || child.name == name)
if a == 0
if index == b
found = child.equal?(element)
break
end
elsif a < 0
break if index > b
if child.equal?(element)
found = (index % a) == 0
break
end
else
if child.equal?(element)
found = (index % a) == b
break
end
end
index += 1
end
end
foundend end
only_child(of_type)
Creates a only child lambda. Pass of-type to only look at elements of its type.
Source: show
def only_child(of_type) lambda do |element|
return false unless element.parent && element.parent.tag?
name = of_type ? element.name : nil
other = false
for child in element.parent.children
if child.tag? && (name == nil || child.name == name)
unless child.equal?(element)
other = true
break
end
end
end
!otherend end
simple_selector(statement, values, can_negate = true)
Creates a simple selector given the statement and array of substitution values.
Returns a hash with the values tag_name,attributes, pseudo (classes) andnegation.
Called the first time with can_negate true to allow negation. Called a second time with false since negation cannot be negated.
Source: show
def simple_selector(statement, values, can_negate = true) tag_name = nil attributes = [] pseudo = [] negation = []
statement.sub!(/^(*|[[:alpha:]][\w-])/) do |match| match.strip! tag_name = match.downcase unless match == "" @source << match "" end
while true
next if statement.sub!(/^#(\?|[\w\-]+)/) do |match|
id = $1
if id == "?"
id = values.shift
end
@source << "##{id}"
id = Regexp.new("^#{Regexp.escape(id.to_s)}$") unless id.is_a?(Regexp)
attributes << ["id", id]
""
end
next if statement.sub!(/^\.([\w\-]+)/) do |match|
class_name = $1
@source << ".#{class_name}"
class_name = Regexp.new("(^|\s)#{Regexp.escape(class_name)}($|\s)") unless class_name.is_a?(Regexp)
attributes << ["class", class_name]
""
end
next if statement.sub!(/^\[\s*([[:alpha:]][\w\-:]*)\s*((?:[~|^$*])?=)?\s*('[^']*'|"[^*]"|[^\]]*)\s*\]/) do |match|
name, equality, value = <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>1</mn><mo separator="true">,</mo></mrow><annotation encoding="application/x-tex">1, </annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8389em;vertical-align:-0.1944em;"></span><span class="mord">1</span><span class="mpunct">,</span></span></span></span>2, $3
if value == "?"
value = values.shift
else
value.strip!
if (value[0] == "" || value[0] == '') && value[0] == value[-1]
value = value[1..-2]
end
end
@source << "[#{name}#{equality}'#{value}']"
attributes << [name.downcase.strip, attribute_match(equality, value)]
""
end
next if statement.sub!(/^:root/) do |match|
pseudo << lambda do |element|
element.parent.nil? || !element.parent.tag?
end
@source << ":root"
""
end
next if statement.sub!(/^:nth-(last-)?(child|of-type)\((odd|even|(\d+|\?)|(-?\d*|\?)?n([+\-]\d+|\?)?)\)/) do |match|
reverse = $1 == "last-"
of_type = $2 == "of-type"
@source << ":nth-#{$1}#{$2}("
case $3
when "odd"
pseudo << nth_child(2, 1, of_type, reverse)
@source << "odd)"
when "even"
pseudo << nth_child(2, 2, of_type, reverse)
@source << "even)"
when /^(\d+|\?)$/
b = ($1 == "?" ? values.shift : $1).to_i
pseudo << nth_child(0, b, of_type, reverse)
@source << "#{b})"
when /^(-?\d*|\?)?n([+\-]\d+|\?)?$/
a = ($1 == "?" ? values.shift :
<span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>1</mn><mo>=</mo><mo>=</mo><mi mathvariant="normal">"</mi><mi mathvariant="normal">"</mi><mo stretchy="false">?</mo><mn>1</mn><mo>:</mo></mrow><annotation encoding="application/x-tex">1 == "" ? 1 : </annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6444em;"></span><span class="mord">1</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">==</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:0.6944em;"></span><span class="mord">""</span><span class="mclose">?</span><span class="mord">1</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">:</span></span></span></span>1 == "-" ? -1 : $1).to_i
b = ($2 == "?" ? values.shift : $2).to_i
pseudo << nth_child(a, b, of_type, reverse)
@source << (b >= 0 ? "#{a}n+#{b})" : "#{a}n#{b})")
else
raise ArgumentError, "Invalid nth-child #{match}"
end
""
end
next if statement.sub!(/^:(first|last)-(child|of-type)/) do |match|
reverse = $1 == "last"
of_type = $2 == "of-type"
pseudo << nth_child(0, 1, of_type, reverse)
@source << ":#{$1}-#{$2}"
""
end
next if statement.sub!(/^:only-(child|of-type)/) do |match|
of_type = $1 == "of-type"
pseudo << only_child(of_type)
@source << ":only-#{$1}"
""
end
next if statement.sub!(/^:empty/) do |match|
pseudo << lambda do |element|
empty = true
for child in element.children
if child.tag? || !child.content.strip.empty?
empty = false
break
end
end
empty
end
@source << ":empty"
""
end
next if statement.sub!(/^:content\(\s*(\?|'[^']*'|"[^"]*"|[^)]*)\s*\)/) do |match|
content = $1
if content == "?"
content = values.shift
elsif (content[0] == "" || content[0] == '') && content[0] == content[-1]
content = content[1..-2]
end
@source << ":content('#{content}')"
content = Regexp.new("^#{Regexp.escape(content.to_s)}$") unless content.is_a?(Regexp)
pseudo << lambda do |element|
text = ""
for child in element.children
unless child.tag?
text << child.content
end
end
text.strip =~ content
end
""
end
if statement.sub!(/^:not\(\s*/, "")
raise ArgumentError, "Double negatives are not missing feature" unless can_negate
@source << ":not("
negation << simple_selector(statement, values, false)
raise ArgumentError, "Negation not closed" unless statement.sub!(/^\s*\)/, "")
@source << ")"
next
end
breakend
{:tag_name=>tag_name, :attributes=>attributes, :pseudo=>pseudo, :negation=>negation} end