The DomCrawler Component (Symfony Docs) (original) (raw)
Installation
Note
If you install this component outside of a Symfony application, you must require the vendor/autoload.php file in your code to enable the class autoloading mechanism provided by Composer. Readthis article for more details.
Usage
See also
This article explains how to use the DomCrawler features as an independent component in any PHP application. Read the Symfony Functional Testsarticle to learn about how to use it when creating Symfony tests.
The Crawler class provides methods to query and manipulate HTML and XML documents.
An instance of the Crawler represents a set of DOMElement objects, which are nodes that can be traversed as follows:
Specialized Link,Image andForm classes are useful for interacting with html links, images and forms as you traverse through the HTML tree.
Note
The DomCrawler will attempt to automatically fix your HTML to match the official specification. For example, if you nest a <p> tag inside another <p> tag, it will be moved to be a sibling of the parent tag. This is expected and is part of the HTML5 spec. But if you're getting unexpected behavior, this could be a cause. And while the DomCrawler isn't meant to dump content, you can see the "fixed" version of your HTML by dumping it.
Node Filtering
Using XPath expressions, you can select specific nodes within the document:
Tip
DOMXPath::query is used internally to actually perform an XPath query.
If you prefer CSS selectors over XPath, install The CssSelector Component. It allows you to use jQuery-like selectors:
An anonymous function can be used to filter with more complex criteria:
To remove a node, the anonymous function must return false.
Note
All filter methods return a new Crawlerinstance with the filtered content. To check if the filter actually found something, use $crawler->count() > 0 on this new crawler.
Both the filterXPath() andfilter() methods work with XML namespaces, which can be either automatically discovered or registered explicitly.
Consider the XML below:
This can be filtered with the Crawler without needing to register namespace aliases both with filterXPath():
and filter():
Note
The default namespace is registered with a prefix "default". It can be changed with thesetDefaultNamespacePrefix()method.
The default namespace is removed when loading the content if it's the only namespace in the document. It's done to simplify the XPath queries.
Namespaces can be explicitly registered with theregisterNamespace() method:
Verify if the current node matches a selector:
Node Traversing
Access node by its position on the list:
Get the first or last node of the current selection:
Get the nodes of the same level as the current selection:
Get the same level nodes after or before the current selection:
Get all the child or ancestor nodes:
Get all the direct child nodes matching a CSS selector:
Get the first parent (heading toward the document root) of the element that matches the provided selector:
Note
All the traversal methods return a new Crawlerinstance.
Accessing Node Values
Access the node name (HTML tag name) of the first node of the current selection (e.g. "p" or "div"):
Access the value of the first node of the current selection:
Access the attribute value of the first node of the current selection:
Tip
You can define the default value to use if the node or attribute is empty by using the second argument of the attr() method:
Extract attribute and/or node values from the list of nodes:
Note
Special attribute _text represents a node value, while _namerepresents the element name (the HTML tag name).
Call an anonymous function on each node of the list:
The anonymous function receives the node (as a Crawler) and the position as arguments. The result is an array of values returned by the anonymous function calls.
When using nested crawler, beware that filterXPath() is evaluated in the context of the crawler:
Adding the Content
The crawler supports multiple ways of adding the content, but they are mutually exclusive, so you can only use one of them to add content (e.g. if you pass the content to the Crawler constructor, you can't call addContent() later):
Note
The addHtmlContent() andaddXmlContent() methods default to UTF-8 encoding but you can change this behavior with their second optional argument.
The addContent() method guesses the best charset according to the given contents and defaults toISO-8859-1 in case no charset can be guessed.
As the Crawler's implementation is based on the DOM extension, it is also able to interact with native DOMDocument, DOMNodeListand DOMNode objects:
Expression Evaluation
The evaluate() method evaluates the given XPath expression. The return value depends on the XPath expression. If the expression evaluates to a scalar value (e.g. HTML attributes), an array of results will be returned. If the expression evaluates to a DOM document, a new Crawler instance will be returned.
This behavior is best illustrated with examples:
Links
Use the filter() method to find links by their id or classattributes and use the selectLink() method to find links by their content (it also finds clickable images with that content in its alt attribute).
Both methods return a Crawler instance with just the selected link. Use thelink() method to get the Link object that represents the link:
The Link object has several useful methods to get more information about the selected link itself:
Note
The getUri() is especially useful as it cleans the href value and transforms it into how it should really be processed. For example, for a link with href="#foo", this would return the full URI of the current page suffixed with #foo. The return from getUri() is always a full URI that you can act on.
Images
To find an image by its alt attribute, use the selectImage method on an existing crawler. This returns a Crawler instance with just the selected image(s). Calling image() gives you a specialImage object:
The Image object has the samegetUri() method as Link.
Forms
Special treatment is also given to forms. A selectButton() method is available on the Crawler which returns another Crawler that matches <button>or <input type="submit"> or <input type="button"> elements (or an<img> element inside them). The string given as argument is looked for in the id, alt, name, and value attributes and the text content of those elements.
This method is especially useful because you can use it to return a Form object that represents the form that the button lives in:
The Form object has lots of very useful methods for working with forms:
The getUri() method does more than just return the action attribute of the form. If the form method is GET, then it mimics the browser's behavior and returns the actionattribute followed by a query string of all of the form's values.
Note
The optional formaction and formmethod button attributes are supported. The getUri() and getMethod() methods take into account those attributes to always return the right action and method depending on the button used to get the form.
You can virtually set and get values on the form:
To work with multi-dimensional fields:
Pass an array of values:
This is great, but it gets better! The Form object allows you to interact with your form like a browser, selecting radio values, ticking checkboxes, and uploading files:
Using the Form Data
What's the point of doing all of this? If you're testing internally, you can grab the information off of your form as if it had just been submitted by using the PHP values:
If you're using an external HTTP client, you can use the form to grab all of the information you need to create a POST request for the form:
One great example of an integrated system that uses all of this is the HttpBrowser provided by the BrowserKit component. It understands the Symfony Crawler object and can use it to submit forms directly:
Selecting Invalid Choice Values
By default, choice fields (select, radio) have internal validation activated to prevent you from setting invalid values. If you want to be able to set invalid values, you can use the disableValidation() method on either the whole form or specific field(s):
Resolving a URI
The UriResolver class takes a URI (relative, absolute, fragment, etc.) and turns it into an absolute URI against another given base URI: