If you have a paid subscription (Standard plan and above), you can create rules for which links to follow and which links to ignore:
While previously a rule could only be based on the URL of a link (example: Url ENDSWITH ".gif"
), it’s now also possible to include or exclude links based on where they are found in the HTML code. As an example, the following exclude rule makes our crawler ignore all <a>
tags found inside HTML elements that have class="footer"
:
HtmlElement = ".footer a"
If you have worked with CSS before, the syntax should already be familiar to you. The table below lists the supported ways of selecting HTML elements:
Selector | Description |
---|---|
.class |
Selects all elements that have the specified class |
#id |
Selects an element based on the value of its ID attribute |
element |
Selects all elements that have the specified tag name |
[attr] |
Selects all elements that have an attribute with the specified name |
[attr=value] |
Selects all elements that have an attribute with the specified name and value |
[attr~=value] |
Selects all elements that have an attribute with the specified name and a value containing the specified word (which is delimited by spaces) |
[attr|=value] |
Selects all elements that have an attribute with the specified name and a value equal to the specified string or prefixed with that string followed by a hyphen (-) |
[attr^=value] |
Selects all elements that have an attribute with the specified name and a value beginning with the specified string |
[attr$=value] |
Selects all elements that have an attribute with the specified name and a value ending with the specified string |
[attr*=value] |
Selects all elements that have an attribute with the specified name and a value containing the specified string |
* |
Selects all elements |
A B |
Selects all elements selected by B that are inside elements selected by A |
A > B |
Selects all elements selected by B where the parent is an element selected by A |
A ~ B |
Selects all elements selected by B that follow an element selected by A (with the same parent) |
A + B |
Selects all elements selected by B that immediately follow an element selected by A (with the same parent) |
A, B |
Selects all elements selected by A and B |
Equipped with this knowledge, it’s possible to construct quite powerful rules:
Example | Matched HTML elements |
---|---|
HtmlElement = "#search > a, #filter > a" |
<a> tags directly under elements with the IDs search or filter |
HtmlElement = "a[rel~=nofollow]" |
<a> tags with a rel="nofollow" attribute |
HtmlElement = "img[src$=.png]" |
<img> tags with an src attribute value ending in .png |
HtmlElement = ".comments *" |
Everything inside elements with a comments class |
HtmlElement = "head > link[rel=alternate][hreflang=en-us]" |
All <link rel="alternate" hreflang="en-us"> elements directly inside the <head> tag |
This feature has been in beta for a while and has proven to be quite a valuable addition. We hope you will find it as useful as we do. If you run into any problems or have a suggestion, please drop us a note.