HTML::Element
BASIC METHODS
new
attr
tag
parent
content_list
content
content_array_ref
content_refs_list
implicit
pos
all_attr
all_attr_names
all_external_attr
all_external_attr_names
id
idf
STRUCTURE-MODIFYING METHODS
push_content
unshift_content
splice_content
detach
detach_content
replace_with
preinsert
postinsert
replace_with_content
delete_content
delete
destroy
destroy_content
clone
clone_list
normalize_content
delete_ignorable_whitespace
insert_element
DUMPING METHODS
dump
as_HTML
as_text
as_trimmed_text
as_XML
as_Lisp_form
format
starttag
starttag_XML
endtag
endtag_XML
SECONDARY STRUCTURAL METHODS
is_inside
is_empty
pindex
left
right
address
depth
root
lineage
lineage_tag_names
descendants
descendents
find_by_tag_name
find
find_by_attribute
look_down
look_up
traverse
attr_get_i
tagname_map
extract_links
simplify_pres
same_as
new_from_lol
objectify_text
deobjectify_text
number_lists
has_insane_linkage
element_class
SYNOPSIS (常规用法)
use HTML::Element;
$a = HTML::Element->new('a', href => 'http://www.perl.com/');
$a->push_content("The Perl Homepage");
$tag = $a->tag;
print "$tag starts out as:", $a->starttag, "\n";
print "$tag ends as:", $a->endtag, "\n";
print "$tag\'s href attribute is: ", $a->attr('href'), "\n";
$links_r = $a->extract_links();
print "Hey, I found ", scalar(@$links_r), " links.\n";
print "And that, as HTML, is: ", $a->as_HTML, "\n";
$a = $a->delete;
DESCRIPTION
(This class is part of the HTML::Tree dist.)
Objects of the HTML::Element class can be used to represent elements of HTML document trees. These objects have attributes, notably attributes that designates each element's parent and content. The content is an array of text segments and other HTML::Element objects. A tree with HTML::Element objects as nodes can represent the syntax tree for a HTML document.
HOW WE REPRESENT TREES (如何表示html文件中节点树)
Consider this HTML document:
Stuff
I like potatoes!
Building a syntax tree out of it makes a tree-structure in memory that could be diagrammed as:
html (lang='en-US')
/ \
/ \
/ \
head body
/\ \
/ \ \
/ \ \
title meta h1
| (name='author', |
"Stuff" content='Jojo') "I like potatoes"
This is the traditional way to diagram a tree, with the "root" at the top, and it's this kind of diagram that people have in mind when they say, for example, that "the meta element is under the head element instead of under the body element". (The same is also said with "inside" instead of "under" -- the use of "inside" makes more sense when you're looking at the HTML source.)
Another way to represent the above tree is with indenting:
html (attributes: lang='en-US')
head
title
"Stuff"
meta (attributes: name='author' content='Jojo')
body
h1
"I like potatoes"
Incidentally, diagramming with indenting works much better for very large trees, and is easier for a program to generate. The $tree->dump method uses indentation just that way.
However you diagram the tree, it's stored the same in memory -- it's a network of objects, each of which has attributes like so:
element #1: _tag: 'html'
_parent: none
_content: [element #2, element #5]
lang: 'en-US'
element #2: _tag: 'head'
_parent: element #1
_content: [element #3, element #4]
element #3: _tag: 'title'
_parent: element #2
_content: [text segment "Stuff"]
element #4 _tag: 'meta'
_parent: element #2
_content: none
name: author
content: Jojo
element #5 _tag: 'body'
_parent: element #1
_content: [element #6]
element #6 _tag: 'h1'
_parent: element #5
_content: [text segment "I like potatoes"]
The "treeness" of the tree-structure that these elements comprise is not an aspect of any particular object, but is emergent from the relatedness attributes (_parent and _content) of these element-objects and from how you use them to get from element to element.
While you could access the content of a tree by writing code that says "access the 'src' attribute of the root's first child's seventh child's third child", you're more likely to have to scan the contents of a tree, looking for whatever nodes, or kinds of nodes, you want to do something with. The most straightforward way to look over a tree is to "traverse" it; an HTML::Element method ($h->traverse) is provided for this purpose; and several other HTML::Element methods are based on it.
BASIC METHODS
new(新构建一个对象)
$h = HTML::Element->new('tag', 'attrname' => 'value', ... );
This constructor method returns a new HTML::Element object. The tag name is a required argument; it will be forced to lowercase. Optionally, you can specify other initial attributes at object creation time.
attr(返回对象中给定节点里面指定属性的值或添加属性值)
$value = $h->attr('attr');
$old_value = $h->attr('attr', $new_value);
Returns (optionally sets) the value of the given attribute of $h. The attribute name (but not the value, if provided) is forced to lowercase. If trying to read the value of an attribute not present for this element, the return value is undef. If setting a new value, the old value of that attribute is returned.
If methods are provided for accessing an attribute (like $h->tag for "_tag", $h->content_list, etc. below), use those instead of calling attr $h->attr, whether for reading or setting.
Note that setting an attribute to undef (as opposed to "", the empty string) actually deletes the attribute.
tag(返回对象中元素的tagname标签名字或在当前节点添加一个标签)
$tagname = $h->tag();
$h->tag('tagname');
Returns (optionally sets) the tag name (also known as the generic identifier) for the element $h. In setting, the tag name is always converted to lower case.
There are four kinds of "pseudo-elements" that show up as HTML::Element objects:
Comment pseudo-elements
These are element objects with a $h->tag value of "~comment", and the content of the comment is stored in the "text" attribute ($h->attr("text")). For example, parsing this code with HTML::TreeBuilder...
produces an HTML::Element object with these attributes:
"_tag",
"~comment",
"text",
" I like Pie.\n Pie is good\n "
Declaration pseudo-elements
Declarations (rarely encountered) are represented as HTML::Element objects with a tag name of "~declaration", and content in the "text" attribute. For example, this:
produces an element whose attributes include:
"_tag", "~declaration", "text", "DOCTYPE foo"
Processing instruction pseudo-elements
PIs (rarely encountered) are represented as HTML::Element objects with a tag name of "~pi", and content in the "text" attribute. For example, this:
produces an element whose attributes include:
"_tag", "~pi", "text", "stuff foo?"
(assuming a recent version of HTML::Parser)
~literal pseudo-elements
These objects are not currently produced by HTML::TreeBuilder, but can be used to represent a "super-literal" -- i.e., a literal you want to be immune from escaping. (Yes, I just made that term up.)
That is, this is useful if you want to insert code into a tree that you plan to dump out with as_HTML, where you want, for some reason, to suppress as_HTML's normal behavior of amp-quoting text segments.
For example, this:
my $literal = HTML::Element->new('~literal',
'text' => 'x < 4 & y > 7'
);
my $span = HTML::Element->new('span');
$span->push_content($literal);
print $span->as_HTML;
prints this:
x < 4 & y > 7
Whereas this:
my $span = HTML::Element->new('span');
$span->push_content('x < 4 & y > 7');
# normal text segment
print $span->as_HTML;
prints this:
x < 4 & y > 7
Unless you're inserting lots of pre-cooked code into existing trees, and dumping them out again, it's not likely that you'll find ~literal pseudo-elements useful.
parent(返回当前节点的父节点)
$parent = $h->parent();
$h->parent($new_parent);
Returns (optionally sets) the parent (aka "container") for this element. The parent should either be undef, or should be another element.
You should not use this to directly set the parent of an element. Instead use any of the other methods under "Structure-Modifying Methods", below.
Note that not($h->parent) is a simple test for whether $h is the root of its subtree.
content_list(返回当前节点下的所有子节点的列表)
@content = $h->content_list();
$num_children = $h->content_list();
Returns a list of the child nodes of this element -- i.e., what nodes (elements or text segments) are inside/under this element. (Note that this may be an empty list.)
In a scalar context, this returns the count of the items, as you may expect.
content(返回当前节点下的所有子节点列表的引用)
$content_array_ref = $h->content(); # may return undef
This somewhat deprecated method returns the content of this element; but unlike content_list, this returns either undef (which you should understand to mean no content), or a reference to the array of content items, each of which is either a text segment (a string, i.e., a defined non-reference scalar value), or an HTML::Element object. Note that even if an arrayref is returned, it may be a reference to an empty array.
While older code should feel free to continue to use $h->content, new code should use $h->content_list in almost all conceivable cases. It is my experience that in most cases this leads to simpler code anyway, since it means one can say:
@children = $h->content_list;
instead of the inelegant:
@children = @{$h->content || []};
If you do use $h->content (or $h->content_array_ref), you should not use the reference returned by it (assuming it returned a reference, and not undef) to directly set or change the content of an element or text segment! Instead use content_refs_list or any of the other methods under "Structure-Modifying Methods", below.
content_array_ref(返回当前节点下的所有子节点列表的引用)
$content_array_ref = $h->content_array_ref(); # never undef
This is like content (with all its caveats and deprecations) except that it is guaranteed to return an array reference. That is, if the given node has no _content attribute, the content method would return that undef, but content_array_ref would set the given node's _content value to [] (a reference to a new, empty array), and return that.
content_refs_list(返回当前节点下所有子节点引用的列表)
@content_refs = $h->content_refs_list;
This returns a list of scalar references to each element of $h's content list. This is useful in case you want to in-place edit any large text segments without having to get a copy of the current value of that segment value, modify that copy, then use the splice_content to replace the old with the new. Instead, here you can in-place edit:
foreach my $item_r ($h->content_refs_list) {
next if ref $$item_r;
$$item_r =~ s/honour/honor/g;
}
You could currently achieve the same affect with:
foreach my $item (@{ $h->content_array_ref }) {
# deprecated!
next if ref $item;
$item =~ s/honour/honor/g;
}
...except that using the return value of $h->content or $h->content_array_ref to do that is deprecated, and just might stop working in the future.
implicit
$is_implicit = $h->implicit();
$h->implicit($make_implicit);
Returns (optionally sets) the "_implicit" attribute. This attribute is a flag that's used for indicating that the element was not originally present in the source, but was added to the parse tree (by HTML::TreeBuilder, for example) in order to conform to the rules of HTML structure.
pos(返回当前节点的位置)
$pos = $h->pos();
$h->pos($element);
Returns (and optionally sets) the "_pos" (for "current position") pointer of $h. This attribute is a pointer used during some parsing operations, whose value is whatever HTML::Element element at or under $h is currently "open", where $h->insert_element(NEW) will actually insert a new element.
(This has nothing to do with the Perl function called pos, for controlling where regular expression matching starts.)
If you set $h->pos($element), be sure that $element is either $h, or an element under $h.
If you've been modifying the tree under $h and are no longer sure $h->pos is valid, you can enforce validity with:
$h->pos(undef) unless $h->pos->is_inside($h);
all_attr(返回当前节点所有的属性和值的哈希)
%attr = $h->all_attr();
Returns all this element's attributes and values, as key-value pairs. This will include any "internal" attributes (i.e., ones not present in the original element, and which will not be represented if/when you call $h->as_HTML). Internal attributes are distinguished by the fact that the first character of their key (not value! key!) is an underscore ("_").
Example output of $h->all_attr() : '_parent', [object_value] , '_tag', 'em', 'lang', 'en-US', '_content', [array-ref value].
all_attr_names(返回当前节点所有的属性的名字列表)
@names = $h->all_attr_names();
$num_attrs = $h->all_attr_names();
Like all_attr, but only returns the names of the attributes. In scalar context, returns the number of attributes.
Example output of $h->all_attr_names() : '_parent', '_tag', 'lang', '_content', .
all_external_attr
%attr = $h->all_external_attr();
Like all_attr, except that internal attributes are not present.
all_external_attr_names
@names = $h->all_external_attr_names();
$num_attrs = $h->all_external_attr_names();
Like all_attr_names, except that internal attributes' names are not present (or counted).
id(返回当前节点属性为id,值为设定值的元素)
$id = $h->id();
$h->id($string);
Returns (optionally sets to $string) the "id" attribute. $h->id(undef) deletes the "id" attribute.
$h->id(...) is basically equivalent to $h->attr('id', ...), except that when setting the attribute, this method returns the new value, not the old value.
idf
$id = $h->idf();
$h->idf($string);
Just like the id method, except that if you call $h->idf() and no "id" attribute is defined for this element, then it's set to a likely-to-be-unique value, and returned. (The "f" is for "force".)
STRUCTURE-MODIFYING METHODS(修改结构的方法)
These methods are provided for modifying the content of trees by adding or changing nodes as parents or children of other nodes.
#这些方法通过增加、改变节点来改变树节结构
push_content(在当前节点的最后子节点后面增加一个子节点)
$h->push_content($element_or_text, ...);
Adds the specified items to the end of the content list of the element $h. The items of content to be added should each be either a text segment (a string), an HTML::Element object, or an arrayref. Arrayrefs are fed thru $h->new_from_lol(that_arrayref) to convert them into elements, before being added to the content list of $h. This means you can say things concise things like:
$body->push_content(
['br'],
['ul',
map ['li', $_], qw(Peaches Apples Pears Mangos)
]
);
See the "new_from_lol" method's documentation, far below, for more explanation.
Returns $h (the element itself).
The push_content method will try to consolidate adjacent text segments while adding to the content list. That's to say, if $h's content_list is
('foo bar ', $some_node, 'baz!')
and you call
$h->push_content('quack?');
then the resulting content list will be this:
('foo bar ', $some_node, 'baz!quack?')
and not this:
('foo bar ', $some_node, 'baz!', 'quack?')
If that latter is what you want, you'll have to override the feature of consolidating text by using splice_content, as in:
$h->splice_content(scalar($h->content_list),0,'quack?');
Similarly, if you wanted to add 'Skronk' to the beginning of the content list, calling this:
$h->unshift_content('Skronk');
then the resulting content list will be this:
('Skronkfoo bar ', $some_node, 'baz!')
and not this:
('Skronk', 'foo bar ', $some_node, 'baz!')
What you'd to do get the latter is:
$h->splice_content(0,0,'Skronk');
unshift_content(在当前节点的最前子节点前面增加一个子节点)
$h->unshift_content($element_or_text, ...)
Just like push_content, but adds to the beginning of the $h element's content list.
The items of content to be added should each be either a text segment (a string), an HTML::Element object, or an arrayref (which is fed thru new_from_lol).
The unshift_content method will try to consolidate adjacent text segments while adding to the content list. See above for a discussion of this.
Returns $h (the element itself).
splice_content(在当前的节点的子节点列表中根据给定的起始位置和长度,截取出一个元素)
@removed = $h->splice_content($offset, $length,
$element_or_text, ...);
Detaches the elements from $h's list of content-nodes, starting at $offset and continuing for $length items, replacing them with the elements of the following list, if any. Returns the elements (if any) removed from the content-list. If $offset is negative, then it starts that far from the end of the array, just like Perl's normal splice function. If $length and the following list is omitted, removes everything from $offset onward.
The items of content to be added (if any) should each be either a text segment (a string), an arrayref (which is fed thru "new_from_lol"), or an HTML::Element object that's not already a child of $h.
detach
$old_parent = $h->detach();
This unlinks $h from its parent, by setting its 'parent' attribute to undef, and by removing it from the content list of its parent (if it had one). The return value is the parent that was detached from (or undef, if $h had no parent to start with). Note that neither $h nor its parent are explicitly destroyed.
detach_content(删除并返回当前节点下所有的子节点)
@old_content = $h->detach_content();
This unlinks all of $h's children from $h, and returns them. Note that these are not explicitly destroyed; for that, you can just use $h->delete_content.
replace_with(用指定的元素取代当前节点)
$h->replace_with( $element_or_text, ... )
This replaces $h in its parent's content list with the nodes specified. The element $h (which by then may have no parent) is returned. This causes a fatal error if $h has no parent. The list of nodes to insert may contain $h, but at most once. Aside from that possible exception, the nodes to insert should not already be children of $h's parent.
Also, note that this method does not destroy $h if weak references are turned off -- use $h->replace_with(...)->delete if you need that.
preinsert(在当前节点的前面,插入给定节点)
$h->preinsert($element_or_text...);
Inserts the given nodes right BEFORE $h in $h's parent's content list. This causes a fatal error if $h has no parent. None of the given nodes should be $h or other children of $h. Returns $h.
postinsert
(在当前节点的前面,插入给定节点)
$h->postinsert($element_or_text...)
Inserts the given nodes right AFTER $h in $h's parent's content list. This causes a fatal error if $h has no parent. None of the given nodes should be $h or other children of $h. Returns $h.
replace_with_content(用当前节点下的子节点代替当前节点)
$h->replace_with_content();
This replaces $h in its parent's content list with its own content. The element $h (which by then has no parent or content of its own) is returned. This causes a fatal error if $h has no parent. Also, note that this does not destroy $h if weak references are turned off -- use $h->replace_with_content->delete if you need that.
delete_content(删除当前节点下的子节点)
$h->delete_content();
$h->destroy_content(); # alias
Clears the content of $h, calling $h->delete for each content element. Compare with $h->detach_content.
Returns $h.
destroy_content is an alias for this method.
delete(删除当前节点下的子节点)
$h->delete();
$h->destroy(); # alias
Detaches this element from its parent (if it has one) and explicitly destroys the element and all its descendants. The return value is the empty list (or undef in scalar context).
Before version 5.00 of HTML::Element, you had to call delete when you were finished with the tree, or your program would leak memory. This is no longer necessary if weak references are enabled, see "Weak References".
destroy(删除当前节点下的子节点)
An alias for "delete".
destroy_content
An alias for "delete_content".
clone(返回当前节点的copy)
$copy = $h->clone();
Returns a copy of the element (whose children are clones (recursively) of the original's children, if any).
The returned element is parentless. Any '_pos' attributes present in the source element/tree will be absent in the copy. For that and other reasons, the clone of an HTML::TreeBuilder object that's in mid-parse (i.e, the head of a tree that HTML::TreeBuilder is elaborating) cannot (currently) be used to continue the parse.
You are free to clone HTML::TreeBuilder trees, just as long as: 1) they're done being parsed, or 2) you don't expect to resume parsing into the clone. (You can continue parsing into the original; it is never affected.)
clone_list(复制指定的节点列表)
@copies = HTML::Element->clone_list(...nodes...);
Returns a list consisting of a copy of each node given. Text segments are simply copied; elements are cloned by calling $it->clone on each of them.
Note that this must be called as a class method, not as an instance method. clone_list will croak if called as an instance method. You can also call it like so:
ref($h)->clone_list(...nodes...)
normalize_content
$h->normalize_content
Normalizes the content of $h -- i.e., concatenates any adjacent text nodes. (Any undefined text segments are turned into empty-strings.) Note that this does not recurse into $h's descendants.
delete_ignorable_whitespace(删除忽略空白的文本片段)
$h->delete_ignorable_whitespace
This traverses under $h and deletes any text segments that are ignorable whitespace. You should not use this if $h is under a
element.
insert_element(在指定的位置插入元素)
$h->insert_element($element, $implicit);
Inserts (via push_content) a new element under the element at $h->pos(). Then updates $h->pos() to point to the inserted element, unless $element is a prototypically empty element like
,
, , etc. The new $h->pos() is returned. This method is useful only if your particular tree task involves setting $h->pos().
DUMPING METHODS (析出方法)
dump(打印该节点数据结构)
$h->dump()
$h->dump(*FH) ; # or *FH{IO} or $fh_obj
Prints the element and all its children to STDOUT (or to a specified filehandle), in a format useful only for debugging. The structure of the document is shown by indentation (no end tags).
as_HTML(输出当前元素的html格式)
$s = $h->as_HTML();
$s = $h->as_HTML($entities);
$s = $h->as_HTML($entities, $indent_char);
$s = $h->as_HTML($entities, $indent_char, \%optional_end_tags);
Returns a string representing in HTML the element and its descendants. The optional argument $entities specifies a string of the entities to encode. For compatibility with previous versions, specify '<>&' here. If omitted or undef, all unsafe characters are encoded as HTML entities. See HTML::Entities for details. If passed an empty string, no entities are encoded.
If $indent_char is specified and defined, the HTML to be output is intented, using the string you specify (which you probably should set to "\t", or some number of spaces, if you specify it).
If \%optional_end_tags is specified and defined, it should be a reference to a hash that holds a true value for every tag name whose end tag is optional. Defaults to \%HTML::Element::optionalEndTag, which is an alias to %HTML::Tagset::optionalEndTag, which, at time of writing, contains true values for p, li, dt, dd. A useful value to pass is an empty hashref, {}, which means that no end-tags are optional for this dump. Otherwise, possibly consider copying %HTML::Tagset::optionalEndTag to a hash of your own, adding or deleting values as you like, and passing a reference to that hash.
as_text(仅输出当前元素的文本信息,屏蔽标签、属性等)
$s = $h->as_text();
$s = $h->as_text(skip_dels => 1);
Returns a string consisting of only the text parts of the element's descendants. Any whitespace inside the element is included unchanged, but whitespace not in the tree is never added. But remember that whitespace may be ignored or compacted by HTML::TreeBuilder during parsing (depending on the value of the ignore_ignorable_whitespace and no_space_compacting attributes). Also, since whitespace is never added during parsing,
HTML::TreeBuilder->new_from_content("a
b
")
->as_text;
returns "ab", not "a b" or "a\nb".
Text under