Friday, December 25, 2009

Charset of input string for DOMDocument loadHTML()

It's very important to tell DOMDocument which charset the input string is encoded.

In many cases I know for sure that the string is in utf-8, but unless I tell this somehow to DOMDocument during loadHTML()
it will do bad things to my string.

This is how to correctly load the html fragment - by appending the full doctype and most importantly the meta with charset info:

$sHtml = '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">
<head>
<meta equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body><div>'.$sHtml.'</div></body></html>';

$ER = error_reporting(0);
if(false === @$oDom->loadHTML($sHtml)){
throw new LampcmsDevException('Error. Unable to load html string: '.$sHtml);
}
error_reporting($ER);

Here I am also wrapping the whole string in <div>
so that it will be easy to get back just the contents of
the first div when I need to do saveXML(), I can do this:

$string = substr($this->saveXML($this->getElementsByTagName('div')->item(0)), 5, -6);
This is because I know that the content is wrapped in the div tag, so I am getting the first div, then stripping off the words <div> and </div> from the string.

If loading the xml instead of html, then instead of this meta tag with charset, just make sure that charset is declared in the xml declaration, like this:
<?xml version="1.0" encoding="utf-8" ?>

Meaning the first line must indicate the charset

No comments:

Post a Comment