Parsing XML with special characters


It is pretty common to come across a scenario where we have to deal with special characters in XML. Like &, (, ), $, etc. With this fix the < be the only illeagal character.

Let’s take look at one way of fixing it if one does not have any control over the XML being received

string xmlString = "<?xml version=\'1.0\' encoding=\'UTF-8\' standalone=\'yes\'?>\n<rows xmlns:xsd=\"http://www.w3.org/2001/XMLSchema\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xmlns:x=\"urn:row\">\n<xsd:schema targetNamespace=\"urn:row\">\n<xsd:element name=\"row\">\n<xsd:complexType>\n<xsd:sequence>\n<xsd:element name=\"customer_name\" type=\"xsd:string\" nillable=\"true\"/>\n</xsd:sequence>\n</xsd:complexType>\n</xsd:element>\n</xsd:schema>\n<x:row>\n<customer_name>A&B Company</customer_name>\n</x:row>\n</rows>";
// xmlString.Dump(); // LINQ Pad
// var doc = XElement.Parse(xmlString); // Error!
string pattern = "(?<start>>)(?<content>.+?(?<!>))(?<end><)|(?<start>\")(?<content>.+?)(?<end>\")";
string result = Regex.Replace(xmlString, pattern, m =>
m.Groups["start"].Value +
HttpUtility.HtmlEncode(HttpUtility.HtmlDecode(m.Groups["content"].Value)) +
m.Groups["end"].Value);
// result.Dump(); // LINQ Pad
var doc = XElement.Parse(result);

So if one were to execute line 4, following XmlException would be thrown:

XmlException when parsing invalid XML: ‘ ‘ is an unexpected token. The expected token is ‘;’. Line 13, position 19.

References:

  1. Stack Overflow post
  2. MSDN documentation
  3. W3.org reference

Leave a Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s