Why are we still using XML?
As you my or may not know, my day job is working for an online advertising company. The specific project I work on involves consuming XML formatted search feeds. We currently parse around 8,000 of these XML feeds per SECOND.
We spent months investigating the fastest way to parse XML. We eventually decided on a C library which walks the XML like a tree, it doesn’t load the XML DOM into memory like most conventional XML parsers do.
That all being said, XML is slowly dying across the internet. Most APIs are focusing on JSON interfaces instead of XML but for some reason our industry is just not willing to make the switch.
JSON is a much better structure than XML. It natively supports different datatypes like integers, strings and booleans. For example:
JSON.parse('{"is_true": true}')["is_true"] == true
That returns true
, where something like "true" == true
does not,
which is what you get from an XML document, since everything is a
string.
However, the big difference is that JSON is MUCH easier to parse, since its much better defined. I have created a quick benchmark to show what I mean:
require 'nokogiri'
require 'json'
require 'benchmark'
small_xml_file = open("datafiles/small.xml").read.gsub(/\n/, "").squeeze(" ")
large_xml_file = open("datafiles/large.xml").read.gsub(/\n/, "").squeeze(" ")
small_json_file = open("datafiles/small.json").read.gsub(/\n/, "").squeeze(" ")
large_json_file = open("datafiles/large.json").read.gsub(/\n/, "").squeeze(" ")
def regex_xml(file)
doc = /\<results\>(.*)\<\/results\>/.match(file)[1]
doc = doc.split(/\<result\>(.*?)\<\/result\>/)
doc.map do |result|
next if result.strip == ""
result = result.split(/\<.*?\>/)
{
:title => result[1],
:description => result[3],
:url => result[5]
}
end.compact
end
def xpath_xml(file)
doc = Nokogiri::XML::Document.parse file
doc.xpath('//results/result').map do |node|
{
:title => node.xpath('title').text,
:description => node.xpath('description').text,
:url => node.xpath('url').text
}
end
end
def parse_json(file)
JSON.parse(file)
end
n = 100000
Benchmark.bmbm do |x|
x.report("large json") { n.times { parse_json(large_json_file) } }
x.report("small json") { n.times { parse_json(small_json_file) } }
x.report("large xml xpath") { n.times { xpath_xml(large_xml_file) } }
x.report("small xml xpath") { n.times { xpath_xml(small_xml_file) } }
x.report("large xml regex") { n.times { regex_xml(large_xml_file) } }
x.report("small xml regex") { n.times { regex_xml(small_xml_file) } }
end
puts
puts
puts "JSON Large File Size: #{large_json_file.size}"
puts "JSON Small File Size: #{small_json_file.size}"
puts "XML Large File Size: #{large_xml_file.size}"
puts "XML Small File Size: #{small_xml_file.size}"
Now this is a pretty quick and dirty benchmark, but I tried to show two things. One, I parse the XML using nokogiri, one of the most popular XML parsers for Ruby. Two, I also parse it using regex’s with splits.
I can probably optimize the regex / splitting code quite as bit, but really, its not even worth it. Even if I can optimize the crap out of it, I don’t think I will get performance that is much better than the JSON parser, plus it is complex, confusing and ugly.
Then I simply use the native JSON parser to parse a JSON file.
I also test with two files, one that has 20 entries, the other that has 2 entries.
Here are my benchmark results for 100,000 parses.
Rehearsal ---------------------------------------------------
large json 12.100000 0.000000 12.100000 ( 12.182863)
small json 1.640000 0.000000 1.640000 ( 1.649213)
large xml xpath 225.450000 0.170000 225.620000 (225.957241)
small xml xpath 28.250000 0.020000 28.270000 ( 28.307895)
large xml regex 37.880000 0.000000 37.880000 ( 37.936387)
small xml regex 4.230000 0.000000 4.230000 ( 4.235283)
---------------------------------------- total: 309.740000sec
user system total real
large json 12.160000 0.000000 12.160000 ( 12.176090)
small json 1.710000 0.000000 1.710000 ( 1.714853)
large xml xpath 223.850000 0.170000 224.020000 (224.371979)
small xml xpath 28.300000 0.040000 28.340000 ( 28.382887)
large xml regex 37.880000 0.000000 37.880000 ( 37.924334)
small xml regex 4.160000 0.000000 4.160000 ( 4.163718)
JSON Large File Size: 1956
JSON Small File Size: 210
XML Large File Size: 2597
XML Small File Size: 311
Now my conclusions are this:
- XML XPath processing with nokogiri is slow.
- I have to walk over the XML and pull out the bits I want, this is extra work.
- Parsing XML with regex’s and/or splits is ugly, and still slower.
- XML files are larger in general than JSON file, simply because of the syntax.
- Parsing JSON immediately gets me a data structure I can deal with natively in Ruby. (and most other languages)
In my case, the XML files are 20% larger, thats a 20% increase bandwidth cost. XML parsing takes 200x longer, thats a lot of CPU resources used up.
100,000 parses is roughly 12.5 seconds of our current traffic. This
benchmark shows that it would take 200+ seconds to deal with that many
parsers in Ruby using XPaths. Using regex/splits, I would almost be able
to keep up, but still not quite yet. JSON.parse
would be able to handle
it, with some time to spare. Obviously our production environment runs
on faster servers and more of them (and also runs under Erlang with a C
library, not Ruby). But we still have a significant overhead caused
simply by parsing XML.
Moral of the story, even if have much less API load, save yourself some CPU cycles, use JSON instead!
You can download the entire benchmark from https://github.com/tecnobrat/xml-vs-json
Comments