This talk outlines our team’s findings on the properties of XML documents and XPath expressions “in the wild”.
As part of an ongoing effort to develop XML processing hardware and software, our team has collected thousands of samples of various XML species from the web to analyze in our lab. Dissecting these critters with various statistical tools, we developed a characterization of “typical” XML documents in each of some familiar species, including RSS and XTHML. Belaboring the metaphor further, we also cloned these species, taking the statistical characteristics and feeding them to a custom-designed tool for generating XML documents matching statistical profiles.
The talk with also describe a related investigation into XPath in which we extracted expressions from hundreds of open source projects. We found some illuminating patterns in XPath usage in those projects.
Stewart Taylor is a software architect at Intel Corporation. In his many years at Intel, he has worked on numerous software projects in multimedia and information processing, most notably the Intel® Integrated Performance Primitives and the Intel® XML Software Suite. He is the author of Intel® Integrated Performance Primitives and Optimizing Applications for Multi-core Processors
Adam has done various works in XML usage model framework and computer system design. His work includes developing B2B XML content level secured document sharing models, structural and statistical XML usage models, random XML document generation, embedded real-time data acquisition systems, database security, and scalable clustered database systems.
Adam holds a MS in Electrical Engineering form Stanford University, and BS in Engineering/BS in Economics in Computer Science and Finance from the School of Engineering and the Wharton School of University of Pennsylvania. He is currently a Senior Member of Technical Staff in Server Technology group of Oracle Corporation. Besides engineering work, Adam enjoys music and is a vocalist and a composer for classical music.