The Difficulties of Dealing with Scripts and Wrappers in Web Data Extraction
That the World Wide Web is the largest public data source in the world is no surprise, although trying to conceive of what that means certainly can be. Since its inception, the Internet was intended as an easy way for people to find information. From airline prices to making a proper béarnaise, that trend–for the most part–continues. This data explosion and the potential revenue that can be generated from it, now coined “the Internet of Things,” is leading companies to rethink how they access, ingest, store and manage Big Data as well as the information and content they harvest from the web.
Increasingly, however, computers are using the web to seek out information. In 2014, for instance, bots made up more than half of all the web traffic. But since people access and use information differently than computers do, it’s easy to assume they hit more than a few snags. For companies looking to capitalize on web data extraction, those snags lie at the heart of the challenge of how to properly gather and make use of the web’s vast storehouses of information.
Web Data is Different
On the surface, extracting web data for use by computer applications looks simple. Data, after all, is everywhere. Its ubiquity implies that it should also be trackable, collectable and exploitable. As the march toward big data-driven discovery continues unabated, expectations are high. People in industries as different from one another as government and retail want to experience the fruit of big data’s tree: More information and better insight that yield better results.
But traditional data processing disciplines deal with data in relational or object-based models, while web data is unstructured. This lack of structure makes browsing websites enjoyable for people, but it poses significant challenges to organizations seeking to extract machine-usable content from around the web.
Scripts and Wrappers, and Trouble, Oh My!
When scripts and wrappers are dependent on HTML delimiters–as they often are in programmatic approaches to extraction–they break when a site’s HTML changes. Once this break occurs, there is no automated process that can repair them. A real, live programmer has to rewrite code to get them up and running again.
Because scripts and wrappers of this type are designed along the lines of traditional data models, they spell nothing but t-r-o-u-b-l-e in relation to web content extraction.
The Booyah! of Machine Learning
Website changes can throw a real wrench in the spokes of a strictly programmatic approach to data extraction. A non-programmatic approach that utilizes machine learning to tackle non-working scripts and wrappers can change that.
Two separate teams of researchers–one from Rutgers and one from USC–applied machine learning in two different ways to develop a non-programmatic approach to web content monitoring and extraction. One explored visual abstraction, while the other developed algorithms regarding the statistical distribution of patterns in web pages. The results of each showed that machine learning could not only ease the creation of web content extraction applications, but it could also improve resiliency to website changes.
Today, these two approaches have been combined to form a hybrid solution that enables web data extraction to continue unabated across changes to content and formatting. The result is a scalable and flexible solution that can solve the problems created by scripts and wrappers, while achieving the results customers want and need.
Need a web content extraction solution that’s resilient to website changes? Request a consultation