The Difficulties of Dealing with Scripts and Wrappers in Web Data Extraction

Gina Cerami

Published Oct 2, 2015

That the World Wide Web is the largest public data source in the world is no surprise, although trying to conceive of what that means certainly can be. Since its inception, the Internet was intended as an easy way for people to find information. From airline prices to making a proper béarnaise, that trend–for the most part–continues. This data explosion and the potential revenue that can be generated from it, now coined “the Internet of Things,” is leading companies to rethink how they access, ingest, store and manage Big Data as well as the information and content they harvest from the web.

Increasingly, however, computers are using the web to seek out information. In 2014, for instance, bots made up more than half of all the web traffic. But since people access and use information differently than computers do, it’s easy to assume they hit more than a few snags. For companies looking to capitalize on web data extraction, those snags lie at the heart of the challenge of how to properly gather and make use of the web’s vast storehouses of information.

Web Data is Different

On the surface, extracting web data for use by computer applications looks simple. Data, after all, is everywhere. Its ubiquity implies that it should also be trackable, collectable and exploitable. As the march toward big data-driven discovery continues unabated, expectations are high. People in industries as different from one another as government and retail want to experience the fruit of big data’s tree: More information and better insight that yield better results.

But traditional data processing disciplines deal with data in relational or object-based models, while web data is unstructured. This lack of structure makes browsing websites enjoyable for people, but it poses significant challenges to organizations seeking to extract machine-usable content from around the web.

Scripts and Wrappers, and Trouble, Oh My!

When scripts and wrappers are dependent on HTML delimiters–as they often are in programmatic approaches to extraction–they break when a site’s HTML changes. Once this break occurs, there is no automated process that can repair them. A real, live programmer has to rewrite code to get them up and running again.

Because scripts and wrappers of this type are designed along the lines of traditional data models, they spell nothing but t-r-o-u-b-l-e in relation to web content extraction.

The Booyah! of Machine Learning

Website changes can throw a real wrench in the spokes of a strictly programmatic approach to data extraction. A non-programmatic approach that utilizes machine learning to tackle non-working scripts and wrappers can change that.

Two separate teams of researchers–one from Rutgers and one from USC–applied machine learning in two different ways to develop a non-programmatic approach to web content monitoring and extraction. One explored visual abstraction, while the other developed algorithms regarding the statistical distribution of patterns in web pages. The results of each showed that machine learning could not only ease the creation of web content extraction applications, but it could also improve resiliency to website changes.

Today, these two approaches have been combined to form a hybrid solution that enables web data extraction to continue unabated across changes to content and formatting. The result is a scalable and flexible solution that can solve the problems created by scripts and wrappers, while achieving the results customers want and need.

Need a web content extraction solution that’s resilient to website changes? Request a consultation

To view or add a comment, sign in

The Difficulties of Dealing with Scripts and Wrappers in Web Data Extraction

Gina Cerami

More articles by Gina Cerami

Others also viewed

Why Power Query's UI Isn't Enough: Problems that require M language

Why TOON Format Fails: Token Count and the Hidden Cost of Attention Distance

Optimizing Data Flow & Custom Hooks in React Native: A Deep-Dive Guide for Better Performance

CloudFormation: Introduction to YAML

Migrate Report-level Measures back to the Semantic Model

Semantic Web: A Smarter Future

FabCon & SQLCon: What I expect to learn in Atlanta this year

Napkin Imports: An LLM-Powered Pattern for Turning Plain-Text Notes into Reliable Data

Why Enterprise Web Data Pipelines Fail After the First Year

Explore content categories

More articles by Gina Cerami

3 Common Scenarios for Misreading Buying Signals – the Psychology of the Prospect’s Response

SELLING: Just the Professional Version of Chutes and Ladders?

Cutting Through Prospect’s Coronavirus Noise

Why Your Sales Methodology Isn’t Working… and How to Make it Hum

Web Data: A Critical Way To Capitalize On The Data Boom

Meet Import.io at the Web Data Extraction Summit

Web Data Integration: Revolutionizing the Way You Work with Web Data

Headed to Battlefin? Meet Import.io at Alternative Data Discovery Day to Get Alternative Data for Research

What to do When You’ve Got the Data Lake, Data Models and Analytics Tools but, the Website with that Valuable Data Set Has No API

Leveraging Alternative Data Sets – No Special Skills Required