Jump to content

Feb 17

In any kind of data access scenario, transforming and applying business logic to your extracted data is critical to structure the data to the exact format you need.

In ETL terminology this is the “T” in ETL.

When accessing data from Web based applications, this capability is critical since the data is typically unstructured and the more structure you can add, the higher the value.

Most Web Scraping and Screen Scraping tools on the market today typically lack adequate transformation capabilities.

Here are some examples as to how the Kapow Web  Data Server delivers full transformation:

Below is a small extract of a blog list from PW Forum:

ETL4Web

On the first line in blue you see the timestamp set to “Today 15:15:13”. This timestamp denotes the time and date when this blog entry was posted, but it would need to be transformed into a fixed timestamp like “2010-02-14 3:15:13 pm Pacific time” to be useful in comparing it to other blog entries over the internet.

Here’s another example from Ebay. When you bid on an item on Ebay, the price of the item is red or green depending on whether you are the high bidder or not. This is important information “hidden” in the color that you would like to capture along with the price.  Once transformed, you’ll know not only the price, but whether it is the “highest bid” or not.  It’s a simple step to define the business logic as “if price is ‘green’ then set status to ‘high bidder’ otherwise set status to ‘not high bidder’.”

The Kapow Web Data server and its powerful visual programming IDE allows you to  apply any business logic and data transformation you can think of giving you the most powerful ETL for the Web product on the market today. And it’s all done visually with no need for any coding.

Try it out next time you need Web data for your BI or analytics tools.

By:  Stefan Andreasen Stefan Andreasen, CTO and Founder

Tagged with:    
Jul 13

Scraping comes from “Screen Scraping” which is a term used for a set of products that turn old “Green Screen” mainframe applications into web services by “wrapping” the screen protocol.  Screen Scrapers connect up to the fields of a 32×80 character terminal and read pixels, text and numbers to fill in forms and in turn wrap the application into a programmatic interface or web service.  Examples of such products are IBM Rational HATS, Attachmate EXTRA.

Web Scraping is conceptually identical to Screen Scraping as it “wraps” a human interface into a programmatic interface, but instead of “wrapping” a character based mainframe protocol, it “wraps” a Web site or Web application and turns it into an API.

It sounds similar but technically, and in use cases, it’s quite different.

Web Scraping does not represent all approaches of wrapping Web applications into API’s – it’s limited to traditional methods that use script languages like PERL or Python to extract data from static HTML with regular expressions. This method of extracting data from web sites has been used for years, but it has been running into two growing challenges:  it’s fragile toward changes in the underlying web application, and more importantly, it simply does not work with today’s dynamic AJAX powered web sites.

If you are a PERL programmer I encourage you to build a simple “web scraper”. Go to Gmail.com and create a PERL script that can log in and read the content of your inbox. You will quickly find out that it is nearly impossible.

Let me introduce the Kapow Web Data Server – it takes over where fragile “Web Scraping” scripts fail, delivering a point-and-click interface to turn a website like gmail.com into a sharable REST or SOAP service in the cloud or on-premise, virtually in minutes. Web data access has never been easier and more resilient.

Web Scraping represents a business concept with growing value in today’s networked world, however, Web Data Serving has taken over to deliver a far more productive and robust alternative to traditional Web Scraping technologies.

I will be continuing with more blogs on this topic, and as always, I’d love to hear your comments.

By:  Stefan Andreasen Stefan_Andreasen_CTO

Tagged with:       

The Kapow Katalyst Blog is…

... a collection of insights, perspectives, and thought leadership around the Browser-Based Application Integration.

Comments, Feedback, Contact Us:

blog at kapowsoftware.com

Get Our RSS Feed

RSSKapowSoftware