Jump to content

Jul 30

Often without realizing it, more and more companies rely on Web Data (any data you can see in a web browser) as a critical foundation for making business decisions.

Ron’s post on Web Data reminded me of this interesting blog post, “More data usually beats better algorithms”, written by Anand Rajaraman, co-founder of Kosmix and also Consulting Assistant Professor of Data Mining at Stanford University.

MoneyFallingThe blog post describes how Anand’s students competed for the $1 Million Netflix Prize, a competition open to the public.

Netflix provides a huge data set of customer movie ratings from the past, and the challenge is to use this data to create a better algorithm than Netflix already has to predict which movies people want to view in the future.

Anand’s students attacked this challenge and in his post he highlights two very different approaches.  Team A focused on developing a sophisticated algorithm.   Team B used a simple algorithm and focused more on the data, pulling in additional movie data from IMBD (International Movie Database).

Which team performed better?

Team B, who focused more on the data, got to the top of the Netflix Prize leaderboard.

Anand’s point?  “…adding more, independent data usually beats out designing ever-better algorithms to analyze an existing data set. I’m often suprised that many people in business, and even in academia, don’t realize this.”  Just adding one extra set of data can improve the quality of your decision making several times over.

The key is not about selecting between a better algorithm or better data, but about improving the outcome of your decision-making by adding more data, namely Web Data. Think about the impact to your business if you could add high-value Web Data to your Market Intelligence, Pricing Intelligence, Financial Intelligence or any other Business Intelligence product.

Many companies already have knowledge workers who cut-and-paste Web Data into their BI tools or use simple Web Scraping tools like Velocityscape, Connotate, QL2 or Mozenda (which are limited by their inability to handle dynamic web content like AJAX or JavaScript).  To get the most out of your Business Intelligence projects, you’ll want a full Web Data Services product like the Kapow Web Data Server.

Unleash the real power of Web Data to make better business decisions.

Check it out and let me hear your comments.

By:  Stefan Andreasen Stefan_Kapow_CTO

Tagged with:    
Jul 26

When you hear the words “web data”, what comes to mind?  Is it any information you can find on Google, Yahoo!, Amazon, or just simply WWW?  As a consumer, the Worldwide Web is certainly the defacto source of any data that I care about.

But when I walk into my office on Monday morning, something suddenly changes.  “Web data” becomes much more.  It’s campaign or lead information in Salesforce.com.  It’s HR and company information on our Wiki.  It’s essentially any data within my corporate firewall that I need to get my job done and data that is critical to the success of my company.  On the other hand, WWW is generally a secondary source of information e.g. when I’m doing some market research or keeping an eye on my competitors.

A common theme has emerged in my discussions with executives of global companies: an increasing need among all of them to tap into WWW information and make better business decisions.  But for the most part, these projects are departmental and functional – competitive pricing, voice of the customer, etc.  Here’s the problem.  If companies are trying to increase their competitive advantage based upon the use of public web data (i.e. WWW) then their competitors will also have access to that same data… thus their competitive advantage will certainly be short lived!  While creating departmental use cases for the use of public web data is a good first step, it’s not sufficient.  Every company’s ability to compete and win is based upon its unique operational capabilities.  Therefore, it’s essential that ALL sources of information be incorporated within the same enterprise information architecture.

ForestEnter “Web Data Services”: the intersection of Web2.0, Enterprise 2.0, and BI 2.0.  Web2.0 provides modern tools for companies to marry traditional enterprise data (ERP,CRM, Supply Chain, etc.) with data from the Cloud and WWW thus modernizing their information systems.  And the net effect is enabling executives to bring their dashboards, reporting and analytics to the next level for critical decision making.

“Web data” is effectively any data you can see in a browser… a website or web application that is inside or outside your firewall, in your partner’s intranet or in the Cloud.  But consider this:  It is much easier to mine or harvest public web data than it is to pull data from behind firewalls and then turn it into valuable assets that drive your business forward.  The need for automation tools, or Web Data Services, that access, enrich and serve all web data is critical to the success of your business.

Tagged with:
Jul 23

cisco-logoSusan Bouchard, who leads the Web 2.0 and mobility program for Cisco and is author of the book Enterprise Web 2.0 Fundamentals, is an Enterprise Web 2.0 Expert and Blogger at Network World.  She recently wrote a great post titled Delivering Business Value with Mashups where she describes how “Mashups offer several key advantages and significant business value to the enterprise”.  She highlights how Cisco “uses Kapow robots to aggregate links to selling content from Cisco’s product and marketing business unit sites, eliminating the need for users to visit multiple sites and reducing search time” as a prime example of Cisco using Kapow Technologies to deliver business value.

Thanks Susan, great post!

Tagged with:
Jul 21

A few weeks ago, I had a great chat with Jamie Thomson from EMC about Web Data Services.  I noticed Jamie recently wrote an interesting blog post titled, “ETL for HTML”.  ETL is a well known term for anyone working with Data Integration or Data Warehousing. It stands for Extract, Transform and Load, and describes a one-way process of extracting data from a source, transforming the data into a new format and then loading the data into a destination. Traditional ETL vendors like Informatica are most effective for extracting and loading data from sources which can be accessed in traditional ways through SQL, XML or program APIs. This is where Web Data Services products like Kapow Web Data Server come in as a next-generation ETL tool. The Kapow Web Data Server allows users to Extract and Load data to and from all the data sources, including those that cannot be accessed in traditional ways, with the only prerequisite being that users are able to access and see the data in a normal Web Browser.

We live in a browser-centric world today where “ETL for HTML” encompasses the 2 extremes:  Web2.0 (e.g. web scraping, mashups, etc.) and Enterprise Data Management (e.g . data extraction, data collection, data mining, data conversion, data integration, etc.).  “ETL for HTML” is the perfect universal term that best describes working with all the data we work with and see in our Web browsers. This gives us fast and automated access to any data in applications like SalesForce or NetSuite or any of the millions of other web-based applications that exist inside our firewall, at our business partners, with the government, or just out on the public web.

Jamie is spot-on with the term “ETL for HTML” as a way to describe how most of us will access web data.  Although ETL traditionally describes a one-way process of moving data from point A to point B, Web Data Services provides two-way access to data. This means we can leave the data where it resides best (like in your HR or ERP applications) and get full programmatic access by using a product like the Kapow Web Data Server to “wrap” the applications into standard service APIs like REST, SOAP or .NET.

Why is this so important? Well for two reasons.  First, with the data explosion around us it becomes impractical to move and synchronize data into one common data repository.  Second, the data we need to perform our analysis and drive business decisions will change more and more rapidly. We will need new data sources daily, or at least weekly, to react to the ever changing business needs of the future.

So what is a good replacement for the term “ETL for HTML”? I suggest something like “Access, Enrich and Serve Web data”. This is a superset of ETL that also covers the way we want to access data in the future.

What term do you think we should use?

By:  Stefan Andreasen Stefan_Andreasen_CTO

Tagged with:                
Jul 13

Scraping comes from “Screen Scraping” which is a term used for a set of products that turn old “Green Screen” mainframe applications into web services by “wrapping” the screen protocol.  Screen Scrapers connect up to the fields of a 32×80 character terminal and read pixels, text and numbers to fill in forms and in turn wrap the application into a programmatic interface or web service.  Examples of such products are IBM Rational HATS, Attachmate EXTRA.

Web Scraping is conceptually identical to Screen Scraping as it “wraps” a human interface into a programmatic interface, but instead of “wrapping” a character based mainframe protocol, it “wraps” a Web site or Web application and turns it into an API.

It sounds similar but technically, and in use cases, it’s quite different.

Web Scraping does not represent all approaches of wrapping Web applications into API’s – it’s limited to traditional methods that use script languages like PERL or Python to extract data from static HTML with regular expressions. This method of extracting data from web sites has been used for years, but it has been running into two growing challenges:  it’s fragile toward changes in the underlying web application, and more importantly, it simply does not work with today’s dynamic AJAX powered web sites.

If you are a PERL programmer I encourage you to build a simple “web scraper”. Go to Gmail.com and create a PERL script that can log in and read the content of your inbox. You will quickly find out that it is nearly impossible.

Let me introduce the Kapow Web Data Server – it takes over where fragile “Web Scraping” scripts fail, delivering a point-and-click interface to turn a website like gmail.com into a sharable REST or SOAP service in the cloud or on-premise, virtually in minutes. Web data access has never been easier and more resilient.

Web Scraping represents a business concept with growing value in today’s networked world, however, Web Data Serving has taken over to deliver a far more productive and robust alternative to traditional Web Scraping technologies.

I will be continuing with more blogs on this topic, and as always, I’d love to hear your comments.

By:  Stefan Andreasen Stefan_Andreasen_CTO

Tagged with:       

The Kapow Katalyst Blog is…

... a collection of insights, perspectives, and thought leadership around Application Integration.

Comments, Feedback, Contact Us:

blog at kapowsoftware.com

Get Our RSS Feed