user image

David Ziguras April 14th, 2009

A web mashup is a Web application that combines data from one or more sources into a single integrated tool.

The term Mashup implies easy, fast integration, frequently achieved by access to open APIs (application programming interface) and data sources to produce results that were not the original reason for producing the raw source data.

Web Mashups are an exciting variety of interactive Web applications that draw upon content retrieved from external data sources to create entirely new and innovative services. They are a product of Web applications commonly referred to as Web 2.0.

An example of a mashup is the use of cartographic data from Google Maps to add location information to real estate data, thereby creating a new and distinct Web service that was not originally provided by either source.

Content used in mashups is typically obtained from a third party source through a public interface or API. Other methods of obtaining content for mashups include Web feeds (e.g. RSS or Atom), and screen scraping. Many people are experimenting with mashups using Amazon, eBay, Flickr, Google, Microsoft, Pictometry, Yahoo, and YouTube APIs, which has led to the creation of mashup editors.

A web mashup application is architecturally comprised of three different participants that are logically and physically disjoint: API/content providers, the mashup site, and the client’s Web browser.

The API/content providers:  These are the providers of the content being mashed. To facilitate data retrieval, providers often expose their content through Web-protocols such as REST, Web Services, and RSS/Atom. However, many interesting potential data-sources do not <yet> conveniently expose APIs. Mashups that extract content from sites do so by a technique known as screen scraping. In this context, screen scraping connotes the process by which a tool attempts to extract information from the content provider by attempting to parse the provider’s Web pages, which were originally intended for human consumption.

The mashup site:  This is where the mashup is hosted. Interestingly enough, just because this is where the mashup logic resides, it is not necessarily where it is executed. On one hand, mashups can be implemented similarly to traditional Web applications using server-side dynamic content generation technologies like Java servlets, CGI, PHP or ASP. Alternatively, mashed content can be generated directly within the client’s browser through client-side scripting (that is, JavaScript) or applets. This client-side logic is often the combination of code directly embedded in the mashup’s Web pages as well as scripting API libraries or applets (furnished by the content providers) referenced by these Web pages. Mashups using this approach can be termed rich internet applications (RIAs), meaning that they are very oriented towards the interactive user-experience.  The benefits of client-side mashing include less overhead on behalf of the mashup server (data can be retrieved directly from the content provider) and a more seamless user-experience (pages can request updates for portions of their content without having to refresh the entire page). The Google Maps API is intended for access through browser-side JavaScript, and is an example of client-side technology. Often mashups use a combination of both server and client-side logic to achieve their data aggregation. Many mashup applications use data that is supplied directly to them by their user base, making (at least) one of the data sets local. Additionally, performing complex queries on multiple-sourced data requires computation that would be infeasible to perform within the client’s Web browser.

The client’s Web browser: This is where the application is rendered graphically and where user interaction takes place. As described above, mashups often use client-side logic to assemble and compose the mashed content.

Ajax

Ajax is a Web application model rather than a specific technology. It comprises several technologies focused around the asynchronous loading and presentation of content:

– XHTML and CSS for style presentation

– The Document Object Model (DOM) API exposed by the browser for dynamic display and interaction

– Asynchronous data exchange, typically of XML data

– Browser-side scripting, primarily JavaScript

When used together, the goal of these technologies is to create a smooth, cohesive Web experience for the user by exchanging small amounts of data with the content servers rather than reload and re-render the entire page after some user action. You can construct Ajax engines for mashups from various Ajax toolkits and libraries (such as Sajax and Zimbra), usually implemented in JavaScript.

Web protocols: SOAP and REST

Both SOAP and REST are platform neutral protocols for communicating with remote services. As part of the service-oriented architecture paradigm, clients can use SOAP and REST to interact with remote services without knowledge of their underlying platform implementation: the functionality of a service is completely conveyed by the description of the messages that it requests and responds with.

SOAP is a fundamental technology of the Web Services paradigm. Originally an acronym for Simple Object Access Protocol, SOAP has been re-termed Services-Oriented Access Protocol (or just SOAP) because its focus has shifted from object-based systems towards the interoperability of message exchange. There are two key components of the SOAP specification. The first is the use of an XML message format for platform-agnostic encoding, and the second is the message structure, which consists of a header and a body. The header is used to exchange contextual information that is not specific to the application payload (the body), such as authentication information. The SOAP message body encapsulates the application-specific payload. SOAP APIs for Web services are described by WSDL documents, which themselves describe what operations a service exposes, the format for the messages that it accepts (using XML Schema), and how to address it. SOAP messages are typically conveyed over HTTP transport, although other transports (such as JMS or e-mail) are equally viable.

REST is an acronym for Representational State Transfer, a technique of Web-based communication using just HTTP and XML. Its simplicity and lack of rigorous profiles set it apart from SOAP and lend to its attractiveness. Unlike the typical verb-based interfaces that you find in modern programming languages (which are composed of diverse methods such as getEmployee(), addEmployee(), listEmployees(), and more), REST fundamentally supports only a few operations (that is POST, GET, PUT, DELETE) that are applicable to all pieces of information. The emphasis in REST is on the pieces of information themselves, called resources. For example, a resource record for an employee is identified by a URI, retrieved through a GET operation, updated by a PUT operation, and so on. In this way, REST is similar to the document-literal style of SOAP services.

Screen scraping

A lack of APIs from content providers often force mashup developers to resort to screen scraping in order to retrieve the information they seek to mash. Scraping is the process of using software tools to parse and analyse content that was originally written for human consumption in order to extract semantic data structures representative of that information that can be used and manipulated programmatically. A handful of mashups use screen scraping technology for data acquisition, especially when pulling data from the public sectors.

Screen scraping is often considered an inelegant solution, and for good reasons. It has two primary inherent drawbacks. The first is that, unlike APIs with interfaces, scraping has no specific programmatic contract between content-provider and content-consumer. Scrapers must design their tools around a model of the source content and hope that the provider consistently adheres to this model of presentation. Web sites have a tendency to overhaul their look-and-feel periodically to remain fresh and stylish, which imparts severe maintenance headaches on behalf of the scrapers because their tools are likely to fail.

The second issue is the lack of sophisticated, re-usable screen-scraping toolkit software, colloquially known as scrAPIs.

Semantic Web and RDF

The inelegant aspects of screen scraping are directly traceable to the fact that content created for human consumption does not make good content for automated machine consumption. Enter the Semantic Web, which is the vision that the existing Web can be augmented to supplement the content designed for humans with equivalent machine-readable information. In the context of the Semantic Web, the term information is different from data; data becomes information when it conveys meaning (that is, it is understandable). The Semantic Web has the goal of creating Web infrastructure that augments data with metadata to give it meaning, thus making it suitable for automation, integration, reasoning, and re-use.

The W3C family of specifications collectively known as the Resource Description Framework (RDF) serves this purpose of providing methodologies to establish syntactic structures that describe data. XML in itself is not sufficient; it is too arbitrary in that you can code it in many ways to describe the same piece of data. RDF-Schema adds to RDF’s ability to encode concepts in a machine-readable way. Once data objects can be described in a data model, RDF provides for the construction of relationships between data objects through subject-predicate-object triples (“subject S has relationship R with object O”). The combination of data model and graph of relationships allows for the creation of ontologies, which are hierarchical structures of knowledge that can be searched and formally reasoned about. For example, you might define a model in which a “carnivore-type” as a subclass of “animal-type” with the constraint that it “eats” other “animal-type”, and create two instances of it: one populated with data concerning cheetahs and polar bears and their habitats, another concerning gazelles and penguins and their respective habitats. Inference engines might then “mash” these separate model instances and reason that cheetahs might prey on gazelles but not penguins.

RDF data is quickly finding adoption in a variety of domains, including social networking applications (such as FOAF — Friend of a Friend) and syndication. In addition, RDF software technology and components are beginning to reach a level of maturity, especially in the areas of RDF query languages (such as RDQL and SPARQL) and programmatic frameworks and inference engines (such as Jena and Redland).

RSS and ATOM

RSS is a family of XML-based syndication formats. In this context, syndication implies that a Web site that wants to distribute content creates an RSS document and registers the document with an RSS publisher. An RSS-enabled client can then check the publisher’s feed for new content and react to it in an appropriate manner. RSS has been adopted to syndicate a wide variety of content, ranging from news articles and headlines, changelogs for CVS checkins or wiki pages, project updates, and even audiovisual data such as radio programs. Version 1.0 is RDF-based, but the most recent, version 2.0, is not.

Atom is a newer, but similar, syndication protocol. It is a proposed standard at the Internet Engineering Task Force (IETF) and seeks to maintain better metadata than RSS, provide better and more rigorous documentation, and incorporates the notion of constructs for common data representation.

These syndication technologies are great for mashups that aggregate event-based or update-driven content, such as news and weblog aggregators.

Mashups and meshups are different from simple embedding of data from another site to form a compound page. A site that allows a user to embed a YouTube video for instance, is not a mashup site. A mashup or meshup site must access third-party data and process that data to add value for the site’s users.

There are many types of mashups, such as consumer mashups, data mashups, and Business Mashups. The most common mashup is the consumer mashup, which are aimed at the general public. A good example is Google Maps.

Data mashups combine similar types of media and information from multiple sources into a single representation.

Business mashups focus data into a single presentation and allow for collaborative action among businesses and developers.

Mashups can also be used with software provided as a service (SaaS). After several years of standards development, mainstream businesses are starting to adopt Service-oriented Architectures (SOA) to integrate disparate data by exposing this data as discrete Web services. Web services provide open, standardised protocols to provide a unified means of accessing information from a diverse set of platforms (operating systems, programming languages, applications). These Web services can be reused to provide completely new services and applications within and across organisations, providing business flexibility.

Our latest posts