big data – DATA TO DECISIONS

Cloudera announces access control layer and columnar storage engine for Hadoop. News analysis from Strata + Hadoop World 2015.

It’s going to be a big week for big data as Strata + Hodoop World 2015 kicks off this week at New York’s Jacob Javits Center. Ahead of the show there were two big announcements from Cloudera on Monday as it introduced RecordService and Kudu.

RecordService is an all-new security layer for Hadoop designed to ensure secure access to data and analysis engines that run on top of Hadoop, including MapReduce, Apache Spark, and Cloudera Impala. Where Cloudera’s previously available Sentry security component supported the definition of access-control permissions, RecordService complements Sentry by enforcing access down to the column and row level. It also supports dynamic data masking, according to Cloudera, brining uniform, granular access control to any framework or analysis engine that’s plugged into the RecordService API.

#Hadoop, @Cloudera, #bigdata, #analytics, #infosecurity, #streamingdata, #realtime

Ahead of Strata + Hadoop World 2015 in New York, Cloudera introduced the RecordService data-access layer and the columnar Kudu storage engine for Hadoop.

Kudu is a new storage engine that Cloudera has developed to complement the HDFS file system and HBase database. HDFS lets you add and scan high-volume data, but it doesn’t update data. The HBase database, meanwhile, supports random reading and writing, but it’s not a fit for fast scanning of aggregated data. Kudu brings a columnar data store to Hadoop that’s aimed at real-time analytical applications such as fraud detection, ad placement and distributed, Internet-of-Things-style apps.

RecordService and Kudu are being released as beta software downloads this week, and Cloudera says it intends to donate both open-source projects to the Apache Software Foundation (ASF). General availability will depend on feedback and revisions required, according to Cloudera, and it did not specify when it will contribute the technologies to ASF.

MyPOV on RecordService and Kudu

Despite the open-source billing, there’s little doubt that both of these technologies will be promoted predominantly by Cloudera and used mostly by its customers. Cloudera rival Hortonworks previously introduced the open source Ranger project for data-access authorization, control and auditing within Hadoop. Which security project is likely see wider support? Cloudera mentioned MapR, which currently supports Impala, as showing interest in RecordService. Hortonworks, meanwhile, is a member of the Open Data Platform initiative (ODPi), which counts IBM, Pivotal, SAS and Teradata as founding members. But that group has yet to name Ranger as one of the components of the ODPi Common Core, which currently includes HDFS, YARN, MapReduce and Apache Ambari.

The good news is that Cloudera customers can now look forward to granular data-access controls across data and analysis engines, something that’s crucial to secure use of the data in Hadoop in production environments. As for disunity on Hadoop, get used to Cloudera and Hortonworks going separate ways, much like Ford and Chevy or RedHat and Ubuntu.

The bottom line for Cloudera is that it’s the leading Hadoop distributor and a deep-pocketed partner of Intel. It can be confident that its technology will be see adoption even there are questions about vendor lock in and dependence on commercial Cloudera Manager software.

Real-time and streaming applications are another front on which Cloudera and Hortonworks are headed in different directions. Kudu looks like a promising complement to the current Hadoop stack, and it’s aimed at simplifying the complicated LAMBDA architectures that have emerged. Matt Brandwein, Cloudera’s director of product marketing, tells me Kudu will work with Apache tools including Kafka and Spark Streaming and will enable developers to remain within the Hadoop stack when building streaming applications.

That sounds like a jab at Hortonworks DataFlow, which is based on its recent Onyara acquisition and Apache NiFi. While both Kudu and DataFlow are aimed at streaming use cases, they are in no way comparable. While Kudu is a storage engine, DataFlow is a much more comprehensive platform aimed at managing streaming data flows from edge sensors and devices to core systems, ensuring tracking of data provenance, and ensure the secure delivery and flow of data across multiple systems despite bandwidth disruptions or offline conditions. It’s also separate from Hadoop and has yet to support YARN.

If Kudu and DataFlow have anything in common, it’s that it’s early days for both projects. A lot of development and integration work lies ahead before either technology will see significant real-world adoption. The good news for practitioners is that the Hadoop ecosystem is maturing quickly to support a range of big data and low-latency applications with all the security features needed for enterprise production deployments.

MapR, Pivotal and Teradata were among the who’s who of big data vendors making announcements at Strata + Hadoop World 2015. Here’s a quick analysis.
If Strata + Hadoop World 2014 seemed to be all about Hadoop, Hadoop, Hadoop, the 2015 buzzword was Spark, Spark, Spark. Initial impressions aside, there was more going on than adoption of that notable open-source, in-memory data processing and data analysis framework. Here’s a quick rundown of a few of the bigger announcements, along with my analysis of the developments.

@MapR, @Pivotal, @Clouder, @SnapLogic, @Syncsort, @Talend, @Oracle, @Teradata, @AtScale, @DataTorrent

Advanced analytics, Spark support and real-time apps were among the big themes at Strata + Hadoop World 2015.

MapR Adds In-Hadoop Document Database

MapR announced here that it has added JSON-handling document-database features to the MapR DB component of its Hadoop Distribution. MapR DB is the vendor’s version of HBase, which architecturally differs from the open-source, high-scale NoSQL database in order to deliver five- to seven-times faster performance, according to MapR.

Adding an In-Hadoop Document Database to MapR DB will save developers time, eliminate redundancies in data and infrastructure, and eliminate the time and trouble of moving and copying data to handle both transactional and analytical needs.

MyPOV: This combination makes sense, and it will surely appeal to existing MapR customers who are looking to do as much as possible with their MapR deployments. Will it change the dynamics of the Hadoop or NoSQL database markets? I suspect not, as organizations and developers seeking a NoSQL database will not look to make a sweeping choice of the Hadoop platform at the same time. MapR points out that you can deploy MapR DB independently, but without co-location and sharing of the data in a Hadoop cluster, the advantages largely evaporate. Think of the document database feature on MapR DB as a nice add for existing customers and one more selling point for customers looking for a Hadoop distribution and support company.

Pivotal Takes HAWQ Open Source

Pivotal this week announced contributions of many of its data engines to open source. That move started with Pivotal’s GemFire in-memory database, which became Apache Geode in April. At Strata + Hadoop World, Pivotal announced that its HAWQ SQL-on-Hadoop tool is now Apache HAWQ (incubating) and the MADlib machine learning library is now Apache MADlib (inclubating). Soon, the Pivotal Greenplum database and a query optimizer shared by Greenplum and HAWQ will also be contributed to open source.

HAWQ, which is based on Greenplum, was one of the earliest SQL-on-Hadoop options based on a relational database. MADlib, which began as an open-source project in 2002, is a collection of scale out, parallel machine learning algorithms that runs in HAWQ and Greenplum.

MyPOV: Being early to the market in 2013 didn’t appear to help HAWQ win a landslide of new customers. Several databases since ported to run on Hadoop – like Actian Vortex and HP Vertica – also offer extensive SQL compliance and fast query performance, yet they, too, haven’t taken the Hadoop market by storm.

Will an Apache open-source license make a big difference for HAWQ? I suspect the big data community will continue to associate HAWQ with Pivotal, even if it’s now billed by the company as a “Hadoop Native” product. Pivotal’s most compelling big data attractions are the breadth of its analysis options and its flexible, subscription-based approach, which lets you mix, match and switch between engines without cost implications.

Spark Gains Yet More Support

There were plenty of nods to Apache Spark at Strata + Hadoop World, starting with Cloudera’s “one platform” pledge to make Spark an enterprise-class data-processing and data-analysis choice on top of Hadoop.

There were also a spate of announcements around Spark as a data-transformation and processing engine within data integration products. SnapLogic, for example, announced Spark-based big-data capabilities through the Fall release of its SnapLogic Elastic Integration Platform. In the same vein, Syncsort and Talend have also announced Spark-based data-processing options. And in an analyst briefing held by Oracle on Monday, the vendor explained that it’s been working with Spark developer Databricks for more than 18 months to take advantage of the framework’s data-processing and data-analysis capabilities. Expect related announcements at Oracle Open World

MyPOV: “Spark inside” was a common claim seen at Strata + Hadoop World, and it’s clear this framework is seeing a broad vendor support. This is a theme we’ve seen all year, though it does not mean that the Spark core and all its components can be described as mature or production ready. Rather than take on the risk yourself, it’s best to work with certified vendors or Databricks itself if you hope to eventually take advantage of Spark’s fast, in-memory processing and analysis.

Teradata Embraces Python

Python is an increasingly popular language for big data analytics work. As evidence, an entire workshop track was dedicated to “PyData” at Strata + Hadoop World. Responding to this interest, Teradata this week introduced the Teradata Module for Python, which it’s pitching as a boon to DevOps-enabled applications.

The Teradata Module for Python module makes it a quicker and easier proposition for developers to embed SQL queries that invoke Teradata sources into their applications. Operations types like DBAs, meanwhile, gain granular visibility into Web and mobile apps and new versions of those apps that query against Teradata.

MyPOV: Developers were already embedding SQL queries into apps and operations teams were already dealing with Web and mobile apps invoking, and sometimes impacting the performance of, Teradata. This module should make life easier for developers and DBAs. It’s easy to guess that developers will add R and languages to the wish list.

Hot Startups Seen At Strata

After a dozen briefings at Strata, I’ve developed a short list for deeper research. I’ll close by noting two startups that caught my attention. Startup AtScale has the focused mission to help organizations use their existing BI systems and tools with Hadoop. It exposes the data inside Hadoop as fat, virtual tables to SQL-based tools and as virtual OLAP cubes to sources that use MDX.

Another startup that impressed was DataTorrent, which is working on a fast streaming and low-latency batch processing platform with plentiful connectors and an easy-to-use, drag-and-drop app-development interface. The company has contributed its platform to Apache as Project Apex, and it claims faster streaming performance than both Spark and Storm.

Couchbase Live New York event puts new features, NoSQL adoption in the spotlight. Watch for basics and the best fit for NoSQL in the enterprise.

This week’s Couchbase Live New York marked the general availability of Couchbase Server 4.0 and a coming out party of sorts for enterprise customers including Marriott, GE, Cox Automotive and Gannett.

In beta since June, Couchbase Server 4.0 saw 40% higher download activity than the vendor’s previous beta release largely due to two important new features. Multidimensional Scaling lets operations provision infrastructure for data, query and indexing services independently, saving money where resources are sufficient while letting you deploy faster storage, more memory or more processing power where needed.

Couchbase Server 4.0 introduces N1QL, a “SQL-like” query language added to support ad-hoc querying and simpify app development.

The SQL-like N1QL (pronounced “nickle”) query language introduced in 4.0 will enable Couchbase developers to eliminate complicated query code and copies of data previously required within applications. Instead, N1QL will help to streamline app development and upkeep by supporting routine ad hoc querying at the data tier. It will also make it easier to use third-party SQL-based integration and BI tools.

A highlight at Couchbase Live NY was a presentation by Marriott solutions architect Thomas Vidnovick, who explained how and why the hotel chain is retiring a mainframe-based reservations app and replacing it with a new app running on Couchbase on distributed commodity hardware. Vidnovick said Marriott went with open source for its low software and support cost, NoSQL for its JSON-based, object-oriented development, and Couchbase specifically for its high transaction throughput and ability to quickly add nodes to scale out.

MyPOV On Couchbase

Couchbase impressed in New York with presentations from several large, household-name customers. But several presenters from these companies also reminded attendees that NoSQL databases are still playing catchup on some basics. For example, Vidnovick said he’s looking forward to using new Couchbase 4.0 features including LDAP support for ID management and auditing support – surprisingly basic capabilities to see just now being added to enterprise software.

It’s also clear that NoSQL is complementing, rather than entirely replacing conventional relational databases. Cox Automotive executive Tony Selke, for example, said that his firm continues to use Microsoft SQL Server for its transactional applications. It uses Couchbase to handle the fast-changing, ephemeral data behind some 6 million automotive sales listings that are added, deleted or updated within its systems each day.

Nearly all NoSQL database vendors are still working on basics including security and administrative features, but as Couchbase demonstrated in New York, large companies are increasingly embracing NoSQL for high-scale, next-generation applications.

Amazon challenges ‘old-guard’ IBM, Microsoft and Oracle with QuickSight, Kinesis Firehose, Snowball, and new database services. Here’s a quick rundown with cautions for would-be adopters.

Make no mistake: Amazon Web Services (AWS) is doing everything it can to make it easy for enterprises large and small to bring all, or at least part, of their data into its cloud.

At the AWS re:Invent event in Las Vegas on Wednesday, AWS announced a battery of new services designed to lower the cost and simplify the tasks of analyzing data, streaming data, moving data, migrating databases and switching to different database management systems in the cloud. Here’s a quick rundown on five new services with my take on implications for would-be customers.

QuickInsight is a data-analysis, visualization and dashboarding suite introduced at AWS re:Invent. It promises ease of use and low costs starting at $9 per user, per month.

QuickSight Offers BI in the Cloud

Fast to deploy, easy to use, low cost: these are the three promises of QuickSight, which Amazon says will start at $9 per user, per month, “one tenth the cost of traditional BI options,” according to Amazon.

Currently in preview, QuickSight is described by Amazon as follows:

QuickSight automatically discovers all customer data available in AWS data stores – S3, RDS, Redshift, EMR, DynamoDB, etc. – and then “infers data types and relationships, issues optimal queries to extract relevant data [and] aggregates the results,” according to Amazon.
This data is then automatically formatted, “without complicated ETL,” and made accessible for analysis through a parallel, columnar, in-memory calculation engine called “Spice.”
Spice serves up analyses “within seconds… without writing code or using SQL.” Data can also be consumed through a “SQL-like” API by partner tools running on Amazon including Domo, Qlik, Tableau and Tibco-Jaspersoft.
A tool called “Autograph” automatically suggests best-fit visualizations, with options including pie charts, histograms, scatter plots, bar graphs, line charts and storyboards. You can also build live dashboards that will change in real-time along with the data. QuickInsight will also be supported with iOS and Android mobile apps.
Sharing options will include “single-click” capture and bundling data and visualizations (not just screen shots) so collaborators can interact with, drill down on, and change visualizations. These can be shared on intranets, embedded in applications or embedded on public-facing Web sites.

MyPOV on QuickSight: Amazon is clearly taking on what it called “old-guard” BI solutions such as IBM Cognos, Oracle OBIEE and SAP BusinessObjects. We’ve already seen more modern, cloud-based alternatives from these vendors, including IBM Watson Analytics, Oracle BI Cloud Service and SAP Lumira Cloud. I expect to see yet more cloud BI and analytics options announced by these three vendors within the next few weeks. Amazon is smart to be moving into BI, analysis and visualization now as incumbents are starting to put all their chips on cloud-based BI and analysis offerings.

If the no-coding, automated visualization capabilities live up to their billing, QuickSight just may compete a bit with the likes of Qlik and Tableau as well as Cognos and BusinessObjects. We’ll also have to see just what you’ll get for $9 per user, per month and just how easy and broad the collaboration options will be. The pricing is in the freemium range where IBM Watson Analytics, Microsoft PowerBI and SAP Lumira Cloud have been establishing cloud beachheads.

If QuickSight has a weakness it’s that it appears to be entirely geared to consuming data that’s in Amazon’s cloud. Amazon said nothing about connecting to on-premises data sources. In fact, the keynote went on to describe many new services (described below) designed to bring yet more data into Amazon’s cloud. In short, QuickSight is not about hybrid scenarios, it’s about analyzing data that lives in the cloud, which will be fine for many, but not all companies.

Kinesis Firehose Handles Streaming Scenarios

Amazon already offered Kinesis Streams, a real-time streaming data option, but that service is aimed at technical users. Kinesis Firehose, available immediately, is intended to make it easier to support streaming scenarios such as mobile apps, Web apps or the collection of telemetry or sensor data (read, IoT).

Instead of writing code, “builders” create an Amazon Kinesis Firehose Delivery Stream in the AWS Management Console.” These streams can then be targeted at S3 buckets or Amazon Redshift tables (with more Amazon storage options to come). Users can specify refresh intervals, chunk sizes and compression and encryption settings.

MyPOV on Kinesis Firehose: This is a welcome refresh and addition to the Kinesis family. Kinesis Firehose seems to add some of the same visual, point-and-click/drag-and-drop, API-based approaches seen in recently introduced IoT suites, such as Microsoft’s Azure IoT suite. Ease-of-use and self-service seem to be the guiding principles for all data and analysis services these days, and Kinesis Firehose brings streaming capabilities closer to business users.

AWS Snowball: Chuck Bulk Data At AWS

Even if you have fat, dedicated pipes (at great cost), it takes a long time to move big data. Amazon has an existing Import/Export service that lets you ship data overnight to AWS on one-terabyte disks, but even that’s a time-consuming can error-prone approach when moving tens of terabytes or more.

AWS Snowball is a shippable appliance designed to move 50 TB at a time to the cloud.

Amazon Snowball, available immediately, is a PC-Tower-sized appliance in a shipping box that stores up to 50 terabytes at a time. A digital-ink (Kindle-based) shipping label displays your address on the outbound trip and then automatically resets to Amazon’s address once you load data and are ready to send it back via FedEx or UPS.

Data is encrypted as it’s stored in a Snowball, and you can run several Snowballs in parallel if you need to move hundreds of terabytes or more. The service also validates that all data that’s stored on the device is uploaded and readable on AWS storage. The cost per shipment is $200, and Amazon says it’s the fastest and most cost-effective option available for bulk data loading to the cloud.

MyPOV on Snowball: Not much to quibble with here. Amazon has thought through a tough problem and has come up with an easier way to ship bulk data with provisions for security, damage- and tamper-resistant shipping, and fool-proof labeling.

AWS Database Migration Service: The Last Mile

Amazon launched two new services promising “freedom from bad database relationships.” The “old guard has treated you poorly,” said Amazon executive Andy Jassy, who cited lock-in, software audits and fines. So it’s “only natural” that companies are “fleeing expensive, proprietary databases,” he said.

The challenge, admitted Jassy, is that migrations are fraught with risks. Do you try to keep services running and hope you can pull off a seamless transition? Or do you shut down one service and then start another, hoping that you can minimize downtime? Another challenge is migrating from a commercial database, like IBM DB2 or Oracle, and switching to lower-cost, cloud alternative, like Amazon Aurora, PostgreSQL or (just added to AWS as a MySQL-replacement) MariaDB.

The Amazon Database Migration Service will “easily migrate production databases with minimal downtime,” according to Amazon. It’s said ensure continuous replication from target to source for any size database, with tracking and real-time monitoring. The tool takes 10 to 15 minutes to set up, says Amazon, and the cost of the service is about $3 per terabyte.

MyPOV on Database Migration: This is another well-intended service where any improvement offered will be welcome, but Amazon anticipated the major concerns (mine and yours) by also introducing the Amazon Schema Conversion Tool described below. Amazon threw the term “open” around quite a bit, but keep in mind that Amazon RDS, Aurora and RedShift are no less proprietary than DB2 or Oracle. Lower cost, certainly, but lock-in is a factor to consider here, too.

AWS Schema Conversion Tool: The First Mile

Migrating an on-premises database to the cloud (say, MySQL on-premises to MySQL running in the cloud) is hard enough. Migrating from on database management system on-premises (say, Microsoft SQL Server or Oracle) to an alternative DBMS running in the cloud (say, MariaDB replacing SQL Server or Aurora replacing Oracle) is harder still. Amazon acknowledged as much by introducing AWS Schema Conversion Tool.

New AWS Database Migration and Schema Conversion services are both intended to ease database moves, but schema conversion is the tougher task.

Migrating data from one database to another is really the easy last mile. The trickier part is all the preparation work you have to do before the big move. The Schema Conversion Tool is said to “reliably and easily” transform the data from one database type to another, swapping like-for-like tables, partitions, store procedures and more, according to Amazon. “We think we’ll be able to address 80% of changes automatically,” said Jassy. “This will radically change the cost structure and speed of moving from the old world to the cloud world.”

MyPOV on Schema Conversion Tool. Good for Amazon for taking its assistance a level deeper than the Database Migration Service. But practitioners should keep in mind that there will still be plenty of testing and quality-assurance work to do even if the Schema Conversion Tool manages to handle 80% of changes automatically.

Here again we’re talking about a well-intended service wherein any improvement over the status quo will be welcome. Just don’t expect this optimistically named tool to automagically move or swap your DBMS without breaking things in the process. That will certainly be the case when switching from one DBMS to another. Apply the rule of thumb often used in home-improvement projects: Count on it taking twice as long and costing twice as much as your initial estimates.

MyPOV on the Big Picture From re:Invent

Amazon offered an impressive collection of announcements at re:Invent. It also didn’t hurt to have executives from the likes of CapitalOne and GE talking about how they’re consolidating data centers and moving huge chunks of their workloads to AWS – 60% in GE’s case. Keeping things real, it was refreshing to see Amazon acknowledge that many large enterprises (I believe most) will ultimately stick with hybrid cloud approaches. GE, for example, said it’s moving all non-differentiating workloads to AWS while anything related to “company crown jewels” will remain in company-owned data centers.

Where all these new data services are concerned, I think it’s also important for companies to think hybrid rather than putting everything in the cloud. QuickInsight might satisfy many self-service data-analysis scenarios, but that won’t eliminate the need for carefully curated data and analyses, high-value decision-support scenarios and mission-critical business-monitoring tasks. As for putting 100% of your data in the cloud, apply the standard advice to investors and diversify. So maybe you go cloud, but you mix public and private cloud.

The point is to not put all of your eggs in one basket. The Weather Company, for one, has split its B2C and B2B clouds between the AWS and IBM clouds, respectively. Keep competition and the threat of lock-in in mind when crafting a new data-management strategy and making long-term plans for how and where you support BI and advanced analytics for routine versus “company crown jewel” needs.

Oracle Open World 2015 announcements included three data-related standouts: Oracle Cloud Platform For Big Data, Oracle Data Visualization Cloud Service, and Oracle Data Cloud. Here’s a deeper dive on what stands out.

Our cloud is integrated, it’s ready, and it’s bigger than and superior to any rival cloud. That was the big-picture message Oracle offered customers at Oracle Open World 2015. What’s more, the cloud is where most customers will soon be headed, said Oracle CEO Mark Hurd, predicting that “virtually all” enterprise data would be stored in the cloud by 2025.

It would be impossible to detail all two-dozen-plus announcements made last week, so I’ll narrow things to my Data-to-Decisions (D2D) research domain and focus on three announcements that caught my eye: The Oracle Cloud Platform For Big Data, the Data Visualization Cloud Service, and the Oracle Data Cloud. With all three, Oracle is trying to differentiate itself in the cloud.

Oracle Cloud Platform For Big Data

Oracle’s Hadoop-in-the-cloud Big Data Cloud Service was announced at Oracle Open World 2014. At Oracle Open World 2015 the company reannounced a family of supporting services newly packaged as the Oracle Cloud Platform for Big Data. The components include Oracle Big Data Preparation Cloud Service, Oracle GoldenGate Cloud Service, Oracle Big Data Discovery Cloud Service and Oracle NoSQL Database Cloud Service. The idea is to surround the Hadoop service with a breadth of data-prep, data-movement and data-analysis options.

The Oracle Cloud Platform for Big Data combines platforms (database, Hadoop, NoSQL) and analysis options.

I say “reannounce” because Oracle talked about all of these services back in June. The new “Cloud Platform for Big Data” is a new brand. The Oracle Big Data Preparation service is aimed at preparing and enriching semi-structured and unstructured data, such as clickstreams and social data. Under the hood it makes use of Apache Spark machine learning, Yago entity resolution training sets and Apache UIMA natural language processing.

The GoldenGate Cloud service is based on Oracle’s well-known data-replication software. This service is designed to pump data in near real time into Oracle Database Cloud Service, Exadata Cloud Service, Hive, Hbase, HDFS, Flume and Kafka (in this case, in the cloud). It’s a complement to low-latency, data-streaming applications, such as those in IoT scenarios.

The Oracle Big Data Discovery Cloud Service is a business-analyst-oriented tool for exploring, transforming and discovering data within Hadoop (again, in this case running in Oracle’s cloud). Data Discovery first samples, profiles and catalogs all available data. Machine-learning algorithms then surface interesting correlations and offer suggested visualizations for exploring attributes. Search and guided navigation features support data exploration.

Filling out Oracle’s Cloud Platform is the NoSQL Database Cloud service, which is Oracle’s high-scale key-value store database delivered as a service.

MyPOV: Not to be forgotten is the Oracle Big Data SQL Cloud Service, which does SQL querying across Oracle Database Cloud services, the Big Data Cloud Service and the NoSQL Database Cloud Service. Taken together, it’s a broad (if Oracle-centric) collection. IBM has a broader collection of IBM and open-source-based services on BlueMix, and Amazon Web Services has more customers using its cloud. But Oracle is building out an impressive portfolio, and the company’s dominant database position will surely feed cloud growth.

Oracle Data Visualization Cloud Service

Nearly every BI vendor has introduced a data-visualization module in recent years in response to the fast growth of Tableau Software. Oracle has evolved what it offers for the cloud with the Oracle Data Visualization Cloud Service. Set to become generally available in November, this new stand-alone service is based on capabilities first seen in the Oracle Business Intelligence Cloud Service introduced in April.

The Oracle Data Visualization Cloud Service offers 18 charting options and a palate of colors and shapes for depicting data.

The Oracle Data Visualization Cloud Service will enable you to link to on-premises and cloud data sources (both from Oracle and third parties) as well as your own spreadsheets. There are 18 different types of visualizations and a palate of colors, shapes and sizes for depicting data points.

Execs at Oracle Open World made a point of saying “all you need is a browser.” That’s because with Tableau’s cloud service, Tableau Online, users author charts and dashboards with the desktop client and then publish to the cloud for collaboration. Tableau is working on bringing full authoring capabilities to the cloud. And, indeed, Oracle is working on a desktop client for times when you need to work offline.

MyPOV: Oracle execs made claims about its new service being “more modern than Tableau” at Oracle Open World. That starts with full authoring capabilities in the cloud, but I’m not seeing some of the other differences claimed. The press release says Oracle’s service “eliminates the complexity typically associated with blending and correlating data sets,” but Tableau also automatically finds joins when mashing up data sets. Both products also select best-fit visualizations automatically based on the dimensions of data used in an analysis. This auto-charting capability has been around for a while, and it’s also present in SAS Visual Analytics and IBM Watson Analytics.

For a real head-to-head comparison with Tableau, I want to investigate Oracle’s performance characteristics and its connection capabilities (once this service is available). Tableau’s strengths include its in-memory engine and its live-data-connection capabilities with multiple databases, apps and cloud services, including multiple connection options with Amazon, Google, IBM, Microsoft, Oracle, Salesforce, SAP and others. Will Oracle match that? I also want to tour the “fit and finish” of the visualizations and “storytelling” capabilities. Some of the charts seen at Oracle Open World looked hard to read, but that may be due to the data-filtering and presentation inexperience of the demonstrators.

Yet-to-be released pricing details from Oracle will also be key to any comparison, but to me, these visualization capabilities are most attractive when teamed with the Oracle BI Cloud Service. That’s because it’s not only a data-exploration and visualization service; you also get the database and reporting functionality. Here, too, the more Oracle-centric you are in the on-premises world, the more attractive the cloud options will be.

Oracle Data Cloud

Several new features of the Oracle Data Cloud were announced at Oracle Open World, but a larger context emerged last week with IBM’s announced intent to acquire The Weather Company. Thus, I was eager to learn more about Oracle as a data provider. Oracle Data Cloud is built on technology, data and analytics expertise picked up in the BlueKai and Datalogix acquisitions. Talking to execs from both companies now leading Oracle Data Cloud, I came away impressed.

Oracle Data Cloud offers data from more than 1,500 specialty retailers and 30 supermarket loyalty cards.

Oracle Data Cloud offers offers data from more than 1,500 CPG and specialty retailers across 110 Million US Households. With data-enrichment and predictive analytics options on top of this data, Oracle can find likely buyers by product and category.

MyPOV: Having data and being able to enrich that data and apply predictive analytics is the name of the game in marketing, and these initiatives are moving into the sales and service arenas as well. In the business-to-business arena, Oracle Data Cloud can enrich your data with Dun & Bradstreet information to find look-alikes of your best customers. A next step is bringing service data full circle back into your understanding of customers to drive efforts such as retention campaigns.

Many tech vendors are introducing libraries of third-party data that are integrated with their offerings. But big guns like IBM and Oracle are stepping up to become primary data providers. Expect to see data from outside of your organization becoming a bigger and bigger part of your future success.

IBM displayed two sides of the company at its annual analyst forum last week: the mighty tech titan and the agile new design-driven Big Blue. Here’s my take on a company in transition.

IBM is a mighty tech titan that can make new markets by sheer force of will. It’s also a practical, agile and focused company that can dream up innovative products and services that give customers exactly what they need.

Both of these IBMs were heard from at last week’s annual IBM Analyst Forum in Stamford, Connecticut. In the mighty vein, analysts heard all about how cognitive computing, meaning IBM Watson, will set the company apart. We also heard executives talk about a new Big Blue that is moving beyond CIO/IT-oriented selling and delivering more actionable solutions with more hands-on technical guidance for line-of-business buyers.

A key component of the new Big Blue story is the embrace of design thinking, which The New York Times last week boiled down to “building what customers want, rather than building something and then trying to convince people that they want it.” After hearing about IBM’s many plans and initiatives last week, I’d say some plans and initiatives seemed design-thinking inspired while others sounded like big, market-making bets. Here, then, is my take on the two IBMs.

IBM Analytics on Spark

Analytics has long been a cornerstone for IBM, but it’s an area that’s understandably in transition due to larger market forces moving to open source, distributed computing options. IBM is adapting to increasingly popular open-source platforms including Hadoop and, more recently, Apache Spark. IBM embraced Spark in a big way in June, and it’s now porting plenty of IBM software to run on that open source framework.

IBM said last week that it’s embracing Spark for the same reasons it embraced Linux in 1999. I take that to mean it’s hoping to leapfrog competitors (like SAS and integrators like Accenture) by leading a wave of open source technology adoption. IBM even described Spark as “the analytics operating system,” and executives predicted Spark will have the same level of market impact as did the IBM 360 computer and Linux.

Spark has seen plenty of vendor adoption and support, but IBM insists it’s now at the head of the pack, poised to become the top contributor and committer to Spark. The company also pledged it will train more than one million data professionals on Spark. That will happen through the IBM-sponsored Big Data University community as well as through 11 DataPalooza events the company will hold around the globe over the next year. The three-day DataPalooza events show students how to develop data products using the new IBM Analytics for Apache Spark on IBM Bluemix. The stated goal is to “start a movement to create a belief that IBM is a trusted partner for data science success.”

MyPOV on IBM’s Spark Bet: Considering that it formally embraced Spark less than six months ago, IBM has managed to grab lots of attention. In fact, the buzz far outstrips the number of IBM-led Spark-related deployments out there in the real world. Far more numerous are customers using the company’s more mature analytics offerings around the IBM BigInsights Hadoop distribution, which include BigSQL, BigSheets and Big R.

IBM’s analytics portfolio spans the technology platform, which is being widely adapted to run on Apache Spark, as well as cloud services, industry solutions, an IoT suite, cognitive computing and consulting services.

Spark is a long-term bet for IBM. For now it’s sounding like the mighty tech titan touting “the largest investment in Spark of any company in the world.” But for this Spark bet to pay off, we need the story line to evolve and start getting into the compelling services and capabilities the company will offer. IBM says it has at least 16 products and services that now run on Spark, including IBM IoT on Bluemix, IBM Dataworks on Bluemix, IBM Swift Object Storage, the IBM SPSS Analytic Server and the IBM SPSS Modeler.

IBM is clearly trying to wear a white hat and win new customers with its training options, but I’m more interested in hearing about how customers will use (and are using) these new, reborn and refactored products and services on Spark. In my mind, Spark is definitely a destination for data scientists workloads, but I’m not sure big data types see it as an analytics “operating system” on which they want to run commercial tools.

IBM Cloud Insight Services

The new design-driven Big Blue came through in last week’s discussion of IBM Insight Cloud Services. When these services were discussed at last month’s IBM Insight event, I came away wondering how much custom app-development and consulting work would be required to put them into action. At the Analysts Forum I was pleased to learn that these are very purpose-built services designed to bring actionable insights into well-defined work environments.

A Social Merchandising Insight Cloud Service, for example, is aimed at common questions asked by retail merchandizers. These individuals have to gauge weather patterns, social comments, topical news, and internal supply chain forecasts. This cloud service analyzes data on all of the above to help merchandizers trigger targeted promotions, spot affinity opportunities, and optimize inventories in a timely way.

In another example IBM talked about a new Fan Insight Cloud Service developed for sports and entertainment organizations, and it described how the service is being used by the Ottawa Senators NHL hockey team, an early customer. The service helps marketers and ticketing teams make sense of fan attitudes and behavior trends so they can boost brand awareness and fan interaction and better target renewal, upsell and incentive offers.

IBM says it will deliver more than 20 of these cloud services. It’s using high-scale, high-speed infrastructure licensed from the soon-to-be-acquired Weather Channel to deliver the data, and it’s also weaving that company’s weather data into many of the cloud services.

MyPOV On Insight Cloud Services: These services are far more targeted than I expected, down to user roles and integrations with tools typically used. That’s good news and in line with a line-of-business-oriented selling approach discussed by several executives last week.

IBM has long excelled in selling to the CIO and IT. Here’s where the mighty IBM – trust us, we have one of everything – style of selling has worked for years. But tech buying is shifting to LOB decision makers. These buyers are far more focused on quickly implementing products and services that address their specific problems, so IBM says it’s bringing more technical depth and architectural understanding into selling situations. This is a positive sight that IBM’s field sales approach is evolving.

IBM Insight Cloud Services are focused on well-defined use cases and are designed to bring actionable insights to known roles and supporting tools.

MyPOV On The Two Faces of IBM

For years we’ve heard IBM boast of investing billions of dollars here and billions of dollars there to go after new businesses. The company conjures up stats about how many zillions of zetabytes are being created or how many millions of documents are being published, suggesting that only IBM has the heft to solve such problems. But instead of emphasizing how tough the problems are and how enormous the investments have been by IBM, why not focus on the clever, practical and approachable solutions IBM has dreamed up for real-world, well-defined problems?

The scenario of being overwhelmed by information is solid selling point for cognitive computing. IBM is promoting Watson as an advisor to doctors who can’t possibly keep up with all the latest medical journals and clinical trials each year. It’s also developing Watson to serve as a financial advisor, assisting professionals who can’t possibly keep up with all the stocks, funds and dynamics in the market. Both of these problems are very real, but these advisor roles don’t exist today. IBM is trying to create new markets.

In other cases cognitive is being aimed at existing roles. Last week, for example, exec Mike Rhodin passingly mentioned cognitive as a tool for automating call centers to cut costs by as much as 40%. “That’s real money,” he said, almost apologizing for such a practical use of cognitive computing. IBM has announced pilot deployments in this call center role, and I’d love to know whether those early customers are ready to talk about proven cost savings? In this same vein, Wipro is training its Holmes cognitive computing technology to automate manual data-collection-and-analysis steps required in anti-money-laundering “Know Your Customer” compliance processes. Sounds like Wipro is less shy about focusing on more practical uses of cognitive computing.

Given IBM’s imperative to drive revenue growth, I would think the focus would be on low-hanging fruit where companies see near-term benefits. To my mind, the focused, agile IBM that’s applying design thinking to focused customer challenges is compelling. The mighty IBM that spends $1 billion here and $1 billion to go after grand challenges and forge all-new markets seems out of step. I’m hoping the new Big Blue is the one that’s more in evidence in the coming year.

Informatica addresses big-data integration, governance and security through a subscription-based portfolio. But will big data remain a separate world?

One irrefutable trend in 2015 has been growing enterprise adoption of big data platforms including Hadoop and NoSQL databases. Yet as companies move beyond big data pilot projects and try to do more with their deployments, many struggle to achieve repeatability and productivity. Short on experienced talent, companies look for any way possible to avoid one-off-coding and development work.

Enter Informatica, which last month introduced Informatica Big Data Management, a three-part offering aimed at big data integration, governance and security challenges. Big Data Management replaces Informatica’s PowerCenter Big Data Edition with a separate, subscription-based product line aimed exclusively at big data environments.

Informatica addresses traditional use cases (lower-left) with PowerCenter and “next-gen” use cases with subscription-based Big Data Management.

The first component is Big Data Integration, which runs on Hadoop and promises to save companies time, trouble and, therefore, money over hand-coded data-integration, data-transformation and development work. This has been Informatica’s value proposition in the data warehousing arena for decades. Big Data Integration extends the promise to complex big data environments with intense volumes and varieties of batch and streaming data.

The second two components of Big Data Management are Informatica Big Data Quality and Governance, and Informatica Big Data Security. These may be of less immediate interest to companies that are just starting out with big data experiments. But Informatica argues these components will become increasingly important as the number and diversity of big data sources and projects grows.

The one question I have is whether and how long enterprises will view and treat traditional data warehouse environments and big data projects as separate worlds?

Big Data Integration

Data-integration products are all about connecting to data, so vendors in this space invariably tout their portfolios of pre-built connectors. In Informatica’s case, Big Data Integration offers more than 200 connectors that speed and simplify access to data (as compared with the hand-coding) and ingestion to Hadoop and NoSQL databases. The portfolio includes two-way integrations to modern big-data platforms, real-time sources and cloud-based apps and databases. Big Data Integration also offers 100-plus data-transformation and parsing routines, including options to handle variable and semi-structured data and big-data world formats such as JSON, Avro and Parquet.

Flexibility is another draw for Big Data Integration. Running on Hadoop, it supports not just MapReduce for batch processing and Apache Spark for fast, in-memory batch or streaming-data processing, but also Tez and Informatica’s own high-performance Blaze engine, which offers familiarity to veteran PowerCenter users while taking advantage of distributed processing power and Hadoop’s YARN management layer. With all these options at its disposal, Big Data Integration can intelligently and automatically execute each workload on the best-suited engine, according to Informatica. Here, too, the idea is to speed execution while taking manual work steps out of big data projects.

Adding Data Governance and Security

Where some big-data management offerings begin and end with ETL, Informatica Big Data Management also addresses data governance and security. Governance and data quality are always important, but their importance increases as the uses of big data multiply. That’s when Informatica Big Data Quality and Governance collaborative stewardship capabilities help ensure that all appropriate data stakeholders are involved in setting data definitions and standards. The various constituents get role-specific interfaces, and policy-based workflows, approvals and auditing features ensure compliance.

BigData Governance and Quality is also about helping to find the value in big data. For example, a Live Data Map powered by Spark provides a universal metadata catalog and knowledge graph for enterprise data. This supports searching, matching and linking among transactional data, machine data and social data to illuminate behaviors and better understand customers, prospects and influencers.

Informatica started stepping up its security capabilities earlier this year when it released Secure@Source, which analyzes the metadata in Informatica PowerCenter repositories and spots sensitive data, such as payment card and personally identifiable and personal health information, as well as systems, groups and departments at risk.

Informatica Big Data Security brings Secure@Source capabilities to big data, uncovering sensitive data and spotting the business units and individuals that have access to that data. The software gives data and security professionals insight into how such data is used and whether it’s adequately protected. Visualizations pinpoint sensitive data by geography and function while risk analytics highlight vulnerabilities that demand immediate remediation. One alerts are raised, policy-based protections and dynamic data masking can be used to secure or de-identify sensitive data.

MyPOV on Informatica Big Data Management

Informatica previously offered PowerCenter Big Data Edition, its traditional data-integration suite with added big data management capabilities and a selective ability to run on Hadoop. But that’s being replaced by Big Data Management, which is a separate, subscription-based product that’s a better fit with the times, according to Informatica.

Big data projects are graduating from exploratory experiments to separately budgeted initiatives with executive sponsorship and increasingly mission-critical expectations, according to Informatica. Thus, the time is right for a separate product, Informatica execs reason. The subscription-based approach (with software deployed on-premises but paid for annually) is in keeping with the way Hadoop subscriptions are typically handled. This keeps initial costs down and helps businesses scale up as data volumes grow.

It’s good to see that Informatica has not taken and all-or-nothing approach with Big Data Management, offering an Enterprise edition that skips some of the data-quality, data-profiling and data-masking features included in the Premium Edition. What’s more, the Big Data Security offerings are entirely optional, which is a good thing as security professionals with budget and technology oversight many not yet be familiar with Informatica.

My one concern about Informatica’s separation of its traditional and big data product lines is that we will eventually see the pendulum swing the other way. As big data sources, integration needs and data-quality and data-governance concerns become more commonplace, they will become the prevailing data challenges in the enterprise. Indeed, Informatica depicts a “crossing the chasm” movement from traditional data-warehouse use cases to “next-gen” data big data use cases (see diagram above).

When the transition is complete, big data will just be data. And when data-management professionals stop looking at big data as a separate world, they may well want and expect a single portfolio to address all data management needs, whether large or small. I suspect that day may come sooner than many expect.

Apache Spark, real-time, cloud BI & analytics, IoT, and self-service where the trends to watch in 2015, and they’ll continue to make waves into 2016.

It’s that time of year again. Here’s a look back on the big stories of 2015 and a look forward on the trends that will carry on into 2016.

1. Spark Lights A Fire. In 2015 companies and vendors started to realize that the opportunity with big data isn’t just to scale up BI and the data warehouse. Thus the Apache Spark open source framework and other analytical options that go beyond SQL were hot in 2015. Spark was embraced by scores of vendors and hundreds of big companies in 2015. IBM was the most visible vendor advocate, but plenty of other data-integration and big data platform companies joined the bandwagon.

It takes more than SQL to make sense of big data. Spark couples in-memory performance with SQL, streaming, machine learning and R-based data-analysis options.

A key driver of the interest in Spark’s ensemble of analytics – which encompasses SQL, R, graph analysis and machine learning — is the fact that in an increasingly digital world, companies are generating and need to analyze a variety of data types. As companies do more marketing and business online, for example, clickstreams, social data and mobile data become much more important. SQL is good for analyzing the transactional data behind those interactions, but graph, machine learning and other techniques shine with these new types of data. And companies want to look forward, not just back, so they can make the right moves to maximize sales and profits. That’s what is driving interest in predictive capabilities, such as those available in R.

2. Real-time gets real. Streaming data analysis (a.k.a., real-time data analysis) was another hot category in 2015. Recent announcements by Amazon, Cloudera, Confluent (the Kafka support company), Microsoft, MapR, SAP and plenty of others point to the demand for low-latency data capture and analysis capabilities. Online advertising, marketing and retail scenarios have been a big driver, as companies seek to trigger ads, launch campaigns, and serve up cross-sell and up-sell offers while customers are still online. Real-time fraud detection, risk analysis, IoT (see below) and security threat detection are other scenarios where time is of the essence. Look for the wave of real-time announcements to continue into 2016.

3. Cloud-based analytics and business intelligence options take off. Some vendors (like BIRST and BusinessObjects) were very early to cloud-based (Software-as-a-Service-style) business intelligence. But the first-generation of options that emerged seven to ten years ago didn’t exactly set the world on fire. Early pioneers including LudicEra, Oco and PivotLink didn’t survive.

This gauge of cloud-based data analysis versus on-premises-based data analysis, as measured by Tableau Online, explains why cloud-based BI and analytics services are finally taking off. Tableau says Amazon Redshift, Google Analytics, Google Big Query and Salesforce are the top-sources.

Times have changed. Now that huge volumes of data are originating and accumulating online (think Amazon RedShift, Google Analytics, Google BigQuery and Salesforce) cloud-based BI and analytics options are starting to take off. Vendors including IBM, GoodData, Microsoft, Oracle and SAP all upped their stakes in cloud-based BI in 2015 while upstarts Tableau and Qlik prepared to deliver deeper cloud services 2016. Stay tuned for yet more announcements in 2016 and check out these six tips for success in cloud-based data analysis.

4. IoT Services: Will they bear fruit? The question isn’t which vendor did but which didn’t introduce an IoT suite or IoT-related services portfolio in 2015? The list of players announcing new or expanded IoT capabilities in 2015 included IBM, Microsoft, Salesforce and SAP. (Others deeply invested prior to 2015 included General Electric, Cisco and Intel, among others.)

In my view, IoT is often a new marketing spin on analytics against sensor-based data – something in use in manufacturing and industrial settings for quite some time. Add in ingredients including geospatial data, automotive telematics, smart mobile devices and, of course, Internet-based apps and services, and you can give it a more modern “of things” twist.

The question for 2016 is how quickly will businesses adopt and prove out real-world, IoT-based used cases. Our deepest thinker on this topic, Andy Mulholland, points out that there’s a last-mile problem whereby the IT infrastructure and services are useless if you can’t make the final connections and make sense of the data streaming from the sensors and devices on the front lines. In Andy’s book, line-of-business people are crucial to IoT deployment success, and IT-centric suites and vendor portfolios won’t succeed without business leadership of IoT initiatives.

5. Self-service options proliferate. The trend toward self-service reporting and data analysis emerged five to seven years ago. Now that trend is moving into new areas, and it reached a fever pitch in self-service data prep in 2015, with announcements from data-integration vendors like Informatica , SnapLogic and Talend, and from BI vendors including Qlik (Smart Data Load) and Logi Analytics.

Self-service started in the BI realm with the likes of Qlik and Tableau. Vendors including Alteryx have extended the trend to self-service data-prep and self-service advanced analytics for data-analyst types.

Interest in self-service advanced analytics is also on the rise. One of my most important reports of 2015 was “The Era of Self-Service Analytics Emerges,” which looked at leading examples including IBM Watson Analytics, SAP Lumira and SAS Visual Analytics/Visual Statistics. Other vendors and products pursuing this self-service advanced analytical trend include Alteryx, Microsoft Power BI, Qlik and Tableau.

Hope you enjoy the read on these trends as we head into 2016 and have a happy and healthy New Year!

Salesforce Analyst Summit 2016 highlights Wave Analytics Cloud makeover and work in progress on the Internet of Things strategy and Salesforce Thunder.

Salesforce has successfully rebooted its Wave Analytics Cloud. It’s also apparent that company is trying to avoid the sorts of missteps that plagued Wave as it prepares Salesforce Thunder and the Internet of Things (IoT) strategy.

These are my two key takeaways from the January 5-7 Salesforce Analyst Summit in San Francisco, where executives discussed the company’s strategy and laid out product roadmaps for 2016. Execs also acknowledged some of the mistakes that were made in the launch of Wave, which was introduced 15 months ago at Dreamforce 2014.

The first iteration of Wave was basically too expensive, too enterprise focused and packaged too much like a traditional BI platform. Starting with a platform (rather than pre-built apps) was essential, execs here insisted, because partners and customers would ultimately want and need a way to build vertical-industry and custom apps. But the market balked at the cost and complex packaging of the first-generation offering.

The second generation of Wave, introduced in September at Dreamforce 2015, is greatly simplified. For starters, Salesforce ditched separate Builder and Explorer licenses (priced at $250 and $125 per-user, per-month, respectively) and settled on an all-purpose platform license priced at $150 per user, per month. The company also introduced Sales Wave, the first of several planned prebuilt apps priced at $75 per user, per month. The apps are designed to speed and simplify deployment with user- and task-specific data flows and dashboards and templates for customizable analyses and actions.

Sales Wave templates, for example, provide a head start on analyzing sales levels, team performance and pipeline health. Prebuild historical analyses assess revenue by quarter, year-over-year rep productivity, and the length of sales cycles, among other measures. Administrators can set up triggers for recommended actions, such as resetting forecasts or prioritizing deals.

Detailing the Wave roadmap for 2016, Stephanie Buscemi, COO of the Analytics Cloud, said the previously announced Service Wave app will be available in April while a Marketing Wave app is in development. On the platform front she said Salesforce is working on a data-connector framework as well as scheduling capabilities and self-service data-prep options for Salesforce data.

MyPOV on Salesforce Wave

I believe Salesforce is finally on target with Wave’s packaging, pricing and a tighter, clearer focus on offering what Buscemi called “the best analytics option for Salesforce.” Some of the original attractions of Wave, including its user-interfaces and native mobile apps, still stand apart. And from what I hear about the appeal of prebuilt Wave apps (both from Salesforce and from customers), I expect the Salesforce Wave reboot to be a success.

Stay tuned on this front as Salesforce recently hired Microsoft veteran Bob Stutz to service as Chief Analytics Officer. Stutz won’t start until February, but I’d expect more tweaks to Wave as soon as April if adoption isn’t building as quickly as Salesforce would like. Keep in mind that third-party vendors BIRST and GoodData, among others, have been working on their versions of “the best analytics for Salesforce” for quite some time, but they would stress that they can provide insight beyond Salesforce.

Salesforce Thunder and the IoT Strategy

Salesforce announced its IoT Cloud Powered by Salesforce Thunder at Dreamforce 2015. The company even announced initial customers, but at that stage Thunder and the IoT Cloud were nowhere close to testing, let alone general availability. Last year I predicted we wouldn’t see Thunder until Dreamforce 2016, and based on IoT presentations and discussions at the Analyst Summit, I’m convinced that timing will hold.

What’s taking so long? Well, for starters, Thunder had its first customer pilot tests over the recent holidays, according to Adam Bosworth, Salesforce’s Chief Strategic Officer, who is spearheading the development of Thunder and the IoT Cloud. (Bosworth is a storied veteran of Microsoft and Google who’s “a Johnny Appleseed of sorts in the tech industry,” according to a recent profile in the New York Times.)

Bosworth stressed at the Summit that the company “has many months to go” before Thunder and the IoT cloud will be ready. For now he says he’s asking early customers “lots of dumb questions,” like how they intend to make money off of IoT. With so many firms “wallowing” with big data investments, he said Salesforce is intent on starting with practical, revenue-driving use cases.

Salesforce Thunder was described as a kind of enterprise service bus capable of handling high-scale batch data as well as data streaming at rates in excess of 50,000 events per second. Thunder is based on open-source components including Kafka, Cassandra and Spark, but the point is not to establish Salesforce as an IoT infrastructure player.

“When we work with industrial, automotive and connected-device companies, what they are lacking is a way to drive [IoT] adoption,” said Alex Dayon, president of products. “We have to connect IoT with the customer business processes. Our value proposition is to bridge the IoT world – the signals from the machines — with the experience of the customer.”

Talking to execs at the Analyst Summit, it clear that there’s still internal debate about just what Salesforce will deliver with its first-generation IoT offerings. There’s a real danger with IoT offerings, said Bosworth, that customers will expect much more than what companies will be able to deliver. He cited the example of his wife’s connected car, which needs to go to the shop much more frequently than his much older analog car of the same brand. What’s more, the dealer never offers predictive insight into what’s wrong with his wife’s car even though the vehicle is loaded with so-called “smart” sensors.

MyPoV on Salesforce Thunder and the IoT Cloud

Meeting high customer expectations is one challenge. But Salesforce also has formidable internal technical obstacles to overcome. For example, overnight data latency is currently the standard where Wave insights are concerned, while cutting-edge deployments have reduced that data-update latency to about one hour. The trouble is that many IoT scenarios will demand near-real-time analytics, and that’s something Salesforce is still working on.

Dayon and others said the company’s IoT play will be initially be focused on CRM-centric use cases, but in my book, big-data scalability, streaming-data processing and related analytical capabilities all have to be there as Wave and IoT platform-level capabilities. It’s another area where Salesforce will have to decide what it can should offer itself, what it can leave to partners and where, in future, it might have to rely on hyper-scale cloud partners such as Amazon or Microsoft Azure.

We’re venturing deeper, here, into questions that relate to the future of the entire company (and why there was talk of a Microsoft acquisition last year). Where our data-to-decisions research is concerned, suffice it to say that Salesforce has to do more than dabble with IoT and data-science capabilities.

IBM adds cloud services for database deployment, graph analysis, predictive analytics and data-enrichment, hoping to attract developers and data scientists to IBM Bluemix.

IBM on February 4 added a handful of new data and analytics services to the list of more than 25 now available on the IBM Bluemix cloud platform. The list includes IBM Compose Enterprise, the IBM Graph database service, IBM Predictive Analytics and IBM Analytics Exchange.

IBM added four new services to its Data and Analytics portfolio on February 4. The collection is just one subset of services in the IBM Bluemix cloud platform portfolio.

As is typical with press releases, the details of the announcement where cryptic and buzz-word ladened, so I talked to Adam Kocoloski, CTO of IBM’s Analytics Platform Services Marketplace, to get more detail. Here’s what I learned along with my own analysis.

IBM Compose Enterprise: When IBM acquired ComposeIO (now Compose, an IBM Company) last July, it accelerated its ability to help development teams with delivery of web-scale apps by enabling them to quickly containerize and deploy open-source databases on multiple public clouds (today that list includes AWS, DigitalOcean and IBM Softlayer). Compose Enterprise puts the same platform in the hands of enterprise IT so they can deploy MongoDB, Elasticsearch, Redis, PostgreSQL, RabitMQ and a few other distributed, open source products in containerized fashion on their private clouds. The service eases the task of clustering, containerizing, upgrading and backing up database instances in a uniform way.

MyPOV: This is a great move, giving customers a private-cloud/IaaS containerized database deployment option. As a side note, at one point (the company says it was two years ago), Compose supported deployment on Microsoft Azure. Not sure why that option went away, but it would be nice to see Azure and Google on the public cloud option list along with AWS and Softlayer. The more portable the container, the better.

IBM Graph: This is a managed, cloud-based graph database service that employs Apache TinkerPop and the Gremlin graph traversal language as its interface. As Wikipedia describes it, Apache TinkerPop and Gremlin are to graph databases what the JDBC and SQL are to relational databases. Developed in 2009, Gremlin is supported by notable OLTP graph databases including Neo4j, OrientDB and Tital as well as OLAP graph processors Giraph and Spark. Graph approaches are well suited to use cases including IoT, real-time recommendations, fraud detection, and network analysis.

Why use IBM Graph as opposed better known graph databases (like Neo4j, Titan or GraphDB) delivered as a service? Kocoloski said the graph data store under the hood of IBM Graph is “an implementation detail” that’s not exposed to the service user. “Because we’ve standardized on the TinkerPop API, we can experiment with different engines for different use cases. Many graph databases support TinkerPop, so it’s possible to use Neo4j one day and swap it out for something else the next without rewriting the application.”

MyPOV: IBM had a big hand in pushing TinkerPop to join the Apache Software Foundation. I suspect it did so for the same reasons IBM put big support behind CouchDB, an open source, NoSQL rival to the likes of MongoDB and Couchbase. (It did so after acquiring Cloudant, a database-as-a-service company that based its product on CouchDB.) Yes, both CouchDB and TinkerPop are open source projects, but in both cases IBM can exert a lot of influence. While there’s plenty of competition for MongoDB and Neo4j services and support, IBM is the dominant support provider for CouchDB and the only support option for IBM Graph. In short, IBM stands to benefit more if customers use IBM Graph, a commercial offering with open source TinkerPop and Gremlin under the hood.

IBM Predictive Analytics: IBM says this new service allows developers to “easily self-build machine learning models from a broad library into applications.” As I detailed last year, IBM has put a big push on Apache Spark, redesigning or replatforming more than 15 analytics and commerce offerings to run on Spark. IBM says this “dramatically accelerates real-time processing capabilities.”

Predictive Analytics is aimed at putting predictive analytics in the hands of more users. “Not everyone is going to be comfortable dropping down into Spark and working with MLLib or writing their own Scala jobs,” said Kocoloski. “We’re trying to give people an introduction to the world of multivariate analysis and machine learning.”

MyPOV: The key mode of democratization in this new service is an auto-modelling capability with roots in SPSS and that is currently limited to models exposed in SPSS. Open source Spark is an enabler under the hood, but it’s another service offering that gets back to an IBM commercial product.

IBM Analytics Exchange: This data exchange today offers a catalog of more than 150 public datasets along with a way to blend private data with public data to come up with freshinsights. Exchange is also a foundation for metadata and business data management in the cloud, said Kocoloski. The Exchange will ultimately offer more data sources and make it easier for customers to find relevant data sources. IBM also has a separate data-as-a-service business called Insight Cloud Services, which offers high-value data including Twitter Data and data from The Weather Company (an acquisition finalized last week). The Exchange will also be “a foundational component” of Insights Cloud Services, said Kocoloski. In short, think of the Exchange as enabling plumbing for exposing and blending all sorts of data.

MyPOV: The ability to offer data in the cloud is of huge and growing importance, so the more the better. I look forward to seeing more data sets and a single, integrated catalog serving up everything from widely available public data to unique, high-value sources such as The Weather Company. I do have a minor quibble with the name “Analytics Exchange,” as it seems what we’re talking about here is a data exchange or data-as-a-service infrastructure, not a place where analytics are developed.

MyPOV Overall

The real power of these offerings is cumulative. It’s not any one, tactical service that makes IBM’s case, it’s the combined breadth of offerings on Bluemix that creates and one-stop shop for the developer and data scientist. These data and analytics services are a fraction of the overall catalog, as there are also Watson cognitive services, mobile services, DevOps services, web and application services, business analytics services, storage and security services and so on. The broader and more coordinated the portfolio and platform becomes, the more compelling it becomes as a cloud-based platform for delivering next-gen, data-driven apps.

Spark adopters including Bloomberg, Comcast, Capital One and EBay share compelling use cases. Data processing, streaming and analytics use-case scenarios multiply.

What’s the business case for Apache Spark? After the opening (general-session) day of Spark Summit East 2016 in New York, I was thinking that Spark promoter and Summit host Databricks needed to do a better job of telling that story. Then on the final day, executives from Capital One, Synchronoss and eBay offered compelling, business-use-case focused keynotes that knocked it out of the park.

Chris D’Agostino, VP of Technology at Capital One, a top-ten bank and cutting-edge user of data and analytics, outlined the bank’s use of Spark to generate tabular and graph representations of data at high scale. These views of data are used in low-latency fraud-detection, anti-money-laundering and know-your-customer analyses.

Suren Nathan, senior director of Big Data Platforms and Analytics Frameworks at Synchronoss, a cloud services company, outlined four eras of data pipeline maturation. It started with the “V1” traditional ETL/data warehouse era and then moved on to the V2 appliance-assisted era (think Exadata and Netezza, with comparatively high costs yet lacking support for multi-structured data). In the V3 early Hadoop era we suffered with slow batch processing in MapReduce. In today’s V4 era, you get scale on Hadoop but also superior performance with support for both batch and streaming workloads thanks to Spark.

Seshu Adunuthula of eBay described how “the other e-commerce giant” is making use of Spark within its evolved enterprise data platform. eBay’s move into selling more new items at fixed prices (as opposed to used items through auctions) has led it to take a catalog-oriented, rather than search-oriented, approach. That has necessitated a more structured yet dynamic data modelling approach. eBay remains one of the largest users of Teradata in the world, but Adunuthula said the company is moving many workloads into Hadoop. Spark was added to support fast batch analysis and streaming-data services. “Once we introduced Spark, we saw rapid adoption, and now we’re seeing more and more use cases as adoption grows,” Adunuthula concluded.

The opening-day keynotes were offered by vendor execs who had mixed success in painting a big picture. IBM’s Anjul Bhambhri, VP, Big Data, came close, explaining that “the beauty of Spark is that all components [Spark SQL, Spark Streaming, Spark R, MLLib, etc.] work together in a seamless way. You don’t need half a dozen products; you need just one, foundational platform.”

Matai Zaharia, CTO of Databricks, stuck to detailing Apache Spark accomplishments in 2015 and reviewing new capabilities coming in Spark 2.0, set for release in late April or early May. Highlights of the coming release include whole-stage code generation aimed at improving Spark performance and throughput by as much as 10X. Spark Streaming improvements in 2.0 will support the common scenario of mixed streaming and batch workloads, as when you want to track the state of a stream while also firing off SQL queries. Finally, Spark’s DataFrame and Dataset APIs will be merged in 2.0, simplifying matters for developers by presenting fewer libraries and concepts to worry about.

Ali Ghodsi, co-founder and recently promoted to CEO of Databricks, confined his remarks to Databricks’ own services, highlighting the appeal of using Spark through the company’s cloud-based commercial offering. He also introduced a free (sandbox-scale) cloud-based Databricks Community Edition. This led to a demo by Databrick’s Michael Ambrust, who touted Databricks’ cloud offerings as a great way to avoid the pain of deploying Spark software.

MyPOV: With its cloud-based platform, Databricks competes, in a way, against the open-source software it helps to create and promote. Commercial companies behind open source software more typically support on-premises software deployments. Is it a coincidence that Spark customers and third parties including Cloudera and IBM seem to be carrying the load of making the case for enterprise Spark adoption? Plenty of Databricks execs offered technical presentations on Apache Spark software capabilities and roadmaps, but they were preaching to the choir. At Spark Summit East, Databricks missed a chance to be more of a champion of Apache Spark in the enterprise.

Thinking back, it took a partnership with O’Reilly to turn Hadoop World into Strata + Hadoop World, an event with a broader audience and a higher-level purpose. I’ll grand that much of what Databricks does to promote and contribute to a healthy Apache Spark ecosystem goes on behind the scenes. And you can’t argue with the project’s success. But it seems to me that Spark is ready for a bigger stage.

PS: Interesting tech seen at Spark Summit included h2o.ai, which supports distributed machine learning on Hadoop or Spark, and Data Robot, which automatically generates predictive models (using leading open-source algorithms in R, Python and Spark) and tests, validates and selects the most accurate ones, speeding predictive and easing the data science talent shortage. Also interesting was SnappyData, a company recently spun out of Pivotal that has ported the open source Geode (formerly GemFire) in-memory, distributed database to run on Spark. It offers SQL querying in a persistent store that is part of/runs on Spark rather than requiring separate infrastructure.

The most talked about topic at Spark Summit NOT on the agenda or announced at the event was Apache Arrow, a new project which promises an in-memory, columnar data layer that can be shared by multiple open source projects, eliminating redundant infrastructure, cost and copies of data while enabling fast analytics across many workloads. The project launched with support from a whopping 13 open source projects, including Cassandra, Drill, Hadoop, HBase, Impala, Parquet, Pandas and Spark.

Hortonworks integrates Hortonworks Data Platform (Hadoop) and Hortonworks DataFlow (streaming data) platforms to offer a cohesive approach to analyzing data in motion and data at rest. Here’s how they fit together.

The “Connected Data Platforms” that Hortonworks introduced on March 1 are its well-known Hortonworks Data Platform (HDP) Hadoop distribution and its Hortonworks DataFlow (HDF) platform aimed at collecting, curating and routing real-time data from any source to any destination. HDP and HDF can be used independently, but here’s how they fit together to become a cohesive platform for managing and analyzing streaming and historical data.

Interest in streaming data analysis has been growing steadily in recent years, but the emergence of Internet of Things (IoT) opportunities has interest soaring. The thing is, streaming-data use cases such as connected-cars, smart oil fields, smart utilities and precision medicine often require analysis of historical data, which brings context to the real-time insights. That’s why HDF and HDP need to be connected.

Inside Hortonworks Connected Data Platforms

This week Hortonworks introduced HDP’s 2.4 release. Notable upgrades include support for and bundling of Apache Spark 1.6 software as well as improved system management and remote optimization capabilities through Apache Ambari 2.2 and SmartSense 2.2. Ambari, the open source management software, gained an Express Upgrade feature that lets you quickly stop jobs, update software and restart the cluster and running jobs all within one hour, even on large systems. SmartSense is a “phone home” capability that relays system-performance parameters to Hortonworks, which can diagnose problems and offer more than 250 recommendations on optimizing system performance and availability.

The biggest development with HDP 2.4 is a new distribution strategy with two separate release cadences. Core Apache Hadoop components including HDFS, MapReduce and YARN as well as Apache Zookeeper will be updated annually, in line with other members of the ODPi consortium. Hortonworks is expediting other, newer capabilities through new “Extended Services” releases, which will be offered as quickly as they can be made available. One example of an Extended Service is support for Spark 1.6. Other candidates for this release approach will include Hive, HBase, Ambari “and more,” says Hortonworks.

MyPOV on HDP 2.4: I like this two-pronged strategy with the stable, slower moving core complemented throughout the year by extended services. Hortonworks has lagged behind Cloudera in the past in adding certain new capabilities that customers have been anxious to use. This is a good approach to fast tracking capabilities that are in demand (although they presumably can’t require changes to Hadoop core components). The approach also simplifies matters for other distributors of ODPi-based distributions.

Hortonworks DataFlow 1.2

HDF is Hortonwork’s streaming data platform based on Apache NiFi and adapted from last year’s Onyara acquisition. Upgrades with the move HDF 1.2, which will be available later this month, include the integration of Apache Kafka and Apache Storm streaming analytics engines. The release also gains support for Kerberos for centralized authentication across applications. On the near-term roadmap is support for Spark Streaming, which should be available by early summer, according to Hortonworks.

MyPOV on HDF: There’s much to like in Hortonworks DataFlow, including a drag-and-drop approach for developing the routing, transformation and mediation within dataflows. It also offers built-in data-security and data-provenance capabilities. One exec described it as “a FedEx for streaming data,” providing the digital equivalent of a logistics system for routing streaming data and tracking sources and changes to digital information along the way. The ecosystem seems strong, with support for more than 130 processors for systems including Kafka, Couchbase, Microsoft Azure Event Hub and Splunk.

How HDP and HDF are Connected

Hortonworks wants to be a multi-product company, so it has stressed that HDP and HDF will be sold and can be used independently. HDF can route data to (and draw from) other Hadoop distributions, databases such as Cassandra and cloud-based sources, such as Amazon S3.

When use cases span data in motion and data-at-rest, HDP and HDF have commonalities that makes them easier to use together. For example, both HDP and HDF share more than 70 data processors and both use Ambari for system deployment and management. What’s more, Hortonworks is promising that SmartSense, and the Ranger and Atlas security and governance projects will also support both platforms.

MyPOV on Connected Platforms: The need for the combination of streaming and historical data analysis is popping up in many quarters. It was touted as a benefit of Spark Streaming 2.0 at the recent Spark Summit East event, and MapR also has a strategy to address both forms of data in one platform.

Hype around streaming data opportunities is nothing new. More than a decade ago, complex event processing systems were touted as “ready to go mainstream.” At long last, I think we’re finally seeing signs that streaming data analysis is emerging. The mobile, social, cloud and big data trends set the stage and maybe, just maybe, the promise of IoT possibilities is pushing it over the top.

PS: Hortonworks also spotlighted two promising Spark related developments this week. First, it’s shipping a preview of Apache Zeppelin with HDP 2.4, providing a coding-free UI for visualization and a notebook-style approach to working on Spark. This is a usability improvement and democratization tool that Spark sorely needs. Second, in a partnership with HP Enterprise Labs, Hortonworks will bring to open source an optimized shuffle engine for Spark that HP Enterprise says will offer 5X to 15X performance improvements as well as optimized use of memory. This tech doesn’t have project status yet, let alone acceptance from the Spark community, but Hortonworks says it will ship the software with HDP later this year.

Cloudera plans public cloud push as applications multiply in financial services, insurance, life sciences, retail and telecommunications. Hadoop may not be easy, but it is gaining mainstream adoption.

Hadoop is going mainstream, it’s increasingly moving into the cloud, and it’s delivering solid business value. These are three key themes that were highlighted this week at Cloudera’s third annual Analyst Day in San Francisco.

Cloudera shared strong evidence of broad adoption and business value through panels that dug into the details of real-world deployments. Here’s a short list of the types of applications seen across five industries:

Financial Services: Cloudera has more than 100 customers in this category, and use cases typically start with governance and security. Big banks, for example, have to retain transactional data for regulatory reasons, and many have embraced Hadoop for high-scale data-retention and analyses including anti-money-laundering and stress testing. As the bread of data in a data lake spreads across lines of business, financial services develop 360-degree views of customer preferences and behaviors (customer 360).

Insurance: Insurers use Hadoop for customer 360 and claims-fraud analysis applications. More mature adopters are moving into Internet of Things (IoT) applications such as usage-based pricing. In the automotive arena, for example, pay-as-you-drive and how-you-drive pricing will be ubiquitous within a few years, a Cloudera exec predicted. The platform is making high-scale analysis of telematics data practical and affordable.

Life Sciences: Whether it’s healthcare providers, pharmaceutical companies or crop sciences firms, these organizations are modernizing their data infrastructures to handle data at unprecedented scale. Cloudera customer Cerner, which analyzes electronic medical records on a Hadoop-based platform, has come up with an automated way to predict sepsis infections in hospital patients. The alerts have reportedly saved more than 3,000 lives to date.

Retail: It’s all about getting closer to the customer, differentiating products and services, and optimizing inventory to maximize sales and keep customers happy. That’s a journey that starts with resolving customer identities across channels and then better integrating data from across channels. These first two steps get you to the most valuable stage of understanding customer interactions, behaviors and value across all channels and over time.

Telcos: Telcos are big users of Hadoop, and they start with governance-oriented call-data-record remediation and customer-churn analysis. Operations groups use the platform for network troubleshooting, security and risk analysis. As use of the platform matures, front-end and back-end insights are integrated for proactive network optimization, customer service and anti-churn initiatives.

The four themes that cut across all industries are driving customer insights, improving products and services, reducing risk and modernizing IT infrastructures. On this last point, Cloudera said that only 15% of its 850-plus enterprise customers have deployed its software on public clouds, but that’s where it’s seeing the fastest growth. “Data that’s born in the cloud wants to stay in the cloud,” observed Cloudera Chief Strategy Officer, Mike Olson, and that trend will accelerate as IoT scenarios flourish, he said.

Cloudera plans to ramp up in this areas with Cloudera Director, an automated cloud deployment tool and abstraction layer that hides the complexities and differences among various clouds and deployment options including Amazon Web Services, Google, OpenStack and VMWare. With Cloudera Director 2.0, released in January, Cloudera added a cluster cloning feature and the ability to automatically grow and shrink clusters to save money.

Cloudera executives (left to right) Doug Cutting, chief architect, Mike Olson, chief strategy officer, and Tom Reilly, CEO, at Cloudera Analyst Day, March 22, 2016.

MyPOV on Cloudera Analyst Day

There was a bit of a disconnect between what Cloudera talked about in its market observations and strategy overviews and what it detailed in its product roadmap (which was largely under NDA). For example, there was no signal of new cloud deployment capabilities beyond Director 2.0, other than supporting Microsoft Azure as a deployment option. And despite all the talk of industry specific use cases, Cloudera executives only vaguely alluded to blueprints, templates, and frameworks — from Cloudera and from partners — that give customers a starting point on proven applications. It’s nice to hear about vertical use cases, but Cloudera has much more work to do on broad platform acceptance before it can go too far down the vertical-industry path.

At one point during the day Cloudera described its technology as being “fast and easy.” When I questioned execs about the use of these terms, Chief Strategy Officer Mike Olson qualified that Hadoop is fast and easy as compared to relational database approaches when trying to solve high-scale data challenges. He also pointed to efforts Cloudera has made to simplify deployment with tools like Navigator Optimizer and Cloudera Director, which speed and ease analysis and optimization of SQL workloads and cloud deployment, respectively.

At other points during the day Cloudera execs talked about the time and money the company has to invest to help clients move from proof-of-concept projects to broad and fruitful production use. And it also discussed how it’s now employing extensive automated testing to ensure the quality of its software distribution, which is now includes more than 25 open source components.

In short, “fast” and “easy” are not terms I would associate with Hadoop deployment. But “proven,” “value driving” and even “industry standard” work for me and for the many companies that now rely on the platform.

Hadoop security, data management, data governance and analysis options remain works in progress, but a rich ecosystem is emerging to fill gaps and democratize the platform.

Apache Hadoop marks its 10^th anniversary as an open source project this year, a fitting milestone to review its (betwixt-and-between) state as an enterprise computing platform.

Inspired by a Google white paper, born at Yahoo and embraced in its early years almost exclusively by Internet giants, Apache Hadoop is today accepted as a de facto standard platform for any enterprise interested in taking advantage of big data. Over the last five years, the top three Hadoop software distributors, Cloudera, Hortonworks and MapR, have cracked all major vertical industry categories and have collectively gained more than 3,000 paying customers for their supported enterprise editions. Tens of thousands more firms are self-supporting free community distributions of Hadoop, though the largest share of these deployments are no doubt about experimentation rather than production use.

Hadoop 10 Years

Equally significant – and now the fastest-growing part of the Hadoop user community, by most accounts – are the thousands of organizations using cloud-based Hadoop services, such as Amazon Elastic MapReduce, Microsoft Azure HDInsight, Altiscale, Qubole and various managed Hadoop service offerings.

Looking beyond these sheer numbers, I heard plenty of fresh evidence of proven industry use cases at recent Cloudera and Hortonworks analyst events. Cloudera detailed an impressive list of vertical industry use cases at its event while Hortonworks cited unnamed customers at “55 out of the top 100 financial services firms, 75 out of the top 100 retailers, eight out of the top nine telecommunications companies in North America, and eight of the world’s top 20 automotive companies.”

So there’s plenty of reason for confidence in this platform, and we continue to see steady maturation. But Hadoop still has weaknesses and gaps, and plenty of experiments have failed. Even the hand-picked customers attending the Cloudera and Hortonworks events, who shared mostly success stories, had to admit to ongoing challenges:

“Better data governance is the number-one priority on our [Hadoop] wish list,” said a VP of platforms and data architecture attending the Hortonworks event. His employer, a digital marketing company, has been on an acquisitions tear, and segregating, securing and otherwise governing specific data sets has proven difficult as the company has consolidated separate Hadoop deployments.

“It was messy on the data-lineage end,” confessed a director of analytics attending the Cloudera event. This warehousing and logistics firm pulls data from dozens of legacy database applications to calculate how to jam more products into its distribution centers. But before it could begin the optimization work, the firm “spent months working out the details for data ingestion.”
“We have three people working with Hadoop, but we have more than 150 business users who need access to the data,” said a BI solutions architect at an aerospace firm that is using Hortonworks’ distribution. “I’d like to see better ease of use for business users,” he said, noting the wonky, coding-intensive nature of many Hadoop components and data-management tools.
“The sooner we can have an all-purpose [tool] for getting data into Hadoop, the better,” said an IT executive of a data services company using Cloudera. “We use a lot of RDF-linked data, but there’s not a lot of support for that in Cloudera.”

MyPOV on Hadoop Maturity

So is the glass half empty or half full? In my view you should be optimistic but realistic about this ten-year-old platform. I relate it to my experience as a parent. We never left my son home alone when he was 10 years old, but now that he’s 14, I trust that he’ll be safe and will even get his homework done if we get home late from work. In much the same spirit, an executive at a major e-retailer shared in a recent briefing that his firm isn’t ready to open up wide access to the firm’s Hadoop cluster until data-access, governance and security controls are more mature. Maybe if PCI data wasn’t involved he’d feel differently? Just as a parent has to know the child, you have to understand your data, your users and your risks. Maturity and trust will come.

Fortunately, we’re seeing a rich ecosystem emerging around Hadoop that will help make data access, data management, data governance and data analysis easier, less coding intensive, more repeatable and, in many cases, more accessible to business users. Some of these capabilities will undoubtedly be duplicated within open source tools. But we’ll also see data-management and governance capabilities that will extend beyond Hadoop, supporting data pipelines and data-driven applications that span multiple platforms.

Next week I’ll be discussing the possibilities and positive developments in the educational webinar, “Democratizing the Data Lake: The State of Big Data Management in the Enterprise.” Set for Tuesday, April 26 at 1pm ET/10am PT, this webinar will delve into data access, data cataloging and metadata management options for Hadoop as well as big data integration and data-prep options. We’ll also discuss Apache Spark and its role in data processing, stream processing and data analysis in the context of Hadoop. Click on the link above to register for the event.

SAS Viya delivers cloud-ready architecture for multiple public and private deployment scenarios. But will it appeal to next-gen customers looking for new analytics options?

To attend a SAS Global Forum user conference is to be reminded of its huge customer base and influence. This week’s event in Las Vegas attracted more than 5,000 attendees, and to meet and talk to many of them is to understand that they’re not, for the most part, early adopters looking to pioneer new approaches. It’s an older and more well-established crowd than you see at your average tech event.

Their embrace of analytics notwithstanding, SAS customers tend to be cautious adopters, so I have to wonder how quickly they’ll take to the two big announcements made at #SASGF. The biggest announcement was SAS Viya, which is SAS’s next-generation architecture for running its software in private and public clouds. It’s a big advance over SAS’ previous cloud approach, which amounted to hosting and managed services. SAS Viya introduces microservices, REST APIs and even the option to use open languages including Python and Lua, for starters and, eventually, Java and Scala.

Viya is set to become generally available in Q3, and it will launch with CloudFoundry and VMWare as deployment options on private-clouds and Amazon Web Services as the public-cloud option. Microsoft Azure support will follow. Within the next month beta customers will be able to test Viya using programmatic interfaces.

SAS Viya Q3 Launch

The second announcement this week was Customer Intelligence 360, an all-new, software-as-a-service-based marketing analytics suite aimed at digital channels. Customer Intelligence 360 is available immediately with modules on AWS for Web and Mobile channels; email and social modules will follow later this year, says SAS.

Customer Intelligence 360 is not a cloud-based alternative to the vendor’s incumbent on-premises offering, SAS Customer Intelligence. Rather, the cloud app complements that product, focusing on digital marketing channels and execution. It links to on-premises Customer Intelligence deployments to gain the proverbial 360-degree view of customer behavior.

Will Customers Follow?

I’m not alone in sizing up SAS customers as mostly cautious adopters. A SAS data warehousing partner confided to me that many customers say “don’t mess with my code” when he suggests that they embrace in-database processing options introduced as early as 2007 and in-memory options introduced in 2012.

Of course, not every SAS customer is technologically conservative. I sat in on a nice presentation by Neil Chandler and Andy Wolfe of U.K.-based e-retailer Shop Direct, which stores its data in a Hadoop data lake. Apache Cassandra/DataStax is used to serve that data to a SAS stack running on Amazon Web Services. What’s more, the analytics are integrated with the retailer’s transactional environment so it can deliver contextually relevant, personalized content and offers to customers in real time as they shop online.

SAS Viya would have saved Shop Direct a lot of trouble, commented Andy Wolfe. But it wasn’t available when the company decided to run SAS on AWS two years ago. No doubt SAS customers who are fast followers will embrace Viya as a way to accelerate private- and public-cloud initiatives involving analytics. The architecture will insulate most of their users from having to deal with or think about third-party cloud tools.

SAS put a big emphasis on compatibility with existing SAS software when it designed Viya and when it talked about it at SAS Global Forum. The on-stage demos featured scenarios in which data and models were moved from on-premises SAS deployments up into Viya and results were returned back to the on-premises world.

So will the many cautious adopters be tempted to experiment with cloud compute scalability and new options coming with Viya such as the new machine learning suite (one of the bright spots in the announcement)? That partner I talked to said many SAS customers don’t change their ways until new CIOs or CMOs show up and demand at least experimentation with cloud deployment, big data platforms like Hadoop and open source analytics. In short, I’m guessing Customer Intelligence 360, the much more straight-forward SaaS offering, might see more robust adoption than SAS Viya over the next year or two.

MyPOV on SAS Viya, Customer Intelligence 360

SAS Viya is a positive, if overdue, step that SAS needed to take. What’s not clear is the degree to which SAS will break from its traditional licensing and maintenance model. Analytics work tends to be spikey, so true cloud elasticity, where you can use and pay for the software only when you need it, is hugely attractive. A pay-as-you-go, analytics-as-a-service offering will be introduced along with SAS Viya, but from what I saw it will be a very limited offering – at least initially.

In my book, the as-a-service machine learning and analytics options being amassed by the likes of Amazon, Google and Microsoft (as well as those offered by partners on the same platforms) are the most significant threat to SAS in the long term. IBM, meanwhile, is porting its commercial analytics software to run on Apache Spark, offering cloud-delivery options and counting on an open source halo effect.

SAS Viya was clearly developed with existing SAS customers in mind, giving them an avenue to cloud scalability and performance. But I suspect that companies that are embracing new approaches will take a long, hard look at as-a-service options (including open source in the cloud) before making a significant, ongoing commitment to run SAS software in the cloud.

As for SAS Customer Intelligence 360, it’s another product that’s most attractive to existing SAS customers – namely those using SAS Customer Intelligence. In fact, the cloud app’s promised ability to “unite data from all channels” presumes that you have SAS Customer Intelligence on-premises (or hosted) capturing information on customer behavior on traditional (non-digital) marketing channels. So I suspect the vast majority of uptake will be from existing SAS customers rather than from new customers using CI360 in stand-alone fashion.

SAS, like IBM, Microsoft, Oracle, Teradata and other blue chip incumbents, has the luxury of great big customer base, and in SAS’s case they’re mostly very happy and loyal. The trick will be reinventing the business enough to attract new customers without breaking the model of its existing business.

IBM Watson team introduces new services, details evolved capabilities, forecasts market opportunities. But we’re still hungry for real-world success stories.

IBM’s Watson unit keeps developing new offerings, growing its ecosystem and explaining the big opportunities ahead. If only it could share more of the customer case examples that remain under non-disclosure agreement (NDA) wraps.

What would-be customers really want to see, after all, is examples of Watson clobbering real-world problems the way it trounced its puny human competitors in “Jeopardy.” But to be fair, a lot of building, learning, refining and ecosystem building is required when starting a business from scratch.

At the first IBM Watson Analyst Day held this week in Cambridge, Mass., the conversation was mostly about how IBM is setting the stage — a very big stage — for cognitive business. In the two-plus years since it was established as its own commercial business unit within IBM, IBM Watson has gone from addressing three industries to 20 industries, from doing business in one country to 45 countries, from one language to eight languages, and from three partners to 550 partners.

IBM Watson Progress

Instead of talking about big, high-profile cognitive projects spearheaded solely by IBM, the company is increasingly talking about the portfolio of cloud-based cognitive services that can be exploited by the growing developer community. Where Watson began with one application programming interface and hundreds of mostly internal developers, today there are more than 30 APIs and tens of thousands of developers, IBM said this week.

IBM also detailed a number of developer-oriented announcements:

Sentiment and Insight services. In a collaboration with Twilio, IBM this week introduced IBM Watson Message Sentiment and IBM Watson Message Insights. Pre-integrated with Twilio’s APIs, the services use sentiment analysis to detect sentiment in SMS streams at scale, in the case of Message Sentiment, and to distill the key meanings of messages, in the case of Message Insight.

Learning Korean. In a an agreement with SK Holdings C&C, a Korean IT services company, IBM Watson will be trained to understand Korean so South Korea-based developers can use localized APIs and services to accelerate the creation and deployment of cognitive applications.

Machine vision and voice analysis. IBM merged its AlchemyVision and Visual Recognition services to power a unified Watson Visual Recognition API. Separately, Watson Tone Analyzer for spotting snark, sadness, elation and other emotions in text is now generally available.

In addition to adding new APIs and services, IBM is also tinkering with the packaging of existing solutions to make them easier to consume. For example, Watson Engagement Advisor (WEA), which debuted in 2013, has had a makeover. WEA is designed to improve the customer service experience by assisting service agents or interacting directly with customers in complex service scenarios. IBM said this week that Engagement Advisor has been decomposed into more granular, developer-ready services that are pre-trained on common customer-service needs and intentions. An integration framework was also added to make it easier to connect to common applications and data sources.

Where Engagement Advisor deployments initially averaged six to nine months, time to production has been shortened considerably, says IBM. What’s more, the new tooling makes it more of a self-service proposition that than something requiring consulting support.

What’s Next

Among the big opportunities ahead for Watson, IBM execs detailed progress and plans in three domains:

Healthcare. Healthcare has been the top industry priority from the earliest days of Watson, so Watson Health was split out as a separate unit within IBM Watson. In fact, this week’s event was held at the just-opened Watson Health headquarters. IBM sees a $200 billion opportunity for Watson within the $8 trillion worldwide healthcare industry, so it has invested heavily in data drive cognitive healthcare solutions. This week we heard a bit about how the acquisitions of Truven, Explorys and Phytel have helped create one of the world’s largest collections of health data. That data is fuel for the IBM Watson Health Cloud and its analytics, image analytics and knowledge platforms.

Cyber security. Earlier this month IBM announced Watson for Cyber Security. The average company faces some 200,000 security events per day, including everything from network hacks and viruses to routine logons and user-access requests. A central tenant of Watson Cyber Security is learning to distinguish between ever-changing real threats and false positives so the technology can filter out the latter and free security professionals to focus on real problems. Toward this end, IBM is integrating its Watson for Cyber Security with the IBM QRadar SaaS app and IBM X-Force threat-intelligence database to give Watson rich troves of ever-evolving data to power cognitive security analyses.

Internet of Things. Not all IoT applications call for cognitive capabilities, but IBM sees a fit wherever there’s human-to-machine interaction at scale. For example, cognitive is envisioned in cognitive route-planning services that will combine real-time and historic data with traffic, weather and social event data, like concerts and ball games, to predict better routes. In another example, a steel plant customer of IBM’s it interested in a wearables application in which foundry workers are fitted with sensors so Watson can detect and alert managers to early signs of heat stroke.

MyPOV on Watson’s Progress

The one key factoid missing from this week’s event — one that Constellation Research advisory customers ask us about — is how many customers are successfully using IBM Watson in production? IBM doesn’t share such statistics, but analysts were recently given an NDA peek at industry use case and real-world customer examples. Nearly a dozen major companies were cited by name in this March briefing. I came away impressed, so I was a bit disappointed this week when customer citations and examples weren’t on the agenda.

As for IBM’s focus on delivering more granular, digestible services and enabling an ecosystem, these are good moves that will spread the burden of building what is more than just a new business unit. As IBM has described it, cognitive is a whole new era for computing.

The good news is that IBM is no longer going it alone. Systems integrators, in particular, have been introducing cognitive capabilities over the last year. In some cases the offerings might not fit IBM’s definition of cognitive — being closer to combinations of automation, robotics or artificial intelligence techniques. But the development work and conversation has begun, and competition will be nothing but good for this nascent market.

Workday says Platfora buy will bolster its petabyte-scale analysis capabilities. It’s good news for Workday customers, but Platfora customers should be prepared for change.

Human resources and financial applications vendor Workday announced July 21 that it plans to acquire big data analytics vendor Platfora. In an echo of an earlier initiative, the company says the deal will enable customers to “analyze petabyte-scale data in seconds,” but existing Platfora customers should be prepared for new priorities.

Workday’s first effort to support big data analysis emerged in late 2012, when it announced plans for the Hadoop-based “Workday Big Data Analytics Platform.” When the platform was delivered in September 2013, Workday acknowledged that customers were as interested in small-scale, Workday-internal data analysis as they were in speculative big data opportunities. And by 2015, Workday’s focus had shifted to focused, data-driven “Insights” apps built on the Big Data Analytics Platform, starting with a Talent Insights app. The emphasis was on delivering prescriptive recommendations in the context of applications, not just data visualizations.

The acquisition of Platfora, a five-year-old big data analytics vendor, seems in some ways like a reboot for the Workday Big Data Analytics Platform. Platfora works with Hadoop and Apache Spark, providing data-management and governance capabilities as well as its original OLAP-like Platfora Lenses and high-scale data-visualization capabilities.

This screen shot from Platfora’s site captures the company’s focus on customer behavior analysis, with examples including path and funnel analysis, attribution and segmentation.

Workday says it will retain Platfora’s technology, its staff and even its San Mateo, CA, office. In a statement, Workday said the deal will “enhance our analytics capabilities—especially areas like managerial reporting and operational analytics where insights are gathered by collecting and connecting multiple data sources (Workday and non-Workday data).”

What Workday didn’t say is that Platfora is currently focused heavily on customer-behavior analysis, with examples including online path and funnel analysis, attribution, customer segmentation and customer journey analysis. Workday, by contrast, is focused on internal HR and financial analysis. Nonetheless, Workday said Platfora’s tech and expertise will help it support all its customers, including retail customers and insurers doing profitability analysis and consumer goods companies pursuing supply chain analysis and planning.

MyPOV On the Platfora Acquisition

Workday did not disclose the terms of acquisition, which suggest it was a comparatively small deal. Platfora was an early mover on big data analysis, but in many ways it was ahead of its time. Last year there were signs of trouble at the company as it swept in a new CEO just as many new competitors, including Arcadia Data, AtScale and Kyvos, started to emerge.

Adding Platfora’s analytical and big data technology and expertise can only be a good thing for Workday, assuming it came at an attractive price. That said, the types of analyses Workday focuses on aren’t exactly in line with the types of “solutions” Platfora specialized in delivering. Shifting focus may take time, but leading software-as-a-service vendors including Workday and Salesforce seem intent on making analytics a part of their platforms. That’s partly a bow to customers, who like seamlessly embedded analytics that are accessible to business users. It’s also a self-interested move to capture more revenue from customers and avoid forcing them to turn to third-party vendors for analytical capabilities.

As for existing Platfora customers – a list that includes Citi, Disney, Sears, Tivo and Volkswagen Group – I would be wary. Workday’s statement only addressed advancing analysis on the company’s own, cloud-based platform. It did not say anything about continuing to develop and offer Platfora’s software, which is currently deployed on-premises with multiple Hadoop distributions and on the Amazon, Google and Azure public clouds. That’s a crucial bit of reassurance that existing Platfora customers would expect to hear. A separate Platfora blog on the deal was equally vague, saying only that the company will reach out to customers privately to describe what the future holds.

Workday execs were not available for comment. The company said plans will be detailed at the Workday Rising 2016 event in Chicago in late September, but I’m hoping to learn (and will share) more as early as next week.

Workday HR and financial analyses stand to benefit, but Platfora customers will lose the on-premises software option in the wake of the pending acquisition.

Workday customers stand to gain a deeper understanding of employees, recruits, customers, suppliers, business units and locations, but existing Platfora customers will soon lose the on-premises and public cloud software deployment option.

That’s the upshot of Workday’s planned acquisition of Platfora, a deal announced last week that left questions about the fit between the two companies and the future of the acquired big data analysis vendor. To answer some of the lingering questions, Workday executive Mike Frandsen, EVP of Products, Support & Delivery, and Platfora executive Peter Schlampp, VP of Products, briefed Constellation Research on what the deal will bring. Here are a few of the highlights:

Platfora Joins Workday

On Platfora’s future: Platfora’s current version will be the last to be offered as on-premises software, said Frandsen. Workday will continue to support Platfora’s existing software on all currently supported versions of Hadoop and Spark into 2018 and will honor all contracts. A future product based on Platfora’s technology will be a SaaS service on Workday’s cloud platform. That offering will not be marketed as a stand-alone big data analysis service; rather, it will enhance Workday’s current analytical capabilities, as detailed below.

On blending Workday and Platfora capabilities: Workday has more than 40 customers using its Hadoop-based Workday Big Data Analytics Platform, introduced in 2013 and based on Datameer technology. That service enables customers to import and analyze third-party data in aggregate alongside Workday data, but the company says it does not give them the ability to drill down or slice into that data. Tapping Platfora’s in-memory Data Lenses, “non-Workday data will be a first class citizen” and users will be able to “drill down and slice that data into different pivots or dimensions.”

Platfora self-service data-prep and data visualization capabilities will also figure in Workday’s enhanced offering, said Schlampp. But it won’t be a complete “reboot” of Workday Big Data Analytics, said Frandsen. Rather, Workday will rationalize functionality for different user personas drawing from “an abundance of great tools” for preparation and analysis of data on Hadoop.

On supporting deeper Workday analyses: Platfora’s expertise in behavioral analysis is not limited to customer analysis. That has been a focus area for the company, but behavior analysis will be equally applicable to HR and financial analyses. For example, in the HR arena, behavioral analysis will help companies understand the behaviors, strengths, motivations and likelihood to retain employees and potential employees. Further, Workday said it’s contemplating cohort analysis of employees based on hire date, compensation, manager, organization, work location, or any combination of these attributes.

On the finance side Platfora will help Workday better understand the profitability and value of customers and prospective customers, their satisfaction, their likelihood to remain customers and their likelihood to pay their bills in a timely way. Current and potential partners and suppliers might be evaluated in the same way, and all of the above would advance insights on the profitability of business units and locations.

MyPOV on the Workday-Platfora Plans

I stand by my initial take on this deal in seeing it as a boon for Workday customers, but it’s now clear that change is ahead for existing Platfora customers. Schlampp said he expects customers will continue to use Platfora “for many years to come,” but I think that’s optimistic. I expect customers to look to competing products as their subscription-based licenses to use Platfora expire (whether it’s used on-premises or with cloud-based Hadoop deployments on AWS, Google or Azure).

About one third of Platfora customers (I don’t have an exact count, but it’s said to be “dozens”) are also Workday customers. I would expect few if any of these customers to abandon existing Hadoop deployments and move all their data and analyses onto Workday. That said, these joint customers may well use and appreciate Workday’s Platfora-enhanced HR and financial analyses. And if that’s the case, they would surely load more HR- and finance-relevant data and third-party data onto Workday’s platform as a result.

SAP needs to address big data, streaming and IoT apps. Altiscale would accelerate efforts to deliver high-scale and high-performance cloud data services.

SAP is acquiring Altiscale, VentureBeat reported on August 25. SAP neither confirmed nor denied the report, offering a statement that it “does not comment on market rumors or speculation.” If the acquisition is real, it would, in my view, make sense.

Altiscale is a four-year-old startup that specializes in cloud-based Hadoop and Spark services. As such it technically competes against mainstream Hadoop services, such as Amazon Elastic MapReduce, Microsoft HDInight and Google Cloud Dataproc. But Altiscale differentiates itself from these low-cost providers by combining high-performance Hadoop and Spark services with integrated ingestion, transformation, and analytical capabilities via its Altiscale Insight Cloud offering. A small niche player, the company also stresses high-touch support and professional services, which puts it in more direct competition with Qubole.

Altiscale combines Hadoop and Spark cloud services with integrated data ingest, prep and analytics as well as supporting professional services.

By acquiring Altiscale, SAP would quickly gain capacity and deep expertise for delivering Hadoop, Spark and high-scale data services. All of the above would complement SAP’s cloud and IoT strategies while reducing the need for customers to work with third-party vendors.

SAP HANA continues to be the cornerstone of the company’s data-management strategy, but at SAPPHIRE 2015 the company announced that it will also rely on Hadoop and Spark to support its big data and IoT strategies. SAP subsequently announced, and in March made generally available, Hana Vora, which uses Spark to provide interactive analytics on high-scale data in Hadoop.

MyPOV: If this deal is real, I think it will be a boon to SAP. The company is, of course, partnered with all of the leading Hadoop software distributors and with Spark overseer Databricks, but to date the company has lacked its own Hadoop and Spark services. Altiscale would enable SAP to help customers with high-scale data capacity and data pipelines without relying on third-party vendors. What remains to be seen – again, if the deal is real – is how and whether Altiscale’s existing services would change in the wake of an acquisition. Altiscale could not be reached in time for comment.

Earlier this year I interviewed executives at Altiscale customer MarketShare for an in-depth case study, and they were quite happy with the company’s services. Although costs were higher for Altiscale than for Amazon EMR, the Hadoop service it previously used, MarketShare reported that its Hadoop jobs were completed in one-quarter to one-fifth the time and at 65 percent to 70 percent of the cost on Altiscale. It’s a compelling story, but one that’s hard for a small vendor to get out there when up against the biggest public cloud service providers in the industry.

HPE streamlines use of machine learning services with Haven OnDemand Combinations. Vertica release improves performance, adds Hadoop and Spark support.

Hewlett Packard Enterprise announced August 30 at its HPE Big Data Conference in Boston that it’s making its library of machine learning services easier for developers to build into smart, “cognitive” applications through Haven OnDemand Combinations. In a second announcement at the event, HPE unveiled Vertica 8.0, the next release of the company’s high-scale analytical database.

Haven OnDemand is in the white-hot category of machine learning services. (The category is sometimes pegged as cognitive or artificial intelligence technology, but Constellation Research views machine learning as a more accurate description of current capabilities.) It’s a domain that has seen dozens of acquisitions in recent years, led by leading tech companies including Amazon, Google, IBM, Intel, Microsoft and Salesforce.

Hewlett Packard Enterprise Powers Cognitive Apps from Constellation Research on Vimeo.

How will HPE differentiate Haven OnDemand as the big public cloud companies deepen their portfolios? Haven OnDemand Combinations is an early effort to do just that by enabling developers to bring together multiple APIs in composite services that can be saved and reused for fast development.

HPE has introduced a few pre-built Combinations of its own, including call archiving and language-agnostic sentiment analysis. But these Combinations aren’t products so much as starting points meant to be adapted by customers. The essence of Combinations is giving developers the ability to tie together multiple APIs, using a drag-and-drop design interface. From thereon, developers can quickly invoke the Combinations through a single API with minimal coding.

Haven OnDemand runs on Microsoft Azure, but it’s REST-based APIs can be invoked in any services-enabled environment, including Amazon Web Services or hybrid and private clouds. Haven OnDemand services originated on HP Helion, but the services were relaunched on Azure after the Helion public cloud was shuttered in 2015.

The availability of machine learning services on Amazon, Azure, Google and IBM clouds is clearly a threat to Haven OnDemand. But in an onstage interview at this week’s conference, Microsoft executive Mike Schutz said the first priority for Azure is providing quickly deployable and scalable infrastructure services and compute capacity. Schutz described Haven OnDemand as a “higher-level solution.”

HPE executives say Haven OnDemand has a head start on delivering machine learning services and stressed that they’re already in production use among “hundreds of thousands of users.” The Haven OnDemand community has some 18,000 developers and the APIs are getting “millions of calls” per month, they said. HPE also noted that Haven OnDemand covers essentials for developing secure applications such as role-based permissions for access to data.

Haven OnDemand Combinations gives developers the ability to chain together APIs and invoke them through a single API call.

MyPOV on Haven OnDemand: There’s no doubt that competition from the big public clouds will present perception and performance challenges to Haven OnDemand over the long haul. For starters there’s the one-stop-shop appeal of using development services and machine learning services from one and the same cloud. Even in the case of Azure, Haven OnDemand is a stand-alone site (HavenOnDemand.com) rather than a library of APIs that’s exposed within the Azure Cloud.

On performance, it will be challenged when high-scale used in an app is in one cloud, say Amazon Web Services, while Haven OnDemand Services run on Azure. HPE execs said the Haven OnDemand APIs typically work with indexes and result sets of data that are a shadow of the size of the original data, minimizing performance and storage penalties, but data movement at high scale can’t help but tax performance.

In short, HPE’s competitive advantage will hinge on just how aggressively and successfully HPE and its biggest rivals pursue machine learning services and support for building smart applications.

Vertica 8.0 Bolsters Hadoop, In-Database Analytics and Cloud Support

HPE’s Vertica analytic database is a popular choice for high-scale data mart deployments, data warehouse optimization scenarios and in embedded, OEM use by software and services companies, such as current customers Domo, GoodData and Looker.

The Vertica 8.0 release announced at HPE’s Big Data Conference is due out by the end of October, and it promises better performance as well as extended support for Hadoop, in-database analytics and cloud deployment. Vertica 8.0 is said to deliver faster data loading, simplified data loading from Amazon S3, and comprehensive visual monitoring of Apache Kafka data streams.

HPE is bolstering Vertica’s compatibility with Hadoop with support for the Parquet file format. Parquet is widely used on Cloudera deployments. Last year Vertica gained support for ORC, a file format often used in Hortonworks, so HPE is rounding out its ability to work with leading Hadoop distributions.

On the in-database front, Vertica 8.0 gains R-based machine learning algorithms that will enable data scientists to model against vast data sets relying on the power of Vertica’s massively parallel processing (and thus avoiding moving data to analytic servers or relying on sampling techniques). The upgrade also adds a two-way connector for Apache Spark so data scientists can rely on machine learning algorithms in Spark or port high-scale queries that might choke the memory space from Spark to Vertica.

Vertica was already certified to run on Amazon Web Services, but the 8.0 release adds support for deployment on Microsoft Azure. On either cloud it’s a bring-your-own-license (BYOL) approach, but you can spin up the Vertica Community Edition from the Microsoft Azure Marketplace.

MyPOV on Vertica

As with Hadoop, as-a-service offerings seems to be the hottest database deployment choice of late, but HPE insists that its customers prefer to manage their own deployments BYOL style. The company took a stab at database as a service with Vertica OnDemand on Helion, but unlike competitors like Teradata and Oracle, it has since eschewed providing managed services.

Certification on Azure is a good step and I won’t be surprised to see deeper ties with Microsoft and perhaps more cloud deployment options alongside Haven OnDemand services. HPE and Microsoft have many joint customers and partners, and it’s exactly those constituents it appealed to in a keynote slide with the simple headline, “Our platform, your vision.” It’s about putting Haven OnDemand and Vertica inside customer and partner apps, and that story only gets stronger when there are plenty of flexible deployment options.

Cloudera Introduces RecordService For Security, Kudu For Streaming Data Analysis

Strata + Hadoop World Report: Spark, Real-Time In the Spotlight

Couchbase Intros 4.0 Release, Highlights Enterprise Wins

AWS re:Invent: Five Takeaways On New Services

Oracle Open World 2015: Three Important Cloud Services

IBM Analyst Forum: A Tale Of Two Titans

Informatica Takes Comprehensive Approach To Big Data Management

Top 5 Data-to-Decisions News Trends of 2015

Salesforce Reboots Wave Analytics, Preps IoT Cloud

IBM Adds Yet More Cloud Data Services To Its Portfolio

Spark Summit East Report: Enterprise Appeal Grows

Hortonworks Connected Data Platforms: More Than Sum of Parts

Cloudera Takes to the Cloud, Highlights Industry Use Cases

Hadoop Hits 10 Years: Growing Up Fast

SAS Goes Cloud, But Will Customers Follow?

IBM Watson Team Marks Progress, Previews Next Steps

Workday Deal For Platfora Leaves Questions

Workday Plans for Platfora Revealed

SAP Reportedly Buying Altiscale to Power Big Data Services

Hewlett Packard Enterprise Powers Machine Learning Apps, Revs Vertica Database