The AI Talent Gap: Locating Global Data Science Centers

Good AI talent is hard to find. The talent pool for anyone with deep expertise in modern artificial intelligence techniques is terribly thin. More and more companies are committing to data and artificial intelligence as their differentiator. The early adopters will quickly find difficulties in determining which data science expertise meets their needs. And the AI talent? If you are not Google, Facebook, Netflix, Amazon, or Apple, good luck.

With the popularity of AI, pockets of expertise are emerging around the world. For a firm that needs AI expertise to advance its digital strategy, finding these data science hubs becomes increasingly important. In this article we look at the initiatives different countries are pushing in the race to become AI leaders and we examine existing and potential data science centers.

It seems as though every country wants to become a global AI power. With the Chinese government pledging billions of dollars in AI funding, other countries don’t want to be left behind.

In Europe, France plans to invest €1.5 billion in AI research over the next 4 years while Germany has universities joining forces with corporations such as Porsche, Bosch, and Daimler to collaborate on AI research. Even Amazon, with a contribution of €1.25 million, is collaborating in the AI efforts in Germany’s Cyber Valley around the city of Stuttgart. Not one to be left behind, the UK pledged £300 million for AI research as well.

Other countries to commit money to AI are Singapore, which committed $150 million and Canada, which not only committed $125 million, but also has large data science hubs in Toronto and Montreal. Yoshua Bengio, one of the fathers of deep learning, is from Montreal, the city with the biggest group of AI researchers in the world. Toronto has a booming tech industry that naturally attracts AI money.

Data scientists worldwide.

Examining a variety of sources, data science professionals are spread across the regions where we would expect them. The graphic below shows the number of members of the site Data Science Central. Since the site is in English, we expect most of its members to come from English speaking countries; however, it still gives us some insight as to which countries have higher representation.



It becomes difficult then to determine AI hubs without classifying talent by levels. One example of this is India; despite its large number of data science professionals, many of them are employed in lower-skilled roles such as data labeling and processing.

So what would be considered a data science hub? The graphic below defines a hub by the number of advanced AI professionals in the country. The countries shown here have AI talent working in companies such as Google, Baidu, Apple and Amazon. However, this omits a large group of talent that is not hired by these types of companies.



Matching the previous graph with a study conducted by Element AI, we see some commonalities, but also see some new hubs emerge. The same talent centers remain, but more countries are highlighted on the map. Element AI’s approach consisted of analyzing LinkedIn profiles, factoring in participation in conferences and publications and weighting skills highly.



As you search for AI talent, we recommend basing your search on 4 factors: workforce availability, cost of labor, English proficiency, and skill level. Kaggle, one of the most popular data science websites, conducted a salary survey with respondents from 171 countries. The results can be seen below.


Salaries are as expected, but show high variability. By aggregating salary data and the talent pool map, you can decide which countries suit your goals better. The EF English Proficiency Index shows which countries have the highest proficiency in English and can further weed out those that may have a strong AI presence or low cost of labor, but low English proficiency.

In the end, you want to hire professionals that understand the problems you are facing and can tailor their work to your specific needs. With a global mindset, companies can mitigate talent scarcity. If you are considering sourcing talent globally, we recommend hiring strong leadership locally, who act as AI product managers that can manage a team. Hire production managers located on-site with your global talent. They can oversee any data science or AI development and report back to the product manager. KUNGFU.AI will continue to study these global trends and help ensure companies are equipped with access to the best talent to meet their needs.

Originally Posted at: The AI Talent Gap: Locating Global Data Science Centers

When Worlds Collide: Blockchain and Master Data Management

Master Data Management (MDM) is an approach to the management of golden records that has been around over a decade only to find a growth spurt lately as some organizations are exceeding pain thresholds in the management of common data. Blockchain has a slightly shorter history, coming aboard with bitcoin, but also is seeing its revolution these days as data gets distributed far and wide and trust has taken center stage in business relationships.  

Volumes could be written about each on its own, and given that most organizations still have a way to go with each discipline, that might be appropriate. However, good ideas wait for no one and today’s idea is MDM on Blockchain.

Thinking back over our MDM implementations over the years, it is easy to see the data distribution network becoming wider. As a matter of fact, master data distribution is usually the most time-intensive and unwieldy part of an MDM implementation anymore. The blockchain removes overhead, costs and unreliability from authenticated peer-to-peer network partner transactions involving data exchange. It can support one of the big challenges of MDM with governed, bi-directional synchronization of master data between the blockchain and enterprise MDM.

Another core MDM challenge is arriving at the “single version of the truth”. It’s elusive even with MDM because everyone must tacitly agree to the process used to instantiate the data in the first place. While many MDM practitioners go to great lengths to utilize the data rules from a data governance process, it is still a process subject to criticism. The consensus that blockchain can achieve is a governance proxy for that elusive “single version of the truth” by achieving group consensus for trust as well as full lineage of data.

Blockchain enables the major components and tackles the major challenges in MDM.

Blockchain provides a distributed database, as opposed to a centralized hub, that can store data that is certified, and for perpetuity. By storing timestamped and linked blocks, the blockchain is unalterable and permanent. Though not for low latency transactions yet, transactions involving master data, such as financial settlements, are ideal for blockchain and can be sped up by an order of magnitude since blockchain removes the grist in a normal process.

Blockchain uses pre-defined rules that act as gatekeepers of data quality and governs the way in which data is utilized. Blockchains can be deployed publicly (like bitcoin) or internally (like an implementation of Hyperledger). There could be a blockchain per subject area (like customer or product) in the implementation. MDM will begin by utilizing these internal blockchain networks, also known as Distributed Ledger Technology, though utilization of public blockchains are inevitable.

A shared master data ledger beyond company boundaries can, for example, contain common and agreed master data including public company information and common contract clauses with only counterparties able to see the content and destination of the communication.

Hyperledger is quickly becoming the standard for open source blockchain. Hyperledger is hosted by The Linux Foundation. IBM, with the Hyperledger Fabric, is establishing the framework for blockchain in the enterprise. Supporting master data management with a programmable interface for confidential transactions over a permissioned network is becoming a key inflection point for blockchain and Hyperledger.

Data management is about right data at the right time and master data is fundamental to great data management, which is why centralized approaches like the discipline of master data management has taken center stage. MDM can utilize the blockchain for distribution and governance and blockchain can clearly utilize the great master data produced by MDM. Blockchain data needs data governance like any data. This data actually needs it more given its importance on the network.

MDM and blockchain are going to be intertwined now. It enables the key components of establishing and distributing the single version of the truth of data. Blockchain enables trusted, governed data. It integrates this data across broad networks. It prevents duplication and provides data lineage.

It will start in MDM in niches that demand these traits such as financial, insurance and government data. You can get to know the customer better with native fuzzy search and matching in the blockchain. You can track provenance, ownership, relationship and lineage of assets, do trade/channel finance and post-trade reconciliation/settlement.

Blockchain is now a disruption vector for MDM. MDM vendors need to be at least blockchain-aware today, creating the ability for blockchain integration in the near future, such as what IBM InfoSphere Master Data Management is doing this year. Others will lose ground.

Source: When Worlds Collide: Blockchain and Master Data Management by analyticsweek

The Paradigmatic Shift of Data Fabrics: Connecting to Big Data

The notion of a data fabric encompasses several fundamental aspects of contemporary data management. It provides a singular means of data modeling, a centralized method of enforcing data governance, and an inimitable propensity for facilitating data discovery.

Its overarching significance to the enterprise, however, supersedes these individual facets of managing data. At its core, the data fabric tenet effectively signifies a transition in the way data is accessed and, to a lesser extent, even deployed.

According to Denodo Chief Marketing Officer Ravi Shankar, in a big data driven world with increasing infrastructure costs and regulatory repercussions, it has become all but necessary: “to be able to connect to the data instead of collect the data.”

Incorporating various facets of data lakes, virtualization technologies, data preparation, and data ingestion tools, a unified data fabric enables organizations to access data from any variety of locations including on-premise, in the cloud, from remote devices, or from traditional ones. Moreover, it does so in a manner which reinforces traditional centralization benefits such as consistency, uniformity, and oversight vital to regulatory compliance and data governance.

“The world is going connected,” Shankar noted. “There’s connected cars, connected devices. All of these are generating a lot of different data that is all being used to analyze information.”

Analyzing that data in times commensurate with consumer expectations—and with the self-service reporting tools that business users have increasingly become accustomed to—is possible today with a data fabric. According to Cambridge Semantics Chief Technology Officer Sean Martin, when properly implemented a data fabric facilitates “exposing all of the processing and information assets of the business to some sort of portal that has a way of exchanging data between the different sources”, which effectively de-silos the enterprise in the process.

Heterogeneous Data, Single Repository
The quintessential driver for the emergence of the enterprise data fabric concept is the ubiquity of big data and its multiple manifestations. The amounts and diversity of the types of data ingested test the limits of traditional data warehousing methods, which were not explicitly designed to account for the challenges of big data. Instead, organizations began turning to the cloud more and more frequently, while options such as Hadoop (renowned for its cheap storage) became increasingly viable. Consequently, “companies have moved away from a single consuming model in the sense that it used to be standardized for [platforms such as] BusinessObjects,” Shankar explained. “Now with the oncoming of Tableau and QlikView, there are multiple different reporting solutions through the use of the cloud. IT now wants to provide an independence to use any reporting tool.” The freedom to select the tool of choice for data analysis largely hinges on the modeling benefits of a data fabric, which helps to “connect to all the different sources,” Shankar stated. “It could be data warehousing, which many people have. It could be a big data system, cloud systems, and also other on-premises systems. The data fabric stitches all of these things together into a virtual repository and makes it available to the consumers.”

Data Modeling
From a data modeling perspective, a data fabric helps to reconcile the individual semantics involved with proprietary tools accessed through the cloud. Virtually all platforms accessed through the cloud (and many on-premise ones) have respective semantics and taxonomies which can quickly lead to vendor lock-in. “QlikView, Tableau, BusinessObjects, Cognos, all of these have semantic models that cater to their applications,” Shankar said. “Now, if you want to report with all these different forms you have to create different semantic models.” The alternative is to use the virtualization capabilities of a data fabric for effectively “unifying the semantic models within the data fabric,” Shankar said.

One of the principal advantages of this approach is to do so with semantics tailored for an organization’s own business needs, as opposed to those of a particular application. What Shankar referred to as the “high level logical data model” of a data fabric provides a single place for definitions, terms, and mapping which is applicable across the enterprise’s data. Subsequently, the individual semantic models of application tools are used in conjunction with that overlying logical business model, which provides the basis for the interchange of tools, data types, and data sources. “When the data’s in a data store it’s usually in a pretty obscure form,” Martin commented. “If you want to use it universally to make it available across your enterprise you need to convert that into a form that makes it meaningful. Typically the way we do that is by mapping it to an ontology.”

Data Governance
The defining characteristic of a data fabric is the aforementioned virtual repository for all data sources, which is one of the ways in which it builds upon the data lake concept. In addition to the uniform modeling it enables, it also supplies a singular place in which to store the necessary metadata for all sources and data types. That metadata, in turn, is one of the main ways users can create intelligent action for data discovery or search. “Since this single virtual repository actually stores all of this metadata information, the data fabric has evolved to support other functions like data discovery and search because this is one place where you can see all the enterprise data,” Shankar observed. Another benefit is the enhanced governance and security facilitated by this centralized approach in which the metadata about the data and the action created from the data is stored.

“The data fabric stores what we call metadata information,” Shankar said. “It stores information about the data, where to go find it, what type of data, what type of association and so on. It contains a bridge of the data.” This information is invaluable for determining data lineage, which becomes pivotal for effecting regulatory compliance. It can also function as a means of implementing role-based access to data “at the data fabric layer,” Shankar commented. “Since you check the source systems directly, if it comes through the data fabric it will make sure it only gives you the data you have access to.” Mapping the data to requisite business glossaries helps to buttress the enterprise-wide definitions and usage of terminology which are hallmarks of effective governance.

Data Preparation
The data fabric tenet is also a critical means of implementing data preparation quickly and relatively painlessly—particularly when compared to conventional ETL methods. According to Shankar: “Connecting to the data is much easier than collecting, since collecting requires moving the data, replicating it, and transforming it, all of which takes time.” Interestingly enough, those temporal benefits also translate into advantages for resources. Shankar estimated that for every four IT personnel required to enact ETL, only one person is needed to connect data with virtualization technologies. These temporal and resource advantages naturally translate to a greater sense of agility, which is critical for swiftly incorporating new data sources and satisfying customers in the age of real-time. In this regard, the business value of a data fabric directly relates to the abstraction capabilities of its virtualization technologies. According to Martin, “You start to build those abstractions that give you an agility with your data and your processing. Right now, how easy is it for you to move all your things from Amazon [Web Services] to Google? It’s a big effort. How about if the enterprise data fabric was pretty much doing that for you? That’s a value to you; you don’t get locked in with any particular infrastructure provider, so you get better pricing.”

Source: The Paradigmatic Shift of Data Fabrics: Connecting to Big Data by jelaniharper

The Future Of Big Data Looks Like Streaming

Big data is big news, but it’s still in its infancy. While most enterprises at least talk about launching Big Data projects, the reality is that very few do in any significant way. In fact, according to new survey data from Dimensional, while 91% of corporate data professionals have considered investment in Big Data, only 5% actually put any investment into a deployment, and only 11% even had a pilot in place.

Big data is big news, but it’s still in its infancy. While most enterprises at least talk about launching Big Data projects, the reality is that very few do in any significant way. In fact, according to new survey data from Dimensional, while 91% of corporate data professionals have considered investment in Big Data, only 5% actually put any investment into a deployment, and only 11% even had a pilot in place.

Real Time Gets Real

ReadWrite: Hadoop has been all about batch processing, but the new world of streaming analytics is all about real time and involves a different stack of technologies.

Langseth: Yes, however I would not entangle the concepts of real-time and streaming. Real-time data is obviously best handled as a stream. But it’s possible to stream historical data as well, just as your DVR can stream Gone with the Wind or last week’s American Idol to your TV.

 This distinction is important, as we at Zoomdata believe that analyzing data as a stream adds huge scalability and flexibility benefits, regardless of if the data is real-time or historical.

RW: So what are the components of this new stack? And how is this new big data stack impacting enterprise plans?

JL: The new stack is in some ways an extension of the old stack, and in some ways really new.

Data has always started its life as a stream. A stream of transactions in a point of sale system. A stream of stocks being bought and sold. A stream of agricultural goals being traded for valuable metals in Mesopotamia.

Traditional ETL processes would batch that data up and kill its stream nature. They did so because the data could not be transported as a stream, it needed to be loaded onto removable disks and tapes to be transported from place to place.

But now it is possible to take streams from their sources, through any enrichment or transformation processes, through analytical systems, and into the data’s “final resting place”—all as a stream. There is no real need to batch up data given today’s modern architectures such as Kafka and Kinesis, modern data stores such as MongoDB, Cassandra, Hbase, and DynamoDB (which can accept and store data as a stream), and modern business intelligence tools like the ones we make at Zoomdata that are able to process and visualize these streams as well as historical data, in a very seamless way.

Just like your home DVR can play live TV, rewind a few minutes or hours, or play moves from last century, the same is possible with data analysis tools like Zoomdata that treat time as a fluid.

Throw That Batch In The Stream

Also we believe that those who have proposed a “Lambda Architecture,” effectively separating paths for real-time and batched data, are espousing an unnecessary trade-off, optimized for legacy tooling that simply wasn’t engineered to handle streams of data be they historical or real-time.

At Zoomdata we believe that it is not necessary to separate-track real-time and historical, as there is now end-to-end tooling that can handle both from sourcing, to transport, to storage, to analysis and visualization.

RW: So this shift toward streaming data is real, and not hype?

JL: It’s real. It’s affecting modern deployments right now, as architects realize that it isn’t necessary to ever batch up data, at all, if it can be handled as a stream end-to-end. This massively simplifies Big Data architectures if you don’t need to worry about batch windows, recovering from batch process failures, etc.

So again, even if you don’t need to analyze data from five seconds or even five minutes ago to make business decisions, it still may be simplest and easiest to handle the data as a stream. This is a radical departure from the way things in big data have been done before, as Hadoop encouraged batch thinking.

But it is much easier to just handle data as a stream, even if you don’t care at all—or perhaps not yet—about real-time analysis.

RW: So is streaming analytics what Big Data really means?

JL: Yes. Data is just like water, or electricity. You can put water in bottles, or electricity in batteries, and ship them around the world by planes trains and automobiles. For some liquids, such as Dom Perignon, this makes sense. For other liquids, and for electricity, it makes sense to deliver them as a stream through wires or pipes. It’s simply more efficient if you don’t need to worry about batching it up and dealing with it in batches.

Data is very similar. It’s easier to stream big data end-to-end than it is to bottle it up.

Article originally appeared HERE.

Source: The Future Of Big Data Looks Like Streaming

What Is Happening With Women Entrepreneurs? [Infographics]


On this International Women’s Day, it might be a wise idea to learn how women is shaping the entrepreneurial landscape. Not only is the impact impressive, growing but it is also building sustained growth. In some aspects, the impact is equal or better than the male counterparts.

Women entrepreneurs has been on the rise for sometime, more specifically, we’ve grown twice as fast as men between 1997 and 2007, at the pace of 44% growth in women-owned businesses. if it is not a cool stats, not sure what else is?

There are a Dozen interesting factoids about how women is shaping business landscape:

  1. In 2005, there were 7 CEO’s in Fortune 500. As of May 2011, there were 12 CEO’s in Fortune 500 companies, not many but growing.
  2. Approximately 32% of women business owners believe that being a woman in a male-dominated industry is beneficial.
  3. The number of women-owned companies with 100 or more employees has increased at nearlytwice the growth rate of all other companies.
  4. The vast majority (83%) of women business owners are personally involved in selecting and purchasing technology for their businesses.
  5. The workforces of women-owned firms show more gender equality. Women business owners overallemploy a roughly balanced workforce (52% women, 48% men), while men business owners employ 38% women and 62% men, on average.
  6. 3% of all women-owned firms have revenues of $1 million or more compared with 6% of men-owned firms.
  7. Women business owners are nearly twice as likely as men business owners to intend to pass the business on to a daughter or daughters (37% vs. 19%).
  8. Between 1997 and 2002, women-owned firms increased their employment by 70,000, whereas firms owned by men lost 1 million employees.
  9. One in five firms with revenue of $1 million or more is woman-owned.
  10. Women owners of firms with $1 million or more in revenue are more likely to belong to formal business organizations, associations or networks than other women business owners (81% vs. 61%).
  11. Women-owned firms in the U.S. are more likely than all firms to offer flex-time, tuition reimbursement and, at a smaller size, profit sharing to their workers.
  12. 86% of women entrepreneurs say they use the same products and services at home that they do in their business, for familiarity and convenience.

Road is well traveled and boy we have covered a distance. Let us embrace and keep breaking the glass ceiling. At the end, Happy International Women’s Day you all!

Infographic: Women in Business
Courtesy of: CreditDonkey


How to Architect, Engineer and Manage Performance (Part 1)

This is the first of a series of blogs on how to architect, engineer and manage performance.  In it, I’d like to attempt to demystify performance by defining it clearly as well as describing methods and techniques to achieve performance requirements. I’ll also cover how to make sure the requirements are potentially achievable.

Why is Performance Important?

To start, let’s look at a few real-world scenarios to illustrate why performance is critical in today’s business:

  • It is the January sales and you are in the queue at your favorite clothing stores. There are ten people in front of you, all with 10-20 items to scan and scanning is slow. Are you going to wait or give up? Are you happy?  What would a 30% increase in scan rates bring to the business? Would it mean fewer tills and staff at quiet times?  Money to save, more money to make and better customer satisfaction.
  • Let’s say you work at a large bank and you are struggling to meet the payment deadline for inter-bank transfers on peak days. If you fail to meet these, you get fined 1% of the value of all transfers waiting. You need to make the process faster, increase capacity, improve parallelism. It could lose you money, worse damage your reputation with other banks and customers.
  • You’ve needed a mortgage. XYZ Bank offers the best interest rate and they can give you a decision within a month. ABC Bank cost 0.1% more, but guarantee a decision in 72 hours. The vendors of the house you want to buy need a decision within the week.
  • Traders in New York and London can make millions by simply being 1ms faster than others. This is one of the rare performance goals, where there is no limit to how fast, except the cost versus the return.

Why is performance important? Because it means happy customers, cost savings, avoiding lost sales opportunities, differentiating your services, protecting your reputation and much more.

The Outline for Better Performance Processes

Let’s start with performance testing. While this is part of the performance process on it is own, it is usually a case of “too little, too late” and the costs of late change are often severe.

So, let me outline a better process for achieving performance and then I’ll talk a little about each part now, and in more detail in the next few blogs:

  1. Someone should be ultimately responsible for performance – the buck stops with someone, not some team.
  2. This performance owner leads the process – they help others achieve the goals.
  3. Performance goals need to be measurable and realistic, and potentially achievable – they are the subject of discussions and agreement between parties.
  4. Goals will be broken down amongst the team delivering the project – performance is a team game.
  5. Performance will be achieved in stages using tools and techniques – there will no miracles, magic or silver bullets.
  6. Performance must be monitored, measured and managed – how do we deal with the problems which will occur.

Responsibility for Performance

Earlier, I stated that one person should be solely responsible for performance. Let me expand upon that. When you have more than one person in charge, you can no longer be sure how the responsibility is owned. Even with two people, there will be things which you haven’t considered and do not fit neatly in one or other roles.  If possible, do not combine roles like Application and Performance Architect (Leader) as this may lead to poor compromises. In my opinion, it is far better for each person to fight for one thing and have the PM or Chief Architect to judge the arguments rather than one person trying to weigh the benefits on their own.  Clearly, in small projects, this is not possible, and care is necessary to ensure anyone carrying multiple roles can balance the two or bring out the multiple roles without bias.  It is very easy to favour the role we know best or enjoy most.

I probably won’t return to the topic until, towards the end of the series, it will be easier to understand after looking at the other parts in more detail.

The Performance Architect – Performance Expert, Mentor, Leader and Manager

In an ideal scenario, a Performance Architect’s role is to guide others through the performance improvement process, not to do it all themselves. No one is an enough of an expert in all aspects of performance to do this.  Performance Architects should:

  • Manage and orchestrate the process on behalf of the project.
  • Lead by taking responsibility for the division of a requirement(s) and alignment of the performance requirements across the project.
  • Provide expertise in performance, giving more specialised roles ways to solve the challenges of performance – estimation, measurement, tuning, etc.

This needs a bit more explanation in a future blog in this series, but I will cover some of the other parts first.

Setting Measurable Goals – Clear Goals that Reflect Reality

Setting measurable performance goals requires more work than you first think. Consider this example, “95% of transactions must complete in < 5 secs”. On the face of it this seems ok, but consider:

  • Are all transactions equal – logon, account update, new account opening?
  • What do the five seconds mean – elapsed time, processing time, what about the thinking time of the user?
  • What if the transaction fails?
  • Data variations – all customers named “Peabody” versus “Mohammed” or “Smith”.

That one requirement will need to be expanded and broken down into much more detail.  The whole process of getting the performance requirements “right” is a major task and the requirements will continue to be refined during the project as requirements change.

It is worth stating that trying to tune something to go as fast as possible is not the aim of performance. It isn’t a measurable goal, you don’t know when you’ve achieved the goal, and it would be very expensive.

This area can be involved, it takes time, effort and practice, and this is the basis for the rest of the work, so a topic for more discussion in the next blog.

Breaking the Goals Down – Dividing the Cake

If we have five seconds to perform the login transaction, we need to divide that time between the components and systems that will deliver the transaction.  The Performance Architect will do this with the teams involved, but I’ve found that they’ll get it wrong at the start. They don’t have all the facts, it doesn’t matter if it is right at the finish. You are, probably struggling to see where to start, don’t worry about that for a bit, the next section will help.

I’ll look at this in a future blog, but it is probably best discussed after looking at estimation techniques.

Achieving Performance – The Tools and Techniques

When most people drop a ball, they know it falls downwards due to gravity. But not many of us are physicists and four-year-old children haven’t studied any physics, but they don’t seem to struggle to use the experience (knowledge). 

How long will an ice cube take to melt? An immediate reaction goes something like this, “I don’t know it depends on the temperature of the water/air around the ice cube”. So make some assumptions, provide an estimate, then start asking questions and doing research to confirm the assumptions or produce a new estimate.

How long will the process X take?  How long did it take last time? What is different this time? What is similar? What does the internet suggest? Could we do an experiment? Could we build a model? Has it been done before?

Start with an estimate (guess, educated guess, the rule of thumb – use the best method available or methods) and then as the project progresses we can use other techniques to model, measure and refine.

You might be saying, “But I still don’t know”! That’s correct, and you probably never will at the outset.

Statistically, it is almost certain you won’t die by being struck by lightning (Google estimates 24,000 or 6,000 people per year on Earth pass away this way – no one knows the real figure – think about that, we don’t know the real answer).  Most (or all, I hope) assume we won’t be sick tomorrow, but some people will. Nothing is certain, you don’t the answer to nearly everything in the future with accuracy, but it doesn’t stop you making reasonable assumptions and dealing with the surprises, both good and bad.

This is a huge topic and I need to spend some time on this in the series to build your confidence in your skills by showing you just some of the options you have now, you can learn and develop.

“Data Science” and “Statistics” are whole areas of academic study interested in prediction, so this is more than a topic. There are probably more ways to produce estimates than there are to solve the IT problem you are estimating.

Keeping on track – Monitoring, measuring and managing

Donald Rumsfeld made the point there are things we know, we know we don’t know and things that surprise us (unknown unknowns). Actually, it is worse, people often think they know and are wrong, or assume they don’t know and then realise their estimate was better than the one the project used.

Risk management is how we deal with the whole performance process. Risk management, just like the Project Manager uses will help us manage the process. As we progress through the project we will build up our knowledge and confidence in the estimates and reduce the risk and use the information to focus our effort where the greatest risk is.

A Project Manager measures the likelihood of the risk occurring and the impact.  For performance, we measure the chance of us achieving the performance goal and how confident we are of the current estimate.

This will be easier to understand as we look at other parts in more depth, we’ll revisit this in a future blog.

In future blogs in the series I will cover:

  • Setting goals – Refining the performance requirements.
  • Tools and techniques – Where to start – Estimation, Research,
  • Monitoring, measuring and managing – risk and confidence.
  • Breaking the goals down across the team.
  • Tools and techniques – Next Steps – Volumetrics, Model and Statistics
  • Tools and techniques – Later Stages – Some testing (and monitoring) options.
  • Responsibility and Leadership

The Author

Chris first became interested in computers aged 9. His formal IT education was from ages 13 to 21, informally it has never stopped. He joined the British Computer Society (BCS) at 17 and is a Chartered Engineer, Chartered IT Professional and Fellow of the BCS. He is proud and honoured to have held several senior positions in the BCS including Trustee, Chair of Council and Chair of Membership Committee, and remains committed to IT professionalism and the development of IT professionals.

He has worked for two world leading companies before joining his third, Talend as a Customer Success Architect.  He has over 30 years of professional working experience, with data and information systems.  He is passionate about customer success, understanding the requirements and the engineering of IT solutions. 

Our Team

The Talend Customer Success Architecture (CSA) team is a growing worldwide team of 20+ highly experienced technical information professionals. It is part of Talend’s Customer Success organisation dedicated to the mission of making all Talend’s customers successful and encompasses Customer Success Management, Professional Services, Education and Enablement, Support and the CSA teams. 


Built for cloud data integration, Talend Cloud allows you to liberate your data from everything holding it back. Data integration is a critical step in digital transformation. Talend makes it easy.

The post How to Architect, Engineer and Manage Performance (Part 1) appeared first on Talend Real-Time Open Source Data Integration Software.

Originally Posted at: How to Architect, Engineer and Manage Performance (Part 1)

In the Absence of Data, Everyone is Right

I wrote a post last week that compared two ways to make decisions/predictions: 1) opinion-driven and 2) data-driven. I am a big believer of using data to help make decisions/predictions. Many pundits/analysts made predictions about who would win the US presidential elections. Now that the elections are over, we can compare the success rate for predicting the election. Comparing the pundits with Nate Silver, Mr. Silver is clearly the winner, predicting the winner of the presidential election for each state perfectly (yes, 50 out of 50 states) and the winner of the popular vote.

Summary of polling results from published on 11/6/2012, one day before the 2012 presidential election. Click image to read entire post.

Let’s compare how each party made their predictions. While both used publicly available polling data, political pundits appeared to make their predictions based on the results from specific polls. Nate Silver, on the other hand, applied his algorithm to many publicly available polling data at the state level. Because of sampling error, poll results varied across the different polls. So, even though the aggregated results of all polls painted a highly probable Obama win, the pundits could still find particular poll results to support their beliefs. (Here is a good summary of pundits who had predicted Romney would win the Electoral College and the popular vote).

Next, I want to present a psychological phenomenon to help explain how the situation above unfolded. How could the pundits make decisions that were counter to the preponderance of evidence available to them? Can we learn how to improve decision making when it comes to improving the customer experience?

Confirmation Bias and Decision Making

Confirmation bias is a psychological phenomenon where people tend to favor information that confirms or supports their existing beliefs and ignores or discounts information that contradicts their beliefs.

Here are three different forms of confirmation bias with some simple guidelines to help you minimize their impact on decision making. These guidelines are not meant to be comprehensive. Look at them as a starting point to help you think more critically about how you make decisions using customer data. If you have suggestions about how to minimize the impact of confirmation bias, please share what you know. I would love to hear your opinion.

  1. People tend to seek out information that supports their beliefs or hypotheses. In our example, the pundits hand-picked specific poll results to make/support their predictions. What can you do? Specifically look for data to refute your beliefs. If you believe product quality is more important than service quality in predicting customer loyalty, be sure to collect evidence about the relative impact of service quality (compared to product quality).
  2. People tend to remember information that supports their position and not remember information that does not support their position. Don’t rely on your memory. When making decisions based on any kind of data, cite the specific reports/studies in which those data appear. Referencing your information source can help other people verify the information and help them understand your decision and how you arrived at it. If they arrive at a different conclusion than you, understand the source of the difference (data quality? different metrics? different analysis?).
  3. People tend to interpret information in a way that supports their opinion. There are a few things you can do to minimize the impact of confirmation bias. First, use inferential statistics to separate real, systematic, meaningful variance in the data from random noise. Place verbal descriptions of the interpretation next to the graph. A clear description ensures that the graph has little room for misinterpretation. Also, let multiple people interpret the information contained in customer reports. People from different perspectives (e.g., IT vs. Marketing) might provide highly different (and revealing) interpretations of the same data.


My good friend and colleague, Stephen King (CEO of TCELab) put it well when describing the problem of not using data in decision-making: “In the absence of data, everyone is right.” We tend to seek out information that supports our beliefs and disregard information that does not. This confirmation bias negatively impacts decisions by limiting what data we seek out and ignore and how we use those data. To minimize the impact of confirmation bias, act like a scientist. Test competing theories, cite your evidence and apply statistical rigor to your data.

Using Big Data integration principles to organize your disparate business data is one way to improve the quality of decision-making. Data integration around your customers facilitates open dialogue across different departments, improves hypothesis testing using different customer metrics across disparate data sources (e.g., operational, constituency, attitudinal), improving how you make decisions that will ultimately help you win customers or lose customers.


Are Top Box Scores a Better Predictor of Behavior?

Are Top Box Scores a Better Predictive of Behavior

Are Top Box Scores a Better Predictive of BehaviorWhat does 4.1 on a 5-point scale mean? Or 5.6 on a 7-point scale?

Interpreting rating scale data can be difficult in the absence of an external benchmark or historical norms.

A popular technique used often by marketers to interpret rating scale data is the so-called “top box” and “top-two box” scoring approach.

For example, on a 5-point scale, such as the one shown in Figure 1, counting the number of respondents who selected the most favorable response (“strongly agree”) fall into the top box. (See how it looks like a box and is the “top” response to select?)

Strongly-Disagree Disagree Undecided Agree Strongly-Agree
1 2 3 4 5


Figure 1: Top box of a 5-point question.

Likewise, the top-two box counts responses in the two most extreme responses (4 and 5 in Figure 1). This approach is popular when the number of response options are between 7 and 11 points. For example, the 11-point Net Promoter Question (“How likely are you to recommend this product to a friend”) has the top-two boxes of 9 and 10 designated as “Promoters” (Figure 2).

The idea behind this practice is that you’re getting only those that have the strongest feelings toward a statement. This applies to standard Likert item options (Strongly Disagree to Strongly Agree) and to other response options (Definitely Will Not Purchase to Definitely Will Purchase). The Net Promoter Score not only uses the top-two box, but also the bottom-six box approach in computing the score, which captures both the extreme responders (likely to recommend and likely to dissuade others).


Detractors Passive Promoters
Top 2 Box
Not at
all Likely
Neutral Extremely
0 1 2 3 4 5 6 7 8 9 10


Figure 2: Top-two box for the 11-point Likelihood to Recommend (LTR) question used for the Net Promoter Score.

Of course, shifting the analysis from using means to top box percentages may seem like it provides more meaning even though it doesn’t. For example, what does it mean when 56% of respondents select 4 or 5 on a 5-point scale or 63% select 6 or 7 on a 7-point scale? Do you really have more information than with the mean? Without an external benchmark, you still don’t know whether these are good, indifferent, or poor percentages.

Loss of Information

The major problem with the top box approach is that you lose information in the transformation from rating scale to proportion. Should a person who responds with 1 on a 5-point scale be treated the same (computationally) as those who provide a neutral (3) response? The issues seem even more concerning on the 11-point LTR item. Are 0s and 1s really the same as 5s and 6s when determining detractors?

For example, from an analysis of 87 software products, we found converting the 11 points into essentially a two-point scale lost 4% of the information.

The negative impact is:

  1. Wider margins of error (more uncertainty)
  2. Needing a larger sample size to detect differences
  3. Changes over time or to competitors become less easy to detect with the same sample size (loss of precision)

This increase in the margin of error (and its effect on sample size) can be seen in the responses of 53 participants to their Likelihood to Recommend scores toward the brand Coca-Cola in Figure 3. Using the mean LTR response, the confidence interval width is 5.2% of the range (.57/11); for the NPS computation, the confidence interval width is 9.4% of the range (18.7/200).

Difference in width of confidence intervals using the mean or the official NPS

Figure 3: Difference in width of confidence intervals using the mean (right panel) or the official NPS (left panel) from LTR toward Coca-Cola. The NPS margin of error is almost twice the width of the mean (twice the uncertainty).

Moving the Mean or the Extremes?

The intent of using measures like customer satisfaction, likelihood to recommend, and perceived usability is of course not just an exercise in moving the mean from 4.5 to 5.1. It should be about using generally easy to collect leading indicators to predict harder to measure behavior.

This is the general idea behind models like the service profit chain: Increased customer satisfaction is expected to lead to greater customer retention. Improved customer retention leads to greater profitability.

Reichheld and others have argued though that, in fact, it’s not the mean companies should be concerned with, but rather the extreme responders, which have a better association with repurchasing (growth). In his 2003 HBR article, Reichheld says

“Promoters,” the customers with the highest rates of repurchase and referral, gave ratings of nine or ten to the [likelihood to recommend] question.”

Reichheld also talks about the impact of extremely low responses (detractors). But is there other evidence to support the connection between extreme attitudes and behavior that Reichheld found?

The Extremes of Attitudes

There is evidence that attitudes (at least in some situations) don’t follow a simple linear pattern and in fact, it’s the extremes in attitudes, which are better predictors of behavior.

Oliver et al. (1997) suggest that moderate attitudes fall into a ”zone of indifference” and only when attitudes become extremely positive or negative do they begin to map to behavior.

Anderson and Mittal (2000) also echo this non-linear relationship and asymmetry and note that often a decrease in satisfaction will have a greater impact on behavior than an equivalent increase. They describe two types of attributes:

  • Satisfaction-maintaining attributes are what customers expect and are more likely to exhibit “negative asymmetry.” For example, consumers have come to expect clear calls and good coverage from their wireless provider; when the clarity and coverage goes down, consumers get angry. As such, performance changes in the middle of a satisfaction scale are more consequential than those at the upper extreme of satisfaction (i.e. 5 out of 5).
  • Satisfaction-enhancing attributes exhibit positive asymmetry. These are often called delighters and changes in the upper range have more consequence than the middle range. For example, having free Wi-Fi on an airplane may delight customers and lead to higher levels of repeat purchasing and recommending. In this case, changes in the upper extremes of satisfaction have a better link to behavior.

van Doorn et al. (2007) conducted two studies from Dutch consumers to understand the relationship between attitudes and behavior. In the first study, they surveyed 266 Dutch consumers using five 6-point rating scales on environmental consciousness. They found an exponential relationship between attitude to the environment and number of categories of organic products purchased (e.g. meat, eggs, fruit).

Relationship between number of organic products purchased and environmental concerns

Figure 4: Relationship between the number of organic products purchased and environmental concerns from van Doorn et al. (2007) show a non-linear pattern.

They found the relationship between environmental concern and the number of organic product categories purchased is negligible for environmental concern below 5, but for extremely high levels of environmental concern, the relation is much stronger than in the linear model (see Figure 4).

In a second study, they examined the relationship between the number of loyalty cards and attitudes toward privacy from 3,657 Dutch respondents in 2004. They used two 5-point items asking about privacy concerns. In this study though, they found weaker evidence for the non-linear relationship but still found that privacy scores below 2.5 didn’t have much impact on loyalty cards. For privacy scores above 2.5 (see Figure 5), the average number of customer cards decreased more rapidly (less linear).

Relationship between privacy concerns and number of loyalty cards

Figure 5: Relationship between privacy concerns and number of loyalty cards from van Doorn et al. (2007) also show a non-linear pattern.

van Doorn et al. (2007) concluded it makes sense to target only those consumers close to or at the extreme points of the attitudinal scale: bottom-two box and top-two box.

The authors argue that in some circumstances it makes more sense to pay attention to the extremes (echoing Anderson and Mittal). Customers with very low satisfaction (bottom box) may have a greater effect on things like churn. Likewise, high satisfaction (top-two box) customers are likely to drive customer retention, which means that efforts should be made to shift customers just beneath the top-two box to above the threshold.

This asymmetry was also seen with Mittal, Ross, and Baldasare (1998). [pdf] Three studies in the healthcare and the automotive industry found that overall satisfaction and repurchase intentions are affected asymmetrically: negative outcomes had a disproportionate impact on satisfaction.

But not all studies show this effect with extremes. Morgan and Rego (2006), in their analysis of U.S. companies, showed that top-two box scores are a good predictor of future business performance, but actually perform slightly worse than using average satisfaction (they used a Net Promoter type question in their analysis).

de Haan et al. (2015), using data from 93 Dutch services firms from 18 industries, found that top-two box customer satisfaction performs best for predicting customer retention from 1,375 customers from a two-year follow-up survey. They found the top-two box satisfaction and officially scored NPS using its top-two minus bottom-six approach were slightly better predictors than using their full-scale mean on customer retention (Sat Mean r = .15 vs Sat Top 2 Box; r=.18 and NPS Mean r = .16 vs NPS Scored r=.17). They suggested it’s useful to transform scores to focus on very positive (or very negative) groups and to predict customer metrics, including customer retention and tenure.

Extremes of UX Attitudes

Echoing this extreme attitude on behavior in an analysis I conducted in 2012 for a wireless carrier, I looked at the relationship between the attitude toward the usability (using SUS) and likelihood to recommend (NPS) a phone handset and their relationship on return rates.

In running a linear regression on both SUS and NPS to predict return rates at a product (not individual level), I was able to explain 8% and 14% of return rates respectively. However, when I transformed the data into extremes (SUS > 80 = high and SUS 30% = high and NPS < -25% = low), I was able to more than double the explanatory power of attitude predicting behavior to 20% and to 27% R-square respectively.

This can be seen in Figure 6 (the pictures are for illustration only). Handsets with the highest SUS scores had less than half the return rate than handsets that scored average or below. This illustrates the non-linear relationship: movement of SUS scores from horrible (in the 30s-40s) to below average (50s-60s) didn’t affect the return rate.

Non-linear relationship between SUS and 30-day return rates

Figure 6: Non-linear relationship between SUS (usability) and 30-day return rates (phone images are for illustration purposes and don’t represent the actual phones in the analysis).

Summary & Takeaways

This analysis of the literature and our own research found:

Using top box scores loses information and increases uncertainty around the mean. The actual loss will depend on the data, but we found it was around 4% in one study. The margin of error around the estimate will in many situations approximately double when going from mean to NPS. This leads to needing larger sample sizes to detect the same differences over time or against competitors.

Data lost using top or bottom box scoring might be worth shedding. Some published research and our own analysis have found that in some situations, when predicting behavior, that more extreme responses are a better predictor. More research is needed to understand the limitations and extent of this relationship.

The relationship between attitudinal and behavior may be non-linear (in some cases). In situations where there is non-linear behavior, top box and bottom box scoring may capture this non-linearity better than using the mean (or other transformations), lending credence to the NPS approach.

Context matters. Not all studies showed a non-linear relationship and superiority of the top box scoring approach. In some cases, the mean was a better predictor of behavior (albeit slightly) and using both as measures of behavior seems prudent.

Bottom box might be as important. While top box scoring tends to be more common, in many cases it’s not the top box, but the bottom box that matters more. There is some evidence that extreme negative attitudes (e.g. losing what’s expected) predicts behavior better, especially in cases when customers expect an attribute in a product or service.

Thanks to Jim Lewis for commenting on an earlier draft of this article.

Originally Posted at: Are Top Box Scores a Better Predictor of Behavior?

Six Practices Critical to Creating Value from Data and Analytics [INFOGRAPHIC]

IBM Institute for Business Value surveyed 900 IT and business executives from 70 countries from June through August 2013. The 50+ survey questions were designed to help translate concepts relating to generating value from analytics into actions. They found that business leaders adopt specific strategies to create value from data and analytics. Leaders:

  1. are 166% more likely to make decisions based on data.
  2. are 2.2 times more likely to have a formal career path for analytics.
  3. cite growth as the key source of value.
  4. measure the impact of analytics investments.
  5. have predictive analytics capabilities.
  6. have some form of shared analytics resources.

Read my summary of IBM’s study here. Download the entire study here. And check out IBM’s infographic below.

IBM Institute for Business Value - 2013 Infographic


Source by bobehayes