Data Sources for Cool Data Science Projects: Part 1

At The Data Incubator, we run a free six week data science fellowship to help our Fellows land industry jobs. Our hiring partners love considering Fellows who don’t mind getting their hands dirty with data.  That’s why our Fellows work on cool capstone projects that showcase those skills.  One of the biggest obstacles to successful projects has been getting access to interesting data.  Here are a few cool public data sources you can use for your next project:

Economic Data:

  1. Publically Traded Market Data: Quandl is an amazing source of finance data. Google Finance and Yahoo Finance are additional good sources of data.  Corporate filings with the SEC are available on Edgar.
  2. Housing Price Data: You can use the Trulia API or the Zillow API.
  3. Lending data: You can find student loan defaults by university and the complete collection of peer-to-peer loans from Lending Club and Prosper, the two largest platforms in the space.
  4. Home mortgage data: There is data made available by the Home Mortgage Disclosure Act and there’s a lot of data from the Federal Housing Finance Agency available here.

Content Data:

  1. Review Content: You can get reviews of restaurants and physical venues from Foursquare and Yelp (see geodata).  Amazon has a large repository of Product Reviews.  Beer reviews from Beer Advocate can be found here.  Rotten Tomatoes Movie Reviews are available from Kaggle.
  2. Web Content: Looking for web content?  Wikipedia provides dumps of their articles.  Common Crawl has alarge corpus of the internet available.  ArXiv maintains all their data available via Bulk Download from AWS S3.  Want to know which URLs are malicious?  There’s a dataset for that.  Music data is available from the Million Songs Database.  You can analyze the Q&A patterns on sites like Stack Exchange (including Stack Overflow).
  3. Media Data: There’s open annotated articles form the New York Times, Reuters Dataset, and GDELT project (a consolidation of many different news sources).  Google Books has published NGrams for books going back to past 1800.
  4. Communications Data: There’s access to public messages of the Apache Software Foundation and communications amongst former execs at Enron.

Government Data:

  1. Municipal Data: Crime Data is available for City of Chicago and Washington DC.  Restaurant Inspection Data is available for Chicago and New York City.
  2. Transportation Data: NYC Taxi Trips in 2013 are available courtesy of the Freedom of Information Act.  There’s bikesharing data from NYC, Washington DC, and SF.  There’s also Flight Delay Data from the FAA.
  3. Census Data: Japanese Census Data.  US Census data from 2010, 2000, 1990.  From census data, the government has also derived time use data.  EU Census Data.  Check out popular male / female baby names going back to the 19th Century from the Social Security Administration.
  4. World Bank: They have a lot of data available on their website.
  5. Election Data: Political contribution data for the last few US elections can be downloaded from the FEChere and here.  Polling data is available from Real Clear Politics.
  6. Food, Drugs, and Devices Data: The FDA provides a number of high value public datasets.


While building your own project cannot replicate the experience of fellowship at The Data Incubator (our Fellows get amazing access to hiring managers and access to nonpublic data sources) we hope this will get you excited about working in data science.  And when you are ready, you can apply to be a Fellow!

Got any more data sources?  Let us know and we’ll add them to the list!

This article appeared in The Data Incubator on October 16, 2014. 

Originally Posted at: Data Sources for Cool Data Science Projects: Part 1

Jun 22, 17: #AnalyticsClub #Newsletter (Events, Tips, News & more..)


Pacman  Source


More WEB events? Click Here

[ AnalyticsWeek BYTES]

>> Talent analytics in practice by analyticsweekpick

>> Four Use Cases for Healthcare Predictive Analytics, Big Data by anum

>> The 7 Best Data Science and Machine Learning Podcasts by analyticsweekpick

Wanna write? Click Here


 Data science and analytic centre opened – The Hindu Under  Data Science

 Cyber security or cyber snooping? | Bangkok Post: opinion – Bangkok Post Under  cyber security

 CA Technologies: CA Technologies claims its payment security … – Under  Risk Analytics

More NEWS ? Click Here


Hadoop Starter Kit


Hadoop learning made easy and fun. Learn HDFS, MapReduce and introduction to Pig and Hive with FREE cluster access…. more


The Misbehavior of Markets: A Fractal View of Financial Turbulence


Mathematical superstar and inventor of fractal geometry, Benoit Mandelbrot, has spent the past forty years studying the underlying mathematics of space and natural patterns. What many of his followers don’t realize is th… more


Strong business case could save your project
Like anything in corporate culture, the project is oftentimes about the business, not the technology. With data analysis, the same type of thinking goes. It’s not always about the technicality but about the business implications. Data science project success criteria should include project management success criteria as well. This will ensure smooth adoption, easy buy-ins, room for wins and co-operating stakeholders. So, a good data scientist should also possess some qualities of a good project manager.


Q:Is it better to spend 5 days developing a 90% accurate solution, or 10 days for 100% accuracy? Depends on the context?
A: * “premature optimization is the root of all evils”
* At the beginning: quick-and-dirty model is better
* Optimization later
Other answer:
– Depends on the context
– Is error acceptable? Fraud detection, quality assurance



@AnalyticsWeek Panel Discussion: Marketing Analytics

 @AnalyticsWeek Panel Discussion: Marketing Analytics

Subscribe to  Youtube


If you can’t explain it simply, you don’t understand it well enough. – Albert Einstein


#BigData @AnalyticsWeek #FutureOfData #Podcast with Juan Gorricho, @disney

 #BigData @AnalyticsWeek #FutureOfData #Podcast with Juan Gorricho, @disney


iTunes  GooglePlay


Facebook stores, accesses, and analyzes 30+ Petabytes of user generated data.

Sourced from: Analytics.CLUB #WEB Newsletter

Improving Big Data Governance with Semantics

By Dr. Jans Aasman Ph.d, CEO of Franz Inc.

Effective data governance consists of protocols, practices, and the people necessary for implementation to ensure trustworthy, consistent data. Its yields include regulatory compliance, improved data quality, and data’s increased valuation as a monetary asset that organizations can bank on.

Nonetheless, these aspects of governance would be impossible without what is arguably its most important component: the common terminologies and definitions that are sustainable throughout an entire organization, and which comprise the foundation for the aforementioned policy and governance outcomes.

When intrinsically related to the technologies used to implement governance protocols, terminology systems (containing vocabularies and taxonomies) can unify terms and definitions at a granular level. The result is a greatly increased ability to tackle the most pervasive challenges associated with big data governance including recurring issues with unstructured and semi-structured data, integration efforts (such as mergers and acquisitions), and regulatory compliance.

A Realistic Approach
Designating the common terms and definitions that are the rudiments of governance varies according to organization, business units, and specific objectives for data management. Creating policy from them and embedding them in technology that can achieve governance goals is perhaps most expediently and sustainably facilitated by semantic technologies, which are playing an increasingly pivotal role in the overall implementation of data governance in the wake of big data’s emergence.

Once organizations adopt a glossary of terminology and definitions, they can then determine rules about terms based on their relationships to one another via taxonomies. Taxonomies are useful for disambiguation purposes and can clarify preferred labels—among any number of synonyms—for different terms in accordance to governance conventions. These definitions and taxonomies form the basis for automated terminology systems that label data according to governance standards via inputs and outputs. Ingested data adheres to terminology conventions and is stored according to preferred labels. Data captured prior to the implementation of such a system can still be queried according to the system’s standards.

Linking Terminology Systems: Endless Possibilities
The possibilities that such terminology systems produce (especially for unstructured and semi-structured big data) are virtually limitless, particularly with the linking capabilities of semantic technologies. In the medical field, a hand written note hastily scribbled by a doctor can be readily transcribed by the terminology system in accordance to governance policy with preferred terms, effectively giving structure to unstructured data. Moreover, it can be linked to billing coding systems per business functions. That structured data can then be stored in a knowledge repository and queried along with other data, adding to the comprehensive integration and accumulation of data that gives big data its value.

Focusing on common definitions and linking terminology systems enables organizations to leverage business intelligence and analytics on different databases across business units. This method is also critical for determining customer disambiguation, a frequently occurring problem across vertical industries. In finance, it is possible for institutions with numerous subsidiaries and acquisitions (such as Citigroup, Citibank, Citi Bike, etc.) to determine which subsidiary actually spent how much money with the parent company and additional internal, data-sensitive problems by using a common repository. Also, linking the different terminology repositories for these distinct yet related entities can achieve the same objective.

The primary way in which semantics addresses linking between terminology systems is by ensuring that those systems are utilizing the same words and definitions for the commonality of meaning required for successful linking. Vocabularies and taxonomies can provide such commonality of meaning, which can be implemented with ontologies to provide a standards-based approach to disparate systems and databases.

Subsequently, all systems that utilize those vocabularies and ontologies can be linked. In finance, the Financial Industry Business Ontology (FIBO) is being developed to grant “data harmonization and…the unambiguous sharing of meaning across different repositories.” The life sciences industry is similarly working on industry wide standards so that numerous databases can be made available to all within this industry, while still restricting access to internal drug discovery processes according to organization.

Regulatory Compliance and Ontologies
In terms of regulatory compliance, organizations are much more flexible and celeritous to account for new requirements when data throughout disparate systems and databases are linked and commonly shared—requiring just a single update as opposed to numerous time consuming updates in multiple places. Issues of regulatory compliance are also assuaged in a semantic environment through the use of ontological models, which provide the schema that can create a model specifically in adherence to regulatory requirements.

Organizations can use ontologies to describe such requirements, then write rules for them that both restrict and permit access and usage according to regulations. Although ontological models can also be created for any other sort of requirements pertaining to governance (metadata, reference data, etc.) it is somewhat idealistic to attempt to account for all facets of governance implementation via such models. The more thorough approach is to do so with terminology systems and supplement them accordingly with ontological models.

Terminologies First
The true value in utilizing a semantic approach to big data governance that focuses on terminology systems, their requisite taxonomies, and vocabularies pertains to the fact that this method is effective for governing unstructured data. Regardless of what particular schema (or lack thereof) is available, organizations can get their data to adhere to governance protocols by focusing on the terms, definitions, and relationships between them. Conversely, ontological models have a demonstrated efficacy with structured data. Given the fact that the majority of new data created is unstructured, the best means of wrapping effective governance policies and practices around them is through leveraging these terminology systems and semantic approaches that consistently achieve governance outcomes.

About the Author: Dr. Jans Aasman Ph.d is the CEO of Franz Inc., an early innovator in Artificial Intelligence and leading supplier of Semantic Graph Database technology. Dr. Aasman’s previous experience and educational background include:
• Experimental and cognitive psychology at the University of Groningen, specialization: Psychophysiology, Cognitive Psychology.
• Tenured Professor in Industrial Design at the Technical University of Delft. Title of the chair: Informational Ergonomics of Telematics and Intelligent Products
• KPN Research, the research lab of the major Dutch telecommunication company
• Carnegie Mellon University. Visiting Scientist at the Computer Science Department of Prof. Dr. Allan Newell

Originally Posted at: Improving Big Data Governance with Semantics

How the lack of the right data affects the promise of big data in India

Big data is the big buzzword these days. Big data refers to a collection of data sets or information too large and complex to be processed by standard tools. It is the art and science of combining enterprise data, social data and machine data to derive new insights, which it otherwise would not be possible to derive. It is also about combining past data with real time data to predict or suggest outcomes in a current or future context.


The digital footprint is progressively expanding world over, into fragmented mediums (blogs, tweets, reviews etc.) and technologies (mobile, web, cloud/SaaS etc.).

Digital landscape in India

India’s digital landscape too may be evolving quickly but overall penetration remains low, with only 1 in 5 Indians using the Internet in July 2014.

In India, enterprises and businesses have access to a veritable wealth of information. And though some of the larger organisations have made a start in harnessing this information, most Indian companies are still learning how to collect and store big data.

Telecom providers, online travel agencies and online retail stores are some of the industries that are using big data analytics to engage customers in some way or another.

However, big data analytics is still in its infancy in India. Most companies are still learning to store the data collected. Also, there are several challenges when it comes to the collection of data sets themselves. Past and current data is required to make the application of big data analytics really useful, and there is a scarcity of this in public and private sectors in India. Some of the reasons for the lack of enough data are:

Yet to be fully computerised

Healthcare, economic, and statistical data, in both private and public sectors in India, is yet to be computerised. The main reason for this is the late adoption of IT in India. Unlike in the West, most industries in India made the transition from manual records to computerised information systems only during the last decade.

Over the years, the state and central ministries have made moves towards e-governance.  Efforts to deliver public services, and to make access to these services easier, are being made as well. This is still a work in progress; huge amounts of data across many government sectors are yet to be digitised.

Quality of data

In big data analytics, data sufficiency plays a critical role when samples are run across different dimensions. Sufficient data points to make informed analyses are required. Not only the quantity of data, the quality of data being used for crunching, too, influences the quality of insights.  If the signal-to-noise-ratio is high, the accuracy of results may vary for less than optimum data samples. In a country like India, there is very little information about the individuals, due to the fact that Indians are not overly expressive, especially on public forums.

Public social media information that is available for most individuals from India lacks quality information about users themselves. Random facts and figures in individual profiles, sharing of spam content, and fake social media accounts that are created for bots are very common in India.


Social media sites are becoming increasingly vulnerable to spam attacks. Time spent by a captive audience on social media sites opens up windows of opportunities for online threats and spammers.

Again, social media spam contributes to the signal-to-noise-ratio that defines the quality of big data. This takes away from the accuracy of results.

Cultural and Social influences

In most western markets, insights generated through big data can be applied across the whole consumer base. However, given the extensive cultural and linguistic variation across India, any insight generated for a consumer based out of Chandigarh, for example, will not be directly applicable to a consumer based in Chennai. This problem is made worse by the fact that a lot of local data lives in regional publications, in different languages, and has very limited online visibility.

Unstructured data leads to mapping issues

Big data in India is not structured. Most transactional data in the healthcare and retail segments are stored purely for book-keeping purposes. They have very limited appropriate information of the kind that can help big data analytics map enterprise-generated transactional data with public information.

In the case of developed countries, user data is rich enough to provide demographic or group level markers that can be used to generate customized insights while maintaining individual privacy. Lack of these standard identifiers in Indian consumer data is one of the biggest bottlenecks while mapping various transactional and social records in India.

Handsets and internet connectivity

Even though smart phones are driving the new handset market in India, feature phones still dominate everyday usage. Most connections in India are pre-paid and fewer than 10% of users have access to 3G networks. To add to it, internet connection speeds are amongst the lowest in Asia. As a result, consumer data, especially retail enterprise data, is limited.

As more people in India make the move to smart phones, and internet connectivity improves, there will be an increase in the amount of usable data generated. As big data analytics is in its infancy in India today, huge efforts would need to be made to improve the quality of data stored by organisations and enterprises. However, key contributors to the promise of big data analytics in India are steadily gaining ground. An increase in social media users, and efforts by enterprises, both public and private for optimum collection and storage of transactional enterprise data, will contribute to better quality data sets for the better application of big data analytics.

 About the Author: Srikant Sastri is the Co-founder of Crayon Data.

To read the original article on YourStory, click here.

Source by analyticsweekpick

Jun 15, 17: #AnalyticsClub #Newsletter (Events, Tips, News & more..)


Data security  Source

[ AnalyticsWeek BYTES]

>> Talent analytics in practice by analyticsweekpick

>> Data center location – your DATA harbour by martin

>> The 10 Commandments for data driven leaders by v1shal

Wanna write? Click Here


 Women and Men Now Grocery Shop Equally: Study … – Progressive Grocer Under  Prescriptive Analytics

 Fast-moving big data changes data preparation process for analytics – TechTarget Under  Big Data

 Scientists use Tweet ‘sentiment analysis’ to predict Hillary Clinton win – Daily News & Analysis Under  Sentiment Analysis

More NEWS ? Click Here


Applied Data Science: An Introduction


As the world’s data grow exponentially, organizations across all sectors, including government and not-for-profit, need to understand, manage and use big, complex data sets—known as big data…. more


Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython


Python for Data Analysis is concerned with the nuts and bolts of manipulating, processing, cleaning, and crunching data in Python. It is also a practical, modern introduction to scientific computing in Python, tailored f… more


Strong business case could save your project
Like anything in corporate culture, the project is oftentimes about the business, not the technology. With data analysis, the same type of thinking goes. It’s not always about the technicality but about the business implications. Data science project success criteria should include project management success criteria as well. This will ensure smooth adoption, easy buy-ins, room for wins and co-operating stakeholders. So, a good data scientist should also possess some qualities of a good project manager.


Q:Do you think 50 small decision trees are better than a large one? Why?
A: * Yes!
* More robust model (ensemble of weak learners that come and make a strong learner)
* Better to improve a model by taking many small steps than fewer large steps
* If one tree is erroneous, it can be auto-corrected by the following
* Less prone to overfitting



#FutureOfData Podcast: Peter Morgan, CEO, Deep Learning Partnership

 #FutureOfData Podcast: Peter Morgan, CEO, Deep Learning Partnership

Subscribe to  Youtube


For every two degrees the temperature goes up, check-ins at ice cream shops go up by 2%. – Andrew Hogue, Foursquare


#FutureOfData Podcast: Conversation With Sean Naismith, Enova Decisions

 #FutureOfData Podcast: Conversation With Sean Naismith, Enova Decisions


iTunes  GooglePlay


Estimates suggest that by better integrating big data, healthcare could save as much as $300 billion a year — that’s equal to reducing costs by $1000 a year for every man, woman, and child.

Sourced from: Analytics.CLUB #WEB Newsletter

Big Data: Are you ready for blast-off?

As Technology of Business begins a month-long series of features on the theme of Big Data, we kick off with a Q&A backgrounder answering some of those basic questions you were too afraid to ask.

Good question. After all, we’ve always had large amounts of data haven’t we, from loyalty card schemes, till receipts, medical records, tax returns and so on?

As Laurie Miles, head of analytics for big data specialist SAS, says: “The term big data has been around for decades, and we’ve been doing analytics all this time. It’s not big, it’s just bigger.”

But it’s the velocity, variety and volume of data that has merited the new term.

So what made it bigger?

Most traditional data was structured, or neatly organised in databases. Then the world went digital and the internet came along. Most of what we do could be translated into strings of ones and noughts capable of being recorded, stored, searched, and analysed.

There was a proliferation of so-called unstructured data generated by all our digital interactions, from email to online shopping, text messages to tweets, Facebook updates to YouTube videos.

Man checking phone before networked society poster
As the number of mobile phones grows globally, so does the volume of data they generate from call metadata, texts, emails, social media updates, photos, videos, and location

And the number of gadgets recording and transmitting data, from smartphones to intelligent fridges, industrial sensors to CCTV cameras, has also proliferated globally, leading to an explosion in the volume of data.

These data sets are now so large and complex that we need new tools and approaches to make the most of them.

How much data is there?

Nobody really knows because the volume is growing so fast. Some say that about 90% of all the data in the world today has been created in the past few years.

According to computer giant IBM, 2.5 exabytes – that’s 2.5 billion gigabytes (GB) – of data was generated every day in 2012. That’s big by anyone’s standards. “About 75% of data is unstructured, coming from sources such as text, voice and video,” says Mr Miles.

And as mobile phone penetration is forecast to grow from about 61% of the global population in 2013 to nearly 70% by 2017, those figures can only grow. The US government’s open data project already offers more than 120,000 publicly available data sets.

Where is it all stored?

The first computers came with memories measured in kilobytes, but the latest smartphones can now store 32GB and many laptops now have one terabyte (1,000GB) hard drives as standard. Storage is not really an issue anymore.

NSA data centre in Utah
The US National Security Agency has built a huge data centre in Bluffdale, Utah – codenamed Bumblehive – capable of storing a yottabyte of data – that’s one thousand trillion gigabytes

For large businesses “the cost of data storage has plummeted,” says Andrew Carr, UK and Ireland chief executive of IT consultancy Bull. Businesses can either keep all their data on-site, in their own remote data centres, or farm it out to “cloud-based” data storage providers.

A number of open source platforms have grown up specifically to handle these vast amounts of data quickly and efficiently, including Hadoop, MongoDB, Cassandra, and NoSQL.

Why is it important?

Data is only as good as the intelligence we can glean from it, and that entails effective data analytics and a whole lot of computing power to cope with the exponential increase in volume.

But a recent Bain & Co report found that of 400 large companies those that had already adopted big data analytics “have gained a significant lead over the rest of the corporate world.”

“Big data is not just historic business intelligence,” says Mr Carr, “it’s the addition of real-time data and the ability to mash together several data sets that makes it so valuable.”

Practically, anyone who makes, grows and sells anything can use big data analytics to make their manufacturing and production processes more efficient and their marketing more targeted and cost-effective.

It is throwing up interesting findings in the fields of healthcare, scientific research, agriculture, logistics, urban design, energy, retailing, crime reduction, and business operations – several of which we’ll be exploring over the coming weeks.

Thai farmer works in rice field
By analysing weather, soil, topography and GPS tractor data, farmers can increase crop yields

“It’s a big deal for corporations, for society and for each individual,” says Ralf Dreischmeier, head of The Boston Consulting Group’s information technology practice.

Can we handle all this data?

Big data needs new skills, but the business and academic worlds are playing catch up. “The job of data scientist didn’t exist five or 10 years ago,” says Duncan Ross, director of data science at Teradata. “But where are they? There’s a shortage.”

And many businesses are only just waking up to the realisation that data is a valuable asset that they need to protect and exploit. “Banks only use a third of their available data because it often sits in databases that are hard to access,” says Mr Dreischmeier.

“We need to find ways to make this data more easily accessible.”

Businesses, governments and public bodies also need to keep sensitive data safe from hackers, spies and natural disasters – an increasingly tall order in this mobile, networked world.

Who owns it all?

That’s the billion dollar question. A lot depends on the service provider hosting the data, the global jurisdiction it is stored in, and how it was generated. It is a legal minefield.

Facebook logo
Facebook’s logo – created using photos of its global users – adorns the wall of a new data centre in Sweden – its first outside the US. But who has rights to all the data?

Does telephone call metadata – the location, time, and duration of calls rather than their conversational content – belong to the caller, the phone network or any government spying agency that happens to be listening in?

When our cars become networked up, will it be the drivers, owners or manufacturers who own the data they generate?

Social media platforms will often say that their users own their own content, but then lay claim to how that content is used, reserving the right to share it with third parties. So when you tweet you effectively give up any control over how that tweet is used in future, even though Twitter terms and conditions say: “What’s yours is yours.”

Privacy and intellectual property laws have not kept up with the pace of technological change.

Originally posted via “Big Data: Are you ready for blast-off?”

Source: Big Data: Are you ready for blast-off? by anum

It’s Time to Tap into the Cloud Data Protection Market Opportunity

Until now, most businesses did not have the access or resources to implement more complete data protection, including advanced backup, disaster recovery, and secure file sync and share. In fact, a recent study from research firm IDC found that 70% of SMBs have insufficient disaster recovery protection today. At the same time, a recent Spiceworks survey reported that cloud backup and recovery is the top cloud service that IT Pros plan to start using in the next six months.

The good news is that companies today have more options for data protection than ever before. The cloud makes enterprise-grade backup and disaster recovery solutions accessible and affordable for SMBs–and this translates into a massive market opportunity for service providers.

At Acronis, we believe that service providers are uniquely positioned to tap into the cloud to bring best-in-class data protection services to their customers.

We all know that service providers are experts at providing IT services, including administration, maintenance and customer support. They’ve opened up the door to cloud computing for businesses of all sizes, especially for SMBs.

But, service providers do much more than provide cloud solutions, servers and storage. For example, service providers are constantly improving upon the efficiency and cost-effectiveness of the solutions they deliver, including integrating different services into completely transparent and uniform services for their customers.

Service providers also look for opportunities to continuously enhance their offerings to provide end customers with the best possible solutions–now and in the future. Finally, service providers are the best cost managers in the business–they know how to scale solutions and make them easier to buy and deploy for end users. This relentless focus on cost-effectiveness benefits both their businesses with higher margins and their end customers with better value at a lower cost.

This is why Acronis delivers a complete set of cloud data protection solutions for service providers. We know service providers, and we know what it takes to make them successful. And there is a huge and unmet market need for easy, complete and affordable data protection for small and midsize businesses.

The bottom line: Now’s the ideal time to check out how you can grow your business with the latest solutions in cloud data protection, leveraging highly flexible go-to-market models and support for the broadest range of service provider workloads.

If you’d like to learn more about how Acronis can help you quickly tap into the growing market for cloud data protection services, you’ll find more information about our solutions here.

Read more at:

Originally Posted at: It’s Time to Tap into the Cloud Data Protection Market Opportunity by analyticsweekpick

Jun 08, 17: #AnalyticsClub #Newsletter (Events, Tips, News & more..)


Data interpretation  Source

[ AnalyticsWeek BYTES]

>> 8 Best Practices to Maximize ROI from Predictive Analytics by analyticsweekpick

>> Data Driven Innovation: A Primer by v1shal

>> Map of US Hospitals and their Patient Experience Ratings by bobehayes

Wanna write? Click Here


 RFx (request for x) encompasses the entire formal request process and can include any of the following: – TechTarget Under  Sales Analytics

 SAP’s Leonardo points towards Applied Data Science as a Service – Diginomica Under  Data Science

 Four ways to create the ultimate personalized customer experience – TechTarget Under  Customer Experience

More NEWS ? Click Here


Pattern Discovery in Data Mining


Learn the general concepts of data mining along with basic methodologies and applications. Then dive into one subfield in data mining: pattern discovery. Learn in-depth concepts, methods, and applications of pattern disc… more


Superintelligence: Paths, Dangers, Strategies


The human brain has some capabilities that the brains of other animals lack. It is to these distinctive capabilities that our species owes its dominant position. Other animals have stronger muscles or sharper claws, but … more


Analytics Strategy that is Startup Compliant
With right tools, capturing data is easy but not being able to handle data could lead to chaos. One of the most reliable startup strategy for adopting data analytics is TUM or The Ultimate Metric. This is the metric that matters the most to your startup. Some advantages of TUM: It answers the most important business question, it cleans up your goals, it inspires innovation and helps you understand the entire quantified business.


Q:When you sample, what bias are you inflicting?
A: Selection bias:
– An online survey about computer use is likely to attract people more interested in technology than in typical

Under coverage bias:
– Sample too few observations from a segment of population

Survivorship bias:
– Observations at the end of the study are a non-random set of those present at the beginning of the investigation
– In finance and economics: the tendency for failed companies to be excluded from performance studies because they no longer exist



@AnalyticsWeek: Big Data Health Informatics for the 21st Century: Gil Alterovitz

 @AnalyticsWeek: Big Data Health Informatics for the 21st Century: Gil Alterovitz

Subscribe to  Youtube


Data beats emotions. – Sean Rad, founder of


#DataScience Approach to Reducing #Employee #Attrition

 #DataScience Approach to Reducing #Employee #Attrition


iTunes  GooglePlay


100 terabytes of data uploaded daily to Facebook.

Sourced from: Analytics.CLUB #WEB Newsletter

Malaysia opens digital government lab for big data analytics

Malaysia today officially launched a government lab to analyse data from across agencies and to test new ways of using the data to improve public services.

The Digital Government Lab will “facilitate various ministries and agencies to have a greater level of analytics of the collected data, in strict adherence to the government’s data security, integrity and sovereignty guidelines”, said Datuk Abdul Wahab Abdullah, Chief Executive of MIMOS, the country’s national ICT research agency.

The lab was set up in January by MIMOS, Modernisation and Management Planning Unit and Multimedia Development Corporation, as part of a wider national Big Data Analytics Innovation Network.

It is already testing ideas to analyse the public mood on a newly introduced tax and on flooding. With the Ministry of Finance, “we are extracting some data related to the newly implemented GST [Goods and Services Tax] by the Malaysian government, and we are looking at the sentiments of the public, extracting data from social network and blogs, so that this information will provide a better reading of the sentiments”, a MIMOS spokesperson told FutureGov.

Another project with the Department of Irrigation and Drainage is “looking at data from sensors and also feedback from the public on social media on flood issues, and others related to irrigation”, he said.

Other agencies testing their ideas at the lab are the Department of Islamic Development and National Hydraulic Research Institute.

The Minister of Science, Technology and Innovation, Datuk Dr Ewon Ebin, said that his ministry will work with the Ministry of Communications and Multimedia to ensure that the lab maintains data security and sovereignty.

The lab will eventually be opened up for use by the private sector, the MIMOS spokesperson said.

Originally posted via “Malaysia opens digital government lab for big data analytics”

Source by anum

Is the Importance of Customer Experience Overinflated?

Companies rely on customer experience management (CEM) programs to provide insight about how to manage customer relationships effectively to grow their business. CEM programs require measurement of primarily two types of variables, satisfaction with customer experience and customer loyalty. These metrics are used specifically to assess the importance of customer experience in improving customer loyalty. Determining the “importance” of different customer experience attributes needs to be precise as it plays a major role in helping companies: 1) prioritize improvement efforts, 2) estimate return on investment (ROI) of improvement efforts and 3) allocate company resources.

How We Determine Importance of Customer Experience Attributes

When we label a customer experience attribute as “important,” we typically are referring to the magnitude of the correlation between customer ratings on that attribute (e.g., product quality, account management, customer service) and a measure of customer loyalty (e.g., recommend, renew service contract). Correlations can vary from 0.0 to 1.0. Those attributes that have a high correlation with customer loyalty (approaching 1.0) are considered more “important” than other attributes that have a low correlation with customer loyalty (approaching 0.0).

Measuring Satisfaction with the Customer Experience and Customer Loyalty Via Surveys

Companies typically (almost always?) rely on customer surveys to measure both the satisfaction with the customer experience (CX) as well as the level of customer loyalty.  That is, customers are given a survey that includes questions about the customer experience and customer loyalty. The customers are asked to make ratings about their satisfaction with the customer experience and their level of customer loyalty (typically likelihood ratings).

As mentioned earlier, to identify the importance of customer experience attributes on customer loyalty, ratings of CX metrics and customer loyalty are correlated with each other.

The Problem of a Single Method of Measurement: Common Method Variance

The magnitude of the correlations between measures of satisfaction (with the customer experience) and measures of customer loyalty are made up of different components. On one hand, the correlation is due to the “true” relationship between satisfaction with the experience and customer loyalty.

On the other hand, because the two variables are measured using the same method - a survey with self-reported ratings, the magnitude of the correlation is partly due to the method of how the data are collected. Referred to as Common Method Variance (CMV) and studied in the field of social sciences (see Campbell and Fiske, 1959) where surveys are a common method of data collection, the general finding is that the correlation between two different measures is driven partly by the true relationship between the constructs being measured as well as the way they are measured.

The impact of CMV in customer experience management likely occurs when you use the same method of collecting data (e.g., survey questions) for both predictors (e.g., satisfaction with the customer experience) and outcomes (e.g., customer loyalty). That is, the size of the correlation between satisfaction and loyalty metrics is likely due to the fact that both variables are measured using a survey instrument.

Customer Loyalty Measures: Real Behaviors v. Expected Behaviors

The CMV problem is not really about how we measure satisfaction with the customer experience; a survey is a good way to measure the feelings/perceptions behind the customers’ experience. The problem lies with how we measure customer loyalty. Customer loyalty is about actual customer behavior. It is real customer behavior (e.g., number of recommendations, number of products purchased, whether a customer renewed their service contract) that drives company profits. Popular self-report measures ask for customers’ estimation of their likelihood of engaging in certain behaviors in the future (e.g., likely to recommend, likely to purchase, likely to renew).

Using self-report measures of satisfaction and loyalty, researchers have found high correlations between these two variables; For example, Bruce Temkin has found correlations between satisfaction with the customer experience and NPS to be around .70. Similarly, in my research, I have found comparably sized correlations (r ≈ .50) looking at the impact of the customer experience on advocacy loyalty (the recommend question is part of my advocacy metric). Are these correlations a good reflection of the importance of the customer experience in predicting loyalty (as measured by the recommend question)? Before I answer that question, let us first look at work (Sharma, Yetton and Crawford, 2009) that helps us classify different types of customer measurement and their impact on correlations.

Different Ways to Measure Customer Loyalty

Sharma et al. highlight four different types of measurement methods. I have slightly modified their four types to illustrate customer loyalty measures that are least susceptible to CMV (coded as 1) to measures that are most susceptible to CMV (coded as 4):

  1. System-captured metrics reflect objective metrics of customer loyalty: Data are obtained from historical records and other objective sources, including purchase records (captured in a CRM system). Example: Computer generated records of “time spent on the Web site” or “number of products/services purchased” or “whether a customer renewed their service contract.”
  2. Behavioral-continuous items reflect specific loyalty behaviors that respondents have carried out: Responses are typically captured on a continuous scale. Example item: How many friends did you tell about company XYZ in the past 12 months? None to 10, say.
  3. Behaviorally-anchored items reflect specific actions that respondents have carried out: Responses are typically captured on scales with behavioral anchors. Example item: How often have you shopped at store XYZ in the past month? Not at all to Very Often.
  4. Perceptually-anchored items reflect perceptions of loyalty behavior: Responses are typically on Likert scales, semantic differential or “agree/disagree scale”. Example: I shop at the store regularly. Agree to Disagree.

These researchers looked at 75 different studies examining the correlation between perceived usefulness (predictor) and usage of IT (criterion). While all studies used perceptually-anchored measures for perceived usefulness (perception/attitude), different studies used one of four different types of measures of usage (behavior). These researchers found that CMV accounted for 59% of the variance in the relationship between perceived usefulness and usage (r = .59 for perceptually-anchored items; r = .42 for behaviorally anchored items; r = .29 for behavioral continuous items; r = .16 for system-captured metrics). That is, the method with which researchers measure “usage” impacts the outcome of the results; as the usage measures become less susceptible to CMV (moving up the scale from 4 to 1 above), the magnitude of the correlation decreases between perceived usefulness and usage.

Looking at research in the CEM space, we commonly see that customer loyalty is measured using questions that reflect perceptually-anchored questions (type 4 above), the type of measure most susceptible to CMV.

Table 1. Descriptive statistics and correlations of two types of recommend loyalty metrics (behavioral-continuous and perceptually-anchored) with customer experience ratings.

An Example

I have some survey data on the wireless service industry that examined the impact of customer satisfaction with customer touch points (e.g, product, coverage/reliability and customer service) on customer loyalty. This study included measures of satisfaction with the customer experience (perceptually-anchored) and two different measures of customer loyalty:

  1. self-reported number of people you recommended the company to in the past 12 months (behavioral-continuous).
  2. self-reported likelihood to recommend (perceptually-anchored)

The correlations among these measures are located in Table 1.

As you can see, the two recommend loyalty metrics are weakly related to each other (r = .47), suggesting that they measure two different constructs. Additionally, and as expected by the CMV model, the behavioral-continuous measure of customer loyalty (number of friends/colleagues) shows a significantly lower correlation (average r = .28) with customer experience ratings compared to the perceptually-anchored measure of customer loyalty (likelihood to recommend) (average r = .52). These findings are strikingly similar to the above findings of Sharma et al. (2009).

Summary and Implications

The way in which we measure the customer experience and customer loyalty impacts the correlations we see between them. As measures of both variables use perceptually-anchored questions on the same survey, the correlation between the two are likely overinflated. I contend that the true impact of customer experience on customer loyalty can only be determined when real customer loyalty behaviors are used in the statistical modeling process.

We may be overestimating the importance (e.g., impact) of customer experience on customer loyalty simply due to the fact that we measure both variables (experience and loyalty) using the same instrument, a survey with similar scale characteristics. Companies commonly use the correlations (or squared correlation) between a given attribute and customer loyalty as the basis for estimating the return on investment (ROI) when improving the customer experience. The use of overinflated correlations will likely result in an overestimate of the ROI of customer experience improvement efforts. As such, companies need to temper this estimation when perceptually-anchored customer loyalty metrics are used.

I argue elsewhere that we need to use more objective metrics of customer loyalty whenever they are available. Using Big Data principles, companies can link real loyalty behaviors with customer satisfaction ratings. Using a customer-centric approach to linkage analysis, our company,TCELab  helps companies integrate customer feedback data with their CRM data where real customer loyalty data are housed  (see CEM Linkage for a deeper discussion).

While measuring customer loyalty using real, objective metrics (system-captured) would be ideal, many companies do not have the resources to collect and link customer loyalty behaviors to customer ratings of their experience. Perhaps loyalty measures that are less susceptible to CMV could be developed and used to get a more realistic assessment of the importance of the customer experience on customer loyalty.  For example, self-reported metrics that are more easily verifiable by the company (e.g., “likelihood to renew service contract” is more easily verifiable by the company than “likelihood to recommend”) might encourage customers to provide realistic ratings about their expected behaviors, thus reflecting a truer measure of customer loyalty. At TCELab, our customer survey, the Customer Relationship Diagnostic (CRD), includes verifiable types of loyalty questions (e.g., likely to renew contract, likely to purchase additional/different products, likely to upgrade).

The impact of the Common Method Variance (CMV) in CEM research is likely strong in studies in which the data for customer satisfaction (the predictor) and customer loyalty (the criterion) are collected using surveys with similar item characteristics (perceptually-anchored). CEM professionals need to keep the problem of CMV in mind when interpreting customer survey results (any survey results, really) and estimating the impact of customer experience on customer loyalty and financial performance.

What kind of loyalty metrics do you use in your organization? How do you measure them?

Originally Posted at: Is the Importance of Customer Experience Overinflated?