Social Sentiment Company ZenCity Raises $13.5M for Expansion

The Israeli company ZenCity, which helps local governments assess public opinion by combining 311, social media analysis and other open sources on the Internet, has announced $13.5 million in new funding — its largest funding round to date.

A news release today said the money will go toward improving ZenCity’s software, adding partnerships and growing the company’s footprint in the market. The funding round was led by the Israeli venture capital firm TLV Partners, with participation from Salesforce Ventures.

Founded in 2015, ZenCity makes software that collects data from public sources such as social media, local news channels and 311 requests. It then runs this data through an AI tool to identify specific topics, trends and sentiments, from which local government agencies can get an idea of the needs and priorities of their communities.

“Zencity is literally the only way I can get a true big-picture view of all discourse taking place, both on our city-owned channels and those that are not run by the city,” attested Belen Michelis, communications manager for the city of Meriden, Conn., in a case study on the company’s website. “The ability to parse through the chatter from one place is invaluable.”

The latest investments more than doubled ZenCity’s funding, according to Crunchbase, which shows that the company has amassed $21.2 million across three rounds in four years, each larger than the last: $1.7 million announced September 2017, $6 million in September 2018 and $13.5 million today. In May 2018, ZenCity also scored $1 million from Microsoft’s venture capital arm by winning the Innovate.AI competition for Israel’s region.

At the time of that competition, ZenCity counted about 20 customers in the U.S. and Israel. Today’s announcement said the company has over 150 local government customers in the U.S., ranging in size from the city of Los Angeles to the village of Lemont, Ill., with fewer than 20,000 residents.

ZenCity CEO Eyal Feder-Levy said in a statement that his company’s software has a role to play in this moment in history, when city governments are testing new responses to unfolding crises, such as COVID-19 mitigation measures or grants to help local businesses.

“Now more than ever, this investment is further proof of local governments’ acute need for real-time resident feedback,” he said. “The ability to provide municipal leaders with actionable data is a big step in further improving the efficiency and effectiveness of their work.”

Source: Social Sentiment Company ZenCity Raises $13.5M for Expansion

Aug 06, 20: #AnalyticsClub #Newsletter (Events, Tips, News & more..)

Warning: file_get_contents(http://events.analytics.club/tw/eventpull.php?cat=WEB): failed to open stream: HTTP request failed! HTTP/1.0 404 Not Found
in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://events.analytics.club/tw/eventpull.php?cat=WEB): failed to open stream: HTTP request failed! HTTP/1.0 404 Not Found
in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://events.analytics.club/tw/eventpull.php?cat=WEB): failed to open stream: HTTP request failed! HTTP/1.0 404 Not Found
in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://events.analytics.club/tw/eventpull.php?cat=WEB): failed to open stream: HTTP request failed! HTTP/1.0 404 Not Found
in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://events.analytics.club/tw/eventpull.php?cat=WEB): failed to open stream: HTTP request failed! HTTP/1.0 404 Not Found
in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://events.analytics.club/tw/eventpull.php?cat=WEB): failed to open stream: HTTP request failed! HTTP/1.0 404 Not Found
in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

[  COVER OF THE WEEK ]

image
Ethics  Source

[ AnalyticsWeek BYTES]

>> Marketing Analytics – Success Through Analysis by analyticsweekpick

>> Copado Adds Government-Specific DevOps Tools to Salesforce by analyticsweekpick

>> Consider The Close Variants During Page Segmentation For A Better SEO by thomassujain

Wanna write? Click Here

[ FEATURED COURSE]

Applied Data Science: An Introduction

image

As the world’s data grow exponentially, organizations across all sectors, including government and not-for-profit, need to understand, manage and use big, complex data sets—known as big data…. more

[ FEATURED READ]

The Industries of the Future

image

The New York Times bestseller, from leading innovation expert Alec Ross, a “fascinating vision” (Forbes) of what’s next for the world and how to navigate the changes the future will bring…. more

[ TIPS & TRICKS OF THE WEEK]

Fix the Culture, spread awareness to get awareness
Adoption of analytics tools and capabilities has not yet caught up to industry standards. Talent has always been the bottleneck towards achieving the comparative enterprise adoption. One of the primal reason is lack of understanding and knowledge within the stakeholders. To facilitate wider adoption, data analytics leaders, users, and community members needs to step up to create awareness within the organization. An aware organization goes a long way in helping get quick buy-ins and better funding which ultimately leads to faster adoption. So be the voice that you want to hear from leadership.

[ DATA SCIENCE Q&A]

Q:How do you know if one algorithm is better than other?
A: * In terms of performance on a given data set?
* In terms of performance on several data sets?
* In terms of efficiency?
In terms of performance on several data sets:

– ‘Does learning algorithm A have a higher chance of producing a better predictor than learning algorithm B in the given context?”
– ‘Bayesian Comparison of Machine Learning Algorithms on Single and Multiple Datasets”, A. Lacoste and F. Laviolette
– ‘Statistical Comparisons of Classifiers over Multiple Data Sets”, Janez Demsar

In terms of performance on a given data set:
– One wants to choose between two learning algorithms
– Need to compare their performances and assess the statistical significance

One approach (Not preferred in the literature):
– Multiple k-fold cross validation: run CV multiple times and take the mean and sd
– You have: algorithm A (mean and sd) and algorithm B (mean and sd)
– Is the difference meaningful? (Paired t-test)

Sign-test (classification context):
Simply counts the number of times A has a better metrics than B and assumes this comes from a binomial distribution. Then we can obtain a p-value of the HoHo test: A and B are equal in terms of performance.

Wilcoxon signed rank test (classification context):
Like the sign-test, but the wins (A is better than B) are weighted and assumed coming from a symmetric distribution around a common median. Then, we obtain a p-value of the HoHo test.

Other (without hypothesis testing):
– AUC
– F-Score

Source

[ VIDEO OF THE WEEK]

#FutureOfData with Rob(@telerob) / @ConnellyAgency on running innovation in agency

 #FutureOfData with Rob(@telerob) / @ConnellyAgency on running innovation in agency

Subscribe to  Youtube

[ QUOTE OF THE WEEK]

I’m sure, the highest capacity of storage device, will not enough to record all our stories; because, everytime with you is very valuable da

[ PODCAST OF THE WEEK]

#BigData @AnalyticsWeek #FutureOfData #Podcast with Joe DeCosmo, @Enova

 #BigData @AnalyticsWeek #FutureOfData #Podcast with Joe DeCosmo, @Enova

Subscribe 

iTunes  GooglePlay

[ FACT OF THE WEEK]

Within five years there will be over 50 billion smart connected devices in the world, all developed to collect, analyze and share data.

Sourced from: Analytics.CLUB #WEB Newsletter

UI Fraud Rises as Relief Funds Go to Bad Actors (Contributed)

news item about rising unemployment

The federal government moved quickly to stand up programs to blunt the economic impact of the COVID-19 pandemic. This was a tremendous undertaking that provided needed financial support to hundreds of millions of Americans. Still, it would be unrealistic to think that such a large-scale effort could be implemented so quickly without any glitches. Unfortunately, fraudsters, both domestically and worldwide, saw this as an opportunity, impacting not only CARES Act funds but Unemployment Insurance (UI) assets, as well. The scope of the issue is vast: Washington state reports up to $650 million in UI money stolen by fraudsters and Maryland has identified more than $500 million in UI fraud.

The Small Business Administration’s (SBA) Paycheck Protection Program (PPP) is intended to keep businesses open by providing forgivable loans to employers to keep workers on the job. However, SBA has been challenged by numerous applications from nonexistent small businesses — duly registered at the state level — claiming they have employees to keep on payroll. Also, some actual small businesses falsified their qualifications for PPP loans, misrepresenting the number of employees and expenses.

While states have no vested interest in PPP funds themselves, fraud has an impact down to the local level:

An employer applies for PPP funds but tells the employees to go on unemployment, causing them to unwittingly commit UI fraud;
A fake company uses stolen identities to apply for a PPP loan, when those individuals are actually employed elsewhere; or
A false, stolen or synthetic identity is used to apply for UI, connecting this fake persona to a real or fake company

Through these techniques and more, fraudsters can directly affect state and local resources and tax revenues, while delaying UI payments to legitimate applicants.

“Pay-and-chase” could potentially lead to the recovery of a portion of the lost funds, but historically, a large percentage of fraudulently obtained dollars is never recovered. Pay-and-chase also has its own costs: The original money is gone, and now you have to spend more — in time, resources and personnel — to try to recoup it. Of course, there is also the deterrent value of chasing down fraudsters, but with limited resources available to auditors, the likelihood is that the majority of that money is unrecoverable.

Stop Fraud at the Front Door

Businesses don’t commit fraud; the people who run those businesses — legitimate or otherwise — do. It’s essential to make sure the applicants are who they say they are, that their businesses are genuine, and that their employees actually exist and work for them.

True, banks are important parties in the loan application process, but the issue starts with state registration, where new businesses register with the Secretary of State and other offices at the city or county level. It’s relatively easy for fraudsters to use stolen identities and fabricated information to create a realistic business entity, complete with management personnel and officers. It’s up to agencies to determine if any or all of the information submitted is true. Historically, this has been tough to do: Research by LexisNexis Risk Solutions shows that only 50 percent of small businesses have a credit history, and half of those with a history only have thin files. Once the business is registered, that information can flow to federal agencies, including the SBA as it reviews PPP loan applications.

The same set of identity issues need to be dealt with when processing UI applicants. Unfortunately, it’s very easy to create an identity that looks real. Online resources exist to falsify an ID or driver’s license, utility bills and pay stubs, so that an applicant can appear legitimate. Stolen or synthetic identities are also being used, which adds to the confusion, as some or all of the information being used about that person is real. The result is that, ultimately, stimulus and UI funds can end up in the wrong hands, leaving the government to recover it from someone who doesn’t exist. These identities may also be used to obtain assistance and benefits through additional state-run programs, as well as to apply for UI and assistance in other states altogether, creating the issue of dual participation.

The Answer Is Data

Preventing fraud requires a judicious, intelligent process that screens applications for business registrations and UI at the earliest possible stage. Most states have systems in place for this, not only for approving business licenses and UI, but also for disaster contingency programs, which require funds to flow quickly from the state or locality to people and businesses. The current environment, however, has made things more complicated, since offices may have limited resources and many applications are completely online to ensure social distancing. But whether in person or digitally, vetting the identities of applicants with confidence can only be done with a comprehensive set of accurate, up-to-date data sources.

Connecting a person’s physical identity — their address, birthdate, Social Security number, etc. — with their digital life — their online activity and where, when and how they interact online — is crucial for building risk scores, which support well-informed decisions on how best to apply limited resources to the issue of fraud. A system that provides real-time identity intelligence and pattern recognition in near-real time would not slow down the application process; in fact, it would improve turnaround time, since less manual vetting is needed.

With a PPP programs extension, the ability to apply for loan forgiveness and the likelihood of another round of stimulus on the way, the opportunities for continued PPP fraud may be growing, further straining citizens, public resources and the economy. A multi-layered physical and digital identity intelligence solution, powered by comprehensive government and commercial data sources, means approvers can more quickly — and more accurately — sort legitimate applicants from scammers by automating the process using a comprehensive data solution. And that helps ensure that funds go where they are needed most to support hardworking people in every state. Hopefully these critical safeguards will be implemented in disaster-ready solutions so we are not chasing taxpayer money next time.

Andrew McClenahan, a solutions architect for LexisNexis Risk Solutions, leads the design and implementation of government agency solutions that uphold public program integrity and provides consultative services to systems integrators and government agencies on operations and data architecture issues to promote efficiency and economy in programs. McClenahan has spent much of his 25-year career in public service, including roles as the director of the Office of Public Benefits Integrity at the Florida Department of Children and Families, law enforcement captain with the Florida Department of Environmental Protection, and various sworn capacities with the Tallahassee Police Department. He is a Certified Public Manager, a Certified Welfare Fraud Investigator, and a Certified Inspector General Investigator.

Source by analyticsweekpick

Fat Zebras

When presenting data tables, we recommend stripping away backgrounds and grids and using alignment, formatting, and whitespace to guide the eye along the rows and columns of the data. 

But what if you have lots of rows with many columns that stretch across a screen or page? Or perhaps you have large gaps between fields because of varying text lengths. In both these cases, tracking across rows can be difficult.

To combat this, Stephen Few recommends zebra striping, where a light shaded colour is used on alternating rows to aid scanning across.

Problem solved, right?  

Maybe not. They’re almost always poorly done.  If your fill colour is too dark, it overemphasizes rows that shouldn’t be and hampers vertical scrolling. Even with subtle shading, it’s like looking through a pair of those horrible shutter sunglasses from the eighties. Worst of all, they might not even help. A List Apart ran a study that suggests zebra stripes don’t actually increase accuracy or speed.

For both aesthetic and performance reasons, I suggest we find an alternative. I propose fat zebras.

Fat zebras have less visual noise because they switch back and forth less frequently. As for tracking, I find them even easier, simply following the top, middle, or bottom row of the filled section across. (We will have to wait for a follow-up study to see if they actually help – it seems people don’t prefer the look of them on smaller tables, but there is no indication on their performance.)

One potential drawback to consider when using fat zebras is ‘implied grouping’. Depending on your data set, people may think the fill colours indicate that data share some specific quality. For this reason, it is vitally important, as with regular stripes, that the background fill be as subtle as possible. I would also recommend avoiding it when there are only a few rows of data visible as it’s even more likely to appear as data grouping.

Edward Tufte claims that “good typography can always organize a table, no stripes needed.” When working with large tables, however, I prefer to show my stripes, even if they are fat.
 

Source: Fat Zebras by analyticsweek

Jul 30, 20: #AnalyticsClub #Newsletter (Events, Tips, News & more..)

Warning: file_get_contents(http://events.analytics.club/tw/eventpull.php?cat=WEB): failed to open stream: HTTP request failed! HTTP/1.0 404 Not Found
in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://events.analytics.club/tw/eventpull.php?cat=WEB): failed to open stream: HTTP request failed! HTTP/1.0 404 Not Found
in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://events.analytics.club/tw/eventpull.php?cat=WEB): failed to open stream: HTTP request failed! HTTP/1.0 404 Not Found
in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://events.analytics.club/tw/eventpull.php?cat=WEB): failed to open stream: HTTP request failed! HTTP/1.0 404 Not Found
in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://events.analytics.club/tw/eventpull.php?cat=WEB): failed to open stream: HTTP request failed! HTTP/1.0 404 Not Found
in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://events.analytics.club/tw/eventpull.php?cat=WEB): failed to open stream: HTTP request failed! HTTP/1.0 404 Not Found
in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

[  COVER OF THE WEEK ]

image
Data security  Source

[ AnalyticsWeek BYTES]

>> Data-As-A-Service to enable compliance reporting by v1shal

>> Logi Analytics Launches Industry First Out-of-the-Box Embedded Analytics Development Platform by analyticsweek

>> 3 Questions to Ask Your Embedded Analytics Support Team by analyticsweek

Wanna write? Click Here

[ FEATURED COURSE]

CPSC 540 Machine Learning

image

Machine learning (ML) is one of the fastest growing areas of science. It is largely responsible for the rise of giant data companies such as Google, and it has been central to the development of lucrative products, such … more

[ FEATURED READ]

Superintelligence: Paths, Dangers, Strategies

image

The human brain has some capabilities that the brains of other animals lack. It is to these distinctive capabilities that our species owes its dominant position. Other animals have stronger muscles or sharper claws, but … more

[ TIPS & TRICKS OF THE WEEK]

Data aids, not replace judgement
Data is a tool and means to help build a consensus to facilitate human decision-making but not replace it. Analysis converts data into information, information via context leads to insight. Insights lead to decision making which ultimately leads to outcomes that brings value. So, data is just the start, context and intuition plays a role.

[ DATA SCIENCE Q&A]

Q:When you sample, what bias are you inflicting?
A: Selection bias:
– An online survey about computer use is likely to attract people more interested in technology than in typical

Under coverage bias:
– Sample too few observations from a segment of population

Survivorship bias:
– Observations at the end of the study are a non-random set of those present at the beginning of the investigation
– In finance and economics: the tendency for failed companies to be excluded from performance studies because they no longer exist

Source

[ VIDEO OF THE WEEK]

Data-As-A-Service (#DAAS) to enable compliance reporting

 Data-As-A-Service (#DAAS) to enable compliance reporting

Subscribe to  Youtube

[ QUOTE OF THE WEEK]

You can have data without information, but you cannot have information without data. – Daniel Keys Moran

[ PODCAST OF THE WEEK]

@SidProbstein / @AIFoundry on Leading #DataDriven Technology Transformation #FutureOfData #Podcast

 @SidProbstein / @AIFoundry on Leading #DataDriven Technology Transformation #FutureOfData #Podcast

Subscribe 

iTunes  GooglePlay

[ FACT OF THE WEEK]

Bad data or poor data quality costs US businesses $600 billion annually.

Sourced from: Analytics.CLUB #WEB Newsletter

An Unexpected Insider Threat: Senior Executives Who Ignore Cybersecurity Rules

Cybersecurity should be a top concern for any business, with strong policies and protocol put in place, and top executives leading by example. However, recent research has shown that more than half of senior managers disregard the rules, placing their organizations in jeopardy and making them an insider threat. This behavior is common in companies […]

The post An Unexpected Insider Threat: Senior Executives Who Ignore Cybersecurity Rules appeared first on TechSpective.

Source: An Unexpected Insider Threat: Senior Executives Who Ignore Cybersecurity Rules by administrator

Jul 23, 20: #AnalyticsClub #Newsletter (Events, Tips, News & more..)

Warning: file_get_contents(http://events.analytics.club/tw/eventpull.php?cat=WEB): failed to open stream: HTTP request failed! HTTP/1.0 404 Not Found
in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://events.analytics.club/tw/eventpull.php?cat=WEB): failed to open stream: HTTP request failed! HTTP/1.0 404 Not Found
in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://events.analytics.club/tw/eventpull.php?cat=WEB): failed to open stream: HTTP request failed! HTTP/1.0 404 Not Found
in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://events.analytics.club/tw/eventpull.php?cat=WEB): failed to open stream: HTTP request failed! HTTP/1.0 404 Not Found
in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://events.analytics.club/tw/eventpull.php?cat=WEB): failed to open stream: HTTP request failed! HTTP/1.0 404 Not Found
in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://events.analytics.club/tw/eventpull.php?cat=WEB): failed to open stream: HTTP request failed! HTTP/1.0 404 Not Found
in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

[  COVER OF THE WEEK ]

image
SQL Database  Source

[ AnalyticsWeek BYTES]

>> ROC Curves and Precision-Recall Curves for Imbalanced Classification by administrator

>> ‘Big data’ is solving the problem of $200 billion of wasted energy by analyticsweekpick

>> Does your organization have all the risk management tools you need? by administrator

Wanna write? Click Here

[ FEATURED COURSE]

Probability & Statistics

image

This course introduces students to the basic concepts and logic of statistical reasoning and gives the students introductory-level practical ability to choose, generate, and properly interpret appropriate descriptive and… more

[ FEATURED READ]

Hypothesis Testing: A Visual Introduction To Statistical Significance

image

Statistical significance is a way of determining if an outcome occurred by random chance, or did something cause that outcome to be different than the expected baseline. Statistical significance calculations find their … more

[ TIPS & TRICKS OF THE WEEK]

Data aids, not replace judgement
Data is a tool and means to help build a consensus to facilitate human decision-making but not replace it. Analysis converts data into information, information via context leads to insight. Insights lead to decision making which ultimately leads to outcomes that brings value. So, data is just the start, context and intuition plays a role.

[ DATA SCIENCE Q&A]

Q:What is random forest? Why is it good?
A: Random forest? (Intuition):
– Underlying principle: several weak learners combined provide a strong learner
– Builds several decision trees on bootstrapped training samples of data
– On each tree, each time a split is considered, a random sample of m predictors is chosen as split candidates, out of all p predictors
– Rule of thumb: at each split m=?p
– Predictions: at the majority rule

Why is it good?
– Very good performance (decorrelates the features)
– Can model non-linear class boundaries
– Generalization error for free: no cross-validation needed, gives an unbiased estimate of the generalization error as the trees is built
– Generates variable importance

Source

[ VIDEO OF THE WEEK]

#HumansOfSTEAM feat. Hussain Gadwal, Mechanical Designer via @SciThinkers #STEM #STEAM

 #HumansOfSTEAM feat. Hussain Gadwal, Mechanical Designer via @SciThinkers #STEM #STEAM

Subscribe to  Youtube

[ QUOTE OF THE WEEK]

Torture the data, and it will confess to anything. – Ronald Coase

[ PODCAST OF THE WEEK]

#BigData @AnalyticsWeek #FutureOfData #Podcast with Eloy Sasot, News Corp

 #BigData @AnalyticsWeek #FutureOfData #Podcast with Eloy Sasot, News Corp

Subscribe 

iTunes  GooglePlay

[ FACT OF THE WEEK]

Every person in the US tweeting three tweets per minute for 26,976 years.

Sourced from: Analytics.CLUB #WEB Newsletter

10 Clustering Algorithms With Python



Clustering or cluster analysis is an unsupervised learning problem.

It is often used as a data analysis technique for discovering interesting patterns in data, such as groups of customers based on their behavior.

There are many clustering algorithms to choose from and no single best clustering algorithm for all cases. Instead, it is a good idea to explore a range of clustering algorithms and different configurations for each algorithm.

In this tutorial, you will discover how to fit and use top clustering algorithms in python.

After completing this tutorial, you will know:

  • Clustering is an unsupervised problem of finding natural groups in the feature space of input data.
  • There are many different clustering algorithms and no single best method for all datasets.
  • How to implement, fit, and use top clustering algorithms in Python with the scikit-learn machine learning library.

Let’s get started.

Clustering Algorithms With Python

Clustering Algorithms With Python
Photo by Lars Plougmann, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

  1. Clustering
  2. Clustering Algorithms
  3. Examples of Clustering Algorithms
    1. Library Installation
    2. Clustering Dataset
    3. Affinity Propagation
    4. Agglomerative Clustering
    5. BIRCH
    6. DBSCAN
    7. K-Means
    8. Mini-Batch K-Means
    9. Mean Shift
    10. OPTICS
    11. Spectral Clustering
    12. Gaussian Mixture Model

Clustering

Cluster analysis, or clustering, is an unsupervised machine learning task.

It involves automatically discovering natural grouping in data. Unlike supervised learning (like predictive modeling), clustering algorithms only interpret the input data and find natural groups or clusters in feature space.

Clustering techniques apply when there is no class to be predicted but rather when the instances are to be divided into natural groups.

— Page 141, Data Mining: Practical Machine Learning Tools and Techniques, 2016.

A cluster is often an area of density in the feature space where examples from the domain (observations or rows of data) are closer to the cluster than other clusters. The cluster may have a center (the centroid) that is a sample or a point feature space and may have a boundary or extent.

These clusters presumably reflect some mechanism at work in the domain from which instances are drawn, a mechanism that causes some instances to bear a stronger resemblance to each other than they do to the remaining instances.

— Pages 141-142, Data Mining: Practical Machine Learning Tools and Techniques, 2016.

Clustering can be helpful as a data analysis activity in order to learn more about the problem domain, so-called pattern discovery or knowledge discovery.

For example:

  • The phylogenetic tree could be considered the result of a manual clustering analysis.
  • Separating normal data from outliers or anomalies may be considered a clustering problem.
  • Separating clusters based on their natural behavior is a clustering problem, referred to as market segmentation.

Clustering can also be useful as a type of feature engineering, where existing and new examples can be mapped and labeled as belonging to one of the identified clusters in the data.

Evaluation of identified clusters is subjective and may require a domain expert, although many clustering-specific quantitative measures do exist. Typically, clustering algorithms are compared academically on synthetic datasets with pre-defined clusters, which an algorithm is expected to discover.

Clustering is an unsupervised learning technique, so it is hard to evaluate the quality of the output of any given method.

— Page 534, Machine Learning: A Probabilistic Perspective, 2012.

Clustering Algorithms

There are many types of clustering algorithms.

Many algorithms use similarity or distance measures between examples in the feature space in an effort to discover dense regions of observations. As such, it is often good practice to scale data prior to using clustering algorithms.

Central to all of the goals of cluster analysis is the notion of the degree of similarity (or dissimilarity) between the individual objects being clustered. A clustering method attempts to group the objects based on the definition of similarity supplied to it.

— Page 502, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2016.

Some clustering algorithms require you to specify or guess at the number of clusters to discover in the data, whereas others require the specification of some minimum distance between observations in which examples may be considered “close” or “connected.”

As such, cluster analysis is an iterative process where subjective evaluation of the identified clusters is fed back into changes to algorithm configuration until a desired or appropriate result is achieved.

The scikit-learn library provides a suite of different clustering algorithms to choose from.

A list of 10 of the more popular algorithms is as follows:

  • Affinity Propagation
  • Agglomerative Clustering
  • BIRCH
  • DBSCAN
  • K-Means
  • Mini-Batch K-Means
  • Mean Shift
  • OPTICS
  • Spectral Clustering
  • Mixture of Gaussians

Each algorithm offers a different approach to the challenge of discovering natural groups in data.

There is no best clustering algorithm, and no easy way to find the best algorithm for your data without using controlled experiments.

In this tutorial, we will review how to use each of these 10 popular clustering algorithms from the scikit-learn library.

The examples will provide the basis for you to copy-paste the examples and test the methods on your own data.

We will not dive into the theory behind how the algorithms work or compare them directly. For a good starting point on this topic, see:

Let’s dive in.

Examples of Clustering Algorithms

In this section, we will review how to use 10 popular clustering algorithms in scikit-learn.

This includes an example of fitting the model and an example of visualizing the result.

The examples are designed for you to copy-paste into your own project and apply the methods to your own data.

Library Installation

First, let’s install the library.

Don’t skip this step as you will need to ensure you have the latest version installed.

You can install the scikit-learn library using the pip Python installer, as follows:

sudo pip install scikit-learn

For additional installation instructions specific to your platform, see:

Next, let’s confirm that the library is installed and you are using a modern version.

Run the following script to print the library version number.

# check scikit-learn version
import sklearn
print(sklearn.__version__)

Running the example, you should see the following version number or higher.

0.22.1

Clustering Dataset

We will use the make_classification() function to create a test binary classification dataset.

The dataset will have 1,000 examples, with two input features and one cluster per class. The clusters are visually obvious in two dimensions so that we can plot the data with a scatter plot and color the points in the plot by the assigned cluster. This will help to see, at least on the test problem, how “well” the clusters were identified.

The clusters in this test problem are based on a multivariate Gaussian, and not all clustering algorithms will be effective at identifying these types of clusters. As such, the results in this tutorial should not be used as the basis for comparing the methods generally.

An example of creating and summarizing the synthetic clustering dataset is listed below.

# synthetic classification dataset
from numpy import where
from sklearn.datasets import make_classification
from matplotlib import pyplot
# define dataset
X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, random_state=4)
# create scatter plot for samples from each class
for class_value in range(2):
	# get row indexes for samples with this class
	row_ix = where(y == class_value)
	# create scatter of these samples
	pyplot.scatter(X[row_ix, 0], X[row_ix, 1])
# show the plot
pyplot.show()

Running the example creates the synthetic clustering dataset, then creates a scatter plot of the input data with points colored by class label (idealized clusters).

We can clearly see two distinct groups of data in two dimensions and the hope would be that an automatic clustering algorithm can detect these groupings.

Scatter Plot of Synthetic Clustering Dataset With Points Colored by Known Cluster

Scatter Plot of Synthetic Clustering Dataset With Points Colored by Known Cluster

Next, we can start looking at examples of clustering algorithms applied to this dataset.

I have made some minimal attempts to tune each method to the dataset.

Can you get a better result for one of the algorithms?
Let me know in the comments below.

Affinity Propagation

Affinity Propagation involves finding a set of exemplars that best summarize the data.

We devised a method called “affinity propagation,” which takes as input measures of similarity between pairs of data points. Real-valued messages are exchanged between data points until a high-quality set of exemplars and corresponding clusters gradually emerges

— Clustering by Passing Messages Between Data Points, 2007.

The technique is described in the paper:

It is implemented via the AffinityPropagation class and the main configuration to tune is the “damping” set between 0.5 and 1, and perhaps “preference.”

The complete example is listed below.

# affinity propagation clustering
from numpy import unique
from numpy import where
from sklearn.datasets import make_classification
from sklearn.cluster import AffinityPropagation
from matplotlib import pyplot
# define dataset
X, _ = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, random_state=4)
# define the model
model = AffinityPropagation(damping=0.9)
# fit the model
model.fit(X)
# assign a cluster to each example
yhat = model.predict(X)
# retrieve unique clusters
clusters = unique(yhat)
# create scatter plot for samples from each cluster
for cluster in clusters:
	# get row indexes for samples with this cluster
	row_ix = where(yhat == cluster)
	# create scatter of these samples
	pyplot.scatter(X[row_ix, 0], X[row_ix, 1])
# show the plot
pyplot.show()

Running the example fits the model on the training dataset and predicts a cluster for each example in the dataset. A scatter plot is then created with points colored by their assigned cluster.

In this case, I could not achieve a good result.

Scatter Plot of Dataset With Clusters Identified Using Affinity Propagation

Scatter Plot of Dataset With Clusters Identified Using Affinity Propagation

Agglomerative Clustering

Agglomerative clustering involves merging examples until the desired number of clusters is achieved.

It is a part of a broader class of hierarchical clustering methods and you can learn more here:

It is implemented via the AgglomerativeClustering class and the main configuration to tune is the “n_clusters” set, an estimate of the number of clusters in the data, e.g. 2.

The complete example is listed below.

# agglomerative clustering
from numpy import unique
from numpy import where
from sklearn.datasets import make_classification
from sklearn.cluster import AgglomerativeClustering
from matplotlib import pyplot
# define dataset
X, _ = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, random_state=4)
# define the model
model = AgglomerativeClustering(n_clusters=2)
# fit model and predict clusters
yhat = model.fit_predict(X)
# retrieve unique clusters
clusters = unique(yhat)
# create scatter plot for samples from each cluster
for cluster in clusters:
	# get row indexes for samples with this cluster
	row_ix = where(yhat == cluster)
	# create scatter of these samples
	pyplot.scatter(X[row_ix, 0], X[row_ix, 1])
# show the plot
pyplot.show()

Running the example fits the model on the training dataset and predicts a cluster for each example in the dataset. A scatter plot is then created with points colored by their assigned cluster.

In this case, a reasonable grouping is found.

Scatter Plot of Dataset With Clusters Identified Using Agglomerative Clustering

Scatter Plot of Dataset With Clusters Identified Using Agglomerative Clustering

BIRCH

BIRCH Clustering (BIRCH is short for Balanced Iterative Reducing and Clustering using
Hierarchies) involves constructing a tree structure from which cluster centroids are extracted.

BIRCH incrementally and dynamically clusters incoming multi-dimensional metric data points to try to produce the best quality clustering with the available resources (i. e., available memory and time constraints).

— BIRCH: An efficient data clustering method for large databases, 1996.

The technique is described in the paper:

It is implemented via the Birch class and the main configuration to tune is the “threshold” and “n_clusters” hyperparameters, the latter of which provides an estimate of the number of clusters.

The complete example is listed below.

# birch clustering
from numpy import unique
from numpy import where
from sklearn.datasets import make_classification
from sklearn.cluster import Birch
from matplotlib import pyplot
# define dataset
X, _ = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, random_state=4)
# define the model
model = Birch(threshold=0.01, n_clusters=2)
# fit the model
model.fit(X)
# assign a cluster to each example
yhat = model.predict(X)
# retrieve unique clusters
clusters = unique(yhat)
# create scatter plot for samples from each cluster
for cluster in clusters:
	# get row indexes for samples with this cluster
	row_ix = where(yhat == cluster)
	# create scatter of these samples
	pyplot.scatter(X[row_ix, 0], X[row_ix, 1])
# show the plot
pyplot.show()

Running the example fits the model on the training dataset and predicts a cluster for each example in the dataset. A scatter plot is then created with points colored by their assigned cluster.

In this case, an excellent grouping is found.

Scatter Plot of Dataset With Clusters Identified Using BIRCH Clustering

Scatter Plot of Dataset With Clusters Identified Using BIRCH Clustering

DBSCAN

DBSCAN Clustering (where DBSCAN is short for Density-Based Spatial Clustering of Applications with Noise) involves finding high-density areas in the domain and expanding those areas of the feature space around them as clusters.

… we present the new clustering algorithm DBSCAN relying on a density-based notion of clusters which is designed to discover clusters of arbitrary shape. DBSCAN requires only one input parameter and supports the user in determining an appropriate value for it

— A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise, 1996.

The technique is described in the paper:

It is implemented via the DBSCAN class and the main configuration to tune is the “eps” and “min_samples” hyperparameters.

The complete example is listed below.

# dbscan clustering
from numpy import unique
from numpy import where
from sklearn.datasets import make_classification
from sklearn.cluster import DBSCAN
from matplotlib import pyplot
# define dataset
X, _ = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, random_state=4)
# define the model
model = DBSCAN(eps=0.30, min_samples=9)
# fit model and predict clusters
yhat = model.fit_predict(X)
# retrieve unique clusters
clusters = unique(yhat)
# create scatter plot for samples from each cluster
for cluster in clusters:
	# get row indexes for samples with this cluster
	row_ix = where(yhat == cluster)
	# create scatter of these samples
	pyplot.scatter(X[row_ix, 0], X[row_ix, 1])
# show the plot
pyplot.show()

Running the example fits the model on the training dataset and predicts a cluster for each example in the dataset. A scatter plot is then created with points colored by their assigned cluster.

In this case, a reasonable grouping is found, although more tuning is required.

Scatter Plot of Dataset With Clusters Identified Using DBSCAN Clustering

Scatter Plot of Dataset With Clusters Identified Using DBSCAN Clustering

K-Means

K-Means Clustering may be the most widely known clustering algorithm and involves assigning examples to clusters in an effort to minimize the variance within each cluster.

The main purpose of this paper is to describe a process for partitioning an N-dimensional population into k sets on the basis of a sample. The process, which is called ‘k-means,’ appears to give partitions which are reasonably efficient in the sense of within-class variance.

— Some methods for classification and analysis of multivariate observations, 1967.

The technique is described here:

It is implemented via the KMeans class and the main configuration to tune is the “n_clusters” hyperparameter set to the estimated number of clusters in the data.

The complete example is listed below.

# k-means clustering
from numpy import unique
from numpy import where
from sklearn.datasets import make_classification
from sklearn.cluster import KMeans
from matplotlib import pyplot
# define dataset
X, _ = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, random_state=4)
# define the model
model = KMeans(n_clusters=2)
# fit the model
model.fit(X)
# assign a cluster to each example
yhat = model.predict(X)
# retrieve unique clusters
clusters = unique(yhat)
# create scatter plot for samples from each cluster
for cluster in clusters:
	# get row indexes for samples with this cluster
	row_ix = where(yhat == cluster)
	# create scatter of these samples
	pyplot.scatter(X[row_ix, 0], X[row_ix, 1])
# show the plot
pyplot.show()

Running the example fits the model on the training dataset and predicts a cluster for each example in the dataset. A scatter plot is then created with points colored by their assigned cluster.

In this case, a reasonable grouping is found, although the unequal equal variance in each dimension makes the method less suited to this dataset.

Scatter Plot of Dataset With Clusters Identified Using K-Means Clustering

Scatter Plot of Dataset With Clusters Identified Using K-Means Clustering

Mini-Batch K-Means

Mini-Batch K-Means is a modified version of k-means that makes updates to the cluster centroids using mini-batches of samples rather than the entire dataset, which can make it faster for large datasets, and perhaps more robust to statistical noise.

… we propose the use of mini-batch optimization for k-means clustering. This reduces computation cost by orders of magnitude compared to the classic batch algorithm while yielding significantly better solutions than online stochastic gradient descent.

— Web-Scale K-Means Clustering, 2010.

The technique is described in the paper:

It is implemented via the MiniBatchKMeans class and the main configuration to tune is the “n_clusters” hyperparameter set to the estimated number of clusters in the data.

The complete example is listed below.

# mini-batch k-means clustering
from numpy import unique
from numpy import where
from sklearn.datasets import make_classification
from sklearn.cluster import MiniBatchKMeans
from matplotlib import pyplot
# define dataset
X, _ = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, random_state=4)
# define the model
model = MiniBatchKMeans(n_clusters=2)
# fit the model
model.fit(X)
# assign a cluster to each example
yhat = model.predict(X)
# retrieve unique clusters
clusters = unique(yhat)
# create scatter plot for samples from each cluster
for cluster in clusters:
	# get row indexes for samples with this cluster
	row_ix = where(yhat == cluster)
	# create scatter of these samples
	pyplot.scatter(X[row_ix, 0], X[row_ix, 1])
# show the plot
pyplot.show()

Running the example fits the model on the training dataset and predicts a cluster for each example in the dataset. A scatter plot is then created with points colored by their assigned cluster.

In this case, a result equivalent to the standard k-means algorithm is found.

Scatter Plot of Dataset With Clusters Identified Using Mini-Batch K-Means Clustering

Scatter Plot of Dataset With Clusters Identified Using Mini-Batch K-Means Clustering

Mean Shift

Mean shift clustering involves finding and adapting centroids based on the density of examples in the feature space.

We prove for discrete data the convergence of a recursive mean shift procedure to the nearest stationary point of the underlying density function and thus its utility in detecting the modes of the density.

— Mean Shift: A robust approach toward feature space analysis, 2002.

The technique is described in the paper:

It is implemented via the MeanShift class and the main configuration to tune is the “bandwidth” hyperparameter.

The complete example is listed below.

# mean shift clustering
from numpy import unique
from numpy import where
from sklearn.datasets import make_classification
from sklearn.cluster import MeanShift
from matplotlib import pyplot
# define dataset
X, _ = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, random_state=4)
# define the model
model = MeanShift()
# fit model and predict clusters
yhat = model.fit_predict(X)
# retrieve unique clusters
clusters = unique(yhat)
# create scatter plot for samples from each cluster
for cluster in clusters:
	# get row indexes for samples with this cluster
	row_ix = where(yhat == cluster)
	# create scatter of these samples
	pyplot.scatter(X[row_ix, 0], X[row_ix, 1])
# show the plot
pyplot.show()

Running the example fits the model on the training dataset and predicts a cluster for each example in the dataset. A scatter plot is then created with points colored by their assigned cluster.

In this case, a reasonable set of clusters are found in the data.

Scatter Plot of Dataset With Clusters Identified Using Mean Shift Clustering

Scatter Plot of Dataset With Clusters Identified Using Mean Shift Clustering

OPTICS

OPTICS clustering (where OPTICS is short for Ordering Points To Identify the Clustering Structure) is a modified version of DBSCAN described above.

We introduce a new algorithm for the purpose of cluster analysis which does not produce a clustering of a data set explicitly; but instead creates an augmented ordering of the database representing its density-based clustering structure. This cluster-ordering contains information which is equivalent to the density-based clusterings corresponding to a broad range of parameter settings.

— OPTICS: ordering points to identify the clustering structure, 1999.

The technique is described in the paper:

It is implemented via the OPTICS class and the main configuration to tune is the “eps” and “min_samples” hyperparameters.

The complete example is listed below.

# optics clustering
from numpy import unique
from numpy import where
from sklearn.datasets import make_classification
from sklearn.cluster import OPTICS
from matplotlib import pyplot
# define dataset
X, _ = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, random_state=4)
# define the model
model = OPTICS(eps=0.8, min_samples=10)
# fit model and predict clusters
yhat = model.fit_predict(X)
# retrieve unique clusters
clusters = unique(yhat)
# create scatter plot for samples from each cluster
for cluster in clusters:
	# get row indexes for samples with this cluster
	row_ix = where(yhat == cluster)
	# create scatter of these samples
	pyplot.scatter(X[row_ix, 0], X[row_ix, 1])
# show the plot
pyplot.show()

Running the example fits the model on the training dataset and predicts a cluster for each example in the dataset. A scatter plot is then created with points colored by their assigned cluster.

In this case, I could not achieve a reasonable result on this dataset.

Scatter Plot of Dataset With Clusters Identified Using OPTICS Clustering

Scatter Plot of Dataset With Clusters Identified Using OPTICS Clustering

Spectral Clustering

Spectral Clustering is a general class of clustering methods, drawn from linear algebra.

A promising alternative that has recently emerged in a number of fields is to use spectral methods for clustering. Here, one uses the top eigenvectors of a matrix derived from the distance between points.

— On Spectral Clustering: Analysis and an algorithm, 2002.

The technique is described in the paper:

It is implemented via the SpectralClustering class and the main Spectral Clustering is a general class of clustering methods, drawn from linear algebra. to tune is the “n_clusters” hyperparameter used to specify the estimated number of clusters in the data.

The complete example is listed below.

# spectral clustering
from numpy import unique
from numpy import where
from sklearn.datasets import make_classification
from sklearn.cluster import SpectralClustering
from matplotlib import pyplot
# define dataset
X, _ = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, random_state=4)
# define the model
model = SpectralClustering(n_clusters=2)
# fit model and predict clusters
yhat = model.fit_predict(X)
# retrieve unique clusters
clusters = unique(yhat)
# create scatter plot for samples from each cluster
for cluster in clusters:
	# get row indexes for samples with this cluster
	row_ix = where(yhat == cluster)
	# create scatter of these samples
	pyplot.scatter(X[row_ix, 0], X[row_ix, 1])
# show the plot
pyplot.show()

Running the example fits the model on the training dataset and predicts a cluster for each example in the dataset. A scatter plot is then created with points colored by their assigned cluster.

In this case, reasonable clusters were found.

Scatter Plot of Dataset With Clusters Identified Using Spectra Clustering Clustering

Scatter Plot of Dataset With Clusters Identified Using Spectra Clustering Clustering

Gaussian Mixture Model

A Gaussian mixture model summarizes a multivariate probability density function with a mixture of Gaussian probability distributions as its name suggests.

For more on the model, see:

It is implemented via the GaussianMixture class and the main configuration to tune is the “n_clusters” hyperparameter used to specify the estimated number of clusters in the data.

The complete example is listed below.

# gaussian mixture clustering
from numpy import unique
from numpy import where
from sklearn.datasets import make_classification
from sklearn.mixture import GaussianMixture
from matplotlib import pyplot
# define dataset
X, _ = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, random_state=4)
# define the model
model = GaussianMixture(n_components=2)
# fit the model
model.fit(X)
# assign a cluster to each example
yhat = model.predict(X)
# retrieve unique clusters
clusters = unique(yhat)
# create scatter plot for samples from each cluster
for cluster in clusters:
	# get row indexes for samples with this cluster
	row_ix = where(yhat == cluster)
	# create scatter of these samples
	pyplot.scatter(X[row_ix, 0], X[row_ix, 1])
# show the plot
pyplot.show()

Running the example fits the model on the training dataset and predicts a cluster for each example in the dataset. A scatter plot is then created with points colored by their assigned cluster.

In this case, we can see that the clusters were identified perfectly. This is not surprising given that the dataset was generated as a mixture of Gaussians.

Scatter Plot of Dataset With Clusters Identified Using Gaussian Mixture Clustering

Scatter Plot of Dataset With Clusters Identified Using Gaussian Mixture Clustering

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Papers

Books

APIs

Articles

Summary

In this tutorial, you discovered how to fit and use top clustering algorithms in python.

Specifically, you learned:

  • Clustering is an unsupervised problem of finding natural groups in the feature space of input data.
  • There are many different clustering algorithms, and no single best method for all datasets.
  • How to implement, fit, and use top clustering algorithms in Python with the scikit-learn machine learning library.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.



The post 10 Clustering Algorithms With Python appeared first on Machine Learning Mastery.

Source: 10 Clustering Algorithms With Python

Put big data to work with Cortana Analytics

bigdataistock013114

At its most recent Worldwide Partner Conference, held in July 2015, Microsoft announced several new cloud-based initiatives, the most interesting of which for enterprise businesses was the Cortana Analytics Suite. This is Microsoft’s big data and predictive analysis package, and it will compete with the likes of IBM for a share of this lucrative market.

Analyze this

The promise of big data, and the analysis tools that come with it, is that enterprises will be able to “mine” the huge amount of data they generate on customers, vendors, markets, and products for insights into unforeseen or untapped information—the kind of information that leads to more business, more revenue, and more profits.

In simple terms, a solid big data analytics infrastructure is an absolute necessity for any business enterprise with aspirations for success. Enterprises that do not take the time and spend the money to establish a big data infrastructure are going to operate at a disadvantage. That is not a good place to be.

The Suite

The Cortana Analytics Suite (Figure A) is Microsoft’s foray into this important and lucrative market. According to the announcement, Cortana Analytics will take advantage of machine learning, unlimited data storage, and perceptual intelligence to “transform data into intelligent action.” If only it were that simple.

Figure A

Figure A
 Image: Microsoft News
The Cortana Analytics Suite.

No matter where you turn for your big data solutions, the software and infrastructure can only take you so far. All of the business intelligence and big data analytics in the world is not going to do you any good if your employees don’t use them.

This is where Microsoft is trying to carve out a niche. The Cortana Analytics Suite will be available for a simple monthly subscription, similar to the way enterprises pay for Office 365. It will have a familiar interface, and all of the analytical tools will be available in one product. There will be no need to mix and match tools from various sources.

The other “simplification” effort Microsoft hopes will establish its niche in the market is Cortana itself. Cortana, Microsoft’s voice-controlled personal assistant, is integrated into the analytics suite. The idea is that users can ask questions using Cortana and natural language without having to formulate old-fashioned database queries.

It’s a start

Microsoft’s Cortana Analytics Suite is a very ambitious project. I applaud Microsoft for recognizing that many current business intelligence and big data analytics solutions are overly complicated and unwieldy. Simplification is the right strategy—it’s the niche that can separate Microsoft from everyone else.

However, knowing the right strategy and executing the right strategy are two different things. From what I have seen of the Cortana Analytics Suite, it does indeed present a set of streamlined and simplified analytical tools. But those tools are more of a framework, and they aren’t likely to be exactly what your enterprise needs.

In other words, enterprises using the Cortana Analytics Suite are still going to have to build applications that meet the particular needs of their particular enterprises. Cortana Analytics may make application development easier, at least we know that’s Microsoft’s plan, but we won’t know if that’s true or not until enterprises get their hands on it.

And even if Cortana Analytics does make development easier, enterprises will still have to hire people to do the app development and then train their employees how to use those apps.

As I said before, big data analytics is vital to the success of modern business enterprises, but it requires a significant commitment of resources. So, even if Microsoft has indeed simplified big data with Cortana Analytics, it has not made it truly easy.

Note: This article originally appeared in TechRepublic. Click for link here.

Source by analyticsweekpick

Jul 16, 20: #AnalyticsClub #Newsletter (Events, Tips, News & more..)

Warning: file_get_contents(http://events.analytics.club/tw/eventpull.php?cat=WEB): failed to open stream: HTTP request failed! HTTP/1.0 404 Not Found
in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://events.analytics.club/tw/eventpull.php?cat=WEB): failed to open stream: HTTP request failed! HTTP/1.0 404 Not Found
in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://events.analytics.club/tw/eventpull.php?cat=WEB): failed to open stream: HTTP request failed! HTTP/1.0 404 Not Found
in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://events.analytics.club/tw/eventpull.php?cat=WEB): failed to open stream: HTTP request failed! HTTP/1.0 404 Not Found
in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://events.analytics.club/tw/eventpull.php?cat=WEB): failed to open stream: HTTP request failed! HTTP/1.0 404 Not Found
in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://events.analytics.club/tw/eventpull.php?cat=WEB): failed to open stream: HTTP request failed! HTTP/1.0 404 Not Found
in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://analyticsweek.com/tw/blogpull.php): failed to open stream: HTTP request failed! HTTP/1.0 503 Service Unavailable
in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://analyticsweek.com/tw/blogpull.php): failed to open stream: HTTP request failed! HTTP/1.0 503 Service Unavailable
in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://analyticsweek.com/tw/blogpull.php): failed to open stream: HTTP request failed! HTTP/1.0 503 Service Unavailable
in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://news.analyticsweek.com/tw/newspull.php): failed to open stream: HTTP request failed! HTTP/1.0 503 Service Unavailable
in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://news.analyticsweek.com/tw/newspull.php): failed to open stream: HTTP request failed! HTTP/1.0 503 Service Unavailable
in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

Warning: file_get_contents(http://news.analyticsweek.com/tw/newspull.php): failed to open stream: HTTP request failed! HTTP/1.0 503 Service Unavailable
in /home3/vishaltao/public_html/mytao/script/includeit.php on line 15

[  COVER OF THE WEEK ]

image
Ethics  Source

[ FEATURED COURSE]

A Course in Machine Learning

image

Machine learning is the study of algorithms that learn from data and experience. It is applied in a vast variety of application areas, from medicine to advertising, from military to pedestrian. Any area in which you need… more

[ FEATURED READ]

The Black Swan: The Impact of the Highly Improbable

image

A black swan is an event, positive or negative, that is deemed improbable yet causes massive consequences. In this groundbreaking and prophetic book, Taleb shows in a playful way that Black Swan events explain almost eve… more

[ TIPS & TRICKS OF THE WEEK]

Grow at the speed of collaboration
A research by Cornerstone On Demand pointed out the need for better collaboration within workforce, and data analytics domain is no different. A rapidly changing and growing industry like data analytics is very difficult to catchup by isolated workforce. A good collaborative work-environment facilitate better flow of ideas, improved team dynamics, rapid learning, and increasing ability to cut through the noise. So, embrace collaborative team dynamics.

[ DATA SCIENCE Q&A]

Q:Do you know a few “rules of thumb” used in statistical or computer science? Or in business analytics?

A: Pareto rule:
– 80% of the effects come from 20% of the causes
– 80% of the sales come from 20% of the customers

Computer science: “simple and inexpensive beats complicated and expensive” – Rod Elder

Finance, rule of 72:
– Estimate the time needed for a money investment to double
– 100$ at a rate of 9%: 72/9=8 years

Rule of three (Economics):
– There are always three major competitors in a free market within one industry

Source

[ VIDEO OF THE WEEK]

Venu Vasudevan @VenuV62 (@ProcterGamble) on creating a rockstar data science team #FutureOfData #Podcast

 Venu Vasudevan @VenuV62 (@ProcterGamble) on creating a rockstar data science team #FutureOfData #Podcast

Subscribe to  Youtube

[ QUOTE OF THE WEEK]

Data matures like wine, applications like fish. – James Governor

[ PODCAST OF THE WEEK]

@JohnNives on ways to demystify AI for enterprise #FutureOfData #Podcast

 @JohnNives on ways to demystify AI for enterprise #FutureOfData #Podcast

Subscribe 

iTunes  GooglePlay

[ FACT OF THE WEEK]

IDC Estimates that by 2020,business transactions on the internet- business-to-business and business-to-consumer – will reach 450 billion per day.

Sourced from: Analytics.CLUB #WEB Newsletter