Social Sentiment Company ZenCity Raises $13.5M for Expansion

The Israeli company ZenCity, which helps local governments assess public opinion by combining 311, social media analysis and other open sources on the Internet, has announced $13.5 million in new funding — its largest funding round to date.

A news release today said the money will go toward improving ZenCity’s software, adding partnerships and growing the company’s footprint in the market. The funding round was led by the Israeli venture capital firm TLV Partners, with participation from Salesforce Ventures.

Founded in 2015, ZenCity makes software that collects data from public sources such as social media, local news channels and 311 requests. It then runs this data through an AI tool to identify specific topics, trends and sentiments, from which local government agencies can get an idea of the needs and priorities of their communities.

“Zencity is literally the only way I can get a true big-picture view of all discourse taking place, both on our city-owned channels and those that are not run by the city,” attested Belen Michelis, communications manager for the city of Meriden, Conn., in a case study on the company’s website. “The ability to parse through the chatter from one place is invaluable.”

The latest investments more than doubled ZenCity’s funding, according to Crunchbase, which shows that the company has amassed $21.2 million across three rounds in four years, each larger than the last: $1.7 million announced September 2017, $6 million in September 2018 and $13.5 million today. In May 2018, ZenCity also scored $1 million from Microsoft’s venture capital arm by winning the Innovate.AI competition for Israel’s region.

At the time of that competition, ZenCity counted about 20 customers in the U.S. and Israel. Today’s announcement said the company has over 150 local government customers in the U.S., ranging in size from the city of Los Angeles to the village of Lemont, Ill., with fewer than 20,000 residents.

ZenCity CEO Eyal Feder-Levy said in a statement that his company’s software has a role to play in this moment in history, when city governments are testing new responses to unfolding crises, such as COVID-19 mitigation measures or grants to help local businesses.

“Now more than ever, this investment is further proof of local governments’ acute need for real-time resident feedback,” he said. “The ability to provide municipal leaders with actionable data is a big step in further improving the efficiency and effectiveness of their work.”

Source: Social Sentiment Company ZenCity Raises $13.5M for Expansion

UI Fraud Rises as Relief Funds Go to Bad Actors (Contributed)

news item about rising unemployment

The federal government moved quickly to stand up programs to blunt the economic impact of the COVID-19 pandemic. This was a tremendous undertaking that provided needed financial support to hundreds of millions of Americans. Still, it would be unrealistic to think that such a large-scale effort could be implemented so quickly without any glitches. Unfortunately, fraudsters, both domestically and worldwide, saw this as an opportunity, impacting not only CARES Act funds but Unemployment Insurance (UI) assets, as well. The scope of the issue is vast: Washington state reports up to $650 million in UI money stolen by fraudsters and Maryland has identified more than $500 million in UI fraud.

The Small Business Administration’s (SBA) Paycheck Protection Program (PPP) is intended to keep businesses open by providing forgivable loans to employers to keep workers on the job. However, SBA has been challenged by numerous applications from nonexistent small businesses — duly registered at the state level — claiming they have employees to keep on payroll. Also, some actual small businesses falsified their qualifications for PPP loans, misrepresenting the number of employees and expenses.

While states have no vested interest in PPP funds themselves, fraud has an impact down to the local level:

An employer applies for PPP funds but tells the employees to go on unemployment, causing them to unwittingly commit UI fraud;
A fake company uses stolen identities to apply for a PPP loan, when those individuals are actually employed elsewhere; or
A false, stolen or synthetic identity is used to apply for UI, connecting this fake persona to a real or fake company

Through these techniques and more, fraudsters can directly affect state and local resources and tax revenues, while delaying UI payments to legitimate applicants.

“Pay-and-chase” could potentially lead to the recovery of a portion of the lost funds, but historically, a large percentage of fraudulently obtained dollars is never recovered. Pay-and-chase also has its own costs: The original money is gone, and now you have to spend more — in time, resources and personnel — to try to recoup it. Of course, there is also the deterrent value of chasing down fraudsters, but with limited resources available to auditors, the likelihood is that the majority of that money is unrecoverable.

Stop Fraud at the Front Door

Businesses don’t commit fraud; the people who run those businesses — legitimate or otherwise — do. It’s essential to make sure the applicants are who they say they are, that their businesses are genuine, and that their employees actually exist and work for them.

True, banks are important parties in the loan application process, but the issue starts with state registration, where new businesses register with the Secretary of State and other offices at the city or county level. It’s relatively easy for fraudsters to use stolen identities and fabricated information to create a realistic business entity, complete with management personnel and officers. It’s up to agencies to determine if any or all of the information submitted is true. Historically, this has been tough to do: Research by LexisNexis Risk Solutions shows that only 50 percent of small businesses have a credit history, and half of those with a history only have thin files. Once the business is registered, that information can flow to federal agencies, including the SBA as it reviews PPP loan applications.

The same set of identity issues need to be dealt with when processing UI applicants. Unfortunately, it’s very easy to create an identity that looks real. Online resources exist to falsify an ID or driver’s license, utility bills and pay stubs, so that an applicant can appear legitimate. Stolen or synthetic identities are also being used, which adds to the confusion, as some or all of the information being used about that person is real. The result is that, ultimately, stimulus and UI funds can end up in the wrong hands, leaving the government to recover it from someone who doesn’t exist. These identities may also be used to obtain assistance and benefits through additional state-run programs, as well as to apply for UI and assistance in other states altogether, creating the issue of dual participation.

The Answer Is Data

Preventing fraud requires a judicious, intelligent process that screens applications for business registrations and UI at the earliest possible stage. Most states have systems in place for this, not only for approving business licenses and UI, but also for disaster contingency programs, which require funds to flow quickly from the state or locality to people and businesses. The current environment, however, has made things more complicated, since offices may have limited resources and many applications are completely online to ensure social distancing. But whether in person or digitally, vetting the identities of applicants with confidence can only be done with a comprehensive set of accurate, up-to-date data sources.

Connecting a person’s physical identity — their address, birthdate, Social Security number, etc. — with their digital life — their online activity and where, when and how they interact online — is crucial for building risk scores, which support well-informed decisions on how best to apply limited resources to the issue of fraud. A system that provides real-time identity intelligence and pattern recognition in near-real time would not slow down the application process; in fact, it would improve turnaround time, since less manual vetting is needed.

With a PPP programs extension, the ability to apply for loan forgiveness and the likelihood of another round of stimulus on the way, the opportunities for continued PPP fraud may be growing, further straining citizens, public resources and the economy. A multi-layered physical and digital identity intelligence solution, powered by comprehensive government and commercial data sources, means approvers can more quickly — and more accurately — sort legitimate applicants from scammers by automating the process using a comprehensive data solution. And that helps ensure that funds go where they are needed most to support hardworking people in every state. Hopefully these critical safeguards will be implemented in disaster-ready solutions so we are not chasing taxpayer money next time.

Andrew McClenahan, a solutions architect for LexisNexis Risk Solutions, leads the design and implementation of government agency solutions that uphold public program integrity and provides consultative services to systems integrators and government agencies on operations and data architecture issues to promote efficiency and economy in programs. McClenahan has spent much of his 25-year career in public service, including roles as the director of the Office of Public Benefits Integrity at the Florida Department of Children and Families, law enforcement captain with the Florida Department of Environmental Protection, and various sworn capacities with the Tallahassee Police Department. He is a Certified Public Manager, a Certified Welfare Fraud Investigator, and a Certified Inspector General Investigator.

Source by analyticsweekpick

Fat Zebras

When presenting data tables, we recommend stripping away backgrounds and grids and using alignment, formatting, and whitespace to guide the eye along the rows and columns of the data. 

But what if you have lots of rows with many columns that stretch across a screen or page? Or perhaps you have large gaps between fields because of varying text lengths. In both these cases, tracking across rows can be difficult.

To combat this, Stephen Few recommends zebra striping, where a light shaded colour is used on alternating rows to aid scanning across.

Problem solved, right?  

Maybe not. They’re almost always poorly done.  If your fill colour is too dark, it overemphasizes rows that shouldn’t be and hampers vertical scrolling. Even with subtle shading, it’s like looking through a pair of those horrible shutter sunglasses from the eighties. Worst of all, they might not even help. A List Apart ran a study that suggests zebra stripes don’t actually increase accuracy or speed.

For both aesthetic and performance reasons, I suggest we find an alternative. I propose fat zebras.

Fat zebras have less visual noise because they switch back and forth less frequently. As for tracking, I find them even easier, simply following the top, middle, or bottom row of the filled section across. (We will have to wait for a follow-up study to see if they actually help – it seems people don’t prefer the look of them on smaller tables, but there is no indication on their performance.)

One potential drawback to consider when using fat zebras is ‘implied grouping’. Depending on your data set, people may think the fill colours indicate that data share some specific quality. For this reason, it is vitally important, as with regular stripes, that the background fill be as subtle as possible. I would also recommend avoiding it when there are only a few rows of data visible as it’s even more likely to appear as data grouping.

Edward Tufte claims that “good typography can always organize a table, no stripes needed.” When working with large tables, however, I prefer to show my stripes, even if they are fat.
 

Source: Fat Zebras by analyticsweek

An Unexpected Insider Threat: Senior Executives Who Ignore Cybersecurity Rules

Cybersecurity should be a top concern for any business, with strong policies and protocol put in place, and top executives leading by example. However, recent research has shown that more than half of senior managers disregard the rules, placing their organizations in jeopardy and making them an insider threat. This behavior is common in companies […]

The post An Unexpected Insider Threat: Senior Executives Who Ignore Cybersecurity Rules appeared first on TechSpective.

Source: An Unexpected Insider Threat: Senior Executives Who Ignore Cybersecurity Rules by administrator

10 Clustering Algorithms With Python



Clustering or cluster analysis is an unsupervised learning problem.

It is often used as a data analysis technique for discovering interesting patterns in data, such as groups of customers based on their behavior.

There are many clustering algorithms to choose from and no single best clustering algorithm for all cases. Instead, it is a good idea to explore a range of clustering algorithms and different configurations for each algorithm.

In this tutorial, you will discover how to fit and use top clustering algorithms in python.

After completing this tutorial, you will know:

  • Clustering is an unsupervised problem of finding natural groups in the feature space of input data.
  • There are many different clustering algorithms and no single best method for all datasets.
  • How to implement, fit, and use top clustering algorithms in Python with the scikit-learn machine learning library.

Let’s get started.

Clustering Algorithms With Python

Clustering Algorithms With Python
Photo by Lars Plougmann, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

  1. Clustering
  2. Clustering Algorithms
  3. Examples of Clustering Algorithms
    1. Library Installation
    2. Clustering Dataset
    3. Affinity Propagation
    4. Agglomerative Clustering
    5. BIRCH
    6. DBSCAN
    7. K-Means
    8. Mini-Batch K-Means
    9. Mean Shift
    10. OPTICS
    11. Spectral Clustering
    12. Gaussian Mixture Model

Clustering

Cluster analysis, or clustering, is an unsupervised machine learning task.

It involves automatically discovering natural grouping in data. Unlike supervised learning (like predictive modeling), clustering algorithms only interpret the input data and find natural groups or clusters in feature space.

Clustering techniques apply when there is no class to be predicted but rather when the instances are to be divided into natural groups.

— Page 141, Data Mining: Practical Machine Learning Tools and Techniques, 2016.

A cluster is often an area of density in the feature space where examples from the domain (observations or rows of data) are closer to the cluster than other clusters. The cluster may have a center (the centroid) that is a sample or a point feature space and may have a boundary or extent.

These clusters presumably reflect some mechanism at work in the domain from which instances are drawn, a mechanism that causes some instances to bear a stronger resemblance to each other than they do to the remaining instances.

— Pages 141-142, Data Mining: Practical Machine Learning Tools and Techniques, 2016.

Clustering can be helpful as a data analysis activity in order to learn more about the problem domain, so-called pattern discovery or knowledge discovery.

For example:

  • The phylogenetic tree could be considered the result of a manual clustering analysis.
  • Separating normal data from outliers or anomalies may be considered a clustering problem.
  • Separating clusters based on their natural behavior is a clustering problem, referred to as market segmentation.

Clustering can also be useful as a type of feature engineering, where existing and new examples can be mapped and labeled as belonging to one of the identified clusters in the data.

Evaluation of identified clusters is subjective and may require a domain expert, although many clustering-specific quantitative measures do exist. Typically, clustering algorithms are compared academically on synthetic datasets with pre-defined clusters, which an algorithm is expected to discover.

Clustering is an unsupervised learning technique, so it is hard to evaluate the quality of the output of any given method.

— Page 534, Machine Learning: A Probabilistic Perspective, 2012.

Clustering Algorithms

There are many types of clustering algorithms.

Many algorithms use similarity or distance measures between examples in the feature space in an effort to discover dense regions of observations. As such, it is often good practice to scale data prior to using clustering algorithms.

Central to all of the goals of cluster analysis is the notion of the degree of similarity (or dissimilarity) between the individual objects being clustered. A clustering method attempts to group the objects based on the definition of similarity supplied to it.

— Page 502, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2016.

Some clustering algorithms require you to specify or guess at the number of clusters to discover in the data, whereas others require the specification of some minimum distance between observations in which examples may be considered “close” or “connected.”

As such, cluster analysis is an iterative process where subjective evaluation of the identified clusters is fed back into changes to algorithm configuration until a desired or appropriate result is achieved.

The scikit-learn library provides a suite of different clustering algorithms to choose from.

A list of 10 of the more popular algorithms is as follows:

  • Affinity Propagation
  • Agglomerative Clustering
  • BIRCH
  • DBSCAN
  • K-Means
  • Mini-Batch K-Means
  • Mean Shift
  • OPTICS
  • Spectral Clustering
  • Mixture of Gaussians

Each algorithm offers a different approach to the challenge of discovering natural groups in data.

There is no best clustering algorithm, and no easy way to find the best algorithm for your data without using controlled experiments.

In this tutorial, we will review how to use each of these 10 popular clustering algorithms from the scikit-learn library.

The examples will provide the basis for you to copy-paste the examples and test the methods on your own data.

We will not dive into the theory behind how the algorithms work or compare them directly. For a good starting point on this topic, see:

Let’s dive in.

Examples of Clustering Algorithms

In this section, we will review how to use 10 popular clustering algorithms in scikit-learn.

This includes an example of fitting the model and an example of visualizing the result.

The examples are designed for you to copy-paste into your own project and apply the methods to your own data.

Library Installation

First, let’s install the library.

Don’t skip this step as you will need to ensure you have the latest version installed.

You can install the scikit-learn library using the pip Python installer, as follows:

sudo pip install scikit-learn

For additional installation instructions specific to your platform, see:

Next, let’s confirm that the library is installed and you are using a modern version.

Run the following script to print the library version number.

# check scikit-learn version
import sklearn
print(sklearn.__version__)

Running the example, you should see the following version number or higher.

0.22.1

Clustering Dataset

We will use the make_classification() function to create a test binary classification dataset.

The dataset will have 1,000 examples, with two input features and one cluster per class. The clusters are visually obvious in two dimensions so that we can plot the data with a scatter plot and color the points in the plot by the assigned cluster. This will help to see, at least on the test problem, how “well” the clusters were identified.

The clusters in this test problem are based on a multivariate Gaussian, and not all clustering algorithms will be effective at identifying these types of clusters. As such, the results in this tutorial should not be used as the basis for comparing the methods generally.

An example of creating and summarizing the synthetic clustering dataset is listed below.

# synthetic classification dataset
from numpy import where
from sklearn.datasets import make_classification
from matplotlib import pyplot
# define dataset
X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, random_state=4)
# create scatter plot for samples from each class
for class_value in range(2):
	# get row indexes for samples with this class
	row_ix = where(y == class_value)
	# create scatter of these samples
	pyplot.scatter(X[row_ix, 0], X[row_ix, 1])
# show the plot
pyplot.show()

Running the example creates the synthetic clustering dataset, then creates a scatter plot of the input data with points colored by class label (idealized clusters).

We can clearly see two distinct groups of data in two dimensions and the hope would be that an automatic clustering algorithm can detect these groupings.

Scatter Plot of Synthetic Clustering Dataset With Points Colored by Known Cluster

Scatter Plot of Synthetic Clustering Dataset With Points Colored by Known Cluster

Next, we can start looking at examples of clustering algorithms applied to this dataset.

I have made some minimal attempts to tune each method to the dataset.

Can you get a better result for one of the algorithms?
Let me know in the comments below.

Affinity Propagation

Affinity Propagation involves finding a set of exemplars that best summarize the data.

We devised a method called “affinity propagation,” which takes as input measures of similarity between pairs of data points. Real-valued messages are exchanged between data points until a high-quality set of exemplars and corresponding clusters gradually emerges

— Clustering by Passing Messages Between Data Points, 2007.

The technique is described in the paper:

It is implemented via the AffinityPropagation class and the main configuration to tune is the “damping” set between 0.5 and 1, and perhaps “preference.”

The complete example is listed below.

# affinity propagation clustering
from numpy import unique
from numpy import where
from sklearn.datasets import make_classification
from sklearn.cluster import AffinityPropagation
from matplotlib import pyplot
# define dataset
X, _ = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, random_state=4)
# define the model
model = AffinityPropagation(damping=0.9)
# fit the model
model.fit(X)
# assign a cluster to each example
yhat = model.predict(X)
# retrieve unique clusters
clusters = unique(yhat)
# create scatter plot for samples from each cluster
for cluster in clusters:
	# get row indexes for samples with this cluster
	row_ix = where(yhat == cluster)
	# create scatter of these samples
	pyplot.scatter(X[row_ix, 0], X[row_ix, 1])
# show the plot
pyplot.show()

Running the example fits the model on the training dataset and predicts a cluster for each example in the dataset. A scatter plot is then created with points colored by their assigned cluster.

In this case, I could not achieve a good result.

Scatter Plot of Dataset With Clusters Identified Using Affinity Propagation

Scatter Plot of Dataset With Clusters Identified Using Affinity Propagation

Agglomerative Clustering

Agglomerative clustering involves merging examples until the desired number of clusters is achieved.

It is a part of a broader class of hierarchical clustering methods and you can learn more here:

It is implemented via the AgglomerativeClustering class and the main configuration to tune is the “n_clusters” set, an estimate of the number of clusters in the data, e.g. 2.

The complete example is listed below.

# agglomerative clustering
from numpy import unique
from numpy import where
from sklearn.datasets import make_classification
from sklearn.cluster import AgglomerativeClustering
from matplotlib import pyplot
# define dataset
X, _ = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, random_state=4)
# define the model
model = AgglomerativeClustering(n_clusters=2)
# fit model and predict clusters
yhat = model.fit_predict(X)
# retrieve unique clusters
clusters = unique(yhat)
# create scatter plot for samples from each cluster
for cluster in clusters:
	# get row indexes for samples with this cluster
	row_ix = where(yhat == cluster)
	# create scatter of these samples
	pyplot.scatter(X[row_ix, 0], X[row_ix, 1])
# show the plot
pyplot.show()

Running the example fits the model on the training dataset and predicts a cluster for each example in the dataset. A scatter plot is then created with points colored by their assigned cluster.

In this case, a reasonable grouping is found.

Scatter Plot of Dataset With Clusters Identified Using Agglomerative Clustering

Scatter Plot of Dataset With Clusters Identified Using Agglomerative Clustering

BIRCH

BIRCH Clustering (BIRCH is short for Balanced Iterative Reducing and Clustering using
Hierarchies) involves constructing a tree structure from which cluster centroids are extracted.

BIRCH incrementally and dynamically clusters incoming multi-dimensional metric data points to try to produce the best quality clustering with the available resources (i. e., available memory and time constraints).

— BIRCH: An efficient data clustering method for large databases, 1996.

The technique is described in the paper:

It is implemented via the Birch class and the main configuration to tune is the “threshold” and “n_clusters” hyperparameters, the latter of which provides an estimate of the number of clusters.

The complete example is listed below.

# birch clustering
from numpy import unique
from numpy import where
from sklearn.datasets import make_classification
from sklearn.cluster import Birch
from matplotlib import pyplot
# define dataset
X, _ = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, random_state=4)
# define the model
model = Birch(threshold=0.01, n_clusters=2)
# fit the model
model.fit(X)
# assign a cluster to each example
yhat = model.predict(X)
# retrieve unique clusters
clusters = unique(yhat)
# create scatter plot for samples from each cluster
for cluster in clusters:
	# get row indexes for samples with this cluster
	row_ix = where(yhat == cluster)
	# create scatter of these samples
	pyplot.scatter(X[row_ix, 0], X[row_ix, 1])
# show the plot
pyplot.show()

Running the example fits the model on the training dataset and predicts a cluster for each example in the dataset. A scatter plot is then created with points colored by their assigned cluster.

In this case, an excellent grouping is found.

Scatter Plot of Dataset With Clusters Identified Using BIRCH Clustering

Scatter Plot of Dataset With Clusters Identified Using BIRCH Clustering

DBSCAN

DBSCAN Clustering (where DBSCAN is short for Density-Based Spatial Clustering of Applications with Noise) involves finding high-density areas in the domain and expanding those areas of the feature space around them as clusters.

… we present the new clustering algorithm DBSCAN relying on a density-based notion of clusters which is designed to discover clusters of arbitrary shape. DBSCAN requires only one input parameter and supports the user in determining an appropriate value for it

— A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise, 1996.

The technique is described in the paper:

It is implemented via the DBSCAN class and the main configuration to tune is the “eps” and “min_samples” hyperparameters.

The complete example is listed below.

# dbscan clustering
from numpy import unique
from numpy import where
from sklearn.datasets import make_classification
from sklearn.cluster import DBSCAN
from matplotlib import pyplot
# define dataset
X, _ = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, random_state=4)
# define the model
model = DBSCAN(eps=0.30, min_samples=9)
# fit model and predict clusters
yhat = model.fit_predict(X)
# retrieve unique clusters
clusters = unique(yhat)
# create scatter plot for samples from each cluster
for cluster in clusters:
	# get row indexes for samples with this cluster
	row_ix = where(yhat == cluster)
	# create scatter of these samples
	pyplot.scatter(X[row_ix, 0], X[row_ix, 1])
# show the plot
pyplot.show()

Running the example fits the model on the training dataset and predicts a cluster for each example in the dataset. A scatter plot is then created with points colored by their assigned cluster.

In this case, a reasonable grouping is found, although more tuning is required.

Scatter Plot of Dataset With Clusters Identified Using DBSCAN Clustering

Scatter Plot of Dataset With Clusters Identified Using DBSCAN Clustering

K-Means

K-Means Clustering may be the most widely known clustering algorithm and involves assigning examples to clusters in an effort to minimize the variance within each cluster.

The main purpose of this paper is to describe a process for partitioning an N-dimensional population into k sets on the basis of a sample. The process, which is called ‘k-means,’ appears to give partitions which are reasonably efficient in the sense of within-class variance.

— Some methods for classification and analysis of multivariate observations, 1967.

The technique is described here:

It is implemented via the KMeans class and the main configuration to tune is the “n_clusters” hyperparameter set to the estimated number of clusters in the data.

The complete example is listed below.

# k-means clustering
from numpy import unique
from numpy import where
from sklearn.datasets import make_classification
from sklearn.cluster import KMeans
from matplotlib import pyplot
# define dataset
X, _ = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, random_state=4)
# define the model
model = KMeans(n_clusters=2)
# fit the model
model.fit(X)
# assign a cluster to each example
yhat = model.predict(X)
# retrieve unique clusters
clusters = unique(yhat)
# create scatter plot for samples from each cluster
for cluster in clusters:
	# get row indexes for samples with this cluster
	row_ix = where(yhat == cluster)
	# create scatter of these samples
	pyplot.scatter(X[row_ix, 0], X[row_ix, 1])
# show the plot
pyplot.show()

Running the example fits the model on the training dataset and predicts a cluster for each example in the dataset. A scatter plot is then created with points colored by their assigned cluster.

In this case, a reasonable grouping is found, although the unequal equal variance in each dimension makes the method less suited to this dataset.

Scatter Plot of Dataset With Clusters Identified Using K-Means Clustering

Scatter Plot of Dataset With Clusters Identified Using K-Means Clustering

Mini-Batch K-Means

Mini-Batch K-Means is a modified version of k-means that makes updates to the cluster centroids using mini-batches of samples rather than the entire dataset, which can make it faster for large datasets, and perhaps more robust to statistical noise.

… we propose the use of mini-batch optimization for k-means clustering. This reduces computation cost by orders of magnitude compared to the classic batch algorithm while yielding significantly better solutions than online stochastic gradient descent.

— Web-Scale K-Means Clustering, 2010.

The technique is described in the paper:

It is implemented via the MiniBatchKMeans class and the main configuration to tune is the “n_clusters” hyperparameter set to the estimated number of clusters in the data.

The complete example is listed below.

# mini-batch k-means clustering
from numpy import unique
from numpy import where
from sklearn.datasets import make_classification
from sklearn.cluster import MiniBatchKMeans
from matplotlib import pyplot
# define dataset
X, _ = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, random_state=4)
# define the model
model = MiniBatchKMeans(n_clusters=2)
# fit the model
model.fit(X)
# assign a cluster to each example
yhat = model.predict(X)
# retrieve unique clusters
clusters = unique(yhat)
# create scatter plot for samples from each cluster
for cluster in clusters:
	# get row indexes for samples with this cluster
	row_ix = where(yhat == cluster)
	# create scatter of these samples
	pyplot.scatter(X[row_ix, 0], X[row_ix, 1])
# show the plot
pyplot.show()

Running the example fits the model on the training dataset and predicts a cluster for each example in the dataset. A scatter plot is then created with points colored by their assigned cluster.

In this case, a result equivalent to the standard k-means algorithm is found.

Scatter Plot of Dataset With Clusters Identified Using Mini-Batch K-Means Clustering

Scatter Plot of Dataset With Clusters Identified Using Mini-Batch K-Means Clustering

Mean Shift

Mean shift clustering involves finding and adapting centroids based on the density of examples in the feature space.

We prove for discrete data the convergence of a recursive mean shift procedure to the nearest stationary point of the underlying density function and thus its utility in detecting the modes of the density.

— Mean Shift: A robust approach toward feature space analysis, 2002.

The technique is described in the paper:

It is implemented via the MeanShift class and the main configuration to tune is the “bandwidth” hyperparameter.

The complete example is listed below.

# mean shift clustering
from numpy import unique
from numpy import where
from sklearn.datasets import make_classification
from sklearn.cluster import MeanShift
from matplotlib import pyplot
# define dataset
X, _ = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, random_state=4)
# define the model
model = MeanShift()
# fit model and predict clusters
yhat = model.fit_predict(X)
# retrieve unique clusters
clusters = unique(yhat)
# create scatter plot for samples from each cluster
for cluster in clusters:
	# get row indexes for samples with this cluster
	row_ix = where(yhat == cluster)
	# create scatter of these samples
	pyplot.scatter(X[row_ix, 0], X[row_ix, 1])
# show the plot
pyplot.show()

Running the example fits the model on the training dataset and predicts a cluster for each example in the dataset. A scatter plot is then created with points colored by their assigned cluster.

In this case, a reasonable set of clusters are found in the data.

Scatter Plot of Dataset With Clusters Identified Using Mean Shift Clustering

Scatter Plot of Dataset With Clusters Identified Using Mean Shift Clustering

OPTICS

OPTICS clustering (where OPTICS is short for Ordering Points To Identify the Clustering Structure) is a modified version of DBSCAN described above.

We introduce a new algorithm for the purpose of cluster analysis which does not produce a clustering of a data set explicitly; but instead creates an augmented ordering of the database representing its density-based clustering structure. This cluster-ordering contains information which is equivalent to the density-based clusterings corresponding to a broad range of parameter settings.

— OPTICS: ordering points to identify the clustering structure, 1999.

The technique is described in the paper:

It is implemented via the OPTICS class and the main configuration to tune is the “eps” and “min_samples” hyperparameters.

The complete example is listed below.

# optics clustering
from numpy import unique
from numpy import where
from sklearn.datasets import make_classification
from sklearn.cluster import OPTICS
from matplotlib import pyplot
# define dataset
X, _ = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, random_state=4)
# define the model
model = OPTICS(eps=0.8, min_samples=10)
# fit model and predict clusters
yhat = model.fit_predict(X)
# retrieve unique clusters
clusters = unique(yhat)
# create scatter plot for samples from each cluster
for cluster in clusters:
	# get row indexes for samples with this cluster
	row_ix = where(yhat == cluster)
	# create scatter of these samples
	pyplot.scatter(X[row_ix, 0], X[row_ix, 1])
# show the plot
pyplot.show()

Running the example fits the model on the training dataset and predicts a cluster for each example in the dataset. A scatter plot is then created with points colored by their assigned cluster.

In this case, I could not achieve a reasonable result on this dataset.

Scatter Plot of Dataset With Clusters Identified Using OPTICS Clustering

Scatter Plot of Dataset With Clusters Identified Using OPTICS Clustering

Spectral Clustering

Spectral Clustering is a general class of clustering methods, drawn from linear algebra.

A promising alternative that has recently emerged in a number of fields is to use spectral methods for clustering. Here, one uses the top eigenvectors of a matrix derived from the distance between points.

— On Spectral Clustering: Analysis and an algorithm, 2002.

The technique is described in the paper:

It is implemented via the SpectralClustering class and the main Spectral Clustering is a general class of clustering methods, drawn from linear algebra. to tune is the “n_clusters” hyperparameter used to specify the estimated number of clusters in the data.

The complete example is listed below.

# spectral clustering
from numpy import unique
from numpy import where
from sklearn.datasets import make_classification
from sklearn.cluster import SpectralClustering
from matplotlib import pyplot
# define dataset
X, _ = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, random_state=4)
# define the model
model = SpectralClustering(n_clusters=2)
# fit model and predict clusters
yhat = model.fit_predict(X)
# retrieve unique clusters
clusters = unique(yhat)
# create scatter plot for samples from each cluster
for cluster in clusters:
	# get row indexes for samples with this cluster
	row_ix = where(yhat == cluster)
	# create scatter of these samples
	pyplot.scatter(X[row_ix, 0], X[row_ix, 1])
# show the plot
pyplot.show()

Running the example fits the model on the training dataset and predicts a cluster for each example in the dataset. A scatter plot is then created with points colored by their assigned cluster.

In this case, reasonable clusters were found.

Scatter Plot of Dataset With Clusters Identified Using Spectra Clustering Clustering

Scatter Plot of Dataset With Clusters Identified Using Spectra Clustering Clustering

Gaussian Mixture Model

A Gaussian mixture model summarizes a multivariate probability density function with a mixture of Gaussian probability distributions as its name suggests.

For more on the model, see:

It is implemented via the GaussianMixture class and the main configuration to tune is the “n_clusters” hyperparameter used to specify the estimated number of clusters in the data.

The complete example is listed below.

# gaussian mixture clustering
from numpy import unique
from numpy import where
from sklearn.datasets import make_classification
from sklearn.mixture import GaussianMixture
from matplotlib import pyplot
# define dataset
X, _ = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, random_state=4)
# define the model
model = GaussianMixture(n_components=2)
# fit the model
model.fit(X)
# assign a cluster to each example
yhat = model.predict(X)
# retrieve unique clusters
clusters = unique(yhat)
# create scatter plot for samples from each cluster
for cluster in clusters:
	# get row indexes for samples with this cluster
	row_ix = where(yhat == cluster)
	# create scatter of these samples
	pyplot.scatter(X[row_ix, 0], X[row_ix, 1])
# show the plot
pyplot.show()

Running the example fits the model on the training dataset and predicts a cluster for each example in the dataset. A scatter plot is then created with points colored by their assigned cluster.

In this case, we can see that the clusters were identified perfectly. This is not surprising given that the dataset was generated as a mixture of Gaussians.

Scatter Plot of Dataset With Clusters Identified Using Gaussian Mixture Clustering

Scatter Plot of Dataset With Clusters Identified Using Gaussian Mixture Clustering

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Papers

Books

APIs

Articles

Summary

In this tutorial, you discovered how to fit and use top clustering algorithms in python.

Specifically, you learned:

  • Clustering is an unsupervised problem of finding natural groups in the feature space of input data.
  • There are many different clustering algorithms, and no single best method for all datasets.
  • How to implement, fit, and use top clustering algorithms in Python with the scikit-learn machine learning library.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.



The post 10 Clustering Algorithms With Python appeared first on Machine Learning Mastery.

Source: 10 Clustering Algorithms With Python

Put big data to work with Cortana Analytics

bigdataistock013114

At its most recent Worldwide Partner Conference, held in July 2015, Microsoft announced several new cloud-based initiatives, the most interesting of which for enterprise businesses was the Cortana Analytics Suite. This is Microsoft’s big data and predictive analysis package, and it will compete with the likes of IBM for a share of this lucrative market.

Analyze this

The promise of big data, and the analysis tools that come with it, is that enterprises will be able to “mine” the huge amount of data they generate on customers, vendors, markets, and products for insights into unforeseen or untapped information—the kind of information that leads to more business, more revenue, and more profits.

In simple terms, a solid big data analytics infrastructure is an absolute necessity for any business enterprise with aspirations for success. Enterprises that do not take the time and spend the money to establish a big data infrastructure are going to operate at a disadvantage. That is not a good place to be.

The Suite

The Cortana Analytics Suite (Figure A) is Microsoft’s foray into this important and lucrative market. According to the announcement, Cortana Analytics will take advantage of machine learning, unlimited data storage, and perceptual intelligence to “transform data into intelligent action.” If only it were that simple.

Figure A

Figure A
 Image: Microsoft News
The Cortana Analytics Suite.

No matter where you turn for your big data solutions, the software and infrastructure can only take you so far. All of the business intelligence and big data analytics in the world is not going to do you any good if your employees don’t use them.

This is where Microsoft is trying to carve out a niche. The Cortana Analytics Suite will be available for a simple monthly subscription, similar to the way enterprises pay for Office 365. It will have a familiar interface, and all of the analytical tools will be available in one product. There will be no need to mix and match tools from various sources.

The other “simplification” effort Microsoft hopes will establish its niche in the market is Cortana itself. Cortana, Microsoft’s voice-controlled personal assistant, is integrated into the analytics suite. The idea is that users can ask questions using Cortana and natural language without having to formulate old-fashioned database queries.

It’s a start

Microsoft’s Cortana Analytics Suite is a very ambitious project. I applaud Microsoft for recognizing that many current business intelligence and big data analytics solutions are overly complicated and unwieldy. Simplification is the right strategy—it’s the niche that can separate Microsoft from everyone else.

However, knowing the right strategy and executing the right strategy are two different things. From what I have seen of the Cortana Analytics Suite, it does indeed present a set of streamlined and simplified analytical tools. But those tools are more of a framework, and they aren’t likely to be exactly what your enterprise needs.

In other words, enterprises using the Cortana Analytics Suite are still going to have to build applications that meet the particular needs of their particular enterprises. Cortana Analytics may make application development easier, at least we know that’s Microsoft’s plan, but we won’t know if that’s true or not until enterprises get their hands on it.

And even if Cortana Analytics does make development easier, enterprises will still have to hire people to do the app development and then train their employees how to use those apps.

As I said before, big data analytics is vital to the success of modern business enterprises, but it requires a significant commitment of resources. So, even if Microsoft has indeed simplified big data with Cortana Analytics, it has not made it truly easy.

Note: This article originally appeared in TechRepublic. Click for link here.

Source by analyticsweekpick

How AI and Machine Learning Can Win Elections

Reading Time: 4 minutesElections are the time when the people of a country are bestowed with the power to choose the next government that will govern their nation. The period leading up to the elections is packed with massive campaigning activities by all political parties. Every voter has his/her own ideologies and expectations that they would like to see a candidate fulfill. The main objective of political parties is to influence or sway the mind of the voter to vote for their respective candidates. The general techniques used by politicians to achieve this objective is by meeting voters in person, through mass media advertising, public rallies, social media campaigns, etc. This has been the case with political elections throughout modern history.

In recent years, technology has changed the whole approach drastically. Politicians are now relying on technological advancements such as analysing Big Data to connect and engage better with voters. Former US President Barack Obama’s campaign team used Big Data Analytics to maximize the effectiveness of his email campaigns, which resulted in the raising of a whopping US$ 1billion of campaign money.

Along with Big Data, the next technologies that are going to have a huge impact in election campaigns and political life are Artificial Intelligence and Machine Learning.

Engaging Voters Using AI

AI and machine learning can be used to engage voters in election campaigns and help them be more informed about important political issues happening in the country. Based on statistical techniques, machine learning algorithms can automatically identify patterns in data. By analyzing the online behaviour of voters which includes their data consumption patterns, relationships, and social media patterns, unique psychographic and behavioural user profiles could be created. Targeted advertising campaigns could then be sent to each voter based on their individual psychology. This helps in persuading voters to vote for the party that meets their expectations.

Read Also: How AI and ML can Transform Governance

Source: How AI and Machine Learning Can Win Elections by administrator

The Hidden Bias in Customer Metrics

Business leaders understand how their business is performing by monitoring different metrics. Metrics are essentially a summary all the data (yes, even Big Data) into a score. Metrics include new customer growth rate, number of sales and employee satisfaction, to name a few. Your hope is that these scores tell you something useful.

There are a few ways to calculate a metric. Using the examples above, we see that we can use percentages, a simple count and descriptive statistics (e.g., mean) to calculate a metric. In the world of customer feedback, there are a few ways to calculate metrics from structured data. Take, for example, a company that has 10,000 responses to a recent customer survey in which they used a 0-10 point rating scale (e.g., 0 = extremely dissatisfied; 10 = extremely satisfied). They have a few options for calculating a summary metric:

  1. Mean Score:  This is the arithmetic average of the set of responses. The mean is calculated by summing all responses and dividing by the number of responses. Possible mean scores can range from 0 to 10.
  2. Top Box Score: The top box score represents the percentage of respondents who gave the best responses (either a 9 and 10 on a 0-10 scale). Possible percentage scores can range from 0 to 100.
  3. Bottom Box Score: The bottom box score represents the percentage of respondents who gave the worst responses (0 through 6 on a 0-10 scale). Possible percentage scores can range from 0 to 100.
  4. Net Score: The net score represents the difference between the Top Box Score and the Bottom Box Score. Net scores can range from -100 to 100. While the net score was made popular by the Net Promoter Score camp, others have used a net score to calculate a metric (please see Net Value Score.) While the details might be different, net scores take the same general approach in their calculations (percent of good responses – percent of bad responses). For the remainder, I will focus on the Net Promoter Score methodology.

Different Summary Metrics Tell You the Same Thing

Table 1. Correlations among different summary metrics of the same question (likelihood to recommend).

When I compared these summary metrics to each other, I found that they tell you pretty much the same thing about the data set. Across 48 different companies, these four common summary metrics are highly correlated with each other (See Table 1). Companies who receive high mean scores also receive a high NPS and top box scores. Likewise, companies who receive low mean scores will get low NPS and top box scores.

If each of these metrics are mathematically equivalent, does it matter which one we use?

How Are Metrics Interpreted by Users?

Even though different summary metrics are essentially the same, some metrics might be more beneficial to users due to their ease of interpretation. Are there differences between Mean Scores and Net Promoter Scores at helping users understand the data? Even though a mean of 7.0 is comparable to an NPS of 0.0, are there advantages of using one over the other?

Table 2. Net Promoter Scores and Predicted Values of Other Summary Metrics. Click image to enlarge.

One way of answering that question is to determine how well customer experience (CX) professionals can describe the underlying distribution of ratings on which the Mean Score or Net Promoter Score is calculated.

Study

Study participants were invited to the study via a blog post about the study; the post included a hyperlink to the Web-based data collection instrument. The post was shared through social media connections, professional online communities and the author’s email list.

For the current study, each CX professional ran through a series of exercises in which they estimated the size of different customer segments based on their knowledge of either a Mean Score or the Net Promoter Score. To ensure Mean Scores and Net Promoter Scores were comparable to each other, I created the study protocol using the data from the study above. Table 2 includes a list of six summary metrics with their corresponding values. NPS values range from -100 to 100 in increments of 10. The values of other metrics are based on the regression formulas that predicted a specific summary metric from different values of the NPS. An NPS of 0.0 corresponds to a Mean Score of 7.1.

First, each study participant was given five NPS values (-100, -50, 0, 50 and 100). For each NPS value, they were asked to provide their best guess of the size of four specific customer segments from which that NPS was calculated: 1) percent of respondents with ratings of 6 or greater (Satisfied); 2) percent of respondents who have ratings of 9 or 10 (Promoters); 3) percent of respondents with ratings between 0 and 6, inclusive (Detractors) and 4) percent of respondents with ratings of 7 or 8 (Passives).

Table 3. Sample Demographics
Table 3. Sample Demographics

Next, these same CX professionals were given five comparable (to the NPS values above) mean values (4.0, 5.5, 7.0, 8.5 and 10.0). For each mean score, they were asked to provide their best guess of the percent of respondents in each of the same categories above (i.e., Satisfied, Promoters, Detractors and Passives).

Results

A total of 41 CX professionals participated in the study. Most CX professionals were from B2B companies (55%) or B2B/B2C companies (42%). Three quarters of them had formal CX roles, and most (77%) considered themselves either proficient or experts in their company’s CX program. See Table 3.

Figures 1 through 4 contain the results. Each figure contains three pieces of information that illustrate the accuracy of CX professionals’ estimate. The red dots represent the actual size of the specific customer segment for each value of the NPS. The green bars represent the CX professionals’ estimates of the size of the customer segment as well as their corresponding 95% confidence interval)

Figure 1 focuses on the estimation of the number of Promoters. Results show that CX professionals underestimate the Top Box percentage (i.e., Promoters) when the Mean Score is high. For example, CX professionals estimated that a Mean Score of 8.5 was equivalent to 45% Top Box Score when the actual Top Box Score would really be 64%. We saw a smaller effect using the NPS. In general, CX professionals could more accurately guess the Top Box Scores when using Net Promoter Scores, except for the highest NPS value of 100 (Actual Top Box Score = 100; CX professional’s estimate = 89).

Figure 2. Estimating % of Promoters from NPS and Mean Values
Figure 1. Estimating % of Promoters from NPS and Mean Values – Click image to enlarge.

In Figure 2, I looked at how well study participants could guess the size of the Bottom Box Scores (i.e., Detractors). Results show that CX professionals could accurately predict the percent of Detractors throughout the range of NPS values. On the other hand, CX professionals greatly underestimated the Bottom Box Scores when the Mean Score was extremely low (Mean = 4.0; Corresponding Bottom Box Score = 89; CX professionals’ estimate of Bottom Box Score= 64).

In Figure 3, I looked at how well study participants could guess the size of the Passives segment. Again, CX professionals were able to accurately estimate the percent of Passives across all values of the NPS. When using Mean Scores, however, study participants tended to overestimate the size of the Passives segment across all levels of the Mean Score.

Figure 2. Estimating % of Detractors from NPS and Mean Values
Figure 2. Estimating % of Detractors from NPS and Mean Values – Click image to enlarge.

In Figure 4, I looked at how well CX professionals could estimate the size of the Satisfied segment (rating of 6 or greater). Unlike the other findings using the NPS, we see that study partcipants underestimated the size of this segment across all levels of the NPS. The effect was less pronounced and slightly different when CX professionals relied on the Mean Score. Under this condition, CX professionals underestimated the size of the Satisfied segment when Mean ratings were 5.5 or above but overestimated the size of the Satisfied segments when Mean ratings was 4.0.

Figure 3. Estimating % of Passives from NPS and Mean Values
Figure 3. Estimating % of Passives from NPS and Mean Values – Click image to enlarge.

Summary and Conclusions

The results of this study show that customer metrics possess inherent bias. People tend to make consistent errors when interpreting customer metrics, especially for extreme values.

When the Mean Score was used, estimations of segment sizes suffered on the extreme ends of the scale. When things are really good (high Mean Score), CX professionals underestimated the number of Promoters they really have. When things are really bad (Mean score of 4.0), they underestimated the number of Detractors they really have.

The use of the NPS leads to more accurate estimations about underlying customer segments that are a part of the NPS lexicon (i.e., Promoters, Detractors and Passives). Net scores force the user to think about their data in specific segments. When CX professionals were estimating the size of a segment unrelated to the NPS (i.e., estimating percent of 6 – 10 ratings), they greatly underestimated the size of the segment across the entire spectrum of the NPS.

Figure 4.
Figure 4. Estimating % of Positives from NPS and Mean Values – Click image to enlarge.

Generally speaking, better decisions will be made when the interpretation of results matches reality. We saw that a mean of 8.5 really indicates that 64% of the customers are very satisfied (% of 9 and 10 ratings); yet, the CX professionals think that only 45% of the customers are very satisfied, painting a vastly different picture of how they interpret the data. Any misinterpretation of a performance metric could lead to sub-optimal decisions that are driven more by biases than by what the data really tell us, leading to unnecessary investments in areas where leaders are doing better than they think they are.

My advice is to consider using a few metrics to describe what’s happening with your data. First, Mean Scores and Net Scores are equivalent. So, for trending purposes, pick either and use it consistently. Second, report the size of specific customer segments (e.g., % Top Box) to ensure people understand the true meaning of the underlying data.

With the shortage of data scientists to help fill analytic roles in business, companies are looking for ways to train existing employees on how to analyze and interpret data. In addition to training the next analytics leaders, businesses need to focus on educating the consumers (e.g., executives, managers and individual contributors) about data and the use of analytics. The current sample used professionals who have a high degree of proficiency in the use of metrics as well as in the application of those metrics in a formal company program. Yet, these savvy users still misinterpreted metrics. For data novices, we would likely see greater bias. If you are a metric-rich company (and who isn’t?), consider offering a class on basic statistics to all employees.

Some Big Data vendors hope to build solutions to help bring data science to the masses. These solutions help users gain insight through easy analysis and visualization of results. For example, Statwing and Tableau provide good examples of solutions that allow you to present data in different ways (e.g., means, frequency distributions), helping you communicate what is really going on in the data.

Remember that metrics don’t exist in a vacuum. They are interpreted by people. We saw that people are biased in their understanding of the meaning of two commonly used customer metrics, the Mean Score and Net Score. Carefully consider how you communicate your results as well as your audiences’ potential biases.

http://www.slideshare.net/bobehayes/the-hidden-bias-in-customer-metrics

Source: The Hidden Bias in Customer Metrics by bobehayes

Friends of Juice: Jessica Walker

Juice wouldn’t be the successful company that it is today without our friends who have championed our mission and expanded our network. We have identified several of Juice’s closest friends and advocates and we want to introduce them to you! Meet Jessica Walker, the CEO of Care Sherpa!

JessicaWalker_03 (1).jpg

For over 14 years, Jessica has been consulting with health care organizations and providers to support patient and employee engagement strategies with demonstrated financial and quality improvement impact. From Predictive Analytics, Marketing Strategies, CRM, Digital Engagement, Population Health and Patient Portals, Jessica supported multiple healthcare organizations on their patient acquisition and retention journeys.

Here’s what Jessica had to say when we discussed what she loved most about Juice Analytics and our team.

How did you hear or find out about Juice? 

I first became familiar with Juice after meeting Zach at the Nashville Analytics Summit and then later became familiar with Juice’s full capabilities when I was involved in an acquisition where the former company was utilizing some dashboards and insights tools that Juice had built. 

What do you love the most about Juice or what they do? 

I appreciate the thoughtfulness of starting with the “story” or what problem you are trying to solve. Juice quickly becomes a true partner aligned with your strategic goals and giving you the tools you need to get there. I also appreciate the team members I have had the pleasure to work with, they are very service-oriented and responsive. 

Who would you recommend to Juice for all of their data product needs? 

I think Juice can be a critical tool in any organization’s tool box. Specifically, I believe that anyone in a consultancy or change management role would quickly accelerate their business and stakeholder impact by working with Juice. Further Senior Leaders or Executives that are looking to demonstrate impact of their product line, team or investments would be well served in a Juice partnership. Finally, early stage organizations that need performance tracking would not only benefit from the expertise that Juice brings to the table to guide with best practices, but a Juice investment can directly impact their business growth over time.

What has impressed you the most about Juice and its team? 

Beyond what was stated prior regarding service orientation, I was most impressed with the depth of expertise and their ability to bring examples of what others have done that may be similar to my needs. This knowledge not only accelerated our project timeline but also represented true cost savings by eliminating multiple revisions.   

We are so thankful for the partnership and friendship that we have with Jessica! Thank you for being such a champion for Juice!

Originally Posted at: Friends of Juice: Jessica Walker by analyticsweek

Amazon Redshift COPY command cheatsheet

Although it’s getting easier, ramping up on the COPY command to import tables into Redshift can become very tricky and error-prone.   Following the accordion-like hyperlinked Redshift documentation to get a complete command isn’t always straighforward, either.

Treasure Data got in on the act (we always do!) with a guide to demystify and distill all the COPY commands you could ever need into one short, straightforward guide.

Load tables into Redshift from S3, EMR, DynamoDB, over SSH, and more.
Load tables into Redshift from S3, EMR, DynamoDB, over SSH, and more!

 

Includes example commands, how to use data sources – including the steps for setting up an SSH connection,  using temporary and encrypted credentials, formatting, and much more.

Get the guide here.

Originally Posted at: Amazon Redshift COPY command cheatsheet by john-hammink