Nov 29, 18: #AnalyticsClub #Newsletter (Events, Tips, News & more..)

[  COVER OF THE WEEK ]

image
Complex data  Source

[ LOCAL EVENTS & SESSIONS]

More WEB events? Click Here

[ AnalyticsWeek BYTES]

>> Top 4 Instagram Analytics Tools That Digital Marketers Can Use for Business by thomassujain

>> Meet the Robot Reading Your Resume [infographics] by v1shal

>> Adopting a Multi-Cloud Strategy: Challenges vs. Benefits by analyticsweek

Wanna write? Click Here

[ NEWS BYTES]

>>
 10 Best Social Media Management Tools for Marketers – TGDaily – TG Daily (blog) Under  Social Analytics

>>
 Tableau cracks the business data code…be a data scientist now for just $19 – TNW Under  Data Scientist

>>
 How efficient smart cities will be built on IoT sensors – TechRepublic Under  IOT

More NEWS ? Click Here

[ FEATURED COURSE]

Artificial Intelligence

image

This course includes interactive demonstrations which are intended to stimulate interest and to help students gain intuition about how artificial intelligence methods work under a variety of circumstances…. more

[ FEATURED READ]

The Signal and the Noise: Why So Many Predictions Fail–but Some Don’t

image

People love statistics. Statistics, however, do not always love them back. The Signal and the Noise, Nate Silver’s brilliant and elegant tour of the modern science-slash-art of forecasting, shows what happens when Big Da… more

[ TIPS & TRICKS OF THE WEEK]

Data aids, not replace judgement
Data is a tool and means to help build a consensus to facilitate human decision-making but not replace it. Analysis converts data into information, information via context leads to insight. Insights lead to decision making which ultimately leads to outcomes that brings value. So, data is just the start, context and intuition plays a role.

[ DATA SCIENCE Q&A]

Q:Why is naive Bayes so bad? How would you improve a spam detection algorithm that uses naive Bayes?
A: Naïve: the features are assumed independent/uncorrelated
Assumption not feasible in many cases
Improvement: decorrelate features (covariance matrix into identity matrix)

Source

[ VIDEO OF THE WEEK]

The History and Use of R

 The History and Use of R

Subscribe to  Youtube

[ QUOTE OF THE WEEK]

It’s easy to lie with statistics. It’s hard to tell the truth without statistics. – Andrejs Dunkels

[ PODCAST OF THE WEEK]

@TimothyChou on World of #IOT & Its #Future Part 1 #FutureOfData #Podcast

 @TimothyChou on World of #IOT & Its #Future Part 1 #FutureOfData #Podcast

Subscribe 

iTunes  GooglePlay

[ FACT OF THE WEEK]

100 terabytes of data uploaded daily to Facebook.

Sourced from: Analytics.CLUB #WEB Newsletter

Unraveling the Mystery of Big Data

Slide01
Synopsis:
Curious about the Big Data hype? Want to find out just how big, BIG is? Who’s using Big Data for what, and what can you use it for? How about the architecture underpinnings and technology stacks? Where might you fit in the stack? Maybe some gotchas to avoid? Lionel Silberman, a seasoned Data Architect spreads some light on it. A good and wholesome refresher into Big Data and what all it can do.
Our guest speaker:

Lionel Silberman,
Senior Data Architect, Compuware
Lionel Silberman has over thirty years of experience in big data product development. He has expert knowledge of relational databases, both internals and applications, performance tuning, modeling, and programming. His product and development experience encompasses the major RDBMS vendors, object-oriented, time-series, OLAP, transaction-driven, MPP, distributed and federated database applications, data appliances, NoSQL systems Hadoop and Cassandra, as well as data parallel and mathematical algorithm development. He is currently employed at Compuware, integrating enterprise products at the data level. All are welcome to join us.

Video:

Slideshare:

Source by v1shal

Data Science Programming: Python vs R

“Data Scientist – The Sexiest Job of 21st Century.”- Harvard Business Review

If you are already into a big data related career then you must already be familiar with the set of big data skillsthat you need to master to grab the sexiest job of 21st century. With every industry generating massive amounts of data – the need to crunch data requires more powerful and sophisticated programming tools like Python and R language. Python and R are among the popular programming languages that a data scientist must know to pursue a lucrative career in data science.

Python is popular as a general purpose web programming language whereas R is popular for its great features for data visualization as it was particularly developed for statistical computing. At DeZyre, our career counsellors often get questions from prospective students as to what should they learn first Python programming or R programming. If you are unsure on which programming language to learn first then you are on the right page.

Python and R language top the list of basic tools for statistical computing among the set of data scientist skills.Data scientists often debate on the fact that which one is more valuable R programming or Python programming, however both the programming languages have their specialized key features complementing each other.

Data Science with Python Language

Data science consists of several interrelated but different activities such as computing statistics, building predictive models, accessing and manipulating data, building explanatory models, data visualizations, integrating models into production systems and much more on data. Python programming provides data scientists with a set of libraries that helps them perform all these operations on data.

Python is a general purpose multi-paradigm programming language for data science that has gained wide popularity-because of its syntax simplicity and operability on different eco-systems. Python programming can help programmers play with data by allowing them to do anything they need with data – data munging, data wrangling, website scraping, web application building, data engineering and more. Python language makes it easy for programmers to write maintainable, large scale robust code.

“Python programming has been an important part of Google since the beginning, and remains so as the system grows and evolves. Today dozens of Google engineers use Python language, and we’re looking for more people with skills in this language.” – said Peter Norvig, Director at Google.

Unlike R language, Python language does not have in-built packages but it has support for libraries like Scikit, Numpy, Pandas, Scipy and Seaborn that data scientists can use to perform useful statistical and machine learningtasks. Python programming is similar to pseudo code and makes sense immediately just like English language. The expressions and characters used in the code can be mathematical, however, the logic can be easily adhered from the code.

What makes Python language the King of Data Science Programming Languages?

“In Python programming, everything is an object. It’s possible to write applications in Python language using several programming paradigms, but it does make for writing very clear and understandable object-oriented code.”- said Brian Curtin, member of Python Software Foundation

1) Broadness

The public package index for Python language popularly known as PyPi has approximately 40K add-ons available listed under 300 different categories. So, if a developer or a data scientist has to do something with Python language then there is high probability that someone already has it and they need not begin from the scratch. Python programming is used extensively for various tasks ranging from CGI and web development, system testing and automation, and ETL to gaming.

2) Efficient

Developers these days spend lot of time in defining and processing big data. With the increasing amount of data that needs to be processed, it becomes extremely important for programmers to efficiently manage the in-memory usage. Python language has generators both from functions and also as expressions which helps in iterative processing i.e. one item at a time. When there are large number of processes to be applied to a set of data in that case generators in Python language prove to be great advantage as they grab the source data ,one item at a time and then pass through the entire processing chain.

The generator based migration tool collective.transmogrifier helps make complex and interdependent updates to the data as it is being processed from the old site and then allows the programmers to create and store objects in constant memory at the new site.The transmogrifier plays vital role in Python programming when dealing with larger data sets.

3) Can be Easily Mastered Under Expert Guidance-Read It, Use it with Ease

Python language has gained wide popularity as the syntax is clear and readable making it easy to learn under expert guidance. Data scientists can gain expertise knowledge and master programming with Python in scientific computing by taking industry expert oriented Python programming courses. The readability of the syntax makes it easier for other peer programmers update already written Python programs at a faster pace and also helps write new programs quickly.

Applications of Python language-

  • Python programming is used by Mozilla for exploring their broad code base. Mozilla releases several open source packages built using Python.
  • Dropbox, a popular file hosting service founded by Drew Houston as he kept forgetting his USB. The project was started to fulfill his personal needs but it turned out to be so good that even others started using it.Dropbox is completely written in Python language which now has close to 150 million registered users.
  • Walt Disney uses Python language to enhance the supremacy of their creative processes.
  • Some other exceptional products written in Python language are –

i. Cocos2d-A popular open source 2D gaming framework

ii.Mercurial- A popular cross-platform, distributed code revision control tool used by developers.

iii.Bit Torrent- File sharing software

iv.Reddit- Entertainment and Social News website.

Limitations of Python Programming-

  • Python is an interpreted language and thus is many a times slower than the compiled languages.
  • “A possible disadvantage of Python is its slow speed of execution. But many Python packages have been optimized over the years and execute at C speed.”- said Pierre Carbonnelle, a Python programmer who runs the PyPL language index.
  • Python language being a dynamically typed language poses certain design restrictions. It requires rigorous testing because errors show up only during runtime.
  • Python programming has gained popularity on desktop and server platforms but is still weak on mobile computing platforms as there are very less number of mobile apps that are developed using Python language. Python programming can be rarely found on the client side of web applications.

Click here to know more about our IBM Certified Hadoop Developer course

Data Science with R Language

Millions of data scientists and statisticians use R programming to get away with challenging problems related to statistical computing and quantitative marketing. R language has become an essential tool for finance and business analytics-driven organizations like LinkedIn, Twitter, Bank of America, Facebook and Google.

R is an open source programming language and environment for statistical computing and graphics available on Linux, Windows and Mac. R language has an innovative package system that allows developers to extend the functionality to new heights by providing cross-platform distribution and testing of data and code. With more than 5K publicly released packages available for download, it is just a great programming language for exploratory data analysis language can easily be integrated with other object oriented programming languages like C, C++ and Java. R language has array-oriented syntax making it easier for programmers to translate math to code, in particular for professionals with minimal programming background.

Why use R programming for data science?

1.R language is one of the best tools for data scientists in the world of data visualization. It virtually has everything that a data scientist needs- statistical models, data manipulation and visualization charts.

2.Data scientists can create unique and beautiful data visualizations with R language that go far beyond the out-dated line plots and bar charts. With R programming, data scientists can draw meaningful insights from data in multiple dimensions using 3D surfaces and multi-panel charts. The Economist and The New York Times exploit the custom charting capabilities of R programming to create stunning infographics.

3.One great feature of R programming is its reproducible research-the code and data can be given to an interested third party which can trace it back to reproduce the same results. Thus, data scientists need to write code that will extract the data, analyse it and generate a HTML, PDF or a PPT for reporting. When any other third party is interested, the original author can share the code and data with the third party for reproducing similar results.

4.R language is designed particularly for data analysis with a flexibility to mix and match various statistical and predictive models for best possible outcomes. R programming scripts can further be automated with ease to promote production deployments and reproducible research.

5.R language has rich community of approximately 2 million users and close to 1000’s of developers that draws talents of data scientists spread across the world. The community has packages widespread across actuarial analysis, finance, machine learning, web technologies,pharmaceuticals that can be of great help to predict component failure times, analyse genomic sequences, and optimize portfolios. All these resources created by experts in various domains can be accessed easily for free, online.

Applications of R Language

  • Ford uses open source tools like R programming and Hadoop for data driven decision support and statistical data analysis.
  • The popular insurance giant Lloyd’s uses R language to create motion charts that provide analysis reports to investors.
  • Google uses R programming to analyse the effectiveness of online advertising campaigns, predict economic activities and measure the ROI of advertising campaigns.
  • Facebook uses R language to analyse the status updates and create the social network graph.
  • Zillow makes use of R programming to promote the housing prices.

Limitations of R Language

  • R programming has a steep learning curve for professionals who do not come from a programming background (professionals hailing from a GUI world like that of MicrosoftExcel).
  • Working with R language can at times be slow if the code is written poorly, however, there are solutions to this like FastR package, pqR and Penjin.

Data Science with Python or R Programming- What to learn first?

There are certain strategies that will help professionals decide their call of action on whether to begin learning data science with Python language or with R language –

  • If professionals are aware of the fact on what kind of project they will be working on then they can make a decision on which language to learn first. If the projects requires working with jumbled or scrape data from files, websites or any other sources of data then professionals must first start their learning with Python language. On the other hand, if the project requires working with clean data then professionals must first learn to focus on the data analysis part which requires learning R programming first.
  • It is always better to be on-par with the teams so find out what data science  programming language are they using R or Python. Collaboration and learning becomes much easier if you and your team mates are on the same language paradigm.
  • Trends in increasing data scientist jobs will help make a better decision on which what to learn first R language or Python language.
  • Last but not the least, do consider your personal preferences as to what interests you more and which is easier for you to grasp.

Having understood briefly about Python language and R language, the bottom line here is that it is difficult to choose learning any one language first -Python or R to crack data scientist jobs in top big data companies. Each one has its own advantages and disadvantages based on the different scenarios and tasks to be performed. Thus, the best solution is to make a smart move based on the above listed strategies and decide which language you should learn first that will fetch you a job with big data scientist salary and later add onto your skill set by learning the other language.

To read the original article on DeZyre, click here.

Originally Posted at: Data Science Programming: Python vs R

Nov 22, 18: #AnalyticsClub #Newsletter (Events, Tips, News & more..)

[  COVER OF THE WEEK ]

image
Ethics  Source

[ LOCAL EVENTS & SESSIONS]

More WEB events? Click Here

[ AnalyticsWeek BYTES]

>> How to #Leap into the #FutureOfWork by @HowardHYu #JobsOfFuture by v1shal

>> The Key to DevOps for Big Data Applications: Containers and Stateful Storage by jelaniharper

>> #OpenAnalyticsDay: A Day for Analytics by v1shal

Wanna write? Click Here

[ NEWS BYTES]

>>
 Social media analytic tools for business marketers – mtltimes.ca Under  Social Analytics

>>
 A How to Get More Bang from Your Big Data Clusters – CIO Under  Big Data

>>
 Five9 Aims To Unlock Insight From Contact Center With Artificial Intelligence – Forbes Under  Artificial Intelligence

More NEWS ? Click Here

[ FEATURED COURSE]

The Analytics Edge

image

This is an Archived Course
EdX keeps courses open for enrollment after they end to allow learners to explore content and continue learning. All features and materials may not be available, and course content will not be… more

[ FEATURED READ]

Big Data: A Revolution That Will Transform How We Live, Work, and Think

image

“Illuminating and very timely . . . a fascinating — and sometimes alarming — survey of big data’s growing effect on just about everything: business, government, science and medicine, privacy, and even on the way we think… more

[ TIPS & TRICKS OF THE WEEK]

Grow at the speed of collaboration
A research by Cornerstone On Demand pointed out the need for better collaboration within workforce, and data analytics domain is no different. A rapidly changing and growing industry like data analytics is very difficult to catchup by isolated workforce. A good collaborative work-environment facilitate better flow of ideas, improved team dynamics, rapid learning, and increasing ability to cut through the noise. So, embrace collaborative team dynamics.

[ DATA SCIENCE Q&A]

Q:When you sample, what bias are you inflicting?
A: Selection bias:
– An online survey about computer use is likely to attract people more interested in technology than in typical

Under coverage bias:
– Sample too few observations from a segment of population

Survivorship bias:
– Observations at the end of the study are a non-random set of those present at the beginning of the investigation
– In finance and economics: the tendency for failed companies to be excluded from performance studies because they no longer exist

Source

[ VIDEO OF THE WEEK]

#HumansOfSTEAM feat. Hussain Gadwal, Mechanical Designer via @SciThinkers #STEM #STEAM

 #HumansOfSTEAM feat. Hussain Gadwal, Mechanical Designer via @SciThinkers #STEM #STEAM

Subscribe to  Youtube

[ QUOTE OF THE WEEK]

Getting information off the Internet is like taking a drink from a firehose. – Mitchell Kapor

[ PODCAST OF THE WEEK]

Understanding #BigData #BigOpportunity in Big HR by @MarcRind #FutureOfData #Podcast

 Understanding #BigData #BigOpportunity in Big HR by @MarcRind #FutureOfData #Podcast

Subscribe 

iTunes  GooglePlay

[ FACT OF THE WEEK]

Three-quarters of decision-makers (76 per cent) surveyed anticipate significant impacts in the domain of storage systems as a result of the “Big Data” phenomenon.

Sourced from: Analytics.CLUB #WEB Newsletter

March 6, 2017 Health and Biotech analytics news roundup

Here’s the latest in health and biotech analytics:

Mathematical Analysis Reveals Prognostic Signature for Prostate Cancer: University of East Anglia researchers used an unsupervised technique to categorize cancers based on gene expression levels. Their method was better able than current supervised methods to identify patients with more harmful variants of the disease.

Assisting Pathologists in Detecting Cancer with Deep Learning: Scientists at Google have trained deep learning models to detect tumors in images of tissue samples. These models beat pathologists’ diagnoses by one metric.

Patient expectations for health data sharing exceed reality, study says: The Humana study shows that, among other beliefs, most patients think doctors share more information than they actually do. They also expect information from digital devices will be beneficial.

NHS accused of covering up huge data loss that put thousands at risk: The UK’s national health service failed to deliver half a million medically relevant documents between 2011 and 2016. They had previously briefed Parliament about the failure, but not the scale of it.

Entire operating system written into DNA at 215 Pbytes/gram: Yaniv Erlich and Dina Zielinski (New York Genome Center) used a “fountain code” to translate a 2.1 MB archive into DNA. They were able to retrieve the data by sequencing the resulting fragments, a process that was robust to mutations and loss of sequences.

Source: March 6, 2017 Health and Biotech analytics news roundup

Improving Self-Service Business Intelligence and Data Science

The heterogeneous complexities of big data present the foremost challenge in delivering that data to the end users who need them most. Those complexities are characterized by:

  • Disparate data sources: The influx of big data multiplied the sheer amount of data sources almost exponentially, including those both external and internal ones. Moreover, the quantity of sources required today are made more complex by…
  • Multiple technologies powering those sources: For almost every instance in which SQL is still deployed, there is seemingly another application, use case, or data source which involves an assortment of alternative technologies. Moreover, accounting for the plethora of technologies in use today is frequently aggravated by contemporary…
  • Architecture and infrastructure complications: With numerous advantages for deployments in the cloud, on-premise, and in hybrid manifestations of the two, contemporary enterprise architecture and infrastructure is increasingly ensnared in a process which protracts time to value for accessing data. The dilatory nature of this reality is only worsened in the wake of…
  • Heightened expectations for data: As data becomes ever entrenched in the personal lives of business users, the traditional lengthy periods of business intelligence and data insight are becoming less tolerable. According to Dremio Chief Marketing Officer Kelly Stirman, “In our personal lives, when we want to use data to answer questions, it’s just a few seconds away on Google…And then you get to work, and your experience is nothing like that. If you want to answer a question or want some piece of data, it’s a multi-week or multi-month process, and you have to ask IT for things. It’s frustrating as well.”

However, a number of recent developments have taken place within the ever-shifting data landscape to substantially accelerate self-service BI and certain aspects of data science. The end result is that despite the variegated factors characterizing today’s big data environments, “for a user, all of my data looks like it’s in a single high performance relational database,” Stirman revealed. “That’s exactly what every analytical tool was designed for. But behind the scenes, your data’s spread across hundreds of different systems and dozens of different technologies.”

Avoiding ETL

Conventional BI platforms were routinely hampered by the ETL process, a prerequisite for both integrating and loading data into tools with schema at variance with that of source systems. The ETL process was significant for three reasons. It was the traditional way of transforming data for application consumption. It was typically the part of the analytics process which absorbed a significant amount of time—and skill—because it required the manual writing of code. Furthermore, it resulted in multiple copies of data which could be extremely costly to organizations. Stirman observed that, “Each time you need a different set of transformations you’re making a different copy of the data. A big financial services institution that we spoke to recently said that on average they have eight copies of every piece of data, and that consumes about 40 percent of their entire IT budget which is over a billion dollars.” ETL is one of the facets of the data engineering process which monopolizes the time and resources of data scientists, who are frequently tasked with transforming data prior to leveraging them.

Modern self-service BI platforms eschew ETL with automated mechanisms that provide virtual (instead of physical) copies of data for transformation. Thus, each subsequent transformation is applied to the virtual replication of the data with swift in-memory technologies that not only accelerate the process, but eliminate the need to dedicate resources to physical copies. “We use a distributed process that can run on thousands of servers and take advantage of the aggregate RAM across thousands of servers,” Stirman said. “We can execute these transformations dynamically and give you a great high-performance experience on the data, even though we’re transforming it on the fly.” End users can enact this process visually without involving script.
Reflections

Today’s self-service BI and data science platforms have also expedited time to insight by making data more available than traditional solutions did. Virtual replications of datasets are useful in this regard because they are stored in the underlying BI solution—instead of in the actual source of data. Thus, these platforms can access that data without retrieving them from the initial data source and incurring the intrinsic delays associated with architectural complexities or slow source systems. According to Stirman, the more of these “copies of the data in a highly optimized format” such a self-service BI or data science solution has, the faster it is at retrieving relevant data for a query. Stirman noted this approach is similar to one used by Google, in which there are not only copies of web pages available but also “all these different ways of structuring data about the data, so when you ask a question they can give you an answer very quickly.” Self-service analytics solutions which optimize their data copies in this manner produce the same effect.

Prioritizing SQL

Competitive platforms in this space are able to account for the multiplicity of technologies the enterprise has to contend with in a holistic fashion. Furthermore, they’re able to do so by continuing to prioritize SQL as the preferred query language which is rewritten into the language relevant to the source data’s technology—even when it isn’t SQL. By rewriting SQL into the query language of the host of non-relational technology options, users effectively have “a single, unified future-proof way to query any data source,” Stirman said. Thus, they can effectively query any data source without understanding its technology or its query language, because the self-service BI platform does. In those instances in which “those sources have something you can’t express in SQL, we augment those capabilities with our distributed execution engine,” Stirman remarked.
User Experience

The crux of self-service platforms for BI and data science is that by eschewing ETL for quicker versions of transformation, leveraging in-memory technologies to access virtual copies of data, and re-writing queries from non-relational technologies into familiar relational ones, users can rely on their tool of choice for analytics. Business end users can choose from any popular Tableau, Qlik, or any other preferred tool, while data scientists can use R, Python, or any other popular data science platform. The fact that these solutions are able to facilitate these advantages at scale and in cloud environments adds to their viability. Consequently, “You log in as a consumer of data and you can see the data, and you can shape it the way you want to yourself without being able to program, without knowing these low level IT skills, and you get the data the way you want it through a powerful self-service model instead of asking IT to do it for you,” Stirman said. “That’s a fundamentally very different approach from the traditional approach.”

 

Source by jelaniharper

Nov 15, 18: #AnalyticsClub #Newsletter (Events, Tips, News & more..)

[  COVER OF THE WEEK ]

image
Statistics  Source

[ LOCAL EVENTS & SESSIONS]

More WEB events? Click Here

[ NEWS BYTES]

>>
 Why Big Data and Machine Learning are Essential for Cyber Security – insideBIGDATA Under  Big Data Analytics

>>
 Amazon’s Big Step Into IoT – Seeking Alpha Under  IOT

>>
 Hitachi and Tencent team up on ‘internet of things’ – Nikkei Asian Review Under  Internet Of Things

More NEWS ? Click Here

[ FEATURED COURSE]

Learning from data: Machine learning course

image

This is an introductory course in machine learning (ML) that covers the basic theory, algorithms, and applications. ML is a key technology in Big Data, and in many financial, medical, commercial, and scientific applicati… more

[ FEATURED READ]

Rise of the Robots: Technology and the Threat of a Jobless Future

image

What are the jobs of the future? How many will there be? And who will have them? As technology continues to accelerate and machines begin taking care of themselves, fewer people will be necessary. Artificial intelligence… more

[ TIPS & TRICKS OF THE WEEK]

Fix the Culture, spread awareness to get awareness
Adoption of analytics tools and capabilities has not yet caught up to industry standards. Talent has always been the bottleneck towards achieving the comparative enterprise adoption. One of the primal reason is lack of understanding and knowledge within the stakeholders. To facilitate wider adoption, data analytics leaders, users, and community members needs to step up to create awareness within the organization. An aware organization goes a long way in helping get quick buy-ins and better funding which ultimately leads to faster adoption. So be the voice that you want to hear from leadership.

[ DATA SCIENCE Q&A]

Q:Explain what a long-tailed distribution is and provide three examples of relevant phenomena that have long tails. Why are they important in classification and regression problems?
A: * In long tailed distributions, a high frequency population is followed by a low frequency population, which gradually tails off asymptotically
* Rule of thumb: majority of occurrences (more than half, and when Pareto principles applies, 80%) are accounted for by the first 20% items in the distribution
* The least frequently occurring 80% of items are more important as a proportion of the total population
* Zipf’s law, Pareto distribution, power laws

Examples:
1) Natural language
– Given some corpus of natural language – The frequency of any word is inversely proportional to its rank in the frequency table
– The most frequent word will occur twice as often as the second most frequent, three times as often as the third most frequent…
– The” accounts for 7% of all word occurrences (70000 over 1 million)
– ‘of” accounts for 3.5%, followed by ‘and”…
– Only 135 vocabulary items are needed to account for half the English corpus!

2. Allocation of wealth among individuals: the larger portion of the wealth of any society is controlled by a smaller percentage of the people

3. File size distribution of Internet Traffic

Additional: Hard disk error rates, values of oil reserves in a field (a few large fields, many small ones), sizes of sand particles, sizes of meteorites

Importance in classification and regression problems:
– Skewed distribution
– Which metrics to use? Accuracy paradox (classification), F-score, AUC
– Issue when using models that make assumptions on the linearity (linear regression): need to apply a monotone transformation on the data (logarithm, square root, sigmoid function…)
– Issue when sampling: your data becomes even more unbalanced! Using of stratified sampling of random sampling, SMOTE (‘Synthetic Minority Over-sampling Technique”, NV Chawla) or anomaly detection approach

Source

[ VIDEO OF THE WEEK]

@JustinBorgman on Running a data science startup, one decision at a time #Futureofdata #Podcast

 @JustinBorgman on Running a data science startup, one decision at a time #Futureofdata #Podcast

Subscribe to  Youtube

[ QUOTE OF THE WEEK]

Everybody gets so much information all day long that they lose their common sense. – Gertrude Stein

[ PODCAST OF THE WEEK]

@JohnTLangton from @Wolters_Kluwer discussed his #AI Lead Startup Journey #FutureOfData #Podcast

 @JohnTLangton from @Wolters_Kluwer discussed his #AI Lead Startup Journey #FutureOfData #Podcast

Subscribe 

iTunes  GooglePlay

[ FACT OF THE WEEK]

Brands and organizations on Facebook receive 34,722 Likes every minute of the day.

Sourced from: Analytics.CLUB #WEB Newsletter

To Trust A Bot or Not? Ethical Issues in AI

Given we see fake profiles and potentially chatbots that misfire and miscommunicate we would like your thoughts on whether there should be some sort of Government registry for robots so that consumers know they are legitimate or not. If we had a registry for trolls and or chatbots would that ensure that people could feel more comfortable that they are dealing with a legitimate business or would know if the profile or troll or bot is fake? Is it time for a good housekeeping seal of approval for AI?

These are all provocative questions and questions that are so new I am not sure there is one answer as they are so undefined. What do you think? Who should create such standards? Perhaps we should start by categorizing the types of AI?

Source by tony

Nov 08, 18: #AnalyticsClub #Newsletter (Events, Tips, News & more..)

[  COVER OF THE WEEK ]

image
Data security  Source

[ LOCAL EVENTS & SESSIONS]

More WEB events? Click Here

[ AnalyticsWeek BYTES]

>> The Modern Data Warehouse – Enterprise Data Curation for the Artificial Intelligence Future by analyticsweek

>> Analyzing Big Data: A Customer-Centric Approach by bobehayes

>> Analytics Implementation in 12 Steps: An Exhaustive Guide (Tracking Plan Included!) by analyticsweek

Wanna write? Click Here

[ NEWS BYTES]

>>
 Why Cloud (By Default) Gives You Security You Couldn’t Afford Otherwise – Forbes Under  Cloud Security

>>
  Under  Big Data Security

>>
 Make it so: Red River Mill Employees FCU rebrands as Engage – Credit Union Journal Under  Talent Analytics

More NEWS ? Click Here

[ FEATURED COURSE]

Baseball Data Wrangling with Vagrant, R, and Retrosheet

image

Analytics with the Chadwick tools, dplyr, and ggplot…. more

[ FEATURED READ]

The Signal and the Noise: Why So Many Predictions Fail–but Some Don’t

image

People love statistics. Statistics, however, do not always love them back. The Signal and the Noise, Nate Silver’s brilliant and elegant tour of the modern science-slash-art of forecasting, shows what happens when Big Da… more

[ TIPS & TRICKS OF THE WEEK]

Analytics Strategy that is Startup Compliant
With right tools, capturing data is easy but not being able to handle data could lead to chaos. One of the most reliable startup strategy for adopting data analytics is TUM or The Ultimate Metric. This is the metric that matters the most to your startup. Some advantages of TUM: It answers the most important business question, it cleans up your goals, it inspires innovation and helps you understand the entire quantified business.

[ DATA SCIENCE Q&A]

Q:Is it better to spend 5 days developing a 90% accurate solution, or 10 days for 100% accuracy? Depends on the context?
A: * “premature optimization is the root of all evils”
* At the beginning: quick-and-dirty model is better
* Optimization later
Other answer:
– Depends on the context
– Is error acceptable? Fraud detection, quality assurance

Source

[ VIDEO OF THE WEEK]

Surviving Internet of Things

 Surviving Internet of Things

Subscribe to  Youtube

[ QUOTE OF THE WEEK]

Data beats emotions. – Sean Rad, founder of Ad.ly

[ PODCAST OF THE WEEK]

#BigData #BigOpportunity in Big #HR by @MarcRind #JobsOfFuture #Podcast

 #BigData #BigOpportunity in Big #HR by @MarcRind #JobsOfFuture #Podcast

Subscribe 

iTunes  GooglePlay

[ FACT OF THE WEEK]

94% of Hadoop users perform analytics on large volumes of data not possible before; 88% analyze data in greater detail; while 82% can now retain more of their data.

Sourced from: Analytics.CLUB #WEB Newsletter

Three final talent tips: how to hire data scientists

This last post focuses on less tangible aspects, related to curiosity, clarity about what kind of data scientist you need, and having appropriate expectations when you hire.

8. Look for people with curiosity and a desire to solve problems

Radhika Kulkarni, PhD in Operations Research, Cornell University, teaching calculus as a grad student.

 

 

 

 

 

 

 

As I  blogged previously, Greta Roberts of Talent Analytics will tell you that the top traits to look for when hiring analytical talent are curiosity, creativity, and discipline, based on a study her organization did of data scientists. It is important to discover if your candidates have these traits, because they are necessary elements to find a practical solution and separate candidates from those who may get lost in theory. My boss Radhika Kulkarni, the VP of Advanced Analytics R&D at SAS, self-identified this pattern when she arrived at Cornell to pursue a PhD in math. This realization prompted her to switch to operations research, which she felt would allow her to pursue investigating practical solutions to problems, which she preferred to more theoretical research.

That passion continues today, as you can hear Radhika describe in this video on moving the world with advanced analytics. She says “We are not creating algorithms in an ivory tower and throwing it over the fence and expecting that somebody will use it someday. We actually want to build these methods, these new procedures and functionality to solve our customers’ problems.” This kind of practicality is another key trait to evaluate in your job candidates, in order to avoid the pitfall of hires who are obsessed with finding the “perfect” solution. Often, as Voltaire observed, “Perfect is the enemy of good.” Many leaders of analytical teams struggle with data scientists who haven’t yet learned this lesson. Beating a good model to death for that last bit of lift leads to diminishing returns, something few organizations can afford in an ever-more competitive environment. As an executive customer recently commented during the SAS Analytics Customer Advisory Board meeting, there is an “ongoing imperative to speed up that leads to a bias toward action over analysis. 80% is good enough.”

9. Think about what kind of data scientist you need

Ken Sanford, PhD in Economics, University of Kentucky, speaking about how economists make great data scientists at the 2014 National Association of Business Economists Annual Meeting. (Photo courtesy of NABE)

Ken Sanford describes himself as a talking geek, because he likes public speaking. And he’s good at it. But not all data scientists share his passion and talent for communication. This preference may or may not matter, depending on the requirements of the role. As this Harvard Business Review blog post points out, the output of some data scientists will be to other data scientists or to machines. If that is the case, you may not care if the data scientist you hire can speak well or explain technical concepts to business people. In a large organization or one with a deep specialization, you may just need a machine learning geek and not a talking one! But many organizations don’t have that luxury. They need their data scientists to be able to communicate their results to broader audiences. If this latter scenario sounds like your world, then look for someone with at least the interest and aptitude, if not yet fully developed, to explain technical concepts to non-technical audiences. Training and experience can work wonders to polish the skills of someone with the raw talent to communicate, but don’t assume that all your hires must have this skill.

10. Don’t expect your unicorns to grow their horns overnight

Annelies Tjetjep, M.Sc., Mathematical Statistics and Probability from the University of Sydney, eating frozen yogurt.

Annie Tjetjep relates development for data scientists to frozen yogurt, an analogy that illustrates how she shines as a quirky and creative thinker, in addition to working as an analytical consultant for SAS Australia. She regularly encounters customers looking for data scientists who have only chosen the title, without additional definition. She explains: “…potential employers who abide by the standard definitions of what a ‘data scientist’ is (basically equality on all dimensions) usually go into extended recruitment periods and almost always end up somewhat disappointed – whether immediately because they have to compromise on their vision or later on because they find the recruit to not be a good team player….We always talk in dimensions and checklists but has anyone thought of it as a cycle? Everyone enters the cycle at one dimension that they’re innately strongest or trained for and further develop skills of the other dimensions as they progress through the cycle – like frozen yoghurt swirling and building in a cup…. Maybe this story sounds familiar… An educated statistician who picks up the programming then creativity (which I call confidence), which improves modelling, then business that then improves modelling and creativity, then communication that then improves modelling, creativity, business and programming, but then chooses to focus on communication, business, programming and/or modelling – none of which can be done credibly in Analytics without having the other dimensions. The strengths in the dimensions were never equally strong at any given time except when they knew nothing or a bit of everything – neither option being very effective – who would want one layer of froyo? People evolve unequally and it takes time to develop all skills and even once you develop them you may choose not to actively retain all of them.”

So perhaps you hire someone with their first layer of froyo in place and expect them to add layers over time. In other words, don’t expect your data scientists to grow their unicorn horns overnight. You can build a great team if they have time to develop as Annie describes, but it is all about having appropriate expectations from the beginning.

To learn more, check out this series from SAS on data scientists, where you can read Patrick Hall’s post on the importance of keeping the science in data science, interviews with data scientists, and more.

And if you want to check out what a talking geek sounds like, Ken will be speaking at a National Association of Business Economists event next week in Boston – Big Data Analytics at Work: New Tools for Corporate and Industry Economics. He’ll share the stage with another talking geek, Patrick Hall, a SAS unicorn I wrote about it in my first post.

To read the original article on SAS, click here.

Originally Posted at: Three final talent tips: how to hire data scientists by analyticsweekpick