[ COVER OF THE WEEK ]
Big Data knows everything Source
[ LOCAL EVENTS & SESSIONS]
- Jul 24, 2018 #WEB Dual Lean Six Sigma Yellow Belt & Green Belt 4-Days Class in Saint Paul
- Jul 15, 2018 #WEB Dubai Workshop – Develop a Successful Bitcoin Startup Company Today!
- Jul 23, 2018 #WEB SAP FICO Certification Online Training in Houston, TX
[ AnalyticsWeek BYTES]
[ NEWS BYTES]
[ FEATURED COURSE]
[ FEATURED READ]
Antifragile is a standalone book in Nassim Nicholas Talebs landmark Incerto series, an investigation of opacity, luck, uncertainty, probability, human error, risk, and decision-making in a world we dont understand. The… more
[ TIPS & TRICKS OF THE WEEK]
Analytics Strategy that is Startup Compliant
With right tools, capturing data is easy but not being able to handle data could lead to chaos. One of the most reliable startup strategy for adopting data analytics is TUM or The Ultimate Metric. This is the metric that matters the most to your startup. Some advantages of TUM: It answers the most important business question, it cleans up your goals, it inspires innovation and helps you understand the entire quantified business.
[ DATA SCIENCE Q&A]
Q:Explain what a long-tailed distribution is and provide three examples of relevant phenomena that have long tails. Why are they important in classification and regression problems?
A: * In long tailed distributions, a high frequency population is followed by a low frequency population, which gradually tails off asymptotically
* Rule of thumb: majority of occurrences (more than half, and when Pareto principles applies, 80%) are accounted for by the first 20% items in the distribution
* The least frequently occurring 80% of items are more important as a proportion of the total population
* Zipfs law, Pareto distribution, power laws
1) Natural language
– Given some corpus of natural language – The frequency of any word is inversely proportional to its rank in the frequency table
– The most frequent word will occur twice as often as the second most frequent, three times as often as the third most frequent
– The accounts for 7% of all word occurrences (70000 over 1 million)
– ‘of accounts for 3.5%, followed by ‘and
– Only 135 vocabulary items are needed to account for half the English corpus!
2. Allocation of wealth among individuals: the larger portion of the wealth of any society is controlled by a smaller percentage of the people
3. File size distribution of Internet Traffic
Additional: Hard disk error rates, values of oil reserves in a field (a few large fields, many small ones), sizes of sand particles, sizes of meteorites
Importance in classification and regression problems:
– Skewed distribution
– Which metrics to use? Accuracy paradox (classification), F-score, AUC
– Issue when using models that make assumptions on the linearity (linear regression): need to apply a monotone transformation on the data (logarithm, square root, sigmoid function )
– Issue when sampling: your data becomes even more unbalanced! Using of stratified sampling of random sampling, SMOTE (‘Synthetic Minority Over-sampling Technique, NV Chawla) or anomaly detection approach
[ VIDEO OF THE WEEK]
Subscribe to Youtube
[ QUOTE OF THE WEEK]
You can use all the quantitative data you can get, but you still have to distrust it and use your own intelligence and judgment. Alvin Tof
[ PODCAST OF THE WEEK]
[ FACT OF THE WEEK]
40% projected growth in global data generated per year vs. 5% growth in global IT spending.