Data Science for Business: What you need to know about data mining and data-analytic thinking Paperback

by Foster Provost and Tom Fawcett

http://www.amazon.com/Data-Science-Business-data-analytic-thinking/dp/1449361323

Useful resource on basic statistical models (generating, testing, visualising), applied in business contexts.

My notes up to page 112 are as follows:

Examples of classification and regression tasks.

What is the probability a class will respond to an situation?

e.g. which group is least likely to renew their contract

What is the probability an individual is part of a class?

e.g. user displays x, y, z characteristics = 70% chance male youth

Value estimation (regression)

e.g. how much will a customer use this service?

Similarity matching

e.g. find company profiles that are similar to our most valuable customers

Clustering – (unsupervised) group individuals together based on similarity

Co-occurrence grouping (aka market-basket analysis) – look for ‘interactions’

e.g. what items are commonly purchased together

Profiling – useful for establishing trends and identifying outliers

e.g. what is the typical cell phone usage of this customer segment?

e.g. a credit card is used for odd purchases in an odd location

Link prediction

e.g. you are karen share 10 friends -> would you like to be friends with karen?

e.g. movie prediction algorithms

data reduction – simplify larger data sets into smaller data sets; complex models into simpler models

More:

Who are the most profitable customers?

Which customers are most likely to pay late?

Which customers are most likely to respond to a campaign or advertisement?

Is there a stat significant difference between profitable and average customer groups?

With how much certainty can we characterise individuals into groups?

Will a particular new customer be profitable? How profitable?

More general notes

‘Tree induction’ is the same as model generation in science stats and can be plotted in R using:

plot([response variable] ~ .,data=[data set])

“(For simplification in the following examples) we ignore the need to normalize numeric measurements to a common scale. *Attributes such as Age and Income have vastly different ranges and they are usually normalized to a common scale to help with model interpretability*, as well as other things (to be discussed later). ”

- Interesting – how are age and income usually normalized?

Side-note: I got side-tracked thinking about Principal Components Analysis, which is used when a number of your variables are correlated. The aim of PCA is to take high dimensional data that is correlated and reduce the number of dimensions i.e. model simplification.

princomp([data set], scores=TRUE, cor=TRUE)

scores – transform data into the transformed space

cor – use the correlation matrix instead of the covariance matrix

Output: Proportion of the variance – useful.

Can visualise this by:

plot(princomp([data set], scores=TRUE, cor=TRUE))

biplot() – how do the directions of the components align? Non-aligning components you probably want to keep. Highly aligning components probably aren’t adding too much explanatory value.

- Find out – what ‘transformed’ data looks like after PCA? What form is it in?

Resource: https://www.youtube.com/watch?v=Heh7Nv4qimU

http://www.amazon.com/Data-Science-Business-data-analytic-thinking/dp/1449361323