Data Science for Business: What you need to know about data mining and data-analytic thinking Paperback
by Foster Provost and Tom Fawcett
Useful resource on basic statistical models (generating, testing, visualising), applied in business contexts.
My notes up to page 112 are as follows:
Examples of classification and regression tasks.
What is the probability a class will respond to an situation?
e.g. which group is least likely to renew their contract
What is the probability an individual is part of a class?
e.g. user displays x, y, z characteristics = 70% chance male youth
Value estimation (regression)
e.g. how much will a customer use this service?
e.g. find company profiles that are similar to our most valuable customers
Clustering – (unsupervised) group individuals together based on similarity
Co-occurrence grouping (aka market-basket analysis) – look for ‘interactions’
e.g. what items are commonly purchased together
Profiling – useful for establishing trends and identifying outliers
e.g. what is the typical cell phone usage of this customer segment?
e.g. a credit card is used for odd purchases in an odd location
e.g. you are karen share 10 friends -> would you like to be friends with karen?
e.g. movie prediction algorithms
data reduction – simplify larger data sets into smaller data sets; complex models into simpler models
Who are the most profitable customers?
Which customers are most likely to pay late?
Which customers are most likely to respond to a campaign or advertisement?
Is there a stat significant difference between profitable and average customer groups?
With how much certainty can we characterise individuals into groups?
Will a particular new customer be profitable? How profitable?
More general notes
‘Tree induction’ is the same as model generation in science stats and can be plotted in R using:
plot([response variable] ~ .,data=[data set])
“(For simplification in the following examples) we ignore the need to normalize numeric measurements to a common scale. Attributes such as Age and Income have vastly different ranges and they are usually normalized to a common scale to help with model interpretability, as well as other things (to be discussed later). ”
- Interesting – how are age and income usually normalized?
Side-note: I got side-tracked thinking about Principal Components Analysis, which is used when a number of your variables are correlated. The aim of PCA is to take high dimensional data that is correlated and reduce the number of dimensions i.e. model simplification.
princomp([data set], scores=TRUE, cor=TRUE)
scores – transform data into the transformed space
cor – use the correlation matrix instead of the covariance matrix
Output: Proportion of the variance – useful.
Can visualise this by:
plot(princomp([data set], scores=TRUE, cor=TRUE))
biplot() – how do the directions of the components align? Non-aligning components you probably want to keep. Highly aligning components probably aren’t adding too much explanatory value.
- Find out – what ‘transformed’ data looks like after PCA? What form is it in?