Lecture 2: Notation and Terminology
University of British Columbia Okanagan
All models are wrong but some are useful.
This is the first two lines from your textbook Introduction to Statistical Learning, Ed. 2, I will henceforth refer to as ISLR2.
“Statistical learning refers to a vast set of tools for understanding data. These tools can be classified as supervised or unsupervised.”
1
Goal of prediction1: To be able to predict what the responses are going to be to future input variables
Goal of inference:
The Wage
data set comprise the wages from 3000 males from the Atlantic regions of the United States.
There are 11 variables (type ?Wage
for details) :
year
, age
, maritl
, race
, education
, region
, jobclass
, health
, health_ins
, logwage
, and wage
When analyzing this data, we might have different types of questions in mind …
Visualization adapted from stackoverflow discussion
What is the age range in this data set?: range(age)
= 18, 80
How does wage differ across education status?
Supervised learning is characterized by the presence of “answers” in the data set which are utilized to supervise the algorithm.
To put another way, the data is comprised of both inputs \(X\) and outputs \(Y\).
Depending on the format of \(Y\) (i.e. categorical or numeric), supervised learning will perform one of the following tasks:
Unsupervised learning attempts to learn relationships and patterns from data that are not labeled in any way.
In other words, we have only inputs \(X\) and no \(Y\).
Source: https://vas3k.com/blog/machine_learning/
Notation is not standard across different disciplines, courses, or textbooks. We adopt the same notation used in ISLR2:
Example: Wage
has \(n\)=3000 observations (i.e. male workers in the Mid-Atlantic region) and \(p\)= 11 variables1: year
, age
, maritl
, race
, education
, region
, jobclass
, health
, health_ins
, logwage
, wage
.
Let \(\textbf{X}\) define an \(n \times p\) matrix whose ( \(i\) , \(j\) )th element is \(x_{ij}\).
\[\begin{array}{cc} & \begin{array}{cccc} \text{col }1 & \text{col } 2 & & \text{col } p \end{array} \\ \begin{array}{cccc} \text{row }1 \\ \text{row }2 \\ \\ \text{row }n\end{array} & \left( \begin{array}{cccc} x_{11} & x_{12} & \dots & x_{1p} \\ x_{21} & x_{22} & \dots & x_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n1} & x_{n2} & \dots & x_{np} \end{array} \right)\end{array}\]Some may find it helpful to think of \(\textbf{X}\) as a spreadsheet of numbers with \(n\) rows and \(p\) columns. N.B. the first index (\(i\)) of \(x_{ij}\) is the row and second index (\(j\)) is the column)
In reference to matrix \(\mathbf{X}\) we will either be referencing a row vector, \(x_i\), or a column vector \(\textbf{x}_j\).
Notice the slight change in little x font here:
rows vector \(x_i\) is curly and not bold
column vector \(\textbf{x}_j\) is bold and straight
We refer to the \(i\)throw of \(\textbf{X}\) using \(x_i\)
Hence, \(\textbf{X}\) is comprised of the \(n\) row vectors \(x_1, x_2, \dots, x_n\) where \(x_i\) is a vector of length \(p\).
Typically, \(x_i\) stores all the variable measurements for the \(i\)th observation.
Vectors are by default represented as columns:
Visualization of row vector \(x_i\)
For example, for the Wage
data, \(x_7\) is a vector of length 11, consisting of year
, age
, race
, and other values for the 7th individual.
We refer to the \(j\)thcolumn of \(\textbf{X}\) using \(\textbf{x}_j\)
Hence, \(\textbf{X}\) is comprised of the \(p\) column vectors \(\textbf{x}_1, \textbf{x}_2, \dots, \textbf{x}_p\) where \(\textbf{x}_j\) is a vector of length \(n\).
Typically, \(\textbf{x}_j\) stores the measurements of a variable for all of the \(n\) observations.
Visualization of column vector \(= \textbf{x}_j\)
For example, for the Wage
data, \(\textbf{x}_1\) contains the \(n = 3000\) values for year
(that year that wage information was recorded for each worker in our data set).
[1] 2006 2004 2003 2003 2005 2008 2009 2008 2006 2004 2007 2007 2005 2003
[15] 2009 2009 2003 2006 2007 2003 2003 2005 2009 2007 2006 2005 2006 2004
[29] 2008 2003 2004 2005 2008 2005 2006 2003 2006 2004 2006 2003 2004 2007
[43] 2004 2005 2004 2008 2009 2007 2006 2006 2005 2008 2004 2003 2009 2004
[57] 2006 2003 2004 2003 2005 2006 2005 2004 2003 2003 2009 2003 2004 2008
[71] 2004 2006 2006 2006 2009 2003 2003 2006 2003 2008 2005 2004 2009 2003
[85] 2007 2006 2004 2009 2009 2009 2004 2003 2003 2005 2006 2009 2003 2006
[99] 2009 2005 2009 2008 2004 2005 2008 2006 2007 2003 2007 2006 2009 2008
[113] 2004 2006 2003 2007 2009 2006 2008 2005 2006 2009 2006 2008 2009 2005
[127] 2004 2004 2009 2003 2007 2008 2007 2008 2005 2004 2009 2007 2004 2003
[141] 2004 2006 2003 2007 2005 2004 2005 2004 2007 2004 2007 2008 2003 2003
[155] 2003 2008 2005 2004 2003 2005 2007 2008 2008 2007 2003 2007 2003 2004
[169] 2009 2006 2005 2008 2007 2004 2009 2006 2003 2004 2003 2006 2006 2005
[183] 2006 2009 2003 2006 2004 2008 2006 2005 2004 2005 2004 2003 2006 2009
[197] 2008 2006 2005 2004 2006 2006 2006 2003 2007 2007 2007 2006 2005 2007
[211] 2005 2006 2008 2009 2008 2003 2008 2008 2009 2009 2008 2003 2006 2004
[225] 2005 2003 2009 2006 2007 2003 2003 2006 2006 2007 2004 2009 2004 2006
[239] 2007 2007 2009 2006 2006 2004 2005 2007 2004 2009 2006 2007 2003 2006
[253] 2005 2007 2008 2004 2004 2008 2008 2009 2007 2003 2009 2008 2004 2009
[267] 2007 2006 2004 2003 2007 2007 2003 2005 2007 2008 2004 2003 2006 2005
[281] 2003 2004 2008 2009 2009 2005 2004 2005 2005 2007 2004 2005 2009 2003
[295] 2006 2004 2007 2003 2005 2009 2003 2003 2005 2004 2005 2006 2007 2006
[309] 2004 2006 2007 2007 2009 2004 2007 2007 2007 2005 2006 2008 2004 2009
[323] 2005 2009 2008 2003 2003 2008 2003 2005 2004 2009 2009 2008 2007 2003
[337] 2005 2004 2008 2005 2004 2004 2005 2004 2007 2007 2008 2007 2003 2005
[351] 2004 2009 2005 2009 2005 2003 2004 2006 2009 2007 2007 2003 2004 2005
[365] 2006 2003 2003 2008 2009 2004 2007 2006 2004 2004 2003 2004 2008 2009
[379] 2008 2004 2004 2008 2007 2005 2004 2004 2003 2005 2004 2008 2008 2008
[393] 2005 2008 2003 2007 2009 2009 2003 2004 2006 2005 2004 2009 2004 2009
[407] 2004 2003 2006 2003 2003 2009 2003 2008 2006 2005 2009 2009 2004 2005
[421] 2005 2005 2009 2004 2006 2008 2003 2007 2005 2009 2006 2005 2009 2007
[435] 2003 2003 2007 2007 2005 2005 2009 2004 2004 2006 2005 2009 2007 2003
[449] 2003 2005 2009 2007 2009 2007 2007 2009 2007 2005 2008 2004 2005 2008
[463] 2006 2008 2009 2008 2005 2003 2003 2004 2006 2005 2003 2003 2006 2007
[477] 2005 2006 2009 2003 2007 2006 2005 2009 2004 2009 2009 2003 2007 2007
[491] 2008 2004 2007 2006 2005 2003 2007 2005 2009 2003 2003 2005 2003 2003
[505] 2003 2008 2005 2003 2004 2007 2003 2004 2004 2006 2004 2006 2006 2009
[519] 2004 2004 2003 2003 2007 2005 2005 2009 2003 2007 2003 2009 2003 2004
[533] 2009 2003 2007 2006 2005 2005 2004 2003 2005 2009 2006 2009 2008 2003
[547] 2007 2007 2005 2007 2007 2003 2009 2003 2008 2007 2003 2006 2007 2007
[561] 2003 2009 2007 2006 2008 2004 2004 2005 2004 2003 2004 2004 2007 2008
[575] 2003 2007 2006 2005 2006 2008 2004 2008 2007 2004 2005 2008 2003 2007
[589] 2007 2005 2008 2003 2004 2004 2006 2006 2004 2009 2009 2009 2005 2003
[603] 2008 2005 2006 2004 2006 2009 2006 2006 2005 2008 2008 2009 2006 2006
[617] 2006 2007 2003 2008 2008 2007 2005 2006 2008 2008 2009 2006 2007 2006
[631] 2004 2004 2005 2006 2007 2008 2008 2009 2004 2003 2008 2009 2007 2009
[645] 2008 2003 2005 2007 2005 2003 2009 2004 2007 2006 2008 2006 2003 2006
[659] 2008 2005 2003 2004 2006 2003 2009 2006 2008 2009 2003 2004 2007 2008
[673] 2004 2004 2007 2007 2008 2003 2003 2007 2003 2004 2004 2008 2007 2007
[687] 2009 2003 2009 2006 2007 2003 2003 2005 2003 2008 2005 2009 2005 2005
[701] 2003 2004 2008 2004 2008 2003 2006 2009 2003 2004 2008 2007 2007 2005
[715] 2007 2005 2005 2005 2003 2006 2008 2007 2008 2004 2007 2005 2005 2008
[729] 2004 2008 2009 2006 2004 2003 2007 2009 2004 2009 2005 2006 2008 2008
[743] 2008 2007 2007 2008 2003 2003 2007 2005 2004 2003 2008 2009 2004 2008
[757] 2006 2005 2005 2007 2006 2003 2005 2006 2004 2005 2005 2006 2008 2008
[771] 2004 2003 2008 2004 2006 2008 2003 2008 2003 2007 2004 2008 2008 2003
[785] 2008 2004 2005 2003 2003 2009 2007 2009 2004 2006 2008 2006 2003 2009
[799] 2009 2007 2003 2005 2006 2003 2007 2006 2005 2003 2008 2004 2003 2007
[813] 2006 2003 2007 2007 2007 2007 2009 2006 2004 2008 2003 2004 2009 2005
[827] 2005 2007 2009 2007 2007 2003 2006 2007 2008 2009 2003 2007 2007 2003
[841] 2004 2004 2004 2008 2009 2005 2003 2007 2004 2009 2005 2004 2009 2004
[855] 2005 2008 2007 2003 2004 2006 2009 2009 2005 2007 2004 2005 2005 2003
[869] 2005 2003 2007 2005 2009 2005 2003 2009 2006 2006 2005 2009 2006 2006
[883] 2005 2003 2008 2003 2006 2006 2004 2003 2004 2005 2005 2009 2004 2009
[897] 2004 2003 2004 2003 2004 2009 2009 2008 2006 2007 2007 2004 2006 2008
[911] 2003 2004 2003 2005 2003 2005 2009 2004 2004 2004 2005 2004 2005 2004
[925] 2008 2005 2005 2005 2008 2006 2009 2003 2009 2003 2003 2008 2004 2003
[939] 2003 2005 2008 2007 2007 2009 2005 2009 2003 2003 2004 2005 2006 2009
[953] 2009 2009 2007 2004 2003 2008 2008 2004 2006 2006 2005 2006 2007 2007
[967] 2006 2008 2007 2008 2006 2007 2008 2009 2006 2005 2003 2007 2006 2004
[981] 2003 2004 2009 2008 2009 2006 2007 2003 2003 2009 2005 2005 2008 2006
[995] 2004 2007 2006 2005 2008 2003 2007 2007 2004 2005 2005 2007 2006 2005
[1009] 2007 2004 2003 2009 2006 2007 2009 2007 2004 2006 2006 2005 2006 2003
[1023] 2007 2007 2003 2009 2007 2006 2007 2008 2004 2003 2006 2006 2004 2008
[1037] 2007 2006 2009 2004 2008 2008 2007 2006 2009 2009 2003 2005 2003 2009
[1051] 2007 2006 2007 2006 2005 2005 2004 2008 2008 2003 2004 2006 2008 2007
[1065] 2005 2007 2004 2008 2005 2007 2005 2006 2006 2006 2003 2004 2007 2004
[1079] 2009 2004 2009 2006 2006 2005 2005 2003 2004 2005 2006 2004 2008 2005
[1093] 2003 2006 2003 2009 2005 2008 2004 2003 2004 2005 2007 2005 2004 2004
[1107] 2004 2006 2003 2005 2008 2004 2006 2007 2007 2004 2007 2006 2003 2009
[1121] 2003 2007 2003 2003 2007 2005 2005 2007 2005 2008 2007 2009 2005 2008
[1135] 2007 2003 2006 2004 2008 2004 2006 2008 2003 2006 2008 2006 2004 2007
[1149] 2008 2004 2009 2003 2007 2009 2004 2004 2005 2004 2006 2006 2004 2008
[1163] 2005 2004 2006 2005 2007 2007 2004 2004 2008 2009 2004 2005 2003 2007
[1177] 2008 2009 2008 2005 2003 2008 2003 2009 2007 2009 2009 2006 2003 2006
[1191] 2003 2006 2006 2009 2007 2007 2009 2006 2004 2007 2005 2004 2005 2006
[1205] 2007 2007 2007 2004 2003 2003 2003 2003 2006 2009 2006 2003 2008 2005
[1219] 2003 2003 2007 2003 2006 2009 2003 2003 2009 2009 2005 2007 2006 2003
[1233] 2005 2006 2009 2004 2005 2007 2004 2008 2005 2003 2006 2008 2004 2003
[1247] 2009 2005 2003 2004 2004 2008 2003 2004 2005 2007 2004 2006 2005 2009
[1261] 2008 2003 2003 2005 2007 2007 2004 2005 2005 2004 2004 2003 2003 2008
[1275] 2009 2004 2003 2008 2006 2005 2009 2004 2003 2007 2008 2005 2003 2004
[1289] 2008 2005 2004 2004 2008 2009 2005 2008 2004 2005 2006 2009 2009 2009
[1303] 2006 2004 2006 2009 2003 2008 2009 2003 2005 2006 2005 2005 2007 2008
[1317] 2003 2004 2009 2003 2007 2005 2006 2005 2006 2004 2003 2004 2003 2004
[1331] 2008 2004 2005 2004 2009 2003 2008 2003 2008 2003 2006 2008 2003 2009
[1345] 2004 2009 2007 2005 2003 2003 2007 2006 2008 2007 2003 2008 2009 2004
[1359] 2005 2003 2006 2005 2007 2004 2004 2005 2006 2009 2009 2004 2003 2004
[1373] 2003 2004 2006 2008 2009 2003 2005 2008 2004 2009 2005 2007 2005 2005
[1387] 2004 2007 2003 2007 2008 2005 2004 2007 2003 2003 2008 2003 2006 2008
[1401] 2003 2008 2006 2009 2004 2008 2006 2003 2007 2004 2009 2008 2004 2005
[1415] 2003 2004 2007 2007 2009 2009 2004 2006 2009 2007 2005 2005 2009 2005
[1429] 2004 2008 2003 2005 2006 2006 2004 2005 2004 2004 2003 2004 2003 2005
[1443] 2003 2004 2003 2003 2006 2004 2007 2008 2005 2008 2005 2003 2006 2006
[1457] 2006 2008 2006 2004 2009 2003 2003 2007 2009 2004 2005 2003 2006 2004
[1471] 2008 2003 2006 2006 2007 2004 2003 2004 2004 2005 2009 2006 2005 2005
[1485] 2008 2007 2008 2003 2003 2007 2004 2008 2007 2008 2008 2003 2004 2008
[1499] 2008 2004 2003 2003 2004 2006 2009 2009 2006 2003 2003 2008 2008 2004
[1513] 2008 2007 2005 2003 2003 2006 2003 2006 2004 2003 2005 2006 2007 2007
[1527] 2009 2006 2006 2004 2004 2003 2004 2007 2003 2007 2005 2004 2007 2006
[1541] 2004 2008 2008 2007 2007 2008 2008 2008 2004 2005 2007 2005 2005 2003
[1555] 2003 2009 2006 2008 2003 2005 2008 2008 2004 2007 2008 2003 2008 2004
[1569] 2009 2004 2007 2004 2008 2004 2009 2005 2006 2003 2003 2006 2006 2009
[1583] 2007 2004 2004 2004 2004 2008 2003 2005 2005 2003 2009 2006 2003 2007
[1597] 2008 2007 2005 2005 2008 2009 2006 2008 2007 2007 2005 2009 2003 2009
[1611] 2009 2008 2006 2003 2008 2003 2006 2006 2004 2006 2008 2004 2004 2007
[1625] 2005 2007 2005 2006 2007 2004 2003 2007 2003 2009 2003 2008 2008 2009
[1639] 2009 2003 2003 2006 2005 2005 2009 2005 2008 2004 2005 2004 2005 2004
[1653] 2009 2005 2004 2007 2004 2004 2009 2005 2003 2005 2003 2009 2003 2003
[1667] 2006 2004 2003 2005 2009 2009 2003 2004 2004 2008 2006 2004 2006 2003
[1681] 2007 2009 2009 2008 2003 2005 2008 2009 2004 2009 2003 2005 2003 2003
[1695] 2008 2005 2009 2006 2007 2003 2005 2005 2004 2007 2007 2009 2009 2006
[1709] 2006 2004 2005 2007 2005 2005 2007 2007 2003 2009 2009 2004 2008 2005
[1723] 2008 2004 2003 2009 2004 2007 2004 2003 2005 2005 2004 2009 2005 2003
[1737] 2004 2009 2003 2005 2004 2007 2004 2006 2008 2007 2008 2004 2005 2007
[1751] 2009 2003 2008 2009 2005 2007 2003 2005 2003 2004 2004 2007 2003 2007
[1765] 2006 2006 2003 2009 2005 2003 2004 2008 2009 2007 2009 2003 2008 2005
[1779] 2006 2006 2004 2006 2004 2009 2005 2009 2003 2004 2006 2006 2005 2004
[1793] 2003 2004 2005 2009 2004 2006 2003 2007 2009 2003 2003 2005 2005 2008
[1807] 2005 2006 2006 2004 2005 2003 2006 2006 2009 2003 2006 2003 2006 2008
[1821] 2006 2006 2004 2006 2008 2008 2005 2003 2008 2003 2006 2007 2009 2008
[1835] 2007 2003 2005 2005 2008 2009 2004 2007 2003 2006 2008 2003 2003 2009
[1849] 2007 2007 2003 2005 2005 2005 2004 2003 2005 2005 2009 2008 2005 2005
[1863] 2006 2006 2003 2004 2008 2003 2005 2004 2007 2007 2009 2007 2006 2003
[1877] 2003 2005 2008 2009 2009 2009 2006 2009 2004 2004 2004 2004 2003 2009
[1891] 2009 2009 2009 2009 2005 2009 2006 2008 2009 2005 2004 2008 2007 2008
[1905] 2007 2009 2007 2004 2004 2004 2007 2008 2005 2005 2008 2004 2005 2009
[1919] 2005 2007 2007 2006 2005 2007 2004 2005 2007 2009 2007 2008 2003 2004
[1933] 2007 2008 2005 2005 2004 2003 2009 2005 2008 2003 2003 2008 2008 2009
[1947] 2004 2006 2005 2008 2006 2009 2004 2009 2009 2004 2004 2003 2004 2005
[1961] 2007 2008 2003 2006 2008 2007 2005 2003 2004 2006 2004 2004 2003 2008
[1975] 2009 2009 2004 2005 2006 2006 2006 2008 2008 2005 2008 2009 2004 2006
[1989] 2004 2007 2008 2005 2007 2003 2009 2004 2007 2005 2007 2009 2003 2005
[2003] 2003 2006 2004 2009 2005 2009 2005 2005 2007 2004 2003 2004 2006 2006
[2017] 2005 2008 2008 2006 2008 2004 2005 2005 2008 2004 2005 2007 2006 2004
[2031] 2006 2009 2004 2003 2004 2009 2008 2006 2008 2003 2005 2009 2008 2008
[2045] 2006 2005 2007 2009 2003 2003 2009 2006 2006 2003 2007 2008 2004 2005
[2059] 2004 2005 2004 2009 2006 2007 2009 2009 2006 2009 2005 2008 2007 2005
[2073] 2005 2006 2008 2005 2003 2003 2004 2008 2004 2004 2004 2004 2007 2006
[2087] 2006 2004 2003 2006 2009 2008 2004 2008 2009 2008 2006 2005 2003 2004
[2101] 2008 2008 2005 2007 2006 2005 2006 2005 2009 2005 2006 2006 2006 2003
[2115] 2004 2005 2004 2004 2008 2009 2003 2005 2004 2006 2009 2006 2003 2007
[2129] 2004 2005 2006 2008 2008 2008 2009 2007 2009 2008 2003 2008 2005 2006
[2143] 2003 2003 2007 2003 2008 2007 2003 2006 2008 2007 2005 2009 2007 2005
[2157] 2008 2007 2007 2005 2003 2005 2008 2006 2004 2004 2009 2003 2006 2007
[2171] 2003 2004 2003 2004 2004 2004 2007 2005 2007 2009 2005 2003 2008 2003
[2185] 2004 2003 2007 2007 2003 2004 2004 2005 2005 2008 2004 2007 2005 2009
[2199] 2007 2004 2004 2004 2006 2006 2003 2008 2009 2005 2008 2003 2005 2006
[2213] 2003 2004 2006 2008 2008 2006 2004 2004 2003 2004 2008 2009 2005 2004
[2227] 2004 2009 2006 2006 2003 2009 2007 2005 2009 2009 2007 2008 2007 2006
[2241] 2004 2005 2005 2005 2004 2009 2003 2008 2008 2009 2004 2008 2005 2008
[2255] 2005 2003 2007 2005 2003 2006 2009 2003 2008 2003 2008 2009 2007 2006
[2269] 2005 2003 2003 2004 2006 2009 2005 2009 2004 2003 2009 2006 2004 2003
[2283] 2007 2004 2003 2003 2007 2004 2007 2009 2007 2008 2004 2005 2009 2008
[2297] 2005 2006 2004 2004 2007 2008 2003 2009 2005 2005 2007 2006 2009 2004
[2311] 2005 2005 2008 2003 2008 2005 2009 2005 2009 2003 2005 2009 2007 2004
[2325] 2003 2003 2004 2004 2005 2008 2004 2009 2008 2003 2006 2006 2004 2003
[2339] 2003 2008 2003 2006 2003 2009 2003 2009 2004 2003 2004 2007 2008 2009
[2353] 2004 2007 2006 2008 2009 2003 2005 2006 2003 2007 2006 2006 2009 2009
[2367] 2008 2003 2008 2003 2004 2003 2004 2004 2007 2006 2009 2006 2006 2005
[2381] 2003 2008 2004 2004 2005 2009 2008 2009 2008 2006 2005 2008 2008 2008
[2395] 2005 2009 2006 2005 2008 2005 2008 2006 2009 2009 2007 2004 2004 2007
[2409] 2004 2003 2006 2003 2005 2004 2003 2008 2009 2006 2007 2004 2008 2004
[2423] 2006 2006 2003 2009 2008 2009 2006 2009 2003 2008 2007 2003 2004 2003
[2437] 2008 2007 2004 2008 2004 2007 2004 2009 2006 2009 2007 2007 2008 2006
[2451] 2004 2008 2003 2007 2006 2004 2008 2009 2009 2008 2007 2005 2005 2004
[2465] 2003 2008 2003 2003 2004 2008 2006 2004 2008 2009 2004 2008 2009 2003
[2479] 2005 2003 2005 2007 2004 2006 2008 2007 2006 2009 2004 2005 2004 2007
[2493] 2009 2008 2009 2005 2005 2009 2005 2007 2006 2008 2004 2008 2006 2004
[2507] 2005 2003 2005 2004 2008 2008 2003 2005 2007 2004 2003 2009 2005 2005
[2521] 2004 2004 2003 2009 2003 2003 2004 2004 2005 2006 2004 2008 2005 2004
[2535] 2004 2009 2003 2004 2003 2008 2007 2007 2005 2008 2003 2008 2006 2005
[2549] 2007 2009 2003 2003 2005 2009 2009 2004 2008 2005 2003 2003 2008 2003
[2563] 2005 2004 2005 2005 2005 2006 2005 2007 2007 2007 2005 2009 2008 2008
[2577] 2003 2003 2003 2008 2006 2003 2008 2004 2005 2007 2009 2007 2005 2003
[2591] 2009 2003 2009 2009 2006 2004 2005 2007 2009 2005 2006 2005 2008 2005
[2605] 2008 2008 2007 2007 2009 2006 2003 2006 2003 2008 2006 2007 2007 2003
[2619] 2008 2005 2007 2008 2003 2008 2005 2003 2003 2008 2003 2004 2005 2007
[2633] 2008 2009 2008 2005 2009 2006 2008 2005 2008 2009 2005 2004 2005 2009
[2647] 2004 2006 2003 2003 2006 2007 2004 2003 2005 2006 2008 2005 2007 2003
[2661] 2008 2006 2003 2006 2003 2005 2004 2006 2004 2009 2003 2007 2003 2007
[2675] 2003 2005 2009 2004 2006 2006 2006 2004 2006 2009 2006 2005 2006 2009
[2689] 2009 2003 2006 2005 2005 2006 2009 2006 2009 2008 2008 2003 2004 2004
[2703] 2006 2004 2005 2003 2005 2004 2004 2006 2008 2007 2006 2008 2003 2007
[2717] 2009 2007 2006 2004 2009 2003 2005 2008 2003 2007 2007 2008 2003 2003
[2731] 2005 2003 2005 2003 2004 2003 2006 2009 2007 2005 2009 2008 2008 2003
[2745] 2009 2005 2007 2006 2004 2004 2008 2006 2009 2009 2006 2007 2003 2004
[2759] 2006 2004 2003 2005 2003 2003 2005 2006 2003 2007 2003 2009 2007 2007
[2773] 2009 2003 2006 2008 2006 2005 2003 2008 2003 2004 2009 2006 2009 2005
[2787] 2008 2004 2008 2008 2006 2004 2004 2004 2005 2009 2003 2008 2006 2003
[2801] 2005 2007 2003 2005 2003 2004 2003 2005 2008 2004 2005 2005 2005 2004
[2815] 2004 2007 2005 2009 2004 2005 2003 2008 2003 2005 2005 2008 2004 2005
[2829] 2004 2008 2007 2005 2004 2003 2009 2006 2004 2009 2009 2004 2008 2009
[2843] 2006 2009 2007 2005 2005 2005 2004 2003 2005 2009 2003 2006 2009 2005
[2857] 2008 2003 2004 2007 2008 2009 2008 2009 2009 2003 2004 2008 2003 2004
[2871] 2004 2003 2009 2003 2008 2003 2004 2004 2006 2008 2007 2004 2007 2005
[2885] 2003 2006 2007 2004 2006 2007 2007 2009 2005 2003 2004 2006 2003 2005
[2899] 2003 2003 2009 2003 2005 2007 2009 2004 2006 2005 2005 2007 2007 2003
[2913] 2007 2003 2006 2004 2008 2003 2003 2009 2004 2006 2007 2007 2005 2005
[2927] 2009 2007 2003 2006 2003 2005 2005 2007 2007 2003 2007 2008 2008 2009
[2941] 2008 2006 2009 2005 2004 2004 2007 2007 2006 2009 2007 2009 2003 2004
[2955] 2003 2003 2003 2007 2007 2006 2004 2007 2009 2006 2004 2007 2003 2003
[2969] 2008 2009 2007 2003 2009 2005 2003 2008 2006 2007 2003 2009 2007 2003
[2983] 2005 2003 2008 2007 2003 2007 2007 2008 2009 2003 2007 2006 2009 2008
[2997] 2007 2005 2005 2009
Using the row and column notation just presented, the matrix \(X\) can be written:
\[\begin{equation} \textbf{X} = \left( \begin{array}{c} x_{1}^T \\ x_{2}^T \\ \vdots \\ x_{n}^T \end{array} \right) = (\textbf{x}_1, \textbf{x}_2, \dots, \textbf{x}_p) \end{equation}\]The \({}^T\) notation denotes the transpose of a matrix or vector, eg \({x}_{i}^T =({x}_{i1}, {x}_{i2}, \dots {x}_{ip})\).
We use \(x\) to denote our input variable(s) and \(y\) to denote our output variable1
For instance, \(y_i\) may refer to the wage
of the \(i\)th observation in the Wage
data set, whose observed features are stored in \(x_i\).
The collection of all \(n\) observed outcomes form the vector \(\textbf{y} = (y_1,y_2, \dots, y_n)^T\)
Our observed data consists of \(\{(x_1, y_1)\), \((x_2, y_2)\), \(\dots\), \((x_n, y_n)\}\), where each \(x_i\) is a vector of length \(p\).
Visualization of inputs and outputs in tabular data
Notation | Description | Representation |
---|---|---|
\(n\) | lower case “n” | number of samples |
\(\textbf{x}\), \(\textbf{y}\), \(\textbf{a}\) | lower-case bold | vectors of length \(n\) |
\(x\), \(x_i\), \(a\) | lower-case normal | vectors of length \(\neq n\) |
\(\textbf{X}\), \(\textbf{A}_{n\times p}\) | capital bold letters | matrix |
\(a\) | lower-case normal (beginning of alphabet) | scalars |
\(X\) | capital (end of alphabet) | random variables |
In the rare cases the use of lower case normal font leads to ambiguity, clarification will be provided using the following notation:
We will indicate an \(r \times s\) matrix using:
\[Y = f(X) + \epsilon\]
We assume that data arises from the above formula where:
\(X=(X_1, X_2, \ldots, X_p)\) are inputs (also referred to as predictors, features, independent variables, among others)
\(Y\) is the output (also referred to as response, dependent variable, among others)
\(\epsilon\) is the error term (independent of \(X\) and with mean 0)
\[Y = f(X) + \epsilon\]
Note that we are using capital letters1 to denote random variables in this context.
For ease of notation in this section, we use \(X\) to denote the input variable(s) which we distinguish using subscripts.
eg. using the Wage
data set we might consider the input variables: \(X = (X_1, X_2)\) where \(X_1\)= year
and \(X_2\)= age
and response variable \(Y\) = wage
.
\[Y = f(X) + \epsilon\]
Adapted from ISLR
The growth of bacterial in a population occurs in an exponential manner as simulated here.
Important Note: For real (ie. non-simulated data) data, the function \(f\) is generally unknown and must estimated based on the observed points.
Reasons for finding \(f\) fall into two primary categories:
Prediction: with inputs \(X\) available, our concern is predicting the output \(Y\).
Inference: we want to understand the relationship between \(X\) and \(Y\).
Often, we will be interested in both, perhaps to varying extent. Our goal will dictate what choices for \(f\) will be “better” for us.
Inference aims to answer how \(Y\) affected by \(X\)
Since \(\hat f\) is used to model this relationship, we don’t want a black box, we want to understand its nuts and bolts.
Some related question may include:
When choosing the “best” model for a problem, it is important to keep the underlying task in mind.
Generally speaking:
complicated models are often better at prediction but harder to understand;
simpler models tend to be easier to interpret but will not necessary make accurate predictions.
In many cases a balancing act between the interpretation and accuracy of prediction is needed.
Inspired by Fig 2.7 of ISLR2 (pg 25)
Source: Introduction to Machine Learning with the Tidyverse. by Dr. Alison Hill. rstudio::conf2020
Source: Introduction to Machine Learning with the Tidyverse. by Dr. Alison Hill. rstudio::conf2020
Source: Introduction to Machine Learning with the Tidyverse. by Dr. Alison Hill. rstudio::conf2020
Source: Introduction to Machine Learning with the Tidyverse. by Dr. Alison Hill. rstudio::conf2020
Source: Introduction to Machine Learning with the Tidyverse. by Dr. Alison Hill. rstudio::conf2020
Source: Introduction to Machine Learning with the Tidyverse. by Dr. Alison Hill. rstudio::conf2020
Source: Introduction to Machine Learning with the Tidyverse. by Dr. Alison Hill. rstudio::conf2020
If we do not have access to another “new” data set (as we often don’t), we can divide our data (randomly) into two non-overlapping sets:
Comments
A word of warning: