Data 311: Machine Learning

Lecture 1: Introduction

Dr. Irene Vrbik

University of British Columbia Okanagan

Welcome!

Welcome to DATA 311: Machine Learning

DATA 311 (3) Machine Learning

Regression, classification, resampling, model selection and validation, fundamental properties of matrices, dimension reduction, tree-based methods, unsupervised learning. [3-2-0]

Prerequisite: Either (a) one of STAT 205, STAT 230 or (b) a score more than 75% in one of APSC 254, BIOL 202, PSYO 373; and one of COSC 111, APSC 177.

A little about me

I am currently a Tenure-track Assistant Professor of Teaching
I have taught a variety of courses (from introductory data science and to graduate courses in statistics) at several institutions (Guelph, McGill, MDS Program)
I am currently the Data Science Program advisor, Articulation, and curriculum representative

Where can you find me?

Office: SCI 104 email: irene.vrbik@ubc.ca

Websites: irene.quarto.pub, irene.vrbik.ok.ubc.ca

Educational Background

McMaster University, BSc (Mathematics & Statistics)
University of Guelph, MSc (Applied Statistics)

Thesis: Using Individual-level Models to Model Spatio-temporal Combustion Dynamics. This involved modelling the spatio-temporal combustion dynamics of fire in a Bayesian framework. Supervisors: Rob Deardon and Zeng Feng.
University of Guelph, PhD (Applied Statistics)

Thesis: Non-Elliptical and Fractionally-Supervised Classification. This involved model-based classification with a particular emphasis on non-elliptical distributions. Supervisor: Paul D. McNicholas.

Experience

Postdoctoral Fellow at McGill University Under the supervision of Dr. David Stephens, this work focused on the statistical and computational challenges associated with analyzing genetic data. It involved clustering and modeling HIV DNA sequences.

Postdoctoral Fellow at UBCO Awarded by NSERC (Natural Sciences and Engineering Research Council of Canada), this research involved collaborations with faculty from several disciplines (eg. Medical Physics, Biology, and Chemistry) and was supervised by Dr. Jason Loeppky.

Instructor at UBCO a three-year contract position in the Department of Computer Science, Mathematics, Physics, and Statistics.

Research Interests

Statistics and Machine Leaning in Curriculum Design

e.g. topics modeling in Data Science course calendars

Curricular Analytics the systematic analysis and evaluation of educational curricula to gain insights into various aspects of curriculum design, delivery, and assessment.

e.g. metric calculation for various pathways, curriculum visualization, course recommendation systems

Tools for teaching, learning, and technology

e.g. Prairie Learn: online problem-driven learning system for creating homework and tests

Course Syllabus

The course syllabus is a dynamic document which has been posted to our Canvas shell . Many administrative questions can be answered there:

Flexible Assessment

A Flexible Assessment tool has been integrated into Canvas to allow students to choose how their final grades will be weighted (within predefined ranges).
This system was created for a Teaching & Learning Enhancement Fund (TLEF) project to streamline the processes involved in Flexible Assessment.
It is based on a Flexible Assessment approach devised by Dr. Candice Rideout, a Professor of Teaching in the Faculty of Land & Food Systems that has been shown to increase student satisfaction and self-regulation ^[1].

Student View

Grading breakdown

Grading Item	Default	Min Weight	Max Weight	Desired
Assignments	20%
(asgn 1)		0	5	[0--5]
(asgn 2)		0	5	[0--5]
(asgn 3)		0	5	[0--5]
(asgn 4)		0	5	[0--5]
Midterms	40%
(mid 1)		0	20	[0--20]
(mid 2)		0	20	[0--20]
Final Exam	40%	40	100	[40--100]
Total	100%			must add to 100

Student View

You will select the desired weight of each grading item in the Flexible Assessment tab in Canvas (open after class)

Example

For example, if you know that assignment 3 and midterm 2 falls in a heavy week for you, you might choose to move some or all of that weight to the final exam (see next slide for how that would look like in the Flexible Assessment).
Notice that that only the weight of the final exam can increase from the default.

Flexibility Assessment Example

Grading Item	Default	Allowable Range	Selected Weight
Assignments
(asgn 1)	5%	[0--5]	5
(asgn 2)	5%	[0--5]	5
(asgn 3)	5%	[0--5]	0
(asgn 4)	5%	[0--5]	5
Midterms
(mid 1)	20%	[0--20]	20
(mid 2)	20%	[0--20]	10
Final Exam	40%	[40--100]	40+10+5 = 55
Total	100%		100

Flexible Assessment Availbility

Students may enter their desired percentages and comments from Sept 5 6:30 PM to September 18, 11:59 PM.
For students who do not enter desired percentages, final grades will be calculated using the default weighting scheme.

Course Tools

Canvas will be use for most course related material:

Grades
Assignments (downloading/submitting)
Course announcements/discussions
Supplementary files (eg. data sets, code, etc…)

Lectures will be posted at irene.quarto.pub/data311-2023/. Take time to learn how to navigate through the slides, how to annotate them, and how to export to PDF.

Programming Language

Any necessary coding will be done in R:
Relevant code will be posted to Canvas and embedded in the slides when necessary
It is also recommended that you complete assignments using Rmarkdown in RStudio.

Clipboard code

A clipboard button appears when you hover over code¹

head(iris)

Why R?

Pros

exposure to R in Statistics prerequisite course
Rich Ecosystem
Reproducibility
Textbook

Cons

Steep learning curve
Performance
Package Quality
Limited Industry Adoption

R was developed by statisticians, so it has a strong foundation in statistical modeling and analysis.
R has a vast and comprehensive ecosystem of packages and libraries specifically designed for data analysis, visualization, and statistics.
While R’s primary strength is statistics, it also offers a growing number of machine learning packages, such as caret, randomForest, and xgboost, making it suitable for a variety of data science tasks.

While R is a powerful language for data science, it also has some limitations and drawbacks that you should consider before choosing it for your data science projects. Here are some of the cons of using R for data science:

CONS :

Learning Curve: R has a steeper learning curve compared to some other languages like Python, especially for those who don’t have a strong background in statistics. The syntax can be less intuitive for beginners.

Performance: R can be slower than languages like C++ or Python, particularly when dealing with large datasets or computationally intensive tasks. While there are ways to optimize R code, it may not be the best choice for high-performance computing.

Package Quality: While R has a rich ecosystem of packages, the quality and documentation of packages can vary widely. Some packages may be poorly maintained or lack support, which can lead to compatibility issues and frustration.

Limited Industry Adoption: While R is widely used in academia and certain industries like finance and healthcare, it may not be as prevalent in other sectors. This could impact job opportunities and collaboration with colleagues who use different tools.

Lab Delivery

Labs will be held in person.
Students must be enrolled in a lab (which cannot conflict with other courses)
TAs provide guidance on carrying out analyses in R for the techniques discussed in lecture.
Knowledge of commands and programming techniques will be evaluated throughout the course.
Follow the instructions carefully and practice skills by completing (and redoing!) labs and assignments

Textbook

The main textbook reference for this course is:

ISLR: An Introduction to Statistical Learning with Applications in R (Second Edition). By: Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani.

This book is available for free at statlearning.com; see resources here.

A secondary (less referenced) textbook is:

ESL: The Elements of Statistical Learning: data mining, inference, and prediction, 2nd edition. By: Hastie, Tibshirani, Friedman.

Can be downloaded for free at: hastie.su.domains/ElemStatLearn.

Lecture format

Slides will sometimes be supplemented with handwritten material.
Aside for doodling, substantial written material will be done digitally (on my iPad) and uploaded to Canvas.
Lectures may also include discussions which you will only gain access to by attending class.
You will not get the whole story by reading the slides!

Class Etiquette

Please be respectful, especially to other students
Please be present. Attendance will not be taken, but you are encouraged to come and learn together.
Please restrict the use of electronic devices to course related material; other content could be distracting.
Please be forgiving; instructors are people too, we will make mistakes.

Course Questions

In class

If you are stuck on a concept during lecture, please feel free to raise your hand and ask for clarification.
If you are needing help understanding something, chances are, other students are too!
I will do my best to answer questions on the fly or organize a more thoughtful answer to be presented first thing next class or posted to Canvas.

Course Questions

Outside of class

Outside of class, the general order in which I would suggest you asking course-related questions is:

Consult the course syllabus
Post your question on the public forum on Canvas*
Come see me during student hours or visit your TA during lab (whichever comes first)
e-mail (weekdays are best)

Machine Learning

Machine learning (ML) is a subfield of artificial intelligence (AI) that uses algorithms and statistical models to learn from data to perform complex tasks.

e.g. recommend TV shows you might like, determine if an e-mail is spam or not, predict the selling price of a home.

How does Machine Learning work

Machine Learning is often described as:

“The field of study that makes computers capable of learning without being explicitly programmed.”

The above is attributed to Arthur Lee Samuel, an early American leader in AI, who also happened to coin the term “Machine Learning” in 1959 while at IBM.

Cats

Dogs

Traditional Programing

Input:

e.g. If claws are sharp and nose is small …

Output: cat

Machine Learning

In machine learning the computer must learn these distinguishing patterns for itself in order to determine a set of rules by which future data will be sorted.

General concepts

To continue with the cats and dogs, the more examples a human is given, the better they would become at distinguishing between the two species.

The more variety in the samples, the easier it may become in detecting patterns and ultimately predicting the result.

This process is often iterative.

General examples

Prediction

determine if an email is spam or genuine (classification)
predict housing prices (regression)

We will manly stay in the prediction landscape, but there many other examples of ML:

e.g. Google photos face/voice recognition, sentiment analysis, recommendation systems, self-driving cars …

Why is ML important?

Many of the statistical techniques you’ve (probably) learned thus far are either completely inapplicable to much of this data, or only applicable on a subset.

With growing access to data and computing power, models can be built faster than ever and used in countless fields to gain useful insights.

This course will guide you through a few of the classical approaches in Machine Learning (ML)