Presentation

Lecture 1: Introduction

Dr. Irene Vrbik

University of British Columbia Okanagan

Welcome!

Welcome to DATA 311: Machine Learning TEST!

DATA 311 (3) Machine Learning

Regression, classification, resampling, model selection and validation, fundamental properties of matrices, dimension reduction, tree-based methods, unsupervised learning. [3-2-0]

Prerequisite: Either (a) one of STAT 205, STAT 230 or (b) a score more than 75% in one of APSC 254, BIOL 202, PSYO 373; and one of COSC 111, APSC 177.

A little about me

I am currently a Tenure-track Assistant Professor of Teaching
I have taught a variety of courses (from introductory data science and to graduate courses in statistics) at several institutions (Guelph, McGill, MDS Program)
I am currently the Data Science Program advisor, Articulation, and curriculum representative

Where can you find me?

Office: SCI 104 email: irene.vrbik@ubc.ca

Websites: irene.quarto.pub, irene.vrbik.ok.ubc.ca

Educational Background

McMaster University, BSc (Mathematics & Statistics)
University of Guelph, MSc (Applied Statistics)

Thesis: Using Individual-level Models to Model Spatio-temporal Combustion Dynamics. This involved modelling the spatio-temporal combustion dynamics of fire in a Bayesian framework. Supervisors: Rob Deardon and Zeng Feng.
University of Guelph, PhD (Applied Statistics)

Thesis: Non-Elliptical and Fractionally-Supervised Classification. This involved model-based classification with a particular emphasis on non-elliptical distributions. Supervisor: Paul D. McNicholas.

Experience

Postdoctoral Fellow at McGill University Under the supervision of Dr. David Stephens, this work focused on the statistical and computational challenges associated with analyzing genetic data. It involved clustering and modeling HIV DNA sequences.

Postdoctoral Fellow at UBCO Awarded by NSERC (Natural Sciences and Engineering Research Council of Canada), this research involved collaborations with faculty from several disciplines (eg. Medical Physics, Biology, and Chemistry) and was supervised by Dr. Jason Loeppky.

Instructor at UBCO a three-year contract position in the Department of Computer Science, Mathematics, Physics, and Statistics.

Research Interests

Statistics and Machine Leaning in Curriculum Design

e.g. topics modeling in Data Science course calendars

Curricular Analytics the systematic analysis and evaluation of educational curricula to gain insights into various aspects of curriculum design, delivery, and assessment.

e.g. metric calculation for various pathways, curriculum visualization, course recommendation systems

Tools for teaching, learning, and technology

e.g. Prairie Learn: online problem-driven learning system for creating homework and tests

Course Syllabus

The course syllabus is a dynamic document which has been posted to Canvas and course website. Many administrative questions can be answered there.

Course Tools

Canvas will be use for most course related material:

Grades
Assignments (downloading/submitting)
Course announcements/discussions
Supplementary files (eg. data sets, code, etc…)

Lectures

Lectures will be posted on our course webpage.
Take time to learn how to:
- navigate through the slides
- export to PDF (good for tablet annotation)
- use the clipboard (example Clipboard code HTML only)

Programming Language

Any necessary coding will be done in R:
Relevant code will be posted to Supplmentary Files Canvas when necessary. Most relevant pieces will be embedded in the slides and/or included in Labs
It is also recommended that you complete assignments using Rmarkdown in RStudio.

Clipboard code

How to use the clipboard

Hover over the code block below and you will see a copy icon in the top-right corner:

head(iris)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

Warning

This interactivity will only work on the HTML version of the slides, not PDF.

Why R?

Pros

exposure to R in Statistics prerequisite course
Rich Ecosystem
Reproducibility
Textbook

Cons

Steep learning curve
Performance
Package Quality
Limited Industry Adoption

R was developed by statisticians, so it has a strong foundation in statistical modeling and analysis.
R has a vast and comprehensive ecosystem of packages and libraries specifically designed for data analysis, visualization, and statistics.
While R’s primary strength is statistics, it also offers a growing number of machine learning packages, such as caret, randomForest, and xgboost, making it suitable for a variety of data science tasks.

While R is a powerful language for data science, it also has some limitations and drawbacks that you should consider before choosing it for your data science projects. Here are some of the cons of using R for data science:

CONS :

Learning Curve: R has a steeper learning curve compared to some other languages like Python, especially for those who don’t have a strong background in statistics. The syntax can be less intuitive for beginners.

Performance: R can be slower than languages like C++ or Python, particularly when dealing with large datasets or computationally intensive tasks. While there are ways to optimize R code, it may not be the best choice for high-performance computing.

Package Quality: While R has a rich ecosystem of packages, the quality and documentation of packages can vary widely. Some packages may be poorly maintained or lack support, which can lead to compatibility issues and frustration.

Limited Industry Adoption: While R is widely used in academia and certain industries like finance and healthcare, it may not be as prevalent in other sectors. This could impact job opportunities and collaboration with colleagues who use different tools.

Lab Delivery

Labs will be held in person; students must be enrolled in a lab (which cannot conflict with other courses)
TAs provide guidance on carrying out analyses in R for the techniques discussed in lecture.
Knowledge of commands and programming techniques will be evaluated throughout the course.
Follow the instructions carefully and practice skills by completing (and redoing!) labs and assignments

Textbook

The main textbook reference for this course is:

ISLR: An Introduction to Statistical Learning with Applications in R (Second Edition). By: Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani.

This book is available for free at statlearning.com; see resources here.

A secondary (less referenced) textbook is:

ESL: The Elements of Statistical Learning: data mining, inference, and prediction, 2nd edition. By: Hastie, Tibshirani, Friedman.

Can be downloaded for free at: hastie.su.domains/ElemStatLearn.

Lecture format

Slides will occasionally be supplemented with handwritten material.
Aside for doodling, substantial written material will be done digitally (on my iPad) and uploaded to Canvas.
Lectures may also include discussions which you will only gain access to by attending class.
You will not get the whole story by reading the slides!

Class Etiquette

Please be respectful, especially to other students
Please be present. Attendance will not be taken, but you are encouraged to come and learn together.
Please restrict the use of electronic devices to course related material; other content could be distracting.
Please be forgiving; instructors are people too, we will make mistakes.

Course Questions

In class

If you are stuck on a concept during lecture, please feel free to raise your hand and ask for clarification.
If you are needing help understanding something, chances are, other students are too!
I will do my best to answer questions on the fly or organize a more thoughtful answer to be presented first thing next class or posted to Canvas.

Course Questions

Outside of class

Outside of class, the general order in which I would suggest you asking course-related questions is:

Consult the course syllabus
Post your question on the public forum on Canvas*
Come see me during student hours or visit your TA during lab (whichever comes first)
e-mail (weekdays are best)

Machine Learning

Machine learning (ML) is a subfield of artificial intelligence (AI) that uses algorithms and statistical models to learn from data to perform complex tasks.

e.g. recommend TV shows you might like, determine if an e-mail is spam or not, predict the selling price of a home.

How does Machine Learning work

Machine Learning has been described¹ as:

“The field of study that makes computers capable of learning without being explicitly programmed.”

Cats

Dogs

Traditional Programing

Input:

e.g. If claws are sharp and nose is small …

Output: cat

Machine Learning

In supervised machine learning the computer must learn these distinguishing patterns for itself in order to determine a set of rules by which future data will be sorted.

General concepts

To continue with the cats and dogs, the more examples a human is given, the better they would become at distinguishing between the two species.

The more variety in the samples, the easier it may become in detecting patterns and ultimately predicting the result.

This process is often iterative.

Why is ML important?

Many of the statistical techniques you’ve (probably) learned thus far are either completely inapplicable to much of this data, or only applicable on a subset.

With growing access to data and computing power, models can be built faster than ever and used in countless fields to gain useful insights.

This course will guide you through a few of the classical approaches in Machine Learning (ML)

Supervised ML

Involves training a model on a labeled examples. There are two main goals:

Classification: Assigning unseen examples into to predefined categories (e.g., spam vs. not spam).
Regression: Predicting continuous values (e.g., predicting house prices based on features).

Unsupervised ML

Involves training a model on unlabeled examples.

Clustering: Grouping similar data points into clusters (e.g., customer segmentation).
Dimensionality Reduction: Reducing the number of features while retaining the most important information (e.g., reduce image size).

Supervised:

where each input has a corresponding output label.
is to make accurate predictions or classifications on new, unseen data by learning from the provided examples.
learn a mapping from inputs to outputs based on the training data, so the model can predict the labels for unseen data accurately.

Unsupervised

no “answers” a dataset without labeled responses
Unsupervised model tries to learn the underlying patterns or structure from the data itself.
eg. clustering (eg. customer segmentation),
e.g. dimensionality reduction (eg. image compression - Reducing the size of high-resolution images for storage and transmission without losing significant details.
also Association: Discovering rules that describe large portions of the data (e.g., market basket analysis) (not covered)

We will manly stay in the prediction landscape, but there many other examples of ML:

e.g. Google photos face/voice recognition, sentiment analysis, recommendation systems, self-driving cars …
medical diagnostics (detect tumors in x-rays)
sentiment analysis this is a natural language processing (NLP) technique used to determine whether data is positive, negative or neutral eg. is a tweet positive, neutral, or negative
cars: allow a car to collect data on its surroundings from cameras and other sensors, interpret it, and decide what actions to take.

iClicker

Students may choose to participate in iClicker questions by enrolling at: https://join.iclicker.com/BODT