CSMODEL_midterm_exam_megadoc

RAW FILE

This note has not been edited yet. Content may be subject to change.
EDIT (10/08/2025 6:48PM): copy pasted the contents sana gumana
EDIT (10/08/2025 7:36PM): removed the TOC, please use the one on the right! thank you + vercel is rate limited pls refer to the render site <3
also i'm aware of the latex not rendering properly so sorry. pls dm me @clavzno on discord for a pdf
EDIT (10/09/2025): fixed the problem with latex

1a Introduction to Data 🟒

Syllabus

Examples of Data
Types of Data
Forms of Data
Data Modelling
Data Collection and Sampling

Types of Data

  1. Categorical (descriptive information, may have several categories)
    1. Ordinal (have a sense of ordering)
    2. Nominal (no order)
  2. Numerical (can objectively be measured)
    1. Discrete (specific value)
    2. Continuous (can take any value within a range)

Forms of Data

  1. Structured: adheres to some structured model
  2. Unstructured: does not adhere to a standardized format
  3. Graph-based: represented as pairwise relationships between objects
  4. Natural Language: information expressed in human language
  5. Audio & Video: not represented in text format
  6. Generated: recorded by applications without human intervention

Data Collection

Collected by other people

Correlation β‰  Causation

In an observational study, only correlation, or relationship between variables can be concluded.
An experiment is needed for causation to be established, by isolating all other variables aside from the one being treated

More on Data Collection

Data may be collected not for any specific research question or purpose, by just collected for future potential analysis
Data could already exist somewhere in the world, waiting to be collected for some research goal or purpose

Sampling

Sampling Bias

Simple Random Sampling

Stratified Random Sampling

SimplePsychology: Stratified Sampling

Cluster Sampling

Multistage Sampling

Data Modelling

_attachments/Pasted image 20250619172606.png

  1. get the raw data
  2. load the data into a format understandable by computers
  3. estimate or represent the data using a systematic mathematical model
  4. then make conclusions, transformations, or generate insights from the model

Modelling should be backed-up by a research question

Answering the research goal or question

  1. Data preprocessing
  2. Exploratory Data Analysis
  3. Data Modelling

1. Data Preprocessing

real world data is dirty

2. Exploratory Data Analysis

lets you build a general understanding of

data visualization is used to gain a better understanding of the data
_attachments/Pasted image 20250619174937.png

Mathematical Models

mathematical models allow us to

_attachments/Pasted image 20250619175212.png


1b Data Representation 🟒

Syllabus

Dataframe and Series
Basic Operations
Python Crash Course

Data must be represented into a data structure to apply computer algorithms for analysis.

Data Matrix

Pandas (panel data)

DataFrame

Series

More on DataFrame

_attachments/Pasted image 20250619183035.png

import pandas as pd
?pd.read_csv # read_csv is a function

Basic Operations

  1. View dataset info - view dataset information such as variables, datatype of each variable, number of observations, the number of missing values, etc.
  2. Select columns - select one column as a Series or select 2 or more columns as a DataFrame
  3. Select rows - select a single observation as a Series, select 2 or more observations as a DataFrame
  4. Filter rows - uses a conditional (like select observations where CSMODEL grade β‰₯ 3.0)
  5. Sort rows - ascending or descending
  6. Add new columns - add a new column to the DataFrame where (condition) (ex. Add a new column "Type" where the value is Good if CSMODEL grade β‰₯ 3.0 and "Okay" otherwise)
    _attachments/Pasted image 20250619183654.png
  7. Aggregate Data - get the average (or mean, median, etc.) of a column
    1. "aggregate": combining or summarizing multiple individual data points into a single, higher-level value or a smaller set of values. (common functions include βˆ‘, average/mean, count, min, max, median, mode)
      _attachments/Pasted image 20250619183648.png
  8. Group by variable - Group the data by section then get the min and max grade per section for CSMODEL

2a Data Preprocessing 🟒

Syllabus

Data Cleaning
Pre-processing Techniques

Data Cleaning

Separate Files

Multiple Representations

Incorrect Datatypes

Default Values

Missing Data

Duplicate Data

Inconsistent Format

Data Preprocessing

Querying

df = pd.DataFrame({
	"ID": [0, 1, 2 ,3, 4],
	"CCPROG1", [3.0, 1.0, 3.5, 2.5, 4.0],
	"CCPROG2: [1.0, None, 4.0, 1.0, 4.0]
})

df.query("CCPROG1 >= 3.5") # give it a Boolean value
# we should be able to have row 2 and 4 selected
df.query('SECTION == "S17"') # query based on String value
# query rows based on multiple conditions
df.query("CCPROG1 >= 3.5 and Section == 'S17'")
df.query('index > 2') # returns rows 3 and 4, index is a reserved keyword
df.query('CCPROG1 > CCPROG2') # you would get index 0 and 3 (you cant compare a numerical value and none value)

Note: None is not necessarily 0

Looping

for cur_idx, cur_row in df.iterrows():
	# perform some operations

Inplace parameter

Imputation

Method 1

Method 2

Numerical Imputation

Categorical Imputation

Binning

Numerical Binning

Categorical Binning

Outlier Detection

consideration:

One-hot Encoding

_attachments/Pasted image 20250619185932.png

Log Transformation

Aggregation

Numerical Aggregation

_attachments/Pasted image 20250516163858.png

Categorical Aggregation

_attachments/Pasted image 20250516163950.png

df_ncr = pd.DataFrame({
	"Group": ["A", "A", "B", "B", "B"]
	"City": ["Manila", "Manila", "Manila", "Makati", "Makati"]
})
df_ncr.groupby(by="Group").agg(pd.Series.mode)
df_cdr['City'].mode() # get the mode of the entire series

Column Transformation

_attachments/Pasted image 20250516164207.png

Feature Scaling

Example
_attachments/Pasted image 20250516164824.png

1. Normalization (min-max normalization)

Xnorm=Xβˆ’XminXmaxβˆ’Xmin 163βˆ’162180.2βˆ’162=118.2β‰ˆ0.0529kβˆ’15k60kβˆ’15k=14k45kβ‰ˆ0.31

2. Standardization (z-score)

zscore=xβˆ’ΞΌΟƒ

scikitlearn has built-in scalers so you don't have to rewrite these formulas over and over

from sklearn.preprocessing import MinMaxScaler, StandardScaler
# normalization
normalizer = MinMaxScaler()  # 0 to 1
normalizer.fit_transform(df_scaling) # outputs an array

# standardization
standardizer = StandardScaler()
standardizer.fit_transform(df_scaling) # outputs an array

Feature Engineering

Examples

  1. From the height and weight, extract the body mass index
  2. from the date, determine whether it's a holiday or not
  3. from the location, determine the nearest hospital/school/etc.
  4. from the geo-coordinates (latitude), estimate the climate

2b Exploratory Data Analysis🟠

Syllabus

Summary Statistics
Data Visualization

once we get raw data, we want to express it in a format that the computer can understand
then we do data preprocessing and cleaning
after those preps for the given domain, we can start preliminary EDA.

_attachments/Pasted image 20250620230539.png

Exploratory Data Analysis...

Some things to find out in the process of the EDA:

these are like guidelines or goals you might want to achieve in the process of the EDA

Regarding the format and the metadata

  1. What does each observation represent?
  2. What does each variable represent?
  3. Can i treat each row as individual records? Are there duplicates?
  4. Are there missing data to consider?
  5. What is the unit of measure for each variable?

Regarding the domain

  1. What do the terms/jargons in the dataset mean?
    1. In Financial Trading data - bid volume, bid-ask volume misbalance, signed transaction volume, spread volatility, bid-ask spread
    2. In Epidemiology data - basic reproduction number, generation time, incidence, serial interval, vaccine efficacy

Regarding the Method of Collection

  1. Is the dataset a population or a sample in the context of the research qusetion?
  2. Are the possible biases in the dataset?
  3. Is the data collection consistent across all observations?
  4. Is it grouped?
  5. Is it simulated?
  6. ...Are there underlying groups? Is the data simulated from what is being drawn?

Regarding previous processing involved

  1. Are there previous processing performed on the data?
  2. Normalization - numerical values have been normalized to a set range?
  3. Discretization - numerical data has been grouped together into categories?
  4. Interpolation - are some values estimated from others?
  5. Truncation - some observations have been removed?

if you're working on large real world data, sometimes there are pre-processed datasets (cleaner) or raw datasets (if you have another cleaning method in mind)

Regarding the data itself

  1. Describe each variable in the dataset
    1. Summary statistics, frequency tables, histograms, etc.
  2. Describe the relationship between pairs of vairables
    1. Scatterplots, correlation tables, etc.

Regarding each variable

  1. What is the measure of central tendency? (mean, median, mode, trimmed mean, etc.)
  2. What is the measure of dispersion? (range, IQR, standard deviation)
  3. What does the distribution look like? (symmetric, positively skewed, negatively skewed, etc.)

Regarding relationships between variables

  1. Is there a relationship between two variables?
    1. Measures of correlation (Pearson, Spearman, etc.)
    2. Scatterplot, correlation tables, etc.
  2. Use the corr() function in pandas for this (Pearson by default, check documentation)

Regarding outliers and anomalies

  1. Are there outliers? Errors in encoding?
  2. Are there needed preprocessing steps to visualize and process the data properly?

Sample Data Visualization

_attachments/Pasted image 20250620232534.png
_attachments/Pasted image 20250620232542.png

Regarding additional Columns

  1. Are there additional columns to add to the dataset?
  2. Computed Index
    1. Poverty index
    2. H-index
    3. BMI
  3. Feature Pairs
    1. Male + old
    2. Cough + fever
  4. Generated Features
    1. Nearest grocery
    2. time duration
    3. date difference

Summary Statistics

Measure of Central Tendency

Example
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60

Mean

ΞΌ=xΒ―βˆ‘xn

Trimmed Mean

same as mean but removes the n% of the highest and lowest values in the dataset

Median

the value separating the upper half and lower half of the sample/population

Mode

most frequently occurring value in the sample/population

Measures of Dispersion

Range

_attachments/Pasted image 20250620235647.png

Interquartile Range

_attachments/Pasted image 20250620235711.png

Standard Deviation

variance: sigma squared

Οƒ=Ξ£(xβˆ’x)^/N

_attachments/Pasted image 20250516174129.png

Οƒ2=βˆ‘(xβˆ’ΞΌ)2NΟƒ2=βˆ‘(xβˆ’xΒ―)2nβˆ’1

Correlation

Pearson Correlation

assesses the linear relationship between two variables

rxycov(x,y)SDxSDy

Spearman Correlation

rxycov(rankx,ranky)SD(rankx)SD(ranky)

_attachments/Pasted image 20250621000327.png

Data Visualization

_attachments/Pasted image 20250621000602.png

Types of Plots

Scatterplot

_attachments/Pasted image 20250621000630.png

Correlation Plot

_attachments/Pasted image 20250621000636.png

Dot Plot

_attachments/Pasted image 20250621000642.png

Histogram

_attachments/Pasted image 20250621000648.png

Bar Plot

_attachments/Pasted image 20250621000717.png

Box Plot

_attachments/Pasted image 20250621000757.png


3a Foundations for Inference πŸ”΄

Syllabus

Normal Distribution
Point Estimate
Confidence Intervals
Hypothesis Testing

Probability Distribution

to get the probability distribution of the target variable, repeat the experiment numerous times, then record the number of times each outcome occurred.

_attachments/Pasted image 20250613100013.png

Normal Distribution

Parameters of a Normal Distribution

_attachments/Pasted image 20250613101545.png

_attachments/Pasted image 20250613101701.png

The z-score

Z=xβˆ’ΞΌΟƒ

_attachments/Pasted image 20250520172649.png

Percentile of a Data Point

_attachments/Pasted image 20250613102125.png

Computing for Percentile

_attachments/Pasted image 20250520173553.png

_attachments/Pasted image 20250520173701.png

Normal Distribution Formula

_attachments/Pasted image 20250520173832.png
we use the normal distribution formula to approximate the data if the data is nearly normal.

Checking the Histogram

Checking the Q-Q Plot (Quartile-quartile Plot)

_attachments/Pasted image 20250520174052.png

Point Estimate (pΜ‚)

Example:
In the Dec2019 SWS survey, the PH president had an 82% satisfaction rating. This value is obtained from a sample.

_attachments/Pasted image 20250613130014.png

Types of Error

  1. Bias: Introduced in data collection process over or underestimation of true value
    1. Addressed through thoughtful data collection process
  2. Sampling Error: the variation of the estimate between different samples (our focus)

_attachments/Pasted image 20250613130305.png
Suppose there is a very large population wherein all cannot be surveyed. We get a point estimate by surveying a small sample from the large population and get its approval rating (what we're trying to measure)

Repeat the sampling process multiple times and survey each sample to get its approval rating.

Central Limit Theorem (CLT)

_attachments/Pasted image 20250613135244.png

For proportion, the sampling distribution will be a normal distribution with:

Confidence Intervals (p49)

Example: In the December 2019 SWS survey, the satisfaction rating of Pres. Duterte is 82 Β± 3% with a 95% confidence level

To construct a confidence interval with 95% confidence, we do

p^Β±1.96Γ—SE

Intuition

_attachments/Pasted image 20250613142605.png

_attachments/Pasted image 20250613142726.png

Increasing the confidence level to 99%:

p^Β±2.58Γ—SE

Hypothesis Testing

confirms if the hypotheses are true in a statistically significant way based on the sample data

Process

  1. State null and alternative hypotheses
  2. Decide on test statistic and compute the p-value
  3. Decide based on the p-value

Example
Pew Research asked a random sample of 1k American adults whether they supported the increased usage of coal to produce energy. The sample of 1000 American adults show that 37% of American adults support increased usage of coal.

Set up hypotheses to test if majority of American adults support or oppose the increased usage of coal.

Null and Alternative Hypotheses

Example Cont
Set up hypotheses to test if the majority of American adults support or oppose the increased usage of coal.

Null: There is no majority, i.e. p = 0.5
Alternative: There is a majority support or opposition (though we do not know which one) to expanding the use of coal

the sampling distribution should look like this if the null hypothesis is true:
_attachments/Pasted image 20250613143548.png

_attachments/Pasted image 20250613143828.png

Z-score and z-test

to perform hypothesis testing for proportion, compute the Z-score and use the z-test.

p^=0.37SE=0.37(1βˆ’0.37)1000=0.016Z=0.37βˆ’0.50.016

The significance level is the p-value in which you decide to reject the null hypothesis. (threshold whether to reject or not)

Since the p-value (4.4 * 10^-16) is less than the significance level 0.05, so we reject the null hypothesis. thus, there is a majority support or opposition to expanding the use of coal.

Summary of what we did

Types of Decision Errors

Statistical test can potentially make decision errors.

_attachments/Pasted image 20250613150127.png

Summary


3b Inference for Means πŸ”΄

Syllabus

T Distribution
One Sample Mean
Paired Observations
Unpaired Observations Multiple Means

Parametric Test

Prerequisite: Central Limit Theorem

If the sample size is small, this estimate for the standard deviation:

SE=Οƒn

TLDR; using standard error is problematic if the sample size is small so t-distribution is the alternative

GEMINI: Standard Deviation tells you how "diverse" the individuals are in your group (sample). Standard Error tells you how "diverse" the average of your group would be if you repeatedly picked different groups from the same population.

T-distribution

_attachments/Pasted image 20250621001737.png

Tβˆ’score=xβˆ’xΒ―snTβˆ’score=xβˆ’ΞΌΟƒn

Degrees of Freedom

df=nβˆ’1

_attachments/Pasted image 20250621001702.png

Confidence Intervals

Example
Given n = 19 instances, sample mean xΜ„ = 4.4, sample standard deviation Οƒ = 2.3
df = n - 1 = 19 - 1 = 18

  1. To construct a confidence interval with 95% confidence, get the z-score of the point where 95% of the observation will fall (using the t-table)
  2. Multiply the z-score with the standard error s/sqrt(n)
  3. Confidence interval is the result Β± xΜ„ with a 95% confidence level

_attachments/Pasted image 20250621105532.png

One Sample Mean and One Sample T-test

To confirm if the hypotheses are true in a statistically significant way, we perform hypothesis testing.

Since observations are independent and n β‰₯ 30, use the t-distribution to estimate the sampling distribution of the mean where df = 99

To perform hypothesis testing for mean, compute the T-score and use the T-test

Tβˆ’score=xβˆ’xΒ―sn

_attachments/Pasted image 20250621105758.png

Then, using the t-table, get the p-value based on the T-score (2.3733) and df=99. the p-value is 0.0195.
Using significance level Ξ± = 0.05, the p-value is less than the significance level 0.05, reject the null hypothesis
thus, the data provides strong evidence that the average run time for the Cherry Blossom Run in 2017 is different than the 2006 average at a significance level of 5%

Paired Observations

Perform analysis on the difference between paired observations.

Since the data is independent and n = 68, apply CLT even though the distribution is not normal.

To know if the difference between the paired observations is significant, set up the hypotheses as two-tailed tests.
_attachments/Pasted image 20250621111447.png

We're using the Standard Error for the mean of a single sample.

SEsinglesample=Οƒn

To perform the hypotheses testing, compute the T-score of the observed mean of the differences and use the T-test

Tβˆ’score=xβˆ’ΞΌSEsinglesample

_attachments/Pasted image 20250621111712.png

Then, using the t-table, get the p-value based on the T-score (2.20) and df (67). The p-value is 0.0312.
Using significance level Ξ± = 0.05, the p-value is statistically significant.
since the p-value is less than the significance level, reject the null hypothesis.
thus, the data provides strong evidence that the prices in the store are different than that of Amazon at a significance level of 5%

Unpaired Observations

Then setup the hypotheses
_attachments/Pasted image 20250621112251.png

Given 2 separate groups, consider the estimates for the mean difference sampling distribution

Here, we're using the Standard Error of the Difference Between Two Independent Means

SE=s12n1+s22n2

set df to the lower sample size - 1

To perform hypothesis testing, compute the T-score of the point estimate and use the T-test

Tβˆ’score=xβˆ’ΞΌSE

_attachments/Pasted image 20250621113106.png

Then, using the t-table, get the p-value based on the T-score (1/54) and df (49). The p-value is 0.135.
Using significance level Ξ± = 0.05, the p-value (0.135) is statistically unsignificant. Since the p-value is greater than the significance level Ξ±, accept the null hypothesis.
Thus, the data provides strong evidence that there is no difference between the weights of newborns from smokers and non-smokers at a significance level of 5%

Multiple Means and ANOVA F-statistic

ANOVA Core Concept: if the means are the same, then their variability should be low.

Setup the hypotheses
_attachments/Pasted image 20250621113434.png

ANOVA Computes for the F-statistic

F=MSGMSE

F-Distribution (df1 and df2)

SSG=βˆ‘i=1kni(xiΒ―βˆ’xΒ―)2SST=βˆ‘i=1n(xΒ―βˆ’xiΒ―)2MSG=1df1β‹…SSGMSE=1df2β‹…(SSTβˆ’SSG)

To perform hypothesis testing, use the F-statistic

F=MSGMSE=5.077

Get the p-value based on the F-statistic (5.077). The p-value is 0.006
Using significance level Ξ± = 0.05, the p-value 0.006 is statistically significant.
Since the p-value is less than the significance level, reject the null hypothesis.
Thus, the data provide strong evidence that at least one of the groups deviates from the others under a significance level of 5%

To find out which group, we can compare all pairwise means using the method previously shown (outfielder VS infielder, outfielder VS catcher, infielder VS catcher), but we need to apply a Bonferroni correction to account for increased chances of a Type 1 error.

BonferroniΒ aβˆ—=aK

(more comparisons mean higher chances of making mistakes, if Ξ± = 0.05, then there is a chance of making a Type 1 error 5% of the time)

Use the intuitions behind CLT to establish statistical significance when comparing means.
Use the t-distribution to estimate the sampling distribution of the means taken from some population.
Determine the probability that a value as extreme as the observation can be observed from the sampling distribution of our null hypothesis to know its significance.
Use ANOVA to compare the means of multiple groups at once.


3c Inference for Categorical Data πŸ”΄

Syllabus

Test of Goodness of Fit
Test of Independence

Recap of Statistical Testing

Example
Test the hypothesis that there is a difference in the average sleeping time of boys and girls. Take a sample of the population. Compute the desired statistic from the sample. In this case, the difference between the means/average.

Suppose that the difference between the average sleeping time of boys and girls is 1.6 hours. Find out if this observation is statistically significant.

Assume the null hypothesis is true: there is no difference in the mean sleeping time (mean difference = 0)

based on the Central Limit Theorem...
taking multiple samples repeatedly results to a sampling distribution that can be approximated with t-distribution

_attachments/Pasted image 20250612233934.png
Compute the probability of observing a value as extreme of more extreme as 1.6 in a sample, given that the null hypothesis is true.

If the probability (p-value) is small, then there is some evidence to believe that the null hypothesis is likely not true.

Otherwise, since the probability of observing that value in a sample is high, then we cannot say that the null hypothesis is likely to be false.

Steps in Statistical Testing:

  1. Compute some statistic from the sample.
  2. Determine the sampling distribution of that statistic if the null hypothesis is true.
  3. Check the probability that the sample statistic could be observed from the distribution.
  4. Reject the null hypothesis if the probability is low enough.

Test of Goodness of Fit (Chi-Square Test)

Example

Race White Black Hispanic Other Total
Juries 206 26 25 19 275
Voters 72% = 0.72 7% = 0.07 12% = 0.12 9% = 0.09 100%
Expected 275 * 0.72 = 198 265 * 0.07 = 19.25 265 * 0.12 = 33 265 * 0.09 = 24.75

For context:

Q: is the number of juries per race representative of the actual population?

Given the expected counts, check if the observed counts are statistically different from the expected distribution or is the difference likely a result of chance.

Define the null and alternative hypotheses:

Here, we use the Chi-Square test, if the following are met:

x2=βˆ‘i=1k(observediβˆ’expectedi)2expectedi

_attachments/Pasted image 20250612234225.png

Chi-Square Distribution

The Chi-Square distribution is the expected distribution of the Chi-Square statistic when you take repeated samples from a population

Parameters of Chi-Square Distribution

_attachments/Pasted image 20250612234357.png
Degrees of freedom is another parameter in the Chi2 Distribution

_attachments/Pasted image 20250612234425.png

Q: What is the probability of observing a value as extreme as x^2 = 5.8897 or more, given our expected distribution?

_attachments/Pasted image 20250527173502.png

Steps for performing a hypothesis test (specifically a chi-square goodness-of-fit or chi-square test of independence):

  1. Get the p-value based on the chi-square value of 5.8897 and df = 3. The p-value would be 0.1171
  2. Use a significance level Ξ± = 0.05, the p-value is not statistically significant because the p-value is higher (SimplyPsychology: a p-value less less than or equal to your significance level IS statistically significant)
  3. Since the p-value 0.1171 is greater than the significance level, fail to reject the null hypothesis.
  4. Thus, the data provides strong evidence that the juror are randomly sampled and there are no bias in juror selection at a significance level of 5%.

Another Example:

Age Group 18 to 39 40 to 59 60+ Total
Respondent 702 398 100 1200

Q: Is the number of respondents per age group representative of the actual set of registered voters?

Age Group 18 to 39 40 to 59 60+ Total
Respondent 702 398 100 1200
Voters 56% 32% 12% 100%
Expected 18-39: 1200 * 0.56 = 672
Expected 40-59: 1200 * 0.32 = 384
Expected 60+: 1200 * 0.12 = 144
Age Group 18 to 39 40 to 59 60+ Total
Respondent 702 398 100 1200
Expected 672 384 144 1200
Given the expected counts, check if the observed counts are statistically different from the expected distribution or is the difference likely just a result of chance.

Determine the null and alternative hypothesis:

Compute x^2 for the Chi-Squared Test:
_attachments/Pasted image 20250613000212.png

  1. Get the p-value based on the chi-square value (15.2942) and df = 2. The p-value is 0.0012.
  2. Using significance level 0.05, the p-value 0.0012 is statistically significant.
  3. Since the p-value is less than the significance level, reject the null hypothesis.
  4. Thus, the data provides strong evidence that the respondents are not randomly sampled and are not representative of the current set of registered voters at a significance level of 5%

Test of Independence

Example Context: Pulse Asia Commissioned Survey by Senator Gatchalian

Q: The ROTC or Reserved Officers' Training Corps is a program which aims to teach the youth about discipline and love of country through military training. How much do you agree or disagree with the proposal to implement ROTC to all students in SHS?

Revised Q based on RA 9163: ROTC is a program designed to provide military training to tertiary level students in order to motivate, train, organize, and mobilize them for national defense preparedness.
Q: How much do you agree or disagree with the proposal to implement ROTC to all students in SHS?

Some examples compare categorial responses between multiple groups. Group A got the original question, Group B got the negative version, Group C got the legal definition.

Group Group A Group B Group C
Agree 23 2 36
Disagree 50 71 37

Test for independence checks whether two categorial variables are independent of each other. (which would be the null hypothesis)

  1. Base the expected proportion according to the total
Group Group A Group B Group C Total
Agree 23 2 36 61
Disagree 50 71 37 158
Total 73 73 73 219

Get the proportion:
Agree = 61/219 = 0.2785
Disagree = 158/219 = 0.7215

Get the expected frequencies by multiplying to the total:
Agree = 0.2785 * 73 total = 20.33
Disagree = 0.7215 * 73 = 52.67

  1. Create the Null and Alternative Hypotheses

Null H_0: Responses are independent of the question, i.e. the phrasing did not affect the response
Alternative H_A: Responses are dependent on the question, i.e. the phrasing affects the response.

Chi-Square Test

  1. Conduct a Chi-Square Test to get the chi2 statistic for addressing the null hypothesis
x2=βˆ‘i=1k(observediβˆ’expectedi)2expectedi

_attachments/Pasted image 20250613082803.png

Chi-Square Distribution

  1. Calculate for the degrees of freedom for the Chi-Square Distribution.

df = (rows - 1) * (cols - 1) = (2-1) (3-1) = (1)(2) = 2

  1. Based on the Chi-Square Distribution, what is the probability of observing the x^2 value (Chi-squared statistic) = 40.13? (i.e. we want to find the area under the curve)

_attachments/Pasted image 20250613083115.png
It's too small to be seen in the graph.

Test of independence steps:

  1. Get the p-value based on the chi-square value (40.13) and df = 2.
  2. The p-value is 0.000000002
  3. Using significance level Ξ± = 0.05, the p-value is statistically significant so we reject the null hypothesis. (in p-value, if value < 0.05, reject the null hypothesis)
  4. Thus, the data provides strong evidence that the answer of the respondents depends on the phrasing of the question.

In code (scipy.stats)

Example: Election results and Economic Class

Group Class ABC Class D Class E
More Credible 24 32 90
Less Credible 45 93 152
Just as Credible 11 15 25

Q: is the answer of the respondent dependent on the economic class?

  1. Get the total
Group Class ABC Class D Class E Total
More Credible 24 32 90 146
Less Credible 45 93 152 290
Just as Credible 11 15 25 51
Total 80 140 267 487
  1. Calculate for the proportion and the expected values.

Proportion:

(Expected value is proportion * total)
Expected Values for More Credible

Expected Values for Less Credible:

Expected Values for Just as Credible:

  1. Put them all together and create the Null and Alternative Hypotheses

_attachments/Pasted image 20250613084035.png

  1. Compute the Chi-square statistic to test the null hypothesis.
x2=βˆ‘i=1k(observediβˆ’expectedi)2expectedi

_attachments/Pasted image 20250613084122.png

Steps

Summary


3d Bayesian Inference 🟠

Syllabus

Conditional Probability
Bayes Theorem

Two kinds of statistics

Conditional Probability

P(A|B)=P(A∩B)P(B)=P(B|A)β‹…P(A)P(B)

Example Scenario
Suppose out of all the 4 championship races between Niki and James, Niki won 3 times while James managed only 1. What is the probability that James will win?

Assume event B is that James won.
Total events = 4
P(B) = 1/4

_attachments/Pasted image 20250611212030.png

Suppose out of all the 4 championship races between Niki and James, Niki won 3 times while James managed only 1. Out of the 1 time James won, it was raining. Out of the 3
times Niki won, it was raining only 1 time. If it was raining in the next race, what is the probability that James will win?

Assume event A is that it was raining, event B is that James won.
_attachments/Pasted image 20250611212730.png
(Answer is correct)
The evidence of rain strengthened our belief that James will win the next race (doubled it.)

Bayes Theorem

P(A|B)=P(A∩B)P(B)=P(B|A)β‹…P(A)P(B)

Bayesian Inference

1.Β P(A|B)=P(A∩B)P(B)2.Β we know that..Β P(A∩B)=P(A|B)β‹…P(B)3.Β P(B|A)=P(A∩B)P(A)4.Β and...Β P(A∩B)=P(B|A)β‹…P(A)given 2 and 4, we can say that :P(A|B)β‹…P(B)==P(B|A)β‹…P(A)so...P(A|B)=P(B|A)β‹…P(A)P(B)

Dr. Trefor Bazett: Bayes' Theorem - The Simplest Case

Terms needed in Bayesian Inference (or inference in general)
Models: mathematical formulations of the observed events

Example Scenario:
Consider the task of flipping a coin, where heads is considered as the successful case and tails is considered as the unsuccessful case.

In this case...
let ΞΈ: a parameter to the model representing the fairness of the coin
let D: the outcome of the events.

ex. D could be either heads or tails, and ΞΈ is the probability of getting heads or tails in our coin

Given an outcome D, what is the probability that the coin is fair, i.e. ΞΈ = 0.5? (50% chance of getting heads)

P(ΞΈ|D)=P(D|ΞΈ)Γ—P(ΞΈ)P(D)

Prior belief (before) --evidence→ Posterior belief (after)

Q: Is belief = confidence?
A: belief is whatever the conclusion is, confidence is how likely that belief would be true

The Bernoulli Function or Bernoulli Distribution Function

Bernoulli probability distribution function is one of the models

P(y|ΞΈ)=ΞΈy(1βˆ’ΞΈ)1βˆ’y

Example
If you're working with financial data, and there's a guy who applies for a loan application. You want to predict if the guy will pay his payments on time or will he fail to pay.

Example
If you're in healthcare, y=1 when the person is positive for a diagnosis, y=0 when negative.

Bernoulli function computes the actual probability within 0 to 1, used when we have a assumed probability in mind.

Beta Distribution

xΞ±βˆ’1(1βˆ’x)Ξ²βˆ’1B(Ξ±,Ξ²)

Beta distribution can be helpful when you don't know what the probability is, but you have the set of observations. So it's useful in estimating what the likely probability is.

_attachments/Pasted image 20250611235458.png

More examples
_attachments/Pasted image 20250612000802.png

We can also calculate the mean and standard deviation of a Beta distribution:

meanΒ ΞΌ=Ξ±Ξ±+Ξ²std devΒ Οƒ=Ξ±Ξ²(Ξ±|Ξ²)2(Ξ±|Ξ²+1) Ξ±=11,Ξ²=9ΞΌ=1111+9=1120β‰ˆ0.55

The purpose of Beta distribution is to represent the prior belief. The Beta distribution behaves nicely when multiplied with the likelihood function, since it yields a posterior distribution in a similar format as the prior.

Computing Posterior Belief

P(ΞΈ|z,N)=P(z,N|ΞΈ)β‹…P(ΞΈ)P(z,N)=ΞΈz(1βˆ’ΞΈ)Nβˆ’zΞΈΞ±βˆ’1(1βˆ’ΞΈ)Ξ²βˆ’1B(Ξ±,Ξ²)P(z,N)=ΞΈz+Ξ±βˆ’1(1βˆ’ΞΈ)Nβˆ’z+Ξ²βˆ’1B(z+Ξ±,Nβˆ’z+Ξ²)

So the posterior belief (updated belief of our given probability value) becomes:

P(ΞΈ|z+Ξ±,Nβˆ’z+Ξ²)

Example:
Suppose the prior belief, having not observed anything yet, is that the coin is most likely fair (ΞΌ = 0.5, Οƒ = 0.1) Represent the belief using a Beta distribution P(ΞΈ|Ξ±, Ξ²) = P(ΞΈ| 12.5, 12.5).

Let's have an arbitrary value for Ξ± and Ξ². Let's say we do 25 trials. and Ξ± and Ξ² are 12.5 each.
Ξ± = 12.5
Ξ² = 12.5
P(ΞΈ|Ξ±, Ξ²) = P(ΞΈ| 12.5, 12.5) is the initial assumption that the coin is fair.

Then, suppose the coin is flipped 10 times (N) and observed 8 heads (z) out of 10. That is our new data. Then, update the Beta distribution to
Ξ± = 8
Ξ² = 2
P(ΞΈ|z + Ξ±, N - z + Ξ²) = P(ΞΈ | 8 + 12.5, 2 + 12.5 ) = P(ΞΈ|20.5, 14.5)

Flip the coin ten more times (N) and observed 9 heads (z) out of 10.
Ξ± = 9
Ξ² = 1
P(ΞΈ|z + Ξ±, N - z + Ξ²) = P(ΞΈ| 9 + 20.5, 1 + 14.5) = P(ΞΈ|29.5, 15.5)

The graph keeps shifting to the right, so there is a probability that our coin is not 50/50 and therefore, not fair.
_attachments/Pasted image 20250612012228.png
it becomes unlikely that the graph reaches a high point at the 0.5 mark.

In Bayesian Inference, the more data that becomes available, the more our beliefs get updated.

Hypothesis Testing

Bayes Factors (equivalent of p-value)

BF=P(M=null|z,N)P(M=alternative|z,N)/P(M=null)P(M=alt)

the point of Bayesian Inference: Given new observations, how do we update our initial belief or initial assumption?

4a Association Rule Mining 🟒

Syllabus

Market-Basket Model
Frequent Itemset
Confidence

JcSites: Generating Association Rules

Market Basket Model

Example
If a person buys cereal, we can reasonably assume that they're also buying milk. If a person buys coffee, they might buy sugar as well.

Assumptions about the data

Practical Applications of the Market Basket Model

Frequent Itemset (definition of terms in Market Basket Model)

Example
All items I =

Baskets (we had 8 customers)
B1 = {milk, coke, beer}
B2 = {m, p, j}
B3 = {m, b}
B4 = {c, j}
B5 = {m, p, b}
B6 = {m, c, j, b}
B7 = {c, j, b}
B8 =

  1. Compute the Support S_I: how many people bought milk, coke, etc. these are 1-itemsets (k = 1)
    S_{milk} = 5
    S_{coke} = 5
    S_{pepsi} = 2
    S_{juice} = 4
    S_{beer} = 6
  2. If our support threshold S_ΞΈ = 3...
    S_{milk} = 5
    S_{coke} = 5
    S_{juice} = 4
    S_{beer} = 6
    are frequent itemsets
  3. We look at the support of 2-itemsets where k=2... (pairs of items)
    S{m, c} = 2
    S_{m, j} = 2

    S_{m, b} = 4
    S_{c, j} = 3
    S_{c, b} = 4
    S_{j, b} = 2
    S_{m, p} = 2
    S_{c, p} = 0
    S_{p, j} = 1
    S_{p, b} = 1
  4. If our support threshold S_ΞΈ = 3...
    S_{m, b} = 4
    S_{c, j} = 3
    S_{c, b} = 4
    are frequent itemsets

Confidence

Example
All items I =

Baskets (we had 8 customers)
B1 = {milk, coke, beer}
B2 = {m, p, j}
B3 = {m, b}
B4 = {c, j}
B5 = {m, p, b}
B6 = {m, c, j, b}

B7 = {c, j, b}
B8 =

We want to measure the confidence if someone buys {milk, beer} β†’ coke
Confidence:

Cm,b→c=xy

So the confidence of {milk, beer} β†’ coke...
y = 4 (see highlighted)
x = 2

Cm,bβ†’c=24=12β‰ˆ0.5

Example
We want to measure the confidence if someone buys {milk, beer} β†’ pepsi

Cm,b→p=14confidence that they will buy pepsi

Example
We want to measure the confidence if someone buys {milk, pepsi} β†’ beer

Cm,p→p=12confidence that they will buy beer

Example
We want to measure the confidence if someone buys {coke, juice} β†’ beer

Cc,j→b=23confidence that they will buy beer

Example
We want to measure the confidence if someone buys {coke, juice} β†’ milk

Cc,j→m=13confidence that they will buy milk

When Association Rules are useful

Given an association rule {i_1, i_2, … i_k} β†’ j, we can say that this association rule is useful iff:

Example
All Items =

Baskets
B1 = {1, 2, 5}
B2 = {2, 4}
B3 = {2, 3}
B4 = {1, 2, 4}
B5 = {1, 3}
B6 = {2, 3}
B7 = {1, 2, 3}
B8 =

Compute the Support if S_ΞΈ = 2...
S_{1} = 5 (how many times does item 1 appear in a basket)
S_{2} = 7
S_{3} = 5
S_{4} = 2
S_{5} = 2
all of these are frequent item sets

Example
If we're looking for the frequent 2-itemsets and S_ΞΈ = 2...
S_{1, 2} = 4
S_{1, 3} = 3
S_{1, 4} = 1
S_{1, 5} = 2
S_{2, 3} = 4
S_{2, 4} = 2
S_{2, 5} = 2
S_{3, 4} = 0
S_{3 ,5} = 1
S_{4, 5} = 0

those that aren't highlighted are frequent 2-itemsets

Example
If we're looking for the frequent 3-itemsets and S_ΞΈ = 2...
S_{1, 2, 3} = 2
S{1, 2, 5} = 2
S_{1, 2, 4} = 1
S_{1, 3, 5} = 1
S_{2, 3, 4} = 0
S_{2, 3, 5} = 1
S_{2, 4, 5} = 0

those that aren't highlighted are frequent 3-itemsets

Note: if you increase the support of an itemset, the support will not increase.

Focusing on the 3-itemsets...
{1, 2, 3}
Possible Rules that we generated...
{1} β†’ {2, 3}
{2} β†’ {1, 3}
{3} β†’ {1, 2}
{1, 2} β†’ {3}
{1, 3} β†’ {2}
{2, 3} β†’ 1

We want to measure the confidence of these rules

Example {1} β†’ {2, 3} if 1 is chosen, what's the probability that 2 and 3 will also be chosen

C{1}β†’{2,3}=25β‰ˆ0.40=40%Β confidence

Example {2} β†’

C{2}β†’{1,3}=27β‰ˆ0.29

Example {3} β†’

C{3}β†’{1,2}=25β‰ˆ0.40

Example {1, 2} β†’ {3} if 1 and 2 are chosen, what's the probability that 3 will also be chosen

C{1,2}β†’{3}=24β‰ˆ0.50

Example {1, 3} β†’

C{1,3}β†’{2}=23β‰ˆ0.67

Example {2, 3} β†’

C{2,3}β†’{1}=24β‰ˆ0.50

If our confidence threshold was C_ΞΈ = 50%
{1} β†’ {2, 3} = 0.40
{2} β†’ {1, 3} = 0.29
{3} β†’ {1, 2} = 0.40
{1, 2} β†’ {3} = 0.50
{1, 3} β†’ {2} = 0.67
{2, 3} β†’ 1 = 0.50
only the highlighted will pass that threshold

Example with another set of rules:
_attachments/Pasted image 20250612152329.png

_attachments/Pasted image 20250612152345.png
_attachments/Pasted image 20250612152404.png
_attachments/Pasted image 20250612152412.png