In recent times, machine learning (ML) and data science have gained popularity like never before. This field is expected to grow exponentially in the coming years. First of all, what is machine learning? And why does someone need to take pains to understand the principles? Well, we have the answers for you. One simple example could be book recommendations in e-commerce websites when someone went to search for a particular book or any other product recommendations which were bought together to provide an idea to users which they might like. Sounds magic, right? In fact, utilizing machine learning can achieve much more than this.
Machine learning is a branch of study in which a model can learn automatically from the experiences based on data without exclusively being modeled like in statistical models. Over a period and with more data, model predictions will become better.
In this first chapter, we will introduce the basic concepts which are necessary to understand statistical learning and create a foundation for full-time statisticians or software engineers who would like to understand the statistical workings behind the ML methods.
Statistics is the branch of mathematics dealing with the collection, analysis, interpretation, presentation, and organization of numerical data.
Statistics are mainly classified into two subbranches:
Statistical modeling is applying statistics on data to find underlying hidden relationships by analyzing the significance of the variables.
Machine learning is the branch of computer science that utilizes past experience to learn from and use its knowledge to make future decisions. Machine learning is at the intersection of computer science, engineering, and statistics. The goal of machine learning is to generalize a detectable pattern or to create an unknown rule from given examples. An overview of machine learning landscape is as follows:
Machine learning is broadly classified into three categories but nonetheless[ˌnʌnðəˈles]虽然如此, based on the situation, these categories can be combined to achieve the desired results for particular applications:
In some cases, we initially perform unsupervised learning to reduce the dimensions followed by supervised learning when the number of variables is very high. Similarly, in some artificial intelligence applications, supervised learning combined with reinforcement learning could be utilized for solving a problem; an example is self-driving cars in which, initially, images are converted to some numeric format using supervised learning and combined with driving actions (left, forward, right, and backward).
Statistics itself is a vast subject on which a complete book could be written; however, here the attempt is to focus on key concepts that are very much necessary with respect to the machine learning perspective. In this section, a few fundamentals are covered and the remaining concepts will be covered in later chapters wherever it is necessary to understand the statistical equivalents of machine learning.
Predictive analytics depends on one major assumption: that history repeats itself!
By fitting a predictive model on historical data after validating key measures, the same model will be utilized for predicting future events based on the same explanatory variables(predictor variables or independent variables) that were significant on past data.
########################
In regression analysis, we are given a number of predictor (explanatory) variables and a continuous response variable (outcome), and we try to find a relationship between those variables that allows us to predict an outcome.
Note that in the field of machine learning, the predictor variables are commonly called "features," and the response variables are usually referred to as "target variables." We will adopt these conventions throughout this book.
In machine learning and deep learning applications, we can encounter various different types of features: continuous, unordered categorical (nominal), and ordered categorical (ordinal). You will recall that in In machine learning and deep learning applications, we can encounter various different types of features: continuous(e.g. house price), unordered categorical (nominal, e.g. t-shirt color as a nominal feature ), and ordered categorical (ordinal, e.g. t-shirt size would be an ordinal feature, because we can define an order XL > L > M). You will recall that in cp4 Training Sets Preprocessing_StringIO_dropna_categorical_feature_Encode_Scale_L1_L2_bbox_to_anchor https://blog.csdn.net/Linli522362242/article/details/108230328, we covered different types of features and learned how to handle each type. Note that while numeric data can be either continuous or discrete, in the context of the TensorFlow API, "numeric" data specifically refers to continuous data of the floating point type.
######################## cp14_2_Layers_config_numeric_continuou_Feature Column_boosted tree_n_batches_per_layer_repeat_estima_Linli522362242的专栏-CSDN博客
The first movers of statistical model implementers were the banking and pharmaceutical[ˌfɑːməˈsuːtɪkl; ˌfɑːməˈsjuːtɪkl]制药(学)的 industries; over a period, analytics expanded to other industries as well.
Statistical models are a class of mathematical models that are usually specified by mathematical equations that relate one or more variables to approximate reality. Assumptions embodied by statistical models describe a set of probability distributions, which distinguishes it from non-statistical, mathematical, or machine learning models
Statistical models always start with some underlying assumptions for which all the variables should hold, then the performance provided by the model is statistically significant. Hence, knowing the various bits and pieces involved in all building blocks provides a strong foundation for being a successful statistician.
In the following section, we have described various fundamentals with relevant codes:
Usually, it is expensive to perform an analysis of an entire population; hence, most statistical methods are about drawing conclusions about a population by analyzing a sample.
The Python code for the calculation of mean, median, and mode using a numpyarray and the stats package is as follows:
import numpy as np
from scipy import stats
data = np.array([4,5,1,2,7,2,6,9,3])
# Calculate Mean
dt_mean = np.mean(data);
print("Mean :", round(dt_mean,2))
# Calculate Median
dt_median = np.median(data);
print("Median :", dt_median)
# Calculate Mode : This is the most repetitive data point in the data: the most frequent number!
dt_mode = stats.mode(data);
print("Mode :", dt_mode[0][0]) # [0]:first element in ModeResult,
# [0][0]: first value in mode list((the most frequent number)
dt_mode
#######################################################
a = np.array([[2, 2, 2, 1],
[1, 2, 2, 2],
[1, 1, 3, 3]])
print("# Print mode(a):", stats.mode(a)) # in each column
print("# Print mode(a.transpose()):", stats.mode(a.transpose())) # in each row
print("# a的每一列中最常见的成员为:{},分别出现了{}次。".format(stats.mode(a)[0][0], stats.mode(a)[1][0])) # in each column
print("# a的第一列中最常见的成员为:{},出现了{}次。".format(stats.mode(a)[0][0][0], stats.mode(a)[1][0][0])) # 1 in the first column
print("# a的每一行中最常见的成员为:{},分别出现了{}次。".format(stats.mode(a.transpose())[0][0], stats.mode(a.transpose())[1][0])) # in each row
print("# a中最常见的成员为:{},出现了{}次。".format(stats.mode(a.reshape(-1))[0][0], stats.mode(a.reshape(-1))[1][0]))
print("# a的第一列中最常见的成员为:{},出现了{}次。".format(stats.mode(a)[0][0][0], stats.mode(a)[1][0][0]))
print("# a的每一行中最常见的成员为:{},分别出现了{}次。".format(stats.mode(a.transpose())[0][0], stats.mode(a.transpose())[1][0]))
print("# a中最常见的成员为:{},出现了{}次。".format(stats.mode(a.reshape(-1))[0][0], stats.mode(a.reshape(-1))[1][0]))
for example 2,1,1
Where mode is calculated simply the number of observations in a data set which is occurring most of the time.
First the modal group with the highest frequency needs to identify If the interval is not continuous 0.5 should be subtracted from the lower limit Mode and 0.5 should be added from the upper limit Mode.
The (h) is Called the Size of the class interval is 1= 1.5-0.5 or 2.5-1.5
1 (frequency=2) : [1-0.5, 1+0.5] = [0.5, 1.5]
2 (frequency=1) : [2-0.5, 2+0.5] = [1.5, 2.5]
mode=E5 + E4*(E2-E1)/((E2-E1)+(E2-E3)) =1 and we just use the integer so we drop the decimal
Mode Formula | Calculator (Examples with Excel Template)
Mode Formula – Example #1
Where mode is calculated simply the number of observations in a data set which is occurring most of the time.
Mode Formula – Example #2 : Modal Group Which is Most Frequent i.e 60.5-65.5 ? : 61.5
Note:
Mode Formula – Example #3
The following are the distributions of heights in a certain class of students in a certain Mode
Calculate the Mode by using the Given Information.
Solution:
Modal Group Which is Most Frequent i.e 165.5-168.5 : 167.35
########################################################
We have used a NumPy array instead of a basic list as the data structure; the reason behind using this is the scikit-learn package built on top of NumPy array in which all statistical models and machine learning algorithms have been built on NumPy array itself. The mode function is not implemented in the numpy package, hence we have used SciPy's stats package. SciPy is also built on top of NumPy arrays.
The Python code is as follows:
import numpy as np
from statistics import variance, stdev
game_points = np.array( [35,56,43,59,63,79,35,41,64,43,93,60,77,24,82],
dtype=float
)
# Calculate Variance
dt_var = variance(game_points)
print( "Sample variance:", round(dt_var,2) )
# Calculate Standard Deviation
dt_std = stdev(game_points)
print( "Sample std.dev:", round(dt_std,2) )
# Calculate Range
dt_rng = np.max(game_points, axis=0) - np.min(game_points, axis=0)
print( "Range:",dt_rng )
#Calculate percentiles
print ("Quantiles:")
for val in [20,80,100]:
dt_qntls = np.percentile( game_points, val )
print( str(val) + "%", round(dt_qntls, 1) )
# Calculate IQR (Interquartile range)
q75, q25 = np.percentile( game_points, [75 ,25] )
print ( "Inter quartile range:",q75-q25 )
The steps involved in hypothesis testing are as follows:
The Python code is as follows:
from scipy import stats
import numpy as np
xbar = 990
mu0 = 1000 #population's mean
s=12.5 #standard deviation
n=30
# Test Statistic(from the sample)
t_sample = (xbar - mu0) / ( s/np.sqrt(float(n)) )
print("Test Statistic: ", round(t_sample,2))
# Critical value from t-table
alpha = 0.05
t_alpha = stats.t.ppf(alpha, n-1)
print( "Critical value from t-table: ", round(t_alpha, 3) )
# Lower tail p-value from t-table: # stats.t.sf : Survival function (1 - cdf
) at x of the given RV.
p_val = stats.t.sf( np.abs(t_sample), n-1 )
print( "Lower tail p-value from t-table: ", p_val )
Type I and II error: Hypothesis testing is usually done on the samples rather than the entire population, due to the practical constraints of available resources to collect all the available data. However, performing inferences about the population from samples comes with its own costs, such as rejecting good results or accepting false results, not to mention separately, when increases in sample size lead to minimizing type I and II errors:
vs
Type I error: Rejecting a null hypothesiswhen it is true, (OR is true, but we reject it by mistake. Because we mistakenly think is False, P=target class(the null hypothesis=False, Reject) )
α = probability of a Type I error, known as a "false positive"(FP)
1-α = probability of a "true negative"(TN), i.e., correctly not rejecting the null hypothesis, since =True, non-false
Type II error: Accepting a null hypothesiswhen it is false, (OR is false, but we fail to reject it.)
β = probability of a Type II error, known as a "false negative"(FN) ( Because we mistakenly think is True, N=non-target class(the null hypothesis=True, Accept))
1-β = probability of a "true positive"(TP), i.e., correctly rejecting the null since=False
T: True prediction or classify correctly; F: False prediction or classify incorrectly
P: instance is truely belong to target class(the null hypothesis=False, Reject) based on the fact;
N: instance is not belong to target class(the null hypothesis=True, Accept) based on the fact
Accuracy = (TP+TN)/(TP+FP+FN+TN)
Precision = TP/(TP+FP) # True predicted/ (True predicted + False predicted)
Recall = TP/(TP+FN)
https://blog.csdn.net/Linli522362242/article/details/103786116
The statistical power of a binary hypothesis test is the probability that the test correctly rejects the null hypothesis () when a specific alternative hypothesis () is true(since=False, reject). It is commonly denoted by (1-β = probability of a "true positive"(TP), i.e., correctly rejecting the null since=False), and represents the chances of a "true positive"(TP) detection conditional on the actual existence of an effect to detect. Statistical power ranges from 0 to 1, and as the power of a test increases(1-β), the probabilityof making a type II error by wrongly failing to reject the null hypothesis decreases.
β = probability of a Type II error, known as a "false negative"(FN) ( Because we mistakenly think is True, N=non-target class(the null hypothesis=True, Accept))
To measure the likelihood of making Type I or II error, we define:
α = probability of a Type I error, known as a "false positive"(FP)
β = probability of a Type II error, known as a "false negative"(FN) ( Because we mistakenly think is True, N=non-target class(the null hypothesis=True, Accept))
Ideally, we want both α and to β be small. However, if we decrease α, we increase β, and vice-versa.
We will make α small(increase 1-α, TP):
If we reject , since we're pretty sure it's false(increase 1-α, TP);
we will fail to reject , we don't claim is true because we may have a big chance(large β) to make Type II error when =True base on the fact.
Example: Assume that the test scores of an entrance exam fit a normal distribution. Furthermore, the mean test score is 52 and the standard deviation is 16.3. What is the percentage of students scoring 67 or more in the exam?
==>
( hidden: (67-52)/16.3=0.92)
The Python code is as follows:
from scipy import stats
xbar=67
mu0=52
s=16.3
# Calculating z-score
z = (xbar-mu0)/s # z = (67-52)/16.3
# Calculating probability under the curve
p_val = 1 - stats.norm.cdf(z)
print( "Prob. to score more than 67 is", round(p_val*100, 2), "%" )
import pandas as pd
from scipy import stats
survey = pd.read_csv("/content/drive/MyDrive/Numerical Computing/dataset/survey.csv")
survey.head(n=10)
# Tabulating 2 variables with row & column variables respectively
# Compute a simple cross tabulation简单交叉表 of two (or more) factors.
# By default computes a frequency table of the factors
# unless an array of values and an aggregation function are passed.
# If margins is True, will also normalize margin values.
survey_tab = pd.crosstab( survey.Smoke, survey.Exer, margins=True )
survey_tab
While creating a table using the crosstab function, we will obtain both row(survey.Smoke) and column(survey.Exer) totals field extra(All). However, in order to create the observed table, we need to extract the variables part and ignore the totals:
# Creating observed table for analysis
observed = survey_tab.iloc[0:4, 0:3]
observed
The degrees of freedom(df): (num_rows - 1) * ( num_columns - 1)=(4-1)*(3-1)=3*2= 6
The chi2_contingency function in the stats package uses the observed table and subsequently calculates its expected table, followed by calculating the p-value in order to check whether two variables are dependent or not.
If p-value < 0.05 (), there is a strong dependency between two variables , whereas
if p-value > 0.05(), there is no dependency between the variables:
contg = stats.chi2_contingency( observed=observed )
# return
# chi2 : float
# The test statistic.
# p : float
# The p-value of the test
# dof : int
# Degrees of freedom
# expected : ndarray, same shape as observed
# The expected frequencies, based on the marginal sums of the table.
contg
<== 11*98/236=4.56779661 <==<==
# Find the expected values:
pd.DataFrame(contg[3], dtype=float)
==> ==>The test statistic : 5.488545890584232
p_value = round(contg[1],3)
print("P-value is: ", p_value)
The p-value is 0.483 > 0.05, which means there is no dependency between the smoking habit and exercise behavior.
OR Check X table [df=6, =0.05]==>14.067> 5.488545890584232 ==> P-value = P( > 5.488545890584232 ) ==> the test statistic is not falls in the rejection region, so we cannot reject (OR there is no dependency between the variables)
*********************************************************************************************************
Example: A fertilizer[ˈfɜːrtəlaɪzər]肥料 company developed three new types of universal fertilizers after research that can be utilized to grow any type of crop. In order to find out whether all three have a similar crop yield产量, they randomly chose six crop types in the study. In accordance with the randomized block design, each crop type will be tested with all three types of fertilizer separately. The following table represents the yield in . At the 0.05 level of significance, test whether the mean yields for the three new types of fertilizers are all equal:
The Python code is as follows:
import pandas as pd
from scipy import stats
fetilizers = pd.read_csv( "/content/drive/MyDrive/Numerical Computing/dataset/fetilizers.csv")
fetilizers
# 6 crop types
Calculating one-way ANOVA using the stats package:
################################Business Statistics (2nd Edition) - 道客巴巴 Page: 760, 729
stats.f_oneway :
Perform one-way ANOVA.
The one-way ANOVA tests the null hypothesis that two or more groups have the same population mean. The test is applied to samples from two or more groups, possibly with differing sizes.
Parameters
sample1, sample2, ... : array_like
The sample measurements for each group.
Returns
statistic : float
The computed F statistic of the test.
pvalue : float
The associated p-value from the F distribution.
Notes
The ANOVA test has important assumptions that must be satisfied in order for the associated p-value to be valid.
The samples are independent.
Each sample is from a normally distributed population.
The population standard deviations of the groups are all equal. This property is known as homoscedasticity[ˈhoʊməsɪdæsˈtɪsəti]同方差性,[数] 方差齐性.
################################
one_way_anova = stats.f_oneway( fetilizers["fertilizer1"], fetilizers["fertilizer2"], fetilizers.fertilizer3 )
one_way_anova
print("statistic: ", round( one_way_anova[0], 2 ),
", p-value:", round(one_way_anova[1], 3)
)
Result: The p-value did come as greater than 0.05, hence we cannot reject the null hypothesis that the mean crop yields of the fertilizers are equal. Fertilizers make a insignificant difference to crops.
Note: sample variance : vs
( (7-5.5)**2 + (3-5.5)**2 + (6-5.5)**2 + (6-5.5)**2 )/(4-1)
( (6-6)**2 + (5-6)**2 + (5-6)**2 + (8-6)**2 )/(4-1)
1-0.9991=0.0009 < alpha=0.05
************************************************************************************************************************
use t table with two tails
************************************************************************************************************************