A1. Explain what Monte Carlo Simulation is and what it can be used for.
Monte Carlo simulation is a computerized mathematical technique that allows people to account for risk in quantitative analysis and decision making.
The method uses repeated random sampling to generate simulated data
which is used to evaluate a mathematical model or process.
• It is used in modelling where there is a high amount of uncertainty.
• e.g. forecasting, predictive modelling or decision modelling.
• It is used in many cases for assessing what-if scenarios.
A2. Explain the difference between structured and unstructured data.
Give examples in your answer.
Structured data is highly-organized and formatted in a way so it's easily searchable in relational databases. Unstructured data has no pre-defined format or organization, making it much more difficult to collect, process, and analyze.
• Most of the data we have looked at so far has been structured (data
tables, csv files, databases etc..) but most data comes unstructured.
• Unstructured data includes:
• Images, Audio files (including speech) and Videos.
• Text (emails, messages, webpages, news articles, social media,
logs, notes and transcripts, open-ended survey responses,
books, legal documents etc.)
A3. With regards to text analytics, explain the following terms:
B1. Honda has approached your consultancy company asking you to help it forecast the number of PES 125A motorbikes sold in the UK in the next year. It has provided you with the quarterly time series sales data from Q1 2010 to Q1 2016 (B1.csv) for the Honda PES 125A motorbike.
a. Using the read.csv, ts and plot functions in R, import the data, create a time series object and then plot the time series object. From looking at this plot, what can you say about the trend and seasonality of the data? Include the plot in your answer.
CODE:
# import dataset
data <- read.csv("2017B1.csv")
View(data)
# create a time series data based on quaterly data
honda <- ts(data = data$Sales, start = c(2010,1) , end = c(2016,1),frequency = 4)
# plot the data
plot(honda)
From this plot, we can see that it is surging before 2014, and then decreasing in the following two years. The seasonality of the data can hardly see.
b. Using the decompose function in R, decompose the data with additive decomposition and explain what is shown in the plot. Explain why you think the additive decomposition method was chosen.
# decompose the data with additive method
de <- decompose(honda, type = c("additive"))
plot(de)
We choose the additive method because the seasonal fluctuation doesn't vary over time. In this plot, the first line shows the time series of the honda sales data. The following lines show trend,seasonal and reminder decomposition. Because we use the additive method here, the sum of the last three is equal to the first line.
The trend component has been obtained using moving averages and therefore the first few and last observations have been removed. It looks very similar to the initial data albeit slightly smoother. The trend decomposition account for the largest part of the whole data.
The seasonal decomposition is very small. The seasonal decomposition is constant over time, we use additive method because the seasonality increase by constant value( S2010.07 - S2010.02 = S2011.07-S2011.02 et al.)
The random component is the left over proportion that can not be explained by the seasonal or trend components.In the middle of the decomposition plot is a high focus on positive error, whilst on the left and right side is a focus on negative error. This would indicate that there is some sort of structural variation that is yet to be explained.
c. Using the ets() function in the forecast package in R, predict future sales using exponential smoothing for the next year. Set alpha so that you give less weight to more recent observations. Include an image of the forecast in your answer.
CODE:
# forecasting with ets method
library("forecast")
ny <- ets(honda[1:25], model = "ZZZ", alpha =0.2)
pr <- predict(ny, h=4)
plot(pr)
B2. A member of the sales team in the company you work for has asked for your help in working out the best route to travel to visit 6 of his clients. He wants to visit each client once but spend the least amount of time travelling. The matrix below shows the time in minutes it takes to travel between each of the six clients and your office location:
a. Using a genetic algorithm in Excel or R calculate the route with the shortest travelling time. Include your workings, the route and total time taken in your answer.
b. Explain the general procedure for a genetic algorithm giving examples of the crossover and mutation stages.
The general procedure of a Genetic Algorithm is as follows:
1. Define an end condition (time or number of iterations).
2. Generate a random population of chromosomes.
3. Evaluate fitness of each chromosome in the population.
4. Create a new population by repeating the following steps until a new
population is complete:
• Select two parent chromosomes from the population according to
their fitness.
• Crossover, also called recombination, is a genetic operator used to combine the genetic information of two parents to generate new offspring stochastically. There are a number of techniques which handle the crossover stage, however the most common method is binary encoding (从母体A和母体B 中前后各取一段,组成后代):Select a random cut off point and form a new offspring by merging one side of the cut point of parent A to the other side of the cut point of parent B: eg A:10001|011 and B:01101|110 produces offspring :10001110
• Randomly mutate the offspring.The mutation stage consists of a small alteration to the new offspring: For example: 10001110 mutates to 10101110 •The probability of this occurring to each individual bit of the chromosome is set by the decision-maker or analyst. Generally the probability is fixed to less than 0.1 (<10%). The level of the mutation probability denotes the stochastic nature of the algorithm
• Place the offspring into the population.
5. Evaluate fitness of each chromosome in the population.
6. If the end condition is met, return the best solution(s) in the current
population.
B3. A medical company has approached your consultancy business and asked for help with predicting if pregnant women will give birth to low birth weight babies. It has collected the following data (B3-train.csv) on 150 women:
age: Age of the mother in years.
previousweight: Weight in pounds at the last menstrual period
smoke: Smoking status during pregnancy
(1 = Yes, 0 = No)
low: Low birth weight
(0 = Birth Weight >= 2500g,
1 = Birth Weight < 2500g)
a. Using read.csv, as.factor and the naiveBayes function from the e1071 package in R, import the data (B3-train.csv) and train a Naïve Bayes Classifier with low as x and age, previous_weight and smoke as y. Provide and explain the A-priori and conditional probabilities.
Code:
library("e1071")
train <- read.csv("2017B3-train.csv")
View(train)
# using naivebayes method to categorize the data
nb.cscore <- naiveBayes(as.factor(train$low) ~ age+previousweight+ as.factor(train$smoke), data = train)
# A-priori and conditional probabilities
nb.cscore[["apriori"]]
nb.cscore[["tables"]][["age"]]
nb.cscore[["tables"]][["smoke"]]
nb.cscore[["tables"]][["previousweight"]]
A priori probability: in the training data, among those women, the probability of having a baby's birth weight < 2500 g is 0.2867; and the probability of having a baby's birth weight >= 2500 g is 0.713.
Conditional probability: show you the mean and standard deviation of each predictor variable. In other words, the mean age of women who have a baby's birth weight <2500 g is 22.51 years old with a standard deviation of 5.573 years old. et al.
b. The medical company was able to get data for another 39 women (B3-test.csv).
Using your model from answer a, predict the value for low using the data in B3-test.csv. Use the read.csv and the predict functions in R to do this.
How do the predictions from the Naïve Bayes Classifier compare to the real results for low in B3-test.csv?
CODE:
#load the test data
test <- read.csv("2017B3-test.csv")
View(test)
# choose the variables we need to predict low
testdata <- subset(test, select = c("age","previousweight","smoke"))
# prediction
predict <- predict(nb.cscore,testdata)
# create a dataframe to accomadate predict and actural data
data.frame(prediction = predict, actural = test$low)
# how many are successful predicted?
sum(predict == test$low)
# caculate the accuracy
sum(predict == test$low)/39
RESULTS:
The accuracy of this model is 54%