October 25, 2018MLNC – Machine Learning Neural Computation – Dr Aldo FaisalCoursework 1 - Grid WorldTo be returned via Blackboard as indicated online.Your coursework should contain: your name, your CID and your degree course at the top of the firstpage. You text should provide brief analytical derivations and calculations as necessary in-line, so thatthe markers can understand what you did. Please use succinct answers to the questions. Your finaldocument should be submitted as a single .zip file, containing one single PDF file, in the format ofCID FirstnameLastname.pdf (example: 012345678 JaneXu.pdf), and one single .m file, also in theformat CID FirstnameLastname.m. Note, that therefore all code that you have written or modified mustbe within that one Matlab file. Do not submit multiple Matlab files, do not modify other Matlab files.Your Matlab script. should contain a function that takes no arguments and is called RunCoursework(),that should produce all the Matlab-based results of your coursework (in clearly labelled text and/or figureoutput). This function should be able to run on its own, in a clean Matlab installation and directory withonly the code we provided for the coursework present.Please additionally paste the same fully-commented Matlab source code in the appendix of your PDFsubmission. You are allowed to use all built-in Matlab functions and any Matlab functions supplied bythe course or written by you.The markers may subtract points for badly commented code, coding that does not run and coding thatdoes not follow the specifications. Figures should be clearly readable, labelled and visible – poor qualityor difficult to understand figures may result in a loss of points.Your coursework should not be longer than 4 single sided pages with 2 centimetre margins all aroundand 12pt font. You are encouraged to discuss with other students, but your answers should be yours, i.e.,written by you, in your own words, showing your own understanding. You have to produce your owncode. If you have questions about the coursework please make use of labs or Piazza, but note that GTAscannot provide you with answers that directly solve the coursework.Marks are shown next to each question. Note that the marks are only indicative.This coursework uses the simple Grid World shown in Figure 1. There are 14 states, corresponding to locationson a grid – two cells (marked in grey) are walls and therefore cannot be occupied. This Grid World has twoterminal states, sprobability of starting from any of these states).• Possible actions in this world are N, E, S and W (North, East, South, West), which correspond tomoving in the four cardinal directions of the compass.• The effects of actions are not deterministic, and only succeed in moving in the desired direction withprobability p. Alternatively, the agent will move perpendicular to its desired direction in either adjacentdirection with probability• After the movement direction is determined, and if a wall blocks the agent’s path, then the agent will staywhere it is, otherwise it will move to the corresponding adjacent. So for example, in the grid world wherep =0.8, an agent at state s5which chooses to move north will move north to state swith probability0.8; will move east to state swith probability 0.1; or will move west staying in state s>> [NumStates, NumActions, TransitionMatrix, ...RewardMatrix, StateNames, ActionNames, AbsorbingStates] ...= PersonalisedGridWorld(p);With NumStates being the number of states in the Grid World, and NumActions the number of ac-tions the agent can take. The TransitionMatrix is a代做留学生MLNC Machine Learning Matlab语言、Matlab编程代写、帮写Matlab语言程序 NumStates ⇥ NumStates ⇥ NumActionsarray of specified transition probabilities between (first dimension) successor state, (second dimen-sion) prior state, and (third dimension) action. RewardMatrix is the NumStates ⇥ NumStates ⇥NumActions array of reward values between (first dimension) successor state, (second dimension) priorstate, and (third dimension) action. StateNames is a NumStates ⇥ 1 matrix containing the name ofeach state. ActionNames is a NumActions ⇥ 1 matrix containing the name of each action. Finally,AbsorbingStates is a NumStates⇥1 matrix specifying which states are terminal.The coursework is personalised by your CID number. Throughout the exercise we set p =0.5+0.5⇥, where x is the penultimate digit of your College ID (CID), and y is the last digit of yourCID. If your CID is 876543210 we have X =1and y =0resulting in p =0.55 and � =0.2.QuestionsPoints per questions are indicative only. Questions become progressively more challenging.1. (1 point) State your CID and personalised p and � (no need to show derivation).2. (15 points) Assume the MDP is operating under an unbiased policy ⇡) by any dynamic programming method of yourchoice. Report your result in the following format:(a) What is the likelihood that the above observed 3 sequences were generated by an unbiased policy⇡u? Report the value of the likelihood.(b) Find a policy ⇡Mfor the observed 3 sequences that has higher likelihood than the likelihood of ⇡uto have generated these sequences. Report it in the following table format. Note, that as not allstates are visited by these 3 sequences you only have to report the policy for visited, non-transientstates. Report your result using the following format:4. (39 points)(a) Assume an unbiased policy ⇡uin this MDP. Generate 10 traces from this MDP and write them out.When writing them out use one line for each trace, use symbols S1, S4, ..., S14, actions N, E, S,W, and the rewards in the following format (please make sure we can easily copy and paste thesevalues from the PDF in one go), e.g. the output must be in the following format (so that we cancopy and paste the text from your PDF into our automatic testing software).S12,W,-1,S11,N,-1,S9,N,-1,S5,N,-1,S1,N,-1,S1,E,0S14,E,-1,S10,E,-1,S8,W,-1,S7,S,-1,S6,N,03(b) Apply First-Visit Batch Monte-Carlo Policy Evaluation to estimate the value functionthese 10 traces alone. Report the value function for every non-terminal state (sthe format specified in Question 2.(c) Quantify the difference betweenobtained from Q4.b and Vobtained from Q2 by defininga measure that reports in a single number how similar these two value functions are. Justify yourchoice of measure. Then, plot the value of the proposed similarity measure against the number oftraces used. Start plotting the measure using the first trace, then the first and second trace, and soforth. Comment on how increasing the number of traces affects the similarity measure.5. (20 points)(a) Implement ✏-greedy first-visit Monte Carlo control. Evaluate the learning and control for two set-tings of ✏, namely 0.1 and 0.75.For each setting of ✏, plot two types of learning curves:i. Plot reward against episodes.ii. Plot trace length per episode against episodes.Note: An episode is one complete trace. A trial is many episodes starting from an initialisation ofthe agent. The learning curves are stochastic quantities, you may need to run a good number ofrepeated learning experiments to average out the variability across trials. Specify the number oftrials and plot mean ± standard deviation of your learning curves.转自:http://ass.3daixie.com/2018110618113588.html