Waiting for answer This question has not been answered yet. You can hire a professional tutor to get the answer.
Consider playing Tic-Tac-Toe against an opponent who plays randomly.
Consider playing Tic-Tac-Toe against an opponent who plays randomly. In particular, assume the opponent chooses with uniform probability any open space, unless there is a forced move (in which case it makes the obvious correct move). (a) Formulate the problem of learning an optimal Tic-Tac-Toe strategy in this case as a Q-learning task. What are the states, transitions, and rewards in this nondeterministic Markov decision process? (b) Will your program succeed if the opponent plays optimally rather than randomly?