This paper extended PMRL as the non-communicative and theoretical method for two agents, and proposed PLA as the method to be able to force agents to learn cooperative behavior for any number of agents. In addition, this paper adds the theoretic explanation for PLA that all agents achieve all purposes without spending the largest times. Concretely PLA forces each agent to avoid the more difficult purposes requiring many time to be reached by limiting the purpose which it can achieve, and it forces the agents to learn cooperative policy as achieving the appropriate purpose among the limited purposes. The experimental results in this paper derive that (1) PLA enables the agents to learn cooperative policy in the two grid world problems for three and five agents, and (2) PLA can force all agents to achieve all purposes in the problems with the minimum time.
|IEEJ Transactions on Electronics, Information and Systems
|Published - 2020
ASJC Scopus subject areas