Hina Sakazaki, Software Engineer, Dialogflow NLU at Google presents: "Adapting ML Research to Make Training AI For Games Fun"
I started programming when I was in middle school if you count HTML on neoPATs and other platforms. Serious coding started in college after playing a game called Portal. I graduated from UC Berkeley with a degree in Computer Science and Economics in 2015. Shortly after, I joined a mobile gaming company called Zynga. I worked on game titles such as Words With Friends and Donna Titans. I joined YouTube Test and Safety as a full-stack engineer in 2018. I then joined Google Research as the fifth engineer to work on the Synthetic Players Project, which is what I'm going to be talking about today. We were able to wrap up the project in July 2021. I currently work on natural-language understanding and an AI quality team at Google Cloud Dialogflow, where I build conversational agents.
Game development is bigger and more accessible than ever, with free-to-use game engines and assets, technological advancements such as procedural generations for levels, indie-friendly publishers, and a large audience of gamers on various devices. Game developers around the world are generating enormous game experiences that players explore through hundreds of hours of play on various devices. In addition to making the games, game developers must test games before they are released to the world. This is to iron out any bugs, before they are found by players, via a manual QA process. A large number of people play the game to find bugs. That gets more complicated as the game has more elements, like multiplayer or a vast world with multiple combinations of environmental factors. This QA is a bottleneck in the process, which is unable to scale with the complexity of these modern games. It leads to delayed launches and lower-quality products, and they're just huge money sinks.
We started our service using our open-source project. We started the game and then trained the AI. You can download our open-source project to play around on your own. Our team, Google Research, had a goal to create a product that directly addresses the slow down by creating a service that allows game developers to train an AI that can play and test their games at scale quickly and easily. Game companies have strict requirements on where their binaries are run. Machine-learning researchers often run their target game and machine-learning code in the same data center. We have to build the service for game developers who often don't have experience working with machine learning and expect us to provide APIs that are simulation-centric, not data-centric. We also build for an existing ecosystem. Our service had to fit in the production requirements of the game development where a game developer could quickly train the AI iteratively as their game evolved.
Most research projects are centered around a core question and a hypothesis. We broke it down and figured out solutions piece by piece. The first thing that we had to solve was our target games not running in our data centers or even any data center at all. Game developers are quite cautious about where their game binaries are run. Those locations are not guaranteed to have performant equipment. The games could be run on the developer's machine, the game companies, cloud VMs, game consoles, or mobile phones, all with varying capabilities. The core requirement when an AI plays a game is that it must do so interactively, The game state is provided as an input to the AI. The AI calculates output actions that then feed into the game. In order to play the game interactively, the process must happen within 33 milliseconds. For an AI to learn to play the game, processing data and training the AI requires a substantial amount of computing power.
We implemented the actor-learner pattern, which splits the AI into two parts, the actor who plays the game, and the learner who figures out how to play the game. The actor playing the game runs the AI. It takes input from the game and generates output to play the game. Then our actor is replaced in the client machine as an SDK. SDK is a software development kit, which is integrated by the game developers into their game to share their game state and feed the actions from the AI into their game. It communicates with our service, the learner. Training the AI happens in the learner, where game data is processed through intensive mathematical computation to learn behaviors. We placed our learner in the cloud, which could be scaled to multiple specialized cost-efficient machines hidden away from game developers. The actor updates the on-device model to be better or try new things, while continuously batching and sending data to learn from the gameplay happening in real-time. This is called the actor-learner pattern, it's a proven and well-known architecture in applied machine learning.
The actor-learner architecture allows our learners to train models from game data. We provide data about the game through a game-centric API, specifying first observations. Those are game states that the player sees at any given moment in time. Then there are actions, which are logical interactions that the player can perform, like walking or jumping. Rewards are numerical values that tell the AI how well it's doing like 'got three points' or 'lost two points'.
Reinforcement learning, where the AI learns to play a game autonomously through rewards and penalty signals has been the go-to technique to train AI to play games. RL is basically how humans play games or do anything. We interact with the environment, get feedback, and keep trying things. The primary issue with RL for our requirements was that RL algorithms require orders of magnitudes, more data than our developers could provide, and also have unpredictable outcomes. Another non-trivial issue is that reward shaping or defining rewards and punishments is pretty tricky to get right even for a machine-learning expert. A bad reward or punishment can easily confuse the AI at training.
We found more success with a solution of learning by watching a human play rather than autonomously learning from reward signals. We used a technique, behavioral cloning, a method that applies supervised learning on state-action pairs. Algorithmically, this is simpler than reinforcement learning. We know that mapping the desire to stay in action pairs and learn the pattern of it, rather than finding out if the state action pair was a good pair by getting punished or rewarded for it. With this approach, game developers are able to generate AI that acts similarly to the player. The demonstration behavior provides that training.
In the human-in-the-loop training, the game developer would watch the AI play the game, and when it starts behaving incorrectly, they could jump in and provide a demonstration of what to do in such situations. Additional data helps train higher-quality AI that can handle those similar situations in the future. This is like a fun flow for game developers. They would provide demonstrations, watch the AI play, correct it, and then you can see the AI succeed in similar situations where it failed before completely in real-time and interactively. This is roughly coined as continuous imitation learning, and it's known for its simple efficiency, which is exactly what we wanted.
Unfortunately, pure behavioral cloning only tells us how good the model is at predicting the state action mapping, not how good the AI is at playing the game. We added an additional autonomous learning step that the game developer could trigger. The AI autonomously interacts with the environment, gathering success on failure metrics to learn which version of itself is best at playing the game. This takes the concept of RL making a decision based on a reward signal but applies it at a different layer at the model selection. The final version of the AI can be then used to run tests and detect bugs in the game. When all the training completes, we would have trained on all versions of the AI and are ready to evaluate.
Once you see that the AI is pretty good at playing the game, the training phase is complete. Then you let it run on the evaluation mode, where you had defined the success binary and the time out. Let the AI play the game autonomously while using versions of itself while you can step away from your desk and go home for the day. When you come back the next morning, the best version of the model has been decided and outputted. You can use this version to run future tests for multiplayer testing or whatever
After we encountered issues, and debugging with game developers, we decided we needed to implement an emergency feature. I implemented a global allocate tree-override function. The game developer defines their own allocation functions, which report through their memory management system and pass that through our set API. With this feature fully integrated with the game that used the custom memory management system, we were able to run the game without any crashes, which is totally crucial for getting anything working.
Building a machine-learning service can be intimidating with the sheer velocity of research in the field, especially when real-world requirements diverge from academic assumptions. We had the unique constraints of not having access to the game binary, needing to make the tool work for game developers with no ML experience, and actually training the AI to play the game in a way that works with the game development workflow. To build a product for a real audience, you can't just simply copy what a research paper has done. You have to pick and choose different techniques to accomplish an ideal experience for your audience. Finally, let's not forget that a real product needs to work for real customers.