Yesterday, California-based AI company Adept announced Action Transformer (ACT-1), an AI model that can perform actions in software like a human assistant when it receives high-level written or verbal commands. It can operate web applications and perform intelligent searches on websites while clicking, scrolling and typing in the right fields as if it were a person using the computer.
In a demo video tweeted per Adept, the company shows someone typing “Find me a house in Houston that’s suitable for a family of 4. My budget is $600,000” into a text entry box. Upon job submission, ACT-1 automatically crawls Redfin.com in a web browser, clicking on appropriate regions of the website, typing a search entry, and modifying search parameters until a corresponding house appears on the screen.
1/7 We have built a new model! It’s called Action Transformer (ACT-1) and we taught it to use a bunch of software tools. In this first video, the user simply types in a high-level query and ACT-1 does the rest. Read on to see more examples ⬇️ pic.twitter.com/mq7c0Vyd7N
— Adept (@AdeptAILabs) September 14, 2022
Another demo video on Adept’s website shows ACT-1 leveraging Salesforce with prompts such as “add Max Nye to Adept as a new prospect” and “record a call with James Veel saying he’s considering buy 100 widgets”. ACT-1 then clicks the correct buttons, scrolls, and fills out the appropriate forms to complete these tasks. Other demo videos show ACT-1 browsing Google Sheets, Craigslist, and Wikipedia via a browser.
How is it possible? Adept describes ACT-1 as a “scale transformer”. In AI, a transformer model is a type of neural network that learns to do something by training on sample data, and it gains knowledge of the context and relationships between items in the dataset. Transformers have been behind many recent innovations in AI, including language models like GPT-3 that can write at an almost human level.
In the case of ACT-1, the training data apparently came from humans using the software first, and the AI model learned from that. Someone who identified himself as a developer for ACT-1 on Hacker News wrote: “We used a combination of human demonstrations and feedback data! You need custom software both to record the demos and to represent the state of the tool in a consumable way per model.“
After training, the ACT-1 model interacts with a web browser through a Chrome extension that can “observe what is happening in the browser and perform certain actions, such as clicking, typing, and scrolling,” according to Adept. The company describes ACT-1’s observability as being able to generalize across all websites, so rules learned on one site can apply to others.
While scripts to automate navigation already exist (and are often used to feed bots with bad intentions), the powerful and widespread nature of ACT-1 involved in the demos seems to take machine automation to a new level. . Already, people on Twitter are both serious and half-joking set off alarms about the potential for misuse that this technology could bring. Should we allow an intelligent system to have so much control over our computer interfaces?
While these concerns are purely hypothetical at this time, especially since ACT-1 does not operate autonomously, they should be kept in mind as we rush headlong towards generalized human-level AI that can interface with the outside world via the Internet. Adept even references this goal on its website, writing, “We believe the clearest framework of general intelligence is a system that can do anything a human can do in front of a computer.”