Amazon’s new Just Walk Out combines Transformers and Edge

To receive industry-leading AI updates and exclusive content, sign up for our daily and weekly newsletters. Learn more

We were among a select group of journalists invited into Amazon’s secret lab, located on the first floor of an industrial modern office building, to see their latest Just Walk Out (JWO) technology.

Currently used in over 170 retailers around the world, JWO allows customers to enter a store, select products and leave without having to pay at the register, streamlining the shopping experience.

Enter a new AI-based system developed by Amazon that uses multimodal foundational models and transformer-based machine learning to simultaneously analyze data from various sensors in stores. Yes, this is the same underlying technology used in large-scale language models such as GPT, but instead of generating text, these models generate receipts. This upgrade improves accuracy in complex shopping scenarios and makes it easier for retailers to adopt the technology.

Our host is John Jenkins (JJ), Amazon’s vice president of JWO, and he leads us past a small group of Amazon employees sipping coffee in the lobby, through a glass security gate, down a short, dark hallway, and to an inconspicuous door. Inside, we find ourselves standing in an exact replica of a local grocery store, complete with shelves of chips and candy, and refrigerators stocked with Coca-Cola, Vitamin Water, Orbit gum, and various other sundries.

Aside from the electronic gates and the bars of Amazon’s special 4-in-1 camera device overhead, the lab store looks like a completely normal retail shopping experience, except for the cashiers.

Photo: We weren’t able to take photos in the lab, but here’s the actual JWO store across the square

How JWO works

JWO (pronounced “Jay-Wo” in Amazon) combines computer vision, sensor fusion and machine learning to track what shoppers take from and put back on store shelves. The store-building process starts with creating a 3D map of the physical space using a regular iPhone or iPad.

The store is divided into product areas called “polygons,” which are separate spaces that correlate to product inventory, with custom cameras mounted on a rail system suspended from the ceiling and weight sensors placed in front and behind each polygon.

Photo: In the actual JWO store, cameras and sensors are hanging from the top. Shopping Areas

JWO tracks the orientation of the head, left hand, and right hand to detect when the user interacts with a polygon. By fusing inputs from multiple cameras and weight sensors with object recognition, the model predicts with great accuracy whether a particular item has been held by a shopper.

JJ explains that the system previously used multiple models at the chain to handle different aspects of the shopping experience: “Previously, we would run these models at the chain: Did you interact with the product space? Yes. Does the product match what we expected? Yes. Did you take one, did you take two? Did you end up putting the product back? Running it at the chain is slower, less accurate, and more expensive.”

Currently, all of this information is processed by a single Transformer model: “Our model generates receipts instead of text. It does this by taking all the inputs and acting on them at the same time, spitting out a receipt in one fell swoop. Just like GPT, where you have the language in one model and the images all in one model, you can do the same thing. Instead of generating text, you generate receipts.”

Image: JWO Architecture provided by Amazon

By processing data from various sources, such as weight sensors, improved AI models can handle complex scenarios such as multiple shoppers interacting with items at the same time or camera views being blocked. This enhancement minimizes receiving delays and simplifies implementation for retailers.

The system’s self-learning capabilities reduce the need for manual retraining in unfamiliar situations. Trained on 3D store maps and product catalogs, the AI can adapt to changes in store layouts and accurately identify misplaced items. This advancement marks a major step forward in making frictionless shopping experiences more reliable and widely accessible.

JWO utilizes edge computing

One interesting thing we’ve seen is the productization of edge computing by Amazon. Amazon has confirmed that all model inference runs on compute hardware installed on-premises. As with all AWS services, this hardware is fully managed by Amazon and included in the total cost of the solution. In this respect, the service is still fully cloud-like for customers.

“We’ve built our own edge computing devices to deploy in these stores and run most of the inference on-site, because, number one, it’s faster if you can run it on-site. It also means that you need less bandwidth in and out of the store,” says JJ.

VentureBeat took a closer look at the new edge computing hardware: Each edge node is a rail-mounted enclosure measuring roughly 8x5x3, featuring very large air intakes, and is itself installed within a wall-mounted enclosure along with the network and other equipment.

Of course, Amazon has yet to comment on what’s inside these edge computing nodes, but because they’ll be used for AI inference, we speculate they’ll contain Amazon GPUs like Trainium or Inferentia2, which AWS is positioning as a more affordable and accessible alternative to Nvidia’s GPUs.

You can see why edge computing is emerging as a critical layer for real-world AI inference use cases, as JWO requires information from multiple sensors to be processed and integrated in real time, and the data is too large to stream back to a cloud-hosted inference model.

Scaling Up with RFID

Our next stop, down another long, dark corridor and behind another nondescript door, was another mock retail lab, this time inside what appears to be a clothing retail store: Long racks of sweatshirts, hoodies and sportswear line the walls, each item tagged with a unique RFID tag.

In the lab, Amazon is rapidly integrating RFID technology into JWO. The AI architecture remains the same, with a multimodal transformer that fuses sensor inputs, but without the complexity of multiple cameras or weight sensors. All a retailer needs to implement this type of JWO is an RFID gate and RFID tags to attach to the items. Many retail clothing items already come with RFID tags from the manufacturer, so they can get started right away.

The minimal infrastructure requirements here are a key advantage in terms of both cost and complexity. This type of JWO could also potentially be used for temporary retail outlets within fairgrounds, festivals, and similar locations.

What it took for Amazon to build JWO

The JWO project was made public in 2018, though research and development on the project likely dates back several years. JJ declined to comment on the exact size of the JWO product team or the total amount invested in the technology, but said that more than 90% of the JWO team are scientists, software engineers, and other technical staff.

But a quick look at LinkedIn suggests the JWO team has at least 250 full-time employees, and could reach 1,000. According to job listing site Comparably, the average salary at Amazon is $180,000 a year.

Assuming the cost breakdown for JWO’s development is similar to other software and hardware companies, and that Amazon started with its famous “two pizza team” of 10 full-time staff around 2015, I would guess that cumulative R&D costs are between $250M and $800M (hundreds of millions for friends is not much).

The point is not to give an exact figure, but to provide a rough estimate of the research and development costs for a company looking to build a system like JWO from scratch. The point here is that you need to be prepared to spend several years and tens of millions of dollars to get there using the latest technology and hardware. But why build it when you can get it right now?

The build-versus-buy dilemma in AI

The estimated (speculative) costs of building a system like JWO show just how risky R&D is when it comes to enterprise AI, IoT, and integrating complex technologies. And it echoes what I heard from many enterprise decision makers at VB Transform in San Francisco a couple of weeks ago: “Investing heavily in hard tech AI only makes sense for companies like Amazon that can leverage platform effects to create economies of scale. It’s too risky to invest in infrastructure and R&D at this stage and make it rapidly obsolete.”

This dynamic is one of the reasons why hyperscale cloud providers are winning over in-house development in the AI space. The complexity and cost associated with AI development is a major barrier for most retailers. As these companies are focused on increasing efficiency and ROI, they are more likely to choose pre-integrated, ready-to-deploy systems like JWO and leave the technical heavy lifting to Amazon.

As for customization, AWS’ history shows that we will increasingly see components of JWO appear as standalone cloud services. In fact, JJ revealed that this is already happening with AWS Kinesis Video Streams, which grew out of the JWO project. When asked if the JWO model will be made available on AWS Bedrock so that companies can innovate on their own, JJ replied, “Not really, but it’s an interesting question.”

Towards the widespread use of AI

The advancements in the JWO AI model demonstrate the continuing impact of the Transformer architecture across the AI landscape. This groundbreaking advancement in machine learning is not only revolutionizing natural language processing, but also the complex multi-modal tasks required for a frictionless retail experience. The Transformer model’s ability to efficiently process and fuse data from multiple sensors in real time is pushing the boundaries of what’s possible in AI-driven retail (and other IoT solutions).

Strategically, Amazon is tapping into a huge new source of potential revenue growth: third-party retailers. The move plays to Amazon’s strengths in productizing expertise and constantly expanding into adjacent markets. By offering JWO as a service through Amazon Web Services (AWS), Amazon is not only solving retailers’ pain points, but also expanding its dominance in the retail sector.

The integration of RFID technology into JWO, first announced for fall 2023, is an exciting development that could really bring the system to the mass market. With millions of retail stores around the world, it’s hard to overstate the size of the addressable market, if the price is right. With minimal infrastructure requirements and the potential for use in temporary retail environments, this RFID-based version of JWO could be the key to widespread adoption.

As AI and edge computing continue to evolve, Amazon’s JWO technology serves as a prime example of how hyperscalers are shaping the future of retail and other industries. By offering complex AI solutions as easily deployable services, the success of JWO and similar business models could determine the broader adoption of AI in everyday operations.

VB Daily

Stay up to date! Get the latest news every day by email

By subscribing, you agree to VentureBeat’s Terms of Use.

Thanks for subscribing! Check out other VB newsletters here.

An error occurred.

Amazon’s new Just Walk Out combines Transformers and Edge

How JWO works

JWO utilizes edge computing

Scaling Up with RFID

What it took for Amazon to build JWO

The build-versus-buy dilemma in AI

Towards the widespread use of AI

Leave a Reply Cancel reply

Stay Connected

Latest News

Scientists have identified some long-sought byproducts of drinking water treatment

Gwyneth Paltrow may have named her daughter Apple because of Shallow Hal

Peace activists arrested for calling out Poilievre’s anti-climate policies

What mission?

How JWO works

JWO utilizes edge computing

Scaling Up with RFID

What it took for Amazon to build JWO

The build-versus-buy dilemma in AI

Towards the widespread use of AI

You Might Also Like

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News