Well, this is the end of Marae. It feels so odd to say, but it's true.
Marae originated with the belief that AR will one day be all-encompassing and pervasive, seamlessly merging our physical and digital spaces. Today, we create and witness our digital reality through screens. Tomorrow, we'll effortlessly engage with a sea of content, apps, and tools via invisible intermediaries - smart glasses, contacts, and in the distant future maybe even optical implants. As the next paradigm of human-computer interaction, the emerging AR Cloud will fundamentally change the way we live, play, learn, and work. It was the magnitude of this shift, and its societal ramifications, which drew me in. In short, Marae was a moonshot effort to have substantial impact on an incredibly influential medium's rollout - one which will shape our reality in the next few decades.
In just the next 5-10 years though we'll start to get a taste of this future. There are countless glasses and headsets set to be released soon, with some powerful platforms coming out of their infancy as well. These advances, along with others still incubating in the minds of researchers the world over, will enable immersive experiences that are hard to fathom today - "Minecraft on the Boston Common, Fortnite on Central Park" as was my pitch. Marae's vision, distilled, was to become the dominant publishing, hosting, and distribution platform for this content - the "Operating System of Augmented Reality".
So why pack it in then? Well, as with most things, there are a handful of reasons. Some I optimistically pushed aside months ago (glad I did), and others have been on my mind for the past few weeks, but the coup de grâce landed May 19th. While pitching academics on a collaboration opportunity to test Marae's viability, I noticed that one of my prior assumptions needed a bit more interrogation. I had a rough belief that a popular computer vision dataset was similar enough to Marae's target spaces (public parks and college campuses) to be used. But under closer examination, my confidence shrank. The dataset might be similar enough, but that guess, compounded with other educated guesses downstream, would make the resulting research's value questionable at best. I've thought of some efficient ways to procure a relevant dataset to use, but they'll likely add a year or so to Marae's research lead time. Instead of being 2-3 years out from a prototype, we're looking at 3-4. This isn't a business, this is an academic endeavor - at least for now.
I've thought about pushing on to grad school - producing the needed dataset and validating my hypothesis, before getting back to the real world to build it all. But, there's something about a stark change of plans which causes one to soul search a bit. Like a compass that leads you a few miles in the wrong direction, you may tap and shake it, or ditch it entirely. Stepping back brought a few thoughts which had been pushed to my rearview into focus.
I started Marae with two major assumptions - that park-scale, multi-user experiences would be the driving force behind the AR Cloud's adoption, and that I would, as a founder, have significant stake in shaping this future. The first assumption was a critical part of our strategic plan, and the second tied directly to why I founded the company in the first place. I'll go into each of these disproven beliefs in their respective sections below, but want to more generally speak to my process of reckoning with them here.
Sometimes we cling to desirable notions in the face of overwhelming evidence. It can be hard to filter fact from fiction, or unwarranted pessimism from valid criticism when we're in the thick of it. While pitching, I had to navigate these murky waters. There were a lot of weakly-founded opinions to dredge through, but every once and a while a new bit of information would creep in that had merit. These insights gradually lowered the upper-limit of Marae's aspirations, trimming and narrowing a grandiose vision I had created for myself. When it was all said and done, I had to see this opportunity for what it was - a path which wouldn't get me to my destination.
I'd like to take a moment to express my sincerest gratitude to all of those who have welcomed me with conversation and guidance over the past six months (plus one year of moonlighting) as we've explored the unbelievable future of Augmented Reality that's on our horizon. To those in academia and the professional world, thank you for your genuine enthusiasm and crisp critiques. To my friends and family, thank you for supporting my ambition and entertaining my obsessive curiosity (don't worry, it'll only shift elsewhere :) ).
I couldn't be happier to have taken the trip I did with Marae. I don't think I've ever learned, or grown, this much in a 6 month period before. I'm also convinced as ever that we'll one day see just how critical Marae's ethos was:
Foundational decisions made, while building Marae or other similar platforms, will have lasting effects on privacy and digital freedom for generations to come. This responsibility and purpose is the inspiration behind "Marae". In New Zealand and other parts of Polynesia, a Marae is a communal building where all kinds of critical social functions and rituals take place. Throughout history they've hosted birth rites, funerals, trade negotiations, religious sacraments, and even political deliberation. The word’s origin itself references an “open clearing without weeds”, which we can only imagine was carved out of the dense tropical jungle, consciously cultivated in order to lift humanity from chaotic obscurity to civilization. As the heart of a community, a Marae is a testament to what can be accomplished when concentrated efforts are made to procure and architect the spaces we live within.
I think that it's grown increasingly obvious over recent years that we're on the precipice of seismic change as a people. There's a feeling in the air, as if our collective "gut" has caught on to the rapid change around us. Black Mirror and "a boring dystopia" memes regularly ring true. Our largest digital platforms have upended age-old social contracts, and the boundary between public and private life has been blurred. That's because surveillance and censorship are not only expected, but sometimes even called for in these spaces. Heck, we just saw an ex-President forcibly taken off of one of the world's largest stages for what amounted to be a couple emotionally-charged tweets and a ton of hypotheticals. No judge, no jury, just the executioner. And I can't stand the guy, mind you, but wanton censorship is a much, much larger threat to all of us in the long run. When it comes to our digital reality, it often feels like we're living in a Banana Republic of sorts, where the largest Big Tech execs influenced by the latest PR/PC trends dictate the terms by which we are allowed to engage with one another (make sure you read the fine print). This is not how our most important communal spaces should be governed. How can we claim to be even remotely enlightened as a society, if we aren't comfortable letting each other communicate freely? I often wonder, if in the distant future we didn't physically talk to one another but communicated in some digital sense, how much of the richness and freedom of our thoughts would also be lost?
To date, every advancement in our methods of communication has been trailed by a wake of societal change. Our rich linguistic capacity is often credited with separating us from the other hominids which came before. The birth of writing is linked to the first states, as rulers and merchants were able to navigate ever-more-sophisticated networks of power and goods. When a little known priest in Germany by the name of Martin Luther used the printing press to distribute his 95 Theses and subsequent sermons, he became a bona fide celebrity in the region. This sparked a schism in Europe which the Catholic Church could not dampen like they had in the past.
It was because Luther was a celebrity that the Emperor could not follow the private advice of his advisors, and deal with Luther as the Council of Constance had dealt with Jan Hus: that is, withdraw his safe conduct, arrest and execute the heretic. - The Invention of News
The same stories of change follow with the radio, television, internet, and mobile. The radio brought the busy (and sometimes illiterate) masses into the fold, becoming a force for both social cohesion and nationalist propaganda. The television helped Martin Luther King Jr and the civil rights movement touch the hearts of Americans, as grotesque images and film of atrocities flashed across their evening screens. It's strange to consider all the different ways our history could have unraveled if these inventions were discovered at slightly different times.
Although I was born after the commercial internet began, I still had a moment, like so many of you, of wonder when it's magnitude first hit me. With just a word and a touch of a button, I could find an image of anything I wanted. Better yet, Google Image Search produced a seemingly infinite variety of the thing!
Through all of these stages, the medium has affected the message. The AR Cloud will be no different. How, exactly, the "message" will change, I'm not so sure. I'd guess that recent trends from social media and our internet culture will probably prevail, but they don't need to. We can, with the right efforts, bring structural change to the digital ecosystems which we live an ever-increasing amount of our lives on. I hope we do this soon, since the stakes are only rising.
The AR Cloud, unfortunately, has the potential to be the world's most sophisticated, and intrusive, surveillance and reality distortion apparatus ever created. As I describe below in the technical section, AR experiences are reliant on a continuous video feed taken from a user's perspective. That, combined with audio mic features which some companies are excitedly planning, will put many of us in an unprecedented position of vulnerability. If there's anything the past decade of computer security should have taught us, it's that absolutely anything can happen to your data once it's no longer on your device. Leaks, and abuse, are sure to come about. It's only a matter of time. What we can control, is how we use these new technologies, and hold the relevant authorities accountable when such issues do inevitably come about.
Writing this piece in NYC, I looked up to find Apple peaking down back at me. Are we ready for our world to become a stage, with cameras always rolling?
I think that the hyper-realism of this new, immersive reality will be shocking to some as well. We've never been face-to-face with artificial humans before, so I'm sure there'll be an uptick of double takes in the future. Maybe we'll even have a culturally acceptable "Turing test" of sorts, to verify each others humanity. These digital people will be great substitutes for real ones in a number of sectors which are hard to visually automate. Kind of like a scarecrow in the field, virtual security guards may be published to stores across the globe to deter theft and mischief.
I think this shift will tell us a lot about our humanity, seeing just how easily we fall for these illusions of genuine social interaction. Or maybe we never cared in certain contexts? I've always found the classic Walmart greeter role to be strange. Doesn't everyone recognize that this person is being paid to smile, and say "Hi"? A friend's mother once tried explaining to me why she liked them anyways, but it was a perspective I couldn't grasp. I wonder if she'd care if they became virtual? These advancements of reality distortion will propel a fairly new trend of deepfakes and synthetic "people" to new heights which are hard to fathom today.
Marae was my attempt to set a desperately needed precedent - one which would respect individual thought and perspective. As we grow closer to the tech ecosystems we live our digital lives on, I think it's absolutely critical that we have a bastion for free expression, privacy, and authenticity. For that reason, Marae would be:
It would be a struggle establishing these principles which are often cast aside in the pursuit of profit. Sailing against the trade winds of commerce, I was extremely cautious about losing equity (ownership) in the platform I was creating. My aim was to bootstrap via grants in order to push off dilution for as long as possible. Eventually, we'd strike a strong product-market fit and leverage capital for the needed strategic growth. But along the way, my aim was to retain majority control for as long as possible. I've heard way too many stories of founders and inventors creating something with one purpose in mind, and getting ousted for the sake of the bottom line, or a board room power struggle. Even Thomas Edison, perhaps the most renowned American inventor and businessman - a guy who literally electrified the United States, had his own company sold out from under him after initially founding it with a 5/6ths stake worth $250k (that's $6.7m in today's dollars) [Empires of Light, 57 and 241].
My goal of independent control (for some significant amount of time) no longer looks realistic. Marae would only power a small fraction of the total AR Cloud, the research is out of range for grants, and prototyping will be much more costly than the typical lightweight tech startup. Turns out Computer Vision experts, while intrigued with sweat equity, have a pretty massive worth on the markets (FAANG) today, and I'd need to compensate accordingly.
That being said, there's still a business opportunity for hype+tech->acquisition here, but that's not me.
Marae's journey may be coming to an end, but I'm convinced as ever that our technical vision was a promising blueprint for the incoming AR Cloud.
In the simplest of terms, our plan was to build a couple game engine SDKs, a handful of thin clients, and between them, a cluster of cloud localization, hosting, and streaming services. From the deck:
As a cross-platform solution, we'd connect content creators on the most popular ecosystems with owners of the best AR hardware. From the jump, this would put us in position to maximize both the reach of our creators and library for our users. As a streaming service, we could even further magnify this network effect. Instead of one single experience per app (like Pokémon Go), we'd surface a theoretically infinite library of content to the end user (like YouTube or Netflix). This was a key part of our broader strategy.
In the world of Computer Vision, SLAM, and AR, localization is the process of determining a device's six degrees of freedom (6 DoF) pose. That is, the position (X, Y, Z) and orientation (roll, pitch, yaw) combined. Having an accurate user pose is absolutely crucial to powering large scale AR experiences.
To illustrate this, consider how a typical first person video game works - you provide physical input to a controller, which moves your character (altering their virtual "pose"), and then a new, appropriate perspective is shown/rendered to you. These steps are seamless and trivial in a pristine, virtual sandbox, but become much more complicated in AR as we interweave the digital with the physical. Tracking your perspective, AR platforms rely on various sensor inputs and state-of-the-art methods to keep the virtual world aligned with the physical in real-time. Errors, or slow or laggy computation here, can result in jitter at best, or a downright incoherent experience at worst.
Today, the problem of localization has been solved in some domains, but not others. As one might imagine, a space is easier to localize within if it's small, static, or "well-defined" (has distinct features which can be detected consistently). Because of this, techniques exist to localize indoors and in dense, urban environments, but remain elusive in large natural spaces, such as public parks and college quads. In these spaces, it's the non-descript natural shapes of trees and shrubs, as well as lighting and seasonal changes, which often cause difficulty.
These are actually powered by a weaker variant of localization - relative localization. That is, the detection of a user 6DoF pose relative to some arbitrary landmark or starting point. In these cases, the landmark is often some colorized point cloud of the ground. Such techniques are useful for short sessions of immersion, but won't last throughout the day as the sun moves across the sky casting new shadows and changing the colors of points on the ground.
Marae's aim was to build a platform powered by rich, global localization. Here, the X, Y, Z of a person's pose actually maps to something akin to a global location - like latitude, longitude, and altitude. With a stronger localization methodology, we could actually map a space once, surface it to developers via a simple SDK (similar to ARWAY's), and ultimately power a huge, long-lasting, immersive experience.
To do this, we'd pair off-the-shelf visual-inertial odometry with proprietary semantic re-localization. That's research-speak for "tracking movement via pictures and motion data" and "understanding the context of a picture to determine where you are". The first part, visual-inertial odometry (VIO), has been refined and perfected over many years, and is accessible to any app via Apple's ARKit or Google's ARCore. Operating in real-time (up to 120Hz), these libraries take imagery and motion data from a device to continually track where it is relative to some arbitrary starting point. Pretty incredible stuff. But, that alone is not enough to get us global localization. You see, the tracking of these systems is not perfect. In fact, "drift" (error) accumulates over time between a device's actual location and it's perceived location at a rate of about 1%. This means, that for the world's best VIO systems, movement of 10m will result in a predicted pose error of 10cm. Enter the need for proprietary semantic re-localization - the cutting edge research which will one day enable Marae-esque experiences.
I'll save the details of this research for it's respective section below, but will give a brief overview here to round out this section from an engineering perspective. Semantic re-localization is a technique where you take an input image, derive some semantic meaning about it, and determine where (in a pre-mapped space) you are. In our case, this means labeling each and every pixel in an image with its class of object. Here's an example image, which has been labeled with the classes "vehicle", "road", "sidwalk", "pole", "vegetation", etc. [Fully Convolutional Networks for Panoptic Segmentation]
Semantic-based approaches span the fast-approaching horizon of localization and mapping technologies. With resilience to the various superficial changes which can occur in a scene visually, they'll one day bring accuracy and robustness to localization in traditionally difficult spaces. So far, these techniques have only just started to be formalized and applied [Semantic SLAM] [Semantic Visual Localization]. Due to this, we have only scant empirical evidence to suggest exactly what sort of performance we should expect in the near future. My intuition though (reinforced by chats with experts), is that we could get re-localization accuracy down to 10 cm or less at a 2 Hz runtime.
This system, combined with VIO, would give us a nice upper-bound of our pose error over time.
Precisely, the upper-bound ATE is defined in this system as:
semantic_relocalization_accuracy + speed/100/semantic_relocalization_freq
Think this looks a little choppy though? Luckily, we could actually smooth the estimated pose in real-time to save the user from any jitter.
This is the closest I came to an architecture diagram :)
(1) Illustrates the local VIO process on device
(2) A request sent to our cloud localization solution (where the semantic re-localization would be performed). Data sent includes an image (with timestamp) and an estimated pose (based on some combination of historical data, VIO, and GPS if needed)
(3) Device receiving an accurate pose (from some ~400ms ago) which will then be incorporated into a smoothed estimate
Software aside, this would have actually been a bit pricey to run today. That's because state-of-the-art semantic re-localization pushes modern day GPUs to their absolute limits. Remember that 2Hz I mentioned earlier? That's 2Hz for one person on a single GPU. Shopping around AWS, we can see that these run at about $0.80/hr. That's a baseline cost of $0.80/person/hr. Within the next 3-4 years though, I expect that we'll see a trifecta of compounding factors that drop this cost 10x. These are:
Game Engine SDK
We'd need a couple SDKs to integrate our platform with the world's most popular 3D engines. Each of these would serve two core purposes. First, they'd give us programmatic means to adjust the virtual camera pose in real-time. Second, we'd surface our library of pre-mapped spaces through them. With this, developers and creators could design their virtual scenes atop the Boston Common, Central Park, college campuses, or any other popular outdoor space using a familiar workflow.
Taking Unity as an example, this first bit of functionality could be achieved by modifying the ARSessionOrigin's camera setter function. Specifically, we'd add a driver script to the component which would send requests to our cloud localization solution.
Actually creating and publishing an experience on Marae would have been much like the typical game dev workflow. Within the game engine IDE, we'd have a drop-down menu with all of our pre-mapped locations. Selecting one, we'd populate the workspace with a low-res asset of the space. As a symbolic reference point, this asset would serve as a transparent foundation for the experience to be build upon. Once published, it'd be removed entirely.
Cloud Hosting and Streaming
In order to maximize the scalability of our platform, we'd need to host and stream published experiences on cloud infrastructure. We considered using AWS Gamelift, or even spinning up a proprietary solution as some other companies have, to accomplish this.
The rollout of 5G was a key trend for us, since AR streaming has such massive network requirements. These needs are somewhat analogous to the needs of cloud gaming, which has only just recently begun flirting with mobile and requires a great connection.
At the forefront of this wave is Amazon's partnership with Verizon to bring 5G connectivity to AWS. Today, developers can rent location-specific instances (servers) on both the Local Zones and Wavelength AWS offerings. These use Verizon's Ultra Wideband coverage, which is coming along rapidly. New documentation claims 12 cities are expected to be set up (partially) by the end of 2021. Important for us, was the rollout on the Boston Common:
Around the country, we saw target spaces gaining coverage. Here's Los Angeles State Historic Park:
Quoted below is a brief exerpt of the STTR grant proposal I was pitching Computer Vision & SLAM experts. After reading through quite a bit of relevant literature and refining my perspective over hundreds of conversations, here's a promising methodology I landed on to produce a localization service capable of powering Marae:
Localization is critical for rich Augmented Reality experiences, as it enables the alignment of virtual and real-world coordinate systems. Today, there is no localization solution capable of powering scalable AR experiences in large outdoor spaces such as public parks and college campuses. This is primarily due to the difficulty keypoint-based approaches have when handling illumination and seasonal variance in texture-weak, feature-poor spaces. Semantic-based re-localization however, has shown early promise and resilience to these common pitfalls . Our intuition is that recent works in panoptic segmentation as well as view synthesis have just now made a new methodology of semantic re-localization highly compelling.
Consider an extremely dense semantic keyframe graph of 360 degree images (1/.5m^3), a small (<2m^3) search space, and an accurate segmentation network . With these assumptions, we suspect that robust, performant re-localization of input images is possible. We wish to examine the real-world feasibility of this technique, by exploring the relationship between keyframe graph density and re-localization accuracy. Techniques to increase this density can include more thorough coverage of a test space, as well as view synthesis techniques akin to NeRF .
Visually, I was proposing a packed 3D grid of points atop the relevant space, with each point representing the position of a camera having full 360 degree coverage.
In actuality, there'd be a handful of separate images taken at each camera position. These might then be stitched together and manipulated to simulate any vantage point, like Google Maps. All of these photos would be passed through a SegNet, described above in the tech section, which would semantically annotate them.
With dense enough coverage, we would expect to be able to localize anywhere within a space. Here's where a crucial question of the research comes up though. How dense, is dense enough? This is all dependent on a couple factors:
When receiving a new input image, we need to have some sense of where in the space it was actually taken. Having a good estimate of this can reduce complexity and improve runtime. In our case, we'd use GPS and historical data to define our search space. This estimate also informs us of the relevant set of ground truth images (keyframes) to compare against. Passing our input image through the same SegNet we used on our keyframes, we're left with a handful of similarly labeled images taken at slightly different vantages. We can then perform a learned alignment procedure to determine how our new image fits in relatively with the keyframes, and globally within our coordinate system.
Autonomous drones, and view synthesis techniques could help tremendously in increasing the density of this ground truth keyframe graph. Intuition being, that this would in turn boost overall system accuracy, as keyframes and new input images get even closer.
Semantic localization (and Computer Vision in general) is an incredibly hot area in academia today. These plans drew quite a bit of interest from top researchers around the globe, but I don't think the space is actually mature enough for a feasibility-testing prototype of Marae to be created. As mentioned, this work ties together many assumptions and unknowns, so I wouldn't be all too confident that the resulting prototype would translate nicely to the real world.
One major assumption for us, was that we'd have decent performance from a pre-trained SegNet when capturing Marae's target spaces (0.8 mIoU or 0.6 PQ). Under scrutiny this assumption was invalidated and it became clear that we'd need to produce our own annotated image dataset before training a SegNet and getting on with the research planned. This added about a year onto our research lead time, and definitively put us out of range for an STTR grant.
Technical innovation is powerful, but doesn't always translate into profitable, or even sustainable, business. Without some kind of competitive moat, a service will eventually become commoditized (thin margins) or pushed out of the market entirely. In Marae's case, we had three guiding principles behind our competitive/growth strategy. I initially believed that these would give us serious, long-lived leverage in the space. Not only was this a path to short-term profitability (beating Big Tech R&D and getting bought out), but it laid the blueprint for a much larger goal of mine - sustained operational independence.
As stated on top, I'm not nearly as bullish on the independence, or even significant impact, of Marae as I once was. But, I thought it'd be neat to share the rationale I once had.
The plan was to:
A content platform is a two-sided marketplace. Usually, two-sided marketplaces are hard to scale because of a chicken-and-egg/bootstrapping problem. Without users you'll attract no creators. Without creators, there will be no users. Once things start to click though, there's a virtuous cycle of growth to be tapped into. The value of an ecosystem like Marae is largely derived from its scale, and as it grows will further attract more creators and users in an exponential fashion. This is the making of a Super App.
Being a first mover here is valuable for two major reasons. First, as a truly novel development in an already excitable space, the launch of an AR Cloud platform will draw the attention needed to get over the initial hump mentioned above. In our case, the plan was to make a splash by partnering with ambitious game dev studios and powering their creations throughout weekend-long Betas on the Boston Common and Central Park. Second, and much more obvious, is that a first mover in a market conducive of network effects earns an advantage which compounds. All else being equal, a small lead time can result in a massive gap between ecosystem scale (and thus value) over time.
As illustrated in the tech section above, Marae had no hardware ambitions. This was by design. Without our own walled garden to tend to, we'd be free to cut across the many small ones which are only just starting to bud in XR. This flexibility would be critical once other serious competitors got off the ground. For example, if Apple wanted to prioritize their own in-house version of Marae, or outright block us from their app store, that'd be fine. We'd go on to support Facebook, Microsoft, Niantic, and every other hardware offering anyways. Each and every hardware platform would have the choice to either play ball and improve their market share, or get left behind with a less compelling product.
The same dynamic would hold for our upstream game engine partners too. In essence, we believed that Marae, once created, would be an invaluable strategic partner in the AR sphere - a lynchpin between intensely competitive game engines and hardware platforms.
There's a B-School concept which I learned back in Theory of Org that describes this position of leverage, but since I find historical analogues more interesting we'll go down that route. Enter the Play-off System, which the Seneca Native Americans employed to keep the French and British at bay for much longer than any direct conflict/contest would have -
"The maintenance of this balance of power, however, depended upon Iroquois capacity to make credible threats, or promises, of military action. The system did not require that both French and British be equally powerful in the area, but rather that the Iroquois should at any time be considered able, by shifting their weight, to make up the difference. Any actual or apparent rigidity (induced either by indecision or by insufficiency of resources) in Iroquois policy would threaten this system because it would permit one side or the other to anticipate decline to a point where it could no longer afford to pay the price for the necessary support. Such a point of critical imbalance could also be reached if, as a result of any outside circumstances, one side or the other thought that a power differential existed which no degree of shifting of Iroquois support could rectify." - Death and Rebirth of the Seneca
As long as the Seneca could convincingly tilt the scale, they'd be safe. Marae would have similar positioning, but with the crucial addition of a virtuous growth cycle.
This strategic vision, and in turn, Marae's ability to fulfill my why, fell apart under the growing suspicion that large, multi-user experiences very likely won't propel the AR Cloud's adoption, or even be a large portion of its applications. The vast majority of AR content will be much like the 2D apps we engage with today - private and location-agnostic (meaning that the experience is not affected by your exact location). These are the messengers, social feeds, browsers, and productivity and creative tools which have come to shape our digital reality today. These experiences have no dependency on a location-precise, multi-user AR platform. And as for the multi-user experiences which do emerge in AR, there's pretty good reason to think that many of them won't need park-scale location-precision either. Point being, that the long-term strategic value of Marae is a tiny fraction of what I initially believed.
If Marae isn't powering the lion's share of AR and becoming a Super App, it won't have the leverage needed to grow between behemoths. Eventually, it'll be lost to market forces - either via early Exit or getting pushed out. Neither of these outcomes excite me.