Stu (Michael) Stewart
斯图(迈克尔)斯图尔特
搭建舞台 (Setting the Stage)
At Opendoor, we buy and sell thousands of homes every year. We make offers to purchase even more homes — hundreds of thousands annually. To generate valuations for such a quantity of houses we need a system that combines human and machine intelligence. Enter AVMs.
在Opendoor,我们每年买卖数千套房屋。 我们提供购买更多房屋的提议-每年成千上万。 为了对如此多的房屋进行估价,我们需要一个结合了人类和机器智能的系统。 输入AVM。
虚拟机 (AVMs)
AVM, or automated valuation model, is the name given to a Machine Learning model that estimates the value of a property, usually by comparing that property to similar nearby properties that have recently sold (comparables or comps). Comps are key: an AVM evaluates a property relative to its comps, assimilating those data into a single number quantifying the property’s value.
AVM或自动评估模型是为机器学习模型提供的名称,该模型通常通过将财产与最近出售的类似的邻近财产(可比较或可比较)进行比较来估算财产的价值。 补偿是关键:AVM会相对于其补偿评估属性,将这些数据同化为一个数字,以量化该属性的值。
The idea of using a model to predict home prices is hardly new. (Boston Housing Dataset, anyone?) So Opendoor’s use of an AVM — we have lovingly dubbed ours OVM — probably isn’t unusual. What might be unusual, though, is OVM’s centrality to our business. Many companies have an “AI strategy,” but at Opendoor (scare quotes) “AI” is the business. We don’t buy or sell a single home without consulting OVM — for a valuation, for information regarding comparable properties, or for both. Humans and models must work hand-in-hand for the business to flourish.
使用模型预测房价的想法并不是什么新鲜事。 (波士顿房屋数据集,有人吗?)因此,Opendoor使用AVM(我们将其称为OVM)可能并不罕见。 但是,可能不寻常的是OVM在我们业务中的中心地位。 许多公司都有“人工智能战略”,但在Opendoor(吓人的话)中,“人工智能” 就是企业 。 我们不会在未咨询OVM的情况下买卖单个房屋-评估,有关可比房地产的信息或两者兼而有之。 人类和模型必须携手并进,业务才能蓬勃发展。
OVM has existed in some form for all but a few months of Opendoor’s history. Recently, we launched our latest-and-greatest model, which generated a step-function improvement in our predictive accuracy by melding human intuition about home valuations with deep learning algorithms. But in order to understand why this is significant, a brief history lesson is in order.
除了Opendoor的几个月历史外,OVM以某种形式存在。 最近,我们推出了最新,最出色的模型,该模型通过将人类对房屋估值的直觉与深度学习算法融合在一起,从而在我们的预测准确性上产生了阶跃函数改进。 但是,为了理解为什么这很重要,有必要简要介绍一下历史课程。
我们以前的工作 (Our Previous Work)
For the past several years, OVM has relied on a pipeline-style ML system in which separate models handle different aspects of the home valuation process. The old system looked a bit like this:
在过去的几年中,OVM依赖于流水线式的ML系统,在该系统中,单独的模型处理房屋评估过程的不同方面。 旧系统看起来像这样:
OVM(手工模型管道) (OVM (handcrafted model pipeline))
- Select comps 选择伴奏
- Score said comps based on their “closeness” to the subject property 根据与目标媒体资源的“接近度”,为这些伴奏评分
- “Weight” each comp relative to the others 每个分量相对于其他分量的“权重”
- “Adjust” the (observed) prices of each comp “调整”每个伴奏的(观察到的)价格
Estimate “uncertainty” via multiple additional models
通过多个其他模型估算“不确定性”
While the system reflects a natural human process for valuing homes (a desirable property), there are a few downsides to this approach. Namely, having so many separate models increases complexity and also prevents us from jointly optimizing things like mean and variance predictions, or comp weights and comp adjustments.
尽管该系统反映了自然的人为评估房屋价值(理想资产)的过程,但是这种方法也有一些缺点。 也就是说,拥有如此众多的独立模型会增加复杂性,并且还会阻止我们共同优化均值和方差预测或补偿权重和补偿调整之类的东西。
In short, the prior model effectively leveraged human intuition about home valuations, but was very complex and did not adequately share information between the various problems it sought to solve. In addition, it did not handle high-cardinality categorical information (such as postal codes) in a very thoughtful manner.
简而言之,先前的模型有效地利用了人们对房屋估值的直觉,但是它非常复杂,并且无法在要解决的各种问题之间充分共享信息。 此外,它没有以深思熟虑的方式处理高基数的类别信息(例如邮政编码)。
深度学习的承诺 (The Promise of Deep Learning)
The deep learning revolution has been propelled forward by a battery of factors; few stand more prominently than the magic of Stochastic Gradient Descent (SGD). For our purposes, the ability to define an end-to-end system, in which gradients flow freely through all aspects of the valuation process, is key: for instance, attempting to assign weights to comps without also considering the requisite adjustments (a shortcoming of the prior algorithm) leaves useful information on the table. This shortcoming was top-of-mind as we built out our current framework.
一系列因素推动了深度学习革命。 没有什么比随机梯度下降(SGD)的魔术更为突出。 对于我们而言,定义一个终端到终端的系统,其中梯度通过估值过程中的各个环节自由流动的能力是很关键的:例如,试图分配权重谱曲不也在考虑进行必要调整(一个缺点先验算法的定义)在表格上留下有用的信息。 在我们构建当前框架时,这个缺点是最重要的。
The new system, written in PyTorch, looks a bit like this:
用PyTorch编写的新系统看起来像这样:
OVM(深度学习) (OVM (deep learning))
- Select comps 选择伴奏
Give everything to a neural net and hope it works!
将一切都交给神经网络,希望它能起作用!
Deep learning is often pilloried for an alleged overabundance of complexity. For our purposes, however, deep learning presented a much more straight-forward solution. But we must ask: What are the characteristics of this new system that allow it to retain the positives from our earlier models while also addressing their shortcomings?
深度学习通常因过于复杂而被嘲笑。 但是,出于我们的目的,深度学习提出了更为直接的解决方案。 但是我们必须问:这个新系统的特征是什么,使其可以保留我们早期模型的优点,同时还可以解决它们的缺点?
相信过程 (Trust the Process)
In residential real estate, we know the causal mechanism by which homes are valued, a powerful backstop typically unavailable in computer vision or Natural Language Processing (NLP) applications. That is, a home is priced by a human real estate agent who consults comps (those same comps again) and defines/adjusts the listing’s price based on:
在住宅房地产中,我们知道房屋价值的因果机制,这是计算机视觉或自然语言处理(NLP)应用程序中通常没有的强大支持。 也就是说,房屋是由房地产经纪人定价的,该房地产经纪人咨询补偿对象(再次是那些相同的补偿对象)并根据以下内容定义/调整挂牌价格:
- the recency of the comps (which factors in home price fluctuations) 补偿的新近度(哪些因素会导致房价波动)
- how fancy/new/desirable the comps are relative to the subject property 伴奏相对于主题属性的幻想程度/新颖性/理想程度
List price is not the only input to the close price, to be sure, but it is by far the most important one. Comp prices are more than correlative: a nearby comp selling for more than its intrinsic value literally causes one’s home to be worth more, as no shopper will be able to perfectly parse the underlying “intrinsic value” from the “noise (error)” of previous home sales. This overvaluation propagates causally into the future prices of other nearby homes.
当然,标价不是收盘价的唯一输入,但到目前为止,它是最重要的输入。 比较价格都超过相关 :超过其内在价值更多的是附近比较畅销字面上导致一个人的家更值钱 ,因为没有顾客将能够完全从“噪声(错误)”分析的根本“内在价值”以前的房屋销售。 这种高估会因果传播到附近其他房屋的未来价格中。
A model with an inductive bias that mirrors this data-generating process is well positioned to succeed as an AVM.
具有感应偏差的模型可以很好地反映此数据生成过程,因此可以成功用作AVM。
地图的度量 (The Measure of a Map)
Our old system did a good job synthesizing the aforementioned human intuition about real estate. Put differently, its inductive bias is well suited to the problem at hand given what we know about the data-generating process. It was less effective, however, at leveraging important categorical information about a home and its comps (such as postal code, county, etc.) that are less well-treated by older ML algorithms. It also failed to leverage data in an end-to-end manner, thereby unnecessarily restricting information flow between components of the system.
我们的旧系统在综合上述人类对房地产的直觉上做得很好。 换句话说, 鉴于我们对数据生成过程的了解 ,其归纳偏差非常适合当前的问题。 但是,在利用关于房屋及其配套物的重要分类信息(例如邮政编码,县等)时,这种方法的效果较差,而旧的ML算法则无法很好地处理这些信息。 它还无法以端到端的方式利用数据,从而不必要地限制了系统组件之间的信息流。
Let’s investigate the former weakness first: How should we structure our neural network to utilize categorical embeddings while not losing sight of the known data-generating process?
让我们首先研究前一个弱点:我们应该如何构造神经网络以利用分类嵌入,同时又不遗漏已知的数据生成过程?
嵌入词 (Embeddings’s the word)
Deep learning, through categorical feature embeddings, unlocks an extremely powerful treatment of high-cardinality categorical variables that is ill-approximated by ML stalwarts like linear models and Gradient Boosted Trees. The benefit of these embeddings is on full display in the NLP community, where embeddings have revolutionized the field.
深度学习通过分类特征嵌入,可以对高基数分类变量进行极其强大的处理,而线性模型和Gradient Boosted Trees等ML支持者则无法很好地处理这些分类变量。 这些嵌入的好处在NLP社区中得到了充分展示,NLP社区彻底改变了该领域。
Real estate has surprising similarities to NLP: high-cardinality features such as postal code, census block, school district, etc. are nearly as central to home valuations as words are to language. By providing access to best-in-class handling of categorical features, a deep learning based solution immediately resolved a primary flaw of our system. Better yet, we didn’t need to lift a finger, as embedding layers are a provided building-block in all modern deep learning frameworks.
房地产与NLP有惊人的相似之处:邮政编码,人口普查区,学区等高基数特征对房屋估价的重要性几乎与对语言的重要性一样。 通过提供对类别功能的最佳处理,基于深度学习的解决方案立即解决了我们系统的主要缺陷。 更好的是,我们不需要费劲,因为嵌入层是所有现代深度学习框架中提供的构建块。
工程网络 (Engineering a Network)
The final unresolved defect of our prior algorithm was its inability to jointly optimize parameters across sub-model boundaries. For instance, the model that assigned comp weights did not “talk” to the model that predicted the dollar-value of the comp-adjustment that would bring said comp to parity with the subject listing.
我们现有算法的最后一个未解决的缺陷是它无法共同优化子模型边界上的参数。 例如,分配补偿权重的模型没有与预测补偿调整的美元价值的模型“交谈”,该模型将使补偿值与主题清单保持一致。
A modular framework, such as PyTorch, cleanly resolves this fault, as well. We can define sub-modules of our network to tackle the adjustment and weighting schemes for each comp, and autograd will handle backward-pass gradient propagation within and between the sub-modules of the net.
诸如PyTorch之类的模块化框架也可以很好地解决此问题。 我们可以定义网络的子模块来解决每个comp的调整和加权方案,而autograd将处理网络子模块内部和之间的反向传播梯度传播。
Yet, we must keep in mind a key constraint: The inductive bias of our network should hew closely to the causal pricing mechanism or else the human-interpretability of the algorithm will be compromised.
但是,我们必须牢记一个关键约束条件:我们网络的归纳偏差应与因果定价机制密切相关,否则将损害算法的人为解释性。
There are several approaches to model this process (while enabling joint-optimization). We’ve had success with many model paradigms presently popular in NLP and/or image retrieval / visual search. These include:
有几种方法可以对此过程进行建模(同时实现联合优化)。 目前,在NLP和/或图像检索/视觉搜索中很流行的许多模型范例已经取得了成功。 这些包括:
- Transformer-style network architectures that accept a variable-length sequence of feature vectors (perhaps words or houses) and emit a sequence or single number quantifying an output 变压器样式的网络体系结构,它接受特征向量(可能是单词或房屋)的可变长度序列,并发出量化输出的序列或单个数字
Siamese Networks that compare, for example, images or home listings and produce a number/vector quantifying the similarity between any two of them
连体网络 ,它们比较例如图像或房屋清单,并产生一个数字/矢量来量化两者之间的相似性
Triplet loss frameworks for similarity detection (and, more recently, contrastive-loss approaches spiritually similar to triplet loss)
用于相似度检测的三重损失框架(最近,在精神上类似于三重损失的对比损失方法)
Embedding lookup schemes such as Locality Sensitive Hashing that efficiently search a vector-space for similar entities to a query-vector of interest
嵌入查找方案(例如局部敏感哈希) ,可以有效地在向量空间中搜索与所需查询向量相似的实体
The process of valuing a home is similar to NLP for one key reason: a home “lives” in a neighborhood just as a word “lives” in a sentence. Using the local context to understand a word works well; it is intuitive that a comparable method could succeed in real estate.
评估房屋的过程与NLP相似,原因之一是:房屋“居住”在社区中,就像单词中的“居住”一样。 使用本地上下文理解单词效果很好; 直觉上,类似的方法可以在房地产领域取得成功。
Image retrieval hinges on querying a massive database for images similar to the query image — a process quite aligned with the comparable-listing selection process.
图像检索取决于查询海量数据库中与查询图像相似的图像,该过程与可比清单选择过程完全一致。
Which model works best will depend on the specifics of the issue one is trying to solve. Building a world-class AVM involves geographical nuance as well: an ensemble of models stratified by region and/or urban/suburban/exurban localization may leverage many or all of the above methodologies.
哪种模型效果最好,取决于一个人要解决的问题的细节。 构建世界一流的AVM也涉及到地理上的细微差别:按地区和/或城市/郊区/郊区划分的分层模型集合可能会利用上述方法中的许多或全部。
眼前的问题 (The Problem at Hand)
With our network(s) we must be able to answer two key questions:
通过我们的网络,我们必须能够回答两个关键问题:
- How much more (or less) expensive is some comp listing than the listing of interest? 某个comp列表比感兴趣的列表贵多少(或更少)?
- How much weight (if any) should be assigned to said comp, relative to the other comps? 相对于其他组合,应为该组合分配多少权重(如果有)?
放大 (Zooming In)
Let’s make our problem more concrete: assume for the sake of argument that, after evaluating one’s problem, a transformer appears to be well suited to the project specifications.
让我们更具体地解决问题:出于争论的目的,假设在评估一个人的问题之后,变压器似乎很适合项目规范。
We can define a module, then, that takes data from both the listing of interest (the subject listing) and from the comps selected for said listing.
然后,我们可以定义一个模块,该模块从感兴趣的列表(主题列表)和为该列表选择的组合中获取数据。
The module might take in tabular data (about the listing’s features), photos, satellite imagery, text, etc. It may also use contextual information, including information about the other comps available for the given listing of interest — a transformer’s self-attention aligns well with this notion of contextual info. Said module is responsible for outputting two kinds of quantities:
该模块可能会采用表格数据(有关列表的特征),照片,卫星图像,文本等。它也可以使用上下文信息,包括有关给定感兴趣列表的其他补偿的信息-变压器的自我注意对齐这个上下文信息的概念很好。 所述模块负责输出两种数量:
- An estimate of the relative price difference between a given (subject, comp) listing pair 给定(主题,比较)列表对之间的相对价格差异的估算
- A “logit” (un-normalized weight) characterizing the relative strength of a comp 一个“ logit”(未归一化的权重),它描述了组件的相对强度
Because the comp weights should sum to one, a normalization scheme (perhaps softmax, sparsemax, or a learned reduction-function) is employed after the weights are computed. Recall that the comparable properties have already recently sold (never mind active listings for now), so their close prices are known. That close price, augmented by the price delta computed in (1), is itself a powerful predictor of the close price of the subject listing.
由于补偿权重应为1,因此在计算权重后将使用归一化方案(可能是softmax,sparsemax或学习的归约函数)。 回想一下,可比较的房地产最近已经售出(现在不用担心活跃列表),因此知道其收盘价。 该收盘价加上在(1)中计算出的价格差,本身就是对象列表收盘价的有力预测指标。
These transformer-based techniques from NLP work well because each comp can be viewed as a draw from a relatively homogeneous bag of possible comparable properties. In this capacity, comps are quite similar to words in the context of a language model: atomic units that together form a “sentence” that describes the subject listing and speculates regarding its worth.
NLP的这些基于变压器的技术效果很好,因为每个补偿都可以看作是从具有相对可比特性的相对均一的包中汲取的。 在这种情况下,在语言模型的上下文中,补偿与单词非常相似:原子单元共同形成一个“句子”,用于描述主题列表并推测其价值。
Though, deciding which words (comps) to place in that sentence is a tricky problem in its own right.
但是,确定在该句子中放置哪些单词(comps)本身是一个棘手的问题。
顶管 (Capping the Pipe)
Once the aforementioned quantities are in hand, the valuation process reduces immediately to a standard regression problem:
一旦掌握了上述数量,评估过程便立即减少到标准回归问题:
- The observed comp close prices are adjusted via the values proposed by our network 通过我们的网络建议的值来调整观察到的合约收盘价
- These adjusted close prices are reduced, via a weighted-average-like procedure, to a point estimate of the subject’s close price 通过类似加权平均的方法,将这些调整后的收盘价降低至该对象的收盘价的点估计
- Your favorite regression loss can then be employed, as usual, to train the model and learn the parameters of the network 然后,您可以像往常一样使用您最喜欢的回归损失来训练模型并学习网络参数
收获好处 (Reaping the Benefits)
We measure the accuracy of our models on many cohorts and subsets of listings. Across all of them, the neural network based ensemble architecture outperformed heritage OVM. We were extremely pleased to see (relative) error rates decline by 10% or more in many cases. We also massively reduced the number of models that we have to train, host, and maintain in production.
我们在许多同类队列和列表子集中测量模型的准确性。 在所有这些方面,基于神经网络的集成架构均优于传统OVM。 我们很高兴看到(相对)错误率在许多情况下下降了10%或更多 。 我们还大大减少了在生产中必须训练,托管和维护的模型的数量。
摘要 (Summary)
At Opendoor, we’ve updated our core home valuation algorithm to incorporate advances in deep learning without sacrificing domain-specific knowledge about the mechanism by which residential properties are valued. We saw a step function improvement in accuracy after implementing these ideas; the bulk of the improvement can be attributed to (1) end-to-end learning and (2) efficient embeddings of high-cardinality categorical features.
在Opendoor,我们已经更新了核心的房屋估值算法,以结合深度学习方面的进步,而不会牺牲特定领域对住宅物业估值机制的知识。 实施这些构想后,我们发现步进函数的准确性有所提高; 大部分改进可归因于(1)端到端学习和(2)高基数分类特征的有效嵌入。
If you are interested in building the next generation of machine learning applications for real estate, we’d love to hear from you. Opendoor is hiring (remotely!) as well as in our SF, LA, and Atlanta offices.
如果您对构建用于房地产的下一代机器学习应用程序感兴趣,我们将很高兴收到您的来信。 Opendoor正在雇用 (远程!)以及我们在SF,LA和亚特兰大的办事处。
还有什么照片? (And What of Photos?)
Selling is finance, buying is romance. ~ Opendoor Mantra
卖是金融,买是浪漫。 〜开门咒
Humans interact (and fall in love) with homes through photos. It seems natural, then, that a deep learning model would leverage these images when comparing homes to one another during the appraisal process. After all, one of the great success stories of deep learning is the field of computer vision. Transitioning OVM to deep learning has the added benefit of making it much easier to incorporate mixed-media data, such as images and text (from listings, tax documents, etc.) into our core algorithm. But that, dear reader, is a topic for another blog post.
人类通过照片与房屋互动(并坠入爱河)。 因此,在评估过程中将房屋相互比较时,深度学习模型会利用这些图像是很自然的。 毕竟,深度学习的一大成功案例是计算机视觉领域。 将OVM转换为深度学习还有一个好处,就是可以更轻松地将混合媒体数据(例如图像和文本(来自清单,税务文件等))整合到我们的核心算法中。 但是,亲爱的读者,这是另一篇博客文章的主题。
Until next time, may your GANs never suffer mode-collapse!
直到下次,您的GAN才不会遭受模式崩溃!
翻译自: https://medium.com/opendoor-labs/accurately-valuing-homes-with-deep-learning-and-structural-inductive-biases-18232ede1efd