我们不再支持这个浏览器. 使用受支持的浏览器将提供更好的体验.

更新浏览器.

关闭浏览器消息

研究 从行政银行数据估计家庭收入

机器学习方法

The 澳博官方网站app 研究所 was established to leverage the power of administrative banking data to deepen our understanding of critical economic issues and provide timely insights to decision makers. We recently developed a machine-learning based estimate of family 收入 to enable deeper insights and improved representation in our research. 我们在这个新版本中描述了我们的方法和结果.

Q&关于JPMC研究所收入估算

We discussed the motivation for this new 收入 estimate and the potential of applying machine learning approaches to administrative banking data.

What is the JPMC 研究所 Income Estimate (JPMC IIE) and why did the 澳博官方网站app 研究所 create it?

简单地说, the 研究所 Income Estimate is an estimate of gross family 收入 for families who regularly use a 追逐 checking account. Analyzing and understanding financial behavior of families and how it varies across the 收入 spectrum is a central theme of the 研究所's work. 以便更好地评估这些动态, 我们需要创建一种方法,通过我们的数据集估算家庭总收入.

Our ability to extend insights gained from the 追逐 portfolio to the US population relies on having or approximating a sample that is representative of the broader population and being able to differentiate results by key attributes, 比如年龄, 收入, 和地理. For example, if we want to measure growth in consumer spending in Houston, as we do with our 本地消费者商业指数, we want to make sure that the customers we observe in Houston are truly representative of that city, 我们可能还想知道谁在休斯顿贡献了大部分的增长.

We know that the 追逐 portfolio is not a perfect mirror of the US population and doesn't offer a perfect window into its customers' 收入. 例如,它从本质上排除了没有银行账户的人,这些人往往收入较低. 即使是银行家庭, financial institutions might see payroll 收入 arrive into a customer's account but not all of the deductions for taxes, 保险, 由雇主安排的退休. And there might be other sources of 收入 that aren't deposited into the customer's account.

为了使我们的样品更具代表性, we have to be able to re-weight our population to match the 收入 distribution of the country. 为了研究低收入家庭的经济行为, 我们希望按照国家基准来定义低收入者. So we need an 收入 measure that is comparable to Census in order to re-weight and benchmark our sample. 这就是为什么我们选择创建JPMC IIE.

在高层次上,JPMC IIE背后的方法论是什么?

The idea behind JPMC IIE is quite simple in that it is a classic application of a “supervised learning” problem within machine learning. 对于一些客户,我们实际上知道他们的家庭总收入, 因为他们向我们申请了抵押贷款或信用卡, 我们被要求询问他们的收入作为承销过程的一部分. 这些客户代表了我们的“真相集”.” Among these customers we can then ascertain which characteristics that we observe for all of our customers are highly predictive of gross family 收入. In this sense we can train a model to predict gross family 收入 that uses features observable for everyone. 一旦我们调整了这个模型,使其尽可能地预测真实情况, 然后,我们可以利用它来为其他所有人生成预测的家庭总收入.

JPMC IIE对家庭收入的预测到底有多大?

Our first version of JPMC IIE leveraged a wide variety of features to predict gross family 收入, 包括银行内部的帐户信息和公开可用的数据. 它能够预测出家庭总收入, 平均, 不超过事实的百分之四十一. That is to say, 平均, the estimate could be higher or lower than the actual by 41 percent. 这被称为“平均绝对误差”.”

因为我们主要关心的是确定一个家庭的收入五分之一, we also assessed performance based on how often predicted 收入 fell into the same quintile as the family's true 收入. 在这一点上, predicted quintile matched the true quintile 55 percent of the time and was equal or adjacent to the true quintile roughly 90 percent of the time.

这当然留下了改进的空间. 但让我们正确看待这些数字. Had we simply guessed each family's 收入 based on the average 收入 among families living in their same zip code, 根据税务记录, 我们的平均误差会达到103%. 这显示了利用行政银行数据来预测家庭收入的价值.

We also took JPMC IIE on a test run by seeing how well it performed if we used it to weight the population in our 医疗保健 Out of Pocket 支出 Panel. 果然, weighting by age and JPMC IIE made our population more representative of the general population than if we had weighted by age alone.

这项研究与澳博官方网站app研究所的典型研究有何不同, 我们从中吸取了哪些重要的教训?

这是机器学习在我们工作中的第一个应用. 除了, 而我们的大部分研究都是为了回答具体的研究问题, 本出版物为我们描述了一个关键数据资产背后的方法, JPMC IIE, 哪些是其他研究的基础.

我们确实从这次演习中学到了很多东西. 我们将分享一个关键亮点.

Early on it became clear to the team that our prediction was only going to be as strong as our truth set. And we needed to make sure that the truth set was representative of the larger universe of customers for whom we were trying to predict 收入. 依靠抵押贷款和信用卡申请人为基础的真相, 我们的真相集偏向于高收入家庭, so we had to oversample our truth set for mortgage and credit card applicants who had lower 收入s. Stratifying the truth set by 收入 yielded a 28 percentage point improvement in the quintile prediction for families in the lowest 收入 quintile.

那么澳博官方网站appIIE的下一步是什么呢? Are there plans to continue to enhance or expand the scope of the 收入 estimate and these approaches?

正如我们之前提到的, 平均绝对误差为41%, 还有很大的改进空间. 我们正在吸取最初的经验教训,并继续改进这个模型.

We are busy refining and adding to our original features to see if we can improve the accuracy of the prediction. We are also trying to expand the size of our truth set by finding additional customers within the bank for whom we have gross family 收入.

We also see promise in expanding the scope of this 收入 estimate beyond checking account customers to credit customers so that we have a uniform prediction estimate across our universe of customers.

We hope that publishing our initial methodology will not only teach the public about the power of leveraging administrative banking data for prediction but also generate a lot of feedback for future improvement. 所以请向我们提出想法,并继续关注!