数据仓库 python
A user buying a Fave Deal or an eCard or using FavePay at Fave’s merchant partners has the option of paying using credit/debit cards or using online banking. They can get further discounts by using credits they’ve accumulated as well as apply promo code to get more. Malaysian users can redeem their AirAsia Big Points and pay from their Boost e-Wallet. Singaporean users can link up their GrabPay Wallet. More recently, we partnered with DBS and Singtel to also give Singaporeans the option of paying at our merchants via their apps while still being able to earn the cashback they would enjoy via the Fave app.
在Fave的贸易伙伴处购买Fave Deal或eCard或使用FavePay的用户可以选择使用信用卡/借记卡或使用网上银行付款。 他们可以通过使用积累的积分以及应用促销代码获得更多折扣,从而获得更多折扣。 马来西亚用户可以兑换其亚航大积分并通过Boost电子钱包付款。 新加坡用户可以链接他们的GrabPay钱包。 最近,我们与星展银行(DBS)和新加坡电信(Singtel)合作,还为新加坡人提供了通过他们的应用程序向我们的商家付款的选项,同时仍然能够通过Fave应用程序获得他们想享受的现金返还。
我们需要考虑12度的位置,3度的位置,4度的时间,16度的位置和8度的位置 (We need to consider 12 degrees of where, 3 degrees of who, 4 degrees of when, 16 degrees of how and 8 degrees of which)
With all this, it isn’t enough for us to know how much, where and when a user has transacted. We need to consider 12 degrees of where it happens, 3 degrees of who (anonymised, of course) is purchasing, 4 degrees of when purchasing/redeeming occurs, 16 degrees of how the payment is covered, 8 degrees of which Fave offerings apply and 6 degrees of the rewards a user can receive. For Fave’s earliest product, Fave Deals, this meant combining at least 15 tables, all with their own periodicity for data population, to end up with a comprehensive repository of information.
有了这些,我们还不足以知道用户进行了多少交易,何时何地进行交易。 我们需要考虑发生情况的12度,购买者的3度(当然是匿名的),发生购买/兑换的4度,支付方式的16度,Fave产品适用的8度和用户可获得的奖励的6度。 对于Fave最早的产品,Fave Deals,这意味着将至少15张表结合起来,它们各自具有周期性的数据填充功能,最终形成一个全面的信息库。
The Dilemma
困境
Our existing reporting data warehouse was ageing, the strategy behind it unscalable, the problems ensuing from it untenable. A revitalisation was overdue. We wanted to be in a position where real-time reporting was available to support the business across 3 countries and potentially more countries in the future. But mainly we wanted a solid backbone on which to build our own data science projects. Fave’s Data Science team* has been teeing up for this for a long time and our delay to a full-on immersion has been in no small part due to an inherited extract, transform and load (ETL or data refresh) strategy that did its job well in its time but proved severely lacking for a ceaselessly growing company.
我们现有的报告数据仓库正在老化,其背后的策略不可扩展,随之而来的问题也难以为继。 振兴已经过期。 我们希望处于一个可以提供实时报告的位置,以支持3个国家/地区以及将来可能更多国家/地区的业务。 但是主要是我们希望有一个坚实的基础来构建我们自己的数据科学项目。 Fave的数据科学团队*已经为此做好了很长时间的准备,并且由于采用了继承,提取和转换和加载(ETL或数据刷新)策略,Fave的全浸入式延迟在很大程度上不容小part当时还算不错,但事实证明,对于一个不断成长的公司而言,它严重缺乏。
延迟的转换意味着无法控制数据的可用性,从而阻碍了讨论并削弱了为客户和我们的贸易伙伴提供服务的能力。 (Delayed transformation means there’s a hold up to data availability which in turn impedes discussions and tempers the ability to serve customers and our merchant partners.)
That previous ETL depended on an absolute data refresh daily. Every day, a dump of ALL data since inception until that point into our reporting database would initiate in the predawn hours. A cron job schedule would then trigger SQL-scripted transformation** jobs which itself takes a healthy number of hours to wrap up. The next day and indeed every day, rinse and repeat. The daily dump would naturally accumulate a larger load with each passing day and so a delay to its completion snowballed. What was particularly painful about this strategy was that older unchanging data would also be refreshing for absolutely no reason whatsoever. I myself have watched and reconfigured the transformation schedule gradually from 6 am when I started at Fave just under 2 years ago to its current 10:20 am.
先前的ETL每天都依赖于绝对数据刷新。 每天,从开始到此刻一直到我们的报告数据库的所有数据转储都会在黎明前开始。 然后,cron作业计划将触发SQL脚本化转换**作业,这本身需要花费大量的时间来完成工作。 第二天,实际上是每天,冲洗并重复。 每天的转储自然会每天累积较大的负载,因此延误了其累积量。 这种策略特别令人痛苦的是,旧的不变数据也将毫无理由地刷新。 从我不到两年前在Fave开始时的上午6点到现在的10:20,我本人就一直在观察并重新配置转换时间表。
Inherited ETL Setup 继承的ETL设置Delayed transformation means there’s a hold up to data availability which in turn impedes discussions such as the weekly top management meeting. It tempers the ability of our Customer Happiness and Partner Manager teams to serve customers and our merchant partners. It hampers Finance from completing their monthly account closings in a timely manner.
延迟的转换意味着无法保证数据的可用性,这反过来又阻碍了每周高层管理会议等讨论。 它可以改善我们的客户幸福度和合作伙伴经理团队为客户和我们的贸易伙伴提供服务的能力。 它会妨碍财务部及时完成每月的帐户结清工作。
The Solution
解决方案
One solution would have been to replicate data into our reporting database as it is generated. In other words, the dump would be spread out throughout the day and at no point recurrent for the same data point. That still meant hours transforming the data daily though, not to mention a minefield of read/write conflicts. Our lead, Jatin Solanki, proposed to spread out the transformations as well i.e. transform the data on its way to being dumped. On top of that, we coupled this project with migration to BigQuery to take advantage of their BI engine and table partitioning and clustering features for faster report load times.
一种解决方案是在生成数据时将数据复制到我们的报告数据库中。 换句话说,转储将全天分布,并且同一数据点绝不会重复出现。 尽管如此,这仍然意味着每天要花费数小时来转换数据,更不用说读/写冲突的雷区了。 我们的负责人Jatin Solanki提议也扩展转换,即按照转储的方式转换数据。 最重要的是,我们将此项目与迁移到BigQuery结合在一起,以利用其BI引擎和表分区以及集群功能来加快报告加载时间。
我们的即时转换解决方案强调准确性,远见,模块化和协作。 (Our on-the-fly transformations solution emphasises accuracy, foresight, modularisation and collaboration.)
To be clear, there are ready-built ETL tools and SAAS products in the market that do exactly this. We had been using one such tool for a few use-cases. It could flatten JSONs and calculate new fields as streaming occurred but joining to other streamed tables had to be done through a separate feature that had to have its own schedule and pricing. Ultimately our issue with it was its cost but also reliability. Even while using it for only a subset of our data, we had been plagued with failing pipelines that confounded us and efforts to seek assistance from their support team more often than not fell short of our expectations. Because its core was not under our jurisdiction, we didn’t have the fullest flexibility to investigate, experiment and tinker. The decision was made to close that door, for now, go the open-source route and chart our own way for all our pipelines.
需要明确的是,市场上有现成的ETL工具和SAAS产品可以做到这一点。 我们已经在一些用例中使用了这样一种工具。 它可以展平JSON并在发生流式传输时计算新字段,但必须通过单独的功能来加入其他流式表,该功能必须具有自己的时间表和定价。 最终,我们面临的问题是它的成本以及可靠性。 即使仅将其用于我们的部分数据,我们也一直在遭受失败的管道困扰,这使我们感到困惑,并且寻求支持小组的帮助的努力常常没有达到我们的预期。 由于其核心不在我们的管辖范围内,因此我们没有最大的灵活性来进行调查,试验和修补。 决定关闭那扇门,现在,走开源路线,为我们所有的管道制定自己的方式。
Our on-the-fly transformations solution emphasised the following tenants:
我们的即时转换解决方案强调了以下租户:
Accuracy cannot be compromised. Period.
准确性不能受到损害 。 期间 。
Get ahead: We didn’t want just a faster database or a more up-to-date one. We wanted to improve anywhere we saw it was needed. That meant more efficient logic, deprecating some tables while introducing more comprehensive columns. Our 8 members of the Data team collectively have 13 years of dealing with every single other team in Fave. Our design must facilitate answering their needs but also our own data science projects.
取得成功:我们不想要一个更快的数据库或一个更新的数据库。 我们想在需要的地方进行改进。 这意味着更高效的逻辑,不推荐使用某些表,同时引入更全面的列。 我们的8位数据团队成员与Fave中的每个其他团队共同拥有13年的工作经验。 我们的设计必须方便满足他们的需求,而且还必须满足我们自己的数据科学项目的要求。
Modularisation: As much as possible, we should apply predefined libraries, functions and variables (whether our own or from elsewhere) to reduce complexity and increase code reuse. While there could still be improvements, what we’ve accomplished so far has already proven advantageous. In fact, our accuracy check module is already being applied in other projects with turnaround being exponentially shorter given we were not building from scratch.
模块化:我们应该尽可能地使用预定义的库,函数和变量(无论是我们自己的还是其他的),以降低复杂性并增加代码重用性。 尽管仍有改进的余地,但到目前为止我们已经完成的工作已经证明是有利的。 实际上,我们的准确性检查模块已经在其他项目中应用,由于我们不是从头开始构建的,因此周转时间缩短了几倍。
Engineering, as the capturers of the data, need to always be on the same page (or at least only one page away): A great many companies suffer the inefficiency of not having their engineers and various data people talk to each other. We have that relationship at Fave and this undertaking shouldn’t dislodge that.
工程,作为数据的捕获者,必须始终在同一页上 (或至少只有一页):许多公司效率低下,因为没有工程师和各种数据人员互相交谈。 我们在Fave拥有这种关系,这项事业不应该消除这种关系。
In brief terms, the system we conceived consisted of Kafka being made to tap into our production database. We then had Python scripts listen to a combination of Kafka topics in real-time, where each topic corresponds to a raw table’s events*** and utilise Dask for streamlined parallel computing. Everything from joins to flattening to decoding to enrichment to practically any beneficial computation would be applied at this point. This improves on our SQL scripted transformations by being continuous as new data is created, immediately streamed and transformed and improves on the SAAS product we were using by way of a more integrated and customisable answer to our woes. It also improves on both by making use of distributed computing so the load involved in the transformations would be shared among parallel-running workers, thus decreasing the operation time further. That same script would then insert or update into the final tables that plug into our data reports used across Fave as well as the reports we send out to merchants and partners.
简而言之,我们构想的系统包括将Kafka用于进入我们的生产数据库。 然后,我们让Python脚本实时收听Kafka主题的组合,其中每个主题都对应于原始表的事件***,并利用Dask简化了并行计算。 从连接到展平到解码再到充实到实际上任何有益的计算都将在这一点上应用。 通过在创建,立即流式传输和转换新数据时保持连续性,这改进了我们SQL脚本转换,并通过对问题的更全面,更可定制的解决方案,对我们正在使用的SAAS产品进行了改进。 通过使用分布式计算,它在两个方面都得到了改善,因此转换所涉及的负载将在并行运行的工人之间共享,从而进一步减少了操作时间。 然后,该脚本将插入或更新到最终表中,该最终表将插入我们在Fave中使用的数据报告以及我们发送给商家和合作伙伴的报告。
Pursuing any single milestone in this project was not easy for individuals proficient in Fave’s existing data set-up. It was not easy for those who already had numerous Python projects and libraries under their belt. One challenge certainly was that those of us who knew Fave’s data inside-out was not also the same people within the team who had an established repertoire around Kafka and Dask. Coding transformation logics on top of events-based input and using distributed computing for efficiency, which by the way also inevitably separated linked events, was one thing. Having to contend with the magnitude of inherent complexity when trying to get numerous raw tables with distinct population timelines into a coherent unified table was its own beast. All of us took on numerous new skills while leveraging the skills, knowledge and business sense the data analysts had already accumulated. By the end, we had clocked in-excess of 25,000 lines of Python.
对于精通Fave现有数据设置的个人而言,在该项目中追求任何一个里程碑都不容易。 对于那些已经拥有众多Python项目和库的人来说,这并不容易。 当然,我们面临的挑战是,那些从内而外了解Fave数据的人与在卡夫卡和达斯克周围拥有完整曲目的团队中的人也不相同。 在基于事件的输入之上编码转换逻辑,并使用分布式计算以提高效率(顺便说一句也不可避免地将链接的事件分开)是一回事。 当试图将具有不同人口时间表的大量原始表放入一个统一的统一表中时,必须应对固有的复杂性的程度是它自己的野兽。 我们所有人都采用了许多新技能,同时又利用了数据分析师已经积累的技能,知识和商业意识。 到最后,我们已经处理了25,000行以上的Python。
Why transform the data at all?
为什么要完全转换数据?
As much as the transformations can be coded as part of a data science project or when somebody clicks submit in one of our reports to get results, it is not ideal as it takes up time. Take Fave Deals again. To generate the full range of transactional information from a SQL-based report would need half of its script to just be taken up by the joins of at least 15 different raw tables. A run of such a report would correspond to a load time that can be best described as the timer I could use when making roast potatoes.
转换可以作为数据科学项目的一部分进行编码,或者当有人单击提交到我们的一份报告中以获取结果时,这并不理想,因为这会占用时间。 再次进行收藏交易。 要从基于SQL的报表中生成所有交易信息,将需要将其脚本的一半仅由至少15个不同原始表的联接所占用。 运行此报告将对应于加载时间,可以将其最好地描述为我在制作烤土豆时可以使用的计时器。
数据可用性滞后从36小时缩短至<8分钟 (Data availability lag is down from 36 hours to <8 minutes)
Transformations save time and effort for the really potent work to be done on top of the data. Data science work can be focussed on modelling. Analysis need not require code sections simply to translate the European currency notation that Indonesia uses into the more typical format to allow for arithmetics.
转换节省了时间和精力,可以在数据之上完成真正有效的工作。 数据科学工作可以集中在建模上。 分析不需要代码部分,只需将印度尼西亚使用的欧洲货币符号转换为更典型的格式即可进行算术运算。
So does it work?
这样有效吗?
We brought the data availability lag from 36 hours down to <8 minutes. With this, the establishment and maintenance of a separate pipeline and reporting are now rendered unnecessary for our marketing team to watch costs as it happens. That same dataset could advise our Operations team on their turnaround times to create, quality check and deploy new offerings. The Partnerships and various Sales teams can communicate with their stakeholders at a heightened level of informativeness. Product and Engineering benefit from our API firing our team’s AI-generated content built upon that same dataset again, created just hours prior, for both our merchant app (FaveBiz) and consumer app. Similarly, CRM communications stay relevant and personalised to each Fave user. Everybody wins.
我们将数据可用性的滞后时间从36小时降低到了不到8分钟。 这样,现在就无需建立和维护单独的管道并生成报告了,因此我们的营销团队无需再看成本。 同一数据集可以为我们的运营团队提供周转时间建议,以创建,质量检查和部署新产品。 合作伙伴关系和各种销售团队可以与他们的利益相关者进行交流,以提高他们的信息水平。 产品和工程技术得益于我们的API,它激发了我们团队的AI生成的内容,这些内容是在相同的数据集上再次构建的,该内容是为商户应用程序( FaveBiz )和消费者应用程序在几个小时前创建的。 同样,CRM通信与每个Fave用户保持相关和个性化。 每个人都赢。
But...
但...
While we agree it was necessary for the company with its unending need for customisation and its propensity for prudent spending, the barrier-of-entry to our team has undoubtedly been raised. Custom-built Python libraries including our own version of Dask (pending approval of our pull request) need to be maintained. Bugs ranging from a missing comma to leaking Kafka offsets to the more intertwined issues stemming from how Fave’s raw tables are populated in production are all ours to own and ours to resolve.
尽管我们同意公司有迫切需要进行定制以及谨慎消费的倾向,但毫无疑问,这给我们团队增加了进入壁垒。 需要维护定制的Python库,包括我们自己的Dask版本(正在等待我们的请求请求的批准)。 错误包括从缺少逗号到泄漏的Kafka偏移量,再到更纠缠的问题,这些问题由Fave的原始表如何在生产中填充而产生,这些都是我们自己拥有和解决的。
As mentioned, it is elaborate even for Python experts and it is labyrinthine even for those that are able to navigate the maze that is Fave’s relational database structure. That said, we welcome all attempts to scale this challenge, both in the literal and figurative senses. At the end of the day, we’ve built real-time Python processing for broad-scale utilisation, something other companies choose to pay for. More than anything, we’re really quite excited at the opportunities this has given us.
如前所述,它甚至对于Python专家来说都是精心设计的,对于那些能够浏览Fave关系数据库结构的迷宫的人来说,也是迷宫般的。 也就是说,我们欢迎所有从字面意义和象征意义上解决这一挑战的尝试 。 归根结底,我们已经构建了用于大规模利用的实时Python处理,这是其他公司选择付费的。 最重要的是,我们对它给我们带来的机会感到非常兴奋。
25,000行Python + 3个定制的Python库,数据可用性提高250倍,世界一流的低成本事件流系统以及在全球大流行情况下的开发和交付。 (25,000 lines of Python + 3 custom-built Python libraries, 250x faster data availability, a world-class low-cost event-streaming system and development and delivery under global pandemic conditions.)
Up until the last few weeks of this project, we were consistently in a 1-step-forward, 2-steps-back dance. Hypothesising and experimenting constantly with last week’s breakthrough becoming this week’s redundancy at times left us dejected. A 3-month planned timeline stretched to a 4, then 5, then 6, then 7-month execution as each key component had to be planned, coded then stress-tested against a permutation and combination of scenarios. All this till we reached the point of a very slow and very cautious realisation of “I think... we’ve maybe kinda sorta did it”.
直到该项目的最后几个星期,我们一直在向前迈进1步,向后迈2步。 假设和不断尝试上周的突破成为本周的冗余有时使我们沮丧。 3个月的计划时间表延长到4、5、6、7个月,因为必须计划每个关键组件,进行编码,然后针对场景的排列和组合进行压力测试。 所有这一切,直到我们达到了一个非常缓慢和非常谨慎的认识:“我认为……我们也许已经做到了”。
25,000 lines of Python not including 3 custom-built Python libraries, 250x faster data availability, a world-class low-cost event-streaming system and development and delivery under global pandemic conditions. Not bad at all from a team that while not ignorant in Python, until a year ago, were working primarily in SQL on a day-to-day basis.
25,000行Python,不包括3个自定义的Python库,250倍的数据可用性,世界一流的低成本事件流系统以及在全球大流行情况下的开发和交付。 一支团队虽然对Python并不了解,但直到一年前,他们主要每天都在SQL中工作,这一点也不错。
*Fave’s Data Science Team, at time of writing, consists of Husein Zolkepli (Creator of Malaya — THE Malay language toolkit library and our chief tamer of Kafka), Lin Cheun Hong (budding data engineer who always has a fresh thought to contribute), Evonne Soon, Zuhairi “Harry” Akshah and myself, Sarhan Abd. Samat (data analysts/analytics engineers who successfully detoured into Dask scripting with just the Python basics), Cheok Huei Keat (who made sure the company knew our team still existed but somehow managed to serve up some time-saving data science deliverables) and Faris Hassan (our only data scientist from the get-go and our generous teacher in the months to come). We are helmed by Jatin Solanki who, and this cannot be emphasised enough, always showed the right amount of guidance, patience, shielding and confidence in us throughout. A lesser leader would have binned this project long before it came to fruition. Also an unqualified special mention to Aiyas Aboobakar who helped get this going before abandoning us for greener pastures.
*在撰写本文时,Fave的数据科学团队由 Husein Zolkepli ( 马来亚的 创建者 -马来语语言工具包库 和我们的Kafka首席负责人), Lin Cheun Hong (萌芽的数据工程师,始终 怀有 新的想法做出贡献)组成, Evonne Soon , Zuhairi“ Harry” Akshah 和我自己, Sarhan Abd。 Samat (仅使用Python基础知识成功绕过了Dask脚本的数据分析师/分析工程师), Cheok Huei Keat (确保公司知道我们的团队仍然存在,但以某种方式设法提供了一些省时的数据科学成果)和 Faris哈桑 (我们唯一的数据科学家,我们的慷慨的老师在接下来的几个月中)。 我们受到贾廷·索兰基( Jatin Solanki )的控制,而对此却没有足够强调,他始终对我们表现出适当的指导,耐心,屏蔽和信心。 一个较小的领导者早在该项目实现之前就已经对这个项目进行了分类。 还特别 感谢Aiyas Aboobakar ,他在放弃我们之前选择了更绿色的牧场之前帮助实现了这一目标。
**Essentially what happens during transformations: 1) Different data sources are combined into a unitary data source. 2) The data is cleaned so your RM39.99 that was stored in a JSON just becomes 39.99 to facilitate arithmetic operations. 3) Timestamps are localised from UTC. 4) Some decoding of ids to their names as well as categorisation, among other things, also done to make the data more human-decipherable. 5) Some additional calculated fields like gross profit are created and stored so it’s lighter for reports to produce.
** 转换期间基本上会发生以下情况:1)将不同的数据源组合为一个单一的数据源。 2)数据已清除,因此存储在JSON中的RM39.99变为39.99,以方便进行算术运算。 3)时间戳是从UTC本地化的。 4)除其他事项外,还对ID对其名称进行了一些解码以及分类,以使数据更易于辨认。 5)创建并存储了一些额外的计算字段,例如毛利,因此可以简化报表的生成。
***Each version/snapshot of a table row at any given time -even a few milliseconds- is an event
*** 在任何给定时间(甚至几毫秒),表行的每个版本/快照都是一个事件
翻译自: https://medium.com/fave-engineering/how-python-novices-revolutionised-faves-data-warehouse-with-on-the-fly-transformations-1520c24c160c
数据仓库 python