Data Engineer 101

The popularity of the mobile devices and the pervasiveness of all kinds of information on internet, rises to the record high, and still keeps fast growing, "big data" becomes an ever "sexy" concepts. With the ever-huge-and-still-growing size of data as well as the dazzling complexity, in past a few years, the "data engineer" role has been spin off from the general "software development engineer" role and is attracted more and more attention. However, to many people, there are still lots of mysterious around this role, here are some simple Q&A based on my past 7 years experience working closely with data in a giant software company. 

What is Data Engineer


Data Engineer is an engineer who build the systems which handles data processing pipeline. Let's think about this from two perspectives. (1), from a business and management perspective, there are many important questions need to be answered, such as "which region generates most revenue?", "how was my customer base increasing over the past year?", "who are the most important target customer segments that we should invest on?", all these questions (ideally) need to use some solid data about the business to answer. When we have the data at our hand to make the decision, it's called a "data-driven" decision. (2), from a practical/technical perspective,  where and how do we get these data? The very old fashion way, in small business, the owners and/or managers meticulously maintain important numbers with pens and papers, and do all the math on paper, as we all did at school. Then calculator comes in handy, we do not need to calculate by ourselves, but still need the manual/literal book-keeping efforts. Starting about 20-30 years ago, PC starts going into every office and home, EXCEL becomes the primary go-to software for all kinds (big or small) of cooperate data-keeping and reports generation tool, however, the data entry work is mostly still manual and manageable, minimum engineering efforts are involved at that time. Until the past 15-20 years, the internet and things on the web just went viral, the traditional method of collecting all kinds of data become impossible by human efforts, so the engineers steps in. At the beginning, they are just part of the traditional software engineering team, a handful people who are working with the data. More recently, as the data scale grows bigger (from hundreds to thousands entries a day to hundreds even thousands entries per seconds -- think about the popular websites API calls) and complexity increases (number of different types of data sources, business relationships between different types of data etc.), it is not enough to just have the "ad-hoc" efforts from general software engineers to understand and build the more sophisticated data systems, hence the dedicated "data engineer" emerged. In short, data engineers are people who possess the technical skill set to make the enormous data collecting and processing as automated as possible, in a timely fashion, so that the important business questions can be answered with the data presented.    

What is the difference between Data Engineer and Data Scientist


Data Engineers are more focused on the data collection and data "massaging" (transformation) process, on the engineering side of building up the data pipeline system. At the output end of the data pipeline, most commonly the processed data are landed into a centralized place, called "data warehouse". Data Scientists are usually more focused on working with the resulting data in the warehouse, on the analysis side. They look for the patterns and insights, do more in-deep and ad-hoc complex analysis (beyond straight-forward aggregation and segmentation) to help answer those important questions, and most likely to raise more questions about the systems and business (when some "unexpected" observations happens) and seek for more data to be collected (requirement generation to answer the "why" on those "unexpected").  However, above distinction is made in the ideal perfect world, which rarely happens in real life. In most organizations and especially in the early stage of a "data team" is formed, the initial hires are usually required to possess both sides of the desired skill sets and do them all : (a). collect/understand the requirements from the business side; (b). build the data pipeline systems; (c). generate automated reports; (d). do some ad-hoc analysis on the data to answer or raise questions. 

What does Data Engineer do in day-to-day work


This becomes repetitive, but it is very true, basically a typical data engineer does one of the 4 things in their most time at work: (a).collect/understand the requirements from the business side; (b). build the data pipeline systems; (c). generate automated reports; (d). do some ad-hoc analysis on the data to answer or raise questions.

It would be heavily depending on the stage of your organization and the readiness or the maturity level of the data team, the later (more mature) the stage, the more designated on individual roles, usually, there are 1 people to collect the requirement (mostly they are the program managers), 2-3 people are building the pipelines (could be more, depending on the complexity of the system and business logic), 1-2 people are generating the reports (such tasks could be tricky, far more complex than one button click on EXCEL to get a pie chart or line trend graph, some rather sophisticated query writing skills are required here), 1 people are doing the ad-hoc analysis work. This is just roughly the ratio of the team that I have observed so far across different projects that I have been on, and the size could be proportionally grown as the project get to the more evolved stage. And sometimes the (a) and (d) stage would put such responsibility into a single hand, such roles are called either "data program managers" or "data scientists", depending on which parts (a) or (d) are more emphasized. In some startup or any early stage data team, the "data engineer" would most likely do all 4 by him/herself. On the other hand, in a more "traditional sense" or "strictly defined scope", a Data Engineer is more focused on (b) and (c) or any one of them (data pipeline building and reports generation).   

What is desired skill set for Data Engineer


Let's discuss the desired skill set based on the 4 sets of important things in the day-to-day work and why we are looking for them. 

(a).collect/understand the requirements from the business side; communication, both oral (since you will talk to your stake holders/managers to solicit what do they want AND translate those "fuzzy" business language into "data models" and translate back from your model results to the business meaningful insights/action items) and written (you would have numeric emails back and forth to clarify things/requirement and sometimes formal design documents and progress reports), Please note, in the "oral" part, "listening" is sometimes more important than "speaking". Because you are mostly on the "receiver" side (especially at the initial stage of your career), only if you can "listen" well and "receive" the message, you can start to "speak" -- to clarify some requirements and re-state your understanding to make sure you get the information and requirement correctly. I cannot emphasize more on the importance of such "confirmation" stage, as the cost of "correcting" such issues at this early stage could be extremely expensive. 

(b). build the data pipeline systems; Database design, more specifically, data warehouse design (that's the "house" of your data, the "good" or "bad" design could make the later query efforts from all downstream parties significantly different.) Data pipeline building,  (i) identify the data sources, (ii) understand their individual structures and relationships between them (because you need to eventually "join or link" information across different data sources), (iii) design the pipeline  (iv) implement/build the pipeline (there are many available software on the market to choose from for implementation, prior knowledge about any of those would definitely a plus, however, it's not uncommon that many pipelines are just build upon generic programming language, such as Java, Python, Perl, SQL, or mixed of many). Most junior people would think about the "implementation" part (iv) is the most important skills required, I would say it's only partially true, it is probably quite fundamental at the early stage, as they are the basic "building bricks" for this job. However the more your career advanced (which also probably means that you have failed quite a few times either by "implementing wrong" or by "implementing right", but not on the "right things"), the more you would appreciate that the "understanding what needs to be done" and "design" is far more important than "implementation" itself, it is like "do the right things" vs "do things right". For many junior people, they may wonder "what if I do not have any experience, I do not even know how to do things right (because you have never did it before and you do not have any experience),  how could I know what's the 'right' things to do?" My suggestion would be simple: "LOOK AT THE DATA, USE YOUR COMMON SENSE". However, just as Voltaire said, "common sense is not that common". It takes times to learn how to use "common sense" to look at the data, to observe the "unexpected" and to ask questions, but the earlier you start proactively looking at the data & using "common sense" to ask questions, the steeper you learning curve would go along the road. 

(c). generate automated reports; the "common sense" things and "do the right things" definitely apply to this stage as well, however, this stage could be a little bit too late to correct any significant errors/issues (which are usually happened way earlier up the processing pipeline or even design or requirement collection stage). From a pure technical point of view, experience of any data visualization software on the market would be great; on the other hand, the query-writing skills are usually having a higher bar in order to handle complicated business scenarios. Some "visual sense and judgement" are also desirable skills to make the reports easily understandable, especially for those non-technical audiences. If you do have some talents in UX design to make the reports "look and feel" more attractive, that is definitely a plus, but rarely a requirement, not mention it also mostly constrained by the software/tools that you pick to use.  From another totally different side, I would value the "INTEGRITY" highly than anything else. Here is why. When you have spent many hours or days or even weeks to finished some reports, and find the results just look so "ugly" and "unacceptable" (lots of data issues are most likely first bubbled up in the reports generation steps, as the graphical visual representation is much powerful than those pure numbers in a database), it is quite hard to accept such "ugly" figures, which cause lots of frustration and resentment (towards whoever potentially to make this happen), and it is also enticing to make some simple "twist" to make it look much nicer. There are lots of judgement calls here, sometimes such 'twists" are necessary steps, if we do have a sound understanding and reasons to do so. But most of time, it could be very dangerous to do so since your "twists" could mask some important issues which are difficult to observe from other angles. At a baseline, NEVER FAKE DATA, especially you are on your own (and you know nobody would double-check your queries), INTEGRITY about the data (and as a PERSON) is above anything.  

(d). do some ad-hoc analysis on the data to answer or raise questions. More critical/advanced skills: critical thinking, ask good questions, understand the bigger industry/business scenarios. Basic skills: query-writing, some basic statistical knowledge, machine learning knowledge (not much emphasize on the theoretical part, but more on the application side, like: can I solve this problem without using any learning algorithm? this real-life problem translate to what ML problem? a classification problem or a clustering problem? are they supervised or unsupervised learning problem?  If supervised, where can I get the labeled training data? are the labeled data trustworthy and with acceptable quality? which methods I should use to address this issue? what nature of the data/problem lead me to believe so? how can I evaluate my results? if they are not good, what's the root cause? data is not clean, model is not good or we model it wrong or we should not use learning model at all and should use rule-based system instead? etc.)

What makes you a great Data Engineer


Beside all the basic technical side requirements: such as some programming skills, query-writing skills, data base design skills, statistical skills; and more advanced soft skills like communication, critical thinking, curiosity and constant learning, the single most important factor which will make you a great Data Engineer, is the PASSION and LOVE of data, you feel the excitement to work on the data, to look at them, to make sense of them and to utilize them to achieve something bigger (like answer those big business questions and to help make very important business decisions). On the personality traits side, you need to be very logical and detail-oriented, which is super important to work with data, as the data work is about making details arranged together in a logical way. At the same time, you need to practice the "zooming" capability and flexibility when you exam the data, just like zooming your camera lens, from raw detail level data, to different aggregation angles, and to some even bigger/higher perspective, jumping out of the data themselves to re-think about what questions we intend to answer & how they can be answer, by looking at these data.  The more agile you are "zooming" your data lens and the thinking process, the better you are as a Data Engineer, and the more ready for yourself to go to the next level of your career (on either the Data track or some other tracks). 

What to learn and how to learn to be a Data Engineer  


I found it is really difficult to present any general advice on this as it really varies based on the individual's background. I would put it in some big rough levels categories, and mostly on technical side, as the soft skill sides are much more general and could be polished in many more settings not limited to the Data Engineer role. 

a. Basics: college level mathematics, statistics, computer programming (any language, Java, C, C++, C#, Python, Perl etc.) you need to have the capability of doing procedural programming (give computer instructions on what to do, like first get all the files, then open each of them one by one, then iterate the file line by line, do sth on it etc. ), and basic SQL (you need to know how to write the result-oriented queries) 

b. Intermediate: Database design (data warehouse design), Object-oriented design (useful for the data pipeline design), 

c. Advanced: understand the big data platform usage and infrastructure (map-reduce, distributed file systems, big tables etc.); machine learning knowledge on application level to solve real life problems.

All these contents can be found in numerous MOOC resources available online, however, even most of them are designed to be "real-life" and "practical", after all they are still "courses" and need to be self-contained and "complete" in order to be solvable by students. The best way to learn is to get the chance to do a real "real-life" problem (by an intern project, by offering a hand at the neighbor "data team" if you are already in a software company or IT team), you would have a sense and learn how to handle the totally "imperfect" world, how "dirty" the data could be, how "incomplete" the information could be, and still you need to "find a way" to do your work and produce results... only by that time, welcome to the real data world and enjoy the journey!

你可能感兴趣的:(Data Engineer 101)