数据工程师工作规划
Data engineering is a fascinating field. You get to work with a variety of interesting data, cutting-edge technologies, as well as with diverse teams of data professionals and domain experts. The entire field of data engineering is relatively new. As a data engineer, your role is crucial to the company’s success — many data professionals, including data analysts and data scientists, rely on you in order to do their work. You are responsible to equip them with data that is always available, reliable, and in a proper structure.
数据工程是一个引人入胜的领域。 您可以使用各种有趣的数据,尖端技术,以及与数据专业人员和领域专家组成的不同团队。 数据工程的整个领域都是相对较新的。 作为数据工程师,您的角色对于公司的成功至关重要-许多数据专业人员,包括数据分析师和数据科学家,都依赖您来完成他们的工作。 您有责任为他们配备始终可用,可靠且结构正确的数据。
The companies need you to make informed decisions based on real data and KPIs generated from it. And they are willing to pay you well if you are good at it! Let’s look at what skills are in high demand, what factors play a large role in future career prospects, and how to approach the technical interview.
公司需要您根据实际数据和由此产生的KPI做出明智的决策。 如果您擅长于此,他们愿意为您支付高价! 让我们看看对哪些技能有很高的要求,哪些因素在未来的职业前景中起着重要的作用,以及如何进行技术面试。
需求技能 (The skills in demand)
Overall, it’s usually hard to give any truly general advice but I summarize the skills that seem to be the most relevant, from what I saw being mentioned numerous times in job ads and from my experience in the field.
总体而言,通常很难给出任何真正的一般性建议,但从我在招聘广告中被无数次提及的经验以及我在该领域的经验,我总结了似乎最相关的技能。
1.成为T型专业人员 (1. Being a T-shaped professional)
It’s considered best to aim for being a generalist (the horizontal bar in T) in the sense that you understand the general concepts of databases, cloud computing, data warehousing, big data, and that you know at least some basics of SQL, Python, Docker, and creating ETL.
在您了解数据库,云计算,数据仓库,大数据的一般概念,并且至少了解SQL,Python的一些基本知识的意义上,最好是成为一名通才( T中的横杠)。 Docker,并创建ETL。
At the same time, you should have stronger skills in at least one particular area (the vertical bar in T). For instance, you might be really good at writing Spark or Dask data manipulations or you may have some particular domain knowledge required by the company you apply for, which sets you apart from other applicants.
同时,您应该在至少一个特定区域( T中的竖线)具有更强的技能。 例如,您可能真的很擅长编写Spark或Dask数据操作,或者您可能拥有所申请公司所需的特定领域知识,这使您与其他申请人有所不同。
In many cases, knowing SQL well + basics of Python, Linux and AWS can already get you to a fairly-paid junior position.
在许多情况下,充分了解SQL + Python,Linux和AWS的基础知识已经可以使您升入中等水平的初级职位。
2.用于处理数据的云服务 (2. Cloud services for working with data)
Cloud computing revolutionized and changed many industries. As a data engineer, you need to know the most important services for storage, compute, networking, and databases. If you don’t know much about those, I highly recommend learning Amazon Web Services — even if you would end up using Google Cloud Platform or Microsoft Azure, the concepts learned from AWS can be easily applied when switching to a different cloud vendor since many services across cloud vendors are analogous and their concepts are virtually the same (ex. block storage vs object storage vs NFS).
云计算革命性地改变了许多行业。 作为数据工程师,您需要了解有关存储,计算,网络和数据库的最重要服务。 如果您对这些知识不太了解,我强烈建议您学习Amazon Web Services-即使您最终将使用Google Cloud Platform或Microsoft Azure,从AWS中学到的概念也可以在切换到其他云供应商时轻松应用。跨云供应商的服务是类似的,它们的概念实际上是相同的(例如,块存储vs对象存储vs NFS )。
If you are new to AWS, following this link, you can find great free courses on AWS — they are all offered directly from AWS. You don’t need to pay for the extra certificate — from my experience, recruiters and engineering managers don’t really care that much about certifications. They want people with hands-on experience who know a lot and can apply it to business problems.
如果您不熟悉AWS,请通过以下链接在AWS上找到很棒的免费课程-它们都是直接从AWS提供的。 您不需要支付额外的证书-根据我的经验,招聘人员和工程经理并不真正在乎证书。 他们希望有实践经验的人了解很多知识并将其应用于业务问题。
The most important AWS services for a data engineering position are:
对于数据工程职位而言,最重要的AWS服务是:
Being able to programmatically interact with files on S3 (ex. to download and upload a CSV or parquet file)
能够以编程方式与S3上的文件进行交互(例如,下载和上传CSV或镶木地板文件)
Being able to spin up and SSH to an EC2 instance + knowing some Linux basics to be able to interact with it by using CLI
能够启动并通过SSH连接到EC2实例+了解 一些Linux基础知识,可以使用CLI与之交互
IAM: knowing how to create IAM user, attach a policy for relevant services, use it to configure programmatic access with AWS CLI + basics of how IAM roles work
IAM :知道如何创建IAM用户,附加有关服务的策略,使用它来通过AWS CLI + IAM角色工作原理的基础配置编程访问
VPC: you should know what is a VPC, subnet, and knowing the basics of how they work (ex.: your VPC exists in a specific AWS region and subnet in a specific Availability Zone within that region)
VPC :您应该知道什么是VPC,子网,并了解其工作原理(例如:您的VPC存在于特定的AWS区域中,并且子网位于该区域中的特定可用区域中)
RDS: knowing how to spin up or at least to interact with a relational database such as Postgres.
RDS:知道如何启动或至少与诸如Postgres的关系数据库进行交互。
Additionally, it’s good to know AWS Lambda (serverless Function as a Service), ECS & EKS (running containers at scale), Amazon Redshift (cloud data warehouse), Athena (serverless query engine to query S3 data lake), and AWS Kinesis or Amazon MSK (both are used for real-time streaming data). But you can focus on the ones presented in the bulleted list first. The courses from Edx explain most of them. Plus, remember to practice: with AWS free tier you get (limited) access to those basic services so that you can play around and learn by doing.
此外,很高兴知道AWS Lambda(无服务器功能即服务),ECS和EKS(大规模运行容器),Amazon Redshift(云数据仓库),Athena(用于查询S3数据湖的无服务器查询引擎)以及AWS Kinesis或Amazon MSK(两者均用于实时流数据)。 但是您可以首先关注项目符号列表中显示的内容。 Edx的课程解释了其中的大部分内容。 另外,请记住练习:使用AWS免费套餐,您(有限)可以访问那些基本服务,因此您可以边玩边学。
3.建立ETL管道 (3. Building ETL pipelines)
Being a data engineer is a lot about integrating data from various sources, bringing it to a form appropriate for analysis, and then loading to some data lake or data warehouse. You should have some experience in creating ETL. It doesn’t mean that you must have worked at Big Data project for some large companies — even your self-driven projects shared on Github or in a blog post can get you far in the application process and make you stand out from the crowd.
成为数据工程师的工作主要涉及集成来自各种来源的数据,将其转换为适合分析的形式,然后加载到某些数据湖或数据仓库中。 您应该具有创建ETL的经验。 这并不意味着您必须为某些大公司从事Big Data项目,即使您在Github上或博客文章中共享的自驱动项目也可以使您在申请过程中走得更远,并使您在人群中脱颖而出。
4.管理,监视和调度ETL管道 (4. Managing, monitoring, and scheduling ETL pipelines)
One of the main responsibilities of data engineers is to ensure that the data is always available, reliable, and in a proper structure. To achieve this, you need to schedule and monitor your data pipelines. Many companies use workflow management systems such as Apache Airflow or Prefect for this purpose, so knowing one of them may significantly improve your chances of getting a great data engineering job. If you want to learn more about those, read my previous stories, such as this one — in that article, I’m demonstrating how to easily set up a workflow management system with a serverless Kubernetes cluster on AWS.
数据工程师的主要职责之一是确保数据始终可用,可靠且结构正确。 为此,您需要计划和监视数据管道。 许多公司为此目的使用诸如Apache Airflow或Prefect之类的工作流管理系统,因此了解其中之一可能会大大提高您获得出色的数据工程工作的机会。 如果您想了解更多有关这些的信息,请阅读我以前的故事,例如该故事。在那篇文章中,我将演示如何使用AWS上的无服务器Kubernetes集群轻松设置工作流管理系统。
5.能够使用容器:Docker和Kubernetes (5. Ability to work with containers: Docker & Kubernetes)
If you work with Python, you know that your code may suddenly no longer work because you upgraded to a new pandas version. Containerization is key, being able to work with containerized workloads is one of the most crucial and most in-demand skills in (any) engineering jobs, as it makes your code self-contained, dependency-free and let you deploy your code to virtually any environment.
如果使用Python,您会知道您的代码可能突然不再起作用,因为您升级到了新的pandas版本。 容器化是关键,能够处理容器化的工作负载是(任何)工程工作中最关键和最需要的技能之一,因为它使您的代码独立,无依赖,并使您可以将代码部署到虚拟环境中。任何环境。
6.了解基本概念 (6. Knowing basic concepts)
This goes together with being a T-shaped professional: you should know the basics of data warehousing, data lakes, Big Data, REST APIs, and databases. It would be rather disappointing to fail at explaining the 3Vs of Big Data or data warehouse characteristics during your job interview. Additionally, it’s worth knowing the architectural components. For instance, in this post, I discuss data warehouse architectures and key considerations when migrating to the cloud.
这与成为T形专业人员在一起:您应该了解数据仓库,数据湖,大数据,REST API和数据库的基础知识。 如果在工作面试中未能解释大数据的3V或数据仓库特征,那将是非常令人失望的。 此外,值得了解架构组件。 例如,在本文中,我讨论了数据仓库架构和迁移到云时的主要注意事项。
7.能够独立工作和学习 (7. Ability to work and learn independently)
This goes without saying: with technologies evolving so fast, it’s crucial that you are a self-directed learner and that you are willing to continuously learn and experiment with new tools. It doesn’t mean that you need to follow every hype, but rather that you stay open-minded.
这是不言而喻的:随着技术的飞速发展,至关重要的是,您必须是一个自我指导的学习者,并愿意不断学习和尝试新工具。 这并不意味着您需要遵循每一个炒作,而是要保持开放的胸怀。
8.编码技巧 (8. Coding skills)
Programming doesn’t mean that you must be a “hacker” and you need to spend all days doing nothing else but writing code. It’s rather about being able to learn quickly and to know how to write good abstractions. In the field of data engineering, this means that you know how to create code that is DRY (Don’t Repeat Yourself), meaning: you don’t copy-and-paste the same code from one script to another, but you know how to write functions or classes in a modular and reusable way. Clean code that can be reused, extended and parametrized, is easy to maintain and will save you and others time.
编程并不意味着您必须是“黑客”,您需要花一整天的时间来编写代码。 而是要能够快速学习并知道如何编写良好的抽象。 在数据工程领域,这意味着您知道如何创建DRY (不要重复自己)的代码,这意味着:您不会将相同的代码从一个脚本复制并粘贴到另一个脚本,但是您知道如何以模块化和可重用的方式编写函数或类。 可以重复使用,扩展和参数化的干净代码易于维护,可以节省您和其他人的时间。
To give you an example: I once worked for a company, where there has been almost no modularity in place. In almost every Python project, people were copying over the same code to establish logging, connect to a data warehouse and load some data to and from it, or to establish an S3 client and download a CSV file from some S3 bucket. To improve this, I created a Python package:
举个例子:我曾经在一家公司工作,那里几乎没有模块化。 在几乎每个Python项目中,人们都在复制相同的代码来建立日志记录,连接到数据仓库并向其中加载一些数据,或者建立S3客户端并从某个S3存储桶下载CSV文件。 为了改善这一点,我创建了一个Python包:
- it included all the functions that were needed in almost any project and I pushed it to a new GitHub repository 它包含了几乎所有项目所需的所有功能,我将其推到了新的GitHub存储库中
This package could then be installed anywhere via:
然后可以通过以下任何位置安装此软件包:
This package could then be installed anywhere via:
pip install git+https://github.com/
./ .git 然后可以通过以下任何位置安装此软件包:
pip install git+https://github.com/
。/ .git
This package saved us all a lot of time in the long run and made the codebase much cleaner.
从长远来看,该程序包为我们节省了很多时间,并使代码库更加简洁。
If you are a Python-beginner, you don’t need to learn how to create packages. At first, it may be enough if you can write good Python functions and if you know how to work with basic packages for data manipulation such as Pandas.
如果您是Python入门者,则无需学习如何创建包。 首先,如果您可以编写出色的Python函数并且知道如何使用基本的数据处理软件包(例如Pandas)就足够了。
Many companies also look for data engineers who know Scala, Java, R, or C (or any other language you can think of) — regardless of the programming language, you can get a much better job if you understand the basic data types for working with data, as well as the principles of functional programming and modularity.
许多公司也在寻找了解Scala,Java,R或C(或其他您可以想到的语言)的数据工程师-无论使用哪种编程语言,如果您了解工作的基本数据类型,就可以获得更好的工作。以及数据,以及功能编程和模块化的原理。
9.命令行 (9. Command Line)
Being able to work with the Linux operating system and interacting with it by using bash commands is one of the most crucial skills that will make you much more efficient.
能够使用Linux操作系统并通过使用bash命令与之交互是最关键的技能之一,它将使您效率更高。
Many frameworks and cloud services work in such a way that we define our resources and services via a declarative language (such as Dockerfile or Kubernetes YAML files), which can then be deployed via Command Line Interface (CLI). This paradigm is often known as Infrastructure as Code. For instance, AWS CLI allows you to provision an entire cluster of resources simply by submitting bash commands to the AWS API. Other cloud providers (such as GCP or Azure) offer similar command-line interfaces.
许多框架和云服务的工作方式是,我们通过声明性语言(例如Dockerfile或Kubernetes YAML文件)定义资源和服务,然后可以通过命令行界面(CLI)进行部署。 这种范例通常称为基础架构即代码。 例如,AWS CLI允许您仅通过将bash命令提交到AWS API即可置备整个资源集群。 其他云提供商(例如GCP或Azure )也提供类似的命令行界面。
10.软技能 (10. Soft skills)
Some may expect a data engineer to be a person who is doing nothing but writing ETL and crunching numbers. But in every job, it pays off to have skills that complement your profile. Imagine that you have two candidates:
有些人可能期望数据工程师是一个除了写ETL和处理数字外无所事事的人。 但是,在每一项工作中,拥有可以补充您的个人资料的技能都是值得的。 假设您有两个候选人:
- An excellent coder but a poor public speaker, 优秀的编码人员,但演讲者表现欠佳,
- An average coder but at the same time a great public speaker. 一个普通的编码员,但同时又是一位出色的公众演说家。
Which one would you hire? Many companies would pick the latter. Employers look for well-rounded individuals who also have important soft skills such as project management, public speaking, documenting, or great at moderating and organizing events.
您会雇用哪一个? 许多公司会选择后者。 雇主寻找的是全面发展的人,他们还具有重要的软技能,例如项目管理,公开演讲,文档编制或擅长主持和组织活动。
在您的职业前景中起重要作用的因素 (Factors playing a large role in your career prospects)
Salaries in data engineering jobs vary depending on the location, industry, required skills, and level of experience. Below, I list the 7 most important factors that determine salary and future growth. Some of them are obvious, but others may surprise you:
数据工程工作的薪水因地点,行业,所需技能和经验水平而异。 下面,我列出了决定薪资和未来增长的7个最重要的因素。 其中一些是显而易见的,但其他一些可能会让您感到惊讶:
Location — even if you apply for a remote job, chances are that the company is paying you based on the standards of the country you live in to reflect the costs of living, etc
所在地-即使您申请远程工作,公司也有可能根据您所居住国家/地区的标准向您付款,以反映生活费用等
Industry — companies in finance, automotive, tech, or pharmaceutical industry often pay much better than startups and e-commerce
行业-金融,汽车,科技或制药行业的公司通常比初创公司和电子商务的付款要好得多
Years of experience — recruiters are obsessed with it, even though the years themselves don’t really tell much about how much you learned from your previous jobs…
多年的经验-招聘人员对它很着迷,即使这些年本身并不能真正说明您从以前的工作中学到了多少……
Expertise — the years of experience are not equivalent to the expertise (at least I think so). Often people are just great at Spark, Linux, Dask, or advanced SQL. And if you can prove that you really know it well, it may be worth more than 20 years of experience doing drag-and-drop ETL
专业知识-多年的经验不等于专业知识(至少我认为如此)。 人们通常只擅长Spark,Linux,Dask或高级SQL。 而且,如果您可以证明自己确实很了解,那么值得进行20多年的ETL拖放操作
Hands-on experience — nothing is worth more in engineering than hands-on experience. Nobody can benefit from our knowledge if we can’t apply it in real life. Do personal projects and practice. Don’t just read something and think that you already know it — if you didn’t apply it, it’s all just theory that you will soon forget.
动手经验–在工程上,没有什么比动手经验更有价值了。 如果我们不能将其应用到现实生活中,则没有人可以从我们的知识中受益。 做个人项目和练习。 不要只是阅读并认为您已经知道它-如果您不应用它,那只是理论,您很快就会忘记它。
Education — I personally found that recruiters don’t look as much at your education as I would expect. Of course, they check whether you have a Bachelor’s or Master’s degree or even a Ph.D., but it often doesn’t matter much to recruiters what university did you attend or what was your subject. The same is true with certifications — many technical managers value your actual experience with specific tools or programming languages higher than any official proofs of your knowledge, and they might prefer to verify your knowledge themselves in the technical interview rather than relying on certificates.
教育-我个人发现,招聘人员对您的教育的重视程度不如我期望的那样。 当然,他们会检查您是否拥有学士或硕士学位,甚至是博士学位,但对于招聘人员而言,您所上的大学或所学的学科通常并不重要。 认证也是如此-许多技术经理对您使用特定工具或编程语言的实际经验的重视程度要高于对您的知识的任何官方证明,他们可能更愿意在技术面试中亲自验证您的知识,而不是依靠证书。
Your special skills, domain knowledge, and soft skills (for instance, the ability to handle conflicts) are more important than you might expect. Often recruiters may reject somebody because they feel that this person simply doesn’t fit into the team’s and company’s culture.
您的特殊技能,领域知识和软技能(例如,处理冲突的能力)比您预期的要重要。 招聘人员经常会拒绝某人,因为他们认为此人根本不适合团队和公司的文化。
面试准备 (Interview prep)
I heard about cases when an applicant couldn’t answer in a phone interview the question about what the company he or she applied at is doing. Also, questions like: tell me about yourself and why do you want to switch to a new company are so common that it’s good to think about it in advance.
我听说过一些案例,当申请人在电话采访中无法回答有关他或她所申请的公司在做什么的问题时。 另外,这样的问题也很常见:告诉我有关您自己的信息以及您为什么要转到新公司的事,最好事先考虑一下。
Additionally, if you plan to apply, you should be prepared for some (basic) technical questions. Many data engineering managers ask to design a star schema based on some situation or give you some coding questions like what are SQL window functions, generators, broadcasting, or list comprehensions in Python, what is the difference between Docker image and Docker container, or how would you go about creating a Docker image and running a Docker container.
此外,如果您打算申请,则应对一些(基本)技术问题做好准备。 许多数据工程经理要求根据某些情况设计星型模式,或者给您一些编码问题,例如什么是SQL窗口函数,生成器,广播或Python中的列表理解,Docker映像和Docker容器之间的区别是什么,或者您是否可以创建Docker映像并运行Docker容器?
And lastly, believe in yourself and stay confident.
最后,相信自己并保持自信。
结论 (Conclusion)
In this article, we discussed what data engineering skills are currently in high demand and what do you need to know if you look for an entry-level position in this field. We also listed factors that play a large role in data engineering career prospects and how to prepare for the interview. I hope you might find some of this helpful. Thank you for reading!
在本文中,我们讨论了目前迫切需要哪些数据工程技能,以及在该领域寻找入门级职位时需要了解什么。 我们还列出了在数据工程职业前景以及如何准备面试中起重要作用的因素。 希望您会发现其中的一些帮助。 感谢您的阅读!
翻译自: https://towardsdatascience.com/how-to-get-a-job-as-a-data-engineer-990e1cbbe192
数据工程师工作规划