Skills of a Data Engineer

Guha Ayan
Slalom Australia
Published in
10 min readSep 13, 2021

--

Once upon a time, there used to be two species: software engineers and data management folks. They coexisted successfully, occasionally spoke to each other over strict protocols. Both areas had their own rituals, practices and culture. Data management was messy, monolithic and bulky, while software engineering was matured and maintainable. However, success of software is always dependent on how data is managed, which led to data folks starting to adopt software engineering practices. This is how a new function came along: data engineers — data engineers always have data at heart, but their craft encompasses software engineering.

Courtesy: Andrew Lamont— Slalom _build

This blog is my humble opinion about what skills a good data engineer should have.

Let’s explore each sections.

What you cannot live without

SQL

SQL is undoubtably the most widely known and probably one of the most important skillset to have when it comes to data. SQL is used for data manipulation, data definition and data analysis, it is also used for moving data from one place to another. It is a very simple declarative language with a handful of important reserved tokens and functionalities to remember. It is taught in schools, but you can always learn it from the infinite amount of resources available across the internet.

SQL remains a prominent skill in modern data architecture. Most of common cloud MPP systems and modern distributed computing solutions have SQL interfaces to interact with them. In the last few years, a significant amount of work has gone into improving SQL experience in larger data using better and smarter query optimisers.

Python

Python is by far the easiest programming language you can learn. It is intuitive, flexible and versatile. It represents the “engineering” part of a data engineer. Python helps you to deal with various types of data, interact with various types of resources and build systems. All tools and technologies covered in the following sections are either written in python or have python bindings ingrained in their design. You can take all benefits of modern software practices easily such as version control, CI/CD, and containerisation. Finally, python is widely used in data science due to its superior set of specifically designed libraries.

One word of caution though for those who are learning Python but have a past in server-side scripting. It is very easy to write python scripts, but it requires effort and patience to learn the craft, and I learned it the hard way. It’s important to not write python code as you would have written shell scripts. Learn pythonic ways.

Know your platform

Apache Spark

Let me be clear here, I am biased and for a reason. Apache Spark has matured like no other system in past five years. It is used heavily for data batch and stream processing and all major cloud vendors have managed offerings. But more interestingly, Apache Spark is used in many novel ways such as embedded edge processing. In short, Apache Spark is one of the most widely used and most versatile distributed compute system available for commercial use. With the advancement of delta.io, mlflow and kubernetes integration and various other aspirational improvements, Apache spark is definitely a key skill that is here to stay.

Cloud Experience

I won’t spend too much time on this topic. Anyone who is remotely related to software industry in last five to seven years would have been exposed to some cloud experience.

What I will stress though is invest time to learn cloud. Develop a general understanding of cloud services, have a clear understanding of which service does what and how they fit into modern data architecture. As of today, all cloud offerings in the data space are very similar with certain subtle differences. It is very important to know the general reference architecture which can be translated to any cloud and then have deep expertise on either AWS, Azure or GCP.

My recommendation would be to learn and understand as much as possible for cloud specific data products, and learn as much as possible -without losing generality- when it comes to non-data products. As an example, it is important to learn the finest details about the NoSQL or Data Warehousing product you are using, but knowing how virtual networks work in general is good enough.

Courtesy: https://www.reddit.com/r/AZURE/comments/mkfzww/big_data_pipeline_on_aws_microsoft_azure_and/

Data Warehousing

First of all, data warehousing is not dead. Some pronounced it dead at the awake of Hadoop, but they were wrong. It remained one of key component in data architecture even with its seemingly high cost and bulky operational nature. As it stands today, cloud offerings such as AWS Redshift, GCP BigQuery and Azure Synapse are all robust and mature data warehousing solutions and it is helpful to learn one of them according to your cloud choice. Snowflake is definitely a cloud-born Data Warehouse and if this is what your organisation uses then it is helpful to learn it. I also mentioned it below in “Something Extra” section.

Your Toolbox

Now let me talk about some of the engineering practices you need to add your toolbox to build better data platforms and data products.

Automation

With the advent of cloud infrastructures and agile development practices, a lot of focus has shifted to automation.

Infrastructure Provisioning

In modern cloud architecture, it is important to decouple data and compute resources. And each compute resource has a cost defined by the time it is working. Hence it is critical to express every bit of compute required as code so that it can be turned on or off, or scaled automatically to control cost. Depending on your cloud provider and organisational mandate, be familiar with either cloud based provisioning services (such as AWS CloudFormation or Azure Resource Manager) or be familiar with a generic tool (such as Teraform or Ansible)

CI/CD

While version controlling has been a known practice, it is not enough anymore. With infrastructure being stateless and auto-scaled, it is important to have the ability to deploy the code and configurations whenever a new piece of compute resource is provisioned. Automated deployment pipelines are not optional anymore, they are part of mature engineering practices. Git is by far the de-facto standard in this area. Learn git. Additionally, familiarity with any specific tool such as Azure DevOps and Gitlab is also good.

Operations

Most of the current cloud service providers have excellent offerings around monitoring and alerting services. They can be used across infrastructure, platform and application management. It will be always good to be familiar with such offerings so that you can integrate your solution easily.

Quality Engineering tools

Data projects are notoriously difficult to test traditionally. There are real and practical challenges such as functional meaning of data being modified over a period of time, prohibitive volume, and not having representative distribution of data across environment.

However, a few simple basic tools in modern data engineer’s toolbox can save time and ensure quality. Firstly, unit tests should be part of the design. Write functional and modular code and use libraries like nose, pytest or unittest to build a unit test framework. Secondly take steps to automate manual functional tests to form a regression suite.

Data quality itself is a measure of platform maturity. While definitions and implementations vary a lot because data quality as a field is evolving, there are specific tool such as pydeequ are available today to run basic data quality checks. A combination of Apache spark and dbt (more about dbt below) can as well be used to achieve the same.

Agile Tool

My simple suggestion, be familiar with Jira or Azure DevOps. These tools have a broad set of options, but you should learn Scrum and Kanban models at a minimum.

Data Modelling

Data modelling as a concept has been around for a very long time. In the last few years though, we have seen a steady decrease of focus in this area to favour more dynamic semi-structured and unstructured data sets. In my experience though, data modelling becomes absolutely necessary at the consumption layer. Most of today’s BI tools require a dimensional model. Many of data driven applications require NoSQL or 3NF modelling. And data scientists started to rely more on feature stores.

I mentioned various types of data models, but it is not important to know all of them at once. The key is to be open to learn them whenever you get a chance. There is no better way to learn Data Modelling that studying well designed models themselves.

Something Extra

This section talks about a few skills that are not mandatory to have at the beginning, but they will help as you mature.

Tool Specialisation

In the past few years, we have seen the rise of excellent data engineering platforms such as Snowflake, Databricks — and data engineering tools such as Fivetran and Matillion. All of them have their own place in data engineering design. Interestingly, they all compete and complement each other in many cases.

One open source project deserves special mention: dbt. I strongly recommend to familiarise yourself with it.

Data Visualisation

Data visualisation is a handy skill to grow for a data engineer. This space is super crowded with many established players and many challengers. In general, I would recommend to start with either of Power BI or Tableau.

Data visualisation itself is a larger topic of interest, and there is no limit as to what can be done with libraries like d3.js, Plotly and Seaborn. I am consciously not getting into more elaborate data exploration and data analysis visualisation requirements in this article as it is a topic on its own. You may want to check out this article if you are interested.

Streaming Data Processing

Stream Data Processing is a niche area of data processing and things are still evolving. Processing data as soon as it is available is an exciting idea. With organisations investing in IOT, real time stream processing will be something to look forward.

There are two parts to it: Data Streaming and Stream Processing. First one relates to how the data is moved around and second one is about how to process data.

For streaming data, there are traditional queues such as RabbitMQ, as well as cloud-backed ones like GCP Pubsub. There are also distributed purpose-built data streaming technologies such as Kafka, and cloud-backed ones like AWS Kinesis and Azure Event Hub.

Until recently the processing semantics for batch and streams were treated differently, and architecture patterns like Lambda and Kappa were used to design systems. Now a days both paradigms are collapsed to a single paradigm, but with different contexts: bounded context vs unbounded context. There are many processing technologies available and many of them are cloud native such as Apache Beam, Apache Flink, Spark Structured Streaming, and/or cloud-backed such as Azure Stream Analytics, AWS Kinesis Analytics etc.

My suggestion is to invest in learning Kafka for streaming data and spark structured streaming for stream processing in principle. It is important to understand processing semantics such as at least once, at most once and exactly-once and trigger-once (used for batch). Additionally, it will also help to learn cloud specific solutions.

Application Development

Data driven applications are one of the key consumption channel of data platforms. There can be many applications: it can be recommendation engines, computer vision, search and update applications, data APIs and many many more.

While these applications are often seen as software engineering problems, in reality a large part of the solutions are data engineering and requires some specialised skillsets we talked about in this article.

With data being front and centre of digital transformation strategies for many organisations, data driven applications will be increase in volume, in scope and in variety. Hence, it only makes sense to invest time and effort to understand concepts and foundational concepts which has large overlaps with modern data engineering. Let me list out few technologies you should have a fundamental understanding:

  • API Backend: REST, GraphQL
  • Serverless Design: Function-As-A-Service offerings, Docker, Kubernetes
  • Databases: NoSQL in general
  • Search: ElasticSearch

Final Words

We’ve touched on a lot of tools and technologies — even though I excluded machine learning engineering, data security and metadata technologies it is still an overwhelming list. Unfortunately this is the nature of today’s technology landscape. The question is, do you need to learn all of them? My suggestion would be to use a T-shaped model to structure your maturity.

At Slalom, we follow a similar structure when it comes to Data Engineering.

We use what we call a Modern Culture of Data framework to help our clients enable everyone in their organisation to accelerate business outcomes with rapid insights and achieve the full potential of their investments in data and analytics.

Slalom Australia is hiring, so if this article resonates with your journey, let’s talk! You can reach the team at australia@slalom.com, or directly to me via linkedin.

--

--