Data Engineering Interview Series : SQL – Fundamentals and Basic Concepts
Written by Anju Mercian
“Data are just summaries of thousands of stories — tell a few of those stories to help make the data meaningful.” ~ Dan Heath, bestselling author
The stages of the data engineering lifecycle described in the book “Fundamentals of Data Engineering” by Joe Reis & Matt Housley is as follows:
Generation, Storage, Ingestion, Transformation, Serving.
There is also a notion of undercurrents; critical idea across the data engineering lifecycle. To work with Ingestion, Transformation and Serving of data; we need to first query the data at the ingestion phase. Queries are a fundamental part of data engineering. As a data engineer one will mostly be working on querying the data and transforming the data to deliver customer requests.
What is a Query? A query allows you to retrieve data and act on it. R in CRUD, referring to Read is what the select does.
The industry standard programming language used for querying is Structured Query Language (SQL). SQL has now become a very powerful tool in the day to day life of a data engineer. Thus, for a data engineer knowing SQL thoroughly becomes imperative. Hence, any interview for a data engineer or data science or a data analyst position will include probing questions on your knowledge of SQL.
In this series, I will try to provide a cheatsheet; not only for data interviews on SQL but also for your everyday work.
The flow of the series: I will go over the SQL commands and interject theory with few questions and solutions; for practice, I will also link the question so if you would like to try out the problem and warm up your SQL knowledge before looking at the solution.
Note: The SQL commands I share are the ones that have helped me in my interview process and my day to day life as a data engineer.
SQL Fundamentals
If Accounting is the language of money, biology is the language of life then SQL is the language of data.
- SQL being primarily a set-oriented query language, it focuses on the WHAT you want to do instead of the HOW like in procedural languages.
- SQL allows us to get the desired result through a sequence of well-defined operations.
- SQL works with relational databases; A relational database is a collection of related tables, each assigned a unique name. The tables are designed to keep track of some aspects of the real world and their relationship.
- SQL can help define the structure of data, modify data and specify constraints.
- SQL is a non-procedural query language — it can do much more than just query results.
According to ANSI, it is the standard language for relational database management systems. It finds a wide range of applications in almost all data roles.
In the field of Data Engineering, SQL is used widely for writing ETL pipelines, designing data models, querying underlying data etc.
Let’s get into the SQL concepts – Categorization of SQL Commands
We can categorize different SQL commands based on the functions they perform:
a) Data Definition Language (DDL): At a high level we need to first create the database objects before adding data and we use the Data Definition Language (DDL).
Includes commands for defining schema, deleting relations, modifying the schema and for specific integrity constraints.
examples: CREATE, DROP, TRUNCATE, ALTER, COMMENT, RENAME
b) Data Manipulation Language (DML): After creating the database using DDL, we need to define the database objects we need to add alter data within these objects which is why we use the data manipulation language.
Includes commands to manipulate data in the database
examples: Insert, Update, Delete, Copy, Merge.
c) Data Query Language (DQL): Commands to perform operations on data within the defined schema.
example: Select
d) Data Control Language (DCL): to limit access to the db objects and finely control who has access to what data.
Includes the command to control rights and permission of the database. Example: Grant, Revoke
e) Transaction control language: supports commands that control the details of the transaction With TCL define commit checkpoints conditions when actions can be rolled back examples : COMMIT Rollback.