When it comes to data engineering, the world of learning resources is comparatively small. Probably „Data Engineering“ does not sound as sexy as Data Science. The young girls and boys rather dream about creating and presenting beautiful graphs and charts where they explain how they magically made sense of some very complex data using machine learning and other fancy algorithms. They don‘t wake up in the morning and say: „Gosh I want to build a data pipeline today, this will be so much fun“. Data engineering is just not hyped enough. You can perfectly see this, if you browse Udacity’s nanodegree programs, you won’t find anything on data engineering (because EVERY super hyped topic got to be on Udacity). This results in data engineering being reserved for the more geeky folks and for people having some experience in computer science. Yet, I found a platform which attempts to make data engineering accessible to everyone.
Sure, there are some data engineering courses floating around on coursera, edX, udemy, lynda and co. But most of the time those seem to have two major drawbacks:
1. Focusing on Technologies
These resources specialize on technology rather than on the fundamentals. Indeed most MOOCs and books focus on Hadoop, Spark, AWS, Google Cloud and co to teach you data engineering and “Big Data”. While these are important technologies to know about, the most important skill will be to get an overall understanding of what is behind these technologies. If the fundamental understanding is created, mastering a technology becomes just a matter of time.
2. Way of Teaching
Having a professional background in poker, there were two major ways for me to learn and improve back then: Playing and watching others play. I also had read some books, planned out scenarios and strategized on a regular basis to deepen the theoretical understanding but the first two methods were the bread and butter. When I first started out in computer science, I assumed watching videos would be a very effective method in this milieu too. And it might even be true for some people, but now I think it is most suited for real beginners. Those who really need a step by step guide in the dark computer science world of so many variables and unknowns. After a few years of coding you get really accustomed to the documentations given by other developers. As your computer science vocabulary and understanding grow, you find yourself more and more in situations where videos and precomposed courses don’t always go in the speed you need them to.
Is dataquest Different?
So far I have completed the first part of dataquests’s engineering path which was practically about the basics of SQL databases using Postgres. As of this writing it seems to be the only one interactive course for data engineers out there. Dataquest works like this: You will be introduced to a topic in form of text and then you have an interactive console where you can play around with things and finally solve some sort of a task. When I first tried dataquest I feared that it would end up like my experiments with codecademy and DataCamp, where I found the tasks to be too strongly tied to the explanations and thus too easy to solve. And my first quests really started like this, way too basic, just writing down what is written in the explanation. But after a short while the tasks got more challenging, inspiring and fun. I also peeked inside the next „quest“ which will deal with a socially relevant dataset and with preparing a database so that data analysts can work more efficiently with faster queries (that’s just the story, there are no analysts behind it in reality). Quite a motivating context for the datagoodies out there.
There were three points I wasn’t very content with. Firstly, the interactive iPython console wasn’t really perfectly integrated in the frontend to a point that it was unusable at times. Secondly, as somebody who really liked the platform, I wished that there would be a direct feedback function on every page, so that everybody can work right away on fixing some issues in text or logic. You can shoot the support an email and they are very kind, but with a direct feedback loop, they would have much more feedback I think. Thirdly, on completion of the first „real world project“ I got a certificate. Which is very kind. However, to be honest, I haven’t completed this project at all. I derived my own mini projects from the course and I was very happy about that, but I just rushed through the last project‘s tasks and decided that this is merely repetition of the things I learned in the course and not meaningful enough to me to be worth doing. By the way, it was a project which was supposed to be done on your own local machine, which I think is quite cool. At the end you get a hint that you might deploy your project to a cloud platform to share it with others. If dataquest could manage it, to automatically grade those deployments and issue the certificate on the basis of this graduation, the certificate would really gain in importance and meaning.
To sum it up, I assume that there is a lot of potential in the learning resources landscape of data engineering. Dataquest has surprised me very positively and I learned a lot throughout the first part of the course. I also installed Postgres on my computer and was inspired to one or two cool side projects for myself. I will share the most interesting bits with you in some of my next posts, so stay tuned for some code, benchmarks and SQL hustling 🤙