GPT, English SDK, LLMs and the future

It’s official, Databricks has released their English SDK for Apache Spark.In short, this tool looks like it makes writing Spark code as simple as writing plain English instructions.

I haven’t tested it yet, but I have been using ChatGPT for some time to help me write cover letters for job applications. I’m assuming that the fine folks at Databricks have released a product that will make a lot of data engineering tasks way too easy, and make it easier for folks without a coding background to create ETLs, or even whole data infrastructures, without the help of people experienced in writing code.

THIS IS A GOOD THING, and it highlights an important thing to be mindful of , whether you’re a data engineer, data scientist, or marketing executive.

Memorizing the use of a complex tool (like Spark or SQL) does not make you a good engineer. Knowing how to use a complex tool to get results that are asked for from your boss or a stakeholder doesn’t make you a good engineer.

Understanding relationships, whether mathematical, cause/effect, logical, or interpersonal, make you a good engineer. 

What do I mean by this? Let’s use this new Spark tool as an example. When it comes to data engineering, the important part isn’t understanding how to write the correct Spark. It’s understanding how the data you’re manipulating is being collected, how it will be used, what it represents in the real world, and how it can be efficiently stored, retrieved, transformed, and related to other data in your infrastructure.

This isn’t exactly a hot take. I’m sure something similar is being repeated on a thousand different blogs right now. But I wanted to weigh in because I also see a lot of doom-posting about the end of data engineering as a profession. On the contrary, tools like this (and Copilot, and ChatGPT, etc.) are just going to enable people to apply more computational power to their projects. Technical folks like myself will be necessary to help non-technical folks avoid mistakes, refine and optimize processes, and fine-tune plug-and-play LLMs for specific business uses. To me, this is the democratization of complex computation; folks with real expertise in analytics will still be necessary to act as subject matter experts. But we’ll have to spend less time memorizing syntax, and more time understanding what we’re trying to build.