Just came across this new, free book on the basics of enterprise data management. The book does a great job of covering the most important aspects of data management from a data science perspective.
I have a hard time fitting all of these tool-focused topics like using Git, CI/CD, Docker, SSH, OAuth into my course syllabi. These are things to know vs. understand and we focus on the latter. But these are absolute must-knows for data workers. This is how models are deployed and optimized for production, which is the only way to make a real impact. So this will be a great quick reference to list.
The book is a very practical introduction to (almost) everything that comes after modeling in data science. Here’s the intro to data connectors (no offense to data engineers):
Your job as a data scientist is to sift through a massive pile of data to extract nuggets of real information – and then use that information. Working at the end of an external process, you must meet the data where it lives.
CI/CD and Git:
The role of Git is to make your code promotion process happen. Git allows you to configure requirements for whatever approvals and testing you need. Your CI/CD tool sits on top of that so that all this merging and branching does something.
with the footnote:
Strictly speaking, this is not true. There are a lot of different ways to kick off CI/CD jobs. But, the right way to do it is to base it on Git operations.
I like the tone here (and the drawings). You can check out the book here.