// python for geoscientists

Why Beginner Python Tutorials Skip the Geology

If you have ever started a beginner Python tutorial, you have almost certainly met the Iris dataset. Or the Titanic passenger list. Or Boston housing prices. These three datasets show up in tutorials so often that they are practically a genre. They are also, for most working geologists, less useful as a learning aid. Video 2 of this series is an attempt to fix that, because the choice of example dataset matters more than tutorial writers realise.

To be clear, I do not blame anyone for using Iris. It is small, it is clean, it has the right shape for almost any introductory concept, and it has been used so often that everyone knows what to expect. The same goes for Titanic. They are convenient. They were also chosen, decades ago, by people whose primary audience was computer scientists and statisticians, not domain practitioners trying to learn a new tool.

The issue is what happens in the head of a working geologist trying to follow along. Every line of code in a tutorial is doing two things at once: it is teaching you Python, and it is making you think about the example data. When the example data is unfamiliar, those two cognitive loads compete. You are translating from "what is a sepal width" to "what does this code do" in parallel, and the Python lesson is the one that gets crowded out.

Replace the flowers with drillholes and the cognitive load drops by half. You already know what an interval is. You already know why a grade matters. You already know that you would want to average grades along a length-weighted column rather than a simple one. So when the code does a for-loop over a list of grades, or defines a function that classifies an interval, the only thing you have to focus on is the Python. The geology is doing the work that the unfamiliar example was previously asking you to do.

This is, I think, the real reason a lot of geologists pick up Python and then put it down. It is not that the language is hard. It is that the path from tutorials to "I can read code on a real problem I care about" is much longer than it should be, and most of that extra length is the translation overhead.

What's in Video 2

The second video in the series covers the parts of Python you use constantly: variables, the four core data types, operations and logic, loops, functions, and a few habits for writing readable code. The structure is mostly conventional. The examples are not.

Variables hold things like hole_id = "DDH-001" and grade_gpt = 4.7. Lists are lists of grades from a real-shaped intercept. Dictionaries describe holes by their azimuth, dip, and length. The if-statements classify intervals as high grade, mineralized, or below cutoff. The first function in the video does exactly that classification. The second computes a length-weighted mean grade, because a thick low-grade interval is not the same as a thin high-grade one and Python should know that too.

Video 2, about 30 minutes. Builds on Video 1.

By the end of the video we bring back the drillhole dataset from Video 1 and apply the classify function to every row. It is a small thing, but it is the loop closing. You wrote a function. The function did real work on real-shaped data. That is the entire skill chain in miniature.

The notebook and dataset

Everything is in the public GitHub repository:

github.com/DHeasmanGDS/geodatascience-python-intro

One small improvement over Video 1: the notebook now loads the dataset directly from GitHub via URL, so following along in Colab is one click. Open the notebook through the badge in the repo, click Copy to Drive, run cells. No file uploads. The example data flows in automatically.

An aside on the philosophy

I do not think general-purpose tutorials are wrong. They serve a real audience, and that audience is much larger than the geoscience community. But every domain that wants to adopt Python (or any tool, really) needs its own bridge content. The flowers are not the right bridge for us. The drillholes are.

This is part of why I started Bots on the Ground and why I have been pushing on the intersection of geoscience and data science for a long time. Every general-purpose tool that gets adopted into a working geologist's day-to-day starts as bridge content like this. Someone writes the version of the tutorial that uses examples the audience already cares about, and the field becomes a tiny bit more accessible.

What's coming

Video 3 is the payoff for the first two videos. We use pandas and matplotlib to take a real geoscience workflow end to end: load a dataset, clean it, group it, summarize it, produce figures you would actually put in a report. By the end of three videos, the gap between "I have never coded" and "I can write a useful script" is closed.

For my regular readers, the harder material continues as normal. The statistical series I mentioned in the Video 1 post is still in active rotation, and the next post in that series is on its way.

The full series will keep coming as it is ready. The repo grows alongside the videos. Feedback and corrections are welcome via GitHub issues or directly through the blog.