Are you a scientist that has some experience with programming language, but not much? Are you looking to get into the “data science” space, and have gotten some recommendations about “Python” and “Pandas” from friends?

If so, this guide is for you. This guide will teach how to explore a dataset quickly and efficiently. Here’s a demo of what I mean:

There are a couple steps in being able to do the analysis above:

  1. Install the correct software environment.
  2. Learn Python, and the Pandas library.
  3. Solve a similar problem.

Steps (2) and (3) go hand-in-hand I think. Let’s go through each step. Generally, this will be a guide to other resources. One resource is the post below (which will is similar this page):

Installing Python

I’d follow the first 5 lessons in this guide:

The first 5 lessons walk through the environment setup (feel free to skip the lesson 3). Briefly, the core of what it mentions is the Anaconda distribution:

Anaconda includes the Jupyter notebook, which is a convenient web interface that makes sharing/plotting easy (and is what I do all my data science work in). It can be launched through the Anaconda Navigator, or this command:

$ jupyter notebook

This will launch a “Jupyter notebook,” an overview of which is available here:

For detailed documentation on getting Jupyter going, see these links:

Note: I prefer to use Jupyter Lab. It’s a lot more powerful (especially if you have multiple notebooks) but a slightly more complicated. If you’re curious, there’s a Jupyter Lab overview available and it can be launched with the command jupyter lab.

Passive learning

Note: the list of resources below are resources I’ve found to be useful and high-quality; it’s is not meant to be exhaustive or well organized. Please use these resources as a starting place.

I think this is the best place to start for a general overview of Python:

  • Whirlwind Tour of Python, which steps through basic Python syntax. It assumes the reader is familiar with programming but not Python.

The same author goes into more detail about data science specific topics in this book:

Also, I’ve found that RealPython has some good tutorials/videos:

Of particular interest might be the RealPython video on Visualizing Your Pandas DataFrame.

Practical learning

I think the best way to learn Python is to solve problems. To get started, I would try to follow and understand this guide:

A couple more independent problems are with the two datasets below.

  1. Kaggle’s “Titanic” Challenge. The “Titanic” challenge is their first challenge, a subset of passengers on the Titanic and some features (their fare, sex/age, if they survived, etc).
    • How many adult men/women survived?
    • Were male or female children more likely to survive?
    • How does survival rate change with fare? Are survival rate and fare paid correlated? Did high-paying passengers survive more frequently? If so, what percentage of passengers survived if they paid less than $N$ dollars?
  2. Kaggle’s Iris dataset. This dataset records 150 observations of flowers, of 3 different species. Each observation has 4 attributes (e.g., petal width).
    • Can you manually find a set of rules to predict the species from the 4 attributes? (e.g., if a flower has petal width less than $X$ and sepal length greater than $Y$, then it is species “virginica”).
    • Deploy your rule. Code your set of rules up in a Python function, and predict the species of every flower in the training dataset. How accurate is your set of rules on the training set?

Try to make all predictions/estimates by hand – don’t use any “machine learning.” Visualize the relationships between the data and see if you can develop a set of hand-curated rules like “children younger than 12 on the Titanic almost certainly survived.” This will require visualizing the data to explore relationships.

These datasets can also be obtained with this code:

import pandas as pd

# Titanic dataset
df = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")

# Iris dataset
df = pd.read_csv("https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/0e7a9b0a5d22642a06d3d5b9bcbad9890c8ee534/iris.csv")

Visualization libraries

A declarative wrapper around the natively integrated Matplotib is Seaborn. Here’s how to make a scatter plot with Seaborn (using a dummy Titanic CSV):

import pandas as pd
df = pd.read_csv("titanic.csv")

import seaborn as sns
sns.scatterplot(
    data=df,
    x="fare",
    y="age",
    hue="survived",
)

Seaborn’s example gallery is really useful.

Altair is another declarative visualization library. Here’s how to create a scatter plot with Altair:

import pandas as pd
df = pd.read_csv("titanic.csv")

import altair as alt
alt.Chart(df).mark_circle().encode(
    x="fare",
    y="age",
    color="survived",
)

Altair’s example gallery is really useful.