Are you a scientist that has some experience with programming language, but not much? Are you looking to get into the “data science” space, and have gotten some recommendations about “Python” and “Pandas” from friends?
If so, this guide is for you. This guide will teach how to explore a dataset quickly and efficiently. Here’s a demo of what I mean:
There are a couple steps in being able to do the analysis above:
- Install the correct software environment.
- Learn Python, and the Pandas library.
- Solve a similar problem.
Steps (2) and (3) go hand-in-hand I think. Let’s go through each step. Generally, this will be a guide to other resources. One resource is the post below (which will is similar this page):
- Using Pandas and Python to Explore Your Dataset, which mentions setting up the environment, playing with the data (through Pandas DataFrames) and visualizing some relationships. Particular interesting is the section on “Visualizing your Pandas DataFrame,” which shows how to actually visualize your data.
I’d follow the first 5 lessons in this guide:
The first 5 lessons walk through the environment setup (feel free to skip the lesson 3). Briefly, the core of what it mentions is the Anaconda distribution:
- To get started, download the Anaconda distribution: https://www.anaconda.com/products/individual.
Anaconda includes the Jupyter notebook, which is a convenient web interface that makes sharing/plotting easy (and is what I do all my data science work in). It can be launched through the Anaconda Navigator, or this command:
$ jupyter notebook
This will launch a “Jupyter notebook,” an overview of which is available here:
- Video: Jupyter notebook walkthrough.
- Blog post: https://realpython.com/jupyter-notebook-introduction/.
For detailed documentation on getting Jupyter going, see these links:
- A short and basic introduction to getting started with Jupyter.
- A more detailed introduction to Jupyter.
Note: I prefer to use Jupyter Lab. It’s a lot more powerful
(especially if you have multiple notebooks) but a slightly more complicated. If
you’re curious, there’s a Jupyter Lab overview available and it can
be launched with the command
Note: the list of resources below are resources I’ve found to be useful and high-quality; it’s is not meant to be exhaustive or well organized. Please use these resources as a starting place.
I think this is the best place to start for a general overview of Python:
- Whirlwind Tour of Python, which steps through basic Python syntax. It assumes the reader is familiar with programming but not Python.
The same author goes into more detail about data science specific topics in this book:
- The Python Data Science Handbook walks through the common data scientist tools. I’d pay special attention to the following:
Also, I’ve found that RealPython has some good tutorials/videos:
- A good overview of Python basics at RealPython’s Introduction to Python, including data structures and control flow.
- RealPython’s Pandas videos and basic Pandas blog posts, which go over everything from creation to plotting.
Of particular interest might be the RealPython video on Visualizing Your Pandas DataFrame.
I think the best way to learn Python is to solve problems. To get started, I would try to follow and understand this guide:
- Using Pandas and Python to Explore Your Dataset, which
mentions setting up the environment, playing with the data (through Pandas
DataFrames) and visualizing some relationships.
- A video of reading a CSV/etc is available at Importing CSV Data into a Pandas DataFrame.
A couple more independent problems are with the two datasets below.
- Kaggle’s “Titanic” Challenge. The “Titanic” challenge is their
first challenge, a subset of passengers on the Titanic and some features
(their fare, sex/age, if they survived, etc).
- How many adult men/women survived?
- Were male or female children more likely to survive?
- How does survival rate change with fare? Are survival rate and fare paid correlated? Did high-paying passengers survive more frequently? If so, what percentage of passengers survived if they paid less than $N$ dollars?
- Kaggle’s Iris dataset. This dataset records 150 observations of
flowers, of 3 different species. Each observation has 4 attributes (e.g., petal
- Can you manually find a set of rules to predict the species from the 4 attributes? (e.g., if a flower has petal width less than $X$ and sepal length greater than $Y$, then it is species “virginica”).
- Deploy your rule. Code your set of rules up in a Python function, and predict the species of every flower in the training dataset. How accurate is your set of rules on the training set?
Try to make all predictions/estimates by hand – don’t use any “machine learning.” Visualize the relationships between the data and see if you can develop a set of hand-curated rules like “children younger than 12 on the Titanic almost certainly survived.” This will require visualizing the data to explore relationships.
These datasets can also be obtained with this code:
import pandas as pd # Titanic dataset df = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv") # Iris dataset df = pd.read_csv("https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/0e7a9b0a5d22642a06d3d5b9bcbad9890c8ee534/iris.csv")
A declarative wrapper around the natively integrated Matplotib is Seaborn. Here’s how to make a scatter plot with Seaborn (using a dummy Titanic CSV):
import pandas as pd df = pd.read_csv("titanic.csv") import seaborn as sns sns.scatterplot( data=df, x="fare", y="age", hue="survived", )
Seaborn’s example gallery is really useful.
Altair is another declarative visualization library. Here’s how to create a scatter plot with Altair:
import pandas as pd df = pd.read_csv("titanic.csv") import altair as alt alt.Chart(df).mark_circle().encode( x="fare", y="age", color="survived", )
Altair’s example gallery is really useful.