Working With Datetime In Pandas
Follow from my blog Understanding NULLs in Pandas, it leads me to discuss data types and by extensions data structures used by the Python Pandas library.
What are data types I hear you ask? Well they are how we categorises data.
If we lump all data together then how do we know if we can do mathematical calculations on the data?
While each software package has its own data types there are really four types:
- Numerical
- String
- Date
- Boolean
Numerical and data are as they sound. String is text but can also hold numbers and symbols, no mathematical operations can be done on strings. Boolean is binary e.g. True/False
So now to Pandas data types and sometimes called Primitive.
Starting with Numerical. Pandas offers two main numerical data types:-
- Integer (int8, int16, int32, int64)
- Float (float16, float32, float64)
Integers are you might remember from maths class is a whole number.
Floats are numbers with decimal points.
String¶
Pandas has historically only offered the Object data type for storing strings.
But from Pandas V1 we have also had the new String type which is a true string data type.
- Object (mixed type)
- String (StringDtype)
Date¶
With dates there are few data types that can be used but the main one comes from Python.
- datetime64
- timedelta[ns]
Datetime stores a single date in the format YYYY-MM-DD HH:MM:SS for example 2012-05-01 00:00:00
Timedeltas are a range between two dates e.g. "2011-12-29" to "2011-12-31"
Boolean¶
In Pandas the boolean type is important because of it use of bools for filtering data.
to convert a column (series) to boolean there can only be two options. e.g. 0 and 1 or yes and no
Category¶
Pandas has recently added a category datatype which is great for reducing the size of column (series) with only a few options
Lets see how these data types work¶
import pandas as pd
import numpy as np
import seaborn as sns
Lets import the penguins dataset from the seaborn library for convenience
penguins = sns.load_dataset('penguins')
penguins.head()
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
3 | Adelie | Torgersen | NaN | NaN | NaN | NaN | NaN |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | Female |
Lets drop the nulls to start with
penguins = penguins.dropna()
penguins.head()
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | Female |
5 | Adelie | Torgersen | 39.3 | 20.6 | 190.0 | 3650.0 | Male |
Lets use the dtypes attribute to see the data types
penguins.dtypes
species object island object bill_length_mm float64 bill_depth_mm float64 flipper_length_mm float64 body_mass_g float64 sex object dtype: object
- You can see that 'Species' and 'Island' have been imported as Object which is fine for these.
- 'bill_length_mm' and 'bill_depth_mm' have been correctly imported as floats.
- 'flipper_length_mm' and 'body_mass_g' have been imported as floats due to missing data but really we want them as ints.
- 'sex' has been imported as object again due to missing data but we want it as bool.
Convert data types¶
penguins[['bill_length_mm', 'bill_depth_mm']] = penguins[['bill_length_mm', 'bill_depth_mm']].astype('int64')
penguins['sex'] = penguins['sex'].astype('bool')
penguins.dtypes
species object island object bill_length_mm int64 bill_depth_mm int64 flipper_length_mm float64 body_mass_g float64 sex bool dtype: object
Now look at the data again
penguins.head()
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39 | 18 | 181.0 | 3750.0 | True |
1 | Adelie | Torgersen | 39 | 17 | 186.0 | 3800.0 | True |
2 | Adelie | Torgersen | 40 | 18 | 195.0 | 3250.0 | True |
4 | Adelie | Torgersen | 36 | 19 | 193.0 | 3450.0 | True |
5 | Adelie | Torgersen | 39 | 20 | 190.0 | 3650.0 | True |
Data Structures (complex data types)¶
Data Structures are more complex data types.
Pandas only really has two of its own these are:-
- Series
- DataFrames
But it also uses data structures from Numpy and Python including:-
- Numpy arrays
- Python lists
- Python dictionaries
Pandas data structures can be put into two categories 1d and 2d.
Pandas Series¶
Creating a series using a Python dictionary
data = {1:'a', 2:'b', 3:'c'}
s = pd.Series(data)
s
1 a 2 b 3 c dtype: object
Creating a series using Python lists. Data and index must be the same length.
data = ['a','b','c']
index = [1,2,3]
s = pd.Series(data, index=index)
s
1 a 2 b 3 c dtype: object
A few things to note about the Series data structure
- It has a datatype, if mixed data is in the series it will be Object dtype
- It has an index
- It can be use it like a dictionary to get and set values by index labels
- It can be vectorised, meaning looping it not necessary
- It can have a name attribute
Pandas DataFrame¶
Creating a Dataframes from dict of Series
d = {
"one": pd.Series([1.0, 2.0, 3.0], index=["a", "b", "c"]),
"two": pd.Series([1.0, 2.0, 3.0, 4.0], index=["a", "b", "c", "d"]),
}
df = pd.DataFrame(d)
df
one | two | |
---|---|---|
a | 1.0 | 1.0 |
b | 2.0 | 2.0 |
c | 3.0 | 3.0 |
d | NaN | 4.0 |
Creating a Dataframes from lists
d = {"one": [1.0, 2.0, 3.0, 4.0], "two": [4.0, 3.0, 2.0, 1.0]}
pd.DataFrame(d)
one | two | |
---|---|---|
0 | 1.0 | 4.0 |
1 | 2.0 | 3.0 |
2 | 3.0 | 2.0 |
3 | 4.0 | 1.0 |
Creating a Dataframe from list of dictionaries
d = [{"a": 1, "b": 2}, {"a": 5, "b": 10, "c": 20}]
pd.DataFrame(d)
a | b | c | |
---|---|---|---|
0 | 1 | 2 | NaN |
1 | 5 | 10 | 20.0 |
A few things to note about the Dataframe data structure
- Dataframes are 2d data structure much like spreadsheets or SQL tables.
- Dataframes columns can have different data types
- Columns can be added to a dataframe
- Pandas has methods for selection rows and/or columns
- Dataframes align with both column and row indexes
- Dataframes have an index class which give access to the index
- Dataframes have the column attribute for accessing column headers
That's all for now.