Working With Strings In Pandas
In my last blog I discussed Data Types and Strutures in the Python Pandas library. This leads me on to focus on Strings.
As discussed before there are two string datatypes in Pandas:-
- object (is a Numpy array dtype)
- string (is a pandas extension type)
When creating a series the default will be object dtype.
import pandas as pd
import numpy as np
pd.Series(['a', 'b', np.nan])
0 a 1 b 2 NaN dtype: object
When creating a series you can explicitly request string dtype by using the dtype Parameter. Or you can use the astype function after the fact.
Note the string datatype has its down null value as I discussed in a past post.
pd.Series(['a', 'b', np.nan], dtype='string')
0 a 1 b 2 <NA> dtype: string
String methods¶
Series and Index are equipped with a set of string processing methods that make it easy to operate on each element of the array. Perhaps most importantly, these methods exclude missing/NA values automatically. These are accessed via the str attribute and generally have names matching the equivalent (scalar) built-in string methods.
s = pd.Series(
["A", "B", "Aaba", np.nan, "CABA", "dog"], dtype="string"
)
.str.lower¶
Make all of the string lower case
s.str.lower()
0 a 1 b 2 aaba 3 <NA> 4 caba 5 dog dtype: string
.str.upper¶
Make all of the string upper case
s.str.upper()
0 A 1 B 2 AABA 3 <NA> 4 CABA 5 DOG dtype: string
len()¶
Get length of strings
s.str.len()
0 1 1 1 2 4 3 <NA> 4 4 5 3 dtype: Int64
s2 = pd.Series(
["A ", " B", " Aaba ", np.nan, "CABA ", " dog"], dtype="string"
)
.str.strip¶
Remove white space from start and end of strings
s2.str.strip().values
<StringArray> ['A', 'B', 'Aaba', <NA>, 'CABA', 'dog'] Length: 6, dtype: string
str.lstrip¶
Remove white space at left of string
s2.str.lstrip().values
<StringArray> ['A ', 'B', 'Aaba ', <NA>, 'CABA ', 'dog'] Length: 6, dtype: string
str.rstrip()¶
Remove white space at right of string
s2.str.rstrip().values
<StringArray> ['A', ' B', ' Aaba', <NA>, 'CABA', ' dog'] Length: 6, dtype: string
Using string methods on the columns attribute¶
Because df.columns is an index object we can use the .str accessor.
d = [{'Col 1 ':'a', 'col 2 ':'b'},{'Col 1 ':'c', 'col 2 ':'d'}]
df = pd.DataFrame(d)
df
Col 1 | col 2 | |
---|---|---|
0 | a | b |
1 | c | d |
We can in fact chain string methods together to act on the column headers.
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')
df
col_1 | col_2 | |
---|---|---|
0 | a | b |
1 | c | d |
Splitting and replacing strings¶
String come in all sorts and when cleaning data you will frequently need to use these methods.
str.split()¶
Use split to break up a strings using a delimiter of your choice. If no delimiter is given the default is a space. Out put will be a Python list. The expand=True parameter can be used to output to new columns.
s3 = pd.Series(['z y z', 'c d e', np.nan, 'f g z'], dtype="string")
s3.str.split()
0 [z, y, z] 1 [c, d, e] 2 <NA> 3 [f, g, z] dtype: object
s4 = s3.str.split(expand=True)
s4
0 | 1 | 2 | |
---|---|---|---|
0 | z | y | z |
1 | c | d | e |
2 | <NA> | <NA> | <NA> |
3 | f | g | z |
str.replace()¶
The replace method can be used in much the same way as replace in excel. The case and regex parameters are useful
s3.str.replace('z', 'a')
0 a y a 1 c d e 2 <NA> 3 f g a dtype: string
str.contains()¶
The contains method allows you to find a sub-string with in a string. Simlar to the Excel find() or search() functions.
s3.str.contains('z')
0 True 1 False 2 <NA> 3 True dtype: boolean
str.cat()¶
To concatenate strings use the str.cat method and specify the separator.
s4[3]= s4[0].str.cat(s4[1], sep =", ")
s4
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | z | y | z | z, y |
1 | c | d | e | c, d |
2 | <NA> | <NA> | <NA> | <NA> |
3 | f | g | z | f, g |
That should get you up and running with strings in Pandas. In my next post I will discuss working with dates.