Python pipes
Pipes is a lovely feature that makes your code more readable and less verbose
Pipes make code easy to read and maintain. They also make it easier to understand the flow of data through your code. See this mermaid flow, it is just like you would think of transforming a dataframe.
Replace the arrow with a “then”. Do this, then that, then this and finish off with action3. As you can see, we are not writing spaghetti code anymore!
Pipes in python
Code
import pandas as pd
=pd.read_csv("https://raw.githubusercontent.com/TheEconomist/big-mac-data/master/output-data/big-mac-raw-index.csv")
df
def set_dtypes(dataf):
return (dataf
=lambda d: pd.to_datetime(d['date']))
.assign(date'currency_code', 'date']))
.sort_values([
def remove_outliers(dataf):
=32
min_row_country= (dataf
countries 'currency_code')
.groupby(=('name','count'))
.agg(nlambda d: d['n'] >= min_row_country]
.loc[
.index)return (dataf
lambda d: d['currency_code'].isin(countries)])
.loc[
df.pipe(set_dtypes).pipe(remove_outliers).head()
date | iso_a3 | currency_code | name | local_price | dollar_ex | dollar_price | USD | EUR | GBP | JPY | CNY | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
197 | 2005-06-01 | ARE | AED | UAE | 9.0 | 3.67250 | 2.450647 | -0.05014 | -0.31606 | -0.28769 | 0.04608 | 0.93167 |
270 | 2006-05-01 | ARE | AED | UAE | 9.0 | 3.67260 | 2.450580 | -0.11850 | -0.34945 | -0.32824 | 0.09889 | 0.87318 |
311 | 2007-01-01 | ARE | AED | UAE | 10.0 | 3.67260 | 2.722867 | -0.05783 | -0.28519 | -0.28976 | 0.18236 | 0.92440 |
352 | 2007-06-01 | ARE | AED | UAE | 10.0 | 3.67285 | 2.722681 | -0.09244 | -0.34659 | -0.32055 | 0.18937 | 0.88229 |
393 | 2008-06-01 | ARE | AED | UAE | 10.0 | 3.67285 | 2.722681 | -0.15181 | -0.49039 | -0.40429 | 0.03904 | 0.48854 |
Amazing! Now we can see that the code is much more readable and easier to understand.
What if we did it in the old way and look at them both?
Code
import pandas as pd
=pd.read_csv("https://raw.githubusercontent.com/TheEconomist/big-mac-data/master/output-data/big-mac-raw-index.csv")
df
= (df
df2 =lambda d: pd.to_datetime(d['date']))
.assign(date'currency_code', 'date'])
.sort_values(['currency_code')
.groupby(=('date', 'count')))
.agg(n
lambda d: d['currency_code'].isin(df2[df2['n'] >= 32].index)].head() df.loc[
date | iso_a3 | currency_code | name | local_price | dollar_ex | dollar_price | USD | EUR | GBP | JPY | CNY | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2000-04-01 | ARG | ARS | Argentina | 2.50 | 1.000000 | 2.500000 | 0.11607 | 0.05007 | -0.16722 | -0.09864 | 1.09091 |
1 | 2000-04-01 | AUS | AUD | Australia | 2.59 | 1.680000 | 1.541667 | -0.31176 | -0.35246 | -0.48645 | -0.44416 | 0.28939 |
2 | 2000-04-01 | BRA | BRL | Brazil | 2.95 | 1.790000 | 1.648045 | -0.26427 | -0.30778 | -0.45102 | -0.40581 | 0.37836 |
3 | 2000-04-01 | GBR | GBP | Britain | 1.90 | 0.632911 | 3.002000 | 0.34018 | 0.26092 | 0.00000 | 0.08235 | 1.51076 |
4 | 2000-04-01 | CAN | CAD | Canada | 2.85 | 1.470000 | 1.938776 | -0.13448 | -0.18566 | -0.35417 | -0.30099 | 0.62152 |
Not so tidy, harder to read and to grasp what is going on here, compared to the pipes. Also pytest is something you point towards functions, so it aint going to work in the bad code example right above here.
Transforming data
One of the most common tasks is to aggregate, filter, select, mutate and do conditional changes to the data.
Aggregating columns
df.groupby(‘name’).sum(‘dollar_price’)
Code
'name').sum('dollar_price')
df.groupby("name")["dollar_price"].sum()
df.groupby("name").agg(total=("dollar_price", "sum")).reset_index().head() df.groupby(
name | total | |
---|---|---|
0 | Argentina | 134.292819 |
1 | Australia | 160.908677 |
2 | Azerbaijan | 35.182460 |
3 | Bahrain | 51.703457 |
4 | Brazil | 170.168869 |
Mutating a column
df[‘mean_price_country’] = df.groupby(‘name’)[‘local_price’].transform(‘mean’)
Code
import pandas as pd
=pd.read_csv("https://raw.githubusercontent.com/TheEconomist/big-mac-data/master/output-data/big-mac-raw-index.csv")
df
=df.assign(
df= lambda dataframe: dataframe['dollar_price'].map(lambda dollar_price: True if dollar_price <= 4 else False)
is_good_price
)
'eur_price'] = df.groupby('name')['EUR'].transform('sum')*df.groupby('name')['local_price'].transform('sum')
df['local_price', 'date', "name"]].set_index('date').head() df[[
local_price | name | |
---|---|---|
date | ||
2000-04-01 | 2.50 | Argentina |
2000-04-01 | 2.59 | Australia |
2000-04-01 | 2.95 | Brazil |
2000-04-01 | 1.90 | Britain |
2000-04-01 | 2.85 | Canada |
Dropping a column
Code
df
"date",axis=1).head(2) df.drop(
iso_a3 | currency_code | name | local_price | dollar_ex | dollar_price | USD | EUR | GBP | JPY | CNY | is_good_price | eur_price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ARG | ARS | Argentina | 2.50 | 1.00 | 2.500000 | 0.11607 | 0.05007 | -0.16722 | -0.09864 | 1.09091 | True | -161638.542969 |
1 | AUS | AUD | Australia | 2.59 | 1.68 | 1.541667 | -0.31176 | -0.35246 | -0.48645 | -0.44416 | 0.28939 | True | -1049.043475 |
df.drop(“date”,axis=1)
Replace blank space in columns
data.columns = [column.replace(” “,”_“) for column in data.columns]
Filter some data using query
Best solution:
df.query(“name == ‘United States’”)
Alternative solution:
df[df[‘name’]==‘United States’]
Code
=pd.read_csv("https://raw.githubusercontent.com/TheEconomist/big-mac-data/master/output-data/big-mac-raw-index.csv")
df
"name == 'United States'").head(4) df.query(
date | iso_a3 | currency_code | name | local_price | dollar_ex | dollar_price | USD | EUR | GBP | JPY | CNY | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
27 | 2000-04-01 | USA | USD | United States | 2.24 | 1.0 | 2.24 | 0.0 | -0.05914 | -0.25383 | -0.19238 | 0.87345 |
55 | 2001-04-01 | USA | USD | United States | 2.24 | 1.0 | 2.24 | 0.0 | -0.00955 | -0.21285 | -0.05524 | 0.87345 |
87 | 2002-04-01 | USA | USD | United States | 2.35 | 1.0 | 2.35 | 0.0 | -0.01107 | -0.18558 | 0.16603 | 0.85314 |
119 | 2003-04-01 | USA | USD | United States | 2.46 | 1.0 | 2.46 | 0.0 | -0.17477 | -0.21761 | 0.12672 | 1.05745 |
Lambda function
This example is not functioning, just for a visual
Code
df.assign(= lambda dataframe: dataframe['name'].map(lambda name: name.upper()),
name_uppercase
).assign(= lambda dataframe: dataframe['name_uppercase'].map(lambda name: name.upper()+"-"+name.upper())
name_uppercase_double )
Case when
Code
import pandas as pd
list = ['Ana', 'Kevin', 'Wang', 'Tao', '']
print(pd.DataFrame([{'index': idx, 'name': char} for idx,char in enumerate(list)]))
index name
0 0 Ana
1 1 Kevin
2 2 Wang
3 3 Tao
4 4
What if you want to replace some values in a dataframe? You can do it using a function and then an apply fuction. I think this is pretty nice to do as you can do it row-wise. It is easy to read the code and understand the conditions while it is easy to update the conditons also.
This is also good as you can use pytest on the functions and therefore make sure that they work as expected.
Code
=pd.read_csv("https://raw.githubusercontent.com/TheEconomist/big-mac-data/master/output-data/big-mac-raw-index.csv")
df
# Define the custom function
def update_price(row):
if row['name'] == 'Argentina' and (row['local_price'] < 3 or row['local_price'] >= 10):
return 20
return row['local_price']
# Apply the function to the DataFrame row-wise
'local_price'] = df.apply(update_price, axis=1)
df[3) df.head(
date | iso_a3 | currency_code | name | local_price | dollar_ex | dollar_price | USD | EUR | GBP | JPY | CNY | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2000-04-01 | ARG | ARS | Argentina | 20.00 | 1.00 | 2.500000 | 0.11607 | 0.05007 | -0.16722 | -0.09864 | 1.09091 |
1 | 2000-04-01 | AUS | AUD | Australia | 2.59 | 1.68 | 1.541667 | -0.31176 | -0.35246 | -0.48645 | -0.44416 | 0.28939 |
2 | 2000-04-01 | BRA | BRL | Brazil | 2.95 | 1.79 | 1.648045 | -0.26427 | -0.30778 | -0.45102 | -0.40581 | 0.37836 |
Alternative solution, not as nice
Code
import pandas as pd
import numpy as np
= {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
data 'age': [42, 52, 36, 24, 73],
'preTestScore': [4, 24, 31, 2, 3],
'postTestScore': [25, 94, 57, 62, 70]}
= pd.DataFrame(data, columns = ['name', 'age', 'preTestScore', 'postTestScore'])
df
df
= [
conditions "age"].lt(10)),
(df["age"].ge(10) & df["age"].lt(20)),
(df["age"].ge(20) & df["age"].lt(30)),
(df["age"].ge(30) & df["age"].lt(50)),
(df["age"].ge(50)),
(df[
]= ["baby", "kid", "young", "mature", "grandpa"]
choices
"elderly"] = np.select(conditions, choices)
df[
3) df.head(
name | age | preTestScore | postTestScore | elderly | |
---|---|---|---|---|---|
0 | Jason | 42 | 4 | 25 | mature |
1 | Molly | 52 | 24 | 94 | grandpa |
2 | Tina | 36 | 31 | 57 | mature |