4.9. DataFrame Select¶
import pandas as pd
import numpy as np
np.random.seed(0)
df = pd.DataFrame(
columns = ['Morning', 'Noon', 'Evening', 'Midnight'],
index = pd.date_range('1999-12-30', periods=7),
data = np.random.randn(7, 4))
df
# Morning Noon Evening Midnight
# 1999-12-30 1.764052 0.400157 0.978738 2.240893
# 1999-12-31 1.867558 -0.977278 0.950088 -0.151357
# 2000-01-01 -0.103219 0.410599 0.144044 1.454274
# 2000-01-02 0.761038 0.121675 0.443863 0.333674
# 2000-01-03 1.494079 -0.205158 0.313068 -0.854096
# 2000-01-04 -2.552990 0.653619 0.864436 -0.742165
# 2000-01-05 2.269755 -1.454366 0.045759 -0.187184

Figure 4.4. Pandas Select Methods¶
4.9.1. Query Data¶
df.where()
Works withinplace=True
df[df['Morning'] > 0.0]
# Morning Noon Evening Midnight
# 1999-12-30 1.764052 0.400157 0.978738 2.240893
# 1999-12-31 1.867558 -0.977278 0.950088 -0.151357
# 2000-01-02 0.761038 0.121675 0.443863 0.333674
# 2000-01-03 1.494079 -0.205158 0.313068 -0.854096
# 2000-01-05 2.269755 -1.454366 0.045759 -0.187184
query = df['Morning'] > 0.0
df[query]
# Morning Noon Evening Midnight
# 1999-12-30 1.764052 0.400157 0.978738 2.240893
# 1999-12-31 1.867558 -0.977278 0.950088 -0.151357
# 2000-01-02 0.761038 0.121675 0.443863 0.333674
# 2000-01-03 1.494079 -0.205158 0.313068 -0.854096
# 2000-01-05 2.269755 -1.454366 0.045759 -0.187184
query = df['Morning'] > 0.0
df.where(query)
# Morning Noon Evening Midnight
# 1999-12-30 1.764052 0.400157 0.978738 2.240893
# 1999-12-31 1.867558 -0.977278 0.950088 -0.151357
# 2000-01-01 NaN NaN NaN NaN
# 2000-01-02 0.761038 0.121675 0.443863 0.333674
# 2000-01-03 1.494079 -0.205158 0.313068 -0.854096
# 2000-01-04 NaN NaN NaN NaN
# 2000-01-05 2.269755 -1.454366 0.045759 -0.187184
4.9.2. Logical NOT¶
query = df['Midnight'] < 0.0
df[~query]
# Morning Noon Evening Midnight
# 1999-12-30 1.764052 0.400157 0.978738 2.240893
# 2000-01-01 -0.103219 0.410599 0.144044 1.454274
# 2000-01-02 0.761038 0.121675 0.443863 0.333674
df.where(~query)
# Morning Noon Evening Midnight
# 1999-12-30 1.764052 0.400157 0.978738 2.240893
# 1999-12-31 NaN NaN NaN NaN
# 2000-01-01 -0.103219 0.410599 0.144044 1.454274
# 2000-01-02 0.761038 0.121675 0.443863 0.333674
# 2000-01-03 NaN NaN NaN NaN
# 2000-01-04 NaN NaN NaN NaN
# 2000-01-05 NaN NaN NaN NaN
4.9.3. Logical AND¶
In first and in second query
1 & 1 -> 1
1 & 0 -> 0
0 & 1 -> 0
0 & 0 -> 0
df[ (df['Morning']<0.0) & (df['Midnight']<0.0) ]
# Morning Noon Evening Midnight
# 2000-01-04 -2.55299 0.653619 0.864436 -0.742165
query = (df['Morning'] < 0.0) & (df['Midnight'] < 0.0)
df[query]
# Morning Noon Evening Midnight
# 2000-01-04 -2.55299 0.653619 0.864436 -0.742165
query1 = df['Morning'] < 0.0
query2 = df['Midnight'] < 0.0
df[query1 & query2]
# Morning Noon Evening Midnight
# 2000-01-04 -2.55299 0.653619 0.864436 -0.742165
df.where(query1 & query2)
# Morning Noon Evening Midnight
# 1999-12-30 NaN NaN NaN NaN
# 1999-12-31 NaN NaN NaN NaN
# 2000-01-01 NaN NaN NaN NaN
# 2000-01-02 NaN NaN NaN NaN
# 2000-01-03 NaN NaN NaN NaN
# 2000-01-04 -2.55299 0.653619 0.864436 -0.742165
# 2000-01-05 NaN NaN NaN NaN
4.9.4. Logical OR¶
In first or in second query
1 | 1 -> 1
1 | 0 -> 1
0 | 1 -> 1
0 | 0 -> 0
query1 = df['Morning'] < 0.0
query2 = df['Midnight'] < 0.0
df[query1 | query2]
# Morning Noon Evening Midnight
# 1999-12-31 1.867558 -0.977278 0.950088 -0.151357
# 2000-01-01 -0.103219 0.410599 0.144044 1.454274
# 2000-01-03 1.494079 -0.205158 0.313068 -0.854096
# 2000-01-04 -2.552990 0.653619 0.864436 -0.742165
# 2000-01-05 2.269755 -1.454366 0.045759 -0.187184
df.where(query1 | query2)
# Morning Noon Evening Midnight
# 1999-12-30 NaN NaN NaN NaN
# 1999-12-31 1.867558 -0.977278 0.950088 -0.151357
# 2000-01-01 -0.103219 0.410599 0.144044 1.454274
# 2000-01-02 NaN NaN NaN NaN
# 2000-01-03 1.494079 -0.205158 0.313068 -0.854096
# 2000-01-04 -2.552990 0.653619 0.864436 -0.742165
# 2000-01-05 2.269755 -1.454366 0.045759 -0.187184
4.9.5. Logical XOR¶
In first or in second, but not in both queries
1 ^ 1 -> 0
1 ^ 0 -> 1
0 ^ 1 -> 1
0 ^ 0 -> 0
query1 = df['Morning'] < 0.0
query2 = df['Midnight'] < 0.0
df[query1 ^ query2]
# Morning Noon Evening Midnight
# 1999-12-31 1.867558 -0.977278 0.950088 -0.151357
# 2000-01-01 -0.103219 0.410599 0.144044 1.454274
# 2000-01-03 1.494079 -0.205158 0.313068 -0.854096
# 2000-01-05 2.269755 -1.454366 0.045759 -0.187184
df.where(query1 ^ query2)
# Morning Noon Evening Midnight
# 1999-12-30 NaN NaN NaN NaN
# 1999-12-31 1.867558 -0.977278 0.950088 -0.151357
# 2000-01-01 -0.103219 0.410599 0.144044 1.454274
# 2000-01-02 NaN NaN NaN NaN
# 2000-01-03 1.494079 -0.205158 0.313068 -0.854096
# 2000-01-04 NaN NaN NaN NaN
# 2000-01-05 2.269755 -1.454366 0.045759 -0.187184
4.9.6. Assignments¶
"""
* Assignment: DataFrame Select
* Complexity: easy
* Lines of code: 5 lines
* Time: 8 min
English:
TODO: Translate to English
Polish:
1. Użyj danych z sekcji "Given" (patrz poniżej)
2. Wczytaj dane z `DATA` jako `df: pd.DataFrame`
3. Przefiltruj `inplace` kolumnę 'petal_length' i pozostaw wartości powyżej 2.0
4. Wyświetl 5 pierwszych wierszy
Tests:
>>> type(result) is pd.DataFrame
True
>>> pd.set_option('display.width', 500)
>>> pd.set_option('display.max_columns', 10)
>>> pd.set_option('display.max_rows', 10)
>>> result # doctest: +NORMALIZE_WHITESPACE
sepal_length sepal_width petal_length petal_width species
1 5.9 3.0 5.1 1.8 virginica
2 6.0 3.4 4.5 1.6 versicolor
3 7.3 2.9 6.3 1.8 virginica
4 5.6 2.5 3.9 1.1 versicolor
6 5.5 2.6 4.4 1.2 versicolor
"""
# Given
import pandas as pd
DATA = 'https://raw.githubusercontent.com/AstroMatt/book-python/master/_data/csv/iris-clean.csv'
result = ...