4.13. DataFrame NA¶
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A': [1, 2, np.nan, np.nan, 3, np.nan, 4],
'B': [1.1, 2.2, np.nan, np.nan, 3.3, np.nan, 4.4],
'C': ['a', 'b', np.nan, np.nan, 'c', np.nan, 'd'],
'D': [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan],
})
df
# A B C D
# 0 1.0 1.1 a NaN
# 1 2.0 2.2 b NaN
# 2 NaN NaN NaN NaN
# 3 NaN NaN NaN NaN
# 4 3.0 3.3 c NaN
# 5 NaN NaN NaN NaN
# 6 4.0 4.4 d NaN
4.13.1. Select¶
Any:
df.any()
# A True
# B True
# C True
# D False
# dtype: bool
All:
df.all()
# A True
# B True
# C True
# D True
# dtype: bool
Is Null:
df.isnull()
# A B C D
# 0 False False False True
# 1 False False False True
# 2 True True True True
# 3 True True True True
# 4 False False False True
# 5 True True True True
# 6 False False False True
Is NA:
df.isna()
# A B C D
# 0 False False False True
# 1 False False False True
# 2 True True True True
# 3 True True True True
# 4 False False False True
# 5 True True True True
# 6 False False False True
4.13.2. Update¶
axis=0
- rowsaxis=1
- columns
Fill NA with scalar value:
df.fillna(0.0)
# A B C D
# 0 1.0 1.1 a 0.0
# 1 2.0 2.2 b 0.0
# 2 0.0 0.0 0 0.0
# 3 0.0 0.0 0 0.0
# 4 3.0 3.3 c 0.0
# 5 0.0 0.0 0 0.0
# 6 4.0 4.4 d 0.0
Fill NA with dict values:
df.fillna({
'A': 99,
'B': 88,
'C': 77
})
# A B C D
# 0 1.0 1.1 a NaN
# 1 2.0 2.2 b NaN
# 2 99.0 88.0 77 NaN
# 3 99.0 88.0 77 NaN
# 4 3.0 3.3 c NaN
# 5 99.0 88.0 77 NaN
# 6 4.0 4.4 d NaN
Forward Fill. ffill
: propagate last valid observation forward:
df.fillna(method='ffill')
# A B C D
# 0 1.0 1.1 a NaN
# 1 2.0 2.2 b NaN
# 2 2.0 2.2 b NaN
# 3 2.0 2.2 b NaN
# 4 3.0 3.3 c NaN
# 5 3.0 3.3 c NaN
# 6 4.0 4.4 d NaN
Backward Fill. bfill
: use NEXT valid observation to fill gap:
df.fillna(method='bfill')
# A B C D
# 0 1.0 1.1 a NaN
# 1 2.0 2.2 b NaN
# 2 3.0 3.3 c NaN
# 3 3.0 3.3 c NaN
# 4 3.0 3.3 c NaN
# 5 4.0 4.4 d NaN
# 6 4.0 4.4 d NaN
Interpolate:
df.interpolate()
# A B C D
# 0 1.000000 1.100000 a NaN
# 1 2.000000 2.200000 b NaN
# 2 2.333333 2.566667 NaN NaN
# 3 2.666667 2.933333 NaN NaN
# 4 3.000000 3.300000 c NaN
# 5 3.500000 3.850000 NaN NaN
# 6 4.000000 4.400000 d NaN
Method |
Description |
---|---|
|
Ignore the index and treat the values as equally spaced. This is the only method supported on MultiIndexes |
|
Works on daily and higher resolution data to interpolate given length of interval |
|
use the actual numerical values of the index. |
|
Fill in NA using existing values |
|
Passed to |
|
Wrappers around the SciPy interpolation methods of similar names |
|
Refers to |
4.13.3. Drop¶
axis=0
- rowsaxis=1
- columns
Drop Rows:
df.dropna(how='all')
# A B C D
# 0 1.0 1.1 a NaN
# 1 2.0 2.2 b NaN
# 4 3.0 3.3 c NaN
# 6 4.0 4.4 d NaN
df.dropna(how='all', axis='rows')
# A B C D
# 0 1.0 1.1 a NaN
# 1 2.0 2.2 b NaN
# 4 3.0 3.3 c NaN
# 6 4.0 4.4 d NaN
df.dropna(how='all', axis=0)
# A B C D
# 0 1.0 1.1 a NaN
# 1 2.0 2.2 b NaN
# 4 3.0 3.3 c NaN
# 6 4.0 4.4 d NaN
df.dropna(how='any')
# Empty DataFrame
# Columns: [A, B, C, D]
# Index: []
df.dropna(how='any', axis=0)
# Empty DataFrame
# Columns: [A, B, C, D]
# Index: []
df.dropna(how='any', axis='rows')
# Empty DataFrame
# Columns: [A, B, C, D]
# Index: []
Drop Columns:
df.dropna(how='all', axis='columns')
# A B C
# 0 1.0 1.1 a
# 1 2.0 2.2 b
# 2 NaN NaN NaN
# 3 NaN NaN NaN
# 4 3.0 3.3 c
# 5 NaN NaN NaN
# 6 4.0 4.4 d
df.dropna(how='all', axis=1)
# A B C
# 0 1.0 1.1 a
# 1 2.0 2.2 b
# 2 NaN NaN NaN
# 3 NaN NaN NaN
# 4 3.0 3.3 c
# 5 NaN NaN NaN
# 6 4.0 4.4 d
df.dropna(how='all', axis=-1)
# Traceback (most recent call last):
# ValueError: No axis named -1 for object type <class 'pandas.core.frame.DataFrame'>
df.dropna(how='any', axis='columns')
# Empty DataFrame
# Columns: []
# Index: [0, 1, 2, 3, 4, 5, 6]
df.dropna(how='any', axis=1)
# Empty DataFrame
# Columns: []
# Index: [0, 1, 2, 3, 4, 5, 6]
df.dropna(how='any', axis=-1)
# Traceback (most recent call last):
# ValueError: No axis named -1 for object type <class 'pandas.core.frame.DataFrame'>
4.13.4. Assignments¶
"""
* Assignment: DataFrame NaN
* Complexity: easy
* Lines of code: 10 lines
* Time: 8 min
English:
TODO: English Translation
Polish:
1. Użyj danych z sekcji "Given" (patrz poniżej)
2. Wczytaj dane z `DATA` jako `df: pd.DataFrame`
3. Pomiń pierwszą linię z metadanymi
4. Zmień nazwy kolumn na:
a. Sepal length
b. Sepal width
c. Petal length
d. Petal width
e. Species
5. Podmień wartości w kolumnie species
a. 0 -> 'setosa',
b. 1 -> 'versicolor',
c. 2 -> 'virginica'
6. Wybierz wartości w kolumnie 'Petal length' mniejsze od 4
7. Wybrane wartości ustaw na `NaN`
8. Interpoluj liniowo wszystkie wartości `NaN`
9. Usuń wiersze z pozostałymi wartościami `NaN`
10. Zdefiniuj `result` jako dwa pierwsze wiersze
Tests:
>>> type(result) is pd.DataFrame
True
>>> pd.set_option('display.width', 500)
>>> pd.set_option('display.max_columns', 10)
>>> pd.set_option('display.max_rows', 10)
>>> result # doctest: +NORMALIZE_WHITESPACE
Sepal length Sepal width Petal length Petal width Species
1 5.9 3.0 5.1 1.8 virginica
2 6.0 3.4 4.5 1.6 versicolor
"""
# Given
import pandas as pd
import numpy as np
DATA = 'https://raw.githubusercontent.com/AstroMatt/book-python/master/_data/csv/iris-dirty.csv'
COLUMNS = [
'Sepal length',
'Sepal width',
'Petal length',
'Petal width',
'Species']
result = ...