3.7. Series NA¶
3.7.1. Rationale¶
https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html
https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html#missing-data-na
Experimental: the behaviour of pd.NA can still change without warning.
None
float('nan')
np.nan
pd.NA
3.7.2. Boolean Value¶
import pandas as pd
import numpy as np
bool(None)
# False
bool(float('nan'))
# True
bool(np.nan)
# True
bool(pd.NA)
# Traceback (most recent call last):
# TypeError: boolean value of NA is ambiguous
3.7.3. Type¶
import pandas as pd
import numpy as np
pd.Series([1, None, 3]).dtype # dtype('float64')
pd.Series([1.0, None, 3.0]).dtype # dtype('float64')
pd.Series([True, None, False]).dtype # dtype('O')
pd.Series(['a', None, 'c']).dtype # dtype('O')
pd.Series([1, float('nan'), 3]).dtype # dtype('float64')
pd.Series([1.0, float('nan'), 3.0]).dtype # dtype('float64')
pd.Series([True, float('nan'), False]).dtype # dtype('O')
pd.Series(['a', float('nan'), 'c']).dtype # dtype('O')
pd.Series([1, np.nan, 3]).dtype # dtype('float64')
pd.Series([1.0, np.nan, 3.0]).dtype # dtype('float64')
pd.Series([True, np.nan, False]).dtype # dtype('O')
pd.Series(['a', np.nan, 'c']).dtype # dtype('O')
pd.Series([1, pd.NA, 3]).dtype # dtype('O')
pd.Series([1.0, pd.NA, 3.0]).dtype # dtype('O')
pd.Series([True, pd.NA, False]).dtype # dtype('O')
pd.Series(['a', pd.NA, 'c']).dtype # dtype('O')
3.7.4. Comparison¶
import pandas as pd
import numpy as np
None == None # True
None == float('nan') # False
None == np.nan # False
None == pd.NA # False
float('nan') == None # False
float('nan') == float('nan') # False
float('nan') == np.nan # False
float('nan') == pd.NA # <NA>
np.nan == None # False
np.nan == float('nan') # False
np.nan == np.nan # False
np.nan == pd.NA # <NA>
pd.NA == None # False
pd.NA == float('nan') # <NA>
pd.NA == np.nan # <NA>
pd.NA == pd.NA # <NA>
3.7.5. Identity¶
import pandas as pd
import numpy as np
None is None # True
None is float('nan') # False
None is np.nan # False
None is pd.NA # False
float('nan') is None # False
float('nan') is float('nan') # False
float('nan') is np.nan # False
float('nan') is pd.NA # False
np.nan is None # False
np.nan is float('nan') # False
np.nan is np.nan # True
np.nan is pd.NA # False
pd.NA is None # False
pd.NA is float('nan') # False
pd.NA is np.nan # False
pd.NA is pd.NA # True
3.7.6. Check¶
Negated
~
versions of all above methods
import pandas as pd
import numpy as np
s = pd.Series([1.0, np.nan, 3.0])
s.any() # True
~s.any() # False
s.all() # True
~s.all() # False
3.7.7. Select¶
s.isnull()
ands.notnull()
s.isna()
ands.notna()
Negated
~
versions of all above methods
import pandas as pd
import numpy as np
s = pd.Series([1.0, np.nan, 3.0])
s.isnull()
# 0 False
# 1 True
# 2 False
# dtype: bool
~s.isnull()
# 0 True
# 1 False
# 2 True
# dtype: bool
s.notnull()
# 0 True
# 1 False
# 2 True
# dtype: bool
~s.notnull()
# 0 False
# 1 True
# 2 False
# dtype: bool
import pandas as pd
import numpy as np
s = pd.Series([1.0, np.nan, 3.0])
s.isna()
# 0 False
# 1 True
# 2 False
# dtype: bool
s.notna()
# 0 True
# 1 False
# 2 True
# dtype: bool
~s.isna()
# 0 True
# 1 False
# 2 True
# dtype: bool
~s.notna()
# 0 False
# 1 True
# 2 False
# dtype: bool
3.7.8. Update¶
Works with
inplace=True
parameter.
Fill NA - Scalar value:
import pandas as pd
s = pd.Series([1.0, None, None, 4.0, None, 6.0])
s.fillna(0.0)
# 0 1.0
# 1 0.0
# 2 0.0
# 3 4.0
# 4 0.0
# 5 6.0
# dtype: float64
Forward Fill. ffill
: propagate last valid observation forward:
import pandas as pd
s = pd.Series([1.0, None, None, 4.0, None, 6.0])
s.ffill()
# 0 1.0
# 1 1.0
# 2 1.0
# 3 4.0
# 4 4.0
# 5 6.0
# dtype: float64
Backward Fill. bfill
: use NEXT valid observation to fill gap:
import pandas as pd
s = pd.Series([1.0, None, None, 4.0, None, 6.0])
s.bfill()
# 0 1.0
# 1 4.0
# 2 4.0
# 3 4.0
# 4 6.0
# 5 6.0
# dtype: float64
Interpolate. method: str
, default linear
. Does not have inplace=True
:
import pandas as pd
s = pd.Series([1.0, None, None, 4.0, None, 6.0])
s.interpolate()
# 0 1.0
# 1 2.0
# 2 3.0
# 3 4.0
# 4 5.0
# 5 6.0
# dtype: float64
s.interpolate('nearest') # requires installation of ``scipy`` library
# 0 1.0
# 1 1.0
# 2 4.0
# 3 4.0
# 4 4.0
# 5 6.0
# dtype: float64
s.interpolate('polynomial', order=2) # requires installation of ``scipy`` library
# 0 1.0
# 1 2.0
# 2 3.0
# 3 4.0
# 4 5.0
# 5 6.0
# dtype: float64
Method |
Description |
---|---|
|
Ignore the index and treat the values as equally spaced. This is the only method supported on MultiIndexes |
|
Works on daily and higher resolution data to interpolate given length of interval |
|
use the actual numerical values of the index. |
|
Fill in NA using existing values |
|
Passed to |
|
Wrappers around the SciPy interpolation methods of similar names |
|
Refers to |
3.7.9. Drop¶
Drop Rows. Has inplace=True
parameter:
import pandas as pd
s = pd.Series([1.0, None, None, 4.0, None, 6.0])
s.dropna()
# 0 1.0
# 1 2.0
# 2 2.0
# 4 5.0
# dtype: float64
3.7.10. Conversion¶
If you have a
DataFrame
orSeries
using traditional types that have missing data represented usingnp.nan
There are convenience methods
convert_dtypes()
inSeries
andDataFrame
that can convert data to use the newer dtypes for integers, strings and booleansThis is especially helpful after reading in data sets when letting the readers such as
read_csv()
andread_excel()
infer default dtypes.
data = pd.read_csv('data/baseball.csv', index_col='id')
data[data.columns[:10]].dtypes
# player object
# year int64
# stint int64
# team object
# lg object
# g int64
# ab int64
# r int64
# h int64
# X2b int64
# dtype: object
data = pd.read_csv('data/baseball.csv', index_col='id')
data = data.convert_dtypes()
data[data.columns[:10]].dtypes
# player string
# year Int64
# stint Int64
# team string
# lg string
# g Int64
# ab Int64
# r Int64
# h Int64
# X2b Int64
# dtype: object
3.7.11. Assignments¶
"""
* Assignment: Series NA
* Complexity: easy
* Lines of code: 10 lines
* Time: 5 min
English:
1. Use data from "Given" section (see below)
2. From input data create `pd.Series`
3. Fill first missing value with zero
4. Drop missing values
5. Reindex series (without old copy)
6. Compare result with "Tests" section (see below)
Polish:
1. Użyj danych z sekcji "Given" (patrz poniżej)
2. Z danych wejściowych stwórz `pd.Series`
3. Wypełnij pierwszą brakującą wartość zerem
4. Usuń brakujące wartości
5. Zresetuj indeks (bez kopii starego)
6. Porównaj wyniki z sekcją "Tests" (patrz poniżej)
Tests:
>>> type(result) is pd.Series
True
>>> result
0 1.0
1 0.0
2 5.0
3 1.0
4 2.0
5 1.0
dtype: float64
"""
# Given
import pandas as pd
DATA = [1, None, 5, None, 1, 2, 1]