4.16. DataFrame Mapping

import pandas as pd
import numpy as np
np.random.seed(0)

df = pd.DataFrame(
    columns = ['Morning', 'Noon', 'Evening', 'Midnight'],
    index = pd.date_range('1999-12-30', periods=7),
    data = np.random.randn(7, 4))

df
#              Morning      Noon   Evening  Midnight
# 1999-12-30  1.764052  0.400157  0.978738  2.240893
# 1999-12-31  1.867558 -0.977278  0.950088 -0.151357
# 2000-01-01 -0.103219  0.410599  0.144044  1.454274
# 2000-01-02  0.761038  0.121675  0.443863  0.333674
# 2000-01-03  1.494079 -0.205158  0.313068 -0.854096
# 2000-01-04 -2.552990  0.653619  0.864436 -0.742165
# 2000-01-05  2.269755 -1.454366  0.045759 -0.187184

4.16.1. Map

  • .map() works element-wise on a Series

df['Morning'].map(lambda value: round(value, 2))
# 1999-12-30    1.76
# 1999-12-31    1.87
# 2000-01-01   -0.10
# 2000-01-02    0.76
# 2000-01-03    1.49
# 2000-01-04   -2.55
# 2000-01-05    2.27
# Freq: D, Name: Morning, dtype: float64
df['Morning'].map(int)
# 1999-12-30    1
# 1999-12-31    1
# 2000-01-01    0
# 2000-01-02    0
# 2000-01-03    1
# 2000-01-04   -2
# 2000-01-05    2
# Freq: D, Name: Morning, dtype: int64

4.16.2. Apply

  • .apply() works on a row / column basis of a DataFrame

df['Morning'].apply(int)
# 1999-12-30    1
# 1999-12-31    1
# 2000-01-01    0
# 2000-01-02    0
# 2000-01-03    1
# 2000-01-04   -2
# 2000-01-05    2
# Freq: D, Name: Morning, dtype: int64
df['Morning'].apply(lambda value: round(value, 2))
# 1999-12-30    1.76
# 1999-12-31    1.87
# 2000-01-01   -0.10
# 2000-01-02    0.76
# 2000-01-03    1.49
# 2000-01-04   -2.55
# 2000-01-05    2.27

4.16.3. Applymap

  • .applymap() works element-wise on a DataFrame

4.16.4. Summary

First major difference: DEFINITION

  • map is defined on Series ONLY

  • applymap is defined on DataFrames ONLY

  • apply is defined on BOTH

Second major difference: ARGUMENT TYPE

  • map accepts dict``s, ``Series, or callable

  • applymap and apply accept callables only

Third major difference: BEHAVIOR

  • map is elementwise for Series

  • applymap is elementwise for DataFrames

  • apply also works elementwise but is suited to more complex operations and aggregation. The behaviour and return value depends on the function.

Fourth major difference (the most important one): USE CASE

  • map is meant for mapping values from one domain to another, so is optimised for performance (e.g., df['A'].map({1:'a', 2:'b', 3:'c'}))

  • applymap is good for elementwise transformations across multiple rows/columns (e.g., df[['A', 'B', 'C']].applymap(str.strip))

  • apply is for applying any function that cannot be vectorised (e.g., df['sentences'].apply(nltk.sent_tokenize))

Footnotes:

  • map when passed a dictionary/Series will map elements based on the keys in that dictionary/Series. Missing values will be recorded as NaN in the output.

  • applymap in more recent versions has been optimised for some operations. You will find applymap slightly faster than apply in some cases. My suggestion is to test them both and use whatever works better.

  • map is optimised for elementwise mappings and transformation. Operations that involve dictionaries or Series will enable pandas to use faster code paths for better performance.

  • Series.apply returns a scalar for aggregating operations, Series otherwise. Similarly for DataFrame.apply. Note that apply also has fastpaths when called with certain NumPy functions such as mean, sum, etc.

../_images/pandas-dataframe-mapping.png

Source: https://stackoverflow.com/questions/19798153/difference-between-map-applymap-and-apply-methods-in-pandas

4.16.5. Cleaning User Input

  • 80% of machine learning and data science is cleaning data

4.16.6. Is This the Same Address?

  • This is a dump of distinct records of a single address

  • Which one of the below is a true address?

'ul. Jana III Sobieskiego'
'ul Jana III Sobieskiego'
'ul.Jana III Sobieskiego'
'ulicaJana III Sobieskiego'
'Ul. Jana III Sobieskiego'
'UL. Jana III Sobieskiego'
'ulica Jana III Sobieskiego'
'Ulica. Jana III Sobieskiego'

'os. Jana III Sobieskiego'

'Jana 3 Sobieskiego'
'Jana 3ego Sobieskiego'
'Jana III Sobieskiego'
'Jana Iii Sobieskiego'
'Jana IIi Sobieskiego'
'Jana lll Sobieskiego'  # three small letters 'L'

4.16.7. Spelling and Abbreviations

'ul'
'ul.'
'Ul.'
'UL.'
'ulica'
'Ulica'
'os'
'os.'
'Os.'
'osiedle'

'oś'
'oś.'
'Oś.'
'ośedle'
'pl'
'pl.'
'Pl.'
'plac'
'al'
'al.'
'Al.'

'aleja'
'aleia'
'alei'
'aleii'
'aleji'

4.16.8. House and Apartment Number

'1/2'
'1 / 2'
'1/ 2'
'1 /2'
'3/5/7'
'1 m. 2'
'1 m 2'
'1 apt 2'
'1 apt. 2'
'180f/8f'
'180f/8'
'180/8f'
'13d bud. A'

4.16.9. Phone Numbers

+48 (12) 355 5678
+48 123 555 678
123 555 678

+48 12 355 5678
+48 123-555-678
+48 123 555 6789

+1 (123) 555-6789
+1 (123).555.6789

+1 800-python
+48123555678

+48 123 555 678 wew. 1337
+48 123555678,1
+48 123555678,1,2,3

4.16.10. Conversion

LETTERS_EN = 'abcdefghijklmnopqrstuvwxyz'
LETTERS_PL = 'aąbcćdeęfghijklłmnńoóprsśtuwyzżź'

LETTERS_PLEN = {'ą': 'a', 'ć': 'c', 'ę': 'e',
                'ł': 'l', 'ń': 'n', 'ó': 'o',
                'ś': 's', 'ż': 'z', 'ź': 'z'}
MONTHS_EN = ['January', 'February', 'March', 'April',
             'May', 'June', 'July', 'August', 'September',
             'October', 'November', 'December']

MONTHS_PL = ['styczeń', 'luty', 'marzec', 'kwiecień',
             'maj', 'czerwiec', 'lipiec', 'sierpień',
             'wrzesień', 'październik', 'listopad', 'grudzień']

MONTHS_PLEN = {'styczeń': 'January',
               'luty': 'February',
               'marzec': 'March',
               'kwiecień': 'April',
               'maj': 'May',
               'czerwiec': 'June',
               'lipiec': 'July',
               'sierpień': 'August',
               'wrzesień': 'September',
               'październik': 'October',
               'listopad': 'November',
               'grudzień': 'December'}

MONTHS_ENPL = {'January': 'styczeń',
               'February': 'luty',
               'March': 'marzec',
               'April': 'kwiecień',
               'May': 'maj',
               'June': 'czerwiec',
               'July': 'lipiec',
               'August': 'sierpień',
               'September': 'wrzesień',
               'October': 'październik',
               'November': 'listopad',
               'December': 'grudzień'}

4.16.11. Assignments

Code 4.6. Solution
"""
* Assignment: DataFrame Mapping Split
* Complexity: easy
* Lines of code: 5 lines
* Time: 13 min

English:
    1. Use data from "Given" section (see below)
    2. Read data from `DATA` as `df: pd.DataFrame`
    3. Parse data in `date` column as `datetime` object
    4. Split column `date` with into two separate: date and time columns

Polish:
    1. Użyj danych z sekcji "Given" (patrz poniżej)
    2. Wczytaj dane z `DATA` jako `df: pd.DataFrame`
    3. Sparsuj dane w kolumnie `date` jako obiekty `datetime`
    4. Podziel kolumnę z `date` na dwie osobne: datę i czas

Hints:
    * `pd.Series.dt.date`
    * `pd.Series.dt.time`

Tests:
    >>> type(result) is pd.DataFrame
    True
    >>> pd.set_option('display.width', 500)
    >>> pd.set_option('display.max_columns', 10)
    >>> pd.set_option('display.max_rows', 10)
    >>> result  # doctest: +NORMALIZE_WHITESPACE
          id   period            datetime   network  item           type  duration        date      time
    0      0  1999-11 1999-10-15 06:58:00  T-Mobile  data           data      34.5  1999-10-15  06:58:00
    1      1  1999-11 1999-10-15 06:58:00    Orange  call         mobile      13.0  1999-10-15  06:58:00
    2      2  1999-11 1999-10-15 14:46:00      Play  call         mobile      23.0  1999-10-15  14:46:00
    3      3  1999-11 1999-10-15 14:48:00      Plus  call         mobile       4.0  1999-10-15  14:48:00
    4      4  1999-11 1999-10-15 17:27:00  T-Mobile  call         mobile       4.0  1999-10-15  17:27:00
    ..   ...      ...                 ...       ...   ...            ...       ...         ...       ...
    825  825  2000-03 2000-03-13 00:38:00      AT&T   sms  international       1.0  2000-03-13  00:38:00
    826  826  2000-03 2000-03-13 00:39:00    Orange   sms         mobile       1.0  2000-03-13  00:39:00
    827  827  2000-03 2000-03-13 06:58:00    Orange  data           data      34.5  2000-03-13  06:58:00
    828  828  2000-03 2000-03-14 00:13:00      AT&T   sms  international       1.0  2000-03-14  00:13:00
    829  829  2000-03 2000-03-14 00:16:00      AT&T   sms  international       1.0  2000-03-14  00:16:00
    <BLANKLINE>
    [830 rows x 9 columns]
"""


# Given
import pandas as pd


DATA = 'https://raw.githubusercontent.com/AstroMatt/book-python/master/_data/csv/phones-pl.csv'

result = ...


Code 4.7. Solution
"""
* Assignment: DataFrame Mapping Translate
* Complexity: easy
* Lines of code: 5 lines
* Time: 13 min

English:
    1. Use data from "Given" section (see below)
    2. Read data from `DATA` as `df: pd.DataFrame`
    3. Set header and index to data from file
    4. Convert Polish month names to English
    5. Parse dates to `datetime` objects

Polish:
    1. Użyj danych z sekcji "Given" (patrz poniżej)
    2. Wczytaj dane z `DATA` jako `df: pd.DataFrame`
    3. Ustaw nagłówek i index na dane zaczytane z pliku
    4. Przekonwertuj polskie nazwy miesięcy na angielskie
    5. Sparsuj daty do obiektów `datetime`

Hints:
    * `pd.Series.replace(regex=True)`
    * `pd.to_datetime()`

Tests:
    >>> type(result) is pd.DataFrame
    True
    >>> pd.set_option('display.width', 500)
    >>> pd.set_option('display.max_columns', 10)
    >>> pd.set_option('display.max_rows', 10)
    >>> result  # doctest: +NORMALIZE_WHITESPACE
       id First Name   Last Name Mission Date
    0   1        Jan  Twardowski   1988-01-05
    1   2       Mark      Watney   1969-07-21
    2   3       Ivan   Ivanovich   1961-04-12
    3   4    Melissa       Lewis   1970-01-01
    4   5       Alex       Vogel   1968-12-25
"""


# Given
import pandas as pd


DATA = 'https://raw.githubusercontent.com/AstroMatt/book-python/master/_data/csv/martian-pl.csv'
MONTHS_PLEN = {'styczeń': 'January',
               'luty': 'February',
               'marzec': 'March',
               'kwiecień': 'April',
               'maj': 'May',
               'czerwiec': 'June',
               'lipiec': 'July',
               'sierpień': 'August',
               'wrzesień': 'September',
               'październik': 'October',
               'listopad': 'November',
               'grudzień': 'December'}

result = ...


Code 4.8. Solution
"""
* Assignment: DataFrame Mapping Month
* Complexity: easy
* Lines of code: 10 lines
* Time: 13 min

English:
    1. Use data from "Given" section (see below)
    2. Read data from `DATA` as `df: pd.DataFrame`
    3. Add column `year` and `month` by parsing `period` column
    4. Month name must be a string month name, not a number (i.e.: 'January', 'May')

Polish:
    1. Użyj danych z sekcji "Given" (patrz poniżej)
    2. Wczytaj dane z `DATA` jako `df: pd.DataFrame`
    3. Dodaj kolumnę `year` i `month` poprzez sparsowanie kolumny `period`
    4. Nazwa miesiąca musi być ciągiem znaków, a nie liczbą (i.e. 'January', 'May')

:Example:
    * if `period` column is "2015-01"
    * `year`: 2015
    * `month`: January

Hints:
    * `Series.str.split(expand=True)`
    * `df[ ['A', 'B'] ] = ...`

Tests:
    >>> type(result) is pd.DataFrame
    True
    >>> pd.set_option('display.width', 500)
    >>> pd.set_option('display.max_columns', 10)
    >>> pd.set_option('display.max_rows', 10)
    >>> result  # doctest: +NORMALIZE_WHITESPACE
          period            datetime   network  item           type  duration  year     month
    id
    0    1999-11 1999-10-15 06:58:00  T-Mobile  data           data      34.5  1999  November
    1    1999-11 1999-10-15 06:58:00    Orange  call         mobile      13.0  1999  November
    2    1999-11 1999-10-15 14:46:00      Play  call         mobile      23.0  1999  November
    3    1999-11 1999-10-15 14:48:00      Plus  call         mobile       4.0  1999  November
    4    1999-11 1999-10-15 17:27:00  T-Mobile  call         mobile       4.0  1999  November
    ..       ...                 ...       ...   ...            ...       ...   ...       ...
    825  2000-03 2000-03-13 00:38:00      AT&T   sms  international       1.0  2000     March
    826  2000-03 2000-03-13 00:39:00    Orange   sms         mobile       1.0  2000     March
    827  2000-03 2000-03-13 06:58:00    Orange  data           data      34.5  2000     March
    828  2000-03 2000-03-14 00:13:00      AT&T   sms  international       1.0  2000     March
    829  2000-03 2000-03-14 00:16:00      AT&T   sms  international       1.0  2000     March
    <BLANKLINE>
    [830 rows x 8 columns]
"""


# Given
import pandas as pd


DATA = 'https://raw.githubusercontent.com/AstroMatt/book-python/master/_data/csv/phones-pl.csv'
MONTHS_EN = ['January', 'February', 'March', 'April',
             'May', 'June', 'July', 'August', 'September',
             'October', 'November', 'December']
MONTHS = dict(enumerate(MONTHS_EN, start=1))

result = ...


Code 4.9. Solution
"""
* Assignment: DataFrame Mapping Substitute
* Complexity: medium
* Lines of code: 10 lines
* Time: 13 min

English:
    1. Use data from "Given" section (see below)
    2. Read data from `DATA` as `df: pd.DataFrame`
    3. Select `Polish` spreadsheet
    4. Set header and index to data from file
    5. Mind the encoding
    6. Substitute Polish Diacritics to English alphabet letters
    7. Compare `df.replace(regex=True)` with `df.applymap()`

Polish:
    1. Użyj danych z sekcji "Given" (patrz poniżej)
    2. Wczytaj dane z `DATA` jako `df: pd.DataFrame`
    3. Wybierz arkusz `Polish`
    4. Ustaw nagłówek i index na dane zaczytane z pliku
    5. Zwróć uwagę na encoding
    6. Podmień polskie znaki diakrytyczne na litery z alfabetu angielskiego
    7. Porównaj `df.replace(regex=True)` z `df.applymap()`

Tests:
    >>> type(result) is pd.DataFrame
    True
    >>> pd.set_option('display.width', 500)
    >>> pd.set_option('display.max_columns', 3)
    >>> pd.set_option('display.max_rows', 10)
    >>> result  # doctest: +NORMALIZE_WHITESPACE
                                                 Definicja  ...                                 Kryteria wyjsciowe
    TRL                                                     ...
    1    Zaobserwowanie i opisanie podstawowych zasad d...  ...  Zweryfikowane publikacja badania lezacych u po...
    2    Sformulowanie koncepcji technologicznej lub pr...  ...  Udokumentowany opis aplikacji / koncepcji, kto...
    3    Przeprowadzanie eksperymentalnie i analityczni...  ...  Udokumentowane wyniki analityczne / eksperymen...
    4    Przeprowadzenie weryfikacji komponentow techno...  ...  Udokumentowane wyniki testow potwierdzajace zg...
    5    Przeprowadzenie weryfikacji komponentow techno...  ...  Udokumentowane wyniki testow potwierdzajace zg...
    6    Dokonanie demonstracji technologii w srodowisk...  ...  Udokumentowane wyniki testow potwierdzajace zg...
    7    Dokonanie demonstracji prototypu systemu w oto...  ...  Udokumentowane wyniki testow potwierdzajace zg...
    8    Zakonczenie badan i demonstracja ostatecznej f...  ...  Udokumentowane wyniki testow weryfikujacych pr...
    9    Weryfikacja technologii w srodowisku operacyjn...  ...            Udokumentowane wyniki operacyjne misji.
    <BLANKLINE>
    [9 rows x 4 columns]
"""


# Given
import pandas as pd


DATA = 'https://raw.githubusercontent.com/AstroMatt/book-python/master/_data/xlsx/astro-trl.xlsx'
LETTERS_PLEN = {'ą': 'a', 'ć': 'c', 'ę': 'e',
                'ł': 'l', 'ń': 'n', 'ó': 'o',
                'ś': 's', 'ż': 'z', 'ź': 'z'}

result = ...


Code 4.10. Solution
"""
* Assignment: Pandas Read JSON OpenAPI
* Complexity: easy
* Lines of code: 5 lines
* Time: 5 min

English:
    1. Use data from "Given" section (see below)
    2. Read data from `DATA` as `df: pd.DataFrame`
    3. Use `requests` library
    4. Transpose data
    5. If cell is a `dict`, then extract value for `summary`
    6. If cell is empty, leave `pd.NA`

Polish:
    1. Użyj danych z sekcji "Given" (patrz poniżej)
    2. Wczytaj dane z `DATA` jako `df: pd.DataFrame`
    3. Użyj biblioteki `requests`
    4. Transponuj dane
    5. Jeżeli komórka jest `dict`, to wyciągnij wartość dla `summary`
    6. Jeżeli komórka jest pusta, pozostaw `pd.NA`

Tests:
    >>> type(result) is pd.DataFrame
    True
    >>> pd.set_option('display.width', 500)
    >>> pd.set_option('display.max_columns', 10)
    >>> pd.set_option('display.max_rows', 10)
    >>> len(result) > 0
    True
    >>> list(result.columns)
    ['put', 'post', 'get', 'delete']
    >>> list(result.index)  # doctest: +NORMALIZE_WHITESPACE
    ['/pet', '/pet/findByStatus', '/pet/findByTags', '/pet/{petId}', '/pet/{petId}/uploadImage',
     '/store/inventory', '/store/order', '/store/order/{orderId}',
     '/user', '/user/createWithList', '/user/login', '/user/logout', '/user/{username}']
"""


# Given
import pandas as pd
import requests


DATA = 'https://raw.githubusercontent.com/AstroMatt/book-python/master/_data/json/openapi.json'

result = ...