2.2. Pandas Read CSV

2.2.1. Rationale

  • File paths works also with URLs

2.2.2. SetUp

>>> import pandas as pd

2.2.4. Content

>>> DATA = 'https://python.astrotech.io/_static/iris-clean.csv'
>>>
>>> df = pd.read_csv(DATA)
>>> df.head(3)
   sepal_length  sepal_width  petal_length  petal_width     species
0           5.4          3.9           1.3          0.4      setosa
1           5.9          3.0           5.1          1.8   virginica
2           6.0          3.4           4.5          1.6  versicolor

2.2.5. Rename Columns

>>> DATA = 'https://python.astrotech.io/_static/iris-dirty.csv'
>>>
>>> COLUMNS =  ['sepal_length', 'sepal_width',
...             'petal_length', 'petal_width', 'species']
>>>
>>> df = pd.read_csv(DATA)
>>> df.head(3)
   150    4  setosa  versicolor  virginica
0  5.4  3.9     1.3         0.4          0
1  5.9  3.0     5.1         1.8          2
2  6.0  3.4     4.5         1.6          1
>>>
>>> df = pd.read_csv(DATA, skiprows=1, names=COLUMNS)
>>> df.head(3)
   sepal_length  sepal_width  petal_length  petal_width  species
0           5.4          3.9           1.3          0.4        0
1           5.9          3.0           5.1          1.8        2
2           6.0          3.4           4.5          1.6        1
>>>
>>> df['species'].replace({
...     0: 'setosa',
...     1: 'versicolor',
...     2: 'virginica',
... }, inplace=True)
>>>
>>> df.head(n=3)
   sepal_length  sepal_width  petal_length  petal_width     species
0           5.4          3.9           1.3          0.4      setosa
1           5.9          3.0           5.1          1.8   virginica
2           6.0          3.4           4.5          1.6  versicolor

2.2.6. Compressed

  • If the extension is .gz, .bz2, .zip, and .xz, the corresponding compression method is automatically selected

>>> df = pd.read_csv('sample_file.zip', compression='zip')  
>>> df = pd.read_csv('sample_file.gz', compression='infer')  

2.2.7. Use Case - 0x01

>>> DATA = 'https://python.astrotech.io/_static/iris-dirty.csv'
>>>
>>> COLUMNS =  ['sepal_length', 'sepal_width',
...             'petal_length', 'petal_width', 'species']
>>> header = pd.read_csv(DATA, nrows=0)
>>> nrows, ncols, *class_labels = header.columns
>>> label_encoder = dict(enumerate(class_labels))
>>>
>>> label_encoder
{0: 'setosa', 1: 'versicolor', 2: 'virginica'}
>>> df = pd.read_csv(DATA, names=COLUMNS, skiprows=1)
>>> df['species'].replace(label_encoder, inplace=True)
>>> df.head(n=5)
   sepal_length  sepal_width  petal_length  petal_width     species
0           5.4          3.9           1.3          0.4      setosa
1           5.9          3.0           5.1          1.8   virginica
2           6.0          3.4           4.5          1.6  versicolor
3           7.3          2.9           6.3          1.8   virginica
4           5.6          2.5           3.9          1.1  versicolor

2.2.8. Use Case - 0x02

>>> DATA = 'https://python.astrotech.io/_static/martian-en.csv'
>>> pd.read_csv(DATA)
   id First Name   Last Name       Mission Date
0   1        Jan  Twardowski     5 January 1988
1   2       Mark      Watney      July 21, 1969
2   3       Ivan   Ivanovich      Apr 12,  1961
3   4    Melissa       Lewis  January 1st, 1970
4   5       Alex       Vogel   1968 December 25
>>> pd.read_csv(DATA, parse_dates=['Mission Date'])
   id First Name   Last Name Mission Date
0   1        Jan  Twardowski   1988-01-05
1   2       Mark      Watney   1969-07-21
2   3       Ivan   Ivanovich   1961-04-12
3   4    Melissa       Lewis   1970-01-01
4   5       Alex       Vogel   1968-12-25

2.2.9. Assignments

Code 2.32. Solution
"""
* Assignment: Pandas Read CSV Dates
* Complexity: easy
* Lines of code: 1 lines
* Time: 3 min

English:
    1. Read data from `DATA` to `result: pd.DataFrame`
    2. Parse dates in "Mission Date" column
    3. Run doctests - all must succeed

Polish:
    1. Wczytaj dane z `DATA` do `result: pd.DataFrame`
    2. Sparsuj daty w kolumnie "Mission Date"
    3. Uruchom doctesty - wszystkie muszą się powieść

Hints:
    * `parse_dates`

Tests:
    >>> import sys; sys.tracebacklimit = 0

    >>> assert result is not Ellipsis, \
    'Assign result to variable: `result`'
    >>> assert type(result) is pd.DataFrame, \
    'Variable `result` must be a `pd.DataFrame` type'

    >>> result
       id First Name   Last Name Mission Date
    0   1        Jan  Twardowski   1988-01-05
    1   2       Mark      Watney   1969-07-21
    2   3       Ivan   Ivanovich   1961-04-12
    3   4    Melissa       Lewis   1970-01-01
    4   5       Alex       Vogel   1968-12-25
"""

import pandas as pd


DATA = 'https://python.astrotech.io/_static/martian-en.csv'


# pd.DataFrame: read DATA and parse dates in "Mission Date" column
result = ...


Code 2.33. Solution
"""
* Assignment: Pandas Read CSV Replace
* Complexity: easy
* Lines of code: 5 lines
* Time: 8 min

English:
    1. Read data from `DATA` to `result: pd.DataFrame`
    2. Use provided column names in `COLUMNS`
    3. Read labels from the first row
    4. Replace data in `label` column with values extracted above
    5. Define `result: pd.DataFrame` with 25 first rows
    6. Run doctests - all must succeed

Polish:
    1. Wczytaj dane z `DATA` do `result: pd.DataFrame`
    2. Użyj podanych w `COLUMNS` nazw kolumn
    3. Wczytaj nazwy labeli z pierwszego wiersza
    4. Podmień dane w kolumnie `label` na wartości wyciągnięte powyżej
    5. Zdefiniuj `result: pd.DataFrame` z 25 pierwszymi wierszami
    6. Uruchom doctesty - wszystkie muszą się powieść

Hints:
    * `class_labels = pd.read_csv(DATA, nrows=0).columns[2:]`
    * `label_encoder = dict(enumerate(class_labels))`
    * `pd.Series.replace()`

Tests:
    >>> import sys; sys.tracebacklimit = 0

    >>> assert result is not Ellipsis, \
    'Assign result to variable: `result`'
    >>> assert type(result) is pd.DataFrame, \
    'Variable `result` must be a `pd.DataFrame` type'
    >>> assert len(result) == 25, \
    'Select only 25 first rows'

    >>> result.loc[[0,1,2,3,4,5], ['mean radius', 'mean texture', 'label']]
       mean radius  mean texture      label
    0        17.99         10.38  malignant
    1        20.57         17.77  malignant
    2        19.69         21.25  malignant
    3        11.42         20.38  malignant
    4        20.29         14.34  malignant
    5        12.45         15.70  malignant

    >>> result['label'].value_counts()
    malignant    22
    benign        3
    Name: label, dtype: int64
"""

import pandas as pd


DATA = 'https://python.astrotech.io/_static/breast-cancer.csv'

COLUMNS = ['mean radius', 'mean texture', 'mean perimeter', 'mean area',
           'mean smoothness', 'mean compactness', 'mean concavity',
           'mean concave points', 'mean symmetry', 'mean fractal dimension',
           'radius error', 'texture error', 'perimeter error', 'area error',
           'smoothness error', 'compactness error', 'concavity error',
           'concave points error', 'symmetry error',
           'fractal dimension error', 'worst radius', 'worst texture',
           'worst perimeter', 'worst area', 'worst smoothness',
           'worst compactness', 'worst concavity', 'worst concave points',
           'worst symmetry', 'worst fractal dimension', 'label']


# pd.DataFrame: read DATA, substitute column names, and labels, select 25 rows
result = ...