Préparation et analyse de données sous la forme ligne/colonne

Les biliothèques qui vont être utilisées relèvent de l'écosystème PyData (un groupe d'utilisateurs / de developpeurs et une série de conférences autour de l'analyse de données - ainsi que les bibliothèques comme pandas et xarray).

Pandas

  • Permet de manipuler des objets de type DataFrame.
  • Inspirée par les data.frame R.
  • API riche, bonne performance, compatibilité avec le reste de l'écosystème.
Exemple : prix de la weed

Les données proviennent de The Price of Weed website : http://www.priceofweed.com/

Un utilisateur a récupéré les données journalières (491 dates) pour les 51 états américains : https://github.com/frankbi/price-of-weed/

On va regarder comment ouvrir, nettoyer et manipuler ces données avec pandas

In [36]:
# Imports nécessaires 
import pandas as pd
import numpy as np
import os

# Lister les fichiers du dossier contenant les données :
files = os.listdir('data/prices')
files
Out[36]:
['weedprices13082014.csv',
 'weedprices02022015.csv',
 'weedprices25052014.csv',
 'weedprices08022015.csv',
 'weedprices20062015.csv',
 'weedprices27082014.csv',
 'weedprices22012015.csv',
 'weedprices13072014.csv',
 'weedprices08062014.csv',
 'weedprices10082014.csv',
 'weedprices09032014.csv',
 'weedprices03012014.csv',
 'weedprices31052015.csv',
 'weedprices21042014.csv',
 'weedprices08092014.csv',
 'weedprices12012014.csv',
 'weedprices06062014.csv',
 'weedprices04072014.csv',
 'weedprices10102014.csv',
 'weedprices29032014.csv',
 'weedprices11012015.csv',
 'weedprices18072014.csv',
 'weedprices03022014.csv',
 'weedprices20042014.csv',
 'weedprices29102014.csv',
 'weedprices03122014.csv',
 'weedprices02022014.csv',
 'weedprices18062014.csv',
 'weedprices19022014.csv',
 'weedprices13022014.csv',
 'weedprices07102014.csv',
 'weedprices03112014.csv',
 'weedprices14022015.csv',
 'weedprices27012015.csv',
 'weedprices16042014.csv',
 'weedprices24032014.csv',
 'weedprices11052014.csv',
 'weedprices11072014.csv',
 'weedprices08062015.csv',
 'weedprices04042014.csv',
 'weedprices26062014.csv',
 'weedprices09112014.csv',
 'weedprices10112014.csv',
 'weedprices21022014.csv',
 'weedprices04022014.csv',
 'weedprices06122014.csv',
 'weedprices09072015.csv',
 'weedprices28102014.csv',
 'weedprices25032014.csv',
 'weedprices01072014.csv',
 'weedprices11032014.csv',
 'weedprices09012014.csv',
 'weedprices06022014.csv',
 'weedprices28062014.csv',
 'weedprices23122014.csv',
 'weedprices15022015.csv',
 'weedprices07042014.csv',
 'weedprices17092014.csv',
 'weedprices16062015.csv',
 'weedprices21062015.csv',
 'weedprices17042014.csv',
 'weedprices18102014.csv',
 'weedprices31032014.csv',
 'weedprices07012015.csv',
 'weedprices11062014.csv',
 'weedprices02072015.csv',
 'weedprices14012015.csv',
 'weedprices08012015.csv',
 'weedprices17082014.csv',
 'weedprices18122014.csv',
 'weedprices29012014.csv',
 'weedprices25062014.csv',
 'weedprices11042014.csv',
 'weedprices05042014.csv',
 'weedprices28042014.csv',
 'weedprices11072015.csv',
 'weedprices01022014.csv',
 'weedprices21122014.csv',
 'weedprices15082014.csv',
 'weedprices27022014.csv',
 'weedprices06012014.csv',
 'weedprices25012014.csv',
 'weedprices30032014.csv',
 'weedprices22072014.csv',
 'weedprices22122014.csv',
 'weedprices22062015.csv',
 'weedprices07052014.csv',
 'weedprices24022014.csv',
 'weedprices25022014.csv',
 'weedprices26092014.csv',
 'weedprices16112014.csv',
 'weedprices27072014.csv',
 'weedprices28082014.csv',
 'weedprices19052014.csv',
 'weedprices10042014.csv',
 'weedprices20052015.csv',
 'weedprices22082014.csv',
 'weedprices19112014.csv',
 'weedprices26102014.csv',
 'weedprices24042014.csv',
 'weedprices27092014.csv',
 'weedprices11102014.csv',
 'weedprices28032014.csv',
 'weedprices09042014.csv',
 'weedprices19012015.csv',
 'weedprices04122014.csv',
 'weedprices19092014.csv',
 'weedprices14022014.csv',
 'weedprices26062015.csv',
 'weedprices30112014.csv',
 'weedprices14102014.csv',
 'weedprices10032014.csv',
 'weedprices27032014.csv',
 'weedprices03072014.csv',
 'weedprices06112014.csv',
 'weedprices20012015.csv',
 'weedprices24072014.csv',
 'weedprices27012014.csv',
 'weedprices30062014.csv',
 'weedprices30042014.csv',
 'weedprices21072015.csv',
 'weedprices06042014.csv',
 'weedprices18052014.csv',
 'weedprices14122014.csv',
 'weedprices29072014.csv',
 'weedprices22022014.csv',
 'weedprices18032014.csv',
 'weedprices31072014.csv',
 'weedprices18112014.csv',
 'weedprices29062014.csv',
 'weedprices25082014.csv',
 'weedprices24012014.csv',
 'weedprices25112014.csv',
 'weedprices06062015.csv',
 'weedprices13032014.csv',
 'weedprices05012014.csv',
 'weedprices29042014.csv',
 'weedprices09022014.csv',
 'weedprices17032014.csv',
 'weedprices03012015.csv',
 'weedprices14062014.csv',
 'weedprices25062015.csv',
 'weedprices01082014.csv',
 'weedprices21052014.csv',
 'weedprices21012014.csv',
 'weedprices19102014.csv',
 'weedprices13062015.csv',
 'weedprices16012014.csv',
 'weedprices03072015.csv',
 'weedprices10092014.csv',
 'weedprices20032014.csv',
 'weedprices18042014.csv',
 'weedprices07072015.csv',
 'weedprices01012014.csv',
 'weedprices27062014.csv',
 'weedprices07022015.csv',
 'weedprices07062015.csv',
 'weedprices27122013.csv',
 'weedprices01052014.csv',
 'weedprices08072014.csv',
 'weedprices02012015.csv',
 'weedprices10012014.csv',
 'weedprices04012015.csv',
 'weedprices12122014.csv',
 'weedprices29122013.csv',
 'weedprices08032014.csv',
 'weedprices15092014.csv',
 'weedprices31012014.csv',
 'weedprices06092014.csv',
 'weedprices27052015.csv',
 'weedprices24012015.csv',
 'weedprices29012015.csv',
 'weedprices19122014.csv',
 'weedprices31012015.csv',
 'weedprices11022014.csv',
 'weedprices14072015.csv',
 'weedprices19032014.csv',
 'weedprices10062015.csv',
 'weedprices23012015.csv',
 'weedprices01022015.csv',
 'weedprices30052015.csv',
 'weedprices05062014.csv',
 'weedprices24092014.csv',
 'weedprices28112014.csv',
 'weedprices08012014.csv',
 'weedprices09052014.csv',
 'weedprices12022015.csv',
 'weedprices06072015.csv',
 'weedprices28052015.csv',
 'weedprices16092014.csv',
 'weedprices26022014.csv',
 'weedprices20062014.csv',
 'weedprices25092014.csv',
 'weedprices29112014.csv',
 'weedprices08052014.csv',
 'weedprices19072015.csv',
 'weedprices13122014.csv',
 'weedprices05112014.csv',
 'weedprices12032014.csv',
 'weedprices25012015.csv',
 'weedprices15062015.csv',
 'weedprices07022014.csv',
 'weedprices05062015.csv',
 'weedprices12052015.csv',
 'weedprices03032014.csv',
 'weedprices15012015.csv',
 'weedprices10062014.csv',
 'weedprices28122014.csv',
 'weedprices22072015.csv',
 'weedprices03052014.csv',
 'weedprices01032014.csv',
 'weedprices20052014.csv',
 'weedprices09092014.csv',
 'weedprices28022014.csv',
 'weedprices30122013.csv',
 'weedprices18082014.csv',
 'weedprices29062015.csv',
 'weedprices27102014.csv',
 'weedprices12092014.csv',
 'weedprices29092014.csv',
 'weedprices12062014.csv',
 'weedprices21062014.csv',
 'weedprices22032014.csv',
 'weedprices12102014.csv',
 'weedprices11082014.csv',
 'weedprices04092014.csv',
 'weedprices30012014.csv',
 'weedprices12072015.csv',
 'weedprices14052015.csv',
 'weedprices01102014.csv',
 'weedprices17072015.csv',
 'weedprices08122014.csv',
 'weedprices13062014.csv',
 'weedprices10022014.csv',
 'weedprices02042014.csv',
 'weedprices30082014.csv',
 'weedprices16062014.csv',
 'weedprices27122014.csv',
 'weedprices29052015.csv',
 'weedprices23032014.csv',
 'weedprices09012015.csv',
 'weedprices23072015.csv',
 'weedprices05022015.csv',
 'weedprices19012014.csv',
 'weedprices18092014.csv',
 'weedprices17122014.csv',
 'weedprices02112014.csv',
 'weedprices18022014.csv',
 'weedprices01062014.csv',
 'weedprices05072014.csv',
 'weedprices12022014.csv',
 'weedprices24062014.csv',
 'weedprices15112014.csv',
 'weedprices05092014.csv',
 'weedprices13012014.csv',
 'weedprices11062015.csv',
 'weedprices08082014.csv',
 'weedprices05122014.csv',
 'weedprices04052014.csv',
 'weedprices22112014.csv',
 'weedprices05052014.csv',
 'weedprices01122014.csv',
 'weedprices20082014.csv',
 'weedprices13112014.csv',
 'weedprices28072014.csv',
 'weedprices23062015.csv',
 'weedprices09062014.csv',
 'weedprices23042014.csv',
 'weedprices06012015.csv',
 'weedprices06032014.csv',
 'weedprices13052014.csv',
 'weedprices31082014.csv',
 'weedprices08112014.csv',
 'weedprices02102014.csv',
 'weedprices13072015.csv',
 'weedprices04032014.csv',
 'weedprices04012014.csv',
 'weedprices13042014.csv',
 'weedprices27042014.csv',
 'weedprices23052015.csv',
 'weedprices24112014.csv',
 'weedprices27112014.csv',
 'weedprices22092014.csv',
 'weedprices08042014.csv',
 'weedprices08022014.csv',
 'weedprices28012014.csv',
 'weedprices16052015.csv',
 'weedprices21082014.csv',
 'weedprices22012014.csv',
 'weedprices15122014.csv',
 'weedprices15042014.csv',
 'weedprices05072015.csv',
 'weedprices26012015.csv',
 'weedprices16012015.csv',
 'weedprices07122014.csv',
 'weedprices03062014.csv',
 'weedprices22062014.csv',
 'weedprices17062014.csv',
 'weedprices15032014.csv',
 'weedprices02092014.csv',
 'weedprices13102014.csv',
 'weedprices14062015.csv',
 'weedprices21112014.csv',
 'weedprices23062014.csv',
 'weedprices25102014.csv',
 'weedprices29052014.csv',
 'weedprices05012015.csv',
 'weedprices06052014.csv',
 'weedprices22052014.csv',
 'weedprices30052014.csv',
 'weedprices01072015.csv',
 'weedprices21072014.csv',
 'weedprices30122014.csv',
 'weedprices01062015.csv',
 'weedprices27062015.csv',
 'weedprices07032014.csv',
 'weedprices13012015.csv',
 'weedprices09072014.csv',
 'weedprices07112014.csv',
 'weedprices06072014.csv',
 'weedprices11112014.csv',
 'weedprices28092014.csv',
 'weedprices26042014.csv',
 'weedprices21012015.csv',
 'weedprices02122014.csv',
 'weedprices04022015.csv',
 'weedprices19052015.csv',
 'weedprices26122014.csv',
 'weedprices21032014.csv',
 'weedprices16022015.csv',
 'weedprices07012014.csv',
 'weedprices10122014.csv',
 'weedprices14112014.csv',
 'weedprices11052015.csv',
 'weedprices06022015.csv',
 'weedprices23102014.csv',
 'weedprices31122013.csv',
 'weedprices25072014.csv',
 'weedprices17102014.csv',
 'weedprices30102014.csv',
 'weedprices15022014.csv',
 'weedprices19072014.csv',
 'weedprices16072014.csv',
 'weedprices17022014.csv',
 'weedprices14082014.csv',
 'weedprices14012014.csv',
 'weedprices30062015.csv',
 'weedprices13092014.csv',
 'weedprices20022014.csv',
 'weedprices06082014.csv',
 'weedprices02062014.csv',
 'weedprices10052014.csv',
 'weedprices14072014.csv',
 'weedprices07062014.csv',
 'weedprices02032014.csv',
 'weedprices22102014.csv',
 'weedprices31052014.csv',
 'weedprices16072015.csv',
 'weedprices04062014.csv',
 'weedprices07082014.csv',
 'weedprices24062015.csv',
 'weedprices20072014.csv',
 'weedprices12042014.csv',
 'weedprices04082014.csv',
 'weedprices12012015.csv',
 'weedprices09082014.csv',
 'weedprices23072014.csv',
 'weedprices21052015.csv',
 'weedprices08102014.csv',
 'weedprices18072015.csv',
 'weedprices12072014.csv',
 'weedprices10012015.csv',
 'weedprices15072014.csv',
 'weedprices16032014.csv',
 'weedprices30092014.csv',
 'weedprices06102014.csv',
 'weedprices25122014.csv',
 'weedprices09022015.csv',
 'weedprices05022014.csv',
 'weedprices21102014.csv',
 'weedprices12082014.csv',
 'weedprices23112014.csv',
 'weedprices10072014.csv',
 'weedprices16122014.csv',
 'weedprices14042014.csv',
 'weedprices28062015.csv',
 'weedprices21092014.csv',
 'weedprices13022015.csv',
 'weedprices23022014.csv',
 'weedprices26032014.csv',
 'weedprices17052015.csv',
 'weedprices02052014.csv',
 'weedprices26112014.csv',
 'weedprices15102014.csv',
 'weedprices28052014.csv',
 'weedprices29082014.csv',
 'weedprices10052015.csv',
 'weedprices23052014.csv',
 'weedprices03082014.csv',
 'weedprices19062014.csv',
 'weedprices28012015.csv',
 'weedprices04112014.csv',
 'weedprices13052015.csv',
 'weedprices12062015.csv',
 'weedprices16022014.csv',
 'weedprices11022015.csv',
 'weedprices31122014.csv',
 'weedprices12112014.csv',
 'weedprices24082014.csv',
 'weedprices11122014.csv',
 'weedprices15062014.csv',
 'weedprices16052014.csv',
 'weedprices25052015.csv',
 'weedprices01042014.csv',
 'weedprices09062015.csv',
 'weedprices30012015.csv',
 'weedprices17062015.csv',
 'weedprices03042014.csv',
 'weedprices20072015.csv',
 'weedprices08072015.csv',
 'weedprices20112014.csv',
 'weedprices15072015.csv',
 'weedprices03022015.csv',
 'weedprices30072014.csv',
 'weedprices18012014.csv',
 'weedprices22042014.csv',
 'weedprices05032014.csv',
 'weedprices24052014.csv',
 'weedprices19062015.csv',
 'weedprices05082014.csv',
 'weedprices20092014.csv',
 'weedprices18052015.csv',
 'weedprices04102014.csv',
 'weedprices04072015.csv',
 'weedprices07072014.csv',
 'weedprices03092014.csv',
 'weedprices22052015.csv',
 'weedprices02062015.csv',
 'weedprices09102014.csv',
 'weedprices11092014.csv',
 'weedprices20012014.csv',
 'weedprices18062015.csv',
 'weedprices28122013.csv',
 'weedprices01092014.csv',
 'weedprices17072014.csv',
 'weedprices09122014.csv',
 'weedprices29122014.csv',
 'weedprices10022015.csv',
 'weedprices03062015.csv',
 'weedprices10072015.csv',
 'weedprices23012014.csv',
 'weedprices24102014.csv',
 'weedprices17012015.csv',
 'weedprices26052014.csv',
 'weedprices26082014.csv',
 'weedprices17112014.csv',
 'weedprices01112014.csv',
 'weedprices23092014.csv',
 'weedprices17052014.csv',
 'weedprices31102014.csv',
 'weedprices05102014.csv',
 'weedprices15052015.csv',
 'weedprices07092014.csv',
 'weedprices11012014.csv',
 'weedprices24052015.csv',
 'weedprices25042014.csv',
 'weedprices19082014.csv',
 'weedprices16082014.csv',
 'weedprices26052015.csv',
 'weedprices04062015.csv',
 'weedprices20122014.csv',
 'weedprices01012015.csv',
 'weedprices14092014.csv',
 'weedprices19042014.csv',
 'weedprices12052014.csv',
 'weedprices02072014.csv',
 'weedprices17012014.csv',
 'weedprices27052014.csv',
 'weedprices15052014.csv',
 'weedprices14032014.csv',
 'weedprices16102014.csv',
 'weedprices14052014.csv',
 'weedprices18012015.csv',
 'weedprices20102014.csv',
 'weedprices26012014.csv',
 'weedprices23082014.csv',
 'weedprices26072014.csv',
 'weedprices02012014.csv',
 'weedprices24122014.csv',
 'weedprices15012014.csv',
 'weedprices03102014.csv']
In [37]:
# Ouverture du premier fichier pour voir sa structure :
df = pd.read_csv(os.path.join('data', 'prices', files[0]))

df.head()
Out[37]:
State HighQ HighQN MedQ MedQN LowQ LowQN
0 Alabama 340.21 1391 203.97 1254 145.79 160
1 Alaska 289.81 327 260.97 404 403.33 33
2 Arizona 300.50 2407 209.48 2139 186.92 281
3 Arkansas 347.19 752 180.98 722 126.20 134
4 California 245.33 15081 191.63 16853 187.83 987
In [38]:
# Avant de concaténer les différents fichiers, on devra ajouter la date à chaque de ces fichiers :
date_file = files[0].replace('weedprices', '').replace('.csv', '') # 'weedprices13082014.csv' -> '13082014'
date_file = '-'.join([date_file[:2], date_file[2:4], date_file[4:]]) # '13082014' -> '13-08-2014'
date_file = pd.to_datetime(date_file, dayfirst=True)
df['date'] = date_file
df.head(3)
Out[38]:
State HighQ HighQN MedQ MedQN LowQ LowQN date
0 Alabama 340.21 1391 203.97 1254 145.79 160 2014-08-13
1 Alaska 289.81 327 260.97 404 403.33 33 2014-08-13
2 Arizona 300.50 2407 209.48 2139 186.92 281 2014-08-13
In [39]:
# On définit une function qui va s'occuper de convertir le nom de fichier en date
def datetime_from_filename(filename):
    date_file = filename.replace('weedprices', '').replace('.csv', '')
    return pd.to_datetime(
        '-'.join([date_file[:2], date_file[2:4], date_file[4:]]),
        dayfirst=True
        )
In [40]:
# On va concaténer ces DataFrame de manière séquentielle :
dfs = []
for name in files:
    df = pd.read_csv(os.path.join('data', 'prices', name))
    df['date'] = datetime_from_filename(name)
    dfs.append(df)

result = pd.concat(dfs, ignore_index=True)
In [41]:
result.head()
result.tail()
result.dtypes
Out[41]:
State HighQ HighQN MedQ MedQN LowQ LowQN date
0 Alabama 340.21 1391 203.97 1254 145.79 160 2014-08-13
1 Alaska 289.81 327 260.97 404 403.33 33 2014-08-13
2 Arizona 300.5 2407 209.48 2139 186.92 281 2014-08-13
3 Arkansas 347.19 752 180.98 722 126.2 134 2014-08-13
4 California 245.33 15081 191.63 16853 187.83 987 2014-08-13
Out[41]:
State HighQ HighQN MedQ MedQN LowQ LowQN date
24985 Virginia 366.49 3216 291.63 2785 I feel bad for these guys --> 262 2014-10-03
24986 Washington 233.19 3082 189.52 3238 I feel bad for these guys --> 145 2014-10-03
24987 West Virginia 354.97 506 216.72 494 I feel bad for these guys --> 57 2014-10-03
24988 Wisconsin 352.31 2073 272.68 1979 I feel bad for these guys --> 153 2014-10-03
24989 Wyoming 328.51 119 355.77 172 I feel bad for these guys --> 12 2014-10-03
Out[41]:
State             object
HighQ             object
HighQN            object
MedQ              object
MedQN             object
LowQ              object
LowQN             object
date      datetime64[ns]
dtype: object
Il va falloir nettoyer les données et effectuer des conversions de types pour les manipuler et les analyser par la suite
In [42]:
# Le type des éléments dans ces colones est hétérogène (float, str, d'où la notation 'object')
result['HighQ'] = result['HighQ'].apply(lambda x: float(x.replace('$', '')) if isinstance(x, str) else x)
result['MedQ'] = result['MedQ'].apply(lambda x: float(x.replace('$', '')) if isinstance(x, str) else x)

# Cette colonne contient des valeurs inutilisables, on va les convertir en une valeur
# numérique qui à ce sens : NaN
result['LowQ'] = result['LowQ'].apply(lambda x: x.replace('$', '') if isinstance(x, str) else x)
result['LowQ'] = pd.to_numeric(result['LowQ'], errors='coerce')

# Les autre colonnes contiennent des entiers et ne présentent pas de problèmes
result['HighQN'] = result['HighQN'].astype(int)
result['MedQN'] = result['MedQN'].astype(int)
result['LowQN'] = result['LowQN'].astype(int)
In [43]:
result.head()
result.tail()
result.dtypes
Out[43]:
State HighQ HighQN MedQ MedQN LowQ LowQN date
0 Alabama 340.21 1391 203.97 1254 145.79 160 2014-08-13
1 Alaska 289.81 327 260.97 404 403.33 33 2014-08-13
2 Arizona 300.50 2407 209.48 2139 186.92 281 2014-08-13
3 Arkansas 347.19 752 180.98 722 126.20 134 2014-08-13
4 California 245.33 15081 191.63 16853 187.83 987 2014-08-13
Out[43]:
State HighQ HighQN MedQ MedQN LowQ LowQN date
24985 Virginia 366.49 3216 291.63 2785 NaN 262 2014-10-03
24986 Washington 233.19 3082 189.52 3238 NaN 145 2014-10-03
24987 West Virginia 354.97 506 216.72 494 NaN 57 2014-10-03
24988 Wisconsin 352.31 2073 272.68 1979 NaN 153 2014-10-03
24989 Wyoming 328.51 119 355.77 172 NaN 12 2014-10-03
Out[43]:
State             object
HighQ            float64
HighQN             int64
MedQ             float64
MedQN              int64
LowQ             float64
LowQN              int64
date      datetime64[ns]
dtype: object

Quelques opérations de base sur notre table de données :

In [44]:
# Obtenir des lignes spécifiques :
result[2:4]
Out[44]:
State HighQ HighQN MedQ MedQN LowQ LowQN date
2 Arizona 300.50 2407 209.48 2139 186.92 281 2014-08-13
3 Arkansas 347.19 752 180.98 722 126.20 134 2014-08-13
In [45]:
# Obtenir une colonne :
result['State']
Out[45]:
0                     Alabama
1                      Alaska
2                     Arizona
3                    Arkansas
4                  California
5                    Colorado
6                 Connecticut
7                    Delaware
8        District of Columbia
9                     Florida
10                    Georgia
11                     Hawaii
12                      Idaho
13                   Illinois
14                    Indiana
15                       Iowa
16                     Kansas
17                   Kentucky
18                  Louisiana
19                      Maine
20                    Montana
21                   Nebraska
22                     Nevada
23              New Hampshire
24                 New Jersey
25                 New Mexico
26                   New York
27             North Carolina
28               North Dakota
29                       Ohio
                 ...         
24960                Nebraska
24961                  Nevada
24962           New Hampshire
24963              New Jersey
24964              New Mexico
24965                New York
24966          North Carolina
24967            North Dakota
24968                    Ohio
24969                Oklahoma
24970                  Oregon
24971                Maryland
24972           Massachusetts
24973                Michigan
24974               Minnesota
24975             Mississippi
24976                Missouri
24977            Pennsylvania
24978            Rhode Island
24979          South Carolina
24980            South Dakota
24981               Tennessee
24982                   Texas
24983                    Utah
24984                 Vermont
24985                Virginia
24986              Washington
24987           West Virginia
24988               Wisconsin
24989                 Wyoming
Name: State, Length: 24990, dtype: object
In [46]:
result.loc[0]
Out[46]:
State                 Alabama
HighQ                  340.21
HighQN                   1391
MedQ                   203.97
MedQN                    1254
LowQ                   145.79
LowQN                     160
date      2014-08-13 00:00:00
Name: 0, dtype: object
In [47]:
# Index et nom de colonne
result.loc[0, 'State']
Out[47]:
'Alabama'
In [48]:
# Index et index de colonne
result.iloc[0, 0]
Out[48]:
'Alabama'
In [49]:
# Sélection de lignes où le prix pour l'Alabama est inférieur à 340$ pour la plus haute qualité
# - en utilisant l'indexation booléenne et la combinaison des deux critères
result[(result.State == 'Alabama') & (result.HighQ < 340.0)]
Out[49]:
State HighQ HighQN MedQ MedQN LowQ LowQN date
51 Alabama 335.86 1584 209.19 1525 NaN 193 2015-02-02
153 Alabama 335.45 1592 209.87 1542 NaN 195 2015-02-08
204 Alabama 337.40 1890 206.12 1808 NaN 218 2015-06-20
306 Alabama 336.04 1573 209.45 1506 NaN 188 2015-01-22
561 Alabama 339.20 1043 198.64 933 148.48 124 2014-01-03
612 Alabama 337.29 1888 206.17 1805 NaN 218 2015-05-31
918 Alabama 339.36 1430 204.52 1318 NaN 164 2014-10-10
1020 Alabama 336.91 1553 209.09 1478 NaN 184 2015-01-11
1122 Alabama 339.16 1100 200.16 987 152.02 129 2014-02-03
1224 Alabama 338.62 1448 205.01 1342 NaN 171 2014-10-29
1275 Alabama 339.55 1497 205.81 1401 NaN 179 2014-12-03
1326 Alabama 338.87 1097 199.89 985 152.02 129 2014-02-02
1479 Alabama 339.87 1119 201.16 1005 146.19 131 2014-02-13
1530 Alabama 339.36 1430 204.39 1315 NaN 164 2014-10-07
1581 Alabama 339.31 1455 204.56 1350 NaN 172 2014-11-03
1632 Alabama 335.45 1602 209.86 1553 NaN 197 2015-02-14
1683 Alabama 335.79 1578 209.00 1519 NaN 193 2015-01-27
1938 Alabama 337.32 1889 206.09 1807 NaN 218 2015-06-08
2091 Alabama 339.21 1464 204.36 1360 NaN 174 2014-11-09
2142 Alabama 339.42 1467 204.52 1362 NaN 174 2014-11-10
2244 Alabama 339.20 1102 201.10 991 152.02 129 2014-02-04
2295 Alabama 339.23 1501 205.62 1407 NaN 180 2014-12-06
2346 Alabama 337.78 1895 206.29 1812 NaN 218 2015-07-09
2397 Alabama 338.62 1448 204.95 1340 NaN 171 2014-10-28
2652 Alabama 339.81 1106 200.68 995 152.02 129 2014-02-06
2754 Alabama 339.06 1526 207.77 1443 NaN 181 2014-12-23
2805 Alabama 335.27 1606 209.74 1554 NaN 197 2015-02-15
2958 Alabama 337.32 1889 206.12 1808 NaN 218 2015-06-16
3009 Alabama 337.40 1890 206.12 1808 NaN 218 2015-06-21
3111 Alabama 338.86 1436 204.34 1327 NaN 165 2014-10-18
... ... ... ... ... ... ... ... ...
22185 Alabama 337.63 1873 206.38 1800 NaN 216 2015-05-22
22236 Alabama 337.32 1889 206.14 1806 NaN 218 2015-06-02
22287 Alabama 339.36 1430 204.39 1315 NaN 164 2014-10-09
22440 Alabama 337.32 1889 206.12 1808 NaN 218 2015-06-18
22491 Alabama 339.65 1033 198.04 926 147.15 122 2013-12-28
22644 Alabama 339.28 1508 205.76 1415 NaN 180 2014-12-09
22695 Alabama 338.38 1534 207.75 1455 NaN 182 2014-12-29
22746 Alabama 335.48 1594 210.05 1546 NaN 196 2015-02-10
22797 Alabama 337.32 1889 206.14 1806 NaN 218 2015-06-03
22848 Alabama 337.90 1896 206.23 1813 NaN 218 2015-07-10
22899 Alabama 339.13 1076 199.38 971 152.02 129 2014-01-23
22950 Alabama 338.47 1444 204.63 1332 NaN 167 2014-10-24
23001 Alabama 336.41 1564 209.37 1495 NaN 186 2015-01-17
23154 Alabama 339.11 1472 205.37 1377 NaN 177 2014-11-17
23205 Alabama 339.13 1453 204.81 1348 NaN 172 2014-11-01
23358 Alabama 338.92 1451 204.95 1347 NaN 171 2014-10-31
23409 Alabama 339.52 1429 204.33 1314 NaN 163 2014-10-05
23460 Alabama 337.89 1836 207.01 1776 NaN 213 2015-05-15
23613 Alabama 337.49 1878 206.38 1800 NaN 216 2015-05-24
23817 Alabama 337.21 1882 206.26 1803 NaN 217 2015-05-26
23868 Alabama 337.32 1889 206.14 1806 NaN 218 2015-06-04
23919 Alabama 339.19 1520 207.10 1435 NaN 181 2014-12-20
23970 Alabama 337.54 1539 208.24 1463 NaN 182 2015-01-01
24429 Alabama 338.86 1436 204.34 1327 NaN 165 2014-10-16
24531 Alabama 336.50 1565 209.30 1496 NaN 187 2015-01-18
24582 Alabama 338.81 1439 204.29 1328 NaN 166 2014-10-20
24633 Alabama 338.95 1080 199.39 972 152.02 129 2014-01-26
24786 Alabama 339.20 1043 198.64 933 149.49 123 2014-01-02
24837 Alabama 339.03 1527 207.77 1444 NaN 182 2014-12-24
24939 Alabama 339.52 1429 204.33 1312 NaN 162 2014-10-03

256 rows × 8 columns

In [50]:
# Sélection de lignes où le prix pour l'Alabama est inférieur à 340$ pour la plus haute qualité
# - en utilisant l'indexation booléenne et la méthode .eval() pour
# combiner nos 2 critères 
result[result.eval('State == "Alabama" and HighQ < 340.0')]
Out[50]:
State HighQ HighQN MedQ MedQN LowQ LowQN date
51 Alabama 335.86 1584 209.19 1525 NaN 193 2015-02-02
153 Alabama 335.45 1592 209.87 1542 NaN 195 2015-02-08
204 Alabama 337.40 1890 206.12 1808 NaN 218 2015-06-20
306 Alabama 336.04 1573 209.45 1506 NaN 188 2015-01-22
561 Alabama 339.20 1043 198.64 933 148.48 124 2014-01-03
612 Alabama 337.29 1888 206.17 1805 NaN 218 2015-05-31
918 Alabama 339.36 1430 204.52 1318 NaN 164 2014-10-10
1020 Alabama 336.91 1553 209.09 1478 NaN 184 2015-01-11
1122 Alabama 339.16 1100 200.16 987 152.02 129 2014-02-03
1224 Alabama 338.62 1448 205.01 1342 NaN 171 2014-10-29
1275 Alabama 339.55 1497 205.81 1401 NaN 179 2014-12-03
1326 Alabama 338.87 1097 199.89 985 152.02 129 2014-02-02
1479 Alabama 339.87 1119 201.16 1005 146.19 131 2014-02-13
1530 Alabama 339.36 1430 204.39 1315 NaN 164 2014-10-07
1581 Alabama 339.31 1455 204.56 1350 NaN 172 2014-11-03
1632 Alabama 335.45 1602 209.86 1553 NaN 197 2015-02-14
1683 Alabama 335.79 1578 209.00 1519 NaN 193 2015-01-27
1938 Alabama 337.32 1889 206.09 1807 NaN 218 2015-06-08
2091 Alabama 339.21 1464 204.36 1360 NaN 174 2014-11-09
2142 Alabama 339.42 1467 204.52 1362 NaN 174 2014-11-10
2244 Alabama 339.20 1102 201.10 991 152.02 129 2014-02-04
2295 Alabama 339.23 1501 205.62 1407 NaN 180 2014-12-06
2346 Alabama 337.78 1895 206.29 1812 NaN 218 2015-07-09
2397 Alabama 338.62 1448 204.95 1340 NaN 171 2014-10-28
2652 Alabama 339.81 1106 200.68 995 152.02 129 2014-02-06
2754 Alabama 339.06 1526 207.77 1443 NaN 181 2014-12-23
2805 Alabama 335.27 1606 209.74 1554 NaN 197 2015-02-15
2958 Alabama 337.32 1889 206.12 1808 NaN 218 2015-06-16
3009 Alabama 337.40 1890 206.12 1808 NaN 218 2015-06-21
3111 Alabama 338.86 1436 204.34 1327 NaN 165 2014-10-18
... ... ... ... ... ... ... ... ...
22185 Alabama 337.63 1873 206.38 1800 NaN 216 2015-05-22
22236 Alabama 337.32 1889 206.14 1806 NaN 218 2015-06-02
22287 Alabama 339.36 1430 204.39 1315 NaN 164 2014-10-09
22440 Alabama 337.32 1889 206.12 1808 NaN 218 2015-06-18
22491 Alabama 339.65 1033 198.04 926 147.15 122 2013-12-28
22644 Alabama 339.28 1508 205.76 1415 NaN 180 2014-12-09
22695 Alabama 338.38 1534 207.75 1455 NaN 182 2014-12-29
22746 Alabama 335.48 1594 210.05 1546 NaN 196 2015-02-10
22797 Alabama 337.32 1889 206.14 1806 NaN 218 2015-06-03
22848 Alabama 337.90 1896 206.23 1813 NaN 218 2015-07-10
22899 Alabama 339.13 1076 199.38 971 152.02 129 2014-01-23
22950 Alabama 338.47 1444 204.63 1332 NaN 167 2014-10-24
23001 Alabama 336.41 1564 209.37 1495 NaN 186 2015-01-17
23154 Alabama 339.11 1472 205.37 1377 NaN 177 2014-11-17
23205 Alabama 339.13 1453 204.81 1348 NaN 172 2014-11-01
23358 Alabama 338.92 1451 204.95 1347 NaN 171 2014-10-31
23409 Alabama 339.52 1429 204.33 1314 NaN 163 2014-10-05
23460 Alabama 337.89 1836 207.01 1776 NaN 213 2015-05-15
23613 Alabama 337.49 1878 206.38 1800 NaN 216 2015-05-24
23817 Alabama 337.21 1882 206.26 1803 NaN 217 2015-05-26
23868 Alabama 337.32 1889 206.14 1806 NaN 218 2015-06-04
23919 Alabama 339.19 1520 207.10 1435 NaN 181 2014-12-20
23970 Alabama 337.54 1539 208.24 1463 NaN 182 2015-01-01
24429 Alabama 338.86 1436 204.34 1327 NaN 165 2014-10-16
24531 Alabama 336.50 1565 209.30 1496 NaN 187 2015-01-18
24582 Alabama 338.81 1439 204.29 1328 NaN 166 2014-10-20
24633 Alabama 338.95 1080 199.39 972 152.02 129 2014-01-26
24786 Alabama 339.20 1043 198.64 933 149.49 123 2014-01-02
24837 Alabama 339.03 1527 207.77 1444 NaN 182 2014-12-24
24939 Alabama 339.52 1429 204.33 1312 NaN 162 2014-10-03

256 rows × 8 columns

In [51]:
# Sélection de lignes où le prix pour l'Alabama est inférieur à 340$ pour la plus haute qualité
# - en la méthode .query() qui permet de chainer les critères
result.query('State == "Alabama"').query('HighQ < 340.0')
Out[51]:
State HighQ HighQN MedQ MedQN LowQ LowQN date
51 Alabama 335.86 1584 209.19 1525 NaN 193 2015-02-02
153 Alabama 335.45 1592 209.87 1542 NaN 195 2015-02-08
204 Alabama 337.40 1890 206.12 1808 NaN 218 2015-06-20
306 Alabama 336.04 1573 209.45 1506 NaN 188 2015-01-22
561 Alabama 339.20 1043 198.64 933 148.48 124 2014-01-03
612 Alabama 337.29 1888 206.17 1805 NaN 218 2015-05-31
918 Alabama 339.36 1430 204.52 1318 NaN 164 2014-10-10
1020 Alabama 336.91 1553 209.09 1478 NaN 184 2015-01-11
1122 Alabama 339.16 1100 200.16 987 152.02 129 2014-02-03
1224 Alabama 338.62 1448 205.01 1342 NaN 171 2014-10-29
1275 Alabama 339.55 1497 205.81 1401 NaN 179 2014-12-03
1326 Alabama 338.87 1097 199.89 985 152.02 129 2014-02-02
1479 Alabama 339.87 1119 201.16 1005 146.19 131 2014-02-13
1530 Alabama 339.36 1430 204.39 1315 NaN 164 2014-10-07
1581 Alabama 339.31 1455 204.56 1350 NaN 172 2014-11-03
1632 Alabama 335.45 1602 209.86 1553 NaN 197 2015-02-14
1683 Alabama 335.79 1578 209.00 1519 NaN 193 2015-01-27
1938 Alabama 337.32 1889 206.09 1807 NaN 218 2015-06-08
2091 Alabama 339.21 1464 204.36 1360 NaN 174 2014-11-09
2142 Alabama 339.42 1467 204.52 1362 NaN 174 2014-11-10
2244 Alabama 339.20 1102 201.10 991 152.02 129 2014-02-04
2295 Alabama 339.23 1501 205.62 1407 NaN 180 2014-12-06
2346 Alabama 337.78 1895 206.29 1812 NaN 218 2015-07-09
2397 Alabama 338.62 1448 204.95 1340 NaN 171 2014-10-28
2652 Alabama 339.81 1106 200.68 995 152.02 129 2014-02-06
2754 Alabama 339.06 1526 207.77 1443 NaN 181 2014-12-23
2805 Alabama 335.27 1606 209.74 1554 NaN 197 2015-02-15
2958 Alabama 337.32 1889 206.12 1808 NaN 218 2015-06-16
3009 Alabama 337.40 1890 206.12 1808 NaN 218 2015-06-21
3111 Alabama 338.86 1436 204.34 1327 NaN 165 2014-10-18
... ... ... ... ... ... ... ... ...
22185 Alabama 337.63 1873 206.38 1800 NaN 216 2015-05-22
22236 Alabama 337.32 1889 206.14 1806 NaN 218 2015-06-02
22287 Alabama 339.36 1430 204.39 1315 NaN 164 2014-10-09
22440 Alabama 337.32 1889 206.12 1808 NaN 218 2015-06-18
22491 Alabama 339.65 1033 198.04 926 147.15 122 2013-12-28
22644 Alabama 339.28 1508 205.76 1415 NaN 180 2014-12-09
22695 Alabama 338.38 1534 207.75 1455 NaN 182 2014-12-29
22746 Alabama 335.48 1594 210.05 1546 NaN 196 2015-02-10
22797 Alabama 337.32 1889 206.14 1806 NaN 218 2015-06-03
22848 Alabama 337.90 1896 206.23 1813 NaN 218 2015-07-10
22899 Alabama 339.13 1076 199.38 971 152.02 129 2014-01-23
22950 Alabama 338.47 1444 204.63 1332 NaN 167 2014-10-24
23001 Alabama 336.41 1564 209.37 1495 NaN 186 2015-01-17
23154 Alabama 339.11 1472 205.37 1377 NaN 177 2014-11-17
23205 Alabama 339.13 1453 204.81 1348 NaN 172 2014-11-01
23358 Alabama 338.92 1451 204.95 1347 NaN 171 2014-10-31
23409 Alabama 339.52 1429 204.33 1314 NaN 163 2014-10-05
23460 Alabama 337.89 1836 207.01 1776 NaN 213 2015-05-15
23613 Alabama 337.49 1878 206.38 1800 NaN 216 2015-05-24
23817 Alabama 337.21 1882 206.26 1803 NaN 217 2015-05-26
23868 Alabama 337.32 1889 206.14 1806 NaN 218 2015-06-04
23919 Alabama 339.19 1520 207.10 1435 NaN 181 2014-12-20
23970 Alabama 337.54 1539 208.24 1463 NaN 182 2015-01-01
24429 Alabama 338.86 1436 204.34 1327 NaN 165 2014-10-16
24531 Alabama 336.50 1565 209.30 1496 NaN 187 2015-01-18
24582 Alabama 338.81 1439 204.29 1328 NaN 166 2014-10-20
24633 Alabama 338.95 1080 199.39 972 152.02 129 2014-01-26
24786 Alabama 339.20 1043 198.64 933 149.49 123 2014-01-02
24837 Alabama 339.03 1527 207.77 1444 NaN 182 2014-12-24
24939 Alabama 339.52 1429 204.33 1312 NaN 162 2014-10-03

256 rows × 8 columns

In [52]:
# Export au format CSV pour le réutiliser par la suite :
result.to_csv('data/weed_prices.csv', index=False)

Seaborn

Seaborn propose une API de haut-niveau (basée sur matplotlib en interne) permettant d'utiliser directement des DataFrame de pandas.

In [53]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.rcParams['figure.figsize'] = (12, 10)

# En continuant d'utiliser les données ouvertes précédemment..
# représentons l'évolution du prix en Californie
df_cal = result[result["State"] == "California"]
df_cal.head()
df_cal.plot(x = "date", y = "HighQ")  # plot est ici une méthode de l'objet DataFrame
Out[53]:
State HighQ HighQN MedQ MedQN LowQ LowQN date
4 California 245.33 15081 191.63 16853 187.83 987 2014-08-13
55 California 243.29 16916 188.94 19780 NaN 1123 2015-02-02
106 California 245.76 14084 192.30 15463 190.09 918 2014-05-25
157 California 243.13 16995 188.86 19889 NaN 1133 2015-02-08
208 California 241.93 18504 187.89 22036 NaN 1232 2015-06-20
Out[53]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f29af44a6a0>
In [54]:
# Pour les 3 catégories ?
df_cal[["HighQ", "MedQ", "LowQ", "date"]].plot(x='date')
Out[54]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f29af475978>
In [55]:
# Sous forme de boite à moustache
df_cal[["HighQ", "MedQ", "LowQ"]].plot(kind = "box", figsize=(6,6))
Out[55]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f29af912f28>
In [56]:
# En utilisant directement l'API de Seaborn
sns.boxplot(data=df_cal[["HighQ", "MedQ", "LowQ"]], palette='Set3')
Out[56]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f29af7c0f60>
In [57]:
# Seaborn propose de tracer assez facilement d'autres types de graphiques:
sns.jointplot(data=df_cal, x='HighQ', y='MedQ', kind="reg")
/usr/local/lib/python3.6/dist-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
Out[57]:
<seaborn.axisgrid.JointGrid at 0x7f29af5c4a58>
In [58]:
iris = sns.load_dataset("iris")
g = sns.jointplot("petal_length", "sepal_length", data=iris, color='g',
    marginal_kws=dict(bins=15, rug=True),
    s=40, edgecolor="w", linewidth=1, size=8)
/usr/local/lib/python3.6/dist-packages/seaborn/axisgrid.py:2262: UserWarning: The `size` paramter has been renamed to `height`; please update your code.
  warnings.warn(msg, UserWarning)

xarray

  • Permet de manipuler des jeux de données structurés de dimensions arbitraires (pandas sur N dimensions)
In [59]:
import numpy as np
import pandas as pd
import seaborn as sns

import xarray as xr


times = pd.date_range('2000-01-01', '2001-12-31', name='time')
annual_cycle = np.sin(2 * np.pi * (times.dayofyear.values / 365.25 - 0.28))

base = 10 + 15 * annual_cycle.reshape(-1, 1)
tmin_values = base + 3 * np.random.randn(annual_cycle.size, 3)
tmax_values = base + 10 + 3 * np.random.randn(annual_cycle.size, 3)

# Un `DataSet` de xarray :
ds = xr.Dataset({
        'tmin': (('time', 'location'), tmin_values),
        'tmax': (('time', 'location'), tmax_values)
    }, {
        'time': times, 'location': ['IA', 'IN', 'IL'],
    })
In [60]:
# Conversion d'une dimension en DataFrame
# pour afficher avec Seaborn :
ds.mean(dim='location').to_dataframe().plot()
Out[60]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f29af6e8780>
In [61]:
## Risque de gel selon le mois et la localité :
freeze = (ds['tmin'] <= 0).groupby('time.month').mean('time')
freeze.to_pandas().plot(title="Risque de gel selon le mois de l'année")
Out[61]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f29af2c0c50>