Dataframe Analytics Reference Book

Dataframe Analytics Reference Book

avatar

AAAS24

August 19, 2022

Coding

Project Description

The intent of this post is to document basic functions related to analyzing data using DataFrames.

TABLE OF CONTENT


- Data Exploration
- Operations with rows and columns
- Filtering Data Operations
- Chaining examples
- Dictionary Operations



Dataframes Operations

Data Exploration

Command Syntax
Describe df.describe
Number of Rows and Columns df.shape, where:
df.shape[0]= rows & df.shape[1]= columns
Names in Columns df.columns.values
Column data types df.dtypes
Min df['COLUMN_NAME'].min()
Max df['COLUMN_NAME'].max()
Number of unique values len(df['COLUMN_NAME'].unique())
Sorted list of unique values sorted(df['COLUMN_NAME'].unique())
Identify type of value in a column type(df_reason_mod['COLUMN_NAME'][0])
Count per value in column df.['COLUMN_NAME'].value_counts(dropna=False)
Traspose df.head().T
View column with missing value in rows df.loc[df.COLUMN_NAME.isna(), ['COLUMN_NAME_1', 'COLUMN_NAME_2', 'COLUMN_NAME_3']]
   

Operations with rows and columns

Command Syntax
Dropping column df=df.drop(['COLUMN_NAME'], axis=1)
Adding column df['NEW_COLUMN']=df.sum(axis=1)
Sum values per row df['NEW_COLUMN']=df.sum(axis=1)
Sum values per column df['COLUMN_NAME'].sum(axis=0)
Cut dataframe from Columns A to D new_df=df.loc[:,A:D]
Concatenate multiple df df= pd.concat([df1,df2,df3,df4,...,dfn], axis=1)
Renaiming columns 1) get initial values with column_names=df.columns.value
2) modify the name value in variable
3) df.columns=column_names
Reordering columns 1) get initial values with colum_names_reordered=df.columns.value
2) modify order of values in variable
3) df=df[colum_names_reordered]
Convert values in column to timestamp format. Ex, a column DATE with str such as 07/15/2018 pd.to_datetime(df_reason_mod['Date'], format='%d/%m/%Y')
The convension of the format parameter, see the ‘strftime’ documentation
Reclassify the values in a column In a column with 4 distinct values, where we wish to combine 2,3 & 4 as B, and re-label 1 as A: df['COLUMN_NAME']=df['COLUMN_NAME'].map({1:A,2:B,3:B,4:B})
Slice data with iloc df.iloc[START_ROW:END_ROW,START_COLUMN:END_COLUMN]
Ex: df.iloc[:,:-1], which is all rows, all columns but the last one
Where df['COLUMN_NAME']= np.where(daf['COLUMN_NAMEs']>CONDITION,TRUE_VALUE,FALSE_VALUE)
Filter data df[df['COLUMN_NAME']=='VALUE']
Temporarily display more rows with pd.option_context('display.max_columns', 22, 'display.min_rows',10): display(df)
   


Examples of groupby

Use the parameter ‘as_index=False’ in the groupby to have teh output show in the same level as the input df

# initialize data of lists.
data = {'department': ['shoes', 'clothing', 'shoes', 'furniture']
        , 'product_type':['shoes', 'shirt', 'shoes', 'furniture']
        ,'order_amount': [20, 21, 19, 180]
        ,'taxes_paid':[1.5,2,1.2,10]
        }

# Create DataFrame
df = pd.DataFrame(data)
display(df)

df_result=(df.groupby(['department', 'product_type'], as_index=False)
           .agg({
               'order_amount':'sum'
               ,'taxes_paid':'sum'
           })
)

# Print the output.
display(df_result)

Useful codes snippets

  • Find Missing data

For a dataFrame called ‘df’, this shows the ‘percentage’ of missing values per column, where there is in fact missing values detected by the ‘.isna()’ function.

    ((df
    .isna()
    .mean()
    *100)
    .pipe(lambda ser:ser[ser>0])
    )
  • Change format of columns


Filtering Data Operations

Operation Example Code
With AND Condition ` df[(df.state==’Texas’) & (df.year==1990)] `
With OR Condition df[(df.state=='Texas') \| (df.year==1990)]
With != Column Names Nomenclature df[df.state=='Texas'] or df[df['state']=='Texas']
With a list_of_values data_filtered=df[df.column_Name.isin(list_of_values)]
Using ‘query’ df.query('colum_Name>=5 and colum_Name<=10')
   


Date Formatting Data Operations

Operation Example Code
Converting timestamp into datetime data['dt'] = data['ts'].map(lambda x: datetime.strptime(x, '%Y-%m-%dT%H:%M:%SZ').strftime("%Y-%m-%d"))
Function to convert columns for column in columns_to_date: </break> df[column] = pd.to_datetime(df[column], format='%B %d, %Y', errors='coerce')
   


Chaining Examples

</table> ## Dictionary Operations
Operation Example Code
Groupby & Aggregations Controlling Column Name by Using 'pd.NamedAgg'
 See a code example this workbook.
            
            
            data=(df2   
                [df2.date.dt.year==2017]
                .groupby(df2.state)
                .agg(
                    tot_victims=pd.NamedAgg(column="victims", aggfunc="sum"),
                    num_incidents=pd.NamedAgg(column="incident_id", aggfunc="count")
                )
                .sort_values(by='tot_victims', ascending=False)
                .reset_index()
            )
            
            
</td> </tr>
Renaming Columns With .set_axis()
            
        dict_sub_region=(
        data.loc[:,['state','subRegion']]
            .set_index('state')
            .set_axis(['sub_region'], axis = 1) #renaming columns
            .to_dict()
        )
            
            
Catching Current State of a DataFrame during Method Chaining
            
            def catchstate(df, var_name: str) -> 'pd.DataFrame':
            """
            Helper function that captures intermediate Dataframes mid-chain.
            In the global namespace, make a new variable called var_name and set it to dataframe
            """
            globals()[var_name] = df
            return df
            data=(df2_incidents
                .assign(year=df2_incidents.date.dt.year, month=df2_incidents.date.dt.month)
                .assign(sub_region=df2_incidents.state.map(dict_sub_region))
                .pipe(catchstate, var_name="df2")
                [(df2.sub_region!='Mountain')&(df2.year!=2017)]#filtering outlier of las vegas shooting
                .groupby(by=['year','sub_region'])
                .agg(
                    tot_victims=pd.NamedAgg(column="tot_victims", aggfunc="sum"),
                    num_incidents=pd.NamedAgg(column="incident_id", aggfunc="count")
                ).reset_index()
            )
            data
            
            
Adding time values based on date related column 'time_col'. Ex: time_col='date_of_birth'
            
            df=df.assign(
                month=lambda df_: df_[time_col].dt.month,
                qtr=lambda df_: df_[time_col].dt.quarter,
                year=lambda df_: df_[time_col].dt.year,
                day=lambda df_: df_[time_col].dt.day_name()
            ).rename(columns={'month': ('').join([time_col, '_month']),
                                'qtr': ('').join([time_col, '_qtr']),
                                'year': ('').join([time_col, '_year']),
                                'day': ('').join([time_col, '_day'])
                        }
                    )
            df.head(5)
            
            
Cleaning with regex & formatting datetime columns
            
            #cleaning df
            df2=df.copy()
            d1=(df2.pop('country')
                .str.replace('\n','')
                .str.replace(r"(\||\d)",'',regex=True)
                .str.replace('[','')
                .str.replace(']','')
            date_time_cols=['end_date','start_date']
            df[date_time_cols] = df[date_time_cols].apply(pd.to_datetime)
            df=pd.concat([df2, d1], axis=1)
            
            
</table>
Operation Example Code
Removing outer layer of dictionary
            
            dict_sub_region=(
                data.loc[:,['state','subregion']]
                .set_index('state')
                .set_axis(['sub_region'], axis = 1)#renaming columns
                .to_dict()
            )
            
            this printed a dictionary like this: 
            
            {'sub_region': {'AK': 'Pacific',
            'AL': 'East South Central',
            'AR': 'West South Central',
            'AZ': 'Mountain',
            'CA': 'Pacific',
            'CO': 'Mountain',
            (...)}}
            
            To remove 'sub_region', we use: 
            
            from collections import ChainMap
            res = dict(ChainMap(*dict_sub_region.values()))#removes top level of dictionary
            print(res)
            
            

Recent Blog Articles