August 19, 2022
The intent of this post is to document basic functions related to analyzing data using DataFrames.
- Data Exploration
- Operations with rows and columns
- Filtering Data Operations
- Chaining examples
- Dictionary Operations
| Command | Syntax |
|---|---|
| Describe | df.describe |
| Number of Rows and Columns | df.shape, where:df.shape[0]= rows & df.shape[1]= columns |
| Names in Columns | df.columns.values |
| Column data types | df.dtypes |
| Min | df['COLUMN_NAME'].min() |
| Max | df['COLUMN_NAME'].max() |
| Number of unique values | len(df['COLUMN_NAME'].unique()) |
| Sorted list of unique values | sorted(df['COLUMN_NAME'].unique()) |
| Identify type of value in a column | type(df_reason_mod['COLUMN_NAME'][0]) |
| Count per value in column | df.['COLUMN_NAME'].value_counts(dropna=False) |
| Traspose | df.head().T |
| View column with missing value in rows | df.loc[df.COLUMN_NAME.isna(), ['COLUMN_NAME_1', 'COLUMN_NAME_2', 'COLUMN_NAME_3']] |
| Command | Syntax |
|---|---|
| Dropping column | df=df.drop(['COLUMN_NAME'], axis=1) |
| Adding column | df['NEW_COLUMN']=df.sum(axis=1) |
| Sum values per row | df['NEW_COLUMN']=df.sum(axis=1) |
| Sum values per column | df['COLUMN_NAME'].sum(axis=0) |
| Cut dataframe from Columns A to D | new_df=df.loc[:,A:D] |
| Concatenate multiple df | df= pd.concat([df1,df2,df3,df4,...,dfn], axis=1) |
| Renaiming columns | 1) get initial values with column_names=df.columns.value2) modify the name value in variable 3) df.columns=column_names |
| Reordering columns | 1) get initial values with colum_names_reordered=df.columns.value2) modify order of values in variable 3) df=df[colum_names_reordered] |
Convert values in column to timestamp format. Ex, a column DATE with str such as 07/15/2018 |
pd.to_datetime(df_reason_mod['Date'], format='%d/%m/%Y') The convension of the format parameter, see the ‘strftime’ documentation |
| Reclassify the values in a column | In a column with 4 distinct values, where we wish to combine 2,3 & 4 as B, and re-label 1 as A: df['COLUMN_NAME']=df['COLUMN_NAME'].map({1:A,2:B,3:B,4:B}) |
| Slice data with iloc | df.iloc[START_ROW:END_ROW,START_COLUMN:END_COLUMN] Ex: df.iloc[:,:-1], which is all rows, all columns but the last one |
| Where | df['COLUMN_NAME']= np.where(daf['COLUMN_NAMEs']>CONDITION,TRUE_VALUE,FALSE_VALUE) |
| Filter data | df[df['COLUMN_NAME']=='VALUE'] |
| Temporarily display more rows | with pd.option_context('display.max_columns', 22, 'display.min_rows',10): display(df) |
Use the parameter ‘as_index=False’ in the groupby to have teh output show in the same level as the input df
# initialize data of lists.
data = {'department': ['shoes', 'clothing', 'shoes', 'furniture']
, 'product_type':['shoes', 'shirt', 'shoes', 'furniture']
,'order_amount': [20, 21, 19, 180]
,'taxes_paid':[1.5,2,1.2,10]
}
# Create DataFrame
df = pd.DataFrame(data)
display(df)
df_result=(df.groupby(['department', 'product_type'], as_index=False)
.agg({
'order_amount':'sum'
,'taxes_paid':'sum'
})
)
# Print the output.
display(df_result)
For a dataFrame called ‘df’, this shows the ‘percentage’ of missing values per column, where there is in fact missing values detected by the ‘.isna()’ function.
((df
.isna()
.mean()
*100)
.pipe(lambda ser:ser[ser>0])
)
| Operation | Example Code |
|---|---|
| With AND Condition | ` df[(df.state==’Texas’) & (df.year==1990)] ` |
| With OR Condition | df[(df.state=='Texas') \| (df.year==1990)] |
| With != Column Names Nomenclature | df[df.state=='Texas'] or df[df['state']=='Texas'] |
| With a list_of_values | data_filtered=df[df.column_Name.isin(list_of_values)] |
| Using ‘query’ | df.query('colum_Name>=5 and colum_Name<=10') |
| Operation | Example Code |
|---|---|
| Converting timestamp into datetime | data['dt'] = data['ts'].map(lambda x: datetime.strptime(x, '%Y-%m-%dT%H:%M:%SZ').strftime("%Y-%m-%d")) |
| Function to convert columns | for column in columns_to_date: </break> df[column] = pd.to_datetime(df[column], format='%B %d, %Y', errors='coerce') |
| Operation | Example Code | ||||||||||||||
| Groupby & Aggregations Controlling Column Name by Using 'pd.NamedAgg' |
See a code example this workbook.
|