Mongo DF

class mongodf.DataFrame(_host, _database, _collection, _columns, list_columns=[], filter=None, array_expand=True, _meta_coll=None, _show_id=False)[source]

Bases: object

A class to represent a DataFrame for MongoDB collections with extended functionality for querying and metadata management.

Parameters:

_hoststr

The MongoDB host.

_databasestr

The name of the database.

_collectionstr

The name of the collection.

_columnslist

The list of columns to include in the DataFrame.

list_columnslist or set, optional

A list or set of columns that are considered to be lists and need special handling. Default is an empty list.

filterFilter, optional

A Filter object representing the query filter. Default is None.

array_expandbool, optional

Whether to expand arrays into separate rows. Default is True.

Attributes:

_hoststr

The MongoDB host.

_databasestr

The name of the database.

_collectionstr

The name of the collection.

columnslist

The list of columns included in the DataFrame.

_filterFilter

The query filter for the DataFrame.

_array_expandbool

Whether to expand arrays into separate rows.

list_columnsset

A set of columns that are considered to be lists and need special handling.

large_thresholdint

A threshold for determining when a categorical column is large. Default is 1000.

_update_colstr

The name of the column used for tracking updates. Default is “__UPDATED”.

_show_idbool

Whether to show the document ID in the DataFrame. Default is False.

compute(show_id=None, **kwargs)[source]

Compute the DataFrame by querying the MongoDB collection.

Parameters:

kwargsdict

Additional parameters for the computation.

Returns:

pandas.DataFrame

The resulting DataFrame after querying the MongoDB collection.

property dtypes

Get the data types of the columns in the DataFrame.

Returns:

pandas.Series

A Series with the data types of the columns.

example(n=20)[source]

Retrieve an example of the DataFrame with a specified number of rows.

Parameters:

nint, optional

The number of rows to retrieve. Default is 20.

Returns:

pandas.DataFrame

A DataFrame with example data.

get_meta()[source]

Get the metadata for the DataFrame.

Returns:

dict

A dictionary with metadata for each column.

update_meta_cache()[source]

Update the metadata cache for the DataFrame.

update_meta_cache_all()[source]

Update the entire metadata cache for the DataFrame.

class mongodf.Filter(dataframe, config, func=<function Filter.<lambda>>)[source]

Bases: object

A class to represent a filter for querying a DataFrame from a MongoDB collection.

Parameters:

dataframeDataFrame

The DataFrame to which the filter is applied.

configdict

The configuration of the filter, represented as a MongoDB query.

funcfunction, optional

A function to be applied to the filter results. Default is an identity function.

Methods:

__invert__():

Inverts the filter by swapping query operators with their opposites.

__and__(filter_b):

Combines two filters using a logical AND operation.

__or__(filter_b):

Combines two filters using a logical OR operation.

inversion_map = {'$eq': '$ne', '$gt': '$lte', '$gte': '$lt', '$in': '$nin', '$lt': '$gte', '$lte': '$gt', '$ne': '$eq', '$nin': '$in'}
class mongodf.Column(dataframe, name)[source]

Bases: object

A class representing a column in a DataFrame-like object, enabling operations and filters on MongoDB collections.

Parameters:

dataframeDataFrame

The DataFrame-like object this column belongs to.

namestr

The name of the column.

Methods:

_query_value(qt, value):

Internal method to format a query value for MongoDB.

isin(array):

Returns a Filter object to check if column values are in the given array.

__eq__(value):

Returns a Filter object to check if column values are equal to the given value.

__ne__(value):

Returns a Filter object to check if column values are not equal to the given value.

__ge__(value):

Returns a Filter object to check if column values are greater than or equal to the given value.

__gt__(value):

Returns a Filter object to check if column values are greater than the given value.

__lt__(value):

Returns a Filter object to check if column values are less than the given value.

__le__(value):

Returns a Filter object to check if column values are less than or equal to the given value.

unique():

Returns an array of unique values in the column.

agg(types):

Returns a Pandas Series containing aggregate values (mean, median, min, max) for the column.

agg(types)[source]

Aggregate values in the column.

Parameters:

typesstr or list

The types of aggregation to perform (‘mean’, ‘median’, ‘min’, ‘max’).

Returns:

pandas.Series

A Series containing the aggregated values for the column.

isin(array)[source]

Create a Filter object to check if column values are in the given array.

Parameters:

arraylist

The array of values to check against.

Returns:

Filter

A Filter object with the specified condition.

unique()[source]

Get the unique values in the column.

Returns:

numpy.ndarray

An array of unique values in the column.

mongodf.utils.flatten_dict(d, parent_key='', sep='.')[source]

Flatten a nested dictionary into dot notation.

Parameters:

ddict

The nested dictionary to flatten.

parent_keystr, optional

The base key to use for the flattened dictionary. Default is an empty string.

sepstr, optional

The separator to use between keys. Default is ‘.’.

Returns:

dict

A dictionary with flattened keys using dot notation.

Notes:

This function recursively flattens a nested dictionary. If a value in the dictionary is another dictionary, it will concatenate the keys using the specified separator. If a value is a list, it will recursively flatten each element of the list.

mongodf.utils.from_mongo(host, database, collection, columns=None, filter={}, array_expand=True, cached_meta=True, dict_expand_level=0, meta_collection=None, show_id=False)[source]

Fetch data from a MongoDB collection and return it as a DataFrame-like object.

Parameters:

hoststr

The MongoDB host address.

databasestr

The name of the MongoDB database.

collectionstr

The name of the MongoDB collection.

columnslist, optional

A list of column names to include in the result. If None, columns will be inferred.

filterdict, optional

A mongodf.Filter class. If None, no filter will be applied

array_expandbool, optional

Whether to expand arrays found in the documents into separate rows. Default is True.

cached_metabool, optional

Whether to use cached metadata for inferring columns. Default is True.

dict_expand_levelint, optional

The level of dictionary expansion to perform. Default is 0.

meta_collectionstr, optional

The name of the collection to use for cached metadata. If None, defaults to ‘__<collection>_meta’.

Returns:

DataFrame

A DataFrame-like object containing the data from the MongoDB collection.

Notes:

If cached_meta is True and columns is None, the function will attempt to retrieve column names from a meta collection (either specified by meta_collection or defaulting to ‘__<collection>_meta’). If no columns are found in the meta collection, it will then infer the columns by analyzing the collection’s documents. The dict_expand_level parameter controls how deeply nested dictionaries are expanded into separate columns.

mongodf.utils.get_all_columns_of(coll, dict_expand_level=0)[source]

Retrieve all unique column names from a MongoDB collection, expanding nested fields up to a specified level.

Parameters:

collpymongo.collection.Collection

The MongoDB collection from which to retrieve column names.

dict_expand_levelint, optional

The level of nested dictionary expansion. Default is 0.

Returns:

list

A list of unique column names in dot notation.

Notes:

This function uses MongoDB’s aggregation framework to project and unwind nested fields up to the specified dict_expand_level. It then collects all unique keys from the documents in the collection.