Mongo DF
- class mongodf.DataFrame(_host, _database, _collection, _columns, list_columns=[], filter=None, array_expand=True, _meta_coll=None, _show_id=False)[source]
Bases:
object
A class to represent a DataFrame for MongoDB collections with extended functionality for querying and metadata management.
Parameters:
- _hoststr
The MongoDB host.
- _databasestr
The name of the database.
- _collectionstr
The name of the collection.
- _columnslist
The list of columns to include in the DataFrame.
- list_columnslist or set, optional
A list or set of columns that are considered to be lists and need special handling. Default is an empty list.
- filterFilter, optional
A Filter object representing the query filter. Default is None.
- array_expandbool, optional
Whether to expand arrays into separate rows. Default is True.
Attributes:
- _hoststr
The MongoDB host.
- _databasestr
The name of the database.
- _collectionstr
The name of the collection.
- columnslist
The list of columns included in the DataFrame.
- _filterFilter
The query filter for the DataFrame.
- _array_expandbool
Whether to expand arrays into separate rows.
- list_columnsset
A set of columns that are considered to be lists and need special handling.
- large_thresholdint
A threshold for determining when a categorical column is large. Default is 1000.
- _update_colstr
The name of the column used for tracking updates. Default is “__UPDATED”.
- _show_idbool
Whether to show the document ID in the DataFrame. Default is False.
- compute(show_id=None, **kwargs)[source]
Compute the DataFrame by querying the MongoDB collection.
Parameters:
- kwargsdict
Additional parameters for the computation.
Returns:
- pandas.DataFrame
The resulting DataFrame after querying the MongoDB collection.
- property dtypes
Get the data types of the columns in the DataFrame.
Returns:
- pandas.Series
A Series with the data types of the columns.
- example(n=20)[source]
Retrieve an example of the DataFrame with a specified number of rows.
Parameters:
- nint, optional
The number of rows to retrieve. Default is 20.
Returns:
- pandas.DataFrame
A DataFrame with example data.
- class mongodf.Filter(dataframe, config, func=<function Filter.<lambda>>)[source]
Bases:
object
A class to represent a filter for querying a DataFrame from a MongoDB collection.
Parameters:
- dataframeDataFrame
The DataFrame to which the filter is applied.
- configdict
The configuration of the filter, represented as a MongoDB query.
- funcfunction, optional
A function to be applied to the filter results. Default is an identity function.
Methods:
- __invert__():
Inverts the filter by swapping query operators with their opposites.
- __and__(filter_b):
Combines two filters using a logical AND operation.
- __or__(filter_b):
Combines two filters using a logical OR operation.
- inversion_map = {'$eq': '$ne', '$gt': '$lte', '$gte': '$lt', '$in': '$nin', '$lt': '$gte', '$lte': '$gt', '$ne': '$eq', '$nin': '$in'}
- class mongodf.Column(dataframe, name)[source]
Bases:
object
A class representing a column in a DataFrame-like object, enabling operations and filters on MongoDB collections.
Parameters:
- dataframeDataFrame
The DataFrame-like object this column belongs to.
- namestr
The name of the column.
Methods:
- _query_value(qt, value):
Internal method to format a query value for MongoDB.
- isin(array):
Returns a Filter object to check if column values are in the given array.
- __eq__(value):
Returns a Filter object to check if column values are equal to the given value.
- __ne__(value):
Returns a Filter object to check if column values are not equal to the given value.
- __ge__(value):
Returns a Filter object to check if column values are greater than or equal to the given value.
- __gt__(value):
Returns a Filter object to check if column values are greater than the given value.
- __lt__(value):
Returns a Filter object to check if column values are less than the given value.
- __le__(value):
Returns a Filter object to check if column values are less than or equal to the given value.
- unique():
Returns an array of unique values in the column.
- agg(types):
Returns a Pandas Series containing aggregate values (mean, median, min, max) for the column.
- agg(types)[source]
Aggregate values in the column.
Parameters:
- typesstr or list
The types of aggregation to perform (‘mean’, ‘median’, ‘min’, ‘max’).
Returns:
- pandas.Series
A Series containing the aggregated values for the column.
- mongodf.utils.flatten_dict(d, parent_key='', sep='.')[source]
Flatten a nested dictionary into dot notation.
Parameters:
- ddict
The nested dictionary to flatten.
- parent_keystr, optional
The base key to use for the flattened dictionary. Default is an empty string.
- sepstr, optional
The separator to use between keys. Default is ‘.’.
Returns:
- dict
A dictionary with flattened keys using dot notation.
Notes:
This function recursively flattens a nested dictionary. If a value in the dictionary is another dictionary, it will concatenate the keys using the specified separator. If a value is a list, it will recursively flatten each element of the list.
- mongodf.utils.from_mongo(host, database, collection, columns=None, filter={}, array_expand=True, cached_meta=True, dict_expand_level=0, meta_collection=None, show_id=False)[source]
Fetch data from a MongoDB collection and return it as a DataFrame-like object.
Parameters:
- hoststr
The MongoDB host address.
- databasestr
The name of the MongoDB database.
- collectionstr
The name of the MongoDB collection.
- columnslist, optional
A list of column names to include in the result. If None, columns will be inferred.
- filterdict, optional
A mongodf.Filter class. If None, no filter will be applied
- array_expandbool, optional
Whether to expand arrays found in the documents into separate rows. Default is True.
- cached_metabool, optional
Whether to use cached metadata for inferring columns. Default is True.
- dict_expand_levelint, optional
The level of dictionary expansion to perform. Default is 0.
- meta_collectionstr, optional
The name of the collection to use for cached metadata. If None, defaults to ‘__<collection>_meta’.
Returns:
- DataFrame
A DataFrame-like object containing the data from the MongoDB collection.
Notes:
If cached_meta is True and columns is None, the function will attempt to retrieve column names from a meta collection (either specified by meta_collection or defaulting to ‘__<collection>_meta’). If no columns are found in the meta collection, it will then infer the columns by analyzing the collection’s documents. The dict_expand_level parameter controls how deeply nested dictionaries are expanded into separate columns.
- mongodf.utils.get_all_columns_of(coll, dict_expand_level=0)[source]
Retrieve all unique column names from a MongoDB collection, expanding nested fields up to a specified level.
Parameters:
- collpymongo.collection.Collection
The MongoDB collection from which to retrieve column names.
- dict_expand_levelint, optional
The level of nested dictionary expansion. Default is 0.
Returns:
- list
A list of unique column names in dot notation.
Notes:
This function uses MongoDB’s aggregation framework to project and unwind nested fields up to the specified dict_expand_level. It then collects all unique keys from the documents in the collection.