API documentation

Core classes

Module: ianalyzer_readers.readers.core

This module defines the base classes on which all Readers are built.

The module defines two classes, Field and Reader.

`Document = Dict[str, Any]` `module-attribute`

Type definition for documents, defined for convenience.

Each document extracted by a Reader is a dictionary, where the keys are names of the Reader's fields, and the values are based on the extractor of each field.

`Source = Union[SourceData, Tuple[SourceData, Dict]]` `module-attribute`

Type definition for the source input to some Reader methods.

Sources are either:

a string with the path to a filename
binary data with the file contents. This is not supported on all Reader subclasses
a requests.Response
a tuple of one of the above, and a dictionary with metadata

`SourceData = Union[str, Response, bytes]` `module-attribute`

Type definition of the data types a Reader method can handle.

`Field`

Bases: object

Fields are the elements of information that you wish to extract from each document.

Parameters:

Name	Type	Description	Default
`name`	`str`	a short hand name (name), which will be used as its key in the document	required
`extractor`	`Extractor`	an Extractor object that defines how this field's data can be extracted from source documents.	`Constant(None)`
`required`	`bool`	whether this field is required. The `Reader` class should skip the document is the value for this Field is `None`.	`False`
`skip`	`bool`	if `True`, this field will not be included in the results.	`False`

Source code in ianalyzer_readers/readers/core.py

class Field(object):
    '''
    Fields are the elements of information that you wish to extract from each document.

    Parameters:
        name:  a short hand name (name), which will be used as its key in the document
        extractor: an Extractor object that defines how this field's data can be
            extracted from source documents.
        required: whether this field is required. The `Reader` class should skip the
            document is the value for this Field is `None`.
        skip: if `True`, this field will not be included in the results.
    '''

    def __init__(self,
                 name: str,
                 extractor: extract.Extractor = extract.Constant(None),
                 required: bool = False,
                 skip: bool = False,
                 **kwargs
                 ):

        self.name = name
        self.extractor = extractor
        self.required = required
        self.skip = skip

`Reader`

A base class for readers. Readers are objects that can generate documents from a source dataset.

Subclasses of Reader can be created to read specific data formats. In practice, you will probably work with a subclass of Reader like XMLReader, CSVReader, etc., that provides the core functionality for a file type, and create a subclass for a specific dataset.

Some methods of this class need to be implemented in child classes, and will raise NotImplementedError if you try to use Reader directly.

A fully implemented Reader subclass will define how to read a dataset by describing:

How to obtain its source files.
How to parse and iterate over source files.
What fields each document contains, and how to extract them from the source data.

This requires implementing the following attributes/methods:

fields: a list of Field instances that describe the fields that will appear in documents, and how to extract their value.
sources: a method that returns an iterable of sources (e.g. file paths), possibly with metadata for each.
data_directory (optional): a string with the path to the directory containing the source data. You can use this in the implementation of sources; it's not used elsewhere.
data_from_file data_from_bytes, data_from_response: methods that respectively receive a file path, a byte sequence, or an HTTP response, and return a data object. (The type of the data will depend on how you implement your reader; this could be a parsed graph, a row iterator, etc.). You must implement at least one of these methods to have a functioning reader.
iterate_data: method that takes a data object (the output of data_from_file/data_from_bytes/data_from_response) and a metadata dictionary, iterates over the source data, and returns the data that should be passed on to extractors for each document.
validate (optional): a method that will check the reader configuration. This is useful for abstract readers like the XMLReader, CSVReader, etc., so they can verify a child class is implementing attributes correctly.

Abstract reader types like CSVReader usually leave fields and sources unimplemented.

Source code in ianalyzer_readers/readers/core.py

class Reader:
    '''
    A base class for readers. Readers are objects that can generate documents
    from a source dataset.

    Subclasses of `Reader` can be created to read specific data formats. 
    In practice, you will probably work with a subclass of `Reader` like `XMLReader`,
    `CSVReader`, etc., that provides the core functionality for a file type, and create
    a subclass for a specific dataset.

    Some methods of this class need to be implemented in child classes, and will raise
    `NotImplementedError` if you try to use `Reader` directly.

    A fully implemented `Reader` subclass will define how to read a dataset by
    describing:

    - How to obtain its source files.
    - How to parse and iterate over source files.
    - What fields each document contains, and how to extract them from the source data.

    This requires implementing the following attributes/methods:

    - `fields`: a list of `Field` instances that describe the fields that will appear in
        documents, and how to extract their value.
    - `sources`: a method that returns an iterable of sources (e.g. file paths), possibly
        with metadata for each.
    - `data_directory` (optional): a string with the path to the directory containing
        the source data. You can use this in the implementation of `sources`; it's not
        used elsewhere.
    - `data_from_file` `data_from_bytes`, `data_from_response`: methods that respectively
        receive a file path, a byte sequence, or an HTTP response, and return a data
        object. (The type of the data will depend on how you implement your reader; this
        could be a parsed graph, a row iterator, etc.). You must implement at least one of
        these methods to have a functioning reader.
    - `iterate_data`: method that takes a data object (the output of
        `data_from_file`/`data_from_bytes`/`data_from_response`) and a metadata dictionary,
        iterates over the source data, and returns the data that should be passed on to
        extractors for each document.
    - `validate` (optional): a method that will check the reader configuration. This is
        useful for abstract readers like the `XMLReader`, `CSVReader`, etc., so they
        can verify a child class is implementing attributes correctly.

    Abstract reader types like `CSVReader` usually leave `fields` and `sources`
    unimplemented.
    '''

    @property
    def data_directory(self) -> str:
        '''
        Path to source data directory.

        Raises:
            NotImplementedError: This method needs to be implementd on child
                classes. It will raise an error by default.
        '''
        raise NotImplementedError('Reader missing data_directory')


    @property
    def fields(self) -> List[Field]:
        '''
        The list of fields that are extracted from documents.

        These should be instances of the `Field` class (or implement the same API).

        Raises:
            NotImplementedError: This method needs to be implementd on child
                classes. It will raise an error by default.
        '''
        raise NotImplementedError('Reader missing fields implementation')

    @property
    def fieldnames(self) -> List[str]:
        '''
        A list containing the name of each field of this Reader
        '''
        return [field.name for field in self.fields]


    @property
    def _required_field_names(self) -> List[str]:
        '''
        A list of the names of all required fields
        '''
        return [field.name for field in self.fields if field.required]


    def sources(self, **kwargs) -> Iterable[Source]:
        '''
        Obtain source files for the Reader.

        Returns:
            an iterable of tuples that each contain a string path, and a dictionary
                with associated metadata. The metadata can contain any data that was
                extracted before reading the file itself, such as data based on the
                file path, or on a metadata file.

        Raises:
            NotImplementedError: This method needs to be implementd on child
                classes. It will raise an error by default.
        '''
        raise NotImplementedError('Reader missing sources implementation')

    def source2dicts(self, source: Source, source_index=-1) -> Iterable[Document]:
        '''
        Given a source file, returns an iterable of extracted documents.

        Parameters:
            source: Source to extract.

        Returns:
            an iterable of document dictionaries. Each of these is a dictionary,
                where the keys are names of this Reader's `fields`, and the values
                are based on the extractor of each field.
        '''

        self.validate()

        data, metadata = self.data_and_metadata_from_source(source)

        if isinstance(data, AbstractContextManager):
            context_manager = data
        else:
            context_manager = nullcontext(data)

        with context_manager as data:
            for index, extracted_data in enumerate(self.iterate_data(data, metadata)):
                base_data = {
                    'metadata': metadata,
                    'index': index,
                    'source_index': source_index,
                }
                document_data = base_data | extracted_data
                document = self.extract_document(**document_data)
                if self._has_required_fields(document):
                    yield document


    def data_and_metadata_from_source(self, source: Source) -> Tuple[Any, Dict]:
        '''
        Extract the data and metadata object from a source.

        Parameters:
            source: Source to extract.

        Returns:
            A tuple with the parsed source data, and the metadata (empty if none was
                provided).
        '''
        if isinstance(source, tuple) and len(source) == 2:
            source_data, metadata = source
        else:
            source_data = source
            metadata = {}

        if isinstance(source_data, str):
            if not isfile(source_data):
                raise FileNotFoundError(f'Invalid file path: {source_data}')
            data = self.data_from_file(source_data)
        elif isinstance(source_data, bytes):
            data = self.data_from_bytes(source_data)
        elif isinstance(source_data, Response):
            data = self.data_from_response(source_data)
        else:
            raise TypeError(f'Unknown source type: {type(source_data)}')

        return data, metadata


    def data_from_file(self, path: str) -> Any:
        '''
        Extract source data from a filename.

        The return type depends on how the reader is implemented, but should be some kind
        of data structure from which documents can be extracted. It serves as the input
        to `self.iterate_data`.

        This method can also return a context manager. This is especially useful to
        iterate over large files in `iterate_data`, without loading the complete file
        contents in memory.

        Tip: if you have implemented `self.data_from_bytes`, this method can probably just
        read the binary contents of the file and call that method.

        Parameters:
            path: The path to a file.

        Returns:
            A data object. The type depends on the reader implementation.

        Raises:
            NotImplementedError: this method may be implemented on child classes, but
                has no default implementation.
        '''

        raise NotImplementedError('This reader does not support filename input')


    def data_from_bytes(self, bytes: bytes) -> Any:
        '''
        Extract source data from a bytes object. Like `data_from_file`, but with bytes
        input.

        Parameters:
            bytes: byte contents of the source

        Returns:
            A data object. The type depends on the reader implementation. This may also
                be a context manager.

        Raises:
            NotImplementedError: this method may be implemented on child classes, but
                has no default implementation.
        '''

        raise NotImplementedError('This reader does not support bytes input')


    def data_from_response(self, response: Response) -> Any:
        '''
        Extract data from an HTTP response. Like `data_from_file`, but with `Response`
        input.

        Parameters:
            response: HTTP response object

        Returns:
            A data object. The type depends on the reader implementation. This may also
                be a context manager.

        Raises:
            NotImplementedError: this method may be implemented on child classes, but has
                no default implementation.
        '''
        raise NotImplementedError('This reader does not support Response input')


    def iterate_data(self, data: Any, metadata: Dict) -> Iterable[Document]:
        '''
        Iterate parsed source data, return data for each document.

        This should return the arguments that are passed on to field extractors per
        document. These usually cater to a specific extractor type. For example, the
        `CSVReader` returns an argument `rows`, which is used by the `CSV` extractor.

        The core `source2dicts` method will also provide `metadata` and `index` arguments
        to extractors, which you may override by providing them here.

        Parameters:
            data: The data object from a source. The type depends on the reader
                implementation; this is the output of `self.data_from_file` or
                `self.data_from_bytes`.
            metadata: Dictionary containing metadata for the source.

        Returns:
            An iterable of dictionaries. Each iteration will be extracted as a single
            document. The items in the dictionary are given as arguments to field
            extractors.

        Raises:
            NotImplementedError: This method must be implemented on child classes. It
                will raise an error otherwise.
        '''
        raise NotImplementedError('Data iteration is not implemented')


    def extract_document(
            self,
            **kwargs
        ) -> Document:
        '''
        Extract each field of a document, based on the raw data for the document
        '''
        return {
            field.name: field.extractor.apply(**kwargs)
            for field in self.fields
            if not field.skip
        }

    def documents(self, sources:Iterable[Source] = None) -> Iterable[Document]:
        '''
        Returns an iterable of extracted documents from source files.

        Parameters:
            sources: an iterable of paths to source files. If omitted, the reader
                class will use the value of `self.sources()` instead.

        Returns:
            an iterable of document dictionaries. Each of these is a dictionary,
                where the keys are names of this Reader's `fields`, and the values
                are based on the extractor of each field.
        '''
        sources = sources or self.sources()
        return (
            document
            for i, source in enumerate(sources)
            for document in self.source2dicts(
                source, source_index=i
            )
        )

    def export_csv(self, path: str, sources: Optional[Iterable[Source]] = None) -> None:
        '''
        Extracts documents from sources and saves them in a CSV file.

        This will write a CSV file in the provided `path`. This method has no return
        value.

        Parameters:
            path: the path where the CSV file should be saved.
            sources: an iterable of paths to source files. If omitted, the reader class
                will use the value of `self.sources()` instead.
        '''
        documents = self.documents(sources)

        with open(path, 'w') as outfile:
            writer = csv.DictWriter(outfile, self.fieldnames)
            writer.writeheader()
            for doc in documents:
                writer.writerow(doc)


    def validate(self):
        '''
        Validate that the reader is configured properly.

        This is a good place to check parameters that are overridden in a child class. A
        common use case is use self._reject_extractors to raise an error if any fields use
        unsupported extractor types.
        '''
        pass

    def _reject_extractors(self, *inapplicable_extractors: extract.Extractor):
        '''
        Raise errors if any fields use any of the given extractors.

        This can be used to check that fields use extractors that match
        the Reader subclass.

        Raises:
            RuntimeError: raised when a field uses an extractor that is provided
                in the input.
        '''
        for field in self.fields:
            if isinstance(field.extractor, inapplicable_extractors):
                raise RuntimeError(
                    "Specified extractor method cannot be used with this type of data")

    def _has_required_fields(self, document: Document) -> Iterable[Document]:
        '''
        Check whether a document has a value for all fields marked as required.
        '''

        has_field = lambda field_name: document.get(field_name, None) is not None
        return all(
            has_field(field_name) for field_name in self._required_field_names
        )

`data_directory` `property`

Path to source data directory.

Raises:

Type	Description
`NotImplementedError`	This method needs to be implementd on child classes. It will raise an error by default.

`fieldnames` `property`

A list containing the name of each field of this Reader

`fields` `property`

The list of fields that are extracted from documents.

These should be instances of the Field class (or implement the same API).

Raises:

Type	Description
`NotImplementedError`	This method needs to be implementd on child classes. It will raise an error by default.

`data_and_metadata_from_source(source)`

Extract the data and metadata object from a source.

Parameters:

Name	Type	Description	Default
`source`	`Source`	Source to extract.	required

Returns:

Type	Description
`Tuple[Any, Dict]`	A tuple with the parsed source data, and the metadata (empty if none was provided).

Source code in ianalyzer_readers/readers/core.py

def data_and_metadata_from_source(self, source: Source) -> Tuple[Any, Dict]:
    '''
    Extract the data and metadata object from a source.

    Parameters:
        source: Source to extract.

    Returns:
        A tuple with the parsed source data, and the metadata (empty if none was
            provided).
    '''
    if isinstance(source, tuple) and len(source) == 2:
        source_data, metadata = source
    else:
        source_data = source
        metadata = {}

    if isinstance(source_data, str):
        if not isfile(source_data):
            raise FileNotFoundError(f'Invalid file path: {source_data}')
        data = self.data_from_file(source_data)
    elif isinstance(source_data, bytes):
        data = self.data_from_bytes(source_data)
    elif isinstance(source_data, Response):
        data = self.data_from_response(source_data)
    else:
        raise TypeError(f'Unknown source type: {type(source_data)}')

    return data, metadata

`data_from_bytes(bytes)`

Extract source data from a bytes object. Like data_from_file, but with bytes input.

Parameters:

Name	Type	Description	Default
`bytes`	`bytes`	byte contents of the source	required

Returns:

Type	Description
`Any`	A data object. The type depends on the reader implementation. This may also be a context manager.

Raises:

Type	Description
`NotImplementedError`	this method may be implemented on child classes, but has no default implementation.

Source code in ianalyzer_readers/readers/core.py

def data_from_bytes(self, bytes: bytes) -> Any:
    '''
    Extract source data from a bytes object. Like `data_from_file`, but with bytes
    input.

    Parameters:
        bytes: byte contents of the source

    Returns:
        A data object. The type depends on the reader implementation. This may also
            be a context manager.

    Raises:
        NotImplementedError: this method may be implemented on child classes, but
            has no default implementation.
    '''

    raise NotImplementedError('This reader does not support bytes input')

`data_from_file(path)`

Extract source data from a filename.

The return type depends on how the reader is implemented, but should be some kind of data structure from which documents can be extracted. It serves as the input to self.iterate_data.

This method can also return a context manager. This is especially useful to iterate over large files in iterate_data, without loading the complete file contents in memory.

Tip: if you have implemented self.data_from_bytes, this method can probably just read the binary contents of the file and call that method.

Parameters:

Name	Type	Description	Default
`path`	`str`	The path to a file.	required

Returns:

Type	Description
`Any`	A data object. The type depends on the reader implementation.

Raises:

Type	Description
`NotImplementedError`	this method may be implemented on child classes, but has no default implementation.

Source code in ianalyzer_readers/readers/core.py

def data_from_file(self, path: str) -> Any:
    '''
    Extract source data from a filename.

    The return type depends on how the reader is implemented, but should be some kind
    of data structure from which documents can be extracted. It serves as the input
    to `self.iterate_data`.

    This method can also return a context manager. This is especially useful to
    iterate over large files in `iterate_data`, without loading the complete file
    contents in memory.

    Tip: if you have implemented `self.data_from_bytes`, this method can probably just
    read the binary contents of the file and call that method.

    Parameters:
        path: The path to a file.

    Returns:
        A data object. The type depends on the reader implementation.

    Raises:
        NotImplementedError: this method may be implemented on child classes, but
            has no default implementation.
    '''

    raise NotImplementedError('This reader does not support filename input')

`data_from_response(response)`

Extract data from an HTTP response. Like data_from_file, but with Response input.

Parameters:

Name	Type	Description	Default
`response`	`Response`	HTTP response object	required

Returns:

Type	Description
`Any`	A data object. The type depends on the reader implementation. This may also be a context manager.

Raises:

Type	Description
`NotImplementedError`	this method may be implemented on child classes, but has no default implementation.

Source code in ianalyzer_readers/readers/core.py

def data_from_response(self, response: Response) -> Any:
    '''
    Extract data from an HTTP response. Like `data_from_file`, but with `Response`
    input.

    Parameters:
        response: HTTP response object

    Returns:
        A data object. The type depends on the reader implementation. This may also
            be a context manager.

    Raises:
        NotImplementedError: this method may be implemented on child classes, but has
            no default implementation.
    '''
    raise NotImplementedError('This reader does not support Response input')

`documents(sources=None)`

Returns an iterable of extracted documents from source files.

Parameters:

Name	Type	Description	Default
`sources`	`Iterable[Source]`	an iterable of paths to source files. If omitted, the reader class will use the value of `self.sources()` instead.	`None`

Returns:

Type	Description
`Iterable[Document]`	an iterable of document dictionaries. Each of these is a dictionary, where the keys are names of this Reader's `fields`, and the values are based on the extractor of each field.

Source code in ianalyzer_readers/readers/core.py

def documents(self, sources:Iterable[Source] = None) -> Iterable[Document]:
    '''
    Returns an iterable of extracted documents from source files.

    Parameters:
        sources: an iterable of paths to source files. If omitted, the reader
            class will use the value of `self.sources()` instead.

    Returns:
        an iterable of document dictionaries. Each of these is a dictionary,
            where the keys are names of this Reader's `fields`, and the values
            are based on the extractor of each field.
    '''
    sources = sources or self.sources()
    return (
        document
        for i, source in enumerate(sources)
        for document in self.source2dicts(
            source, source_index=i
        )
    )

`export_csv(path, sources=None)`

Extracts documents from sources and saves them in a CSV file.

This will write a CSV file in the provided path. This method has no return value.

Parameters:

Name	Type	Description	Default
`path`	`str`	the path where the CSV file should be saved.	required
`sources`	`Optional[Iterable[Source]]`	an iterable of paths to source files. If omitted, the reader class will use the value of `self.sources()` instead.	`None`

Source code in ianalyzer_readers/readers/core.py

def export_csv(self, path: str, sources: Optional[Iterable[Source]] = None) -> None:
    '''
    Extracts documents from sources and saves them in a CSV file.

    This will write a CSV file in the provided `path`. This method has no return
    value.

    Parameters:
        path: the path where the CSV file should be saved.
        sources: an iterable of paths to source files. If omitted, the reader class
            will use the value of `self.sources()` instead.
    '''
    documents = self.documents(sources)

    with open(path, 'w') as outfile:
        writer = csv.DictWriter(outfile, self.fieldnames)
        writer.writeheader()
        for doc in documents:
            writer.writerow(doc)

`extract_document(**kwargs)`

Extract each field of a document, based on the raw data for the document

Source code in ianalyzer_readers/readers/core.py

def extract_document(
        self,
        **kwargs
    ) -> Document:
    '''
    Extract each field of a document, based on the raw data for the document
    '''
    return {
        field.name: field.extractor.apply(**kwargs)
        for field in self.fields
        if not field.skip
    }

`iterate_data(data, metadata)`

Iterate parsed source data, return data for each document.

This should return the arguments that are passed on to field extractors per document. These usually cater to a specific extractor type. For example, the CSVReader returns an argument rows, which is used by the CSV extractor.

The core source2dicts method will also provide metadata and index arguments to extractors, which you may override by providing them here.

Parameters:

Name	Type	Description	Default
`data`	`Any`	The data object from a source. The type depends on the reader implementation; this is the output of `self.data_from_file` or `self.data_from_bytes`.	required
`metadata`	`Dict`	Dictionary containing metadata for the source.	required

Returns:

Type	Description
`Iterable[Document]`	An iterable of dictionaries. Each iteration will be extracted as a single
`Iterable[Document]`	document. The items in the dictionary are given as arguments to field
`Iterable[Document]`	extractors.

Raises:

Type	Description
`NotImplementedError`	This method must be implemented on child classes. It will raise an error otherwise.

Source code in ianalyzer_readers/readers/core.py

def iterate_data(self, data: Any, metadata: Dict) -> Iterable[Document]:
    '''
    Iterate parsed source data, return data for each document.

    This should return the arguments that are passed on to field extractors per
    document. These usually cater to a specific extractor type. For example, the
    `CSVReader` returns an argument `rows`, which is used by the `CSV` extractor.

    The core `source2dicts` method will also provide `metadata` and `index` arguments
    to extractors, which you may override by providing them here.

    Parameters:
        data: The data object from a source. The type depends on the reader
            implementation; this is the output of `self.data_from_file` or
            `self.data_from_bytes`.
        metadata: Dictionary containing metadata for the source.

    Returns:
        An iterable of dictionaries. Each iteration will be extracted as a single
        document. The items in the dictionary are given as arguments to field
        extractors.

    Raises:
        NotImplementedError: This method must be implemented on child classes. It
            will raise an error otherwise.
    '''
    raise NotImplementedError('Data iteration is not implemented')

`source2dicts(source, source_index=-1)`

Given a source file, returns an iterable of extracted documents.

Parameters:

Name	Type	Description	Default
`source`	`Source`	Source to extract.	required

Returns:

Type	Description
`Iterable[Document]`	an iterable of document dictionaries. Each of these is a dictionary, where the keys are names of this Reader's `fields`, and the values are based on the extractor of each field.

Source code in ianalyzer_readers/readers/core.py

def source2dicts(self, source: Source, source_index=-1) -> Iterable[Document]:
    '''
    Given a source file, returns an iterable of extracted documents.

    Parameters:
        source: Source to extract.

    Returns:
        an iterable of document dictionaries. Each of these is a dictionary,
            where the keys are names of this Reader's `fields`, and the values
            are based on the extractor of each field.
    '''

    self.validate()

    data, metadata = self.data_and_metadata_from_source(source)

    if isinstance(data, AbstractContextManager):
        context_manager = data
    else:
        context_manager = nullcontext(data)

    with context_manager as data:
        for index, extracted_data in enumerate(self.iterate_data(data, metadata)):
            base_data = {
                'metadata': metadata,
                'index': index,
                'source_index': source_index,
            }
            document_data = base_data | extracted_data
            document = self.extract_document(**document_data)
            if self._has_required_fields(document):
                yield document

`sources(**kwargs)`

Obtain source files for the Reader.

Returns:

Type	Description
`Iterable[Source]`	an iterable of tuples that each contain a string path, and a dictionary with associated metadata. The metadata can contain any data that was extracted before reading the file itself, such as data based on the file path, or on a metadata file.

Raises:

Type	Description
`NotImplementedError`	This method needs to be implementd on child classes. It will raise an error by default.

Source code in ianalyzer_readers/readers/core.py

def sources(self, **kwargs) -> Iterable[Source]:
    '''
    Obtain source files for the Reader.

    Returns:
        an iterable of tuples that each contain a string path, and a dictionary
            with associated metadata. The metadata can contain any data that was
            extracted before reading the file itself, such as data based on the
            file path, or on a metadata file.

    Raises:
        NotImplementedError: This method needs to be implementd on child
            classes. It will raise an error by default.
    '''
    raise NotImplementedError('Reader missing sources implementation')

`validate()`

Validate that the reader is configured properly.

This is a good place to check parameters that are overridden in a child class. A common use case is use self._reject_extractors to raise an error if any fields use unsupported extractor types.

Source code in ianalyzer_readers/readers/core.py

def validate(self):
    '''
    Validate that the reader is configured properly.

    This is a good place to check parameters that are overridden in a child class. A
    common use case is use self._reject_extractors to raise an error if any fields use
    unsupported extractor types.
    '''
    pass

CSV reader

Module: ianalyzer_readers.readers.csv

This module defines the CSV reader.

Extraction is based on python's csv library.

`CSVReader`

Bases: Reader

A base class for Readers of .csv (comma separated value) files.

The CSVReader is designed for .csv or .tsv files that have a header row, and where each file may list multiple documents.

The data should be structured in one of the following ways:

one document per row (this is the default)
each document spans a number of consecutive rows. In this case, there should be a column that indicates the identity of the document.

In addition to generic extractor classes, this reader supports the CSV extractor.

Source code in ianalyzer_readers/readers/csv.py

class CSVReader(Reader):
    '''
    A base class for Readers of .csv (comma separated value) files.

    The CSVReader is designed for .csv or .tsv files that have a header row, and where
    each file may list multiple documents.

    The data should be structured in one of the following ways:

    - one document per row (this is the default)
    - each document spans a number of consecutive rows. In this case, there should be a
        column that indicates the identity of the document.

    In addition to generic extractor classes, this reader supports the `CSV` extractor.
    '''

    field_entry = None
    '''
    If applicable, the name of the column that identifies entries. Subsequent rows with the
    same value for this column are treated as a single document. If left blank, each row
    is treated as a document.
    '''

    required_field = None
    '''
    Specifies the name of a required column in the CSV data, for example the main content.
    Rows with an empty value for `required_field` will be skipped.
    '''

    delimiter = ','
    '''
    The column delimiter used in the CSV data
    '''

    skip_lines = 0
    '''
    Number of lines in the file to skip before reading the header. Can be used when files
    use a fixed "preamble", e.g. to describe metadata or provenance.
    '''


    def validate(self):
        # make sure the field size is as big as the system permits
        csv.field_size_limit(sys.maxsize)
        self._reject_extractors(extract.XML)


    @contextmanager
    def data_from_file(self, path: str):
        with open(path, 'r') as f:
            logger.info('Reading CSV file {}...'.format(path))

            # skip first n lines
            for _ in range(self.skip_lines):
                next(f)

            reader = csv.DictReader(f, delimiter=self.delimiter)
            yield reader


    def iterate_data(self, data: csv.DictReader, metadata) -> Iterable[Document]:
        document_id = None
        rows = []
        for row in data:
            is_new_document = True

            if self.required_field and not row.get(self.required_field):  # skip row if required_field is empty
                continue

            if self.field_entry:
                identifier = row[self.field_entry]
                if identifier == document_id:
                    is_new_document = False
                else:
                    document_id = identifier

            if is_new_document and rows:
                yield {'rows': rows, 'metadata': metadata}
                rows = [row]
            else:
                rows.append(row)

        yield {'rows': rows}

`delimiter = ','` `class-attribute` `instance-attribute`

The column delimiter used in the CSV data

`field_entry = None` `class-attribute` `instance-attribute`

If applicable, the name of the column that identifies entries. Subsequent rows with the same value for this column are treated as a single document. If left blank, each row is treated as a document.

`required_field = None` `class-attribute` `instance-attribute`

Specifies the name of a required column in the CSV data, for example the main content. Rows with an empty value for required_field will be skipped.

`skip_lines = 0` `class-attribute` `instance-attribute`

Number of lines in the file to skip before reading the header. Can be used when files use a fixed "preamble", e.g. to describe metadata or provenance.

XLSX reader

Module: ianalyzer_readers.readers.xlsx

`XLSXReader`

Bases: Reader

A base class for Readers that extract data from .xlsx spreadsheets

The XLSXReader is quite rudimentary, and is designed to extract data from spreadsheets that are formatted like a CSV table, with a clear column layout. The sheet should have a header row.

The data should be structured in one of the following ways:

one document per row (this is the default)
each document spans a number of consecutive rows. In this case, there should be a column that indicates the identity of the document.

The XLSXReader will only look at the first sheet in each file.

In addition to generic extractor classes, this reader supports the CSV extractor.

Source code in ianalyzer_readers/readers/xlsx.py

class XLSXReader(Reader):
    '''
    A base class for Readers that extract data from .xlsx spreadsheets

    The XLSXReader is quite rudimentary, and is designed to extract data from
    spreadsheets that are formatted like a CSV table, with a clear column layout. The
    sheet should have a header row.

    The data should be structured in one of the following ways:

    - one document per row (this is the default)
    - each document spans a number of consecutive rows. In this case, there should be a
        column that indicates the identity of the document.

    The XLSXReader will only look at the _first_ sheet in each file.

    In addition to generic extractor classes, this reader supports the `CSV` extractor.
    '''

    field_entry = None
    '''
    If applicable, the name of column that identifies entries. Subsequent rows with the
    same value for this column are treated as a single document. If left blank, each row
    is treated as a document.
    '''

    required_field = None
    '''
    Specifies the name of a required column, for example the main content. Rows with
    an empty value for `required_field` will be skipped.
    '''

    skip_lines = 0
    '''
    Number of lines in the sheet to skip before reading the header. Can be used when files
    use a fixed "preamble", e.g. to describe metadata or provenance.
    '''


    def validate(self):
        self._reject_extractors(extract.XML)


    def data_from_file(self, path) -> Workbook:
        logger.info('Reading XLSX file {}...'.format(path))
        return openpyxl.load_workbook(path)


    def iterate_data(self, data: Workbook, metadata: Dict):
        sheets = data.sheetnames
        sheet = data[sheets[0]]
        return self._sheet2dicts(sheet, metadata)


    def _sheet2dicts(self, sheet: Worksheet, metadata):
        '''
        Extract documents from a single worksheet
        '''

        data = (row for row in sheet.values)

        for _ in range(self.skip_lines):
            next(data)

        header = list(next(data))

        document_id = None
        rows = []

        for row in data:
            values = {
                col: value
                for col, value in zip(header, row)
            }

            # skip row if required_field is empty
            if self.required_field and not values.get(self.required_field):
                continue

            identifier = values.get(self.field_entry, None)
            is_new_document = identifier == None or identifier != document_id
            document_id = identifier

            if is_new_document and rows:
                yield {'rows': rows}
                rows = [values]
            else:
                rows.append(values)

        if rows:
            yield {'rows': rows}

`field_entry = None` `class-attribute` `instance-attribute`

If applicable, the name of column that identifies entries. Subsequent rows with the same value for this column are treated as a single document. If left blank, each row is treated as a document.

`required_field = None` `class-attribute` `instance-attribute`

Specifies the name of a required column, for example the main content. Rows with an empty value for required_field will be skipped.

`skip_lines = 0` `class-attribute` `instance-attribute`

Number of lines in the sheet to skip before reading the header. Can be used when files use a fixed "preamble", e.g. to describe metadata or provenance.

XML reader

Module: ianalyzer_readers.readers.xml

This module defines the XML Reader.

Extraction is based on BeautifulSoup.

`XMLReader`

Bases: Reader

A base class for Readers that extract data from XML files.

The built-in functionality of the XML reader is quite versatile, and can be further expanded by adding custom Tag classes or extraction functions that interact directly with BeautifulSoup nodes.

The Reader is suitable for datasets where each file should be extracted as a single document, or ones where each file contains multiple documents.

In addition to generic extractor classes, this reader supports the XML extractor.

Attributes:

Name	Type	Description
`tag_toplevel`	`TagSpecification`	the top-level tag to search from in source documents.
`tag_entry`	`TagSpecification`	the tag that corresponds to a single document entry in source documents.
`external_file_tag_toplevel`	`TagSpecification`	the top-level tag to search from in external documents (if that functionality is used)

Source code in ianalyzer_readers/readers/xml.py

class XMLReader(Reader):
    '''
    A base class for Readers that extract data from XML files.

    The built-in functionality of the XML reader is quite versatile, and can be further
    expanded by adding custom Tag classes or extraction functions that interact directly with
    BeautifulSoup nodes.

    The Reader is suitable for datasets where each file should be extracted as a single
    document, or ones where each file contains multiple documents.

    In addition to generic extractor classes, this reader supports the `XML` extractor.

    Attributes:
        tag_toplevel: the top-level tag to search from in source documents.
        tag_entry: the tag that corresponds to a single document entry in source
            documents.
        external_file_tag_toplevel: the top-level tag to search from in external
            documents (if that functionality is used)

    '''

    tag_toplevel: TagSpecification = CurrentTag()
    '''
    The top-level tag in the source documents.

    Can be:

    - An XMLTag
    - A callable that takes the metadata of the document as input and returns an
        XMLTag.
    '''

    tag_entry: TagSpecification = CurrentTag()
    '''
    The tag that corresponds to a single document entry.

    Can be:

    - An XMLTag
    - A callable that takes the metadata of the document as input and returns an
        XMLTag
    '''

    external_file_tag_toplevel: TagSpecification = CurrentTag()
    '''
    The toplevel tag in external files (if you are using that functionality).

    Can be:

    - An XMLTag
    - A callable that takes the metadata of the document as input and returns an
        XMLTag. The metadata dictionary includes the values of "regular" fields for
        the document.
    '''

    def validate(self):
        # Make sure that extractors are sensible
        self._reject_extractors(extract.CSV)

    def iterate_data(self, data: bs4.BeautifulSoup, metadata: Dict) -> Iterable[Document]:
        external_soup = self._external_soup(metadata)

        # iterate through entries
        top_tag = resolve_tag_specification(self.__class__.tag_toplevel, metadata)
        bowl = top_tag.find_next_in_soup(data)

        if bowl:
            entry_tag = resolve_tag_specification(self.__class__.tag_entry, metadata)
            spoonfuls = entry_tag.find_in_soup(bowl)
            for spoon in spoonfuls:
                yield {
                    'soup_top': bowl,
                    'soup_entry': spoon,
                    'external_soup': external_soup,
                }
        else:
            logger.warning('Top-level tag not found')

    def extract_document(self, **document_data) -> Document:
        external_fields = self._external_fields()
        # fields should have unique names, but may not have stable instantiations
        # if FieldDefinitions are created on the fly, for example with class methods or @propertys
        external_fields_names = set(field.name for field in external_fields)
        regular_fields = [
            field for field in self.fields
            if field.name not in external_fields_names
        ]

        field_dict = {
            field.name: field.extractor.apply(**document_data)
            for field in regular_fields if not field.skip
        }

        external_soup = document_data.get('external_soup', None)
        metadata = document_data.get('metadata')

        if external_fields and external_soup:
            external_dict = self._external_source2dict(
                external_soup, external_fields, metadata | field_dict)
        else:
            external_dict = {
                field.name: None for field in external_fields if not field.skip
            }

        # yield the union of external fields and document fields
        return field_dict | external_dict

    def _external_fields(self) -> List[Field]:
        '''
        Subset of the reader's fields that rely on an external XML file.
        '''
        return [field for field in self.fields if
            isinstance(field.extractor, extract.XML) and field.extractor.external_file
        ]

    def _external_soup(self, metadata: Dict) -> Optional[bs4.BeautifulSoup]:
        '''
        Returns parsed tree for the external XML file, if applicable
        '''
        if any(self._external_fields()):
            if metadata and 'external_file' in metadata:
                return self.data_from_file(metadata['external_file'])
            else:
                logger.warning(
                    'Some fields have external_file property, but no external file is '
                    'provided in the source metadata'
                )

    def _external_source2dict(self, soup, external_fields: List[Field], metadata: Dict):
        '''
        given an external xml file with metadata,
        return a dictionary with tags which were found in that metadata
        wrt to the current source.
        '''
        tag = resolve_tag_specification(self.__class__.external_file_tag_toplevel, metadata)
        bowl = tag.find_next_in_soup(soup)

        if not bowl:
            logger.warning(
                'Top-level tag not found in `{}`'.format(metadata['external_file']))
            return {field.name: None for field in external_fields if not field.skip}

        return {
            field.name: field.extractor.apply(
                soup_top=bowl, soup_entry=bowl, metadata=metadata
            )
            for field in external_fields if not field.skip
        }

    def data_from_file(self, filename: str) -> bs4.BeautifulSoup:
        '''
        Returns beatifulsoup soup object for a given xml file
        '''
        # Loading XML
        logger.info('Reading XML file {} ...'.format(filename))
        with open(filename, 'rb') as f:
            data = f.read()
        logger.info('Loaded {} into memory...'.format(filename))
        return self.data_from_bytes(data)

    def data_from_bytes(self, data: bytes) -> bs4.BeautifulSoup:
        '''
        Parses content of a xml file
        '''
        return bs4.BeautifulSoup(data, 'lxml-xml')

    def data_from_response(self, data: Response) -> bs4.BeautifulSoup:
        return bs4.BeautifulSoup(data.content, 'lxml-xml')

`external_file_tag_toplevel = CurrentTag()` `class-attribute` `instance-attribute`

The toplevel tag in external files (if you are using that functionality).

Can be:

An XMLTag
A callable that takes the metadata of the document as input and returns an XMLTag. The metadata dictionary includes the values of "regular" fields for the document.

`tag_entry = CurrentTag()` `class-attribute` `instance-attribute`

The tag that corresponds to a single document entry.

Can be:

An XMLTag
A callable that takes the metadata of the document as input and returns an XMLTag

`tag_toplevel = CurrentTag()` `class-attribute` `instance-attribute`

The top-level tag in the source documents.

Can be:

An XMLTag
A callable that takes the metadata of the document as input and returns an XMLTag.

`data_from_bytes(data)`

Parses content of a xml file

Source code in ianalyzer_readers/readers/xml.py

def data_from_bytes(self, data: bytes) -> bs4.BeautifulSoup:
    '''
    Parses content of a xml file
    '''
    return bs4.BeautifulSoup(data, 'lxml-xml')

`data_from_file(filename)`

Returns beatifulsoup soup object for a given xml file

Source code in ianalyzer_readers/readers/xml.py

def data_from_file(self, filename: str) -> bs4.BeautifulSoup:
    '''
    Returns beatifulsoup soup object for a given xml file
    '''
    # Loading XML
    logger.info('Reading XML file {} ...'.format(filename))
    with open(filename, 'rb') as f:
        data = f.read()
    logger.info('Loaded {} into memory...'.format(filename))
    return self.data_from_bytes(data)

HTML reader

Module: ianalyzer_readers.readers.html

This module defines the XML Reader.

The HTML reader is implemented as a subclas of the XML reader, and uses BeautifulSoup to parse files.

`HTMLReader`

Bases: XMLReader

An HTML reader extracts data from HTML sources.

It is based on the XMLReader and supports the same options (tag_toplevel and tag_entry).

In addition to generic extractor classes, this reader supports the XML extractor.

Source code in ianalyzer_readers/readers/html.py

class HTMLReader(XMLReader):
    '''
    An HTML reader extracts data from HTML sources.

    It is based on the XMLReader and supports the same options (`tag_toplevel` and
    `tag_entry`).

    In addition to generic extractor classes, this reader supports the `XML` extractor.
    '''

    def validate(self):
        self._reject_extractors(extract.CSV)


    def data_from_file(self, filename: str) -> bs4.BeautifulSoup:
        logger.info('Reading HTML file {} ...'.format(filename))
        with open(filename, 'rb') as f:
            data = f.read()
        # Parsing HTML
        soup = bs4.BeautifulSoup(data, 'html.parser')
        logger.info('Loaded {} into memory ...'.format(filename))
        return soup


    def iterate_data(self, data: bs4.BeautifulSoup, metadata: Dict) -> Iterable[Document]:
        # Extract fields from soup
        tag0 = self.tag_toplevel
        tag = self.tag_entry

        bowl = tag0.find_next_in_soup(data) if tag0 else data

        # if there is a entry level tag; with html this is not always the case
        if bowl and tag:
            for spoon in tag.find_in_soup(data):
                # yield
                yield {'soup_top': bowl, 'soup_entry': spoon}
        else:
            # yield all page content
            yield {'soup_entry': data}

RDF reader

Module: ianalyzer_readers.readers.rdf

This module defines a Resource Description Framework (RDF) reader.

Extraction is based on the rdflib library.

`RDFReader`

Bases: Reader

A base class for Readers of Resource Description Framework files. These could be in Turtle, JSON-LD, RDFXML or other formats, see rdflib parsers.

Source code in ianalyzer_readers/readers/rdf.py

class RDFReader(Reader):
    '''
    A base class for Readers of Resource Description Framework files.
    These could be in Turtle, JSON-LD, RDFXML or other formats,
    see [rdflib parsers](https://rdflib.readthedocs.io/en/stable/plugin_parsers.html).
    '''

    def validate(self):
        self._reject_extractors(extract.CSV, extract.XML, extract.JSON)


    # TODO: we could also allow Response as source data here, but that would mean the response would also need to include information of the data format, see [this example](https://github.com/RDFLib/rdflib/blob/4.1.2/rdflib/graph.py#L209)

    def data_from_file(self, path) -> Graph:
        ''' Read a RDF file as indicated by source, return a graph 
        Override this function to parse multiple source files into one graph

        Parameters:
            path: the name of the file to be parsed

        Returns:
            rdflib Graph object
        '''
        logger.info(f"parsing {path}")
        g = Graph()
        g.parse(path)
        return g


    def iterate_data(self, data: Graph, metadata: Dict) -> Iterable[Document]:
        document_subjects = self.document_subjects(data)
        for subject in document_subjects:
            yield {'graph': data, 'subject': subject}


    def document_subjects(self, graph: Graph) -> Iterable[Union[BNode, Literal, URIRef]]:
        ''' Override this function to return all subjects (i.e., first part of RDF triple) 
        with which to search for data in the RDF graph.
        Typically, such subjects are identifiers or urls.

        Parameters:
            graph: the graph to parse

        Returns:
            generator or list of nodes
        '''
        return graph.subjects()

`data_from_file(path)`

Read a RDF file as indicated by source, return a graph Override this function to parse multiple source files into one graph

Parameters:

Name	Type	Description	Default
`path`		the name of the file to be parsed	required

Returns:

Type	Description
`Graph`	rdflib Graph object

Source code in ianalyzer_readers/readers/rdf.py

def data_from_file(self, path) -> Graph:
    ''' Read a RDF file as indicated by source, return a graph 
    Override this function to parse multiple source files into one graph

    Parameters:
        path: the name of the file to be parsed

    Returns:
        rdflib Graph object
    '''
    logger.info(f"parsing {path}")
    g = Graph()
    g.parse(path)
    return g

`document_subjects(graph)`

Override this function to return all subjects (i.e., first part of RDF triple) with which to search for data in the RDF graph. Typically, such subjects are identifiers or urls.

Parameters:

Name	Type	Description	Default
`graph`	`Graph`	the graph to parse	required

Returns:

Type	Description
`Iterable[Union[BNode, Literal, URIRef]]`	generator or list of nodes

Source code in ianalyzer_readers/readers/rdf.py

def document_subjects(self, graph: Graph) -> Iterable[Union[BNode, Literal, URIRef]]:
    ''' Override this function to return all subjects (i.e., first part of RDF triple) 
    with which to search for data in the RDF graph.
    Typically, such subjects are identifiers or urls.

    Parameters:
        graph: the graph to parse

    Returns:
        generator or list of nodes
    '''
    return graph.subjects()

`get_uri_value(node)`

a utility function to extract the last part of a uri For instance, if the input is URIRef('https://purl.org/mynamespace/ernie'), or URIRef('https://purl.org/mynamespace#ernie') the function will return 'ernie'

Parameters:

Name	Type	Description	Default
`node`	`URIRef`	an URIRef input node	required

Returns:

Type	Description
`str`	a string with the last element of the uri

Source code in ianalyzer_readers/readers/rdf.py

def get_uri_value(node: URIRef) -> str:
    """a utility function to extract the last part of a uri
    For instance, if the input is URIRef('https://purl.org/mynamespace/ernie'),
    or URIRef('https://purl.org/mynamespace#ernie')
    the function will return 'ernie'

    Parameters:
        node: an URIRef input node

    Returns:
        a string with the last element of the uri
    """
    return node.fragment or node.defrag().split("/")[-1]

JSON reader

Module: ianalyzer_readers.readers.json

This module defines the JSONReader.

It can parse documents nested in one file, for which it uses the pandas library, or multiple files with one document each, which use the generic Python json parser.

`JSONReader`

Bases: Reader

A base class for Readers of JSON encoded data.

The reader can either be used on a collection of JSON files (single_document=True), in which each file represents a document, or for a JSON file containing lists of documents.

If the attributes record_path and meta are set, they are used as arguments to pandas.json_normalize to unnest the JSON data.

Attributes:

Name	Type	Description
`single_document`	`bool`	indicates whether the data is organized such that a file represents a single document
`record_path`	`Optional[List[str]]`	a path or list of paths by which a list of documents can be extracted from a large JSON file; irrelevant if `single_document = True`
`meta`	`Optional[List[Union[str, List[str]]]]`	a list of paths, or list of lists of paths, by which metadata common for all documents can be located; irrelevant if `single_document = True`

"""

Examples:

Multiple documents in one file:

example_data = {
    'path': {
        'sketch': 'Hungarian Phrasebook',
        'episode': 25,
        'to': {
            'records':
                [
                    {'speech': 'I will not buy this record. It is scratched.', 'character': 'tourist'},
                    {'speech': "No sir. This is a tobacconist's.", 'character': 'tobacconist'}
                ]
        }
    }
}

MyJSONReader(JSONReader):
    record_path = ['path', 'to', 'records']
    meta = [['path', 'sketch'], ['path', 'episode']]

    speech = Field('speech', JSON('speech'))
    character = Field('character', JSON('character'))
    sketch = Field('sketch', JSON('path.sketch'))
    episode = Field('episode', JSON('path.episode'))

To define the paths used to extract the field values, consider the dataformat the pandas.json_normalize creates: a table with each row representing a document, and columns corresponding to paths, either relative to documents within record_path, or relative to the top level (meta), with list of paths indicated by dots.

row,speech,character,path.sketch,path.episode
0,"I will not buy this record. It is scratched.","tourist","Hungarian Phrasebook",25
1,"No sir. This is a tobacconist's.","tobacconist","Hungarian Phrasebook",25

Single document per file:

example_data = {
    'sketch': 'Hungarian Phrasebook',
    'episode': 25,
    'scene': {
        'character': 'tourist',
        'speech': 'I will not buy this record. It is scratched.'
    }
}

MyJSONReader(JSONReader):
    single_document = True

    speech = Field('speech', JSON('scene', 'speech'))
    character = Field('character', JSON('scene', 'character))
    sketch = Field('sketch', JSON('sketch'))
    episode = Field('episode', JSON('episode))

Source code in ianalyzer_readers/readers/json.py

class JSONReader(Reader):
    '''
    A base class for Readers of JSON encoded data.

    The reader can either be used on a collection of JSON files (`single_document=True`), in which each file represents a document,
    or for a JSON file containing lists of documents.

    If the attributes `record_path` and `meta` are set, they are used as arguments to [pandas.json_normalize](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html) to unnest the JSON data.

    Attributes:
        single_document: indicates whether the data is organized such that a file represents a single document
        record_path: a path or list of paths by which a list of documents can be extracted from a large JSON file; irrelevant if `single_document = True`
        meta: a list of paths, or list of lists of paths, by which metadata common for all documents can be located; irrelevant if `single_document = True`
    """

    Examples:
        ### Multiple documents in one file:
        ```python
        example_data = {
            'path': {
                'sketch': 'Hungarian Phrasebook',
                'episode': 25,
                'to': {
                    'records':
                        [
                            {'speech': 'I will not buy this record. It is scratched.', 'character': 'tourist'},
                            {'speech': "No sir. This is a tobacconist's.", 'character': 'tobacconist'}
                        ]
                }
            }
        }

        MyJSONReader(JSONReader):
            record_path = ['path', 'to', 'records']
            meta = [['path', 'sketch'], ['path', 'episode']]

            speech = Field('speech', JSON('speech'))
            character = Field('character', JSON('character'))
            sketch = Field('sketch', JSON('path.sketch'))
            episode = Field('episode', JSON('path.episode'))
        ```
        To define the paths used to extract the field values, consider the dataformat the `pandas.json_normalize` creates:
        a table with each row representing a document, and columns corresponding to paths, either relative to documents within `record_path`,
        or relative to the top level (`meta`), with list of paths indicated by dots.
        ```csv
        row,speech,character,path.sketch,path.episode
        0,"I will not buy this record. It is scratched.","tourist","Hungarian Phrasebook",25
        1,"No sir. This is a tobacconist's.","tobacconist","Hungarian Phrasebook",25
        ```

        ### Single document per file:
        ```python
        example_data = {
            'sketch': 'Hungarian Phrasebook',
            'episode': 25,
            'scene': {
                'character': 'tourist',
                'speech': 'I will not buy this record. It is scratched.'
            }
        }

        MyJSONReader(JSONReader):
            single_document = True

            speech = Field('speech', JSON('scene', 'speech'))
            character = Field('character', JSON('scene', 'character))
            sketch = Field('sketch', JSON('sketch'))
            episode = Field('episode', JSON('episode))
        ```

    '''

    single_document: bool = False
    '''
    set to `True` if the data is structured such that one document is encoded in one .json file
    in that case, the reader assumes that there are no lists in such a file
    '''

    record_path: Optional[List[str]] = None
    '''
    a keyword or list of keywords by which a list of documents can be extracted from a large JSON file.
    Only relevant if `single_document=False`.
    '''

    meta: Optional[List[Union[str, List[str]]]] = None
    '''
    a list of keywords, or list of lists of keywords, by which metadata for each document can be located,
    if it is in a different path than `record_path`. Only relevant if `single_document=False`.
    '''

    def validate(self):
        self._reject_extractors(extract.XML, extract.CSV, extract.RDF)


    def iterate_data(self, data, metadata):
        if not self.single_document:
            documents = json_normalize(
                data, record_path=self.record_path, meta=self.meta
            ).to_dict('records')
        else:
            documents = [data]

        for doc in documents:
            yield {'data': doc}


    def data_from_file(self, path):
        with open(path, "r") as f:
            data = json.load(f)
        return data


    def data_from_bytes(self, bytes):
        return json.loads(bytes)


    def data_from_response(self, response):
        return response.json()

`meta = None` `class-attribute` `instance-attribute`

a list of keywords, or list of lists of keywords, by which metadata for each document can be located, if it is in a different path than record_path. Only relevant if single_document=False.

`record_path = None` `class-attribute` `instance-attribute`

a keyword or list of keywords by which a list of documents can be extracted from a large JSON file. Only relevant if single_document=False.

`single_document = False` `class-attribute` `instance-attribute`

set to True if the data is structured such that one document is encoded in one .json file in that case, the reader assumes that there are no lists in such a file

Extractors

Module: ianalyzer_readers.extract

This module contains extractor classes that can be used to obtain values for each field in a Reader.

Some extractors are intended to work with specific Reader classes, while others are generic.

`Backup`

Bases: Extractor

Try all given extractors in order and return the first result that evaluates as true

This is a generic extractor that can be used in any Reader.

Example usage:

Backup(Constant(None), Constant('foo'))

Since the first extractor returns None, the second extractor will be used, and the extracted value would be 'foo'.

Note the difference with Choice: Backup is based on the extracted value, Choice on the applicable parameter of each extractor.

Parameters:

Name	Type	Description	Default
`*extractors`	`Extractor`	extractors to use. These should be listed in descending order of preference.	`()`
`**kwargs`		additional options to pass on to `Extractor`.	`{}`

Source code in ianalyzer_readers/extract.py

class Backup(Extractor):
    '''
    Try all given extractors in order and return the first result that evaluates as true

    This is a generic extractor that can be used in any `Reader`.

    Example usage:

        Backup(Constant(None), Constant('foo'))

    Since the first extractor returns `None`, the second extractor will be used, and the 
    extracted value would be `'foo'`.

    Note the difference with `Choice`: `Backup` is based on the _extracted value_,
    `Choice` on the `applicable` parameter of each extractor.

    Parameters:
        *extractors: extractors to use. These should be listed in descending order of
            preference.
        **kwargs: additional options to pass on to `Extractor`.
    '''
    def __init__(self, *extractors: Extractor, **kwargs):
        self.extractors = list(extractors)
        super().__init__(**kwargs)

    def _apply(self, *nargs, **kwargs):
        for extractor in self.extractors:
            result = extractor.apply(*nargs, **kwargs)
            if result:
                return result
        return None

`CSV`

Bases: Extractor

This extractor extracts values from a list of CSV or spreadsheet rows.

It should be used in readers based on CSVReader or XLSXReader.

Parameters:

Name	Type	Description	Default
`column`	`str`	The name of the column from which to extract the value.	required
`multiple`	`bool`	If a document spans multiple rows, the extracted value for a field with `multiple = True` is a list of the value in each row. If `multiple = False` (default), only the value from the first row is extracted.	`False`
`convert_to_none`	`List[str]`	optional, default is `['']`. Listed values are converted to `None`. If `None`/`False`, nothing is converted.	`['']`
`**kwargs`		additional options to pass on to `Extractor`.	`{}`

Source code in ianalyzer_readers/extract.py

class CSV(Extractor):
    '''
    This extractor extracts values from a list of CSV or spreadsheet rows.

    It should be used in readers based on `CSVReader` or `XLSXReader`.

    Parameters:
        column: The name of the column from which to extract the value.
        multiple: If a document spans multiple rows, the extracted value for a
            field with `multiple = True` is a list of the value in each row. If
            `multiple = False` (default), only the value from the first row is extracted.
        convert_to_none: optional, default is `['']`. Listed values are converted to
            `None`. If `None`/`False`, nothing is converted.
        **kwargs: additional options to pass on to `Extractor`.
    '''
    def __init__(self,
            column: str,
            multiple: bool = False,
            convert_to_none: List[str] = [''],
            *nargs, **kwargs):
        self.field = column
        self.multiple = multiple
        self.convert_to_none = convert_to_none or []
        super().__init__(*nargs, **kwargs)

    def _apply(self, rows, *nargs, **kwargs):
        if self.field in rows[0]:
            if self.multiple:
                return [self.format(row[self.field]) for row in rows]
            else:
                row = rows[0]
                return self.format(row[self.field])

    def format(self, value):
        if value and value not in self.convert_to_none:
            return value

`Cache`

Bases: Extractor

Can be wrapped around another extractor to prevent repeatedly extracting the same value.

Makes an assumption the value of the extractor is going to be the same within a document, a source file, or even across the whole dataset.

Parameters:

Name	Type	Description	Default
`extractor`	`Extractor`	Extractor of which the value is returned and cached.	required
`level`	`str`	The level at which values should be cached. Can be `'document'`, `'source'`, or `'reader'`.	`'document'`
`**kwargs`		additional options to pass on to `Extractor`	`{}`

Note: caching is based on the extractor instance and will not work across instances. For instance, in the example below, there would be no caching across fields.

fields = [
    Field(name='foo', extractor=Cache(XML('baz'))),
    Field(name='bar', extractor=Cache(XML('baz')))
]

You could rewrite this as follows, so the XML tree is only queried once per document:

_my_extractor = Cache(XML('baz'))

fields = [
    Field(name='foo', extractor=_my_extractor),
    Field(name='bar', extractor=_my_extractor)
]

There is a similar issue when you use @property to define the fields of the reader.

Source code in ianalyzer_readers/extract.py

class Cache(Extractor):
    '''
    Can be wrapped around another extractor to prevent repeatedly extracting the same
    value. 

    Makes an assumption the value of the extractor is going to be the same within a
    document, a source file, or even across the whole dataset.

    Parameters:
        extractor: Extractor of which the value is returned and cached.
        level: The level at which values should be cached. Can be `'document'`,
            `'source'`, or `'reader'`.
        **kwargs: additional options to pass on to `Extractor`

    Note: caching is based on the extractor instance and will not work across instances.
    For instance, in the example below, there would be no caching across fields.

    ```python
    fields = [
        Field(name='foo', extractor=Cache(XML('baz'))),
        Field(name='bar', extractor=Cache(XML('baz')))
    ]
    ```

    You could rewrite this as follows, so the XML tree is only queried once per document:

    ```python
    _my_extractor = Cache(XML('baz'))

    fields = [
        Field(name='foo', extractor=_my_extractor),
        Field(name='bar', extractor=_my_extractor)
    ]
    ```

    There is a similar issue when you use `@property` to define the `fields` of the
    reader.
    '''

    def __init__(self, extractor: Extractor, level: str = 'document', **kwargs):
        self.extractor = extractor
        self.level = level
        self.kwargs = {}
        super().__init__(**kwargs)

    def _apply(self, **kwargs):
        self.kwargs = kwargs

        if self.level == 'document':
            cache_params = [kwargs['source_index'], kwargs['index']]
        if self.level == 'source':
            cache_params = [kwargs['source_index']]
        if self.level == 'reader':
            cache_params = []

        return self._apply_cached(*cache_params)

    @lru_cache(maxsize=1)
    def _apply_cached(self, *cache_parameters):
        return self.extractor.apply(**self.kwargs)

`Choice`

Bases: Extractor

Use the first applicable extractor from a list of extractors.

This is a generic extractor that can be used in any Reader.

The Choice extractor will use the applicable property of its provided extractors to check which applies.

Example usage:

Choice(Constant('foo', applicable=some_condition), Constant('bar'))

This would extract 'foo' if some_condition is met; otherwise, the extracted value will be 'bar'.

Note the difference with Backup: Backup will select the first truthy value from a list of extractors, but Choice only checks the applicable condition. For example:

Choice(
    CSV('foo', applicable=Metadata('bar')),
    CSV('baz'),
)

Backup(
    CSV('foo', applicable=Metadata('bar')),
    CSV('baz'),
)

These extractors behave differently if the "bar" condition holds, but the "foo" field is empty. Backup will try to extract the "baz" field, Choice will not.

Parameters:

Name	Type	Description	Default
`*extractors`	`Extractor`	extractors to choose from. These should be listed in descending order of preference.	`()`
`**kwargs`		additional options to pass on to `Extractor`.	`{}`

Source code in ianalyzer_readers/extract.py

class Choice(Extractor):
    '''
    Use the first applicable extractor from a list of extractors.

    This is a generic extractor that can be used in any `Reader`.

    The Choice extractor will use the `applicable` property of its provided extractors
    to check which applies. 

    Example usage: 

        Choice(Constant('foo', applicable=some_condition), Constant('bar'))

    This would extract `'foo'` if `some_condition` is met; otherwise,
    the extracted value will be `'bar'`.

    Note the difference with `Backup`: `Backup` will select the first truthy value from a
    list of extractors, but `Choice` only checks the `applicable` condition. For example:

        Choice(
            CSV('foo', applicable=Metadata('bar')),
            CSV('baz'),
        )

        Backup(
            CSV('foo', applicable=Metadata('bar')),
            CSV('baz'),
        )

    These extractors behave differently if the "bar" condition holds, but the "foo" field
    is empty. `Backup` will try to extract the "baz" field, `Choice` will not.

    Parameters:
        *extractors: extractors to choose from. These should be listed in descending
            order of preference.
        **kwargs: additional options to pass on to `Extractor`.
    '''

    def __init__(self, *extractors: Extractor, **kwargs):
        self.extractors = list(extractors)
        super().__init__(**kwargs)

    def _apply(self, *nargs, **kwargs):
        for extractor in self.extractors:
            if extractor._is_applicable(*nargs, **kwargs):
                return extractor.apply(*nargs, **kwargs)
        return None

`Combined`

Bases: Extractor

Apply all given extractors and return the results as a tuple.

This is a generic extractor that can be used in any Reader.

Example usage:

Combined(Constant('foo'), Constant('bar'))

This would extract ('foo', 'bar') for each document.

Parameters:

Name	Type	Description	Default
`*extractors`	`Extractor`	extractors to combine.	`()`
`**kwargs`		additional options to pass on to `Extractor`.	`{}`

Source code in ianalyzer_readers/extract.py

class Combined(Extractor):
    '''
    Apply all given extractors and return the results as a tuple.

    This is a generic extractor that can be used in any `Reader`.

    Example usage:

        Combined(Constant('foo'), Constant('bar'))

    This would extract `('foo', 'bar')` for each document.

    Parameters:
        *extractors: extractors to combine.
        **kwargs: additional options to pass on to `Extractor`.
    '''

    def __init__(self, *extractors: Extractor, **kwargs):
        self.extractors = list(extractors)
        super().__init__(**kwargs)

    def _apply(self, *nargs, **kwargs):
        return tuple(
            extractor.apply(*nargs, **kwargs) for extractor in self.extractors
        )

`Constant`

Bases: Extractor

This extractor 'extracts' the same value every time, regardless of input.

This is a generic extractor that can be used in any Reader.

It is especially useful in combination with Backup or Choice.

Parameters:

Name	Type	Description	Default
`value`	`Any`	the value that should be "extracted".	required
`**kwargs`		additional options to pass on to `Extractor`.	`{}`

Source code in ianalyzer_readers/extract.py

class Constant(Extractor):
    '''
    This extractor 'extracts' the same value every time, regardless of input.

    This is a generic extractor that can be used in any `Reader`.

    It is especially useful in combination with `Backup` or `Choice`.

    Parameters:
        value: the value that should be "extracted".
        **kwargs: additional options to pass on to `Extractor`.
    '''

    def __init__(self, value: Any, *nargs, **kwargs):
        self.value = value
        super().__init__(*nargs, **kwargs)

    def _apply(self, *nargs, **kwargs):
        return self.value

`ExternalFile`

Bases: Extractor

Free for all external file extractor that provides a stream to stream_handler to do whatever is needed to extract data from an external file. Relies on associated_file being present in the metadata. Note that the XMLExtractor has a built in trick to extract data from external files (i.e. setting external_file), so you probably need that if your external file is XML.

Parameters:

Name	Type	Description	Default
`stream_handler`	`Callable`	function that will handle the opened file.	required
`**kwargs`		additional options to pass on to `Extractor`.	`{}`

Source code in ianalyzer_readers/extract.py

class ExternalFile(Extractor):
    '''
    Free for all external file extractor that provides a stream to `stream_handler`
    to do whatever is needed to extract data from an external file. Relies on `associated_file`
    being present in the metadata. Note that the XMLExtractor has a built in trick to extract
    data from external files (i.e. setting `external_file`), so you probably need that if your
    external file is XML.

    Parameters:
        stream_handler: function that will handle the opened file.
        **kwargs: additional options to pass on to `Extractor`.
    '''

    def __init__(self, stream_handler: Callable, *nargs, **kwargs):
        super().__init__(*nargs, **kwargs)
        self.stream_handler = stream_handler

    def _apply(self, metadata, *nargs, **kwargs):
        '''
        Extract `associated_file` from metadata and call `self.stream_handler` with file stream.
        '''
        return self.stream_handler(open(metadata['associated_file'], 'r'))

`Extractor`

Bases: object

Base class for extractors.

An extractor contains a method that can be used to gather data for a field.

Parameters:

Name	Type	Description	Default
`applicable`	`Union[Extractor, Callable[[Dict], bool], None]`	optional argument to check whether the extractor can be used. This should be another extractor, which is applied first; the containing extractor is only applied if the result is truthy. Any extractor can be used, as long as it's supported by the Reader in which it's used. If left as `None`, this extractor is always applicable.	`None`
`transform`	`Optional[Callable]`	optional function to transform or postprocess the extracted value.	`None`

Source code in ianalyzer_readers/extract.py

class Extractor(object):
    '''
    Base class for extractors.

    An extractor contains a method that can be used to gather data for a field. 

    Parameters:
        applicable: 
            optional argument to check whether the extractor can be used. This should
            be another extractor, which is applied first; the containing extractor
            is only applied if the result is truthy. Any extractor can be used, as long as
            it's supported by the Reader in which it's used. If left as `None`, this 
            extractor is always applicable.
        transform: optional function to transform or postprocess the extracted value.
    '''

    def __init__(self,
                 applicable: Union['Extractor', Callable[[Dict], bool], None] = None,
                 transform: Optional[Callable] = None
                 ):

        if callable(applicable):
            warnings.warn(
                'Using a callable as "applicable" argument is deprecated; provide an '
                'Extractor instead',
                DeprecationWarning,
            )

        self.transform = transform
        self.applicable = applicable


    def apply(self, *nargs, **kwargs):
        '''
        Test if the extractor is applicable to the given arguments and if so,
        try to extract the information.
        '''
        if self._is_applicable(*nargs, **kwargs):
            result = self._apply(*nargs, **kwargs)
            try:
                if self.transform:
                    return self.transform(result)
            except Exception:
                logger.error(traceback.format_exc())
                logger.critical("Value {v} could not be converted."
                                .format(v=result))
                return None
            else:
                return result
        else:
            return None

    def _apply(self, *nargs, **kwargs):
        '''
        Actual extractor method to be implemented in subclasses (assume that
        testing for applicability and post-processing is taken care of).

        Raises:
            NotImplementedError: This method needs to be implemented on child
                classes. It will raise an error by default.
        '''
        raise NotImplementedError()


    def _is_applicable(self, *nargs, **kwargs) -> bool:
        '''
        Checks whether the extractor is applicable, based on the condition passed as the
        `applicable` parameter.

        If no condition is provided, this is always true. If the condition is an
        Extractor, this checks whether the result is truthy.

        If the condition is a callable, it will be called with the document metadata as
        an argument. This option is deprecated; you can use the Metadata extractor to
        replace it.

        Raises:
            TypeError: Raised if the applicable parameter is an unsupported type.
        '''
        if self.applicable is None:
            return True
        if isinstance(self.applicable, Extractor):
            return bool(self.applicable.apply(*nargs, **kwargs))
        if callable(self.applicable):
            return self.applicable(kwargs.get('metadata'))
        return TypeError(
            f'Unsupported type for "applicable" parameter: {type(self.applicable)}'
        )

`apply(*nargs, **kwargs)`

Test if the extractor is applicable to the given arguments and if so, try to extract the information.

Source code in ianalyzer_readers/extract.py

def apply(self, *nargs, **kwargs):
    '''
    Test if the extractor is applicable to the given arguments and if so,
    try to extract the information.
    '''
    if self._is_applicable(*nargs, **kwargs):
        result = self._apply(*nargs, **kwargs)
        try:
            if self.transform:
                return self.transform(result)
        except Exception:
            logger.error(traceback.format_exc())
            logger.critical("Value {v} could not be converted."
                            .format(v=result))
            return None
        else:
            return result
    else:
        return None

`JSON`

Bases: Extractor

An extractor to extract data from JSON. This extractor assumes that each source is dictionary without nested lists. When working with nested lists, use JSONReader to unnest.

Parameters:

Name	Type	Description	Default
`keys`	`Iterable[str]`	the keys with which to retrieve a field value from the source	`()`

Source code in ianalyzer_readers/extract.py

class JSON(Extractor):
    '''
    An extractor to extract data from JSON.
    This extractor assumes that each source is dictionary without nested lists.
    When working with nested lists, use JSONReader to unnest.

    Parameters:
        keys (Iterable[str]): the keys with which to retrieve a field value from the source
    '''

    def __init__(self, *keys, **kwargs):
        self.keys = list(keys)
        super().__init__(**kwargs)

    def _apply(self, data: Union[str, dict], key_index: int = 0, **kwargs) -> str:
        key = self.keys[key_index]
        data = data.get(key)
        if len(self.keys) > key_index + 1:
            key_index += 1
            return self._apply(data, key_index)
        return data

`Metadata`

Bases: Extractor

This extractor extracts a value from provided metadata.

This is a generic extractor that can be used in any Reader.

Parameters:

Name	Type	Description	Default
`key`	`str`	the key in the metadata dictionary that should be extracted.	required
`**kwargs`		additional options to pass on to `Extractor`.	`{}`

Source code in ianalyzer_readers/extract.py

class Metadata(Extractor):
    '''
    This extractor extracts a value from provided metadata.

    This is a generic extractor that can be used in any `Reader`.

    Parameters:
        key: the key in the metadata dictionary that should be
            extracted.
        **kwargs: additional options to pass on to `Extractor`.
    '''

    def __init__(self, key: str, *nargs, **kwargs):
        self.key = key
        super().__init__(*nargs, **kwargs)

    def _apply(self, metadata: Dict, *nargs, **kwargs):
        return metadata.get(self.key)

`Order`

Bases: Extractor

An extractor to keep track of the order of documents. By default, this is the order of documents within their source, but you can also track the order of sources.

Parameters:

Name	Type	Description	Default
`level`	`str`	Can be `'document'` or `'source'`. Whether to return the index of the source, or of the document within the source.	`'document'`
`**kwargs`		additional options to pass on to `Extractor`.	`{}`

Source code in ianalyzer_readers/extract.py

class Order(Extractor):
    '''
    An extractor to keep track of the order of documents. By default, this is the order
    of documents within their source, but you can also track the order of sources.

    Parameters:
        level: Can be `'document'` or `'source'`. Whether to return the index of the
            source, or of the document within the source.
        **kwargs: additional options to pass on to `Extractor`.
    '''

    def __init__(self, level: str = 'document', **kwargs):
        self.level = level
        super().__init__(**kwargs)

    def _apply(self, index: int = None, source_index: int = None, **kwargs):
        if self.level == 'document':
            return index
        if self.level == 'source':
            return source_index

`Pass`

Bases: Extractor

An extractor that just passes the value of another extractor.

This is a generic extractor that can be used in any Reader.

This is useful if you want to stack multiple transform arguments. For example:

Pass(Constant('foo  ', transfrom=str.upper), transform=str.strip)

This will extract str.strip(str.upper('foo ')), i.e. 'FOO'.

Parameters:

Name	Type	Description	Default
`extractor`	`Extractor`	the extractor of which the value should be passed	required
`**kwargs`		additional options to pass on to `Extractor`.	`{}`

Source code in ianalyzer_readers/extract.py

class Pass(Extractor):
    '''
    An extractor that just passes the value of another extractor.

    This is a generic extractor that can be used in any `Reader`.

    This is useful if you want to stack multiple `transform` arguments. For example:

        Pass(Constant('foo  ', transfrom=str.upper), transform=str.strip)

    This will extract `str.strip(str.upper('foo  '))`, i.e. `'FOO'`.

    Parameters:
        extractor: the extractor of which the value should be passed
        **kwargs: additional options to pass on to `Extractor`.
    '''

    def __init__(self, extractor: Extractor, *nargs, **kwargs):
        self.extractor = extractor
        super().__init__(**kwargs)

    def _apply(self, *nargs, **kwargs):
        return self.extractor.apply(*nargs, **kwargs)

`RDF`

Bases: Extractor

An extractor to extract data from RDF triples

Parameters:

Name	Type	Description	Default
`predicates`	`Iterable[URIRef]`	an iterable of predicates (i.e., the middle part of a RDF triple) with which to query for objects when passing no predicate, the current subject will be returned	`()`
`multiple`	`bool`	if `True`: return a list of all nodes for which the query returns a result, if `False`: return the first node matching a query	`False`
`is_collection`	`bool`	specify whether the data of interest is a collection, i.e., sequential data a collection is indicated by the predicates `rdf:first` and `rdf:rest`, see rdflib documentation	`False`

Source code in ianalyzer_readers/extract.py

class RDF(Extractor):
    """An extractor to extract data from RDF triples

    Parameters:
        predicates:
            an iterable of predicates (i.e., the middle part of a RDF triple) with which to query for objects
            when passing no predicate, the current subject will be returned
        multiple:
            if `True`: return a list of all nodes for which the query returns a result,
            if `False`: return the first node matching a query
        is_collection:
            specify whether the data of interest is a collection, i.e., sequential data
            a collection is indicated by the predicates `rdf:first` and `rdf:rest`, see [rdflib documentation](https://rdflib.readthedocs.io/en/stable/_modules/rdflib/collection.html)

    """

    def __init__(
        self,
        *predicates: Iterable[URIRef],
        multiple: bool = False,
        is_collection: bool = False,
        **kwargs,
    ):
        self.predicates = predicates
        self.multiple = multiple
        self.is_collection = is_collection
        super().__init__(**kwargs)

    def _apply(self, graph: Graph = None, subject: BNode = None, *nargs, **kwargs) -> Union[str, List[str]]:
        ''' apply a query to the RDFReader's graph, with one subject resulting from the `document_subjects` function

        Parameters:
            graph: a graph in which to query (set on RDFReader)
            subject: the subject with which to query

        Returns:
            a string or list of strings
        '''
        if self.is_collection:
            collection = Collection(graph, subject)
            return [self._get_node_value(node) for node in list(collection)]
        nodes = self._select(graph, subject, self.predicates)
        if len(nodes) == 0:
            return None
        if self.multiple:
            return [self._get_node_value(node) for node in nodes]
        return self._get_node_value(nodes[0])

    def _select(self, graph, subject, predicates: Iterable[URIRef]) -> List[Union[Literal, URIRef, BNode]]:
        ''' search in a graph with predicates
            if more than one predicate is passed, this is a recursive query:
            the first search result of the query is used as a subject in the next query

            Parameters:
                subject: the subject with which to query
                graph: the graph to search
                predicates: a list of predicates with which to query

            Returns:
                a list of nodes matching the query
        '''
        if not predicates:
            return [subject]
        nodes = list(graph.objects(subject, predicates[0]))
        if len(predicates) > 1:
            return self._select(graph, nodes[0], predicates[1:])
        else:
            return nodes

    def _get_node_value(self, node):
        ''' return a string value extracted from the node '''
        try:
            return node.value
        except:
            return node

`XML`

Bases: Extractor

Extractor for XML data. Searches through a BeautifulSoup document.

This extractor should be used in a Reader based on XMLReader. (Note that this includes the HTMLReader.)

The XML extractor has a lot of options. When deciding how to extract a value, it usually makes sense to determine them in this order:

Choose whether to use the source file (default), or use an external XML file by setting external_file.
Choose where to start searching. The default searching point is the entry tag for the document, but you can also start from the top of the document by setting toplevel.
Describe the tag(s) you're looking for as a Tag object. You can also provide multiple tags to chain queries.
If you need to return all matching tags, rather than the first match, set multiple=True.
Choose how to extract a value: set attribute, flatten, or extract_soup_func if needed.
The extracted value is a string, or the output of extract_soup_func. To further transform it, add a function for transform.

Parameters:

Name	Type	Description	Default
`tags`	`TagSpecification`	Tags to select. Each of these can be a `Tag` object, or a callable that takes the document metadata as input and returns a `Tag`. If no tags are provided, the extractor will work form the starting tag. Tags represent a query to select tags from current tag (e.g. the entry tag of the document). If you provide multiple, they are chained: each Tag query is applied to the results from the previous one.	`()`
`attribute`	`Optional[str]`	By default, the extractor will extract the text content of the tag. Set this property to extract the value of an attribute instead.	`None`
`flatten`	`bool`	When extracting the text content of a tag, `flatten` determines whether the contents of non-text children are flattened. If `False`, only the direct text content of the tag is extracted. This parameter does nothing if `attribute=True` is set.	`False`
`toplevel`	`bool`	If `True`, the extractor will search from the toplevel tag of the XML document, rather than the entry tag for the document.	`False`
`multiple`	`bool`	If `False`, the extractor will extract the first matching element. If `True`, it will extract a list of all matching elements.	`False`
`external_file`	`bool`	If `True`, the extractor will look through a secondary XML file (usually one containing metadata). It requires that the passed metadata have an `'external_file'` key that specifies the path to the file. Note: this option is not supported when this extractor is nested in another extractor (like `Combined`).	`False`
`extract_soup_func`	`Optional[Callable]`	A function to extract a value directly from the soup element, instead of using the content string or an attribute. `attribute` and `flatten` will do nothing if this property is set.	`None`
`**kwargs`		additional options to pass on to `Extractor`.	`{}`

Source code in ianalyzer_readers/extract.py

class XML(Extractor):
    '''
    Extractor for XML data. Searches through a BeautifulSoup document.

    This extractor should be used in a `Reader` based on `XMLReader`. (Note that this
    includes the `HTMLReader`.)

    The XML extractor has a lot of options. When deciding how to extract a value, it
    usually makes sense to determine them in this order:

    - Choose whether to use the source file (default), or use an external XML file by
        setting `external_file`.
    - Choose where to start searching. The default searching point is the entry tag
        for the document, but you can also start from the top of the document by setting
        `toplevel`.
    - Describe the tag(s) you're looking for as a Tag object. You can also provide multiple
        tags to chain queries. 
    - If you need to return _all_ matching tags, rather than the first match, set
        `multiple=True`.
    - Choose how to extract a value: set `attribute`, `flatten`, or `extract_soup_func`
        if needed.
    - The extracted value is a string, or the output of `extract_soup_func`. To further
        transform it, add a function for `transform`.

    Parameters:
        tags:
            Tags to select. Each of these can be a `Tag` object, or a callable that
            takes the document metadata as input and returns a `Tag`.

            If no tags are provided, the extractor will work form the starting tag.

            Tags represent a query to select tags from current tag (e.g. the entry tag of
            the document). If you provide multiple, they are chained: each Tag query is
            applied to the results from the previous one.
        attribute:
            By default, the extractor will extract the text content of the tag. Set this
            property to extract the value of an _attribute_ instead.
        flatten:
            When extracting the text content of a tag, `flatten` determines whether
            the contents of non-text children are flattened. If `False`, only the direct
            text content of the tag is extracted.

            This parameter does nothing if `attribute=True` is set.
        toplevel:
            If `True`, the extractor will search from the toplevel tag of the XML
            document, rather than the entry tag for the document.
        multiple:
            If `False`, the extractor will extract the first matching element. If 
            `True`, it will extract a list of all matching elements.
        external_file:
            If `True`, the extractor will look through a secondary XML file (usually one
            containing metadata). It requires that the passed metadata have an
            `'external_file'` key that specifies the path to the file.

            Note: this option is not supported when this extractor is nested in another
            extractor (like `Combined`).
        extract_soup_func: A function to extract a value directly from the soup element,
            instead of using the content string or an attribute.
            `attribute` and `flatten` will do nothing if this property is set.
        **kwargs: additional options to pass on to `Extractor`.
    '''

    def __init__(self,
                 *tags: TagSpecification,
                 attribute: Optional[str] = None,
                 flatten: bool = False,
                 toplevel: bool = False,
                 multiple: bool = False,
                 external_file: bool = False,
                 extract_soup_func: Optional[Callable] = None,
                 **kwargs
                 ):

        self.tags = tags
        self.attribute = attribute
        self.flatten = flatten
        self.toplevel = toplevel
        self.multiple = multiple
        self.external_file = external_file
        self.extract_soup_func = extract_soup_func
        super().__init__(**kwargs)

    def _select(self, tags: Iterable[TagSpecification], soup: bs4.PageElement, metadata=None):
        '''
        Return the BeautifulSoup element that matches the constraints of this
        extractor.
        '''

        if len(tags) > 1:
            tag = resolve_tag_specification(tags[0], metadata)
            for element in tag.find_in_soup(soup):
                for result in self._select(tags[1:], element, metadata):
                    yield result
        elif len(tags) == 1:
            tag = resolve_tag_specification(tags[0], metadata)
            for result in tag.find_in_soup(soup):
                yield result
        else:
            yield soup


    def _apply(self, soup_top=None, soup_entry=None, **kwargs):
        results_generator = self._select(
            self.tags,
            soup_top if self.toplevel else soup_entry,
            metadata=kwargs.get('metadata')
        )

        if self.multiple:
            results = list(results_generator)
            return list(map(self._extract, results))
        else:
            result = next(results_generator, None)
            return self._extract(result)

    def _extract(self, soup: Optional[bs4.PageElement]):
        if not soup:
            return None

        # Use appropriate extractor
        if self.extract_soup_func:
            return self.extract_soup_func(soup)
        elif self.attribute:
            return self._attr(soup)
        else:
            if self.flatten:
                return self._flatten(soup)
            else:
                return self._string(soup)    

    def _string(self, soup):
        '''
        Output direct text contents of a node.
        '''

        if isinstance(soup, bs4.element.Tag):
            return soup.string
        else:
            return [node.string for node in soup]

    def _flatten(self, soup):
        '''
        Output text content of node and descendant nodes, disregarding
        underlying XML structure.
        '''

        if isinstance(soup, bs4.element.Tag):
            text = soup.get_text()
        else:
            text = '\n\n'.join(node.get_text() for node in soup)

        _softbreak = re.compile('(?<=\S)\n(?=\S)| +')
        _newlines = re.compile('\n+')
        _tabs = re.compile('\t+')

        return html.unescape(
            _newlines.sub(
                '\n',
                _softbreak.sub(' ', _tabs.sub('', text))
            ).strip()
        )

    def _attr(self, soup):
        '''
        Output content of nodes' attribute.
        '''

        if isinstance(soup, bs4.element.Tag):
            if self.attribute == 'name':
                return soup.name
            return soup.attrs.get(self.attribute)
        else:
            if self.attribute == 'name':
                return [ node.name for node in soup]
            return [
                node.attrs.get(self.attribute)
                for node in soup if node.attrs.get(self.attribute) is not None
            ]

XML tags

Module: ianalyzer_readers.xml_tag

This module defines the Tag class (and various subclasses).

This class is used in the XML extractor to read XML/HTML documents.

Each Tag describes a query for one or more XML tags based on their characteristics. It implements a method find_in_soup that takes an element as input and iterates over matching tags.

`CurrentTag`

Bases: Tag

A Tag query that will return the current tag.

Primarily useful as a default option.

Source code in ianalyzer_readers/xml_tag.py

class CurrentTag(Tag):
    '''
    A Tag query that will return the current tag.

    Primarily useful as a default option.
    '''

    def __init__(self):
        pass

    def find_in_soup(self, soup: bs4.PageElement) -> Iterable[bs4.PageElement]:
        return [soup]

`FindParentTag`

Bases: Tag

A Tag that will find a parent tag based on query arguments.

Unlike ParentTag, this searches for a tag with a query.

For example, ParentTag('foo') will search for a <foo> ancestor of the current tag.

Parameters:

Name	Type	Description	Default
`*args`	`Any`	positional arguments to pass on to `soup.find_parents()`	`()`
`**kwargs`	`Any`	named arguments to pass on to `soup.find_parents()`	`{}`

Source code in ianalyzer_readers/xml_tag.py

class FindParentTag(Tag):
    '''
    A Tag that will find a parent tag based on query arguments.

    Unlike ParentTag, this searches for a tag with a query.

    For example, `ParentTag('foo')` will search for a `<foo>` ancestor
    of the current tag.

    Parameters:
        *args: positional arguments to pass on to `soup.find_parents()`
        **kwargs: named arguments to pass on to `soup.find_parents()`
    '''

    def find_in_soup(self, soup: bs4.PageElement):
        return soup.find_parents(*self.args, **self.kwargs)

`NextSiblingTag`

Bases: Tag

A Tag that will look in an element's next siblings.

Parameters:

Name	Type	Description	Default
`*args`	`Any`	positional arguments to pass on to `soup.find_next_siblings()`	`()`
`**kwargs`	`Any`	named arguments to pass on to `soup.find_next_siblings()`	`{}`

Source code in ianalyzer_readers/xml_tag.py

class NextSiblingTag(Tag):
    '''
    A Tag that will look in an element's next siblings.

    Parameters:
        *args: positional arguments to pass on to `soup.find_next_siblings()`
        **kwargs: named arguments to pass on to `soup.find_next_siblings()`
    '''

    def find_in_soup(self, soup: bs4.PageElement):
        return soup.find_next_siblings(*self.args, **self.kwargs)

`NextTag`

Bases: Tag

A Tag that will look through tags following the current element.

Parameters:

Name	Type	Description	Default
`*args`	`Any`	positional arguments to pass on to `soup.find_all_next()`	`()`
`**kwargs`	`Any`	named arguments to pass on to `soup.find_all_next()`	`{}`

Source code in ianalyzer_readers/xml_tag.py

class NextTag(Tag):
    '''
    A Tag that will look through tags following the current element.

    Parameters:
        *args: positional arguments to pass on to `soup.find_all_next()`
        **kwargs: named arguments to pass on to `soup.find_all_next()`
    '''

    def find_in_soup(self, soup: bs4.PageElement):
        return soup.find_all_next(*self.args, **self.kwargs)

`ParentTag`

Bases: Tag

A Tag that will select a parent tag based on a fixed level.

For example, ParentTag(2) will always go up two steps in the tree and return that tag.

Parameters:

Name	Type	Description	Default
`level`	`int`	the number of steps to move up the tree.	`1`

Source code in ianalyzer_readers/xml_tag.py

class ParentTag(Tag):
    '''
    A Tag that will select a parent tag based on a fixed level.

    For example, `ParentTag(2)` will always go up two steps in the tree
    and return that tag.

    Parameters:
        level: the number of steps to move up the tree.
    '''

    def __init__(self, level: int = 1):
        self.level = level

    def find_in_soup(self, soup: bs4.PageElement):
        count = 0
        while count < self.level:
            soup = soup.parent if soup else None
            count += 1
        return [soup]

`PreviousSiblingTag`

Bases: Tag

A Tag that will look in an element's previous siblings.

Parameters:

Name	Type	Description	Default
`*args`	`Any`	positional arguments to pass on to `soup.find_previous_siblings()`	`()`
`**kwargs`	`Any`	named arguments to pass on to `soup.find_previous_siblings()`	`{}`

Source code in ianalyzer_readers/xml_tag.py

class PreviousSiblingTag(Tag):
    '''
    A Tag that will look in an element's previous siblings.

    Parameters:
        *args: positional arguments to pass on to `soup.find_previous_siblings()`
        **kwargs: named arguments to pass on to `soup.find_previous_siblings()`
    '''

    def find_in_soup(self, soup: bs4.PageElement):
        return soup.find_previous_siblings(*self.args, **self.kwargs)

`PreviousTag`

Bases: Tag

A Tag that will look through tags previous to the current element.

Parameters:

Name	Type	Description	Default
`*args`	`Any`	positional arguments to pass on to `soup.find_all_previous()`	`()`
`**kwargs`	`Any`	named arguments to pass on to `soup.find_all_previous()`	`{}`

Source code in ianalyzer_readers/xml_tag.py

class PreviousTag(Tag):
    '''
    A Tag that will look through tags previous to the current element.

    Parameters:
        *args: positional arguments to pass on to `soup.find_all_previous()`
        **kwargs: named arguments to pass on to `soup.find_all_previous()`
    '''

    def find_in_soup(self, soup: bs4.PageElement):
        return soup.find_all_previous(*self.args, **self.kwargs)

`SiblingTag`

Bases: Tag

A Tag that will look in an element's siblings.

Parameters:

Name	Type	Description	Default
`*args`	`Any`	positional arguments to pass on to `soup.find_previous_siblings()` and `soup.find_next_siblings()`	`()`
`**kwargs`	`Any`	named arguments to pass on to `soup.find_previous_siblings()` and `soup.find_next_siblings()`	`{}`

Source code in ianalyzer_readers/xml_tag.py

class SiblingTag(Tag):
    '''
    A Tag that will look in an element's siblings.

    Parameters:
        *args: positional arguments to pass on to `soup.find_previous_siblings()`
            and `soup.find_next_siblings()`
        **kwargs: named arguments to pass on to `soup.find_previous_siblings()`
            and `soup.find_next_siblings()`
    '''

    def find_in_soup(self, soup: bs4.PageElement):
        for tag in soup.find_next_siblings(*self.args, **self.kwargs):
            yield tag

        for tag in soup.find_previous_siblings(*self.args, **self.kwargs):
            yield tag

`Tag`

Describes a query for a tag in an XML tree.

This should be used as the base class for all other tags, which can override the __init__() and find_in_soup() methods.

Tag is the most straightforward case: all arguments passed in the constructor are passed on as-is to the find_all() method of the BeautifulSoup element, searching descendants of the input tag.

See https://www.crummy.com/software/BeautifulSoup/bs4/doc/#kinds-of-filters for different ways of searching. This includes searching by: - a tag name (possibly as a regular expression) - attributes of the tag - the string content of the tag - a function

Parameters:

Name	Type	Description	Default
`*args`	`Any`	positional arguments to pass on to `soup.find_all()`	`()`
`**kwargs`	`Any`	named arguments to pass on to `soup.find_all()`	`{}`

Source code in ianalyzer_readers/xml_tag.py

class Tag:
    '''
    Describes a query for a tag in an XML tree.

    This should be used as the base class for all other tags, which can override
    the `__init__()` and `find_in_soup()` methods.

    `Tag` is the most straightforward case: all arguments passed in the constructor
    are passed on as-is to the `find_all()` method of the BeautifulSoup element, searching
    descendants of the input tag.

    See https://www.crummy.com/software/BeautifulSoup/bs4/doc/#kinds-of-filters for
    different ways of searching. This includes searching by:
    - a tag name (possibly as a regular expression)
    - attributes of the tag
    - the string content of the tag
    - a function

    Parameters:
        *args: positional arguments to pass on to `soup.find_all()`
        **kwargs: named arguments to pass on to `soup.find_all()`
    '''

    def __init__(self, *args: Any, **kwargs: Any):
        self.args = args
        self.kwargs = kwargs

    def find_next_in_soup(self, soup: bs4.PageElement) -> Optional[bs4.PageElement]:
        '''
        Find the first match for the tag, if any.

        Parameters:
            soup: The element to search from.

        Returns:
            The first matching tag. Returns `None` if there is no match.
        '''
        return next((tag for tag in self.find_in_soup(soup)), None)

    def find_in_soup(self, soup: bs4.PageElement) -> Iterable[bs4.PageElement]:
        '''
        Find all results for this tag.

        Parameters:
            soup: The element to search from.

        Returns:
            An iterable of matching tags. Note that is is not guaranteed that the iterable
                contains any elements.

        When subclassing Tag, you will usually want to replace this method. The result
        must be an iterable. (If only one result makes sense, it's an iterable with one
        element.) If the tag may find multiple matches, it's recommended that this method
        returns a generator or a `bs4.ResultSet` rather than collecting all results up
        front.
        '''
        pool = soup.descendants if self.kwargs.get('recursive', True) else soup.children

        def strainer_helper(name=None, attrs={}, string=None, **kwargs):
            return bs4.SoupStrainer(name, attrs, string, **kwargs)
        strainer = strainer_helper(*self.args, **self.kwargs)

        yield from strainer.filter(pool)

`find_in_soup(soup)`

Find all results for this tag.

Parameters:

Name	Type	Description	Default
`soup`	`PageElement`	The element to search from.	required

Returns:

Type	Description
`Iterable[PageElement]`	An iterable of matching tags. Note that is is not guaranteed that the iterable contains any elements.

When subclassing Tag, you will usually want to replace this method. The result must be an iterable. (If only one result makes sense, it's an iterable with one element.) If the tag may find multiple matches, it's recommended that this method returns a generator or a bs4.ResultSet rather than collecting all results up front.

Source code in ianalyzer_readers/xml_tag.py

def find_in_soup(self, soup: bs4.PageElement) -> Iterable[bs4.PageElement]:
    '''
    Find all results for this tag.

    Parameters:
        soup: The element to search from.

    Returns:
        An iterable of matching tags. Note that is is not guaranteed that the iterable
            contains any elements.

    When subclassing Tag, you will usually want to replace this method. The result
    must be an iterable. (If only one result makes sense, it's an iterable with one
    element.) If the tag may find multiple matches, it's recommended that this method
    returns a generator or a `bs4.ResultSet` rather than collecting all results up
    front.
    '''
    pool = soup.descendants if self.kwargs.get('recursive', True) else soup.children

    def strainer_helper(name=None, attrs={}, string=None, **kwargs):
        return bs4.SoupStrainer(name, attrs, string, **kwargs)
    strainer = strainer_helper(*self.args, **self.kwargs)

    yield from strainer.filter(pool)

`find_next_in_soup(soup)`

Find the first match for the tag, if any.

Parameters:

Name	Type	Description	Default
`soup`	`PageElement`	The element to search from.	required

Returns:

Type	Description
`Optional[PageElement]`	The first matching tag. Returns `None` if there is no match.

Source code in ianalyzer_readers/xml_tag.py

def find_next_in_soup(self, soup: bs4.PageElement) -> Optional[bs4.PageElement]:
    '''
    Find the first match for the tag, if any.

    Parameters:
        soup: The element to search from.

    Returns:
        The first matching tag. Returns `None` if there is no match.
    '''
    return next((tag for tag in self.find_in_soup(soup)), None)

`TransformTag`

Bases: Tag

A Tag that will perform a transformation function.

This Tag allows you to run arbitrary code to move to anywhere in the XML tree.

Parameters:

Name	Type	Description	Default
`transform`	`Callable[[PageElement], Iterable[PageElement]]`	a function that takes an XML element as input and returns an iterable of XML elements. (Note that you can return an iterable of one, or an empty iterable, if you don't have multiple results.)	required

Source code in ianalyzer_readers/xml_tag.py

class TransformTag(Tag):
    '''
    A Tag that will perform a transformation function.

    This Tag allows you to run arbitrary code to move to anywhere in the XML tree.

    Parameters:
        transform: a function that takes an XML element as input and returns an
            iterable of XML elements. (Note that you can return an iterable of
            one, or an empty iterable, if you don't have multiple results.)
    '''

    def __init__(
            self,
            transform: Callable[[bs4.PageElement], Iterable[bs4.PageElement]],
        ):
        self.transform = transform

    def find_in_soup(self, soup: bs4.PageElement) -> Iterable[bs4.PageElement]:
        return self.transform(soup)

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search

API documentation

Core classes

Document = Dict[str, Any] module-attribute

Source = Union[SourceData, Tuple[SourceData, Dict]] module-attribute

SourceData = Union[str, Response, bytes] module-attribute

Field

Reader

data_directory property

fieldnames property

fields property

data_and_metadata_from_source(source)

data_from_bytes(bytes)

data_from_file(path)

data_from_response(response)

documents(sources=None)

export_csv(path, sources=None)

extract_document(**kwargs)

iterate_data(data, metadata)

source2dicts(source, source_index=-1)

sources(**kwargs)

validate()

CSV reader

CSVReader

delimiter = ',' class-attribute instance-attribute

field_entry = None class-attribute instance-attribute

required_field = None class-attribute instance-attribute

skip_lines = 0 class-attribute instance-attribute

XLSX reader

XLSXReader

field_entry = None class-attribute instance-attribute

required_field = None class-attribute instance-attribute

skip_lines = 0 class-attribute instance-attribute

XML reader

XMLReader

external_file_tag_toplevel = CurrentTag() class-attribute instance-attribute

tag_entry = CurrentTag() class-attribute instance-attribute

tag_toplevel = CurrentTag() class-attribute instance-attribute

data_from_bytes(data)

data_from_file(filename)

HTML reader

HTMLReader

RDF reader

RDFReader

data_from_file(path)

document_subjects(graph)

get_uri_value(node)

JSON reader

JSONReader

Multiple documents in one file:

Single document per file:

meta = None class-attribute instance-attribute

record_path = None class-attribute instance-attribute

single_document = False class-attribute instance-attribute

Extractors

Backup

CSV

Cache

Choice

Combined

Constant

ExternalFile

Extractor

apply(*nargs, **kwargs)

JSON

Metadata

Order

Pass

RDF

XML

XML tags

CurrentTag

FindParentTag

NextSiblingTag

NextTag

ParentTag

PreviousSiblingTag

PreviousTag

SiblingTag

Tag

find_in_soup(soup)

`Document = Dict[str, Any]` `module-attribute`

`Source = Union[SourceData, Tuple[SourceData, Dict]]` `module-attribute`

`SourceData = Union[str, Response, bytes]` `module-attribute`

`Field`

`Reader`

`data_directory` `property`

`fieldnames` `property`

`fields` `property`

`data_and_metadata_from_source(source)`

`data_from_bytes(bytes)`

`data_from_file(path)`

`data_from_response(response)`

`documents(sources=None)`

`export_csv(path, sources=None)`

`extract_document(**kwargs)`

`iterate_data(data, metadata)`

`source2dicts(source, source_index=-1)`

`sources(**kwargs)`

`validate()`

`CSVReader`

`delimiter = ','` `class-attribute` `instance-attribute`

`field_entry = None` `class-attribute` `instance-attribute`

`required_field = None` `class-attribute` `instance-attribute`

`skip_lines = 0` `class-attribute` `instance-attribute`

`XLSXReader`

`field_entry = None` `class-attribute` `instance-attribute`

`required_field = None` `class-attribute` `instance-attribute`

`skip_lines = 0` `class-attribute` `instance-attribute`

`XMLReader`

`external_file_tag_toplevel = CurrentTag()` `class-attribute` `instance-attribute`

`tag_entry = CurrentTag()` `class-attribute` `instance-attribute`

`tag_toplevel = CurrentTag()` `class-attribute` `instance-attribute`

`data_from_bytes(data)`

`data_from_file(filename)`

`HTMLReader`

`RDFReader`

`data_from_file(path)`

`document_subjects(graph)`

`get_uri_value(node)`

`JSONReader`

`meta = None` `class-attribute` `instance-attribute`

`record_path = None` `class-attribute` `instance-attribute`

`single_document = False` `class-attribute` `instance-attribute`

`Backup`

`CSV`

`Cache`

`Choice`

`Combined`

`Constant`

`ExternalFile`

`Extractor`

`apply(*nargs, **kwargs)`

`JSON`

`Metadata`

`Order`

`Pass`

`RDF`

`XML`

`CurrentTag`

`FindParentTag`

`NextSiblingTag`

`NextTag`

`ParentTag`

`PreviousSiblingTag`

`PreviousTag`

`SiblingTag`

`Tag`

`find_in_soup(soup)`

`find_next_in_soup(soup)`

`TransformTag`