API documentation

Core classes

Module: ianalyzer_readers.readers.core

This module defines the base classes on which all Readers are built.

The module defines two classes, Field and Reader.

Document = Dict[str, Any] module-attribute

Type definition for documents, defined for convenience.

Each document extracted by a Reader is a dictionary, where the keys are names of the Reader's fields, and the values are based on the extractor of each field.

Source = Union[SourceData, Tuple[SourceData, Dict]] module-attribute

Type definition for the source input to some Reader methods.

Sources are either:

  • a string with the path to a filename
  • binary data with the file contents. This is not supported on all Reader subclasses
  • a requests.Response
  • a tuple of one of the above, and a dictionary with metadata

SourceData = Union[str, Response, bytes] module-attribute

Type definition of the data types a Reader method can handle.

Field

Bases: object

Fields are the elements of information that you wish to extract from each document.

Parameters:

Name Type Description Default
name str

a short hand name (name), which will be used as its key in the document

required
extractor Extractor

an Extractor object that defines how this field's data can be extracted from source documents.

Constant(None)
required bool

whether this field is required. The Reader class should skip the document is the value for this Field is None.

False
skip bool

if True, this field will not be included in the results.

False
Source code in ianalyzer_readers/readers/core.py
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
class Field(object):
    '''
    Fields are the elements of information that you wish to extract from each document.

    Parameters:
        name:  a short hand name (name), which will be used as its key in the document
        extractor: an Extractor object that defines how this field's data can be
            extracted from source documents.
        required: whether this field is required. The `Reader` class should skip the
            document is the value for this Field is `None`.
        skip: if `True`, this field will not be included in the results.
    '''

    def __init__(self,
                 name: str,
                 extractor: extract.Extractor = extract.Constant(None),
                 required: bool = False,
                 skip: bool = False,
                 **kwargs
                 ):

        self.name = name
        self.extractor = extractor
        self.required = required
        self.skip = skip

Reader

A base class for readers. Readers are objects that can generate documents from a source dataset.

Subclasses of Reader can be created to read specific data formats. In practice, you will probably work with a subclass of Reader like XMLReader, CSVReader, etc., that provides the core functionality for a file type, and create a subclass for a specific dataset.

Some methods of this class need to be implemented in child classes, and will raise NotImplementedError if you try to use Reader directly.

A fully implemented Reader subclass will define how to read a dataset by describing:

  • How to obtain its source files.
  • How to parse and iterate over source files.
  • What fields each document contains, and how to extract them from the source data.

This requires implementing the following attributes/methods:

  • fields: a list of Field instances that describe the fields that will appear in documents, and how to extract their value.
  • sources: a method that returns an iterable of sources (e.g. file paths), possibly with metadata for each.
  • data_directory (optional): a string with the path to the directory containing the source data. You can use this in the implementation of sources; it's not used elsewhere.
  • data_from_file data_from_bytes, data_from_response: methods that respectively receive a file path, a byte sequence, or an HTTP response, and return a data object. (The type of the data will depend on how you implement your reader; this could be a parsed graph, a row iterator, etc.). You must implement at least one of these methods to have a functioning reader.
  • iterate_data: method that takes a data object (the output of data_from_file/data_from_bytes/data_from_response) and a metadata dictionary, iterates over the source data, and returns the data that should be passed on to extractors for each document.
  • validate (optional): a method that will check the reader configuration. This is useful for abstract readers like the XMLReader, CSVReader, etc., so they can verify a child class is implementing attributes correctly.

Abstract reader types like CSVReader usually leave fields and sources unimplemented.

Source code in ianalyzer_readers/readers/core.py
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
class Reader:
    '''
    A base class for readers. Readers are objects that can generate documents
    from a source dataset.

    Subclasses of `Reader` can be created to read specific data formats. 
    In practice, you will probably work with a subclass of `Reader` like `XMLReader`,
    `CSVReader`, etc., that provides the core functionality for a file type, and create
    a subclass for a specific dataset.

    Some methods of this class need to be implemented in child classes, and will raise
    `NotImplementedError` if you try to use `Reader` directly.

    A fully implemented `Reader` subclass will define how to read a dataset by
    describing:

    - How to obtain its source files.
    - How to parse and iterate over source files.
    - What fields each document contains, and how to extract them from the source data.

    This requires implementing the following attributes/methods:

    - `fields`: a list of `Field` instances that describe the fields that will appear in
        documents, and how to extract their value.
    - `sources`: a method that returns an iterable of sources (e.g. file paths), possibly
        with metadata for each.
    - `data_directory` (optional): a string with the path to the directory containing
        the source data. You can use this in the implementation of `sources`; it's not
        used elsewhere.
    - `data_from_file` `data_from_bytes`, `data_from_response`: methods that respectively
        receive a file path, a byte sequence, or an HTTP response, and return a data
        object. (The type of the data will depend on how you implement your reader; this
        could be a parsed graph, a row iterator, etc.). You must implement at least one of
        these methods to have a functioning reader.
    - `iterate_data`: method that takes a data object (the output of
        `data_from_file`/`data_from_bytes`/`data_from_response`) and a metadata dictionary,
        iterates over the source data, and returns the data that should be passed on to
        extractors for each document.
    - `validate` (optional): a method that will check the reader configuration. This is
        useful for abstract readers like the `XMLReader`, `CSVReader`, etc., so they
        can verify a child class is implementing attributes correctly.

    Abstract reader types like `CSVReader` usually leave `fields` and `sources`
    unimplemented.
    '''

    @property
    def data_directory(self) -> str:
        '''
        Path to source data directory.

        Raises:
            NotImplementedError: This method needs to be implementd on child
                classes. It will raise an error by default.
        '''
        raise NotImplementedError('Reader missing data_directory')


    @property
    def fields(self) -> List[Field]:
        '''
        The list of fields that are extracted from documents.

        These should be instances of the `Field` class (or implement the same API).

        Raises:
            NotImplementedError: This method needs to be implementd on child
                classes. It will raise an error by default.
        '''
        raise NotImplementedError('Reader missing fields implementation')

    @property
    def fieldnames(self) -> List[str]:
        '''
        A list containing the name of each field of this Reader
        '''
        return [field.name for field in self.fields]


    @property
    def _required_field_names(self) -> List[str]:
        '''
        A list of the names of all required fields
        '''
        return [field.name for field in self.fields if field.required]


    def sources(self, **kwargs) -> Iterable[Source]:
        '''
        Obtain source files for the Reader.

        Returns:
            an iterable of tuples that each contain a string path, and a dictionary
                with associated metadata. The metadata can contain any data that was
                extracted before reading the file itself, such as data based on the
                file path, or on a metadata file.

        Raises:
            NotImplementedError: This method needs to be implementd on child
                classes. It will raise an error by default.
        '''
        raise NotImplementedError('Reader missing sources implementation')

    def source2dicts(self, source: Source, source_index=-1) -> Iterable[Document]:
        '''
        Given a source file, returns an iterable of extracted documents.

        Parameters:
            source: Source to extract.

        Returns:
            an iterable of document dictionaries. Each of these is a dictionary,
                where the keys are names of this Reader's `fields`, and the values
                are based on the extractor of each field.
        '''

        self.validate()

        data, metadata = self.data_and_metadata_from_source(source)

        if isinstance(data, AbstractContextManager):
            context_manager = data
        else:
            context_manager = nullcontext(data)

        with context_manager as data:
            for index, extracted_data in enumerate(self.iterate_data(data, metadata)):
                base_data = {
                    'metadata': metadata,
                    'index': index,
                    'source_index': source_index,
                }
                document_data = base_data | extracted_data
                document = self.extract_document(**document_data)
                if self._has_required_fields(document):
                    yield document


    def data_and_metadata_from_source(self, source: Source) -> Tuple[Any, Dict]:
        '''
        Extract the data and metadata object from a source.

        Parameters:
            source: Source to extract.

        Returns:
            A tuple with the parsed source data, and the metadata (empty if none was
                provided).
        '''
        if isinstance(source, tuple) and len(source) == 2:
            source_data, metadata = source
        else:
            source_data = source
            metadata = {}

        if isinstance(source_data, str):
            if not isfile(source_data):
                raise FileNotFoundError(f'Invalid file path: {source_data}')
            data = self.data_from_file(source_data)
        elif isinstance(source_data, bytes):
            data = self.data_from_bytes(source_data)
        elif isinstance(source_data, Response):
            data = self.data_from_response(source_data)
        else:
            raise TypeError(f'Unknown source type: {type(source_data)}')

        return data, metadata


    def data_from_file(self, path: str) -> Any:
        '''
        Extract source data from a filename.

        The return type depends on how the reader is implemented, but should be some kind
        of data structure from which documents can be extracted. It serves as the input
        to `self.iterate_data`.

        This method can also return a context manager. This is especially useful to
        iterate over large files in `iterate_data`, without loading the complete file
        contents in memory.

        Tip: if you have implemented `self.data_from_bytes`, this method can probably just
        read the binary contents of the file and call that method.

        Parameters:
            path: The path to a file.

        Returns:
            A data object. The type depends on the reader implementation.

        Raises:
            NotImplementedError: this method may be implemented on child classes, but
                has no default implementation.
        '''

        raise NotImplementedError('This reader does not support filename input')


    def data_from_bytes(self, bytes: bytes) -> Any:
        '''
        Extract source data from a bytes object. Like `data_from_file`, but with bytes
        input.

        Parameters:
            bytes: byte contents of the source

        Returns:
            A data object. The type depends on the reader implementation. This may also
                be a context manager.

        Raises:
            NotImplementedError: this method may be implemented on child classes, but
                has no default implementation.
        '''

        raise NotImplementedError('This reader does not support bytes input')


    def data_from_response(self, response: Response) -> Any:
        '''
        Extract data from an HTTP response. Like `data_from_file`, but with `Response`
        input.

        Parameters:
            response: HTTP response object

        Returns:
            A data object. The type depends on the reader implementation. This may also
                be a context manager.

        Raises:
            NotImplementedError: this method may be implemented on child classes, but has
                no default implementation.
        '''
        raise NotImplementedError('This reader does not support Response input')


    def iterate_data(self, data: Any, metadata: Dict) -> Iterable[Document]:
        '''
        Iterate parsed source data, return data for each document.

        This should return the arguments that are passed on to field extractors per
        document. These usually cater to a specific extractor type. For example, the
        `CSVReader` returns an argument `rows`, which is used by the `CSV` extractor.

        The core `source2dicts` method will also provide `metadata` and `index` arguments
        to extractors, which you may override by providing them here.

        Parameters:
            data: The data object from a source. The type depends on the reader
                implementation; this is the output of `self.data_from_file` or
                `self.data_from_bytes`.
            metadata: Dictionary containing metadata for the source.

        Returns:
            An iterable of dictionaries. Each iteration will be extracted as a single
            document. The items in the dictionary are given as arguments to field
            extractors.

        Raises:
            NotImplementedError: This method must be implemented on child classes. It
                will raise an error otherwise.
        '''
        raise NotImplementedError('Data iteration is not implemented')


    def extract_document(
            self,
            **kwargs
        ) -> Document:
        '''
        Extract each field of a document, based on the raw data for the document
        '''
        return {
            field.name: field.extractor.apply(**kwargs)
            for field in self.fields
            if not field.skip
        }

    def documents(self, sources:Iterable[Source] = None) -> Iterable[Document]:
        '''
        Returns an iterable of extracted documents from source files.

        Parameters:
            sources: an iterable of paths to source files. If omitted, the reader
                class will use the value of `self.sources()` instead.

        Returns:
            an iterable of document dictionaries. Each of these is a dictionary,
                where the keys are names of this Reader's `fields`, and the values
                are based on the extractor of each field.
        '''
        sources = sources or self.sources()
        return (
            document
            for i, source in enumerate(sources)
            for document in self.source2dicts(
                source, source_index=i
            )
        )

    def export_csv(self, path: str, sources: Optional[Iterable[Source]] = None) -> None:
        '''
        Extracts documents from sources and saves them in a CSV file.

        This will write a CSV file in the provided `path`. This method has no return
        value.

        Parameters:
            path: the path where the CSV file should be saved.
            sources: an iterable of paths to source files. If omitted, the reader class
                will use the value of `self.sources()` instead.
        '''
        documents = self.documents(sources)

        with open(path, 'w') as outfile:
            writer = csv.DictWriter(outfile, self.fieldnames)
            writer.writeheader()
            for doc in documents:
                writer.writerow(doc)


    def validate(self):
        '''
        Validate that the reader is configured properly.

        This is a good place to check parameters that are overridden in a child class. A
        common use case is use self._reject_extractors to raise an error if any fields use
        unsupported extractor types.
        '''
        pass

    def _reject_extractors(self, *inapplicable_extractors: extract.Extractor):
        '''
        Raise errors if any fields use any of the given extractors.

        This can be used to check that fields use extractors that match
        the Reader subclass.

        Raises:
            RuntimeError: raised when a field uses an extractor that is provided
                in the input.
        '''
        for field in self.fields:
            if isinstance(field.extractor, inapplicable_extractors):
                raise RuntimeError(
                    "Specified extractor method cannot be used with this type of data")

    def _has_required_fields(self, document: Document) -> Iterable[Document]:
        '''
        Check whether a document has a value for all fields marked as required.
        '''

        has_field = lambda field_name: document.get(field_name, None) is not None
        return all(
            has_field(field_name) for field_name in self._required_field_names
        )

data_directory property

Path to source data directory.

Raises:

Type Description
NotImplementedError

This method needs to be implementd on child classes. It will raise an error by default.

fieldnames property

A list containing the name of each field of this Reader

fields property

The list of fields that are extracted from documents.

These should be instances of the Field class (or implement the same API).

Raises:

Type Description
NotImplementedError

This method needs to be implementd on child classes. It will raise an error by default.

data_and_metadata_from_source(source)

Extract the data and metadata object from a source.

Parameters:

Name Type Description Default
source Source

Source to extract.

required

Returns:

Type Description
Tuple[Any, Dict]

A tuple with the parsed source data, and the metadata (empty if none was provided).

Source code in ianalyzer_readers/readers/core.py
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
def data_and_metadata_from_source(self, source: Source) -> Tuple[Any, Dict]:
    '''
    Extract the data and metadata object from a source.

    Parameters:
        source: Source to extract.

    Returns:
        A tuple with the parsed source data, and the metadata (empty if none was
            provided).
    '''
    if isinstance(source, tuple) and len(source) == 2:
        source_data, metadata = source
    else:
        source_data = source
        metadata = {}

    if isinstance(source_data, str):
        if not isfile(source_data):
            raise FileNotFoundError(f'Invalid file path: {source_data}')
        data = self.data_from_file(source_data)
    elif isinstance(source_data, bytes):
        data = self.data_from_bytes(source_data)
    elif isinstance(source_data, Response):
        data = self.data_from_response(source_data)
    else:
        raise TypeError(f'Unknown source type: {type(source_data)}')

    return data, metadata

data_from_bytes(bytes)

Extract source data from a bytes object. Like data_from_file, but with bytes input.

Parameters:

Name Type Description Default
bytes bytes

byte contents of the source

required

Returns:

Type Description
Any

A data object. The type depends on the reader implementation. This may also be a context manager.

Raises:

Type Description
NotImplementedError

this method may be implemented on child classes, but has no default implementation.

Source code in ianalyzer_readers/readers/core.py
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
def data_from_bytes(self, bytes: bytes) -> Any:
    '''
    Extract source data from a bytes object. Like `data_from_file`, but with bytes
    input.

    Parameters:
        bytes: byte contents of the source

    Returns:
        A data object. The type depends on the reader implementation. This may also
            be a context manager.

    Raises:
        NotImplementedError: this method may be implemented on child classes, but
            has no default implementation.
    '''

    raise NotImplementedError('This reader does not support bytes input')

data_from_file(path)

Extract source data from a filename.

The return type depends on how the reader is implemented, but should be some kind of data structure from which documents can be extracted. It serves as the input to self.iterate_data.

This method can also return a context manager. This is especially useful to iterate over large files in iterate_data, without loading the complete file contents in memory.

Tip: if you have implemented self.data_from_bytes, this method can probably just read the binary contents of the file and call that method.

Parameters:

Name Type Description Default
path str

The path to a file.

required

Returns:

Type Description
Any

A data object. The type depends on the reader implementation.

Raises:

Type Description
NotImplementedError

this method may be implemented on child classes, but has no default implementation.

Source code in ianalyzer_readers/readers/core.py
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
def data_from_file(self, path: str) -> Any:
    '''
    Extract source data from a filename.

    The return type depends on how the reader is implemented, but should be some kind
    of data structure from which documents can be extracted. It serves as the input
    to `self.iterate_data`.

    This method can also return a context manager. This is especially useful to
    iterate over large files in `iterate_data`, without loading the complete file
    contents in memory.

    Tip: if you have implemented `self.data_from_bytes`, this method can probably just
    read the binary contents of the file and call that method.

    Parameters:
        path: The path to a file.

    Returns:
        A data object. The type depends on the reader implementation.

    Raises:
        NotImplementedError: this method may be implemented on child classes, but
            has no default implementation.
    '''

    raise NotImplementedError('This reader does not support filename input')

data_from_response(response)

Extract data from an HTTP response. Like data_from_file, but with Response input.

Parameters:

Name Type Description Default
response Response

HTTP response object

required

Returns:

Type Description
Any

A data object. The type depends on the reader implementation. This may also be a context manager.

Raises:

Type Description
NotImplementedError

this method may be implemented on child classes, but has no default implementation.

Source code in ianalyzer_readers/readers/core.py
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
def data_from_response(self, response: Response) -> Any:
    '''
    Extract data from an HTTP response. Like `data_from_file`, but with `Response`
    input.

    Parameters:
        response: HTTP response object

    Returns:
        A data object. The type depends on the reader implementation. This may also
            be a context manager.

    Raises:
        NotImplementedError: this method may be implemented on child classes, but has
            no default implementation.
    '''
    raise NotImplementedError('This reader does not support Response input')

documents(sources=None)

Returns an iterable of extracted documents from source files.

Parameters:

Name Type Description Default
sources Iterable[Source]

an iterable of paths to source files. If omitted, the reader class will use the value of self.sources() instead.

None

Returns:

Type Description
Iterable[Document]

an iterable of document dictionaries. Each of these is a dictionary, where the keys are names of this Reader's fields, and the values are based on the extractor of each field.

Source code in ianalyzer_readers/readers/core.py
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
def documents(self, sources:Iterable[Source] = None) -> Iterable[Document]:
    '''
    Returns an iterable of extracted documents from source files.

    Parameters:
        sources: an iterable of paths to source files. If omitted, the reader
            class will use the value of `self.sources()` instead.

    Returns:
        an iterable of document dictionaries. Each of these is a dictionary,
            where the keys are names of this Reader's `fields`, and the values
            are based on the extractor of each field.
    '''
    sources = sources or self.sources()
    return (
        document
        for i, source in enumerate(sources)
        for document in self.source2dicts(
            source, source_index=i
        )
    )

export_csv(path, sources=None)

Extracts documents from sources and saves them in a CSV file.

This will write a CSV file in the provided path. This method has no return value.

Parameters:

Name Type Description Default
path str

the path where the CSV file should be saved.

required
sources Optional[Iterable[Source]]

an iterable of paths to source files. If omitted, the reader class will use the value of self.sources() instead.

None
Source code in ianalyzer_readers/readers/core.py
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
def export_csv(self, path: str, sources: Optional[Iterable[Source]] = None) -> None:
    '''
    Extracts documents from sources and saves them in a CSV file.

    This will write a CSV file in the provided `path`. This method has no return
    value.

    Parameters:
        path: the path where the CSV file should be saved.
        sources: an iterable of paths to source files. If omitted, the reader class
            will use the value of `self.sources()` instead.
    '''
    documents = self.documents(sources)

    with open(path, 'w') as outfile:
        writer = csv.DictWriter(outfile, self.fieldnames)
        writer.writeheader()
        for doc in documents:
            writer.writerow(doc)

extract_document(**kwargs)

Extract each field of a document, based on the raw data for the document

Source code in ianalyzer_readers/readers/core.py
335
336
337
338
339
340
341
342
343
344
345
346
def extract_document(
        self,
        **kwargs
    ) -> Document:
    '''
    Extract each field of a document, based on the raw data for the document
    '''
    return {
        field.name: field.extractor.apply(**kwargs)
        for field in self.fields
        if not field.skip
    }

iterate_data(data, metadata)

Iterate parsed source data, return data for each document.

This should return the arguments that are passed on to field extractors per document. These usually cater to a specific extractor type. For example, the CSVReader returns an argument rows, which is used by the CSV extractor.

The core source2dicts method will also provide metadata and index arguments to extractors, which you may override by providing them here.

Parameters:

Name Type Description Default
data Any

The data object from a source. The type depends on the reader implementation; this is the output of self.data_from_file or self.data_from_bytes.

required
metadata Dict

Dictionary containing metadata for the source.

required

Returns:

Type Description
Iterable[Document]

An iterable of dictionaries. Each iteration will be extracted as a single

Iterable[Document]

document. The items in the dictionary are given as arguments to field

Iterable[Document]

extractors.

Raises:

Type Description
NotImplementedError

This method must be implemented on child classes. It will raise an error otherwise.

Source code in ianalyzer_readers/readers/core.py
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
def iterate_data(self, data: Any, metadata: Dict) -> Iterable[Document]:
    '''
    Iterate parsed source data, return data for each document.

    This should return the arguments that are passed on to field extractors per
    document. These usually cater to a specific extractor type. For example, the
    `CSVReader` returns an argument `rows`, which is used by the `CSV` extractor.

    The core `source2dicts` method will also provide `metadata` and `index` arguments
    to extractors, which you may override by providing them here.

    Parameters:
        data: The data object from a source. The type depends on the reader
            implementation; this is the output of `self.data_from_file` or
            `self.data_from_bytes`.
        metadata: Dictionary containing metadata for the source.

    Returns:
        An iterable of dictionaries. Each iteration will be extracted as a single
        document. The items in the dictionary are given as arguments to field
        extractors.

    Raises:
        NotImplementedError: This method must be implemented on child classes. It
            will raise an error otherwise.
    '''
    raise NotImplementedError('Data iteration is not implemented')

source2dicts(source, source_index=-1)

Given a source file, returns an iterable of extracted documents.

Parameters:

Name Type Description Default
source Source

Source to extract.

required

Returns:

Type Description
Iterable[Document]

an iterable of document dictionaries. Each of these is a dictionary, where the keys are names of this Reader's fields, and the values are based on the extractor of each field.

Source code in ianalyzer_readers/readers/core.py
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
def source2dicts(self, source: Source, source_index=-1) -> Iterable[Document]:
    '''
    Given a source file, returns an iterable of extracted documents.

    Parameters:
        source: Source to extract.

    Returns:
        an iterable of document dictionaries. Each of these is a dictionary,
            where the keys are names of this Reader's `fields`, and the values
            are based on the extractor of each field.
    '''

    self.validate()

    data, metadata = self.data_and_metadata_from_source(source)

    if isinstance(data, AbstractContextManager):
        context_manager = data
    else:
        context_manager = nullcontext(data)

    with context_manager as data:
        for index, extracted_data in enumerate(self.iterate_data(data, metadata)):
            base_data = {
                'metadata': metadata,
                'index': index,
                'source_index': source_index,
            }
            document_data = base_data | extracted_data
            document = self.extract_document(**document_data)
            if self._has_required_fields(document):
                yield document

sources(**kwargs)

Obtain source files for the Reader.

Returns:

Type Description
Iterable[Source]

an iterable of tuples that each contain a string path, and a dictionary with associated metadata. The metadata can contain any data that was extracted before reading the file itself, such as data based on the file path, or on a metadata file.

Raises:

Type Description
NotImplementedError

This method needs to be implementd on child classes. It will raise an error by default.

Source code in ianalyzer_readers/readers/core.py
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
def sources(self, **kwargs) -> Iterable[Source]:
    '''
    Obtain source files for the Reader.

    Returns:
        an iterable of tuples that each contain a string path, and a dictionary
            with associated metadata. The metadata can contain any data that was
            extracted before reading the file itself, such as data based on the
            file path, or on a metadata file.

    Raises:
        NotImplementedError: This method needs to be implementd on child
            classes. It will raise an error by default.
    '''
    raise NotImplementedError('Reader missing sources implementation')

validate()

Validate that the reader is configured properly.

This is a good place to check parameters that are overridden in a child class. A common use case is use self._reject_extractors to raise an error if any fields use unsupported extractor types.

Source code in ianalyzer_readers/readers/core.py
391
392
393
394
395
396
397
398
399
def validate(self):
    '''
    Validate that the reader is configured properly.

    This is a good place to check parameters that are overridden in a child class. A
    common use case is use self._reject_extractors to raise an error if any fields use
    unsupported extractor types.
    '''
    pass

CSV reader

Module: ianalyzer_readers.readers.csv

This module defines the CSV reader.

Extraction is based on python's csv library.

CSVReader

Bases: Reader

A base class for Readers of .csv (comma separated value) files.

The CSVReader is designed for .csv or .tsv files that have a header row, and where each file may list multiple documents.

The data should be structured in one of the following ways:

  • one document per row (this is the default)
  • each document spans a number of consecutive rows. In this case, there should be a column that indicates the identity of the document.

In addition to generic extractor classes, this reader supports the CSV extractor.

Source code in ianalyzer_readers/readers/csv.py
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
class CSVReader(Reader):
    '''
    A base class for Readers of .csv (comma separated value) files.

    The CSVReader is designed for .csv or .tsv files that have a header row, and where
    each file may list multiple documents.

    The data should be structured in one of the following ways:

    - one document per row (this is the default)
    - each document spans a number of consecutive rows. In this case, there should be a
        column that indicates the identity of the document.

    In addition to generic extractor classes, this reader supports the `CSV` extractor.
    '''

    field_entry = None
    '''
    If applicable, the name of the column that identifies entries. Subsequent rows with the
    same value for this column are treated as a single document. If left blank, each row
    is treated as a document.
    '''

    required_field = None
    '''
    Specifies the name of a required column in the CSV data, for example the main content.
    Rows with an empty value for `required_field` will be skipped.
    '''

    delimiter = ','
    '''
    The column delimiter used in the CSV data
    '''

    skip_lines = 0
    '''
    Number of lines in the file to skip before reading the header. Can be used when files
    use a fixed "preamble", e.g. to describe metadata or provenance.
    '''


    def validate(self):
        # make sure the field size is as big as the system permits
        csv.field_size_limit(sys.maxsize)
        self._reject_extractors(extract.XML)


    @contextmanager
    def data_from_file(self, path: str):
        with open(path, 'r') as f:
            logger.info('Reading CSV file {}...'.format(path))

            # skip first n lines
            for _ in range(self.skip_lines):
                next(f)

            reader = csv.DictReader(f, delimiter=self.delimiter)
            yield reader


    def iterate_data(self, data: csv.DictReader, metadata) -> Iterable[Document]:
        document_id = None
        rows = []
        for row in data:
            is_new_document = True

            if self.required_field and not row.get(self.required_field):  # skip row if required_field is empty
                continue

            if self.field_entry:
                identifier = row[self.field_entry]
                if identifier == document_id:
                    is_new_document = False
                else:
                    document_id = identifier

            if is_new_document and rows:
                yield {'rows': rows, 'metadata': metadata}
                rows = [row]
            else:
                rows.append(row)

        yield {'rows': rows}

delimiter = ',' class-attribute instance-attribute

The column delimiter used in the CSV data

field_entry = None class-attribute instance-attribute

If applicable, the name of the column that identifies entries. Subsequent rows with the same value for this column are treated as a single document. If left blank, each row is treated as a document.

required_field = None class-attribute instance-attribute

Specifies the name of a required column in the CSV data, for example the main content. Rows with an empty value for required_field will be skipped.

skip_lines = 0 class-attribute instance-attribute

Number of lines in the file to skip before reading the header. Can be used when files use a fixed "preamble", e.g. to describe metadata or provenance.

XLSX reader

Module: ianalyzer_readers.readers.xlsx

XLSXReader

Bases: Reader

A base class for Readers that extract data from .xlsx spreadsheets

The XLSXReader is quite rudimentary, and is designed to extract data from spreadsheets that are formatted like a CSV table, with a clear column layout. The sheet should have a header row.

The data should be structured in one of the following ways:

  • one document per row (this is the default)
  • each document spans a number of consecutive rows. In this case, there should be a column that indicates the identity of the document.

The XLSXReader will only look at the first sheet in each file.

In addition to generic extractor classes, this reader supports the CSV extractor.

Source code in ianalyzer_readers/readers/xlsx.py
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
class XLSXReader(Reader):
    '''
    A base class for Readers that extract data from .xlsx spreadsheets

    The XLSXReader is quite rudimentary, and is designed to extract data from
    spreadsheets that are formatted like a CSV table, with a clear column layout. The
    sheet should have a header row.

    The data should be structured in one of the following ways:

    - one document per row (this is the default)
    - each document spans a number of consecutive rows. In this case, there should be a
        column that indicates the identity of the document.

    The XLSXReader will only look at the _first_ sheet in each file.

    In addition to generic extractor classes, this reader supports the `CSV` extractor.
    '''

    field_entry = None
    '''
    If applicable, the name of column that identifies entries. Subsequent rows with the
    same value for this column are treated as a single document. If left blank, each row
    is treated as a document.
    '''

    required_field = None
    '''
    Specifies the name of a required column, for example the main content. Rows with
    an empty value for `required_field` will be skipped.
    '''

    skip_lines = 0
    '''
    Number of lines in the sheet to skip before reading the header. Can be used when files
    use a fixed "preamble", e.g. to describe metadata or provenance.
    '''


    def validate(self):
        self._reject_extractors(extract.XML)


    def data_from_file(self, path) -> Workbook:
        logger.info('Reading XLSX file {}...'.format(path))
        return openpyxl.load_workbook(path)


    def iterate_data(self, data: Workbook, metadata: Dict):
        sheets = data.sheetnames
        sheet = data[sheets[0]]
        return self._sheet2dicts(sheet, metadata)


    def _sheet2dicts(self, sheet: Worksheet, metadata):
        '''
        Extract documents from a single worksheet
        '''

        data = (row for row in sheet.values)

        for _ in range(self.skip_lines):
            next(data)

        header = list(next(data))

        document_id = None
        rows = []

        for row in data:
            values = {
                col: value
                for col, value in zip(header, row)
            }

            # skip row if required_field is empty
            if self.required_field and not values.get(self.required_field):
                continue

            identifier = values.get(self.field_entry, None)
            is_new_document = identifier == None or identifier != document_id
            document_id = identifier

            if is_new_document and rows:
                yield {'rows': rows}
                rows = [values]
            else:
                rows.append(values)

        if rows:
            yield {'rows': rows}

field_entry = None class-attribute instance-attribute

If applicable, the name of column that identifies entries. Subsequent rows with the same value for this column are treated as a single document. If left blank, each row is treated as a document.

required_field = None class-attribute instance-attribute

Specifies the name of a required column, for example the main content. Rows with an empty value for required_field will be skipped.

skip_lines = 0 class-attribute instance-attribute

Number of lines in the sheet to skip before reading the header. Can be used when files use a fixed "preamble", e.g. to describe metadata or provenance.

XML reader

Module: ianalyzer_readers.readers.xml

This module defines the XML Reader.

Extraction is based on BeautifulSoup.

XMLReader

Bases: Reader

A base class for Readers that extract data from XML files.

The built-in functionality of the XML reader is quite versatile, and can be further expanded by adding custom Tag classes or extraction functions that interact directly with BeautifulSoup nodes.

The Reader is suitable for datasets where each file should be extracted as a single document, or ones where each file contains multiple documents.

In addition to generic extractor classes, this reader supports the XML extractor.

Attributes:

Name Type Description
tag_toplevel TagSpecification

the top-level tag to search from in source documents.

tag_entry TagSpecification

the tag that corresponds to a single document entry in source documents.

external_file_tag_toplevel TagSpecification

the top-level tag to search from in external documents (if that functionality is used)

Source code in ianalyzer_readers/readers/xml.py
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
class XMLReader(Reader):
    '''
    A base class for Readers that extract data from XML files.

    The built-in functionality of the XML reader is quite versatile, and can be further
    expanded by adding custom Tag classes or extraction functions that interact directly with
    BeautifulSoup nodes.

    The Reader is suitable for datasets where each file should be extracted as a single
    document, or ones where each file contains multiple documents.

    In addition to generic extractor classes, this reader supports the `XML` extractor.

    Attributes:
        tag_toplevel: the top-level tag to search from in source documents.
        tag_entry: the tag that corresponds to a single document entry in source
            documents.
        external_file_tag_toplevel: the top-level tag to search from in external
            documents (if that functionality is used)

    '''

    tag_toplevel: TagSpecification = CurrentTag()
    '''
    The top-level tag in the source documents.

    Can be:

    - An XMLTag
    - A callable that takes the metadata of the document as input and returns an
        XMLTag.
    '''

    tag_entry: TagSpecification = CurrentTag()
    '''
    The tag that corresponds to a single document entry.

    Can be:

    - An XMLTag
    - A callable that takes the metadata of the document as input and returns an
        XMLTag
    '''

    external_file_tag_toplevel: TagSpecification = CurrentTag()
    '''
    The toplevel tag in external files (if you are using that functionality).

    Can be:

    - An XMLTag
    - A callable that takes the metadata of the document as input and returns an
        XMLTag. The metadata dictionary includes the values of "regular" fields for
        the document.
    '''

    def validate(self):
        # Make sure that extractors are sensible
        self._reject_extractors(extract.CSV)

    def iterate_data(self, data: bs4.BeautifulSoup, metadata: Dict) -> Iterable[Document]:
        external_soup = self._external_soup(metadata)

        # iterate through entries
        top_tag = resolve_tag_specification(self.__class__.tag_toplevel, metadata)
        bowl = top_tag.find_next_in_soup(data)

        if bowl:
            entry_tag = resolve_tag_specification(self.__class__.tag_entry, metadata)
            spoonfuls = entry_tag.find_in_soup(bowl)
            for spoon in spoonfuls:
                yield {
                    'soup_top': bowl,
                    'soup_entry': spoon,
                    'external_soup': external_soup,
                }
        else:
            logger.warning('Top-level tag not found')

    def extract_document(self, **document_data) -> Document:
        external_fields = self._external_fields()
        # fields should have unique names, but may not have stable instantiations
        # if FieldDefinitions are created on the fly, for example with class methods or @propertys
        external_fields_names = set(field.name for field in external_fields)
        regular_fields = [
            field for field in self.fields
            if field.name not in external_fields_names
        ]

        field_dict = {
            field.name: field.extractor.apply(**document_data)
            for field in regular_fields if not field.skip
        }

        external_soup = document_data.get('external_soup', None)
        metadata = document_data.get('metadata')

        if external_fields and external_soup:
            external_dict = self._external_source2dict(
                external_soup, external_fields, metadata | field_dict)
        else:
            external_dict = {
                field.name: None for field in external_fields if not field.skip
            }

        # yield the union of external fields and document fields
        return field_dict | external_dict

    def _external_fields(self) -> List[Field]:
        '''
        Subset of the reader's fields that rely on an external XML file.
        '''
        return [field for field in self.fields if
            isinstance(field.extractor, extract.XML) and field.extractor.external_file
        ]

    def _external_soup(self, metadata: Dict) -> Optional[bs4.BeautifulSoup]:
        '''
        Returns parsed tree for the external XML file, if applicable
        '''
        if any(self._external_fields()):
            if metadata and 'external_file' in metadata:
                return self.data_from_file(metadata['external_file'])
            else:
                logger.warning(
                    'Some fields have external_file property, but no external file is '
                    'provided in the source metadata'
                )

    def _external_source2dict(self, soup, external_fields: List[Field], metadata: Dict):
        '''
        given an external xml file with metadata,
        return a dictionary with tags which were found in that metadata
        wrt to the current source.
        '''
        tag = resolve_tag_specification(self.__class__.external_file_tag_toplevel, metadata)
        bowl = tag.find_next_in_soup(soup)

        if not bowl:
            logger.warning(
                'Top-level tag not found in `{}`'.format(metadata['external_file']))
            return {field.name: None for field in external_fields if not field.skip}

        return {
            field.name: field.extractor.apply(
                soup_top=bowl, soup_entry=bowl, metadata=metadata
            )
            for field in external_fields if not field.skip
        }

    def data_from_file(self, filename: str) -> bs4.BeautifulSoup:
        '''
        Returns beatifulsoup soup object for a given xml file
        '''
        # Loading XML
        logger.info('Reading XML file {} ...'.format(filename))
        with open(filename, 'rb') as f:
            data = f.read()
        logger.info('Loaded {} into memory...'.format(filename))
        return self.data_from_bytes(data)

    def data_from_bytes(self, data: bytes) -> bs4.BeautifulSoup:
        '''
        Parses content of a xml file
        '''
        return bs4.BeautifulSoup(data, 'lxml-xml')

    def data_from_response(self, data: Response) -> bs4.BeautifulSoup:
        return bs4.BeautifulSoup(data.content, 'lxml-xml')

external_file_tag_toplevel = CurrentTag() class-attribute instance-attribute

The toplevel tag in external files (if you are using that functionality).

Can be:

  • An XMLTag
  • A callable that takes the metadata of the document as input and returns an XMLTag. The metadata dictionary includes the values of "regular" fields for the document.

tag_entry = CurrentTag() class-attribute instance-attribute

The tag that corresponds to a single document entry.

Can be:

  • An XMLTag
  • A callable that takes the metadata of the document as input and returns an XMLTag

tag_toplevel = CurrentTag() class-attribute instance-attribute

The top-level tag in the source documents.

Can be:

  • An XMLTag
  • A callable that takes the metadata of the document as input and returns an XMLTag.

data_from_bytes(data)

Parses content of a xml file

Source code in ianalyzer_readers/readers/xml.py
180
181
182
183
184
def data_from_bytes(self, data: bytes) -> bs4.BeautifulSoup:
    '''
    Parses content of a xml file
    '''
    return bs4.BeautifulSoup(data, 'lxml-xml')

data_from_file(filename)

Returns beatifulsoup soup object for a given xml file

Source code in ianalyzer_readers/readers/xml.py
169
170
171
172
173
174
175
176
177
178
def data_from_file(self, filename: str) -> bs4.BeautifulSoup:
    '''
    Returns beatifulsoup soup object for a given xml file
    '''
    # Loading XML
    logger.info('Reading XML file {} ...'.format(filename))
    with open(filename, 'rb') as f:
        data = f.read()
    logger.info('Loaded {} into memory...'.format(filename))
    return self.data_from_bytes(data)

HTML reader

Module: ianalyzer_readers.readers.html

This module defines the XML Reader.

The HTML reader is implemented as a subclas of the XML reader, and uses BeautifulSoup to parse files.

HTMLReader

Bases: XMLReader

An HTML reader extracts data from HTML sources.

It is based on the XMLReader and supports the same options (tag_toplevel and tag_entry).

In addition to generic extractor classes, this reader supports the XML extractor.

Source code in ianalyzer_readers/readers/html.py
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
class HTMLReader(XMLReader):
    '''
    An HTML reader extracts data from HTML sources.

    It is based on the XMLReader and supports the same options (`tag_toplevel` and
    `tag_entry`).

    In addition to generic extractor classes, this reader supports the `XML` extractor.
    '''

    def validate(self):
        self._reject_extractors(extract.CSV)


    def data_from_file(self, filename: str) -> bs4.BeautifulSoup:
        logger.info('Reading HTML file {} ...'.format(filename))
        with open(filename, 'rb') as f:
            data = f.read()
        # Parsing HTML
        soup = bs4.BeautifulSoup(data, 'html.parser')
        logger.info('Loaded {} into memory ...'.format(filename))
        return soup


    def iterate_data(self, data: bs4.BeautifulSoup, metadata: Dict) -> Iterable[Document]:
        # Extract fields from soup
        tag0 = self.tag_toplevel
        tag = self.tag_entry

        bowl = tag0.find_next_in_soup(data) if tag0 else data

        # if there is a entry level tag; with html this is not always the case
        if bowl and tag:
            for spoon in tag.find_in_soup(data):
                # yield
                yield {'soup_top': bowl, 'soup_entry': spoon}
        else:
            # yield all page content
            yield {'soup_entry': data}

RDF reader

Module: ianalyzer_readers.readers.rdf

This module defines a Resource Description Framework (RDF) reader.

Extraction is based on the rdflib library.

RDFReader

Bases: Reader

A base class for Readers of Resource Description Framework files. These could be in Turtle, JSON-LD, RDFXML or other formats, see rdflib parsers.

Source code in ianalyzer_readers/readers/rdf.py
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
class RDFReader(Reader):
    '''
    A base class for Readers of Resource Description Framework files.
    These could be in Turtle, JSON-LD, RDFXML or other formats,
    see [rdflib parsers](https://rdflib.readthedocs.io/en/stable/plugin_parsers.html).
    '''

    def validate(self):
        self._reject_extractors(extract.CSV, extract.XML, extract.JSON)


    # TODO: we could also allow Response as source data here, but that would mean the response would also need to include information of the data format, see [this example](https://github.com/RDFLib/rdflib/blob/4.1.2/rdflib/graph.py#L209)

    def data_from_file(self, path) -> Graph:
        ''' Read a RDF file as indicated by source, return a graph 
        Override this function to parse multiple source files into one graph

        Parameters:
            path: the name of the file to be parsed

        Returns:
            rdflib Graph object
        '''
        logger.info(f"parsing {path}")
        g = Graph()
        g.parse(path)
        return g


    def iterate_data(self, data: Graph, metadata: Dict) -> Iterable[Document]:
        document_subjects = self.document_subjects(data)
        for subject in document_subjects:
            yield {'graph': data, 'subject': subject}


    def document_subjects(self, graph: Graph) -> Iterable[Union[BNode, Literal, URIRef]]:
        ''' Override this function to return all subjects (i.e., first part of RDF triple) 
        with which to search for data in the RDF graph.
        Typically, such subjects are identifiers or urls.

        Parameters:
            graph: the graph to parse

        Returns:
            generator or list of nodes
        '''
        return graph.subjects()

data_from_file(path)

Read a RDF file as indicated by source, return a graph Override this function to parse multiple source files into one graph

Parameters:

Name Type Description Default
path

the name of the file to be parsed

required

Returns:

Type Description
Graph

rdflib Graph object

Source code in ianalyzer_readers/readers/rdf.py
31
32
33
34
35
36
37
38
39
40
41
42
43
44
def data_from_file(self, path) -> Graph:
    ''' Read a RDF file as indicated by source, return a graph 
    Override this function to parse multiple source files into one graph

    Parameters:
        path: the name of the file to be parsed

    Returns:
        rdflib Graph object
    '''
    logger.info(f"parsing {path}")
    g = Graph()
    g.parse(path)
    return g

document_subjects(graph)

Override this function to return all subjects (i.e., first part of RDF triple) with which to search for data in the RDF graph. Typically, such subjects are identifiers or urls.

Parameters:

Name Type Description Default
graph Graph

the graph to parse

required

Returns:

Type Description
Iterable[Union[BNode, Literal, URIRef]]

generator or list of nodes

Source code in ianalyzer_readers/readers/rdf.py
53
54
55
56
57
58
59
60
61
62
63
64
def document_subjects(self, graph: Graph) -> Iterable[Union[BNode, Literal, URIRef]]:
    ''' Override this function to return all subjects (i.e., first part of RDF triple) 
    with which to search for data in the RDF graph.
    Typically, such subjects are identifiers or urls.

    Parameters:
        graph: the graph to parse

    Returns:
        generator or list of nodes
    '''
    return graph.subjects()

get_uri_value(node)

a utility function to extract the last part of a uri For instance, if the input is URIRef('https://purl.org/mynamespace/ernie'), or URIRef('https://purl.org/mynamespace#ernie') the function will return 'ernie'

Parameters:

Name Type Description Default
node URIRef

an URIRef input node

required

Returns:

Type Description
str

a string with the last element of the uri

Source code in ianalyzer_readers/readers/rdf.py
67
68
69
70
71
72
73
74
75
76
77
78
79
def get_uri_value(node: URIRef) -> str:
    """a utility function to extract the last part of a uri
    For instance, if the input is URIRef('https://purl.org/mynamespace/ernie'),
    or URIRef('https://purl.org/mynamespace#ernie')
    the function will return 'ernie'

    Parameters:
        node: an URIRef input node

    Returns:
        a string with the last element of the uri
    """
    return node.fragment or node.defrag().split("/")[-1]

JSON reader

Module: ianalyzer_readers.readers.json

This module defines the JSONReader.

It can parse documents nested in one file, for which it uses the pandas library, or multiple files with one document each, which use the generic Python json parser.

JSONReader

Bases: Reader

A base class for Readers of JSON encoded data.

The reader can either be used on a collection of JSON files (single_document=True), in which each file represents a document, or for a JSON file containing lists of documents.

If the attributes record_path and meta are set, they are used as arguments to pandas.json_normalize to unnest the JSON data.

Attributes:

Name Type Description
single_document bool

indicates whether the data is organized such that a file represents a single document

record_path Optional[List[str]]

a path or list of paths by which a list of documents can be extracted from a large JSON file; irrelevant if single_document = True

meta Optional[List[Union[str, List[str]]]]

a list of paths, or list of lists of paths, by which metadata common for all documents can be located; irrelevant if single_document = True

"""

Examples:

Multiple documents in one file:
example_data = {
    'path': {
        'sketch': 'Hungarian Phrasebook',
        'episode': 25,
        'to': {
            'records':
                [
                    {'speech': 'I will not buy this record. It is scratched.', 'character': 'tourist'},
                    {'speech': "No sir. This is a tobacconist's.", 'character': 'tobacconist'}
                ]
        }
    }
}

MyJSONReader(JSONReader):
    record_path = ['path', 'to', 'records']
    meta = [['path', 'sketch'], ['path', 'episode']]

    speech = Field('speech', JSON('speech'))
    character = Field('character', JSON('character'))
    sketch = Field('sketch', JSON('path.sketch'))
    episode = Field('episode', JSON('path.episode'))

To define the paths used to extract the field values, consider the dataformat the pandas.json_normalize creates: a table with each row representing a document, and columns corresponding to paths, either relative to documents within record_path, or relative to the top level (meta), with list of paths indicated by dots.

row,speech,character,path.sketch,path.episode
0,"I will not buy this record. It is scratched.","tourist","Hungarian Phrasebook",25
1,"No sir. This is a tobacconist's.","tobacconist","Hungarian Phrasebook",25
Single document per file:
example_data = {
    'sketch': 'Hungarian Phrasebook',
    'episode': 25,
    'scene': {
        'character': 'tourist',
        'speech': 'I will not buy this record. It is scratched.'
    }
}

MyJSONReader(JSONReader):
    single_document = True

    speech = Field('speech', JSON('scene', 'speech'))
    character = Field('character', JSON('scene', 'character))
    sketch = Field('sketch', JSON('sketch'))
    episode = Field('episode', JSON('episode))
Source code in ianalyzer_readers/readers/json.py
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
class JSONReader(Reader):
    '''
    A base class for Readers of JSON encoded data.

    The reader can either be used on a collection of JSON files (`single_document=True`), in which each file represents a document,
    or for a JSON file containing lists of documents.

    If the attributes `record_path` and `meta` are set, they are used as arguments to [pandas.json_normalize](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html) to unnest the JSON data.

    Attributes:
        single_document: indicates whether the data is organized such that a file represents a single document
        record_path: a path or list of paths by which a list of documents can be extracted from a large JSON file; irrelevant if `single_document = True`
        meta: a list of paths, or list of lists of paths, by which metadata common for all documents can be located; irrelevant if `single_document = True`
    """

    Examples:
        ### Multiple documents in one file:
        ```python
        example_data = {
            'path': {
                'sketch': 'Hungarian Phrasebook',
                'episode': 25,
                'to': {
                    'records':
                        [
                            {'speech': 'I will not buy this record. It is scratched.', 'character': 'tourist'},
                            {'speech': "No sir. This is a tobacconist's.", 'character': 'tobacconist'}
                        ]
                }
            }
        }

        MyJSONReader(JSONReader):
            record_path = ['path', 'to', 'records']
            meta = [['path', 'sketch'], ['path', 'episode']]

            speech = Field('speech', JSON('speech'))
            character = Field('character', JSON('character'))
            sketch = Field('sketch', JSON('path.sketch'))
            episode = Field('episode', JSON('path.episode'))
        ```
        To define the paths used to extract the field values, consider the dataformat the `pandas.json_normalize` creates:
        a table with each row representing a document, and columns corresponding to paths, either relative to documents within `record_path`,
        or relative to the top level (`meta`), with list of paths indicated by dots.
        ```csv
        row,speech,character,path.sketch,path.episode
        0,"I will not buy this record. It is scratched.","tourist","Hungarian Phrasebook",25
        1,"No sir. This is a tobacconist's.","tobacconist","Hungarian Phrasebook",25
        ```

        ### Single document per file:
        ```python
        example_data = {
            'sketch': 'Hungarian Phrasebook',
            'episode': 25,
            'scene': {
                'character': 'tourist',
                'speech': 'I will not buy this record. It is scratched.'
            }
        }

        MyJSONReader(JSONReader):
            single_document = True

            speech = Field('speech', JSON('scene', 'speech'))
            character = Field('character', JSON('scene', 'character))
            sketch = Field('sketch', JSON('sketch'))
            episode = Field('episode', JSON('episode))
        ```

    '''

    single_document: bool = False
    '''
    set to `True` if the data is structured such that one document is encoded in one .json file
    in that case, the reader assumes that there are no lists in such a file
    '''

    record_path: Optional[List[str]] = None
    '''
    a keyword or list of keywords by which a list of documents can be extracted from a large JSON file.
    Only relevant if `single_document=False`.
    '''

    meta: Optional[List[Union[str, List[str]]]] = None
    '''
    a list of keywords, or list of lists of keywords, by which metadata for each document can be located,
    if it is in a different path than `record_path`. Only relevant if `single_document=False`.
    '''

    def validate(self):
        self._reject_extractors(extract.XML, extract.CSV, extract.RDF)


    def iterate_data(self, data, metadata):
        if not self.single_document:
            documents = json_normalize(
                data, record_path=self.record_path, meta=self.meta
            ).to_dict('records')
        else:
            documents = [data]

        for doc in documents:
            yield {'data': doc}


    def data_from_file(self, path):
        with open(path, "r") as f:
            data = json.load(f)
        return data


    def data_from_bytes(self, bytes):
        return json.loads(bytes)


    def data_from_response(self, response):
        return response.json()

meta = None class-attribute instance-attribute

a list of keywords, or list of lists of keywords, by which metadata for each document can be located, if it is in a different path than record_path. Only relevant if single_document=False.

record_path = None class-attribute instance-attribute

a keyword or list of keywords by which a list of documents can be extracted from a large JSON file. Only relevant if single_document=False.

single_document = False class-attribute instance-attribute

set to True if the data is structured such that one document is encoded in one .json file in that case, the reader assumes that there are no lists in such a file

Extractors

Module: ianalyzer_readers.extract

This module contains extractor classes that can be used to obtain values for each field in a Reader.

Some extractors are intended to work with specific Reader classes, while others are generic.

Backup

Bases: Extractor

Try all given extractors in order and return the first result that evaluates as true

This is a generic extractor that can be used in any Reader.

Example usage:

Backup(Constant(None), Constant('foo'))

Since the first extractor returns None, the second extractor will be used, and the extracted value would be 'foo'.

Note the difference with Choice: Backup is based on the extracted value, Choice on the applicable parameter of each extractor.

Parameters:

Name Type Description Default
*extractors Extractor

extractors to use. These should be listed in descending order of preference.

()
**kwargs

additional options to pass on to Extractor.

{}
Source code in ianalyzer_readers/extract.py
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
class Backup(Extractor):
    '''
    Try all given extractors in order and return the first result that evaluates as true

    This is a generic extractor that can be used in any `Reader`.

    Example usage:

        Backup(Constant(None), Constant('foo'))

    Since the first extractor returns `None`, the second extractor will be used, and the 
    extracted value would be `'foo'`.

    Note the difference with `Choice`: `Backup` is based on the _extracted value_,
    `Choice` on the `applicable` parameter of each extractor.

    Parameters:
        *extractors: extractors to use. These should be listed in descending order of
            preference.
        **kwargs: additional options to pass on to `Extractor`.
    '''
    def __init__(self, *extractors: Extractor, **kwargs):
        self.extractors = list(extractors)
        super().__init__(**kwargs)

    def _apply(self, *nargs, **kwargs):
        for extractor in self.extractors:
            result = extractor.apply(*nargs, **kwargs)
            if result:
                return result
        return None

CSV

Bases: Extractor

This extractor extracts values from a list of CSV or spreadsheet rows.

It should be used in readers based on CSVReader or XLSXReader.

Parameters:

Name Type Description Default
column str

The name of the column from which to extract the value.

required
multiple bool

If a document spans multiple rows, the extracted value for a field with multiple = True is a list of the value in each row. If multiple = False (default), only the value from the first row is extracted.

False
convert_to_none List[str]

optional, default is ['']. Listed values are converted to None. If None/False, nothing is converted.

['']
**kwargs

additional options to pass on to Extractor.

{}
Source code in ianalyzer_readers/extract.py
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
class CSV(Extractor):
    '''
    This extractor extracts values from a list of CSV or spreadsheet rows.

    It should be used in readers based on `CSVReader` or `XLSXReader`.

    Parameters:
        column: The name of the column from which to extract the value.
        multiple: If a document spans multiple rows, the extracted value for a
            field with `multiple = True` is a list of the value in each row. If
            `multiple = False` (default), only the value from the first row is extracted.
        convert_to_none: optional, default is `['']`. Listed values are converted to
            `None`. If `None`/`False`, nothing is converted.
        **kwargs: additional options to pass on to `Extractor`.
    '''
    def __init__(self,
            column: str,
            multiple: bool = False,
            convert_to_none: List[str] = [''],
            *nargs, **kwargs):
        self.field = column
        self.multiple = multiple
        self.convert_to_none = convert_to_none or []
        super().__init__(*nargs, **kwargs)

    def _apply(self, rows, *nargs, **kwargs):
        if self.field in rows[0]:
            if self.multiple:
                return [self.format(row[self.field]) for row in rows]
            else:
                row = rows[0]
                return self.format(row[self.field])

    def format(self, value):
        if value and value not in self.convert_to_none:
            return value

Cache

Bases: Extractor

Can be wrapped around another extractor to prevent repeatedly extracting the same value.

Makes an assumption the value of the extractor is going to be the same within a document, a source file, or even across the whole dataset.

Parameters:

Name Type Description Default
extractor Extractor

Extractor of which the value is returned and cached.

required
level str

The level at which values should be cached. Can be 'document', 'source', or 'reader'.

'document'
**kwargs

additional options to pass on to Extractor

{}

Note: caching is based on the extractor instance and will not work across instances. For instance, in the example below, there would be no caching across fields.

fields = [
    Field(name='foo', extractor=Cache(XML('baz'))),
    Field(name='bar', extractor=Cache(XML('baz')))
]

You could rewrite this as follows, so the XML tree is only queried once per document:

_my_extractor = Cache(XML('baz'))

fields = [
    Field(name='foo', extractor=_my_extractor),
    Field(name='bar', extractor=_my_extractor)
]

There is a similar issue when you use @property to define the fields of the reader.

Source code in ianalyzer_readers/extract.py
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
class Cache(Extractor):
    '''
    Can be wrapped around another extractor to prevent repeatedly extracting the same
    value. 

    Makes an assumption the value of the extractor is going to be the same within a
    document, a source file, or even across the whole dataset.

    Parameters:
        extractor: Extractor of which the value is returned and cached.
        level: The level at which values should be cached. Can be `'document'`,
            `'source'`, or `'reader'`.
        **kwargs: additional options to pass on to `Extractor`

    Note: caching is based on the extractor instance and will not work across instances.
    For instance, in the example below, there would be no caching across fields.

    ```python
    fields = [
        Field(name='foo', extractor=Cache(XML('baz'))),
        Field(name='bar', extractor=Cache(XML('baz')))
    ]
    ```

    You could rewrite this as follows, so the XML tree is only queried once per document:

    ```python
    _my_extractor = Cache(XML('baz'))

    fields = [
        Field(name='foo', extractor=_my_extractor),
        Field(name='bar', extractor=_my_extractor)
    ]
    ```

    There is a similar issue when you use `@property` to define the `fields` of the
    reader.
    '''

    def __init__(self, extractor: Extractor, level: str = 'document', **kwargs):
        self.extractor = extractor
        self.level = level
        self.kwargs = {}
        super().__init__(**kwargs)

    def _apply(self, **kwargs):
        self.kwargs = kwargs

        if self.level == 'document':
            cache_params = [kwargs['source_index'], kwargs['index']]
        if self.level == 'source':
            cache_params = [kwargs['source_index']]
        if self.level == 'reader':
            cache_params = []

        return self._apply_cached(*cache_params)

    @lru_cache(maxsize=1)
    def _apply_cached(self, *cache_parameters):
        return self.extractor.apply(**self.kwargs)

Choice

Bases: Extractor

Use the first applicable extractor from a list of extractors.

This is a generic extractor that can be used in any Reader.

The Choice extractor will use the applicable property of its provided extractors to check which applies.

Example usage:

Choice(Constant('foo', applicable=some_condition), Constant('bar'))

This would extract 'foo' if some_condition is met; otherwise, the extracted value will be 'bar'.

Note the difference with Backup: Backup will select the first truthy value from a list of extractors, but Choice only checks the applicable condition. For example:

Choice(
    CSV('foo', applicable=Metadata('bar')),
    CSV('baz'),
)

Backup(
    CSV('foo', applicable=Metadata('bar')),
    CSV('baz'),
)

These extractors behave differently if the "bar" condition holds, but the "foo" field is empty. Backup will try to extract the "baz" field, Choice will not.

Parameters:

Name Type Description Default
*extractors Extractor

extractors to choose from. These should be listed in descending order of preference.

()
**kwargs

additional options to pass on to Extractor.

{}
Source code in ianalyzer_readers/extract.py
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
class Choice(Extractor):
    '''
    Use the first applicable extractor from a list of extractors.

    This is a generic extractor that can be used in any `Reader`.

    The Choice extractor will use the `applicable` property of its provided extractors
    to check which applies. 

    Example usage: 

        Choice(Constant('foo', applicable=some_condition), Constant('bar'))

    This would extract `'foo'` if `some_condition` is met; otherwise,
    the extracted value will be `'bar'`.

    Note the difference with `Backup`: `Backup` will select the first truthy value from a
    list of extractors, but `Choice` only checks the `applicable` condition. For example:

        Choice(
            CSV('foo', applicable=Metadata('bar')),
            CSV('baz'),
        )

        Backup(
            CSV('foo', applicable=Metadata('bar')),
            CSV('baz'),
        )

    These extractors behave differently if the "bar" condition holds, but the "foo" field
    is empty. `Backup` will try to extract the "baz" field, `Choice` will not.

    Parameters:
        *extractors: extractors to choose from. These should be listed in descending
            order of preference.
        **kwargs: additional options to pass on to `Extractor`.
    '''

    def __init__(self, *extractors: Extractor, **kwargs):
        self.extractors = list(extractors)
        super().__init__(**kwargs)

    def _apply(self, *nargs, **kwargs):
        for extractor in self.extractors:
            if extractor._is_applicable(*nargs, **kwargs):
                return extractor.apply(*nargs, **kwargs)
        return None

Combined

Bases: Extractor

Apply all given extractors and return the results as a tuple.

This is a generic extractor that can be used in any Reader.

Example usage:

Combined(Constant('foo'), Constant('bar'))

This would extract ('foo', 'bar') for each document.

Parameters:

Name Type Description Default
*extractors Extractor

extractors to combine.

()
**kwargs

additional options to pass on to Extractor.

{}
Source code in ianalyzer_readers/extract.py
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
class Combined(Extractor):
    '''
    Apply all given extractors and return the results as a tuple.

    This is a generic extractor that can be used in any `Reader`.

    Example usage:

        Combined(Constant('foo'), Constant('bar'))

    This would extract `('foo', 'bar')` for each document.

    Parameters:
        *extractors: extractors to combine.
        **kwargs: additional options to pass on to `Extractor`.
    '''

    def __init__(self, *extractors: Extractor, **kwargs):
        self.extractors = list(extractors)
        super().__init__(**kwargs)

    def _apply(self, *nargs, **kwargs):
        return tuple(
            extractor.apply(*nargs, **kwargs) for extractor in self.extractors
        )

Constant

Bases: Extractor

This extractor 'extracts' the same value every time, regardless of input.

This is a generic extractor that can be used in any Reader.

It is especially useful in combination with Backup or Choice.

Parameters:

Name Type Description Default
value Any

the value that should be "extracted".

required
**kwargs

additional options to pass on to Extractor.

{}
Source code in ianalyzer_readers/extract.py
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
class Constant(Extractor):
    '''
    This extractor 'extracts' the same value every time, regardless of input.

    This is a generic extractor that can be used in any `Reader`.

    It is especially useful in combination with `Backup` or `Choice`.

    Parameters:
        value: the value that should be "extracted".
        **kwargs: additional options to pass on to `Extractor`.
    '''

    def __init__(self, value: Any, *nargs, **kwargs):
        self.value = value
        super().__init__(*nargs, **kwargs)

    def _apply(self, *nargs, **kwargs):
        return self.value

ExternalFile

Bases: Extractor

Free for all external file extractor that provides a stream to stream_handler to do whatever is needed to extract data from an external file. Relies on associated_file being present in the metadata. Note that the XMLExtractor has a built in trick to extract data from external files (i.e. setting external_file), so you probably need that if your external file is XML.

Parameters:

Name Type Description Default
stream_handler Callable

function that will handle the opened file.

required
**kwargs

additional options to pass on to Extractor.

{}
Source code in ianalyzer_readers/extract.py
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
class ExternalFile(Extractor):
    '''
    Free for all external file extractor that provides a stream to `stream_handler`
    to do whatever is needed to extract data from an external file. Relies on `associated_file`
    being present in the metadata. Note that the XMLExtractor has a built in trick to extract
    data from external files (i.e. setting `external_file`), so you probably need that if your
    external file is XML.

    Parameters:
        stream_handler: function that will handle the opened file.
        **kwargs: additional options to pass on to `Extractor`.
    '''

    def __init__(self, stream_handler: Callable, *nargs, **kwargs):
        super().__init__(*nargs, **kwargs)
        self.stream_handler = stream_handler

    def _apply(self, metadata, *nargs, **kwargs):
        '''
        Extract `associated_file` from metadata and call `self.stream_handler` with file stream.
        '''
        return self.stream_handler(open(metadata['associated_file'], 'r'))

Extractor

Bases: object

Base class for extractors.

An extractor contains a method that can be used to gather data for a field.

Parameters:

Name Type Description Default
applicable Union[Extractor, Callable[[Dict], bool], None]

optional argument to check whether the extractor can be used. This should be another extractor, which is applied first; the containing extractor is only applied if the result is truthy. Any extractor can be used, as long as it's supported by the Reader in which it's used. If left as None, this extractor is always applicable.

None
transform Optional[Callable]

optional function to transform or postprocess the extracted value.

None
Source code in ianalyzer_readers/extract.py
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
class Extractor(object):
    '''
    Base class for extractors.

    An extractor contains a method that can be used to gather data for a field. 

    Parameters:
        applicable: 
            optional argument to check whether the extractor can be used. This should
            be another extractor, which is applied first; the containing extractor
            is only applied if the result is truthy. Any extractor can be used, as long as
            it's supported by the Reader in which it's used. If left as `None`, this 
            extractor is always applicable.
        transform: optional function to transform or postprocess the extracted value.
    '''

    def __init__(self,
                 applicable: Union['Extractor', Callable[[Dict], bool], None] = None,
                 transform: Optional[Callable] = None
                 ):

        if callable(applicable):
            warnings.warn(
                'Using a callable as "applicable" argument is deprecated; provide an '
                'Extractor instead',
                DeprecationWarning,
            )

        self.transform = transform
        self.applicable = applicable


    def apply(self, *nargs, **kwargs):
        '''
        Test if the extractor is applicable to the given arguments and if so,
        try to extract the information.
        '''
        if self._is_applicable(*nargs, **kwargs):
            result = self._apply(*nargs, **kwargs)
            try:
                if self.transform:
                    return self.transform(result)
            except Exception:
                logger.error(traceback.format_exc())
                logger.critical("Value {v} could not be converted."
                                .format(v=result))
                return None
            else:
                return result
        else:
            return None

    def _apply(self, *nargs, **kwargs):
        '''
        Actual extractor method to be implemented in subclasses (assume that
        testing for applicability and post-processing is taken care of).

        Raises:
            NotImplementedError: This method needs to be implemented on child
                classes. It will raise an error by default.
        '''
        raise NotImplementedError()


    def _is_applicable(self, *nargs, **kwargs) -> bool:
        '''
        Checks whether the extractor is applicable, based on the condition passed as the
        `applicable` parameter.

        If no condition is provided, this is always true. If the condition is an
        Extractor, this checks whether the result is truthy.

        If the condition is a callable, it will be called with the document metadata as
        an argument. This option is deprecated; you can use the Metadata extractor to
        replace it.

        Raises:
            TypeError: Raised if the applicable parameter is an unsupported type.
        '''
        if self.applicable is None:
            return True
        if isinstance(self.applicable, Extractor):
            return bool(self.applicable.apply(*nargs, **kwargs))
        if callable(self.applicable):
            return self.applicable(kwargs.get('metadata'))
        return TypeError(
            f'Unsupported type for "applicable" parameter: {type(self.applicable)}'
        )

apply(*nargs, **kwargs)

Test if the extractor is applicable to the given arguments and if so, try to extract the information.

Source code in ianalyzer_readers/extract.py
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
def apply(self, *nargs, **kwargs):
    '''
    Test if the extractor is applicable to the given arguments and if so,
    try to extract the information.
    '''
    if self._is_applicable(*nargs, **kwargs):
        result = self._apply(*nargs, **kwargs)
        try:
            if self.transform:
                return self.transform(result)
        except Exception:
            logger.error(traceback.format_exc())
            logger.critical("Value {v} could not be converted."
                            .format(v=result))
            return None
        else:
            return result
    else:
        return None

JSON

Bases: Extractor

An extractor to extract data from JSON. This extractor assumes that each source is dictionary without nested lists. When working with nested lists, use JSONReader to unnest.

Parameters:

Name Type Description Default
keys Iterable[str]

the keys with which to retrieve a field value from the source

()
Source code in ianalyzer_readers/extract.py
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
class JSON(Extractor):
    '''
    An extractor to extract data from JSON.
    This extractor assumes that each source is dictionary without nested lists.
    When working with nested lists, use JSONReader to unnest.

    Parameters:
        keys (Iterable[str]): the keys with which to retrieve a field value from the source
    '''

    def __init__(self, *keys, **kwargs):
        self.keys = list(keys)
        super().__init__(**kwargs)

    def _apply(self, data: Union[str, dict], key_index: int = 0, **kwargs) -> str:
        key = self.keys[key_index]
        data = data.get(key)
        if len(self.keys) > key_index + 1:
            key_index += 1
            return self._apply(data, key_index)
        return data

Metadata

Bases: Extractor

This extractor extracts a value from provided metadata.

This is a generic extractor that can be used in any Reader.

Parameters:

Name Type Description Default
key str

the key in the metadata dictionary that should be extracted.

required
**kwargs

additional options to pass on to Extractor.

{}
Source code in ianalyzer_readers/extract.py
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
class Metadata(Extractor):
    '''
    This extractor extracts a value from provided metadata.

    This is a generic extractor that can be used in any `Reader`.

    Parameters:
        key: the key in the metadata dictionary that should be
            extracted.
        **kwargs: additional options to pass on to `Extractor`.
    '''

    def __init__(self, key: str, *nargs, **kwargs):
        self.key = key
        super().__init__(*nargs, **kwargs)

    def _apply(self, metadata: Dict, *nargs, **kwargs):
        return metadata.get(self.key)

Order

Bases: Extractor

An extractor to keep track of the order of documents. By default, this is the order of documents within their source, but you can also track the order of sources.

Parameters:

Name Type Description Default
level str

Can be 'document' or 'source'. Whether to return the index of the source, or of the document within the source.

'document'
**kwargs

additional options to pass on to Extractor.

{}
Source code in ianalyzer_readers/extract.py
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
class Order(Extractor):
    '''
    An extractor to keep track of the order of documents. By default, this is the order
    of documents within their source, but you can also track the order of sources.

    Parameters:
        level: Can be `'document'` or `'source'`. Whether to return the index of the
            source, or of the document within the source.
        **kwargs: additional options to pass on to `Extractor`.
    '''

    def __init__(self, level: str = 'document', **kwargs):
        self.level = level
        super().__init__(**kwargs)

    def _apply(self, index: int = None, source_index: int = None, **kwargs):
        if self.level == 'document':
            return index
        if self.level == 'source':
            return source_index

Pass

Bases: Extractor

An extractor that just passes the value of another extractor.

This is a generic extractor that can be used in any Reader.

This is useful if you want to stack multiple transform arguments. For example:

Pass(Constant('foo  ', transfrom=str.upper), transform=str.strip)

This will extract str.strip(str.upper('foo ')), i.e. 'FOO'.

Parameters:

Name Type Description Default
extractor Extractor

the extractor of which the value should be passed

required
**kwargs

additional options to pass on to Extractor.

{}
Source code in ianalyzer_readers/extract.py
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
class Pass(Extractor):
    '''
    An extractor that just passes the value of another extractor.

    This is a generic extractor that can be used in any `Reader`.

    This is useful if you want to stack multiple `transform` arguments. For example:

        Pass(Constant('foo  ', transfrom=str.upper), transform=str.strip)

    This will extract `str.strip(str.upper('foo  '))`, i.e. `'FOO'`.

    Parameters:
        extractor: the extractor of which the value should be passed
        **kwargs: additional options to pass on to `Extractor`.
    '''

    def __init__(self, extractor: Extractor, *nargs, **kwargs):
        self.extractor = extractor
        super().__init__(**kwargs)

    def _apply(self, *nargs, **kwargs):
        return self.extractor.apply(*nargs, **kwargs)

RDF

Bases: Extractor

An extractor to extract data from RDF triples

Parameters:

Name Type Description Default
predicates Iterable[URIRef]

an iterable of predicates (i.e., the middle part of a RDF triple) with which to query for objects when passing no predicate, the current subject will be returned

()
multiple bool

if True: return a list of all nodes for which the query returns a result, if False: return the first node matching a query

False
is_collection bool

specify whether the data of interest is a collection, i.e., sequential data a collection is indicated by the predicates rdf:first and rdf:rest, see rdflib documentation

False
Source code in ianalyzer_readers/extract.py
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
class RDF(Extractor):
    """An extractor to extract data from RDF triples

    Parameters:
        predicates:
            an iterable of predicates (i.e., the middle part of a RDF triple) with which to query for objects
            when passing no predicate, the current subject will be returned
        multiple:
            if `True`: return a list of all nodes for which the query returns a result,
            if `False`: return the first node matching a query
        is_collection:
            specify whether the data of interest is a collection, i.e., sequential data
            a collection is indicated by the predicates `rdf:first` and `rdf:rest`, see [rdflib documentation](https://rdflib.readthedocs.io/en/stable/_modules/rdflib/collection.html)

    """

    def __init__(
        self,
        *predicates: Iterable[URIRef],
        multiple: bool = False,
        is_collection: bool = False,
        **kwargs,
    ):
        self.predicates = predicates
        self.multiple = multiple
        self.is_collection = is_collection
        super().__init__(**kwargs)

    def _apply(self, graph: Graph = None, subject: BNode = None, *nargs, **kwargs) -> Union[str, List[str]]:
        ''' apply a query to the RDFReader's graph, with one subject resulting from the `document_subjects` function

        Parameters:
            graph: a graph in which to query (set on RDFReader)
            subject: the subject with which to query

        Returns:
            a string or list of strings
        '''
        if self.is_collection:
            collection = Collection(graph, subject)
            return [self._get_node_value(node) for node in list(collection)]
        nodes = self._select(graph, subject, self.predicates)
        if len(nodes) == 0:
            return None
        if self.multiple:
            return [self._get_node_value(node) for node in nodes]
        return self._get_node_value(nodes[0])

    def _select(self, graph, subject, predicates: Iterable[URIRef]) -> List[Union[Literal, URIRef, BNode]]:
        ''' search in a graph with predicates
            if more than one predicate is passed, this is a recursive query:
            the first search result of the query is used as a subject in the next query

            Parameters:
                subject: the subject with which to query
                graph: the graph to search
                predicates: a list of predicates with which to query

            Returns:
                a list of nodes matching the query
        '''
        if not predicates:
            return [subject]
        nodes = list(graph.objects(subject, predicates[0]))
        if len(predicates) > 1:
            return self._select(graph, nodes[0], predicates[1:])
        else:
            return nodes

    def _get_node_value(self, node):
        ''' return a string value extracted from the node '''
        try:
            return node.value
        except:
            return node

XML

Bases: Extractor

Extractor for XML data. Searches through a BeautifulSoup document.

This extractor should be used in a Reader based on XMLReader. (Note that this includes the HTMLReader.)

The XML extractor has a lot of options. When deciding how to extract a value, it usually makes sense to determine them in this order:

  • Choose whether to use the source file (default), or use an external XML file by setting external_file.
  • Choose where to start searching. The default searching point is the entry tag for the document, but you can also start from the top of the document by setting toplevel.
  • Describe the tag(s) you're looking for as a Tag object. You can also provide multiple tags to chain queries.
  • If you need to return all matching tags, rather than the first match, set multiple=True.
  • Choose how to extract a value: set attribute, flatten, or extract_soup_func if needed.
  • The extracted value is a string, or the output of extract_soup_func. To further transform it, add a function for transform.

Parameters:

Name Type Description Default
tags TagSpecification

Tags to select. Each of these can be a Tag object, or a callable that takes the document metadata as input and returns a Tag.

If no tags are provided, the extractor will work form the starting tag.

Tags represent a query to select tags from current tag (e.g. the entry tag of the document). If you provide multiple, they are chained: each Tag query is applied to the results from the previous one.

()
attribute Optional[str]

By default, the extractor will extract the text content of the tag. Set this property to extract the value of an attribute instead.

None
flatten bool

When extracting the text content of a tag, flatten determines whether the contents of non-text children are flattened. If False, only the direct text content of the tag is extracted.

This parameter does nothing if attribute=True is set.

False
toplevel bool

If True, the extractor will search from the toplevel tag of the XML document, rather than the entry tag for the document.

False
multiple bool

If False, the extractor will extract the first matching element. If True, it will extract a list of all matching elements.

False
external_file bool

If True, the extractor will look through a secondary XML file (usually one containing metadata). It requires that the passed metadata have an 'external_file' key that specifies the path to the file.

Note: this option is not supported when this extractor is nested in another extractor (like Combined).

False
extract_soup_func Optional[Callable]

A function to extract a value directly from the soup element, instead of using the content string or an attribute. attribute and flatten will do nothing if this property is set.

None
**kwargs

additional options to pass on to Extractor.

{}
Source code in ianalyzer_readers/extract.py
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
class XML(Extractor):
    '''
    Extractor for XML data. Searches through a BeautifulSoup document.

    This extractor should be used in a `Reader` based on `XMLReader`. (Note that this
    includes the `HTMLReader`.)

    The XML extractor has a lot of options. When deciding how to extract a value, it
    usually makes sense to determine them in this order:

    - Choose whether to use the source file (default), or use an external XML file by
        setting `external_file`.
    - Choose where to start searching. The default searching point is the entry tag
        for the document, but you can also start from the top of the document by setting
        `toplevel`.
    - Describe the tag(s) you're looking for as a Tag object. You can also provide multiple
        tags to chain queries. 
    - If you need to return _all_ matching tags, rather than the first match, set
        `multiple=True`.
    - Choose how to extract a value: set `attribute`, `flatten`, or `extract_soup_func`
        if needed.
    - The extracted value is a string, or the output of `extract_soup_func`. To further
        transform it, add a function for `transform`.

    Parameters:
        tags:
            Tags to select. Each of these can be a `Tag` object, or a callable that
            takes the document metadata as input and returns a `Tag`.

            If no tags are provided, the extractor will work form the starting tag.

            Tags represent a query to select tags from current tag (e.g. the entry tag of
            the document). If you provide multiple, they are chained: each Tag query is
            applied to the results from the previous one.
        attribute:
            By default, the extractor will extract the text content of the tag. Set this
            property to extract the value of an _attribute_ instead.
        flatten:
            When extracting the text content of a tag, `flatten` determines whether
            the contents of non-text children are flattened. If `False`, only the direct
            text content of the tag is extracted.

            This parameter does nothing if `attribute=True` is set.
        toplevel:
            If `True`, the extractor will search from the toplevel tag of the XML
            document, rather than the entry tag for the document.
        multiple:
            If `False`, the extractor will extract the first matching element. If 
            `True`, it will extract a list of all matching elements.
        external_file:
            If `True`, the extractor will look through a secondary XML file (usually one
            containing metadata). It requires that the passed metadata have an
            `'external_file'` key that specifies the path to the file.

            Note: this option is not supported when this extractor is nested in another
            extractor (like `Combined`).
        extract_soup_func: A function to extract a value directly from the soup element,
            instead of using the content string or an attribute.
            `attribute` and `flatten` will do nothing if this property is set.
        **kwargs: additional options to pass on to `Extractor`.
    '''

    def __init__(self,
                 *tags: TagSpecification,
                 attribute: Optional[str] = None,
                 flatten: bool = False,
                 toplevel: bool = False,
                 multiple: bool = False,
                 external_file: bool = False,
                 extract_soup_func: Optional[Callable] = None,
                 **kwargs
                 ):

        self.tags = tags
        self.attribute = attribute
        self.flatten = flatten
        self.toplevel = toplevel
        self.multiple = multiple
        self.external_file = external_file
        self.extract_soup_func = extract_soup_func
        super().__init__(**kwargs)

    def _select(self, tags: Iterable[TagSpecification], soup: bs4.PageElement, metadata=None):
        '''
        Return the BeautifulSoup element that matches the constraints of this
        extractor.
        '''

        if len(tags) > 1:
            tag = resolve_tag_specification(tags[0], metadata)
            for element in tag.find_in_soup(soup):
                for result in self._select(tags[1:], element, metadata):
                    yield result
        elif len(tags) == 1:
            tag = resolve_tag_specification(tags[0], metadata)
            for result in tag.find_in_soup(soup):
                yield result
        else:
            yield soup


    def _apply(self, soup_top=None, soup_entry=None, **kwargs):
        results_generator = self._select(
            self.tags,
            soup_top if self.toplevel else soup_entry,
            metadata=kwargs.get('metadata')
        )

        if self.multiple:
            results = list(results_generator)
            return list(map(self._extract, results))
        else:
            result = next(results_generator, None)
            return self._extract(result)

    def _extract(self, soup: Optional[bs4.PageElement]):
        if not soup:
            return None

        # Use appropriate extractor
        if self.extract_soup_func:
            return self.extract_soup_func(soup)
        elif self.attribute:
            return self._attr(soup)
        else:
            if self.flatten:
                return self._flatten(soup)
            else:
                return self._string(soup)    

    def _string(self, soup):
        '''
        Output direct text contents of a node.
        '''

        if isinstance(soup, bs4.element.Tag):
            return soup.string
        else:
            return [node.string for node in soup]

    def _flatten(self, soup):
        '''
        Output text content of node and descendant nodes, disregarding
        underlying XML structure.
        '''

        if isinstance(soup, bs4.element.Tag):
            text = soup.get_text()
        else:
            text = '\n\n'.join(node.get_text() for node in soup)

        _softbreak = re.compile('(?<=\S)\n(?=\S)| +')
        _newlines = re.compile('\n+')
        _tabs = re.compile('\t+')

        return html.unescape(
            _newlines.sub(
                '\n',
                _softbreak.sub(' ', _tabs.sub('', text))
            ).strip()
        )

    def _attr(self, soup):
        '''
        Output content of nodes' attribute.
        '''

        if isinstance(soup, bs4.element.Tag):
            if self.attribute == 'name':
                return soup.name
            return soup.attrs.get(self.attribute)
        else:
            if self.attribute == 'name':
                return [ node.name for node in soup]
            return [
                node.attrs.get(self.attribute)
                for node in soup if node.attrs.get(self.attribute) is not None
            ]

XML tags

Module: ianalyzer_readers.xml_tag

This module defines the Tag class (and various subclasses).

This class is used in the XML extractor to read XML/HTML documents.

Each Tag describes a query for one or more XML tags based on their characteristics. It implements a method find_in_soup that takes an element as input and iterates over matching tags.

CurrentTag

Bases: Tag

A Tag query that will return the current tag.

Primarily useful as a default option.

Source code in ianalyzer_readers/xml_tag.py
81
82
83
84
85
86
87
88
89
90
91
92
class CurrentTag(Tag):
    '''
    A Tag query that will return the current tag.

    Primarily useful as a default option.
    '''

    def __init__(self):
        pass

    def find_in_soup(self, soup: bs4.PageElement) -> Iterable[bs4.PageElement]:
        return [soup]

FindParentTag

Bases: Tag

A Tag that will find a parent tag based on query arguments.

Unlike ParentTag, this searches for a tag with a query.

For example, ParentTag('foo') will search for a <foo> ancestor of the current tag.

Parameters:

Name Type Description Default
*args Any

positional arguments to pass on to soup.find_parents()

()
**kwargs Any

named arguments to pass on to soup.find_parents()

{}
Source code in ianalyzer_readers/xml_tag.py
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
class FindParentTag(Tag):
    '''
    A Tag that will find a parent tag based on query arguments.

    Unlike ParentTag, this searches for a tag with a query.

    For example, `ParentTag('foo')` will search for a `<foo>` ancestor
    of the current tag.

    Parameters:
        *args: positional arguments to pass on to `soup.find_parents()`
        **kwargs: named arguments to pass on to `soup.find_parents()`
    '''

    def find_in_soup(self, soup: bs4.PageElement):
        return soup.find_parents(*self.args, **self.kwargs)

NextSiblingTag

Bases: Tag

A Tag that will look in an element's next siblings.

Parameters:

Name Type Description Default
*args Any

positional arguments to pass on to soup.find_next_siblings()

()
**kwargs Any

named arguments to pass on to soup.find_next_siblings()

{}
Source code in ianalyzer_readers/xml_tag.py
165
166
167
168
169
170
171
172
173
174
175
class NextSiblingTag(Tag):
    '''
    A Tag that will look in an element's next siblings.

    Parameters:
        *args: positional arguments to pass on to `soup.find_next_siblings()`
        **kwargs: named arguments to pass on to `soup.find_next_siblings()`
    '''

    def find_in_soup(self, soup: bs4.PageElement):
        return soup.find_next_siblings(*self.args, **self.kwargs)

NextTag

Bases: Tag

A Tag that will look through tags following the current element.

Parameters:

Name Type Description Default
*args Any

positional arguments to pass on to soup.find_all_next()

()
**kwargs Any

named arguments to pass on to soup.find_all_next()

{}
Source code in ianalyzer_readers/xml_tag.py
189
190
191
192
193
194
195
196
197
198
199
class NextTag(Tag):
    '''
    A Tag that will look through tags following the current element.

    Parameters:
        *args: positional arguments to pass on to `soup.find_all_next()`
        **kwargs: named arguments to pass on to `soup.find_all_next()`
    '''

    def find_in_soup(self, soup: bs4.PageElement):
        return soup.find_all_next(*self.args, **self.kwargs)

ParentTag

Bases: Tag

A Tag that will select a parent tag based on a fixed level.

For example, ParentTag(2) will always go up two steps in the tree and return that tag.

Parameters:

Name Type Description Default
level int

the number of steps to move up the tree.

1
Source code in ianalyzer_readers/xml_tag.py
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
class ParentTag(Tag):
    '''
    A Tag that will select a parent tag based on a fixed level.

    For example, `ParentTag(2)` will always go up two steps in the tree
    and return that tag.

    Parameters:
        level: the number of steps to move up the tree.
    '''

    def __init__(self, level: int = 1):
        self.level = level

    def find_in_soup(self, soup: bs4.PageElement):
        count = 0
        while count < self.level:
            soup = soup.parent if soup else None
            count += 1
        return [soup]

PreviousSiblingTag

Bases: Tag

A Tag that will look in an element's previous siblings.

Parameters:

Name Type Description Default
*args Any

positional arguments to pass on to soup.find_previous_siblings()

()
**kwargs Any

named arguments to pass on to soup.find_previous_siblings()

{}
Source code in ianalyzer_readers/xml_tag.py
153
154
155
156
157
158
159
160
161
162
163
class PreviousSiblingTag(Tag):
    '''
    A Tag that will look in an element's previous siblings.

    Parameters:
        *args: positional arguments to pass on to `soup.find_previous_siblings()`
        **kwargs: named arguments to pass on to `soup.find_previous_siblings()`
    '''

    def find_in_soup(self, soup: bs4.PageElement):
        return soup.find_previous_siblings(*self.args, **self.kwargs)

PreviousTag

Bases: Tag

A Tag that will look through tags previous to the current element.

Parameters:

Name Type Description Default
*args Any

positional arguments to pass on to soup.find_all_previous()

()
**kwargs Any

named arguments to pass on to soup.find_all_previous()

{}
Source code in ianalyzer_readers/xml_tag.py
177
178
179
180
181
182
183
184
185
186
187
class PreviousTag(Tag):
    '''
    A Tag that will look through tags previous to the current element.

    Parameters:
        *args: positional arguments to pass on to `soup.find_all_previous()`
        **kwargs: named arguments to pass on to `soup.find_all_previous()`
    '''

    def find_in_soup(self, soup: bs4.PageElement):
        return soup.find_all_previous(*self.args, **self.kwargs)

SiblingTag

Bases: Tag

A Tag that will look in an element's siblings.

Parameters:

Name Type Description Default
*args Any

positional arguments to pass on to soup.find_previous_siblings() and soup.find_next_siblings()

()
**kwargs Any

named arguments to pass on to soup.find_previous_siblings() and soup.find_next_siblings()

{}
Source code in ianalyzer_readers/xml_tag.py
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
class SiblingTag(Tag):
    '''
    A Tag that will look in an element's siblings.

    Parameters:
        *args: positional arguments to pass on to `soup.find_previous_siblings()`
            and `soup.find_next_siblings()`
        **kwargs: named arguments to pass on to `soup.find_previous_siblings()`
            and `soup.find_next_siblings()`
    '''

    def find_in_soup(self, soup: bs4.PageElement):
        for tag in soup.find_next_siblings(*self.args, **self.kwargs):
            yield tag

        for tag in soup.find_previous_siblings(*self.args, **self.kwargs):
            yield tag

Tag

Describes a query for a tag in an XML tree.

This should be used as the base class for all other tags, which can override the __init__() and find_in_soup() methods.

Tag is the most straightforward case: all arguments passed in the constructor are passed on as-is to the find_all() method of the BeautifulSoup element, searching descendants of the input tag.

See https://www.crummy.com/software/BeautifulSoup/bs4/doc/#kinds-of-filters for different ways of searching. This includes searching by: - a tag name (possibly as a regular expression) - attributes of the tag - the string content of the tag - a function

Parameters:

Name Type Description Default
*args Any

positional arguments to pass on to soup.find_all()

()
**kwargs Any

named arguments to pass on to soup.find_all()

{}
Source code in ianalyzer_readers/xml_tag.py
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
class Tag:
    '''
    Describes a query for a tag in an XML tree.

    This should be used as the base class for all other tags, which can override
    the `__init__()` and `find_in_soup()` methods.

    `Tag` is the most straightforward case: all arguments passed in the constructor
    are passed on as-is to the `find_all()` method of the BeautifulSoup element, searching
    descendants of the input tag.

    See https://www.crummy.com/software/BeautifulSoup/bs4/doc/#kinds-of-filters for
    different ways of searching. This includes searching by:
    - a tag name (possibly as a regular expression)
    - attributes of the tag
    - the string content of the tag
    - a function

    Parameters:
        *args: positional arguments to pass on to `soup.find_all()`
        **kwargs: named arguments to pass on to `soup.find_all()`
    '''

    def __init__(self, *args: Any, **kwargs: Any):
        self.args = args
        self.kwargs = kwargs

    def find_next_in_soup(self, soup: bs4.PageElement) -> Optional[bs4.PageElement]:
        '''
        Find the first match for the tag, if any.

        Parameters:
            soup: The element to search from.

        Returns:
            The first matching tag. Returns `None` if there is no match.
        '''
        return next((tag for tag in self.find_in_soup(soup)), None)

    def find_in_soup(self, soup: bs4.PageElement) -> Iterable[bs4.PageElement]:
        '''
        Find all results for this tag.

        Parameters:
            soup: The element to search from.

        Returns:
            An iterable of matching tags. Note that is is not guaranteed that the iterable
                contains any elements.

        When subclassing Tag, you will usually want to replace this method. The result
        must be an iterable. (If only one result makes sense, it's an iterable with one
        element.) If the tag may find multiple matches, it's recommended that this method
        returns a generator or a `bs4.ResultSet` rather than collecting all results up
        front.
        '''
        pool = soup.descendants if self.kwargs.get('recursive', True) else soup.children

        def strainer_helper(name=None, attrs={}, string=None, **kwargs):
            return bs4.SoupStrainer(name, attrs, string, **kwargs)
        strainer = strainer_helper(*self.args, **self.kwargs)

        yield from strainer.filter(pool)

find_in_soup(soup)

Find all results for this tag.

Parameters:

Name Type Description Default
soup PageElement

The element to search from.

required

Returns:

Type Description
Iterable[PageElement]

An iterable of matching tags. Note that is is not guaranteed that the iterable contains any elements.

When subclassing Tag, you will usually want to replace this method. The result must be an iterable. (If only one result makes sense, it's an iterable with one element.) If the tag may find multiple matches, it's recommended that this method returns a generator or a bs4.ResultSet rather than collecting all results up front.

Source code in ianalyzer_readers/xml_tag.py
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
def find_in_soup(self, soup: bs4.PageElement) -> Iterable[bs4.PageElement]:
    '''
    Find all results for this tag.

    Parameters:
        soup: The element to search from.

    Returns:
        An iterable of matching tags. Note that is is not guaranteed that the iterable
            contains any elements.

    When subclassing Tag, you will usually want to replace this method. The result
    must be an iterable. (If only one result makes sense, it's an iterable with one
    element.) If the tag may find multiple matches, it's recommended that this method
    returns a generator or a `bs4.ResultSet` rather than collecting all results up
    front.
    '''
    pool = soup.descendants if self.kwargs.get('recursive', True) else soup.children

    def strainer_helper(name=None, attrs={}, string=None, **kwargs):
        return bs4.SoupStrainer(name, attrs, string, **kwargs)
    strainer = strainer_helper(*self.args, **self.kwargs)

    yield from strainer.filter(pool)

find_next_in_soup(soup)

Find the first match for the tag, if any.

Parameters:

Name Type Description Default
soup PageElement

The element to search from.

required

Returns:

Type Description
Optional[PageElement]

The first matching tag. Returns None if there is no match.

Source code in ianalyzer_readers/xml_tag.py
43
44
45
46
47
48
49
50
51
52
53
def find_next_in_soup(self, soup: bs4.PageElement) -> Optional[bs4.PageElement]:
    '''
    Find the first match for the tag, if any.

    Parameters:
        soup: The element to search from.

    Returns:
        The first matching tag. Returns `None` if there is no match.
    '''
    return next((tag for tag in self.find_in_soup(soup)), None)

TransformTag

Bases: Tag

A Tag that will perform a transformation function.

This Tag allows you to run arbitrary code to move to anywhere in the XML tree.

Parameters:

Name Type Description Default
transform Callable[[PageElement], Iterable[PageElement]]

a function that takes an XML element as input and returns an iterable of XML elements. (Note that you can return an iterable of one, or an empty iterable, if you don't have multiple results.)

required
Source code in ianalyzer_readers/xml_tag.py
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
class TransformTag(Tag):
    '''
    A Tag that will perform a transformation function.

    This Tag allows you to run arbitrary code to move to anywhere in the XML tree.

    Parameters:
        transform: a function that takes an XML element as input and returns an
            iterable of XML elements. (Note that you can return an iterable of
            one, or an empty iterable, if you don't have multiple results.)
    '''

    def __init__(
            self,
            transform: Callable[[bs4.PageElement], Iterable[bs4.PageElement]],
        ):
        self.transform = transform

    def find_in_soup(self, soup: bs4.PageElement) -> Iterable[bs4.PageElement]:
        return self.transform(soup)