Custom reader

ianalyzer_readers includes several subclasses of Reader to handle common data formats, such as the XMLReader and CSVReader. If you need to handle a new or unique data format, it may be useful to create a new Reader subclass. This example demonstrates how you could implement a custom reader.

Our dataset is a file library.txt, which contains bibliographical data for a collection of books. It looks like this:

Title: Pride and Prejudice
Author: Jane Austen
Year: 1813

Title: Frankenstein, or, the Modern Prometheus
Author: Mary Shelley
Year: 1818

Title: Moby Dick
Author: Herman Melville
Year: 1851

Title: Alice in Wonderland
Author: Lewis Carroll
Year: 1865

This data doesn't use a standardised format, but it's consistently structured. We start by creating a reader class:

from ianalyzer_readers.readers.core import Reader

class BibliographyReader(Reader):
    pass

File discovery

File discovery is normally implemented when you create the Reader class for a specific dataset. If you're creating an abstract class for a data format, like the CSVReader or XMLReader, you can skip this step, and leave it to the reader for each dataset.

In this case, our reader is meant to handle a single dataset, so we should describe how to find the data file by implementing sources(). This just needs to yield a single file.

from ianalyzer_readers.readers.core import Reader

class BibliographyReader(Reader):
    data_directory = '.'

    def sources(self, **kwargs):
        yield self.data_directory + '/library.txt'

There are several options for the output type of sources; in this case, we're providing a file path.

Extracting file contents

To extract documents from a source, a reader must implement two steps:

  • extract a data object from a source
  • iterate over the data object and yield extractor input for each document

The format of the source data object is up to you; what format makes sense here will depend on how the source data is structured. It could be a graph, an iterator, a dataframe, or something else entirely.

The second step is a method that iterates over this data format and identifies the entry point for each document. Per document, it should return the data that will be made available to field extractors.

This is quite an abstract description, so let's see how it works in practice.

First, we need to extract a data object from a source. There are several methods you can implement here (data_from_file, data_from_bytes, data_from_response), depending on what source types you wish to support. In this case, we know the output of sources is a file path, so we need to implement data_from_file; we can leave the others unimplemented.

The output of data_from_file should be some intermediate data format; we will just return the string contents of the file.

class BibliographyReader(Reader):
    # ...

    def data_from_file(self, path: str) -> str:
        f = open(path, 'r')
        content = f.read()
        f.close()
        return content

Iterating over file contents

We now need a method to iterate over the source data, i.e. the output of data_from_file. In our case, this data object is the file contents as a string. The iterate_data method must be implemented to split this into documents.

As input, it will receive the data object (the string content), and the metadata for the file. (Our reader does not provide metadata, so the metadata will be empty.) It should iterate over the documents we want to extract (in this case, over each book). Per document, it should return whatever data we want to provide to field extractors.

The data for field extractors can be of any format you want. Non-universal extractors like CSV, XML, etc., have specific arguments they expect, so you could tailor your output data to be compatible with a specific extractor class.

In this case, it doesn't really make sense to use one of the existing extractors, so we will make our own extractor class later on. At this step, we can choose what information we will provide to our extractor.

In this case, our data provides a few properties for each book: the title, author, and year. So we can parse the lines of text into a mapping of properties to values.

from typing import Iterable, Dict, List
from ianalyzer_readers.core import Document

class BibliographyReader(Reader):
    # ...

    def iterate_data(self, data: str, metadata: Dict) -> Iterable[Document]:
        # split into entries
        sections = data.split('\n\n')
        for section in sections:
            # get property mapping from each entry
            mapping = self._mapping_from_section(section)
            yield {'mapping': mapping}

    def _mapping_from_section(self, section: str):
        lines = section.split('\n')
        keys_values = (line.split(': ') for line in lines if len(line))
        return { key: value for key, value in keys_values }

Create custom extractor

Our reader provides a mapping for each document. We need to create an extractor class that can extract values from the mapping.

Our custom extractor is a subclass of Extractor. The base extractor class supports a few parameters on initialisation, such as transform. We will add one parameter of our own, key, which specifies the property to extract. Note that the initialiser must call super().__init__() to make sure inherited parameters are stored.

Our extractor also needs to implement a method _apply() which specifies how to extract a value from a document. Here, we expect the reader to provide a mapping, and extract the property matching the key.

from typing import Dict
from ianalyzer_readers.extract import Extractor

class BibliographyExtractor(Extractor):
    def __init__(self, key: str, **kwargs) -> None:
        super().__init__(**kwargs)
        self.key = key

    def _apply(self, mapping: Dict, **kwargs):
        return mapping.get(self.key, None)

Define fields

The last thing that is required for a functioning reader is a list of fields.

Note: if you are creating an abstract reader class like CSVReader, you should not implement a list of fields.

from ianalyzer_readers.core import Field
from ianalyzer_readers.extract import Order, Constant

    fields = [
        Field(
            name='title',
            extractor=BibliographyExtractor('Title'),
        ),
        Field(
            name='author',
            extractor=BibliographyExtractor('Author'),
        ),
        Field(
            name='year',
            extractor=BibliographyExtractor('Year', transform=int),
        ),
        Field(
            name='index',
            extractor=Order(),
        ),
        Field(
            name='file',
            extractor=Constant('library.txt'),
        ),
    ]

Note that we can use our custom-made BibliographyExtractor, but universal extractors like Order and Constant are also supported.

Complete example

from typing import Iterable, Dict
import os

from ianalyzer_readers.extract import Extractor
from ianalyzer_readers.readers.core import Reader, Document, Field
from ianalyzer_readers.extract import Order, Constant


class BibliographyExtractor(Extractor):
    def __init__(self, key: str, **kwargs) -> None:
        super().__init__(**kwargs)
        self.key = key

    def _apply(self, mapping: Dict, **kwargs):
        return mapping.get(self.key, None)


class BibliographyReader(Reader):
    data_directory = os.path.dirname(__file__)

    def sources(self, **kwargs):
        yield self.data_directory + '/library.txt'

    def data_from_file(self, path: str) -> str:
        f = open(path, 'r')
        content = f.read()
        f.close()
        return content

    def iterate_data(self, data: str, metadata: Dict) -> Iterable[Document]:
        sections = data.split('\n\n')
        for section in sections:
            mapping = self._mapping_from_section(section)
            yield {'mapping': mapping}

    def _mapping_from_section(self, section: str):
        lines = section.split('\n')
        keys_values = (line.split(': ') for line in lines if len(line))
        return { key: value for key, value in keys_values }

    fields = [
        Field(
            name='title',
            extractor=BibliographyExtractor('Title'),
        ),
        Field(
            name='author',
            extractor=BibliographyExtractor('Author'),
        ),
        Field(
            name='year',
            extractor=BibliographyExtractor('Year', transform=int),
        ),
        Field(
            name='index',
            extractor=Order(),
        ),
        Field(
            name='file',
            extractor=Constant('library.txt'),
        ),
    ]
Extracted documents The `documents()` method of our reader will now return the following output:
[
    {
        'title': 'Pride and Prejudice',
        'author': 'Jane Austen',
        'year': 1813,
        'index': 0,
        'file': 'library.txt',
    },
    {
        'title': 'Frankenstein, or, the Modern Prometheus',
        'author': 'Mary Shelley',
        'year': 1818,
        'index': 1,
        'file': 'library.txt',
    },
    {
        'title': 'Moby Dick',
        'author': 'Herman Melville',
        'year': 1851,
        'index': 2,
        'file': 'library.txt',
    },
    {
        'title': 'Alice in Wonderland',
        'author': 'Lewis Carroll',
        'year': 1865,
        'index': 3,
        'file': 'library.txt',
    },
]