API documentation
Core classes
Module: ianalyzer_readers.readers.core
This module defines the base classes on which all Readers are built.
The module defines two classes, Field and Reader.
Document = Dict[str, Any]
module-attribute
Type definition for documents, defined for convenience.
Each document extracted by a Reader is a dictionary, where the keys are names of
the Reader's fields, and the values are based on the extractor of each field.
Source = Union[SourceData, Tuple[SourceData, Dict]]
module-attribute
Type definition for the source input to some Reader methods.
Sources are either:
- a string with the path to a filename
- binary data with the file contents. This is not supported on all Reader subclasses
- a requests.Response
- a tuple of one of the above, and a dictionary with metadata
SourceData = Union[str, Response, bytes]
module-attribute
Type definition of the data types a Reader method can handle.
Field
Bases: object
Fields are the elements of information that you wish to extract from each document.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
a short hand name (name), which will be used as its key in the document |
required |
extractor
|
Extractor
|
an Extractor object that defines how this field's data can be extracted from source documents. |
Constant(None)
|
required
|
bool
|
whether this field is required. The |
False
|
skip
|
bool
|
if |
False
|
Source code in ianalyzer_readers/readers/core.py
43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 | |
Reader
A base class for readers. Readers are objects that can generate documents from a source dataset.
Subclasses of Reader can be created to read specific data formats.
In practice, you will probably work with a subclass of Reader like XMLReader,
CSVReader, etc., that provides the core functionality for a file type, and create
a subclass for a specific dataset.
Some methods of this class need to be implemented in child classes, and will raise
NotImplementedError if you try to use Reader directly.
A fully implemented Reader subclass will define how to read a dataset by
describing:
- How to obtain its source files.
- How to parse and iterate over source files.
- What fields each document contains, and how to extract them from the source data.
This requires implementing the following attributes/methods:
fields: a list ofFieldinstances that describe the fields that will appear in documents, and how to extract their value.sources: a method that returns an iterable of sources (e.g. file paths), possibly with metadata for each.data_directory(optional): a string with the path to the directory containing the source data. You can use this in the implementation ofsources; it's not used elsewhere.data_from_filedata_from_bytes,data_from_response: methods that respectively receive a file path, a byte sequence, or an HTTP response, and return a data object. (The type of the data will depend on how you implement your reader; this could be a parsed graph, a row iterator, etc.). You must implement at least one of these methods to have a functioning reader.iterate_data: method that takes a data object (the output ofdata_from_file/data_from_bytes/data_from_response) and a metadata dictionary, iterates over the source data, and returns the data that should be passed on to extractors for each document.validate(optional): a method that will check the reader configuration. This is useful for abstract readers like theXMLReader,CSVReader, etc., so they can verify a child class is implementing attributes correctly.
Abstract reader types like CSVReader usually leave fields and sources
unimplemented.
Source code in ianalyzer_readers/readers/core.py
69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 | |
data_directory
property
Path to source data directory.
Raises:
| Type | Description |
|---|---|
NotImplementedError
|
This method needs to be implementd on child classes. It will raise an error by default. |
fieldnames
property
A list containing the name of each field of this Reader
fields
property
The list of fields that are extracted from documents.
These should be instances of the Field class (or implement the same API).
Raises:
| Type | Description |
|---|---|
NotImplementedError
|
This method needs to be implementd on child classes. It will raise an error by default. |
data_and_metadata_from_source(source)
Extract the data and metadata object from a source.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source
|
Source
|
Source to extract. |
required |
Returns:
| Type | Description |
|---|---|
Tuple[Any, Dict]
|
A tuple with the parsed source data, and the metadata (empty if none was provided). |
Source code in ianalyzer_readers/readers/core.py
207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 | |
data_from_bytes(bytes)
Extract source data from a bytes object. Like data_from_file, but with bytes
input.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
bytes
|
bytes
|
byte contents of the source |
required |
Returns:
| Type | Description |
|---|---|
Any
|
A data object. The type depends on the reader implementation. This may also be a context manager. |
Raises:
| Type | Description |
|---|---|
NotImplementedError
|
this method may be implemented on child classes, but has no default implementation. |
Source code in ianalyzer_readers/readers/core.py
267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 | |
data_from_file(path)
Extract source data from a filename.
The return type depends on how the reader is implemented, but should be some kind
of data structure from which documents can be extracted. It serves as the input
to self.iterate_data.
This method can also return a context manager. This is especially useful to
iterate over large files in iterate_data, without loading the complete file
contents in memory.
Tip: if you have implemented self.data_from_bytes, this method can probably just
read the binary contents of the file and call that method.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
The path to a file. |
required |
Returns:
| Type | Description |
|---|---|
Any
|
A data object. The type depends on the reader implementation. |
Raises:
| Type | Description |
|---|---|
NotImplementedError
|
this method may be implemented on child classes, but has no default implementation. |
Source code in ianalyzer_readers/readers/core.py
238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 | |
data_from_response(response)
Extract data from an HTTP response. Like data_from_file, but with Response
input.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
response
|
Response
|
HTTP response object |
required |
Returns:
| Type | Description |
|---|---|
Any
|
A data object. The type depends on the reader implementation. This may also be a context manager. |
Raises:
| Type | Description |
|---|---|
NotImplementedError
|
this method may be implemented on child classes, but has no default implementation. |
Source code in ianalyzer_readers/readers/core.py
287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 | |
documents(sources=None)
Returns an iterable of extracted documents from source files.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sources
|
Iterable[Source]
|
an iterable of paths to source files. If omitted, the reader
class will use the value of |
None
|
Returns:
| Type | Description |
|---|---|
Iterable[Document]
|
an iterable of document dictionaries. Each of these is a dictionary,
where the keys are names of this Reader's |
Source code in ianalyzer_readers/readers/core.py
348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 | |
export_csv(path, sources=None)
Extracts documents from sources and saves them in a CSV file.
This will write a CSV file in the provided path. This method has no return
value.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
the path where the CSV file should be saved. |
required |
sources
|
Optional[Iterable[Source]]
|
an iterable of paths to source files. If omitted, the reader class
will use the value of |
None
|
Source code in ianalyzer_readers/readers/core.py
370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 | |
extract_document(**kwargs)
Extract each field of a document, based on the raw data for the document
Source code in ianalyzer_readers/readers/core.py
335 336 337 338 339 340 341 342 343 344 345 346 | |
iterate_data(data, metadata)
Iterate parsed source data, return data for each document.
This should return the arguments that are passed on to field extractors per
document. These usually cater to a specific extractor type. For example, the
CSVReader returns an argument rows, which is used by the CSV extractor.
The core source2dicts method will also provide metadata and index arguments
to extractors, which you may override by providing them here.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
Any
|
The data object from a source. The type depends on the reader
implementation; this is the output of |
required |
metadata
|
Dict
|
Dictionary containing metadata for the source. |
required |
Returns:
| Type | Description |
|---|---|
Iterable[Document]
|
An iterable of dictionaries. Each iteration will be extracted as a single |
Iterable[Document]
|
document. The items in the dictionary are given as arguments to field |
Iterable[Document]
|
extractors. |
Raises:
| Type | Description |
|---|---|
NotImplementedError
|
This method must be implemented on child classes. It will raise an error otherwise. |
Source code in ianalyzer_readers/readers/core.py
306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 | |
source2dicts(source, source_index=-1)
Given a source file, returns an iterable of extracted documents.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source
|
Source
|
Source to extract. |
required |
Returns:
| Type | Description |
|---|---|
Iterable[Document]
|
an iterable of document dictionaries. Each of these is a dictionary,
where the keys are names of this Reader's |
Source code in ianalyzer_readers/readers/core.py
172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 | |
sources(**kwargs)
Obtain source files for the Reader.
Returns:
| Type | Description |
|---|---|
Iterable[Source]
|
an iterable of tuples that each contain a string path, and a dictionary with associated metadata. The metadata can contain any data that was extracted before reading the file itself, such as data based on the file path, or on a metadata file. |
Raises:
| Type | Description |
|---|---|
NotImplementedError
|
This method needs to be implementd on child classes. It will raise an error by default. |
Source code in ianalyzer_readers/readers/core.py
156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 | |
validate()
Validate that the reader is configured properly.
This is a good place to check parameters that are overridden in a child class. A common use case is use self._reject_extractors to raise an error if any fields use unsupported extractor types.
Source code in ianalyzer_readers/readers/core.py
391 392 393 394 395 396 397 398 399 | |
CSV reader
Module: ianalyzer_readers.readers.csv
This module defines the CSV reader.
Extraction is based on python's csv library.
CSVReader
Bases: Reader
A base class for Readers of .csv (comma separated value) files.
The CSVReader is designed for .csv or .tsv files that have a header row, and where each file may list multiple documents.
The data should be structured in one of the following ways:
- one document per row (this is the default)
- each document spans a number of consecutive rows. In this case, there should be a column that indicates the identity of the document.
In addition to generic extractor classes, this reader supports the CSV extractor.
Source code in ianalyzer_readers/readers/csv.py
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 | |
delimiter = ','
class-attribute
instance-attribute
The column delimiter used in the CSV data
field_entry = None
class-attribute
instance-attribute
If applicable, the name of the column that identifies entries. Subsequent rows with the same value for this column are treated as a single document. If left blank, each row is treated as a document.
required_field = None
class-attribute
instance-attribute
Specifies the name of a required column in the CSV data, for example the main content.
Rows with an empty value for required_field will be skipped.
skip_lines = 0
class-attribute
instance-attribute
Number of lines in the file to skip before reading the header. Can be used when files use a fixed "preamble", e.g. to describe metadata or provenance.
XLSX reader
Module: ianalyzer_readers.readers.xlsx
XLSXReader
Bases: Reader
A base class for Readers that extract data from .xlsx spreadsheets
The XLSXReader is quite rudimentary, and is designed to extract data from spreadsheets that are formatted like a CSV table, with a clear column layout. The sheet should have a header row.
The data should be structured in one of the following ways:
- one document per row (this is the default)
- each document spans a number of consecutive rows. In this case, there should be a column that indicates the identity of the document.
The XLSXReader will only look at the first sheet in each file.
In addition to generic extractor classes, this reader supports the CSV extractor.
Source code in ianalyzer_readers/readers/xlsx.py
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 | |
field_entry = None
class-attribute
instance-attribute
If applicable, the name of column that identifies entries. Subsequent rows with the same value for this column are treated as a single document. If left blank, each row is treated as a document.
required_field = None
class-attribute
instance-attribute
Specifies the name of a required column, for example the main content. Rows with
an empty value for required_field will be skipped.
skip_lines = 0
class-attribute
instance-attribute
Number of lines in the sheet to skip before reading the header. Can be used when files use a fixed "preamble", e.g. to describe metadata or provenance.
XML reader
Module: ianalyzer_readers.readers.xml
This module defines the XML Reader.
Extraction is based on BeautifulSoup.
XMLReader
Bases: Reader
A base class for Readers that extract data from XML files.
The built-in functionality of the XML reader is quite versatile, and can be further expanded by adding custom Tag classes or extraction functions that interact directly with BeautifulSoup nodes.
The Reader is suitable for datasets where each file should be extracted as a single document, or ones where each file contains multiple documents.
In addition to generic extractor classes, this reader supports the XML extractor.
Attributes:
| Name | Type | Description |
|---|---|---|
tag_toplevel |
TagSpecification
|
the top-level tag to search from in source documents. |
tag_entry |
TagSpecification
|
the tag that corresponds to a single document entry in source documents. |
external_file_tag_toplevel |
TagSpecification
|
the top-level tag to search from in external documents (if that functionality is used) |
Source code in ianalyzer_readers/readers/xml.py
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 | |
external_file_tag_toplevel = CurrentTag()
class-attribute
instance-attribute
The toplevel tag in external files (if you are using that functionality).
Can be:
- An XMLTag
- A callable that takes the metadata of the document as input and returns an XMLTag. The metadata dictionary includes the values of "regular" fields for the document.
tag_entry = CurrentTag()
class-attribute
instance-attribute
The tag that corresponds to a single document entry.
Can be:
- An XMLTag
- A callable that takes the metadata of the document as input and returns an XMLTag
tag_toplevel = CurrentTag()
class-attribute
instance-attribute
The top-level tag in the source documents.
Can be:
- An XMLTag
- A callable that takes the metadata of the document as input and returns an XMLTag.
data_from_bytes(data)
Parses content of a xml file
Source code in ianalyzer_readers/readers/xml.py
180 181 182 183 184 | |
data_from_file(filename)
Returns beatifulsoup soup object for a given xml file
Source code in ianalyzer_readers/readers/xml.py
169 170 171 172 173 174 175 176 177 178 | |
HTML reader
Module: ianalyzer_readers.readers.html
This module defines the XML Reader.
The HTML reader is implemented as a subclas of the XML reader, and uses BeautifulSoup to parse files.
HTMLReader
Bases: XMLReader
An HTML reader extracts data from HTML sources.
It is based on the XMLReader and supports the same options (tag_toplevel and
tag_entry).
In addition to generic extractor classes, this reader supports the XML extractor.
Source code in ianalyzer_readers/readers/html.py
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 | |
RDF reader
Module: ianalyzer_readers.readers.rdf
This module defines a Resource Description Framework (RDF) reader.
Extraction is based on the rdflib library.
RDFReader
Bases: Reader
A base class for Readers of Resource Description Framework files. These could be in Turtle, JSON-LD, RDFXML or other formats, see rdflib parsers.
Source code in ianalyzer_readers/readers/rdf.py
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 | |
data_from_file(path)
Read a RDF file as indicated by source, return a graph Override this function to parse multiple source files into one graph
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
the name of the file to be parsed |
required |
Returns:
| Type | Description |
|---|---|
Graph
|
rdflib Graph object |
Source code in ianalyzer_readers/readers/rdf.py
31 32 33 34 35 36 37 38 39 40 41 42 43 44 | |
document_subjects(graph)
Override this function to return all subjects (i.e., first part of RDF triple) with which to search for data in the RDF graph. Typically, such subjects are identifiers or urls.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
graph
|
Graph
|
the graph to parse |
required |
Returns:
| Type | Description |
|---|---|
Iterable[Union[BNode, Literal, URIRef]]
|
generator or list of nodes |
Source code in ianalyzer_readers/readers/rdf.py
53 54 55 56 57 58 59 60 61 62 63 64 | |
get_uri_value(node)
a utility function to extract the last part of a uri For instance, if the input is URIRef('https://purl.org/mynamespace/ernie'), or URIRef('https://purl.org/mynamespace#ernie') the function will return 'ernie'
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
node
|
URIRef
|
an URIRef input node |
required |
Returns:
| Type | Description |
|---|---|
str
|
a string with the last element of the uri |
Source code in ianalyzer_readers/readers/rdf.py
67 68 69 70 71 72 73 74 75 76 77 78 79 | |
JSON reader
Module: ianalyzer_readers.readers.json
This module defines the JSONReader.
It can parse documents nested in one file, for which it uses the pandas library, or multiple files with one document each, which use the generic Python json parser.
JSONReader
Bases: Reader
A base class for Readers of JSON encoded data.
The reader can either be used on a collection of JSON files (single_document=True), in which each file represents a document,
or for a JSON file containing lists of documents.
If the attributes record_path and meta are set, they are used as arguments to pandas.json_normalize to unnest the JSON data.
Attributes:
| Name | Type | Description |
|---|---|---|
single_document |
bool
|
indicates whether the data is organized such that a file represents a single document |
record_path |
Optional[List[str]]
|
a path or list of paths by which a list of documents can be extracted from a large JSON file; irrelevant if |
meta |
Optional[List[Union[str, List[str]]]]
|
a list of paths, or list of lists of paths, by which metadata common for all documents can be located; irrelevant if |
"""
Examples:
Multiple documents in one file:
example_data = {
'path': {
'sketch': 'Hungarian Phrasebook',
'episode': 25,
'to': {
'records':
[
{'speech': 'I will not buy this record. It is scratched.', 'character': 'tourist'},
{'speech': "No sir. This is a tobacconist's.", 'character': 'tobacconist'}
]
}
}
}
MyJSONReader(JSONReader):
record_path = ['path', 'to', 'records']
meta = [['path', 'sketch'], ['path', 'episode']]
speech = Field('speech', JSON('speech'))
character = Field('character', JSON('character'))
sketch = Field('sketch', JSON('path.sketch'))
episode = Field('episode', JSON('path.episode'))
To define the paths used to extract the field values, consider the dataformat the pandas.json_normalize creates:
a table with each row representing a document, and columns corresponding to paths, either relative to documents within record_path,
or relative to the top level (meta), with list of paths indicated by dots.
row,speech,character,path.sketch,path.episode
0,"I will not buy this record. It is scratched.","tourist","Hungarian Phrasebook",25
1,"No sir. This is a tobacconist's.","tobacconist","Hungarian Phrasebook",25
Single document per file:
example_data = {
'sketch': 'Hungarian Phrasebook',
'episode': 25,
'scene': {
'character': 'tourist',
'speech': 'I will not buy this record. It is scratched.'
}
}
MyJSONReader(JSONReader):
single_document = True
speech = Field('speech', JSON('scene', 'speech'))
character = Field('character', JSON('scene', 'character))
sketch = Field('sketch', JSON('sketch'))
episode = Field('episode', JSON('episode))
Source code in ianalyzer_readers/readers/json.py
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 | |
meta = None
class-attribute
instance-attribute
a list of keywords, or list of lists of keywords, by which metadata for each document can be located,
if it is in a different path than record_path. Only relevant if single_document=False.
record_path = None
class-attribute
instance-attribute
a keyword or list of keywords by which a list of documents can be extracted from a large JSON file.
Only relevant if single_document=False.
single_document = False
class-attribute
instance-attribute
set to True if the data is structured such that one document is encoded in one .json file
in that case, the reader assumes that there are no lists in such a file
Extractors
Module: ianalyzer_readers.extract
This module contains extractor classes that can be used to obtain values for each field in a Reader.
Some extractors are intended to work with specific Reader classes, while others
are generic.
Backup
Bases: Extractor
Try all given extractors in order and return the first result that evaluates as true
This is a generic extractor that can be used in any Reader.
Example usage:
Backup(Constant(None), Constant('foo'))
Since the first extractor returns None, the second extractor will be used, and the
extracted value would be 'foo'.
Note the difference with Choice: Backup is based on the extracted value,
Choice on the applicable parameter of each extractor.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
*extractors
|
Extractor
|
extractors to use. These should be listed in descending order of preference. |
()
|
**kwargs
|
additional options to pass on to |
{}
|
Source code in ianalyzer_readers/extract.py
192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 | |
CSV
Bases: Extractor
This extractor extracts values from a list of CSV or spreadsheet rows.
It should be used in readers based on CSVReader or XLSXReader.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
column
|
str
|
The name of the column from which to extract the value. |
required |
multiple
|
bool
|
If a document spans multiple rows, the extracted value for a
field with |
False
|
convert_to_none
|
List[str]
|
optional, default is |
['']
|
**kwargs
|
additional options to pass on to |
{}
|
Source code in ianalyzer_readers/extract.py
553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 | |
Cache
Bases: Extractor
Can be wrapped around another extractor to prevent repeatedly extracting the same value.
Makes an assumption the value of the extractor is going to be the same within a document, a source file, or even across the whole dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
extractor
|
Extractor
|
Extractor of which the value is returned and cached. |
required |
level
|
str
|
The level at which values should be cached. Can be |
'document'
|
**kwargs
|
additional options to pass on to |
{}
|
Note: caching is based on the extractor instance and will not work across instances. For instance, in the example below, there would be no caching across fields.
fields = [
Field(name='foo', extractor=Cache(XML('baz'))),
Field(name='bar', extractor=Cache(XML('baz')))
]
You could rewrite this as follows, so the XML tree is only queried once per document:
_my_extractor = Cache(XML('baz'))
fields = [
Field(name='foo', extractor=_my_extractor),
Field(name='bar', extractor=_my_extractor)
]
There is a similar issue when you use @property to define the fields of the
reader.
Source code in ianalyzer_readers/extract.py
311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 | |
Choice
Bases: Extractor
Use the first applicable extractor from a list of extractors.
This is a generic extractor that can be used in any Reader.
The Choice extractor will use the applicable property of its provided extractors
to check which applies.
Example usage:
Choice(Constant('foo', applicable=some_condition), Constant('bar'))
This would extract 'foo' if some_condition is met; otherwise,
the extracted value will be 'bar'.
Note the difference with Backup: Backup will select the first truthy value from a
list of extractors, but Choice only checks the applicable condition. For example:
Choice(
CSV('foo', applicable=Metadata('bar')),
CSV('baz'),
)
Backup(
CSV('foo', applicable=Metadata('bar')),
CSV('baz'),
)
These extractors behave differently if the "bar" condition holds, but the "foo" field
is empty. Backup will try to extract the "baz" field, Choice will not.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
*extractors
|
Extractor
|
extractors to choose from. These should be listed in descending order of preference. |
()
|
**kwargs
|
additional options to pass on to |
{}
|
Source code in ianalyzer_readers/extract.py
116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 | |
Combined
Bases: Extractor
Apply all given extractors and return the results as a tuple.
This is a generic extractor that can be used in any Reader.
Example usage:
Combined(Constant('foo'), Constant('bar'))
This would extract ('foo', 'bar') for each document.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
*extractors
|
Extractor
|
extractors to combine. |
()
|
**kwargs
|
additional options to pass on to |
{}
|
Source code in ianalyzer_readers/extract.py
165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 | |
Constant
Bases: Extractor
This extractor 'extracts' the same value every time, regardless of input.
This is a generic extractor that can be used in any Reader.
It is especially useful in combination with Backup or Choice.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
value
|
Any
|
the value that should be "extracted". |
required |
**kwargs
|
additional options to pass on to |
{}
|
Source code in ianalyzer_readers/extract.py
225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 | |
ExternalFile
Bases: Extractor
Free for all external file extractor that provides a stream to stream_handler
to do whatever is needed to extract data from an external file. Relies on associated_file
being present in the metadata. Note that the XMLExtractor has a built in trick to extract
data from external files (i.e. setting external_file), so you probably need that if your
external file is XML.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
stream_handler
|
Callable
|
function that will handle the opened file. |
required |
**kwargs
|
additional options to pass on to |
{}
|
Source code in ianalyzer_readers/extract.py
591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 | |
Extractor
Bases: object
Base class for extractors.
An extractor contains a method that can be used to gather data for a field.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
applicable
|
Union[Extractor, Callable[[Dict], bool], None]
|
optional argument to check whether the extractor can be used. This should
be another extractor, which is applied first; the containing extractor
is only applied if the result is truthy. Any extractor can be used, as long as
it's supported by the Reader in which it's used. If left as |
None
|
transform
|
Optional[Callable]
|
optional function to transform or postprocess the extracted value. |
None
|
Source code in ianalyzer_readers/extract.py
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 | |
apply(*nargs, **kwargs)
Test if the extractor is applicable to the given arguments and if so, try to extract the information.
Source code in ianalyzer_readers/extract.py
59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 | |
JSON
Bases: Extractor
An extractor to extract data from JSON. This extractor assumes that each source is dictionary without nested lists. When working with nested lists, use JSONReader to unnest.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
keys
|
Iterable[str]
|
the keys with which to retrieve a field value from the source |
()
|
Source code in ianalyzer_readers/extract.py
615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 | |
Metadata
Bases: Extractor
This extractor extracts a value from provided metadata.
This is a generic extractor that can be used in any Reader.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
key
|
str
|
the key in the metadata dictionary that should be extracted. |
required |
**kwargs
|
additional options to pass on to |
{}
|
Source code in ianalyzer_readers/extract.py
246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 | |
Order
Bases: Extractor
An extractor to keep track of the order of documents. By default, this is the order of documents within their source, but you can also track the order of sources.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
level
|
str
|
Can be |
'document'
|
**kwargs
|
additional options to pass on to |
{}
|
Source code in ianalyzer_readers/extract.py
289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 | |
Pass
Bases: Extractor
An extractor that just passes the value of another extractor.
This is a generic extractor that can be used in any Reader.
This is useful if you want to stack multiple transform arguments. For example:
Pass(Constant('foo ', transfrom=str.upper), transform=str.strip)
This will extract str.strip(str.upper('foo ')), i.e. 'FOO'.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
extractor
|
Extractor
|
the extractor of which the value should be passed |
required |
**kwargs
|
additional options to pass on to |
{}
|
Source code in ianalyzer_readers/extract.py
265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 | |
RDF
Bases: Extractor
An extractor to extract data from RDF triples
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
predicates
|
Iterable[URIRef]
|
an iterable of predicates (i.e., the middle part of a RDF triple) with which to query for objects when passing no predicate, the current subject will be returned |
()
|
multiple
|
bool
|
if |
False
|
is_collection
|
bool
|
specify whether the data of interest is a collection, i.e., sequential data
a collection is indicated by the predicates |
False
|
Source code in ianalyzer_readers/extract.py
638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 | |
XML
Bases: Extractor
Extractor for XML data. Searches through a BeautifulSoup document.
This extractor should be used in a Reader based on XMLReader. (Note that this
includes the HTMLReader.)
The XML extractor has a lot of options. When deciding how to extract a value, it usually makes sense to determine them in this order:
- Choose whether to use the source file (default), or use an external XML file by
setting
external_file. - Choose where to start searching. The default searching point is the entry tag
for the document, but you can also start from the top of the document by setting
toplevel. - Describe the tag(s) you're looking for as a Tag object. You can also provide multiple tags to chain queries.
- If you need to return all matching tags, rather than the first match, set
multiple=True. - Choose how to extract a value: set
attribute,flatten, orextract_soup_funcif needed. - The extracted value is a string, or the output of
extract_soup_func. To further transform it, add a function fortransform.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tags
|
TagSpecification
|
Tags to select. Each of these can be a If no tags are provided, the extractor will work form the starting tag. Tags represent a query to select tags from current tag (e.g. the entry tag of the document). If you provide multiple, they are chained: each Tag query is applied to the results from the previous one. |
()
|
attribute
|
Optional[str]
|
By default, the extractor will extract the text content of the tag. Set this property to extract the value of an attribute instead. |
None
|
flatten
|
bool
|
When extracting the text content of a tag, This parameter does nothing if |
False
|
toplevel
|
bool
|
If |
False
|
multiple
|
bool
|
If |
False
|
external_file
|
bool
|
If Note: this option is not supported when this extractor is nested in another
extractor (like |
False
|
extract_soup_func
|
Optional[Callable]
|
A function to extract a value directly from the soup element,
instead of using the content string or an attribute.
|
None
|
**kwargs
|
additional options to pass on to |
{}
|
Source code in ianalyzer_readers/extract.py
373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 | |
XML tags
Module: ianalyzer_readers.xml_tag
This module defines the Tag class (and various subclasses).
This class is used in the XML extractor to read XML/HTML documents.
Each Tag describes a query for one or more XML tags based on their
characteristics. It implements a method find_in_soup that takes an
element as input and iterates over matching tags.
CurrentTag
Bases: Tag
A Tag query that will return the current tag.
Primarily useful as a default option.
Source code in ianalyzer_readers/xml_tag.py
81 82 83 84 85 86 87 88 89 90 91 92 | |
FindParentTag
Bases: Tag
A Tag that will find a parent tag based on query arguments.
Unlike ParentTag, this searches for a tag with a query.
For example, ParentTag('foo') will search for a <foo> ancestor
of the current tag.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
*args
|
Any
|
positional arguments to pass on to |
()
|
**kwargs
|
Any
|
named arguments to pass on to |
{}
|
Source code in ianalyzer_readers/xml_tag.py
117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 | |
NextSiblingTag
Bases: Tag
A Tag that will look in an element's next siblings.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
*args
|
Any
|
positional arguments to pass on to |
()
|
**kwargs
|
Any
|
named arguments to pass on to |
{}
|
Source code in ianalyzer_readers/xml_tag.py
165 166 167 168 169 170 171 172 173 174 175 | |
NextTag
Bases: Tag
A Tag that will look through tags following the current element.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
*args
|
Any
|
positional arguments to pass on to |
()
|
**kwargs
|
Any
|
named arguments to pass on to |
{}
|
Source code in ianalyzer_readers/xml_tag.py
189 190 191 192 193 194 195 196 197 198 199 | |
ParentTag
Bases: Tag
A Tag that will select a parent tag based on a fixed level.
For example, ParentTag(2) will always go up two steps in the tree
and return that tag.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
level
|
int
|
the number of steps to move up the tree. |
1
|
Source code in ianalyzer_readers/xml_tag.py
95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 | |
PreviousSiblingTag
Bases: Tag
A Tag that will look in an element's previous siblings.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
*args
|
Any
|
positional arguments to pass on to |
()
|
**kwargs
|
Any
|
named arguments to pass on to |
{}
|
Source code in ianalyzer_readers/xml_tag.py
153 154 155 156 157 158 159 160 161 162 163 | |
PreviousTag
Bases: Tag
A Tag that will look through tags previous to the current element.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
*args
|
Any
|
positional arguments to pass on to |
()
|
**kwargs
|
Any
|
named arguments to pass on to |
{}
|
Source code in ianalyzer_readers/xml_tag.py
177 178 179 180 181 182 183 184 185 186 187 | |
SiblingTag
Bases: Tag
A Tag that will look in an element's siblings.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
*args
|
Any
|
positional arguments to pass on to |
()
|
**kwargs
|
Any
|
named arguments to pass on to |
{}
|
Source code in ianalyzer_readers/xml_tag.py
135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 | |
Tag
Describes a query for a tag in an XML tree.
This should be used as the base class for all other tags, which can override
the __init__() and find_in_soup() methods.
Tag is the most straightforward case: all arguments passed in the constructor
are passed on as-is to the find_all() method of the BeautifulSoup element, searching
descendants of the input tag.
See https://www.crummy.com/software/BeautifulSoup/bs4/doc/#kinds-of-filters for different ways of searching. This includes searching by: - a tag name (possibly as a regular expression) - attributes of the tag - the string content of the tag - a function
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
*args
|
Any
|
positional arguments to pass on to |
()
|
**kwargs
|
Any
|
named arguments to pass on to |
{}
|
Source code in ianalyzer_readers/xml_tag.py
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 | |
find_in_soup(soup)
Find all results for this tag.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
soup
|
PageElement
|
The element to search from. |
required |
Returns:
| Type | Description |
|---|---|
Iterable[PageElement]
|
An iterable of matching tags. Note that is is not guaranteed that the iterable contains any elements. |
When subclassing Tag, you will usually want to replace this method. The result
must be an iterable. (If only one result makes sense, it's an iterable with one
element.) If the tag may find multiple matches, it's recommended that this method
returns a generator or a bs4.ResultSet rather than collecting all results up
front.
Source code in ianalyzer_readers/xml_tag.py
55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 | |
find_next_in_soup(soup)
Find the first match for the tag, if any.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
soup
|
PageElement
|
The element to search from. |
required |
Returns:
| Type | Description |
|---|---|
Optional[PageElement]
|
The first matching tag. Returns |
Source code in ianalyzer_readers/xml_tag.py
43 44 45 46 47 48 49 50 51 52 53 | |
TransformTag
Bases: Tag
A Tag that will perform a transformation function.
This Tag allows you to run arbitrary code to move to anywhere in the XML tree.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
transform
|
Callable[[PageElement], Iterable[PageElement]]
|
a function that takes an XML element as input and returns an iterable of XML elements. (Note that you can return an iterable of one, or an empty iterable, if you don't have multiple results.) |
required |
Source code in ianalyzer_readers/xml_tag.py
202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 | |