Basic Concepts
The Airbyte Specification
As a quick recap, the Airbyte Specification requires an Airbyte Source to support 4 distinct operations:
-
Spec
- The required configuration in order to interact with the underlying technical system e.g. databaseinformation, authentication information etc.
-
Check
- Validate that the provided configuration is valid with sufficient permissions for one to perform allrequired operations on the Source.
-
Discover
- Discover the Source's schema. This let users select what a subset of the data to sync. Usefulif users require only a subset of the data.
-
Read
- Perform the actual syncing process. Data is read from the Source, parsed intoAirbyteRecordMessage
sand sent to the Airbyte Destination. Depending on how the Source is implemented, this sync can be incremental
or a full-refresh.
A core concept discussed here is the Source.
The Source contains one or more Streams (or Airbyte Streams). A Stream is the other concept key to understanding how Airbyte models the data syncing process. A Stream models the logical data groups that make up the larger Source. If the Source is a RDMS, each Stream is a table. In a REST API setting, each Stream corresponds to one resource within the API. e.g. a Stripe Source would have have one Stream for Transactions
, one for Charges
and so on.
The Source
class
Airbyte provides abstract base classes which make it much easier to perform certain categories of tasks e.g: HttpStream
makes it easy to create HTTP API-based streams. However, if those do not satisfy your use case (for example, if you're pulling data from a relational database), you can always directly implement the Airbyte Protocol by subclassing the CDK's Source
class.
The Source
class implements the Spec
operation by looking for a file named spec.yaml
(or spec.json
) in the module's root by default. This is expected to be a json schema file that specifies the required configuration. Here is an example from the Exchange Rates source.
Note that while this is the most flexible way to implement a source connector, it is also the most toilsome as you will be required to manually manage state, input validation, correctly conforming to the Airbyte Protocol message formats, and more. We recommend using a subclass of Source
unless you cannot fulfill your use case otherwise.
The AbstractSource
Object
AbstractSource
is a more opinionated implementation of Source
. It implements Source
's 4 methods as follows:
Check
delegates to the AbstractSource
's check_connection
function. The function's config
parameter contains the user-provided configuration, specified in the spec.yaml
returned by Spec
. check_connection
uses this configuration to validate access and permissioning. Here is an example from the same Exchange Rates API.
The Stream
Abstract Base Class
An AbstractSource
also owns a set of Stream
s. This is populated via the AbstractSource
's streams
function. Discover
and Read
rely on this populated set.
Discover
returns an AirbyteCatalog
representing all the distinct resources the underlying API supports. Here is the entrypoint for those interested in reading the code. See schemas for more information on how to declare the schema of a stream.
Read
creates an in-memory stream reading from each of the AbstractSource
's streams. Here is the entrypoint for those interested.
As the code examples show, the AbstractSource
delegates to the set of Stream
s it owns to fulfill both Discover
and Read
. Thus, implementing AbstractSource
's streams
function is required when using the CDK.
A summary of what we've covered so far on how to use the Airbyte CDK:
- A concrete implementation of the
AbstractSource
object is required. - This involves,
- implementing the
check_connection
function. - Creating the appropriate
Stream
classes and returning them in thestreams
function. - placing the above mentioned
spec.yaml
file in the right place.
- implementing the
HTTP Streams
We've covered how the AbstractSource
works with the Stream
interface in order to fulfill the Airbyte Specification. Although developers are welcome to implement their own object, the CDK saves developers the hassle of doing so in the case of HTTP APIs with the HTTPStream
object.