Python SDK

A python library for working with data.world datasets.

This library simplifies the process for you to access and utilize data from data.world. It offers convenient wrappers for data.world's APIs, enabling you to create and update datasets, add and modify files, and more. With this, you could even build full applications using data.world's platform.

Install the Python library

You can install the library via pip, optionally with pandas support, or using the conda-forge channel if you use conda.

MethodCommand
pippip install datadotworld
pip (pandas)pip install datadotworld[pandas]
condaconda install -c conda-forge datadotworld-py
pip install datadotworld
pip install datadotworld[pandas]
conda install -c conda-forge datadotworld-py

Configure the SDK

You need a data.world API authentication token to operate this library.

  1. Go to Integrations tab and select Python.
  2. Enable the Python library.
  3. Select Manage tab and copy Client Token.

There are two ways to configure the SDK:

Using CLI

To configure the library, run the following command:

dw configure

Using environment variables

Alternatively, tokens can be provided via the DW_AUTH_TOKEN environment variable.
On MacOS or Unix machines, run (replacing <YOUR_TOKEN>> below with the token obtained earlier):

export DW_AUTH_TOKEN=<YOUR_TOKEN>

Connect to data.world

Connecting to the public API

To work with the data.world public API, simply use the default settings provided in the SDK. There's no need to change anything to start querying and retrieving data from the public data.world platform.

Connecting to single-tenant installation

If you're working within a single-tenant installation of data.world (for example, your custom domain like your_org.data.world), you'll need to specify your environment settings to ensure your SDK connects correctly.

For example:

def create_url(subdomain, environment):
    if environment:
        subdomain = subdomain + '.' + environment
    return 'https://{}.data.world'.format(subdomain)

DW_ENVIRONMENT = environ.get('DW_ENVIRONMENT', '')
API_HOST = environ.get('DW_API_HOST', create_url('api', DW_ENVIRONMENT))
DOWNLOAD_HOST = environ.get('DW_DOWNLOAD_HOST', create_url('download', DW_ENVIRONMENT))
QUERY_HOST = environ.get('DW_QUERY_HOST', create_url('query', DW_ENVIRONMENT))

Import the datadotworld module

For example:

import datadotworld as dw

👍

You're done!

That's it — after these configurations, you're all set to utilize the SDK for making real-time queries on data.world using the following examples! 🎉

Load a dataset

The load_dataset() function facilitates maintaining copies of datasets on the local filesystem.
It will download a given dataset's datapackage,
and store it under ~/.dw/cache. When used subsequently, load_dataset() will use the copy stored on disk and will
work offline, unless it's called with force_update=True or auto_update=True. force_update=True will overwrite your local copy unconditionally. auto_update=True will only overwrite your local copy if a newer version of the dataset is available on data.world.

Once loaded, a dataset (data and metadata) can be conveniently accessed via the object returned by load_dataset().

load_dataset(dataset_key, force_update=False, auto_update=False)

Parameters

ParameterTypeDescription
dataset_keystrDataset identifier, in the form of owner/id or of a URL.
force_updateboolFlag indicating if a new copy of the dataset should be downloaded, replacing any previously downloaded copy (Default = False).
auto_updateboolFlag indicating that the dataset be updated to the latest version if available.

Function Outcome

CategoryDetailDescription
ReturnsObjectThe object representing the dataset.
TypeLocalDatasetDenotes the type of the returned object.
RaisesRestApiErrorIf a server error occurs.

To download a dataset and work with it locally, invoke the load_dataset() function.

For example:

intro_dataset = dw.load_dataset('jonloyens/an-intro-to-dataworld-dataset')

Access data via properties

Dataset objects allow access to data via three different properties raw_data, tables and dataframes.
Each of these properties is a mapping (dict) whose values are of type bytes, list and pandas.DataFrame,
respectively. Values are lazy loaded and cached once loaded. Their keys are the names of the files
contained in the dataset.

For example:

>>> intro_dataset.dataframes
LazyLoadedDict({
    'changelog': LazyLoadedValue(<pandas.DataFrame>),
    'datadotworldbballstats': LazyLoadedValue(<pandas.DataFrame>),
    'datadotworldbballteam': LazyLoadedValue(<pandas.DataFrame>)})

🚧

IMPORTANT

Not all files in a dataset are tabular, therefore some will be exposed via raw_data only.

Tables are lists of rows, each represented by a mapping (dict) of column names to their respective values.

For example:

>>> stats_table = intro_dataset.tables['datadotworldbballstats']
>>> stats_table[0]
OrderedDict([('Name', 'Jon'),
             ('PointsPerGame', Decimal('20.4')),
             ('AssistsPerGame', Decimal('1.3'))])

You can also review the metadata associated with a file or the entire dataset, using the describe function.
For example:

>>> intro_dataset.describe()
{'homepage': 'https://data.world/jonloyens/an-intro-to-dataworld-dataset',
 'name': 'jonloyens_an-intro-to-dataworld-dataset',
 'resources': [{'format': 'csv',
   'name': 'changelog',
   'path': 'data/ChangeLog.csv'},
  {'format': 'csv',
   'name': 'datadotworldbballstats',
   'path': 'data/DataDotWorldBBallStats.csv'},
  {'format': 'csv',
   'name': 'datadotworldbballteam',
   'path': 'data/DataDotWorldBBallTeam.csv'}]}
>>> intro_dataset.describe('datadotworldbballstats')
{'format': 'csv',
 'name': 'datadotworldbballstats',
 'path': 'data/DataDotWorldBBallStats.csv',
 'schema': {'fields': [{'name': 'Name', 'title': 'Name', 'type': 'string'},
                       {'name': 'PointsPerGame',
                        'title': 'PointsPerGame',
                        'type': 'number'},
                       {'name': 'AssistsPerGame',
                        'title': 'AssistsPerGame',
                        'type': 'number'}]}}

Query a dataset

The query() function allows datasets to be queried live using SQL or SPARQL query languages.

To query a dataset, invoke the query() function.

query(dataset_key, query, query_type='sql', parameters=None)

Parameters

ParameterTypeDescription
dataset_keystrDataset identifier, in the form of owner/id or of a URL.
querystrSQL or SPARQL query.
query_type{'sql', 'sparql'}The type of the query. Must be either 'sql' or 'sparql'. (Default value = "sql")
parametersquery parametersParameters to the query - if SPARQL query, this should be a dict containing named parameters; if SQL query, this should be a list containing positional parameters. Boolean values will be converted to xsd:boolean, Integer values to xsd:integer, and other Numeric values to xsd:decimal. Anything else is treated as a String literal. (Default = None)

Function Outcome

CategoryDetailDescription
ReturnsObjectObject containing the results of the query.
TypeQueryResultsDenotes the type of the returned object.
RaisesRuntimeErrorIf a server error occurs.

For example:

results = dw.query('jonloyens/an-intro-to-dataworld-dataset', 'SELECT * FROM DataDotWorldBBallStats')

Query result objects allow access to the data via raw_data, table and dataframe properties, of type
json, list and pandas.DataFrame, respectively.

For example:

>>> results.dataframe
      Name  PointsPerGame  AssistsPerGame
0      Jon           20.4             1.3
1      Rob           15.5             8.0
2   Sharon           30.1            11.2
3     Alex            8.2             0.5
4  Rebecca           12.3            17.0
5   Ariane           18.1             3.0
6    Bryon           16.0             8.5
7     Matt           13.0             2.1

Tables are lists of rows, each represented by a mapping (dict) of column names to their respective values.
For example:

>>> results.table[0]
OrderedDict([('Name', 'Jon'),
             ('PointsPerGame', Decimal('20.4')),
             ('AssistsPerGame', Decimal('1.3'))])

To query using SPARQL invoke query() using query_type='sparql', or else, it will assume
the query to be a SQL query.

Just like in the dataset case, you can view the metadata associated with a query result using the describe()
function.

For example:

>>> results.describe()
{'fields': [{'name': 'Name', 'type': 'string'},
            {'name': 'PointsPerGame', 'type': 'number'},
            {'name': 'AssistsPerGame', 'type': 'number'}]}

Work with files

The open_remote_file() function allows you to write data to or read data from a file in a
data.world dataset.

open_remote_file(dataset_key, file_name, mode='w', \*\*kwargs)

Parameters

ParameterTypeDescription
dataset_keystrDataset identifier, in the form of owner/id.
file_namestrThe name of the file to open.
modestrThe mode for the file - must be ‘w’, ‘wb’, ‘r’, or ‘rb’ - indicating read/write (‘r’/’w’) and optionally “binary” handling of the file data. (Default value = ‘w’)
chunk_sizeintSize of chunked bytes to return when reading streamed bytes in ‘rb’ mode (optional).
decode_unicodeboolWhether to decode textual responses as unicode when returning streamed lines in ‘r’ mode (optional).
*kwargs-Additional keyword arguments.

Writing files

The object that is returned from the open_remote_file() call is similar to a file handle that
would be used to write to a local file - it has a write() method, and contents sent to that
method will be written to the file remotely.

    >>> import datadotworld as dw
    >>>
    >>> with dw.open_remote_file('username/test-dataset', 'test.txt') as w:
    ...   w.write("this is a test.")
    >>>

Of course, writing a text file isn't the primary use case for data.world - you want to write your
data! The return object from open_remote_file() should be usable anywhere you could normally
use a local file handle in write mode - so you can use it to serialize the contents of a PANDAS
DataFrame to a CSV file...

    >>> import pandas as pd
    >>> df = pd.DataFrame({'foo':[1,2,3,4],'bar':['a','b','c','d']})
    >>> with dw.open_remote_file('username/test-dataset', 'dataframe.csv') as w:
    ...   df.to_csv(w, index=False)

Or, to write a series of dict objects as a JSON Lines file...

    >>> import json
    >>> with dw.open_remote_file('username/test-dataset', 'test.jsonl') as w:
    ...   json.dump({'foo':42, 'bar':"A"}, w)
    ...   json.dump({'foo':13, 'bar':"B"}, w)
    >>>

Or to write a series of dict objects as a CSV...

    >>> import csv
    >>> with dw.open_remote_file('username/test-dataset', 'test.csv') as w:
    ...   csvw = csv.DictWriter(w, fieldnames=['foo', 'bar'])
    ...   csvw.writeheader()
    ...   csvw.writerow({'foo':42, 'bar':"A"})
    ...   csvw.writerow({'foo':13, 'bar':"B"})
    >>>

And finally, you can write binary data by streaming bytes or bytearray objects, if you open the
file in binary mode...

    >>> with dw.open_remote_file('username/test-dataset', 'test.txt', mode='wb') as w:
    ...   w.write(bytes([100,97,116,97,46,119,111,114,108,100]))

Reading files

You can also read data from a file in a similar fashion

    >>> with dw.open_remote_file('username/test-dataset', 'test.txt', mode='r') as r:
    ...   print(r.read)

Reading from the file into common parsing libraries works naturally, too - when opened in 'r' mode, the
file object acts as an Iterator of the lines in the file:

    >>> with dw.open_remote_file('username/test-dataset', 'test.txt', mode='r') as r:
    ...   csvr = csv.DictReader(r)
    ...   for row in csvr:
    ...      print(row['column a'], row['column b'])

Reading binary files works naturally, too - when opened in 'rb' mode, read() returns the contents of
the file as a byte array, and the file object acts as an iterator of bytes:

    >>> with dw.open_remote_file('username/test-dataset', 'test', mode='rb') as r:
    ...   bytes = r.read()

Append records to stream

The append_record() function allows you to append JSON data to a data stream associated with a dataset. Streams do not need to be created in advance. Streams are automatically created the first time a streamId is used in an append operation.

append_records(dataset_key, stream_id, body)

Parameters

ParameterTypeDescription
dataset_keystrDataset identifier, in the form of owner/id.
stream_idstrStream unique identifier.
bodyobjObject body.

Function Outcome

CategoryDetailDescription
RaisesRestApiExceptionIf a server error occurs.

For example:

>>> client = dw.api_client()
>>> client.append_records('username/test-dataset','streamId', {'data': 'data'})

Contents of a stream will appear as part of the respective dataset as a .jsonl file.

📘

What's next?

Go to Python API Client Methods page to see a complete list of available SDK methods. You can also find more about those functions using help(client).