Python SDK
A python library for working with data.world datasets.
This library simplifies the process for you to access and utilize data from data.world. It offers convenient wrappers for data.world's APIs, enabling you to create and update datasets, add and modify files, and more. With this, you could even build full applications using data.world's platform.
Install the Python library
You can install the library via pip, optionally with pandas support, or using the conda-forge channel if you use conda.
Method | Command |
---|---|
pip | pip install datadotworld |
pip (pandas) | pip install datadotworld[pandas] |
conda | conda install -c conda-forge datadotworld-py |
pip install datadotworld
pip install datadotworld[pandas]
conda install -c conda-forge datadotworld-py
Configure the SDK
You need a data.world API authentication token to operate this library.
- Go to Integrations tab and select Python.
- Enable the Python library.
- Select Manage tab and copy Client Token.
There are two ways to configure the SDK:
Using CLI
To configure the library, run the following command:
dw configure
Using environment variables
Alternatively, tokens can be provided via the DW_AUTH_TOKEN
environment variable.
On MacOS or Unix machines, run (replacing <YOUR_TOKEN>>
below with the token obtained earlier):
export DW_AUTH_TOKEN=<YOUR_TOKEN>
Connect to data.world
Connecting to the public API
To work with the data.world public API, simply use the default settings provided in the SDK. There's no need to change anything to start querying and retrieving data from the public data.world platform.
Connecting to single-tenant installation
If you're working within a single-tenant installation of data.world (for example, your custom domain like your_org.data.world
), you'll need to specify your environment settings to ensure your SDK connects correctly.
For example:
def create_url(subdomain, environment):
if environment:
subdomain = subdomain + '.' + environment
return 'https://{}.data.world'.format(subdomain)
DW_ENVIRONMENT = environ.get('DW_ENVIRONMENT', '')
API_HOST = environ.get('DW_API_HOST', create_url('api', DW_ENVIRONMENT))
DOWNLOAD_HOST = environ.get('DW_DOWNLOAD_HOST', create_url('download', DW_ENVIRONMENT))
QUERY_HOST = environ.get('DW_QUERY_HOST', create_url('query', DW_ENVIRONMENT))
Import the datadotworld
module
datadotworld
moduleFor example:
import datadotworld as dw
You're done!
That's it — after these configurations, you're all set to utilize the SDK for making real-time queries on data.world using the following examples! 🎉
Load a dataset
The load_dataset()
function facilitates maintaining copies of datasets on the local filesystem.
It will download a given dataset's datapackage,
and store it under ~/.dw/cache
. When used subsequently, load_dataset()
will use the copy stored on disk and will
work offline, unless it's called with force_update=True
or auto_update=True
. force_update=True
will overwrite your local copy unconditionally. auto_update=True
will only overwrite your local copy if a newer version of the dataset is available on data.world.
Once loaded, a dataset (data and metadata) can be conveniently accessed via the object returned by load_dataset()
.
load_dataset(dataset_key, force_update=False, auto_update=False)
Parameters
Parameter | Type | Description |
---|---|---|
dataset_key | str | Dataset identifier, in the form of owner/id or of a URL. |
force_update | bool | Flag indicating if a new copy of the dataset should be downloaded, replacing any previously downloaded copy (Default = False). |
auto_update | bool | Flag indicating that the dataset be updated to the latest version if available. |
Function Outcome
Category | Detail | Description |
---|---|---|
Returns | Object | The object representing the dataset. |
Type | LocalDataset | Denotes the type of the returned object. |
Raises | RestApiError | If a server error occurs. |
To download a dataset and work with it locally, invoke the load_dataset()
function.
For example:
intro_dataset = dw.load_dataset('jonloyens/an-intro-to-dataworld-dataset')
Access data via properties
Dataset objects allow access to data via three different properties raw_data
, tables
and dataframes
.
Each of these properties is a mapping (dict) whose values are of type bytes
, list
and pandas.DataFrame
,
respectively. Values are lazy loaded and cached once loaded. Their keys are the names of the files
contained in the dataset.
For example:
>>> intro_dataset.dataframes
LazyLoadedDict({
'changelog': LazyLoadedValue(<pandas.DataFrame>),
'datadotworldbballstats': LazyLoadedValue(<pandas.DataFrame>),
'datadotworldbballteam': LazyLoadedValue(<pandas.DataFrame>)})
IMPORTANT
Not all files in a dataset are tabular, therefore some will be exposed via
raw_data
only.
Tables are lists of rows, each represented by a mapping (dict) of column names to their respective values.
For example:
>>> stats_table = intro_dataset.tables['datadotworldbballstats']
>>> stats_table[0]
OrderedDict([('Name', 'Jon'),
('PointsPerGame', Decimal('20.4')),
('AssistsPerGame', Decimal('1.3'))])
You can also review the metadata associated with a file or the entire dataset, using the describe
function.
For example:
>>> intro_dataset.describe()
{'homepage': 'https://data.world/jonloyens/an-intro-to-dataworld-dataset',
'name': 'jonloyens_an-intro-to-dataworld-dataset',
'resources': [{'format': 'csv',
'name': 'changelog',
'path': 'data/ChangeLog.csv'},
{'format': 'csv',
'name': 'datadotworldbballstats',
'path': 'data/DataDotWorldBBallStats.csv'},
{'format': 'csv',
'name': 'datadotworldbballteam',
'path': 'data/DataDotWorldBBallTeam.csv'}]}
>>> intro_dataset.describe('datadotworldbballstats')
{'format': 'csv',
'name': 'datadotworldbballstats',
'path': 'data/DataDotWorldBBallStats.csv',
'schema': {'fields': [{'name': 'Name', 'title': 'Name', 'type': 'string'},
{'name': 'PointsPerGame',
'title': 'PointsPerGame',
'type': 'number'},
{'name': 'AssistsPerGame',
'title': 'AssistsPerGame',
'type': 'number'}]}}
Query a dataset
The query()
function allows datasets to be queried live using SQL
or SPARQL
query languages.
To query a dataset, invoke the query()
function.
query(dataset_key, query, query_type='sql', parameters=None)
Parameters
Parameter | Type | Description |
---|---|---|
dataset_key | str | Dataset identifier, in the form of owner/id or of a URL. |
query | str | SQL or SPARQL query. |
query_type | {'sql', 'sparql'} | The type of the query. Must be either 'sql' or 'sparql'. (Default value = "sql") |
parameters | query parameters | Parameters to the query - if SPARQL query, this should be a dict containing named parameters; if SQL query, this should be a list containing positional parameters. Boolean values will be converted to xsd:boolean, Integer values to xsd:integer, and other Numeric values to xsd:decimal. Anything else is treated as a String literal. (Default = None) |
Function Outcome
Category | Detail | Description |
---|---|---|
Returns | Object | Object containing the results of the query. |
Type | QueryResults | Denotes the type of the returned object. |
Raises | RuntimeError | If a server error occurs. |
For example:
results = dw.query('jonloyens/an-intro-to-dataworld-dataset', 'SELECT * FROM DataDotWorldBBallStats')
Query result objects allow access to the data via raw_data
, table
and dataframe
properties, of type
json
, list
and pandas.DataFrame
, respectively.
For example:
>>> results.dataframe
Name PointsPerGame AssistsPerGame
0 Jon 20.4 1.3
1 Rob 15.5 8.0
2 Sharon 30.1 11.2
3 Alex 8.2 0.5
4 Rebecca 12.3 17.0
5 Ariane 18.1 3.0
6 Bryon 16.0 8.5
7 Matt 13.0 2.1
Tables are lists of rows, each represented by a mapping (dict) of column names to their respective values.
For example:
>>> results.table[0]
OrderedDict([('Name', 'Jon'),
('PointsPerGame', Decimal('20.4')),
('AssistsPerGame', Decimal('1.3'))])
To query using SPARQL
invoke query()
using query_type='sparql'
, or else, it will assume
the query to be a SQL
query.
Just like in the dataset case, you can view the metadata associated with a query result using the describe()
function.
For example:
>>> results.describe()
{'fields': [{'name': 'Name', 'type': 'string'},
{'name': 'PointsPerGame', 'type': 'number'},
{'name': 'AssistsPerGame', 'type': 'number'}]}
Work with files
The open_remote_file()
function allows you to write data to or read data from a file in a
data.world dataset.
open_remote_file(dataset_key, file_name, mode='w', \*\*kwargs)
Parameters
Parameter | Type | Description |
---|---|---|
dataset_key | str | Dataset identifier, in the form of owner/id. |
file_name | str | The name of the file to open. |
mode | str | The mode for the file - must be ‘w’, ‘wb’, ‘r’, or ‘rb’ - indicating read/write (‘r’/’w’) and optionally “binary” handling of the file data. (Default value = ‘w’) |
chunk_size | int | Size of chunked bytes to return when reading streamed bytes in ‘rb’ mode (optional). |
decode_unicode | bool | Whether to decode textual responses as unicode when returning streamed lines in ‘r’ mode (optional). |
*kwargs | - | Additional keyword arguments. |
Writing files
The object that is returned from the open_remote_file()
call is similar to a file handle that
would be used to write to a local file - it has a write()
method, and contents sent to that
method will be written to the file remotely.
>>> import datadotworld as dw
>>>
>>> with dw.open_remote_file('username/test-dataset', 'test.txt') as w:
... w.write("this is a test.")
>>>
Of course, writing a text file isn't the primary use case for data.world - you want to write your
data! The return object from open_remote_file()
should be usable anywhere you could normally
use a local file handle in write mode - so you can use it to serialize the contents of a PANDAS
DataFrame
to a CSV file...
>>> import pandas as pd
>>> df = pd.DataFrame({'foo':[1,2,3,4],'bar':['a','b','c','d']})
>>> with dw.open_remote_file('username/test-dataset', 'dataframe.csv') as w:
... df.to_csv(w, index=False)
Or, to write a series of dict
objects as a JSON Lines file...
>>> import json
>>> with dw.open_remote_file('username/test-dataset', 'test.jsonl') as w:
... json.dump({'foo':42, 'bar':"A"}, w)
... json.dump({'foo':13, 'bar':"B"}, w)
>>>
Or to write a series of dict
objects as a CSV...
>>> import csv
>>> with dw.open_remote_file('username/test-dataset', 'test.csv') as w:
... csvw = csv.DictWriter(w, fieldnames=['foo', 'bar'])
... csvw.writeheader()
... csvw.writerow({'foo':42, 'bar':"A"})
... csvw.writerow({'foo':13, 'bar':"B"})
>>>
And finally, you can write binary data by streaming bytes
or bytearray
objects, if you open the
file in binary mode...
>>> with dw.open_remote_file('username/test-dataset', 'test.txt', mode='wb') as w:
... w.write(bytes([100,97,116,97,46,119,111,114,108,100]))
Reading files
You can also read data from a file in a similar fashion
>>> with dw.open_remote_file('username/test-dataset', 'test.txt', mode='r') as r:
... print(r.read)
Reading from the file into common parsing libraries works naturally, too - when opened in 'r' mode, the
file object acts as an Iterator of the lines in the file:
>>> with dw.open_remote_file('username/test-dataset', 'test.txt', mode='r') as r:
... csvr = csv.DictReader(r)
... for row in csvr:
... print(row['column a'], row['column b'])
Reading binary files works naturally, too - when opened in 'rb' mode, read()
returns the contents of
the file as a byte array, and the file object acts as an iterator of bytes:
>>> with dw.open_remote_file('username/test-dataset', 'test', mode='rb') as r:
... bytes = r.read()
Append records to stream
The append_record()
function allows you to append JSON data to a data stream associated with a dataset. Streams do not need to be created in advance. Streams are automatically created the first time a streamId
is used in an append operation.
append_records(dataset_key, stream_id, body)
Parameters
Parameter | Type | Description |
---|---|---|
dataset_key | str | Dataset identifier, in the form of owner/id. |
stream_id | str | Stream unique identifier. |
body | obj | Object body. |
Function Outcome
Category | Detail | Description |
---|---|---|
Raises | RestApiException | If a server error occurs. |
For example:
>>> client = dw.api_client()
>>> client.append_records('username/test-dataset','streamId', {'data': 'data'})
Contents of a stream will appear as part of the respective dataset as a .jsonl file.
What's next?
Go to Python API Client Methods page to see a complete list of available SDK methods. You can also find more about those functions using
help(client)
.
Updated about 2 months ago