Session 1: Single Track
Contents
Session 1: Single Track#
What is gos
🦆?#
Motivation#
So far in the tutorial we have provided a general overview of the Gosling visualization grammar and introduced its key features:
Expressiveness
Data scalability
Encoding scalability
Coordinated interactivity
Although Gosling is extremely flexible, its JSON represenation is less ergonomic to construct natively via popular programming languages (like Python) and deploying a Gosling-based visualization requires the administration of a web server. Together these challenges serve as a barrier to entry for some, and we were motivated to design a simplified API for computational biologists to visualize their own datasets with Gosling.
Enter gos
…
Overview#
gos
is a declarative Python library designed to create interactive multi-scale visualizations of genomics and epigenomics data. Its main features include:
Authoring declarative genomics visualizations which adhere to the Gosling JSON Specification
Displaying Gosling visualizations directly in computational notebooks (Jupyter, JupyterLab, Google Colab)
Transparently hosting genomics datasets for visualizations (hiding web server complexities)
How it works#
The gos
Python library exposes a simple API that maps directly to the formal Gosling JSON specification. Users write Python programs with gos
which ultimately:
Emit JSON (the Gosling visualization)
Automatically render said JSON within computational notebooks
Note You need not understand the low-level details of
gos
. The important thing to keep in mind is thatgos
(by design) is gaurenteed to be consistent with the formal Gosling grammar, and therefore learninggos
will teach Gosling and vis versa.
Getting started#
The remainder of this notebook will focus on introducing the Gosling grammar through the declarative Python API.
Start by importing gosling
.
!pip install gosling==0.0.9
Requirement already satisfied: gosling==0.0.9 in /opt/hostedtoolcache/Python/3.10.5/x64/lib/python3.10/site-packages (0.0.9)
Requirement already satisfied: pandas in /opt/hostedtoolcache/Python/3.10.5/x64/lib/python3.10/site-packages (from gosling==0.0.9) (1.4.3)
Requirement already satisfied: jsonschema<4.0,>=3.0 in /opt/hostedtoolcache/Python/3.10.5/x64/lib/python3.10/site-packages (from gosling==0.0.9) (3.2.0)
Requirement already satisfied: jinja2 in /opt/hostedtoolcache/Python/3.10.5/x64/lib/python3.10/site-packages (from gosling==0.0.9) (3.1.2)
Requirement already satisfied: pyrsistent>=0.14.0 in /opt/hostedtoolcache/Python/3.10.5/x64/lib/python3.10/site-packages (from jsonschema<4.0,>=3.0->gosling==0.0.9) (0.18.1)
Requirement already satisfied: setuptools in /opt/hostedtoolcache/Python/3.10.5/x64/lib/python3.10/site-packages (from jsonschema<4.0,>=3.0->gosling==0.0.9) (58.1.0)
Requirement already satisfied: six>=1.11.0 in /opt/hostedtoolcache/Python/3.10.5/x64/lib/python3.10/site-packages (from jsonschema<4.0,>=3.0->gosling==0.0.9) (1.16.0)
Requirement already satisfied: attrs>=17.4.0 in /opt/hostedtoolcache/Python/3.10.5/x64/lib/python3.10/site-packages (from jsonschema<4.0,>=3.0->gosling==0.0.9) (21.4.0)
Requirement already satisfied: MarkupSafe>=2.0 in /opt/hostedtoolcache/Python/3.10.5/x64/lib/python3.10/site-packages (from jinja2->gosling==0.0.9) (2.1.1)
Requirement already satisfied: python-dateutil>=2.8.1 in /opt/hostedtoolcache/Python/3.10.5/x64/lib/python3.10/site-packages (from pandas->gosling==0.0.9) (2.8.2)
Requirement already satisfied: numpy>=1.21.0 in /opt/hostedtoolcache/Python/3.10.5/x64/lib/python3.10/site-packages (from pandas->gosling==0.0.9) (1.23.0)
Requirement already satisfied: pytz>=2020.1 in /opt/hostedtoolcache/Python/3.10.5/x64/lib/python3.10/site-packages (from pandas->gosling==0.0.9) (2022.1)
import gosling as gos
Note it is a convention to import as
gos
and then access the API through this namespace.
Track fundamentals#
gos
exposes two fundamental building-blocks for genomics visualizations:
A gos.Track
is the core component of a genomics visualization that defines explict transformations and mappings of genomics data to visual properties.
A gos.View
is a grouping of one or more gos.Track
objects that share the same linked genomic domain.
Depiction of a Gosling visualization. Distict Views (light orange/blue/green) contain several Tracks (dark orange/blue/green).
We will start by loading a BED file containing UCSC hg38 cytoband information. In gos
an abstract genomic data source is defined and bound to a Track directly through the Python API.
data_url = "https://raw.githubusercontent.com/sehilyi/gemini-datasets/master/data/UCSC.HG38.Human.CytoBandIdeogram.bed"
!curl -s {data_url} | head | column -t
chr1 0 2300000 p36.33 gneg
chr1 2300000 5300000 p36.32 gpos25
chr1 5300000 7100000 p36.31 gneg
chr1 7100000 9100000 p36.23 gpos25
chr1 9100000 12500000 p36.22 gneg
chr1 12500000 15900000 p36.21 gpos50
chr1 15900000 20100000 p36.13 gneg
chr1 20100000 23600000 p36.12 gpos25
chr1 23600000 27600000 p36.11 gneg
chr1 27600000 29900000 p35.3 gpos25
# The dataset is a BED4+1 file which can be read in Gosling as the CSV datatype
data = gos.csv(
url=data_url,
separator="\t", # BED files are tab-delimited
headerNames=['chrom', 'chromStart', 'chromEnd', 'name', 'stain'], # the +1 field is stain
chromosomeField="chrom", # the column containing chromosome names
genomicFields=["chromStart", "chromEnd"], # fields with (relative) genomic coordinates
)
# bind the data to a track
gos.Track(data)
Track({
data: {'type': 'csv', 'url': 'https://raw.githubusercontent.com/sehilyi/gemini-datasets/master/data/UCSC.HG38.Human.CytoBandIdeogram.bed', 'separator': '\t', 'headerNames': ['chrom', 'chromStart', 'chromEnd', 'name', 'stain'], 'chromosomeField': 'chrom', 'genomicFields': ['chromStart', 'chromEnd']},
height: 180,
mark: 'bar',
width: 800
})
The Track above is now bound to the genomics data, but the Gosling grammar requires the root of every visualization as a View, which may contain one or more Tracks.
In order to complete a Gosling specification for the track in isolation, we use the gos.Track.view()
method to cast the track within a gos.View
. In Jupyter or Google Colab, the visualization is automatically rendered in the cell below rather than printing a Python object like above.
track = gos.Track(data)
view = track.view()
print(view)
view
View({
tracks: [Track({
data: {'type': 'csv', 'url': 'https://raw.githubusercontent.com/sehilyi/gemini-datasets/master/data/UCSC.HG38.Human.CytoBandIdeogram.bed', 'separator': '\t', 'headerNames': ['chrom', 'chromStart', 'chromEnd', 'name', 'stain'], 'chromosomeField': 'chrom', 'genomicFields': ['chromStart', 'chromEnd']},
height: 180,
mark: 'bar',
width: 800
})]
})
Something appeared on the screen but our visualization looks empty. What’s going on?
We haven’t declared how to map the dataset to any visual properties!
Use the gos.Track.mark_*()
and gos.Track.encode()
methods to specify a mark and what visual encodings to apply.
# explore available marks with `gos.Track(data).mark_*`
gos.Track(data).mark_point().encode(
x=gos.X("chromStart", type="genomic"),
).view()
# Add another encoding for the `point` mark
gos.Track(data).mark_point().encode(
x=gos.X("chromStart", type="genomic"),
y=gos.Y("chromStart", type="genomic", axis="left"), # y-position
).view()
# use `gos.value()` for a constant value rather than data-derived encoding
gos.Track(data).mark_point().encode(
x=gos.X("chromStart", type="genomic"),
color=gos.value("lightblue"),
)
Track({
color: ColorValue({
value: 'lightblue'
}),
data: {'type': 'csv', 'url': 'https://raw.githubusercontent.com/sehilyi/gemini-datasets/master/data/UCSC.HG38.Human.CytoBandIdeogram.bed', 'separator': '\t', 'headerNames': ['chrom', 'chromStart', 'chromEnd', 'name', 'stain'], 'chromosomeField': 'chrom', 'genomicFields': ['chromStart', 'chromEnd']},
height: 180,
mark: 'point',
width: 800,
x: X({
shorthand: 'chromStart',
type: 'genomic'
})
})
You can read more about the specific visual channels which are supported by each mark type in the Gosling documentation.
Exercise#
Modify the visualization below with the following:
add a
size
encoding with the constant value10
change the mark type from
point
totriangleRight
add a
y
encoding to use the"chromEnd"
field instead of"chromStart"
change the
width
andheight
of the track to be500
track = gos.Track(data).mark_point().encode(
x=gos.X("chromStart", type="genomic"),
y=gos.Y("chromStart", type="genomic", axis="left"),
# additional encodings ...
).properties(
# track property overrides ...
)
track.view()
Data-types#
The specifics of an encoding depend on the type of the data. Gosling recognizes three datatypes:
Data Type |
Shorthand Code |
Description |
---|---|---|
quantitative |
Q |
a continuous real-valued quantity |
nominal |
N |
a discrete unordered category |
genomic |
G |
a genomic base-pair position |
Data types can either be expressed in long-form like above, or a short-hand syntax can be used to remove boilerplate when exploring visual encodings.
Genomic y
-encoding#
gos.Track(data).mark_point().encode(
x=gos.X("chromStart", type="genomic"), # can change to the 'shorthand' syntax
y=gos.Y("chromStart:G", axis="left")
).view()
Quantitative y
-encoding#
gos.Track(data).mark_point().encode(
x=gos.X("chromStart:G"),
y=gos.Y("chromStart:Q", axis="left"), # change y-encoding to quantitative
).view()
Nominal (categorical) y
-encoding#
gos.Track(data).mark_point().encode(
x=gos.X("chromStart:G"),
y=gos.Y("stain:N") # change y-encoding to nominal field
).view()
Literal value color
-encoding#
gos.Track(data).mark_point().encode(
x=gos.X("chromStart:G"),
y=gos.Y("stain:N"),
color=gos.value("black"), # no data-type, just a literal value!
).view()
Multiple encodings#
We can use both y
position and color
to encoding the same field.
gos.Track(data).mark_point().encode(
x=gos.X("chromStart:G"),
y=gos.Y("stain:N"),
).view()
gos.Track(data).mark_point().encode(
x=gos.X("chromStart:G"),
y=gos.Y("stain:N"),
# custom colormapping
color=gos.Color(
"stain:N",
domain=["gneg", "gpos25", "gpos50", "gpos75", "gpos100", "gvar"],
range=["white", "#D9D9D9", "#979797", "#636363", "black", "#A0A0F2"],
legend=True,
),
).view()
# change mark and add text-encoding
gos.Track(data).mark_text().encode(
x=gos.X("chromStart:G"),
y=gos.Y("stain:N"),
color=gos.Color(
"stain:N",
domain=["gneg", "gpos25", "gpos50", "gpos75", "gpos100", "gvar"],
range=["white", "#D9D9D9", "#979797", "#636363", "black", "#A0A0F2"],
legend=True,
),
text=gos.Text("stain:N")
).view()
Example: Simplified ideogram#
We can create a simplified ideogram using the concepts from above by changing the mark
for our track and specifying additional encodings.
track = gos.Track(data).mark_rect().encode(
# defines start and end of rectangle mark
x=gos.X("chromStart:G", axis="top"),
xe=gos.Xe("chromEnd:G"),
# defines how to map Giemsa-stain factor to colors
color=gos.Color(
"stain:N",
domain=["gneg", "gpos25", "gpos50", "gpos75", "gpos100", "gvar"],
range=["white", "#D9D9D9", "#979797", "#636363", "black", "#A0A0F2"]
),
# customize the style of the visual marks.
size=gos.value(20),
stroke=gos.value("gray"),
strokeWidth=gos.value(0.5)
)
track.view()
Additional parameters for the resulting gos.View
can be passed in as well for convenience. We can easily set a title
and xDomain
for our visualization, initializing the initial genomic region to display human chromsome 1.
track.view(
title="Gos is awesome!",
xDomain=gos.GenomicDomain(chromosome="chr1"),
)
Exercise#
Modify the existing track
definition above to encode the nominal stain
field with y
position.
track.encode(
# y=???
).view()
Track reuse and composition#
This section demonstrates how to build more complex, layered tracks and write functions to reuse visualizations for other data sources.
A common pattern in gos
is the reuse of gos.Track
instances to create new, modified gos.Track
or gos.View
objects. This feature allows users to be much more concise with the Python API compared to the Gosling JSON equivalent.
Note in the code example above, we reuse
track
to create a new view (withtitle
andxDomain
) without redefining the visualization from scratch.
The data#
In this section we will visualize several scATAC-seq “pseudobulk” tracks from Corces et. al (Nature Genetics, 2020) multi-omic atlas of the human brain. Each scATAC-seq track is stored in a separate BigWig file and represents the normalized aggregate signal of all cells from a given cell-type in the human brain.
urls = [
f"https://s3.amazonaws.com/gosling-lang.org/data/{file}"
for file in [
"ExcitatoryNeurons-insertions_bin100_RIPnorm.bw",
"InhibitoryNeurons-insertions_bin100_RIPnorm.bw",
"Microglia-insertions_bin100_RIPnorm.bw",
"Astrocytes-insertions_bin100_RIPnorm.bw",
]
]
urls
['https://s3.amazonaws.com/gosling-lang.org/data/ExcitatoryNeurons-insertions_bin100_RIPnorm.bw',
'https://s3.amazonaws.com/gosling-lang.org/data/InhibitoryNeurons-insertions_bin100_RIPnorm.bw',
'https://s3.amazonaws.com/gosling-lang.org/data/Microglia-insertions_bin100_RIPnorm.bw',
'https://s3.amazonaws.com/gosling-lang.org/data/Astrocytes-insertions_bin100_RIPnorm.bw']
In the previous section we visualized a text-based data format which was already tabular (BED). For non-tabular data formats supported by Gosling (e.g., BAM, BigWig) we need to be explicit about how to translate these files to a tabular representation in Gosling for which we can build our visualization.
data = gos.bigwig(urls[0], column="position", value="peak")
data
{'type': 'bigwig',
'url': 'https://s3.amazonaws.com/gosling-lang.org/data/ExcitatoryNeurons-insertions_bin100_RIPnorm.bw',
'column': 'position',
'value': 'peak'}
gos.Track(data).mark_bar().encode(
x=gos.X("position:G"),
y=gos.Y("peak:Q", axis="right"),
).view()
Overlay track#
With the quantitative data type for peak
, we can experiment with different marks in our visualization. Rather than repeating the definition, we use a base
track to derive other modified tracks.
base = gos.Track(data).encode(
x=gos.X("position:G")
).properties(height=60)
heatmap = base.mark_rect().encode(
color=gos.Color("peak:Q"),
)
heatmap.view()
line = base.mark_line().encode(
y=gos.Y("peak:Q"),
color=gos.value("gray"),
)
line.view()
points = line.mark_point()
points.view()
colored_points = points.encode(
color=gos.Color("peak:Q"), # overrides color encoding
size=gos.Size("peak:Q"),
)
colored_points.view()
Since the tracks above share some of the same encodings, we can layer the tracks together to create a composite "overlay"
which combines different marks.
gos.overlay(line, colored_points) # returns a View
While this example is fairly contrived, the ability to create overlay tracks is an essential feature in Gosling and allows much more complex visual encodings (e.g. genome annotations).
Track function#
The previous section demonstrated further how track definitions may be reused and extended. In this section we will show how a Python function can be defined to reuse a track definition for other data sources.
We can refactor this snippet from earlier so that we have a function that generates a barplot for any scATAC-seq track above.
def barplot(url: str, title: str = None, color: str = None) -> gos.Track:
data = gos.bigwig(url, column="position", value="peak")
track = gos.Track(data).mark_bar().encode(
x=gos.X("position:G"),
y=gos.Y("peak:Q", axis="right"),
)
if color:
track = track.encode(color=gos.value(color))
if title:
track = track.properties(title=title)
return track.properties(height=40)
barplot(urls[0]).view()
urls
['https://s3.amazonaws.com/gosling-lang.org/data/ExcitatoryNeurons-insertions_bin100_RIPnorm.bw',
'https://s3.amazonaws.com/gosling-lang.org/data/InhibitoryNeurons-insertions_bin100_RIPnorm.bw',
'https://s3.amazonaws.com/gosling-lang.org/data/Microglia-insertions_bin100_RIPnorm.bw',
'https://s3.amazonaws.com/gosling-lang.org/data/Astrocytes-insertions_bin100_RIPnorm.bw']
We can then reuse this utility to create multiple tracks to combine together.
tracks = []
for url, color in zip(urls, ["#F29B67", "#3DC491", "#565C8B", "#77C0FA"]):
title = url.split("/")[-1].split("-")[0]
track = barplot(url=url, title=title, color=color)
tracks.append(track)
tracks
[Track({
color: ColorValue({
value: '#F29B67'
}),
data: {'type': 'bigwig', 'url': 'https://s3.amazonaws.com/gosling-lang.org/data/ExcitatoryNeurons-insertions_bin100_RIPnorm.bw', 'column': 'position', 'value': 'peak'},
height: 40,
mark: 'bar',
title: 'ExcitatoryNeurons',
width: 800,
x: X({
shorthand: 'position:G'
}),
y: Y({
axis: 'right',
shorthand: 'peak:Q'
})
}),
Track({
color: ColorValue({
value: '#3DC491'
}),
data: {'type': 'bigwig', 'url': 'https://s3.amazonaws.com/gosling-lang.org/data/InhibitoryNeurons-insertions_bin100_RIPnorm.bw', 'column': 'position', 'value': 'peak'},
height: 40,
mark: 'bar',
title: 'InhibitoryNeurons',
width: 800,
x: X({
shorthand: 'position:G'
}),
y: Y({
axis: 'right',
shorthand: 'peak:Q'
})
}),
Track({
color: ColorValue({
value: '#565C8B'
}),
data: {'type': 'bigwig', 'url': 'https://s3.amazonaws.com/gosling-lang.org/data/Microglia-insertions_bin100_RIPnorm.bw', 'column': 'position', 'value': 'peak'},
height: 40,
mark: 'bar',
title: 'Microglia',
width: 800,
x: X({
shorthand: 'position:G'
}),
y: Y({
axis: 'right',
shorthand: 'peak:Q'
})
}),
Track({
color: ColorValue({
value: '#77C0FA'
}),
data: {'type': 'bigwig', 'url': 'https://s3.amazonaws.com/gosling-lang.org/data/Astrocytes-insertions_bin100_RIPnorm.bw', 'column': 'position', 'value': 'peak'},
height: 40,
mark: 'bar',
title: 'Astrocytes',
width: 800,
x: X({
shorthand: 'position:G'
}),
y: Y({
axis: 'right',
shorthand: 'peak:Q'
})
})]
tracks[1].view()
Combined together into a single visualization with gos.stack
.
# returns a `View` which shared genomic domain for all child tracks
gos.stack(*tracks).properties(
xDomain=gos.GenomicDomain(chromosome="3", interval=[52168000, 52890000]),
)