Session 1: Single Track#

What is gos 🦆?#

Motivation#

So far in the tutorial we have provided a general overview of the Gosling visualization grammar and introduced its key features:

  • Expressiveness

  • Data scalability

  • Encoding scalability

  • Coordinated interactivity

Although Gosling is extremely flexible, its JSON represenation is less ergonomic to construct natively via popular programming languages (like Python) and deploying a Gosling-based visualization requires the administration of a web server. Together these challenges serve as a barrier to entry for some, and we were motivated to design a simplified API for computational biologists to visualize their own datasets with Gosling.

Enter gos…

Overview#

gos is a declarative Python library designed to create interactive multi-scale visualizations of genomics and epigenomics data. Its main features include:

  • Authoring declarative genomics visualizations which adhere to the Gosling JSON Specification

  • Displaying Gosling visualizations directly in computational notebooks (Jupyter, JupyterLab, Google Colab)

  • Transparently hosting genomics datasets for visualizations (hiding web server complexities)


How it works#

The gos Python library exposes a simple API that maps directly to the formal Gosling JSON specification. Users write Python programs with gos which ultimately:

  • Emit JSON (the Gosling visualization)

  • Automatically render said JSON within computational notebooks

Note You need not understand the low-level details of gos. The important thing to keep in mind is that gos (by design) is gaurenteed to be consistent with the formal Gosling grammar, and therefore learning gos will teach Gosling and vis versa.

Getting started#

The remainder of this notebook will focus on introducing the Gosling grammar through the declarative Python API.

Start by importing gosling.

!pip install gosling==0.0.9
Requirement already satisfied: gosling==0.0.9 in /opt/hostedtoolcache/Python/3.10.5/x64/lib/python3.10/site-packages (0.0.9)
Requirement already satisfied: pandas in /opt/hostedtoolcache/Python/3.10.5/x64/lib/python3.10/site-packages (from gosling==0.0.9) (1.4.3)
Requirement already satisfied: jsonschema<4.0,>=3.0 in /opt/hostedtoolcache/Python/3.10.5/x64/lib/python3.10/site-packages (from gosling==0.0.9) (3.2.0)
Requirement already satisfied: jinja2 in /opt/hostedtoolcache/Python/3.10.5/x64/lib/python3.10/site-packages (from gosling==0.0.9) (3.1.2)
Requirement already satisfied: pyrsistent>=0.14.0 in /opt/hostedtoolcache/Python/3.10.5/x64/lib/python3.10/site-packages (from jsonschema<4.0,>=3.0->gosling==0.0.9) (0.18.1)
Requirement already satisfied: setuptools in /opt/hostedtoolcache/Python/3.10.5/x64/lib/python3.10/site-packages (from jsonschema<4.0,>=3.0->gosling==0.0.9) (58.1.0)
Requirement already satisfied: six>=1.11.0 in /opt/hostedtoolcache/Python/3.10.5/x64/lib/python3.10/site-packages (from jsonschema<4.0,>=3.0->gosling==0.0.9) (1.16.0)
Requirement already satisfied: attrs>=17.4.0 in /opt/hostedtoolcache/Python/3.10.5/x64/lib/python3.10/site-packages (from jsonschema<4.0,>=3.0->gosling==0.0.9) (21.4.0)
Requirement already satisfied: MarkupSafe>=2.0 in /opt/hostedtoolcache/Python/3.10.5/x64/lib/python3.10/site-packages (from jinja2->gosling==0.0.9) (2.1.1)
Requirement already satisfied: python-dateutil>=2.8.1 in /opt/hostedtoolcache/Python/3.10.5/x64/lib/python3.10/site-packages (from pandas->gosling==0.0.9) (2.8.2)
Requirement already satisfied: numpy>=1.21.0 in /opt/hostedtoolcache/Python/3.10.5/x64/lib/python3.10/site-packages (from pandas->gosling==0.0.9) (1.23.0)
Requirement already satisfied: pytz>=2020.1 in /opt/hostedtoolcache/Python/3.10.5/x64/lib/python3.10/site-packages (from pandas->gosling==0.0.9) (2022.1)
import gosling as gos

Note it is a convention to import as gos and then access the API through this namespace.

Track fundamentals#

gos exposes two fundamental building-blocks for genomics visualizations:

A gos.Track is the core component of a genomics visualization that defines explict transformations and mappings of genomics data to visual properties.

A gos.View is a grouping of one or more gos.Track objects that share the same linked genomic domain.

Depiction of a Gosling visualization. Distict Views (light orange/blue/green) contain several Tracks (dark orange/blue/green).

We will start by loading a BED file containing UCSC hg38 cytoband information. In gos an abstract genomic data source is defined and bound to a Track directly through the Python API.

data_url = "https://raw.githubusercontent.com/sehilyi/gemini-datasets/master/data/UCSC.HG38.Human.CytoBandIdeogram.bed"
!curl -s {data_url} | head | column -t
chr1  0         2300000   p36.33  gneg
chr1  2300000   5300000   p36.32  gpos25
chr1  5300000   7100000   p36.31  gneg
chr1  7100000   9100000   p36.23  gpos25
chr1  9100000   12500000  p36.22  gneg
chr1  12500000  15900000  p36.21  gpos50
chr1  15900000  20100000  p36.13  gneg
chr1  20100000  23600000  p36.12  gpos25
chr1  23600000  27600000  p36.11  gneg
chr1  27600000  29900000  p35.3   gpos25
# The dataset is a BED4+1 file which can be read in Gosling as the CSV datatype
data = gos.csv(
    url=data_url,
    separator="\t", # BED files are tab-delimited
    headerNames=['chrom', 'chromStart', 'chromEnd', 'name', 'stain'], # the +1 field is stain
    chromosomeField="chrom", # the column containing chromosome names
    genomicFields=["chromStart", "chromEnd"], # fields with (relative) genomic coordinates
)

# bind the data to a track
gos.Track(data)
Track({
  data: {'type': 'csv', 'url': 'https://raw.githubusercontent.com/sehilyi/gemini-datasets/master/data/UCSC.HG38.Human.CytoBandIdeogram.bed', 'separator': '\t', 'headerNames': ['chrom', 'chromStart', 'chromEnd', 'name', 'stain'], 'chromosomeField': 'chrom', 'genomicFields': ['chromStart', 'chromEnd']},
  height: 180,
  mark: 'bar',
  width: 800
})

The Track above is now bound to the genomics data, but the Gosling grammar requires the root of every visualization as a View, which may contain one or more Tracks.

In order to complete a Gosling specification for the track in isolation, we use the gos.Track.view() method to cast the track within a gos.View. In Jupyter or Google Colab, the visualization is automatically rendered in the cell below rather than printing a Python object like above.

track = gos.Track(data)
view = track.view()
print(view)
view
View({
  tracks: [Track({
    data: {'type': 'csv', 'url': 'https://raw.githubusercontent.com/sehilyi/gemini-datasets/master/data/UCSC.HG38.Human.CytoBandIdeogram.bed', 'separator': '\t', 'headerNames': ['chrom', 'chromStart', 'chromEnd', 'name', 'stain'], 'chromosomeField': 'chrom', 'genomicFields': ['chromStart', 'chromEnd']},
    height: 180,
    mark: 'bar',
    width: 800
  })]
})

Something appeared on the screen but our visualization looks empty. What’s going on?

We haven’t declared how to map the dataset to any visual properties!

Use the gos.Track.mark_*() and gos.Track.encode() methods to specify a mark and what visual encodings to apply.

# explore available marks with `gos.Track(data).mark_*`
gos.Track(data).mark_point().encode(
    x=gos.X("chromStart", type="genomic"),
).view()
# Add another encoding for the `point` mark
gos.Track(data).mark_point().encode(
    x=gos.X("chromStart", type="genomic"),
    y=gos.Y("chromStart", type="genomic", axis="left"), # y-position
).view()
# use `gos.value()` for a constant value rather than data-derived encoding
gos.Track(data).mark_point().encode(
    x=gos.X("chromStart", type="genomic"),
    color=gos.value("lightblue"),
)
Track({
  color: ColorValue({
    value: 'lightblue'
  }),
  data: {'type': 'csv', 'url': 'https://raw.githubusercontent.com/sehilyi/gemini-datasets/master/data/UCSC.HG38.Human.CytoBandIdeogram.bed', 'separator': '\t', 'headerNames': ['chrom', 'chromStart', 'chromEnd', 'name', 'stain'], 'chromosomeField': 'chrom', 'genomicFields': ['chromStart', 'chromEnd']},
  height: 180,
  mark: 'point',
  width: 800,
  x: X({
    shorthand: 'chromStart',
    type: 'genomic'
  })
})

You can read more about the specific visual channels which are supported by each mark type in the Gosling documentation.

Exercise#

Modify the visualization below with the following:

  • add a size encoding with the constant value 10

  • change the mark type from point to triangleRight

  • add a y encoding to use the "chromEnd" field instead of "chromStart"

  • change the width and height of the track to be 500

track = gos.Track(data).mark_point().encode(
    x=gos.X("chromStart", type="genomic"),
    y=gos.Y("chromStart", type="genomic", axis="left"),
    # additional encodings ...
).properties(
    # track property overrides ...
)

track.view()

Data-types#

The specifics of an encoding depend on the type of the data. Gosling recognizes three datatypes:

Data Type

Shorthand Code

Description

quantitative

Q

a continuous real-valued quantity

nominal

N

a discrete unordered category

genomic

G

a genomic base-pair position

Data types can either be expressed in long-form like above, or a short-hand syntax can be used to remove boilerplate when exploring visual encodings.

Genomic y-encoding#

gos.Track(data).mark_point().encode(
    x=gos.X("chromStart", type="genomic"), # can change to the 'shorthand' syntax
    y=gos.Y("chromStart:G", axis="left")
).view()

Quantitative y-encoding#

gos.Track(data).mark_point().encode(
    x=gos.X("chromStart:G"),
    y=gos.Y("chromStart:Q", axis="left"), # change y-encoding to quantitative
).view()

Nominal (categorical) y-encoding#

gos.Track(data).mark_point().encode(
    x=gos.X("chromStart:G"),
    y=gos.Y("stain:N") # change y-encoding to nominal field
).view()

Literal value color-encoding#

gos.Track(data).mark_point().encode(
    x=gos.X("chromStart:G"),
    y=gos.Y("stain:N"),
    color=gos.value("black"), # no data-type, just a literal value!
).view()

Multiple encodings#

We can use both y position and color to encoding the same field.

gos.Track(data).mark_point().encode(
    x=gos.X("chromStart:G"),
    y=gos.Y("stain:N"),
).view()
gos.Track(data).mark_point().encode(
    x=gos.X("chromStart:G"),
    y=gos.Y("stain:N"),
    # custom colormapping
    color=gos.Color(
        "stain:N", 
        domain=["gneg", "gpos25", "gpos50", "gpos75", "gpos100", "gvar"],
        range=["white", "#D9D9D9", "#979797", "#636363", "black", "#A0A0F2"],
        legend=True,
    ),
).view()
# change mark and add text-encoding
gos.Track(data).mark_text().encode(
    x=gos.X("chromStart:G"),
    y=gos.Y("stain:N"),
    color=gos.Color(
        "stain:N", 
        domain=["gneg", "gpos25", "gpos50", "gpos75", "gpos100", "gvar"],
        range=["white", "#D9D9D9", "#979797", "#636363", "black", "#A0A0F2"],
        legend=True,
    ),
    text=gos.Text("stain:N") 
).view()

Example: Simplified ideogram#

We can create a simplified ideogram using the concepts from above by changing the mark for our track and specifying additional encodings.

track = gos.Track(data).mark_rect().encode(
    # defines start and end of rectangle mark
    x=gos.X("chromStart:G", axis="top"),
    xe=gos.Xe("chromEnd:G"),
    # defines how to map Giemsa-stain factor to colors
    color=gos.Color(
        "stain:N", 
        domain=["gneg", "gpos25", "gpos50", "gpos75", "gpos100", "gvar"],
        range=["white", "#D9D9D9", "#979797", "#636363", "black", "#A0A0F2"]
    ),
    # customize the style of the visual marks. 
    size=gos.value(20),
    stroke=gos.value("gray"),
    strokeWidth=gos.value(0.5)
)

track.view()

Additional parameters for the resulting gos.View can be passed in as well for convenience. We can easily set a title and xDomain for our visualization, initializing the initial genomic region to display human chromsome 1.

track.view(
    title="Gos is awesome!",
    xDomain=gos.GenomicDomain(chromosome="chr1"),
)

Exercise#

Modify the existing track definition above to encode the nominal stain field with y position.

track.encode(
    # y=???
).view()

Track reuse and composition#

This section demonstrates how to build more complex, layered tracks and write functions to reuse visualizations for other data sources.

A common pattern in gos is the reuse of gos.Track instances to create new, modified gos.Track or gos.View objects. This feature allows users to be much more concise with the Python API compared to the Gosling JSON equivalent.

Note in the code example above, we reuse track to create a new view (with title and xDomain) without redefining the visualization from scratch.

The data#

In this section we will visualize several scATAC-seq “pseudobulk” tracks from Corces et. al (Nature Genetics, 2020) multi-omic atlas of the human brain. Each scATAC-seq track is stored in a separate BigWig file and represents the normalized aggregate signal of all cells from a given cell-type in the human brain.

urls = [
    f"https://s3.amazonaws.com/gosling-lang.org/data/{file}"
    for file in [
        "ExcitatoryNeurons-insertions_bin100_RIPnorm.bw",
        "InhibitoryNeurons-insertions_bin100_RIPnorm.bw",
        "Microglia-insertions_bin100_RIPnorm.bw",
        "Astrocytes-insertions_bin100_RIPnorm.bw",
    ]
]
urls
['https://s3.amazonaws.com/gosling-lang.org/data/ExcitatoryNeurons-insertions_bin100_RIPnorm.bw',
 'https://s3.amazonaws.com/gosling-lang.org/data/InhibitoryNeurons-insertions_bin100_RIPnorm.bw',
 'https://s3.amazonaws.com/gosling-lang.org/data/Microglia-insertions_bin100_RIPnorm.bw',
 'https://s3.amazonaws.com/gosling-lang.org/data/Astrocytes-insertions_bin100_RIPnorm.bw']

In the previous section we visualized a text-based data format which was already tabular (BED). For non-tabular data formats supported by Gosling (e.g., BAM, BigWig) we need to be explicit about how to translate these files to a tabular representation in Gosling for which we can build our visualization.

data = gos.bigwig(urls[0], column="position", value="peak")
data
{'type': 'bigwig',
 'url': 'https://s3.amazonaws.com/gosling-lang.org/data/ExcitatoryNeurons-insertions_bin100_RIPnorm.bw',
 'column': 'position',
 'value': 'peak'}
gos.Track(data).mark_bar().encode(
    x=gos.X("position:G"),
    y=gos.Y("peak:Q", axis="right"),
).view()

Overlay track#

With the quantitative data type for peak, we can experiment with different marks in our visualization. Rather than repeating the definition, we use a base track to derive other modified tracks.

base = gos.Track(data).encode(
    x=gos.X("position:G")
).properties(height=60)

heatmap = base.mark_rect().encode(
    color=gos.Color("peak:Q"),
)

heatmap.view()
line = base.mark_line().encode(
    y=gos.Y("peak:Q"),
    color=gos.value("gray"),
)
line.view()
points = line.mark_point()
points.view()
colored_points = points.encode(
    color=gos.Color("peak:Q"), # overrides color encoding
    size=gos.Size("peak:Q"),
)
colored_points.view()

Since the tracks above share some of the same encodings, we can layer the tracks together to create a composite "overlay" which combines different marks.

gos.overlay(line, colored_points) # returns a View

While this example is fairly contrived, the ability to create overlay tracks is an essential feature in Gosling and allows much more complex visual encodings (e.g. genome annotations).

Track function#

The previous section demonstrated further how track definitions may be reused and extended. In this section we will show how a Python function can be defined to reuse a track definition for other data sources.

We can refactor this snippet from earlier so that we have a function that generates a barplot for any scATAC-seq track above.

def barplot(url: str, title: str = None, color: str = None) -> gos.Track:
    data = gos.bigwig(url, column="position", value="peak")
    track = gos.Track(data).mark_bar().encode(
        x=gos.X("position:G"),
        y=gos.Y("peak:Q", axis="right"),
    )
    if color:
        track = track.encode(color=gos.value(color))
    if title:
        track = track.properties(title=title)
    return track.properties(height=40)

barplot(urls[0]).view()
urls
['https://s3.amazonaws.com/gosling-lang.org/data/ExcitatoryNeurons-insertions_bin100_RIPnorm.bw',
 'https://s3.amazonaws.com/gosling-lang.org/data/InhibitoryNeurons-insertions_bin100_RIPnorm.bw',
 'https://s3.amazonaws.com/gosling-lang.org/data/Microglia-insertions_bin100_RIPnorm.bw',
 'https://s3.amazonaws.com/gosling-lang.org/data/Astrocytes-insertions_bin100_RIPnorm.bw']

We can then reuse this utility to create multiple tracks to combine together.

tracks = []
for url, color in zip(urls, ["#F29B67", "#3DC491", "#565C8B",  "#77C0FA"]):
    title = url.split("/")[-1].split("-")[0]
    track = barplot(url=url, title=title, color=color)
    tracks.append(track)
tracks
[Track({
   color: ColorValue({
     value: '#F29B67'
   }),
   data: {'type': 'bigwig', 'url': 'https://s3.amazonaws.com/gosling-lang.org/data/ExcitatoryNeurons-insertions_bin100_RIPnorm.bw', 'column': 'position', 'value': 'peak'},
   height: 40,
   mark: 'bar',
   title: 'ExcitatoryNeurons',
   width: 800,
   x: X({
     shorthand: 'position:G'
   }),
   y: Y({
     axis: 'right',
     shorthand: 'peak:Q'
   })
 }),
 Track({
   color: ColorValue({
     value: '#3DC491'
   }),
   data: {'type': 'bigwig', 'url': 'https://s3.amazonaws.com/gosling-lang.org/data/InhibitoryNeurons-insertions_bin100_RIPnorm.bw', 'column': 'position', 'value': 'peak'},
   height: 40,
   mark: 'bar',
   title: 'InhibitoryNeurons',
   width: 800,
   x: X({
     shorthand: 'position:G'
   }),
   y: Y({
     axis: 'right',
     shorthand: 'peak:Q'
   })
 }),
 Track({
   color: ColorValue({
     value: '#565C8B'
   }),
   data: {'type': 'bigwig', 'url': 'https://s3.amazonaws.com/gosling-lang.org/data/Microglia-insertions_bin100_RIPnorm.bw', 'column': 'position', 'value': 'peak'},
   height: 40,
   mark: 'bar',
   title: 'Microglia',
   width: 800,
   x: X({
     shorthand: 'position:G'
   }),
   y: Y({
     axis: 'right',
     shorthand: 'peak:Q'
   })
 }),
 Track({
   color: ColorValue({
     value: '#77C0FA'
   }),
   data: {'type': 'bigwig', 'url': 'https://s3.amazonaws.com/gosling-lang.org/data/Astrocytes-insertions_bin100_RIPnorm.bw', 'column': 'position', 'value': 'peak'},
   height: 40,
   mark: 'bar',
   title: 'Astrocytes',
   width: 800,
   x: X({
     shorthand: 'position:G'
   }),
   y: Y({
     axis: 'right',
     shorthand: 'peak:Q'
   })
 })]
tracks[1].view()

Combined together into a single visualization with gos.stack.

# returns a `View` which shared genomic domain for all child tracks
gos.stack(*tracks).properties(
    xDomain=gos.GenomicDomain(chromosome="3", interval=[52168000, 52890000]),
)