• untoreh-light

Building a content aggregator for fun and profits?

A full app that scrapes, processes and presents content from the web...on the web.


Information overload? It is kinda bad these days, a lot of low signal to noise sources of information, narrowing down your "feeds" such that you don't get overwhelmed is kinda hard. A tool that filters information and presents it in a format that is both easy and quick to digest would be very much useful. This is why I consider content aggregation an evergreen field for disruption. It is and will always be (as long as the internet is free and there is free speech) a good business opportunity. It is one of those instances where its all about the execution (and none at all about the idea).

Managing expectations

Having said that, my app in the end doesn't do any actual filtering. In fact it simply aggregates content from the web. This is because I didn't build users into it, and there is little incentive on doing filtering if it can't be tailored per user.

The architecture

The diagram of the architecture: content-aggregator-diagram There are indeed many circles in it!...you know...micro...services? The apps that I built are the "scraper" and the "server", while the "publisher" is just a routine embedded in the server. The "search" and "proxies" are external tools that do their job. "Frontend" is nothing special, a mix of js and css bundled with webpack.

Since there are different moving parts I will go according to the flow of the content, starting from when the content is seen for the first time.

The Scraper

Scraping is done in...you guessed it, python. However no ad hoc "scraping" modules are used.

What do you scrape?

Deciding what to scrape depends on the category of the content. We call categories "topics".

Scraping happens continuously, it's a demon. The main loop pseudo code looks like this, and it is configured per site:


Once we have content to publish, we have to decide what to publish and how often. I didn't come up with any tricks here, because as I mentioned previously choosing what to display is user dependent. So we just publish from newest to oldest, with the reasoning that something that we scraped more recently is more relevant, it is a LIFO queue. Publishing although quite reliant on python tools, happens in nim, because it runs adjactently to the server, which is also in nim.

The publishing logic

Publishing happens continuously, and like scraping has idle intervals, publishing does to, in an inversely to scraping. When we scrape, we slow down when we have a long enough cache of unpublished articles, with publishing we slow down when our cache starts to shrink too much. In this way, coupled with scraping, there should always be some content to be published some time in the future. The actual publishing logic:


We setup jobs to both scrape a publish content, all that's left is serving it.

The web server

After having tried different web servers, because of different bugs I settled with scorper.

Handling a request

The router

We use nim pattern matching implementation to match a tuple of regex captures. This is the regex which is not RESTful at all:

  rxend = "(?=/+|(?=[?].*)|$)"
  rxAmp = fmt"(/+amp{rxend})"
  rxLang = "(/[a-z]{2}(?:-[A-Z]{2})?" & fmt"{rxend})" # split to avoid formatting regex `{}` usage
  rxTopic = fmt"(/+.*?{rxend})"
  rxPage = fmt"(/+(?:[0-9]+|s|g|feed\.xml|sitemap\.xml){rxend})"
  rxArt = fmt"(/+.*?{rxend})"
  rxPath = fmt"{rxAmp}?{rxLang}?{rxTopic}?{rxPage}?{rxArt}?"

rxPath shows all the possible nodes that a path can have. Then our routing looks like:

let capts = uriTuple(reqCtx.url.path)
case capts:
  of (topic: ""): # homepage...
  of (topic: "assets"): # assets
  of (topic: "i"): # images
  of (topic: "robots.txt"): # robots.txt
  of (page: "sitemap.xml"): # sitemap for topics
  of (art: "index.xml"): # sitemap index for topic pages

Its not pretty, but not relying on any particular router allowed me to swap underling web server without much fuss when testing. Is it performant? Unclear! Haven't done any benchmarks comparing it to something else. However what smells is ofc the regex that can have bugs, and the fact that the ordering of the cases matters.

Wait a minute...

There is a bunch of stuff that we do on each request before actual page routing:

initially we setup cleanup code (with defer:) which should ensure no leaks happen.

  # FIXME: is this cleanup required?
  var futs: seq[Future[void]]
  let resp =
    if ctx.response.issome: ctx.response.get
    else: nil
  if not resp.isnil and not resp.connection.isnil:
    futs.add resp.connection.closeWait()
  if not ctx.isnil:
    if not ctx.connection.isnil:
      futs.add ctx.connection.closeWait()
    futs.add ctx.closeWait()
  await allFutures(futs)

We check if the thread is initialized:


This should really only run once (it sets a global bool after initialization to check), and could be done outside the request handler. But what is actually initialized? Well...quite a lot of stuff! Basically we (ab)use global constants which require initialization, also some of this are not really thread related, since they initialize memory on the heap and is shared across thread

if threadInitialized:
  debug "thread: already initialized"
debug "thread: base"
debug "thread: sonic"
initSonic() # Must be on top
debug "thread: http"
debug "thread: html"
debug "thread: ldj"
debug "thread: feed"
debug "thread: img"
debug "thread: lsh"
debug "thread: mimes"
# ... and other stuff

Then we parse the parameters

  relpath = ctx.rawPath
  page: string
  rqlocked: bool
debug "handling: {relpath:.120}"


What do we use parameters for? The ParamKey enum type describes it:

  ParamKey = enum
    q, p, # sonic
    c, # cache
    d, # delete
    t,  # translations
    u # imgUrls

We do microcaching for requests so every request is cached according to the tuple (path, query, accetEncoding), encoding is required because we can serve both (non)compressed bodies. A request context looks like this:

let reqCtx = reqCtxCache.lcheckOrPut(reqCacheKey):
  let reqCtx {.gensym.} = new(ReqContext)
    let l = newAsyncLock()
      reqCtx.lock = l
  reqCtx.url = move url
  reqCtx.params = params
  reqCtx.file = reqCtx.url.path.fp
  reqCtx.key = hash(reqCtx.file)
  reqCtx.rq = initTable[ReqId, HttpRequestRef]()
if handleTranslation():

Why is this also done before routing? By default we serve partially translated pages. We are poor :( and translations are based on free services, but we can't afford abysmal load times, so we run the translation deferred while we serve the page translated with only snippets that have been cached in our translations database.

At this point there is the routing, wrapped in an exception such that if serving the correct page fails, we issue a 503. Issuing a 503 implies that we tried to route a valid url but we couldn't generate a page. For invalid urls we issue a 301 redirect which implies that the url is not valid. We serve 11 different kind of urls:

About the rendering

Page rendering is processed from nim side using karax

General page layout

The website is composed of a top fixed bar which shows:

Since it is a responsive design, when the viewport is smaller, the top bar only holds the search box while the rest appears in a togglable sidebar.


This is an example function, shows what we do when a new post is published to update the feed:

proc update*(tfeed: Feed, topic: string, newArts: seq[Article], dowrite = false) =
    ## Load existing feed for given topic and update the feed (in-memory)
    ## with the new articles provided, it does not write to storage.
    checkNil tfeed
        chann = tfeed.findel("channel")
        itms = chann.drainChannel
        arl = itms.len
        narl = newArts.len

    debug "rss: newArts: {narl}, previous: {arl}"
        fill = RSS_N_ITEMS - arl
        rem = max(0, narl - fill)
        shrinked = if (rem > 0 and arl > 0):
                       itms[0..<(max(0, arl-rem))]
                   else: itms
    debug "rss: articles tail len {len(shrinked)}, newarts: {len(newArts)}"
    assert shrinked.len + narl <= RSS_N_ITEMS, fmt"shrinked: {shrinked.len}, newarticles: {narl}"
    for a in newArts:
        chann.add articleItem(a)
    for itm in shrinked:
        chann.add itm
    if dowrite:
        pageCache[][topic.feedKey] = tfeed.toXmlString


This is the core of adding urls to sitemaps:

template addUrlToFeed(getLoc, getLocLang) =
  if unlikely(nEntries > maxEntries):
      warn "Number of URLs for sitemap of topic: {topic} exceeds limit! {nEntries}/{maxEntries}"
      url = newElement("url")
      loc = newElement("loc")
  loc.add getLoc().escape.newText
  url.add loc
  addLangs(url, getLocLang)
  result.add url

proc buildTopicPagesSitemap*(topic: string): Future[XmlNode] {.async.} =
    await syncTopics()
    var nEntries = 0
    let done = await topicDonePages(topic)
    template langUrl(lang): untyped {.dirty.} = $(WEBSITE_URL / lang / topic / pages[n])
        # add the most recent articles first (pages with higher idx)
        let pages = pybi[].list(done.keys()).to(seq[string])
        for n in countDown(pages.len - 1, 0):
          if not (await isEmptyPage(topic, pages[n].parseInt, false)):
            discard sitemapUrl(topic, pages[n]).sitemapEl

template addArticleToFeed() =
  template baseUrl(): untyped =
    getArticleUrl(a, topic)

  template langUrl(lang): untyped =
    getArticleUrl(a, topic, lang)

  if not a.isValidArticlePy:

  addUrlToFeed(baseUrl, langUrl)

proc buildTopicSitemap(topic: string): Future[XmlNode] {.async.} =
    await syncTopics()
    let done = await topicDonePages(topic)
    var nEntries = 0
        # add the most recent articles first (pages with higher idx)
        for pagenum in countDown(len(done) - 1, 0):
            if unlikely(nEntries > maxEntries):
                warn "Number of URLs for sitemap of topic: {topic} exceeds limit! {nEntries}/{maxEntries}"
            checkTrue pagenum in done, "Mismatching number of pages"
            for a in done[pagenum]:


We don't use a template engine since most of the rendering is done with karax, but for pages like ToS we use file templates, where we just replace a bunch of variables, like an envsubst command.

proc pageFromTemplate*(tpl, lang, amp: string): Future[string] {.async.} =
  var txt = await readfileAsync(ASSETS_PATH / "templates" / tpl & ".html")
  let (vars, title, desc) =
    case tpl:
      of "dmca": (tplRep, "DMCA", fmt"dmca compliance for {WEBSITE_DOMAIN}")
      of "tos": (ppRep, "Terms of Service",
          fmt"Terms of Service for {WEBSITE_DOMAIN}")
      of "privacy-policy": (ppRep, "Privacy Policy",
          fmt"Privacy Policy for {WEBSITE_DOMAIN}")
      else: (tplRep, tpl, "")
  txt = multiReplace(txt, vars)
    slug = slugify(title)
    page = await buildPage(title = title, content = txt, wrap = true)
    let processed = await processPage(lang, amp, page, relpath = tpl)
    checkNil(processed, fmt"failed to process template {tpl}, {lang}, {amp}"):
      return processed.asHtml(minify_css = (amp == ""))

Articles pages

When we render pages like home/topics and numbered pages we have to show a list of articles, this function is called in a loop for how many articles we want to show:

import htmlparser
proc articleEntry(ar: Article, topic = ""): Future[VNode] {.async.} =
  if ar.topic == "" and topic != "":
    ar.topic = topic
  let relpath = getArticlePath(ar)
    return buildHtml(article(class = "entry")):
      h2(class = "entry-title", id = ar.slug):
        a(href = relpath):
          text ar.title
      tdiv(class = "entry-info"):
        span(class = "entry-author"):
          text ar.getAuthor & ", "
        time(class = "entry-date", datetime = ($ar.pubDate)):
            text format(ar.pubDate, "dd/MMM")
      tdiv(class = "entry-tags"):
        if ar.tags.len == 0:
          span(class = "entry-tag-name"):
            a(href = (await nextAdsLink()), target = "_blank"):
              text "none"
          for t in ar.tags:
            if likely(t.isSomething):
              span(class = "entry-tag-name"):
                a(href = (await nextAdsLink()), target = "_blank"):
                  text t
      tdiv(class = "entry-content"):
        a(class = "entry-more", href = relpath):
          text "[continue]"
  except Exception as e:
    warn "articles: entry creation failed."
    raise e

proc buildShortPosts*(arts: seq[Article], topic = ""): Future[
    string] {.async.} =
  for a in arts:
    result.add $(await articleEntry(a, topic))

Note how in some lines "ads" creep in X)

Topics list

In the top bar we show the topics list, this is what prints it:

proc topicsList*(ucls: string; icls: string; small: static[
    bool] = true): Future[VNode] {.async.} =
  result = newVNode(VNodeKind.ul)
  result.setAttr("class", ucls)
  let topics = await loadTopics(-MENU_TOPICS) # NOTE: the sign is negative, we load the most recent N topics
  result.add buildHtml(tdiv(class = "topics-shadow"))
  var topic_slug, topic_name: string
  var isEmpty: bool
  for i in 0..<topics.len:
      (topic_slug, topic_name) = ($topics[i][0], $topics[i][1])
      isEmpty = isEmptyTopic(topic_slug)
    if isEmpty:
    let liNode = buildHtml(li(class = fmt"{icls}")):
      # tdiv(class = "mdc-icon-button__ripple") # not used without material icons
      a(href = ($(WEBSITE_URL / topic_slug)), title = topic_name,
          class = "mdc-ripple-button"):
        tdiv(class = "mdc-ripple-surface  mdc-ripple-upgraded")
        when small:
          # only use the first letter
          text $topic_name.runeAt(0).toUpper # loadTopics iterator returns pyobjects
          text topic_name
      when small:
        span(class = "separator")
    result.add liNode

There are some smelly hard coded material design classes in here. Frankly google material design components suck.

Post footer

The post footer appears at bottom right of an article page (in ltr) and it really only prints the published date.

proc postFooter(pubdate: Time): VNode =
  let dt = inZone(pubdate, utc())
  buildHtml(tdiv(class = "post-footer")):
    time(datetime = ($dt)):
      text "Published date: "
        text format(dt, "dd MMM yyyy")


When building article entries we might need excerpts if no summary is available.

proc articleExcerpt(a: Article): string =
  let alen = len(a.content) - 1
  let maxlen = min(alen, ARTICLE_EXCERPT_SIZE)
  if maxlen == alen:
    return a.content
    let runesize = runeLenAt(a.content, maxlen)
    # If article contains html tags, the excerpt might have broken html
    return parseHtml(a.content[0..maxlen+runesize]).innerText & "..."

Wtf is parseHtml doing here? It is the case that we allow html inside article contents (but only some tags), this is an option from the python module trafilatura, that we keep enabled because it can affects article format. We also have to be careful about chunking utf-8 strings...


The last task after having built the karax VNode tree is to dump the bytes. The tree if prefixed with the html header, and optionally minified.

proc asHtml*(data: string ; minify: static[bool] = true; minify_css: bool = true): string =
  let html = "<!DOCTYPE html>" & "\n" & data
  sdebug "html: raw size {len(html)}"
  result = when minify:
             html.minifyHtml(minify_css = false,
                             minify_js = false,
                             keep_closing_tags = true,
                             do_not_minify_doctype = true,
                             keep_spaces_between_attributes = true,
                             ensure_spec_compliant_unquoted_attribute_values = true)
  sdebug "html: minified size {len(result)}"

The minification is handled by minify-html which we have bound using c2nim, the binding file contains:

proc minify*(code: cstring,
             do_not_minify_doctype = false,
             ensure_spec_compliant_unquoted_attribute_values = false,
             keep_closing_tags = true,
             keep_comments = false,
             keep_html_and_head_opening_tags = true,
             keep_spaces_between_attributes = false,
             minify_css = true,
             minify_js = true,
             remove_bangs = false,
             remove_processing_instructions = true): cstring {.importc: "minify".}

proc minifyHtml*(tree: VNode): string = $minify(($tree).cstring)
proc minifyHtml*(data: string): string = $minify(data.cstring)
template minifyHtml*(data: string, args: varargs[untyped]): string =
    $minify(data.cstring, args)

But for building we have to provide the static libraries, adding this line in our nim.cfg


I mean...that's my path were I built the minify library which btw doesn't actually have an extern c function which nim can consume, so we had to write it ourselves.

use minify_html::{Cfg, minify as minify_html_native};
use std::ffi::CStr;
use std::ffi::CString;
use std::os::raw::c_char;

pub extern "C" fn minify(
    code: *const c_char,
    do_not_minify_doctype: bool,
    ensure_spec_compliant_unquoted_attribute_values: bool,
    keep_closing_tags: bool,
    keep_comments: bool,
    keep_html_and_head_opening_tags: bool,
    keep_spaces_between_attributes: bool,
    minify_css: bool,
    minify_js: bool,
    remove_bangs: bool,
    remove_processing_instructions: bool,
) -> *const c_char {

    let code = unsafe { CStr::from_ptr(code) };
    let code_vec = code.to_bytes();

    let cfg = Cfg {

    let minified = minify_html_native(code_vec, &cfg);

    let s = unsafe { CString::from_vec_unchecked(minified).into_raw() };
    return s;

Nimpy and the quest for crash free garbage deletion

Python bindings for nim have to ofc free discard python objects. The problem is that we have to control when nim does GC. The nimpy library assumes the GIL is always locked (it locks it at the beginning), to it is free to call to python whenever. But we unlock the gil in order to allow a python threadpool to run code while nim runs other stuff. If the python GIL was always locked by nim the threadpool would be idle most of time.

when defined(pyAsync):
    PyGilObj = object
      lock: ThreadLock
      currentLockHolder: int
      state: PyGILState_STATE
    PyGil = ptr PyGilObj

  var pyGil*: PyGil
  var pyGilLock*: ThreadLock
  var pyMainThread: PyThreadState
  proc initPyGil*() =
    assert PyGILState_Check()
    pyGil = create(PyGilObj)
    pyGil.currentLockHolder = getThreadID()
    pyGil.lock = newThreadLock()
    pyGilLock = pyGil.lock
    pyMainThread = PyEval_SaveThread()

  proc acquire*(gil: PyGil): Future[void] {.async.} =
    await gil.lock.acquire
    let id = getThreadId()
    gil.currentLockHolder = id
    gil.state = Py_GILState_Ensure()

  proc tryAcquire*(gil: PyGil): bool =
    if gil.lock.tryAcquire():
      let id = getThreadId()
      gil.currentLockHolder = id
      gil.state = Py_GILState_Ensure()
      return true

  proc release*(gil: PyGil) {.inline.} =
    doassert gil.currentLockHolder == getThreadId(), "Can't release gil lock from a different thread."
    doassert gilLocked()

This allows how to execute python code holding the GIL, but only on the current thread. The implementation for acquiring/releasing the GIL on different nim threads requires calling different python C abi functions, because the GIL is a mutex. We then call python using this template:

template withPyLock*(code): untyped =
  {.locks: [pyGil].}:
      # echo getThreadId(), " -- ", getCurrentProcessId(), " -- ", procName()
      await pygil.acquire()
      raise getCurrentException()
      # echo getThreadId(), " -- ", getCurrentProcessId(),  " -- unlocked"

We make use of the nim locks and guards feature, to ensure python types are only accessed when the GIL is held. However this requires defining pyobjects with the guard:

macro pyObjPtr*(defs: varargs[untyped]): untyped =
  result = newNimNode(nnkStmtList)
  for d in defs:
      name = d[0]
      def = d[1]
    result.add quote do:
      let `name` {.guard: pyGil.} = create(PyObject)
      `name`[] = `def`

So I can do:

pyObjPtr(myVar, pyimport("datetime").datetime))

And whenever I call myVar which holds the datetime object, I have to wrap it like this:


Now we can lock the gil when we have to run the GC, overriding the nimpy PyObject destructor with this:

var garbage: seq[PPyObject]

proc `=destroy`*(p: var PyObject) =
  if pygil.tryAcquire:
    if not p.rawPyObj.isnil:
      decRef p.rawPyObj
      p.rawPyObj = nil
    while garbage.len > 1:
      var pp = garbage.pop() # TODO: Does this leak a pointer?
      if not pp.isnil:
        decRef pp
      pp = nil
    if not p.rawPyObj.isnil:
      garbage.add p.rawPyObj

The lock we use inside the destructor is not an AsyncLock as that would be too expensive, and we don't always lock, as that would cause stalls! If we can't lock the gil, we delay the collection, and keep the raw python pointer around for when we will be able to clear it. Honestly I don't know if this causes other forms of issues, but it seems to work well enough.

We have a nim module called pyutils.nim that does a bunch of nim<>python stuff, for example:

from utils import withLocks
proc pyhasAttr*(o: PyObject; a: string): bool {.withLocks: [pyGil].} = pybi[].hasattr(
    o, a).to(bool)

proc pyclass(py: PyObject): PyObject {.inline, withLocks: [pyGil].} =

proc pytype*(py: PyObject): string =

proc pyisbool*(py: PyObject): bool {.withLocks: [pyGil].} =
  return pybi[].isinstance(py, PyBoolClass[]).to(bool)

proc pyisnone*(py: PyObject): bool {.gcsafe, withLocks: [pyGil].} =
  return py.isnil or pybi[].isinstance(py, PyNoneClass[]).to(bool)

This one is used quite a lot:

proc pyget*[T](py: PyObject; k: string; def: T = ""): T =
    let v = py.callMethod("get", k)
    if pyisnone(v):
      return def
      return v.to(T)
    if pyisnone(py):
      return def
      return py.to(T)

This one is used when we have scheduled a python job, and we want to wait for it to finish asynchronously:

proc pywait*(j: PyObject): Future[PyObject] {.async, gcsafe.} =
  var rdy: bool
  var res: PyObject
  while true:
      rdy = j.callMethod("ready").to(bool)
    if rdy:
        res = j.callMethod("get")
    await sleepAsync(250.milliseconds)
    if (not res.isnil) and (not pyisnone(res)) and (not pyErrOccurred()):
      return res
      raise newException(ValueError, "Python job failed.")

Proper python async binding would require completing a nim async future from python at the end of the python scheduled job, which we don't do cuz we haven't looked deep enough into handling nim objects from python.


We support google amp, so we generate somewhat amp compliant amp pages. We don't aim for 1:1 support. In fact we nuke all the javascript we have and only serve html/css. Even then we have to be careful not adding custom attributes to html tags, or just custom html tags, amp is bad like that... For automatic amp page conversion we handle the head and the body tag differently.

proc processHead(inHead: VNode, outHead: VNode, level = 0) {.async.} =
  var canonicalUnset = level == 0
  debug "iterating over {inHead.kind}"
  for el in inHead.preorder(withStyles = true):
    case el.kind:
      of VNodeKind.text, skipNodes:
      of VNodeKind.style:
        if el.len > 0:
      of VNodeKind.link:
        if canonicalUnset and el.isLink(canonical):
          outHead.add el
          canonicalUnset = false
        elif el.isLink(stylesheet) and (not ("flags-sprite" in el.getattr("href"))):
          await el.fetchStyle()
        elif el.isLink(preload) and el.getattr("as") == "style":
          await el.fetchStyle()
          outHead.add el
      of VNodeKind.script:
        if el.getAttr("type") == $ldjson:
          outHead.add el
      of VNodeKind.meta:
        if (el.getAttr("name") == "viewport") or (el.getAttr("charset") != ""):
          outHead.add el
      of VNodeKind.verbatim:
        let data = el.toXmlNode
        if data.kind == xnElement:
          if data.tag == "noscript":
          elif data.tag == "script":
          elif data.tag == "style":
            if data.len > 0:
            outHead.add el
      of VNodekind.noscript:
        debug "amphead: adding element {el.kind} to outHead."
        outHead.add el

All styles are merged into a single inline script, what's kept is link tags which are not style/jscript, like lang. Script tags for ldljson, meta tags. Verbatim handles nodes which are literal, we have to convert them in to XmlNode (which means parsing) and handle it correctly. Process body is similar, we keep some tags, remove others, rename others:

template process(el: VNode, after: untyped): bool =
  var isprocessed = true
  case el.kind:
    of skipNodes: discard
    of VNodeKind.link:
      if el.isLink(stylesheet):
        await el.fetchStyle()
        outBody.add el
    of VNodeKind.style:
      el.text = ""
    of VNodeKind.script:
      if el.getAttr("type") == $ldjson:
        outHead.add el
      el.text = ""
    of VNodeKind.form:
      el.setAttr("amp-form", "")
      isprocessed = false
  if isprocessed:

The form tag is replaced with amp-form, amp has many of these tags...

We have to ensure that the inline styles are within the correct length:

styleStr = styleStr
  # .join("\n")
  # NOTE: the replacement should be ordered from most frequent to rarest
  # # remove troublesome animations
  .replace(pre"""\s*?@(\-[a-zA-Z]+-)?keyframes\s+?.+?{\s*?.+?({.+?})+?\s*?}""", "")
  # # remove !important hints
  .replace(pre"""!important""", "")
  # remove charset since not allowed
  .replace(pre"""@charset\s+\"utf-8\"\s*;?/i""", "")

if unlikely(styleStr.len > CSS_MAX_SIZE):
  raise newException(ValueError, fmt"Style size above limit for amp pages. {styleStr.len}")

Our amp generation doesn't cover the full amp spec, but it works for our content (through trial and error :S).

Whenever an article is published, it is ingested into the sonic database, the sonic database handles "collections", "buckets" and "objects"; We define a collection as a website, so every website that wants to deploy the content aggregator has its own collection. We don't use buckets, although we could consider each topic a bucket that would narrow the search too much, so every site has just one bucket "default", and each object of the bucket is an article (which can be of different topics).

proc push*(capts: UriCaptures, content: string) {.async.} =
  ## Push the contents of an article page to the search database
  ## NOTE: NOT thread safe
  var ofs = 0
  while ofs <= content.len:
    let view = content[ofs..^1]
    let key = join([capts.topic, capts.page, capts.art], "/")
    let cnt = runeSubStr(view, 0, min(view.len, bufsize - key.len))
    ofs += cnt.len
    if cnt.len == 0:
      let lang = await capts.lang.toISO3
      var pushed: bool
      var j: PyObject
        j = pySched[].apply(
          "default", # TODO: Should we restrict search to `capts.topic`?
          lang = if capts.lang != "en": lang else: ""
      j = await j.pywait()
        pushed = not pyisnone(j) and j.to(bool)
      when not defined(release):
        if not pushed:
    except Exception:
      debug "sonic: couldn't push content, \n {capts} \n {key} \n {cnt}"
      when not defined(release):
          var f: File
            await pushLock[].acquire
            f = open("/tmp/sonic_debug.log", fmWrite)
            write(f, cnt)
            if not f.isnil:

When pushing content into sonic we have to split the data in chunks, which max length is known upon connection. Ingesting data seems to be buggy sometimes, as it appears to not be able to handle some specific characters. In case the sonic server breaks somehow, we also have a function to reingest all the content:

proc pushAllSonic*() {.async.} =
  await syncTopics()
  var total, c, pagenum: int
  let pushLog = await readPushLog()
  if pushLog.len == 0:
      discard pySonic[].flush(WEBSITE_DOMAIN)
      discard pySonic[].consolidate()
  for (topic, state) in topicsCache:
    if topic notin pushLog:
      pushLog[topic] = %0
    await pygil.acquire
    defer: pygil.release
    let done = state.group[]["done"]
    for page in done:
      pagenum = ($page).parseint
      c = len(done[page])
      if pushLog[topic].to(int) >= pagenum:
      var futs: seq[Future[void]]
      for n in 0..<c:
        let ar = done[page][n]
        if ar.isValidArticlePy:
          var relpath = getArticlePath(ar, topic)
            capts = uriTuple(relpath)
            content = ar.pyget("content").sanitize
          echo "pushing ", relpath
          futs.add push(capts, content)
      await allFutures(futs)
      pushLog[topic] = %pagenum
      await writePushLog(pushLog)
      await pygil.acquire
  info "Indexed search for {WEBSITE_DOMAIN} with {total} objects."


Translation is quite a messy story. I am at my 4th (!) implementation of a translation wrapper, after having written in php, go and julia, this is also written in nim. The php/go variants are a little bit rotten nowadays, while the julia variant is actively used for this blog. However to achieve low delay for the web server, the way translation is implemented in julia is not well fit for real time servicing (it translates static files), and anyway adding julia as a dependency once we already have python would be to big of a requirement.

So I had to implement a new translation module in nim. In truth, the initial nim translation module looked a lot like the julia implementation, where we were translating static files [1]. Afterwards, when the web server started shaping up, I switching it to translate the karax nodes on demand. This allows to translate each web page just in time for the request.

template translateVbtm(node: VNode, q: QueueDom) =
  assert node.kind == VNodeKind.verbatim
  let tree = ($node).parseHtml() # FIXME: this should be a conversion, but the conversion doesn't preserve whitespace??
  if tree.kind == xnElement and tree.tag == "document":
    tree.tag = "div"
  takeOverFields(tree.toVNode, node)
  translateIter(node, vbtm = false)

template translateIter(otree; vbtm: static[bool] = true) =
  for el in otree.preorder():
    case el.kind:
      of vdom.VNodeKind.text:
        if el.text.isEmptyOrWhitespace:
        if isTranslatable(el):
          translate(q.addr, el, srv)
        let t = el.kind
        if t in tformsTags:
          getTForms(dom)[t](el, file_path, url_path, pair)
        if t == VNodeKind.a:
          if el.hasAttr("href"):
            rewriteUrl(el, rewrite_path, hostname)
        if t == VNodeKind.verbatim:
          when vbtm:
            debug "dom: translating verbatim", false
            translateVbtm(el, q)
          if(el.hasAttr("alt") and el.isTranslatable("alt")) or
            (el.hasAttr("title") and el.isTranslatable("title")):
            translate(q.addr, el, srv)

Above is the main iteration loop translateIter:

Translation is applied to all text nodes and to the alt and title attributes.

proc translate*[T](q: ptr[QueueXml | QueueDom], el: T, srv: service) =
  if q.isnil:
    warn "translate: queue can't be nil"
  let (success, length) = setFromDB(q[].pair, el)
  if not success:
    if length > q[].bufsize:
      debug "Translating element singularly since it is big"
      elUpdate(q[], el, srv)
      if reachedBufSize(length, q[]):
      q[].sz += length

proc translate*[T](q: ptr[QueueXml | QueueDom], el: T, srv: service,
    finish: bool): Future[bool] {.async.} =
  if finish:
    if q.isnil:
      return true
    let (success, _) = setFromDB(q[].pair, el)
    if not success:
      addJob(@[el], q[], el.getText)
      debug "translate: waiting for pair: {q[].pair}"
      await doTrans()
  return true

proc translate*(q: ptr[QueueXml | QueueDom], srv: service,
    finish: bool): Future[bool] {.async.} =
  if finish and q[].sz > 0:
    await doTrans()
    saveToDB(force = true)
  return true

Because we have to translate each text node separately (otherwise we can't render back the html) every node translation is a separated job. Since jobs can query translation services of the net, they have to be done asynchronously. We do splitting and merging of translation queries to spare api calls, but the internals of the translation engine are not important to know. The only thing to note is that initially I was using a python wrapper (which I still use for translating scraped content) because self managing wrappers for external apis is a pain, but then switched to self wrapped google and yandex translation service in nim, because python become a considerable bottleneck when handling many concurrent translations.

[1]In fact originally the content aggregator was supposed to just generate static files for caddy to serve, but because the amount of pages to generate (which is a matrix of n_lang(20) x amp(2) x page), lazy rendering was the better option.


Topic and article pages are tracked for hit counts.

proc updateHits*(capts: UriCaptures) =
  let ak = join([capts.topic, capts.art])
  let tk = capts.topic
    artCount: int32 = statsDB[ak]
    topicCount: int32 = statsDB[tk]
  artCount += 1
  topicCount += 1
  statsDB[ak] = artCount
  statsDB[tk] = topicCount

We use hit counts to cleanup pages with a low count periodically.

proc deleteLowTrafficArts*(topic: string): Future[void] {.gcsafe, async.} =
  let now = getTime()
    pagenum: int
    pagesToReset: seq[int]
    pubTime: Time
    pubTimeTs: int
  var capts = mUriCaptures()
  capts.topic = topic
  for (art, _) in (await publishedArticles[string](topic, "")):
      if pyisnone(art):
      capts.art = pyget[string](art, "slug")
      pagenum = pyget(art, "page", 0)
    capts.page = pagenum.intToStr
        pubTimeTs = pyget(art, "pubTime", 0)
      pubTime = fromUnix(pubTimeTs)
      pubTime = default(Time)
    if pubTime == default(Time):
      if not (pagenum in pagesToReset):
        debug "tasks: resetting pubTime for page {pagenum}"
        pagesToReset.add pagenum
    # article is old enough
    elif inSeconds(now - pubTime) > cfg.CLEANUP_AGE:
      let hits = topic.getHits(capts.art)
      # article has low hit count
      if hits < cfg.CLEANUP_HITS:
        await deleteArt(capts)
  for n in pagesToReset:
      discard site[].update_pubtime(topic, n)


We use libmdbx through this lib. Probably is overkill, and using leveldb would have sufficed. We have a type LRUTrans where the initial idea was to setup the database as an LRU cache, but it was considerably slower. The implementation can be found here

    CollectionNotNil = ptr Collection not nil
    LRUTransObj = object
        db: nimdbx.Database.Database not nil
        coll: CollectionNotNil
        zstd_c: ptr ZSTD_CCtx
        zstd_d: ptr ZSTD_DCtx
    LRUTrans* = ptr LRUTransObj

proc getImpl(t: LRUTrans, k: int64, throw: static bool): string =
        var o: seq[byte]
        t.coll.inSnapshot do (cs: CollectionSnapshot):
            # debug "nimdbx: looking for key {k}, {v}"
            o.add cs[k.asData].asByteSeq
        if len(o) > 0:
            result = cast[string](decompress(t.zstd_d, o))
            # debug "nimdbx: got key {k}, with {o.len} bytes"
        elif throw:
            raise newException(KeyError, "nimdbx: key not found")

proc getImpl[T: not int64](t: LRUTrans, k: T, throw: static bool): string =
    getImpl(t, hash(k).int64, throw)

proc `[]`*[T](t: LRUTrans, k: T): auto = t.getImpl(k, false)
proc `get`*[K](t: LRUTrans, k: K): auto = t.getImpl(k, true)

proc `[]=`*(t: LRUTrans, k: int64, v: string) {.gcsafe.} =
    var o: seq[byte]
    if likely(v.len != 0):
      o = compress(t.zstd_c, v, cfg.ZSTD_COMPRESSION_LEVEL)
        logall "nimdbx: saving key {k}"
        t.coll.inTransaction do (ct: CollectionTransaction):
                ct[k] = o
        logall "nimdbx: commited key {k}"

proc `[]=`*[K: not int64](t: LRUTrans, k: K, v: string) = t[hash(k).int64] = v

This type is used for four separate databases:

The database type is implemented with getters and setters then do automatic de/compression on read/write. For this reason it shouldn't be used for images...but alas... There are also a bunch of small micro caches:

These are implemented as lru caches[2], more precisely as "locked" lru caches, where every get and set operation is wrapped around a (thread)lock. This locks can't cause stalls with the async runtime because the lock is acquired and released without any yield statement, so they are atomic in that sense, however they are still useful since we make use of threads for different tasks.

[2]However nim stew has a simpler implementation for lru cache which I would have used if found sooner.

Background Jobs

A couple tasks that we use are CPU hungry, so we use a different thread for them:

Two more threads are used to update the assets files list and the ads, although not cpu hungry, a thread is required to avoid stalls caused by the file watcher.

We also have async long running tasks for:

Lsh, images, translation and http requests jobs are handled using a producer/consumer setup. Except that we don't use channels, because channels block and we don't have an async implementation of them that is also threadsafe. We used an async implementation of this[1]. And an async table, which is like an event bus

  AsyncTableObj[K, V] = object
    lock: ThreadLock
    waiters: Table[K, seq[ptr Future[V]]]
    table: Table[K, V]
  AsyncTable*[K, V] = ptr AsyncTableObj[K, V]

proc pop*[K, V](t: AsyncTable[K, V], k: K): Future[V] {.async.} =
  var popped = false
    if k in t.table:
      popped = t.table.pop(k, result)
  if not popped:
    if k notin t.waiters:
      t.waiters[k] = newSeq[ptr Future[V]]()
    var fut = newFuture[V]("AsyncTable.pop")
    t.waiters[k].add fut.addr
    result = await fut

proc put*[K, V](t: AsyncTable[K, V], k: K, v: V) {.async.} =
    if k in t.waiters:
      var ws: seq[ptr Future[V]]
      doassert t.waiters.pop(k, ws)
      while ws.len > 0:
        let w = ws.pop()
        if not w.isnil and not w[].isnil and not w[].finished:
      t.table[k] = v

The nim server also handles three async tasks:

  TaskKind = enum pub, cleanup, mem

proc scheduleTasks(): TaskTable =
  template addTask(t) =
    let fut = (selectTask t)()
    result[t] = fut
  # Publishes new articles for one topic every x seconds
  addTask pub
  # cleanup task for deleting low traffic articles
  addTask cleanup
  # quit when max memory usage reached
  addTask mem

The task that monitors mem usage is nice to have, to avoid OOM issues between the containerized process and docker, because docker (or the kernel) doesn't kill the process immediately, and in this period of time the server can become unresponsive, so it is better to restart manually immediately.

[1]although wrapping a plain channel in async routines is probably better...alas


We leverage imageflow to resize and cache images locally. The bindings are simple, but the process is a little bit involved. With getImg we fetch the image data from remote url:

proc getImg*(src: string, kind: Source): Future[string] {.async.} =
  return case kind:
    of urlsrc:
      (await get(src.parseUri, decode = false, proxied = false)).body
    elif fileExists(src):
      await readFileAsync(src)

Then we have to add it to an imageflow context:

proc addImg*(img: string): bool =
  ## a lock should be held here throughout the `processImg` call.
  if img == "": return false
  doassert ctx.check
  let a = imageflow_context_add_input_buffer(
    # NOTE: The image is held in cache, but it might be collected
    cast[ptr uint8](img[0].unsafeAddr),
  if not a:
    doassert ctx.check
    cmdStr["decode"] = %inputIoId
  return true

If the image can't be added, it means imageflow failed to recognized the data as a valid image. After we have sent the data, we have to send a query to the context, then read the response, and get the output:

proc doProcessImg(input: string, mtd = execMethod): (string, string) =
  let c = $cmd
  # debug "{hash(c)} - {c}"
  let json_res = imageflow_context_send_json(
      cast[ptr uint8](c[0].unsafeAddr),
  discard imageflow_json_response_read(ctx.p, json_res,
  defer: doassert imageflow_json_response_destroy(ctx.p, json_res)

  var mime: string
  if status != 200:
    let msg = resPtr[].toString(resLen[].int)
    debug "imageflow: conversion failed {msg}"
    doassert ctx.check
    mime = getMime()
  discard imageflow_context_get_output_buffer_by_id(
  doassert ctx.check
  result = (outputBuffer[].toString(outputBufferLen[].int), mime)

We get the mime type from the response, that will be forwarded in the response of the web server. From the server side the translation from url path to image flow is handled like this:

proc processImgData(q: ptr ImgQuery) {.async.} =
  # push img to imageflow context
  initImageFlow() # NOTE: this initializes thread vars
  var acquired, submitted: bool
  let data = (await q.url.rawImg)
    if acquired: imgLock[].release
    if not submitted:
      imgOut[q] = true
  if data.len > 0:
      await imgLock[].acquire
      acquired = true
      if addImg(data):
        let query = fmt"width={q.width}&height={q.height}&mode=max&format=webp"
        logall "ifl server: serving image hash: {hash(await q.url.rawImg)}, size: {q.width}x{q.height}"
        # process and send back
        (q.processed.data, q.processed.mime) = processImg(query)
        imgOut[q] = true
        submitted = true
    except Exception:

The image url is sent as a parameters, in zstd compressed form. The compression shortens the urls (most of the times). This is also how I found a bug in google chrome, where it couldn't handle urls where the query had url-encoded compressed data. Firefox was fine instead.


We add to each webpage ldjson scripts .

proc jwebpage(id, title, url, mtime, selector, description: auto, keywords: seq[string], name = "", headline = "",
            image = "", entity = "Article", status = "Published", lang = "english", mentions: seq[
            string] = (@[]), access_mode = (@["textual", "visual"]), access_sufficient: seq[
            string] = @[], access_summary = "", created = "", published = "",
            props = default(JsonNode)): JsonNode =
        d_mtime = coerce(mtime, "")
        s_created = created.toIsoDate
        description = coerce(description, to = title)
        prd = (v: seq[string]) => v.len == 0

    let data = %*{
        "@context": "https://schema.org",
        "@type": "https://schema.org/WebPage",
        "@id": id,
        "url": url,
        "lastReviewed": coerce(mtime, ""),
        "mainEntityOfPage": {
            "@type": entity,
            "@id": url
            "@type": "WebPageElement", "cssSelector": selector},
        "accessMode": access_mode,
        "accessModeSufficient": {
            "@type": "itemList",
            "itemListElement": coercf(access_sufficient, prd, to = access_mode),
        "creativeWorkStatus": status,
        # NOTE: datePublished should always be provided
        "datePublished": ensure_time(d_mtime.toIsoDate, s_created),
        "dateModified": d_mtime,
        "dateCreated": coerce(s_created, to = d_mtime),
        "name": coerce(name, to = title),
        "description": coerce(description, ""),
        "keywords": coerce(keywords, to = (@[]))
    setArgs data, %*{"inLanguage": lang, "accessibilitySummary": access_summary,
                    "headline": coerce(headline, to = description), "image": image,
                    "mentions": mentions}

And for translated pages:

proc translation*(src_url, trg_url, lang, title, mtime, selector, description: auto, keywords: seq[string],
                     image = "", headline = "", props = default(JsonNode),
                     translator_name = "Google", translator_url = "https://translate.google.com/"): auto =
    ## file path must be relative to the project directory, assumes the published website is under '__site/'
    # id, title, url, mtime, selector, description: auto, keywords: seq[string], name = "", headline = "",
    let data = jwebpage(id = trg_url, title, url = trg_url, mtime, selector, description,
                            keywords = keywords, image = image, headline = headline, lang = lang, props = props)
    data["translator"] = %*{"@type": "https://schema.org/Organization",
                             "name": translator_name,
                             "url": translator_url}
    data["translationOfWork"] = %*{"@id": src_url}


Same as ldjson, we also provide opengraph meta tags:

proc opgBasic(title, tp, url, image: string, prefix = ""): seq[XmlNode] =
  if prefix != "":
    result.add metaTag(fmt"{prefix}:title", title)
    result.add metaTag(fmt"{prefix}:type", tp)
    result.add metaTag(fmt"{prefix}:url", url)
    result.add metaTag(fmt"{prefix}:image", image)
    result.add metaTag("title", image)
    result.add metaTag("type", image)
    result.add metaTag("url", image)
    result.add metaTag("image", image)

proc opgTags(title, tp, url,
             image: string,
             description = "",
             siteName = "",
             locale = "",
             audio = "",
             video = "",
             determiner = "",
             prefix = ""): seq[XmlNode] {.gcsafe.} =
  ## Generates an HTML String containing opengraph meta result for one item.
  var result = opgBasic(title, tp, url, image, prefix)
  result.add opgOptional(description, siteName, locale, audio, video, determiner)
  return result

proc opgPage*(a: Article): seq[XmlNode] =
  let locale = static(DEFAULT_LOCALE)
    tp = static("article")
    url = getArticleUrl(a)
    siteName = static(WEBSITE_TITLE)
  result = opgTags(a.title, tp, url, a.imageUrl, a.desc, siteName, locale, prefix = "article")
  for t in a.tags:
    result.add metaTag("article:tag", t)
  result.add metaTag("article:author", a.author)
  result.add metaTag("article:published_time", $a.pubTime)
  result.add metaTag("article:section", a.desc)
  # result.add metaTag("article:modified_time", a.pubTime)
  # result.add metaTag("article:expiration_time", a.pubTime)
  result.add twitterMeta("card", "summary")
  result.add twitterMeta("creator", twitterUrl[])

Nim macros and templates come in handy when dealing with all this boilerplate heavy code.

Server side http requests

There is another task, that handles all the http requests (to fetch images, scripts, etc) from the web server side. We use the chronos httpclient:

const proxiedFlags = {NoVerifyHost, NoVerifyServerName, NewConnectionAlways}
const sessionFlags = {NoVerifyHost, NoVerifyServerName, NoInet6Resolution}
proc requestTask(q: sink ptr Request) {.async.} =
  var trial = 0
    sess: HttpSessionRef
    req: HttpClientRequestRef
    resp: HttpClientResponseRef
    cleanup: seq[Future[void]]
  while trial < q[].retries:
      sess = new(HttpSessionRef,
                proxyTimeout = 10.seconds.div(3),
                headersTimeout = 10.seconds.div(2),
                connectTimeout = 10.seconds,
                proxy = if q[].proxied: selectProxy(trial) else: "",
                flags = if q[].proxied: proxiedFlags else: sessionFlags
      req = new(HttpClientRequestRef,
                headers = q[].headers.toHeaderTuple,
                body = q[].body.tobytes,
      resp = await req.fetch(followRedirects = q[].redir, raw = true)
          cleanup.add resp.closeWait()
          resp = nil
        q.response.code = httpcore.HttpCode(resp.status)
          q.response.body = bytesToString (await resp.getBodyBytes)
          q.response.headers = newHttpHeaders(cast[seq[(string, string)]](
        debug "cronhttp: request failed"
      if not req.isnil:
        cleanup.add req.closeWait()
      if not resp.isnil:
        cleanup.add resp.closeWait()
      if not sess.isnil:
        cleanup.add sess.closeWait()
  httpOut[q] = true
  await allFutures(cleanup)

I had to add support for https and socks5 proxies to the httpclient to be able to use translations effectively.


You might have noticed capitalized variables throughout the code. All of these are config variables, that are defined in a file, which can be customized per website.

  BASE_URL* = Uri()
  ASSETS_PATH* = PROJECT_PATH / "src" / "assets"
  DEFAULT_IMAGE* = ASSETS_PATH / "empty.png"
  DEFAULT_IMAGE_MIME* = "image/png"
  CSS_BUN_URL* = $(SITE_ASSETS_PATH / "bundle.css")
  CSS_CRIT_PATH* = SITE_ASSETS_DIR / "bundle-crit.css"
  JS_REL_URL* = $(SITE_ASSETS_PATH / "bundle.js")
  LOGO_PATH* = BASE_URL / "assets" / "logo" / WEBSITE_NAME
  LOGO_URL* = $(LOGO_PATH / "logo.svg")
  LOGO_SMALL_URL* = $(LOGO_PATH / "logo-small.svg")
  LOGO_ICON_URL* = $(LOGO_PATH / "logo-icon.svg")
  LOGO_DARK_URL* = $(LOGO_PATH / "logo-dark.svg")
  LOGO_DARK_SMALL_URL* = $(LOGO_PATH / "logo-small-dark.svg")
  LOGO_DARK_ICON_URL* = $(LOGO_PATH / "logo-icon-dark.svg")
  FAVICON_PNG_URL* = $(LOGO_PATH / "logo-icon.png")
  FAVICON_SVG_URL* = $(LOGO_PATH / "logo-icon.svg")
  APPLE_PNG180_URL* = $(LOGO_PATH / "apple-touch-icon.png")
# ...


There are a bunch of things that I have not mentioned, since the devil is in the details...however this is rough tour of the whole code base which amounts to:

What would I do differently?

Post Tags: