Explanations of Technologies
============================

What does Redis Do?
^^^^^^^^^^^^^^^^^^^

Redis is used to make the state of data collection jobs visible on an
external dashboard, like Flower. Internally, CollectOSS relies on Redis to
cache GitHub API Keys, and for OAuth Authentication. Redis is used to
maintain awareness of CollectOSS’s internal state.

What does RabbitMQ Do?
^^^^^^^^^^^^^^^^^^^^^^

CollectOSS is a distributed system. Even on one server, there are many
collection processes happening simultaneously. Each job to collect data
is put on the RabbitMQ Queue by CollectOSS’s “Main Brain”. Then independent
workers pop messages off the RabbitMQ Queue and go collect the data.
These tasks then become standalone processes that report their
completion or failure states back to the Redis server.

**Edit** the ``/etc/redis/redis.conf`` file to ensure these parameters
are configured in this way:

.. code:: shell

   supervised systemd
   databases 900
   maxmemory-samples 10
   maxmemory 20GB

**NOTE**: You may be able to have fewer databases and lower maxmemory
settings. This is a function of how many repositories you are collecting
data for at a given time. The more repositories you are managing data
for, the close to these settings you will need to be.

**Consequences** : If the settings are too low for Redis, CollectOSS’s
maintainer team has observed cases where collection appears to stall.
(TEAM: This is a working theory as of 3/10/2023 for Ubuntu 22.x, based
on EC2 experiments.)

Possible EC2 Configuration Requirements
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

With virtualization there may be issues associated with redis-server
connections exceeding available memory. In these cases, the following
workarounds help to resolve issues.

Specifically, you may find this error in your collectoss logs:

.. code:: shell

   redis.exceptions.ConnectionError: Error 111 connecting to 127.0.0.1:6379. Connection refused.

**INSTALL** ``sudo apt install libhugetlbfs-bin``

**COMMAND**:

::

   hugeadm --thp-never` &&
   echo never > /sys/kernel/mm/transparent_hugepage/enabled

.. code:: shell

   sudo vi /etc/rc.local

**paste** into ``/etc/rc.local``

.. code:: shell

   if test -f /sys/kernel/mm/transparent_hugepage/enabled; then
      echo never > /sys/kernel/mm/transparent_hugepage/enabled
   fi

**EDIT** : ``/etc/default/grub`` add the following line:

.. code:: shell

   GRUB_DISABLE_OS_PROBER=true

Postgresql Configuration
------------------------

Your postgresql instance should optimally allow 1,000 connections:

.. code:: shell

   max_connections = 1000                  # (change requires restart)
   shared_buffers = 8GB                    # min 128kB
   work_mem = 2GB                  # min 64kB

CollectOSS will generally hold up to 150 simultaneous connections while
collecting data. The 1,000 number is recommended to accommodate both
collection and analysis on the same database. Use of PGBouncer or other
utility may change these characteristics.

CollectOSS Commands
-------------------

To access command line options, use ``collectoss --help``. To load repos from
GitHub organizations prior to collection, or in other ways, the direct
route is ``collectoss db --help``.

Start a Flower Dashboard, which you can use to monitor progress, and
report any failed processes as issues on the CollectOSS GitHub site. The
error rate for tasks is currently 0.04%, and most errors involve
unhandled platform API timeouts. We continue to identify and add fixes
to handle these errors through additional retries. Starting Flower:
``(nohup celery -A collectoss.tasks.init.celery_app.celery_app flower --port=8400 --max-tasks=1000000 &)``
NOTE: You can use any open port on your server, and access the dashboard
in a browser with http://servername-or-ip:8400 in the example above
(assuming you have access to that port, and its open on your network.)

Starting your CollectOSS Instance
---------------------------------

Start CollectOSS: ``(nohup collectoss backend start &)``

When data collection is complete you will see only a single task running
in your flower Dashboard.

Accessing Repo Addition and Visualization Front End
---------------------------------------------------

Your CollectOSS instance will now be available at
http://servername-or-ip:port_number

Note: CollectOSS will run on port 5000 by default (you probably need to
change that in augur_operations.config for OSX)

Stopping your CollectOSS Instance
---------------------------------

You can stop collectoss with ``collectoss backend stop``, followed by
``collectoss backend kill``. We recommend waiting 5 minutes between commands
so CollectOSS can shutdown more gently. There is no issue with data integrity
if you issue them seconds apart, its just that stopping is nicer than
killing.