Stanford is running an entire AWS VPC devoted to analytics, which hosts:
Our data VPC also has a peering connection to our prod VPC, so that the EMR cluster machines can get access to our production RDS read-replica, needed for some of the analytics tasks.
Note that none of this is necessary. Everything will work fine as long as you can set up a cluster, the app machines, and the databases, and they can all connect to each other as needed.
Tracking logs, in recent release of edx-platform, are typically located on the app server at/edx/var/log/tracking/tracking.log-+%Y%m%d-%s
. At Stanford (and edX), the tracking logs from all our app servers get synced up to a single bucket in S3. (Stanford uses rsync). Whether it's pushed by the app servers or periodically synched by some other process, make sure there are no duplicate or missing tracking log files in this bucket, as that will affect the statistical calculations.
Stanford keeps a long running cluster around (1 m3.medium master node and 1 m3.medium core node) and sizes up/down the number of task instances with each task run. The article on creating an EMR cluster has more details.
Note that this is somewhat different than edx.org, which, with every task run, provisions a new EMR clusters using a custom ansible module driven by a shell script. Consult theedx-analytics-configuration repo if you are interested in this workflow.
It's pretty much standard RDS, but make sure your RDS security groups for the reports database (written to by the code in edx-analytics-pipeline
and read by the code in edx-analytics-data-api
) allow access by all the master and slave cluster machines (there are Security Groups associated with EMR-Master and EMR-Slave that were created for us when we launched an EMR cluster), and all the data api servers. The data API and dashboard (edx-analytics-dashboard
) django apps also need databases to function, and we just use the same DB server for these 3 databases.
The reports db is filled periodically by the luigi tasks, so a scheduler is needed. We set up a Jenkins box because it provides a nice interface to allows us to schedule jobs periodically (and to view the console output) but also run them on demand. We did a vanilla sudo apt-get install jenkins
on a Ubuntu server. However, the edx-analytics-pipeline
needs to be checked out and installed on this jenkins box, because the executable python script remote-task
supplied by the install is what kicks off the luigi tasks on the EMR cluster.
Task parameters can be supplied in 3 ways, on the command line of the remote-task
command, or via an overrides.cfg
file that lives on the file system of scheduler Jenkins box and pointed to by a command line parameter to remote-task
(This is what Stanford does currently), or in a override.cfg
kept in another repo, with the repo location being supplied by yet another command-line parameter to remote-task
.
Sundry things are mainly kept in S3, like mysql credentials files for the reports database or.jar
libraries needed by various tasks.
Once you're able to launch tasks and have them run to completion and confirm there's data in your reports mysql DB, you need to setup the data-api application servers to serve that data, from the reports MySQL db, over a REST API. There are ansible roles available in the edx configuration repo (https://github.com/edx/configuration/tree/master/playbooks/roles/analytics-api) for this, and even a playbook that runs this role (ours is at https://github.com/Stanford-Online/configuration/blob/master/playbooks/edx-west/data-api.yml) so you don't need to do much except to edit the vars files used by the playbook.
The data api app has a self-documenting front page (https:///docs/) that you can use to test that the data is being correct served.
Once you confirm that the data API is serving up data over REST, you can set up the insights (dashboard) app which is responsible for the UX / presentation of the analytics data. need to setup the data-api application servers to serve that data over a REST API. This app does not directly interact with the reports database, but rather it makes REST calls to the data API and interprets/displays the JSON retunred.
There are ansible roles available in the edx configuration repo (https://github.com/edx/configuration/tree/master/playbooks/roles/analytics-insights) for this, and even a playbook that runs this role (ours is at https://github.com/Stanford-Online/configuration/blob/master/playbooks/edx-west/data-insights.yml) so you don't need to do much except to edit the vars files used by the playbook.
The insights app relies on the edx-platform instance for its authentication / authorization to create a more integrated user experience. In particular, when a user visits the insights app, the app uses the OpenID Connect protocol to seamlessly create an insights account that's linked with the users' edx-platform account. The users' course staff privileges are also propagated from edx-platform to insights, so that users only see analytics data for courses in which they have staff privileges.
This means that some configuration is required in edx-platform to add insights as an OpenID Connect client, and that configuration needs to be in synch with configuration in the insights app. See article for details.