[00:00:17] Right... I mean, my concern was that since Web requests are automatically wrapped in a transaction [00:00:25] if you don't wait / care for slaves you will likely get your master into trouble, but that's not a common case on our set up [00:02:01] fantastic...! thanks so much :) [00:02:45] BTW... related question... do you know where all this is configured, and what bit of the code does the automatic scoping of a transaction into a web request? (or maybe I'm misunderstanding?) [00:03:22] Sorry to be so bothersome, it'd just be nice to see in full detail how it all works 8p [00:09:12] AndyRussG: https://www.mediawiki.org/wiki/Manual:$wgDBservers see flags there [00:09:24] related code should be in the databsebase class [00:10:23] hoo: gotcha... thanks a million, really appreciate it! :D [00:14:09] :) [00:40:43] PROBLEM - Disk space on ocg1001 is CRITICAL: DISK CRITICAL - free space: / 337 MB (3% inode=72%): [00:50:52] PROBLEM - Disk space on ocg1001 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=72%): [00:51:33] PROBLEM - MySQL Replication Heartbeat on db1016 is CRITICAL: CRIT replication delay 305 seconds [00:52:02] PROBLEM - MySQL Slave Delay on db1016 is CRITICAL: CRIT replication delay 341 seconds [00:53:15] RECOVERY - MySQL Slave Delay on db1016 is OK: OK replication delay 0 seconds [00:53:50] RECOVERY - MySQL Replication Heartbeat on db1016 is OK: OK replication delay -0 seconds [00:58:54] (03CR) 10Legoktm: "Needs I2f53b23631aeeff91023ae8b44e2a4753c1f0ba3 to be deployed first." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167408 (owner: 10Legoktm) [02:14:23] !log LocalisationUpdate completed (1.25wmf3) at 2014-10-19 02:14:23+00:00 [02:14:34] Logged the message, Master [02:25:44] !log LocalisationUpdate completed (1.25wmf4) at 2014-10-19 02:25:44+00:00 [02:25:52] Logged the message, Master [03:36:03] !log LocalisationUpdate ResourceLoader cache refresh completed at Sun Oct 19 03:36:03 UTC 2014 (duration 36m 2s) [03:36:09] Logged the message, Master [03:42:33] (03CR) 10Legoktm: "Actually I4eb6322183556b44bc748c24932892cb311880c0." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167408 (owner: 10Legoktm) [05:58:19] PROBLEM - Disk space on ocg1003 is CRITICAL: DISK CRITICAL - free space: / 349 MB (3% inode=72%): [06:26:00] RECOVERY - Disk space on ocg1003 is OK: DISK OK [06:26:30] RECOVERY - Disk space on ocg1001 is OK: DISK OK [06:29:10] PROBLEM - puppet last run on mw1042 is CRITICAL: CRITICAL: Puppet has 3 failures [06:29:19] PROBLEM - puppet last run on search1001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:29] PROBLEM - puppet last run on mw1065 is CRITICAL: CRITICAL: Puppet has 2 failures [06:29:39] PROBLEM - puppet last run on mw1144 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:40] PROBLEM - puppet last run on db1040 is CRITICAL: CRITICAL: Puppet has 2 failures [06:29:50] PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: Puppet has 2 failures [06:29:50] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:19] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:29] PROBLEM - puppet last run on db1039 is CRITICAL: CRITICAL: Puppet has 3 failures [06:37:19] PROBLEM - puppet last run on db1035 is CRITICAL: CRITICAL: Puppet has 1 failures [06:45:41] RECOVERY - puppet last run on mw1065 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [06:45:42] RECOVERY - puppet last run on mw1144 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [06:45:51] RECOVERY - puppet last run on db1040 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [06:46:11] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [06:46:20] RECOVERY - puppet last run on mw1042 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [06:46:30] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [06:46:32] RECOVERY - puppet last run on search1001 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:47:10] RECOVERY - puppet last run on mw1118 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [06:48:50] RECOVERY - puppet last run on db1039 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [06:55:51] RECOVERY - puppet last run on db1035 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [07:38:01] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: ocg_job_status 370326 msg: ocg_render_job_queue 662 msg (=500 critical) [07:38:11] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: ocg_job_status 370475 msg: ocg_render_job_queue 769 msg (=500 critical) [07:38:35] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: ocg_job_status 370795 msg: ocg_render_job_queue 981 msg (=500 critical) [07:43:12] RECOVERY - OCG health on ocg1002 is OK: OK: ocg_job_status 371094 msg: ocg_render_job_queue 36 msg [07:43:21] RECOVERY - OCG health on ocg1003 is OK: OK: ocg_job_status 371101 msg: ocg_render_job_queue 16 msg [07:43:50] RECOVERY - OCG health on ocg1001 is OK: OK: ocg_job_status 371129 msg: ocg_render_job_queue 0 msg [08:35:17] PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 709 [08:40:24] PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1008 [08:45:20] RECOVERY - check_mysql on lutetium is OK: Uptime: 227561 Threads: 2 Questions: 1760156 Slow queries: 617 Opens: 2713 Flush tables: 2 Open tables: 64 Queries per second avg: 7.734 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [09:05:49] PROBLEM - Disk space on ms-be2007 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdl1 is not accessible: Input/output error [09:06:29] PROBLEM - RAID on ms-be2007 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) [09:13:00] PROBLEM - puppet last run on ms-be2007 is CRITICAL: CRITICAL: Puppet has 1 failures [10:02:02] RECOVERY - Disk space on ms-be2007 is OK: DISK OK [10:08:20] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: ocg_job_status 347244 msg: ocg_render_job_queue 1115 msg (=500 critical) [10:08:20] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: ocg_job_status 347245 msg: ocg_render_job_queue 1115 msg (=500 critical) [10:08:45] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: ocg_job_status 347258 msg: ocg_render_job_queue 1091 msg (=500 critical) [10:16:46] RECOVERY - OCG health on ocg1002 is OK: OK: ocg_job_status 347958 msg: ocg_render_job_queue 72 msg [10:16:46] RECOVERY - OCG health on ocg1003 is OK: OK: ocg_job_status 347959 msg: ocg_render_job_queue 68 msg [10:16:56] RECOVERY - OCG health on ocg1001 is OK: OK: ocg_job_status 347980 msg: ocg_render_job_queue 53 msg [11:14:46] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: ocg_job_status 354360 msg: ocg_render_job_queue 570 msg (=500 critical) [11:14:59] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: ocg_job_status 354445 msg: ocg_render_job_queue 613 msg (=500 critical) [11:15:05] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: ocg_job_status 354446 msg: ocg_render_job_queue 604 msg (=500 critical) [11:20:05] RECOVERY - OCG health on ocg1002 is OK: OK: ocg_job_status 354905 msg: ocg_render_job_queue 63 msg [11:20:07] RECOVERY - OCG health on ocg1003 is OK: OK: ocg_job_status 354912 msg: ocg_render_job_queue 51 msg [11:20:15] RECOVERY - OCG health on ocg1001 is OK: OK: ocg_job_status 354917 msg: ocg_render_job_queue 48 msg [11:34:02] (03PS8) 10Ricordisamoa: minor changes to InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/129464 [11:37:43] (03CR) 10Ricordisamoa: "Rebased again. Guys, we'll get to 6 months soon!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/129464 (owner: 10Ricordisamoa) [12:16:01] (03PS1) 10Nemo bis: Enable uploads on Hungarian Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167439 (https://bugzilla.wikimedia.org/72231) [13:03:08] PROBLEM - puppet last run on cp4011 is CRITICAL: CRITICAL: Puppet has 1 failures [13:04:27] PROBLEM - puppet last run on cp4015 is CRITICAL: CRITICAL: Puppet has 1 failures [13:07:07] PROBLEM - puppet last run on cp4007 is CRITICAL: CRITICAL: Puppet has 1 failures [13:09:08] PROBLEM - puppet last run on cp4006 is CRITICAL: CRITICAL: Puppet has 1 failures [13:19:27] RECOVERY - puppet last run on cp4011 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [13:20:57] RECOVERY - puppet last run on cp4015 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [13:25:47] RECOVERY - puppet last run on cp4007 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [13:26:46] RECOVERY - puppet last run on cp4006 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [13:32:27] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 1 failures [13:33:26] PROBLEM - puppet last run on cp4014 is CRITICAL: CRITICAL: Puppet has 1 failures [13:34:56] PROBLEM - puppet last run on cp4018 is CRITICAL: CRITICAL: Puppet has 1 failures [13:49:57] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [13:51:58] RECOVERY - puppet last run on cp4014 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [13:53:36] RECOVERY - puppet last run on cp4018 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [13:55:11] PROBLEM - puppet last run on cp4005 is CRITICAL: CRITICAL: Puppet has 1 failures [14:00:27] PROBLEM - puppet last run on cp4020 is CRITICAL: CRITICAL: Puppet has 1 failures [14:00:47] PROBLEM - puppet last run on cp4002 is CRITICAL: CRITICAL: Puppet has 1 failures [14:10:48] RECOVERY - puppet last run on cp4005 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [14:14:49] RECOVERY - puppet last run on cp4020 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [14:16:18] RECOVERY - puppet last run on cp4002 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [14:23:48] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: ocg_job_status 374151 msg: ocg_render_job_queue 647 msg (=500 critical) [14:23:59] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: ocg_job_status 374166 msg: ocg_render_job_queue 652 msg (=500 critical) [14:24:02] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: ocg_job_status 374167 msg: ocg_render_job_queue 650 msg (=500 critical) [14:28:58] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: ocg_job_status 375190 msg: ocg_render_job_queue 991 msg (=500 critical) [14:29:00] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: ocg_job_status 375243 msg: ocg_render_job_queue 1030 msg (=500 critical) [14:29:08] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: ocg_job_status 375253 msg: ocg_render_job_queue 1037 msg (=500 critical) [14:41:15] (03Abandoned) 10Yuvipanda: icinga: Add a parameter to icinga::web to parameterize SSL [puppet] - 10https://gerrit.wikimedia.org/r/164103 (owner: 10Yuvipanda) [14:48:52] RECOVERY - OCG health on ocg1002 is OK: OK: ocg_job_status 377416 msg: ocg_render_job_queue 57 msg [14:48:53] RECOVERY - OCG health on ocg1001 is OK: OK: ocg_job_status 377422 msg: ocg_render_job_queue 52 msg [14:49:02] RECOVERY - OCG health on ocg1003 is OK: OK: ocg_job_status 377425 msg: ocg_render_job_queue 49 msg [15:32:11] (03PS3) 10Glaisher: Create Oriya Wikisource (orwikisource) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/166186 (https://bugzilla.wikimedia.org/71875) [15:32:48] (03CR) 10Glaisher: "Rebased" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/166186 (https://bugzilla.wikimedia.org/71875) (owner: 10Glaisher) [16:48:20] (03PS1) 10Glaisher: Rename project and project talk namespaces on mrwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167447 (https://bugzilla.wikimedia.org/71774) [17:19:35] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: ocg_job_status 351913 msg: ocg_render_job_queue 506 msg (=500 critical) [17:26:45] RECOVERY - OCG health on ocg1002 is OK: OK: ocg_job_status 352563 msg: ocg_render_job_queue 51 msg [17:31:55] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: ocg_job_status 354015 msg: ocg_render_job_queue 645 msg (=500 critical) [17:32:06] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: ocg_job_status 354034 msg: ocg_render_job_queue 619 msg (=500 critical) [17:32:15] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: ocg_job_status 354039 msg: ocg_render_job_queue 612 msg (=500 critical) [17:38:27] RECOVERY - OCG health on ocg1001 is OK: OK: ocg_job_status 354504 msg: ocg_render_job_queue 98 msg [17:39:06] RECOVERY - OCG health on ocg1002 is OK: OK: ocg_job_status 354546 msg: ocg_render_job_queue 68 msg [17:39:27] RECOVERY - OCG health on ocg1003 is OK: OK: ocg_job_status 354566 msg: ocg_render_job_queue 45 msg [18:22:56] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: ocg_job_status 359181 msg: ocg_render_job_queue 511 msg (=500 critical) [18:23:06] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: ocg_job_status 359189 msg: ocg_render_job_queue 502 msg (=500 critical) [18:27:25] PROBLEM - puppet last run on amssq47 is CRITICAL: CRITICAL: puppet fail [18:27:46] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: ocg_job_status 359896 msg: ocg_render_job_queue 532 msg (=500 critical) [18:28:11] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: ocg_job_status 359914 msg: ocg_render_job_queue 509 msg (=500 critical) [18:28:15] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: ocg_job_status 359931 msg: ocg_render_job_queue 504 msg (=500 critical) [18:41:16] RECOVERY - OCG health on ocg1002 is OK: OK: ocg_job_status 360927 msg: ocg_render_job_queue 65 msg [18:41:35] RECOVERY - OCG health on ocg1003 is OK: OK: ocg_job_status 360950 msg: ocg_render_job_queue 58 msg [18:41:36] RECOVERY - OCG health on ocg1001 is OK: OK: ocg_job_status 360951 msg: ocg_render_job_queue 55 msg [18:47:58] RECOVERY - puppet last run on amssq47 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [18:48:47] (03PS1) 10MaxSem: Perform mobile redirect only for GET requests [puppet] - 10https://gerrit.wikimedia.org/r/167453 (https://bugzilla.wikimedia.org/72186) [20:09:00] PROBLEM - check configured eth on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:09:00] PROBLEM - SSH on mw1114 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:09:01] PROBLEM - nutcracker port on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:09:01] PROBLEM - HHVM rendering on mw1114 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:09:01] PROBLEM - Apache HTTP on mw1114 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:09:21] PROBLEM - Disk space on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:09:31] PROBLEM - puppet last run on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:09:32] PROBLEM - DPKG on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:09:32] PROBLEM - check if dhclient is running on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:09:41] PROBLEM - check if salt-minion is running on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:09:50] PROBLEM - nutcracker process on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:10:00] PROBLEM - RAID on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:10:52] RECOVERY - nutcracker process on mw1114 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [20:11:11] RECOVERY - SSH on mw1114 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [20:11:11] RECOVERY - nutcracker port on mw1114 is OK: TCP OK - 0.000 second response time on port 11212 [20:11:12] RECOVERY - check configured eth on mw1114 is OK: NRPE: Unable to read output [20:11:14] RECOVERY - Apache HTTP on mw1114 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 2.836 second response time [20:11:14] RECOVERY - Disk space on mw1114 is OK: DISK OK [20:11:14] RECOVERY - HHVM rendering on mw1114 is OK: HTTP OK: HTTP/1.1 200 OK - 66984 bytes in 7.968 second response time [20:11:42] RECOVERY - check if dhclient is running on mw1114 is OK: PROCS OK: 0 processes with command name dhclient [20:11:42] RECOVERY - DPKG on mw1114 is OK: All packages OK [20:11:42] RECOVERY - check if salt-minion is running on mw1114 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [20:11:52] RECOVERY - RAID on mw1114 is OK: OK: no RAID installed [20:25:16] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: ocg_job_status 370898 msg: ocg_render_job_queue 551 msg (=500 critical) [20:25:16] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: ocg_job_status 370901 msg: ocg_render_job_queue 548 msg (=500 critical) [20:25:26] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: ocg_job_status 370911 msg: ocg_render_job_queue 537 msg (=500 critical) [20:26:51] RECOVERY - puppet last run on mw1114 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [20:27:40] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: ocg_job_status 371066 msg: ocg_render_job_queue 500 msg (=500 critical) [20:29:40] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: ocg_job_status 371186 msg: ocg_render_job_queue 503 msg (=500 critical) [20:29:50] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: ocg_job_status 371188 msg: ocg_render_job_queue 500 msg (=500 critical) [20:30:50] PROBLEM - puppet last run on ssl3002 is CRITICAL: CRITICAL: puppet fail [20:31:02] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: ocg_job_status 371262 msg: ocg_render_job_queue 528 msg (=500 critical) [20:43:12] RECOVERY - OCG health on ocg1002 is OK: OK: ocg_job_status 372027 msg: ocg_render_job_queue 89 msg [20:43:21] RECOVERY - OCG health on ocg1003 is OK: OK: ocg_job_status 372034 msg: ocg_render_job_queue 85 msg [20:43:31] RECOVERY - OCG health on ocg1001 is OK: OK: ocg_job_status 372053 msg: ocg_render_job_queue 90 msg [20:50:51] RECOVERY - puppet last run on ssl3002 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [21:03:23] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: ocg_job_status 374370 msg: ocg_render_job_queue 699 msg (=500 critical) [21:03:23] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: ocg_job_status 374374 msg: ocg_render_job_queue 693 msg (=500 critical) [21:03:33] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: ocg_job_status 374386 msg: ocg_render_job_queue 684 msg (=500 critical) [21:11:03] PROBLEM - puppet last run on amssq42 is CRITICAL: CRITICAL: puppet fail [21:11:44] RECOVERY - OCG health on ocg1002 is OK: OK: ocg_job_status 374933 msg: ocg_render_job_queue 69 msg [21:11:53] RECOVERY - OCG health on ocg1003 is OK: OK: ocg_job_status 374944 msg: ocg_render_job_queue 58 msg [21:12:03] RECOVERY - OCG health on ocg1001 is OK: OK: ocg_job_status 374962 msg: ocg_render_job_queue 50 msg [21:29:54] RECOVERY - puppet last run on amssq42 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [21:43:05] PROBLEM - BGP status on cr1-codfw is CRITICAL: CRITICAL: No response from remote host 208.80.153.192 [21:44:12] RECOVERY - BGP status on cr1-codfw is OK: OK: host 208.80.153.192, sessions up: 6, down: 0, shutdown: 0 [21:56:34] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: ocg_job_status 329050 msg: ocg_render_job_queue 703 msg (=500 critical) [21:56:34] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: ocg_job_status 329108 msg: ocg_render_job_queue 754 msg (=500 critical) [21:56:43] PROBLEM - BGP status on cr1-codfw is CRITICAL: CRITICAL: No response from remote host 208.80.153.192 [21:57:23] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: ocg_job_status 329257 msg: ocg_render_job_queue 727 msg (=500 critical) [21:57:34] RECOVERY - BGP status on cr1-codfw is OK: OK: host 208.80.153.192, sessions up: 6, down: 0, shutdown: 0 [22:05:34] RECOVERY - OCG health on ocg1002 is OK: OK: ocg_job_status 329789 msg: ocg_render_job_queue 96 msg [22:05:43] RECOVERY - OCG health on ocg1003 is OK: OK: ocg_job_status 329797 msg: ocg_render_job_queue 84 msg [22:05:44] RECOVERY - OCG health on ocg1001 is OK: OK: ocg_job_status 329803 msg: ocg_render_job_queue 73 msg [22:13:19] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: No response from remote host 208.80.153.192 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [22:15:24] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 108, down: 0, dormant: 0, excluded: 1, unused: 0 [23:19:22] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: No response from remote host 208.80.153.192 for 1.3.6.1.2.1.2.2.1.2 with snmp version 2 [23:20:25] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 108, down: 0, dormant: 0, excluded: 1, unused: 0 [23:57:14] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: ocg_job_status 339230 msg: ocg_render_job_queue 786 msg (=500 critical) [23:57:25] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: ocg_job_status 339240 msg: ocg_render_job_queue 785 msg (=500 critical) [23:57:39] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: ocg_job_status 339247 msg: ocg_render_job_queue 785 msg (=500 critical)