[01:21:11] <icinga-wm>	 PROBLEM - Disk space on ocg1003 is CRITICAL: DISK CRITICAL - free space: / 1713 MB (3% inode=85%)
[03:11:59] <andrewbogott>	 !log removing files in /srv/deployment/ocg/postmortem on ocg1003, another case of T162780
[03:12:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:12:08] <stashbot>	 T162780: ocg1003 partitions are severely misconfigured - https://phabricator.wikimedia.org/T162780
[03:14:11] <icinga-wm>	 RECOVERY - Disk space on ocg1003 is OK: DISK OK
[03:33:01] <icinga-wm>	 PROBLEM - puppet last run on analytics1039 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-City.mmdb.gz]
[03:33:01] <icinga-wm>	 PROBLEM - puppet last run on mw1287 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-City.mmdb.gz]
[03:33:01] <icinga-wm>	 PROBLEM - puppet last run on mw1192 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-City.mmdb.gz]
[03:33:31] <icinga-wm>	 PROBLEM - puppet last run on mw1216 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-City.mmdb.test],File[/usr/share/GeoIP/GeoIP2-City.mmdb.gz]
[03:52:31] <icinga-wm>	 PROBLEM - Juniper alarms on mr1-eqiad is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 208.80.154.199
[03:53:21] <icinga-wm>	 RECOVERY - Juniper alarms on mr1-eqiad is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms
[04:00:31] <icinga-wm>	 RECOVERY - puppet last run on mw1216 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures
[04:00:52] <icinga-wm>	 RECOVERY - puppet last run on analytics1039 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures
[04:00:52] <icinga-wm>	 RECOVERY - puppet last run on mw1287 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures
[04:01:01] <icinga-wm>	 RECOVERY - puppet last run on mw1192 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures
[04:09:01] <icinga-wm>	 PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=4536.00 Read Requests/Sec=6146.30 Write Requests/Sec=0.20 KBytes Read/Sec=35383.60 KBytes_Written/Sec=1.60
[04:14:51] <icinga-wm>	 RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=22.50 Read Requests/Sec=2.50 Write Requests/Sec=6.80 KBytes Read/Sec=10.00 KBytes_Written/Sec=175.20
[06:38:21] <icinga-wm>	 PROBLEM - puppet last run on cp4001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[07:07:21] <icinga-wm>	 RECOVERY - puppet last run on cp4001 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures
[07:52:11] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 33 probes of 274 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[07:54:21] <icinga-wm>	 PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 44 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[08:04:05] <wikibugs>	 (03CR) 10Reedy: [C: 04-1] "Needs a comment referencing T123085 and that this should be removed when that bug is fixed" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349427 (https://phabricator.wikimedia.org/T163167) (owner: 10Zppix)
[09:48:54] <wikibugs__>	 (03PS1) 10Volans: MediaWiki: reduce verbosity of the cache warmup [puppet] - 10https://gerrit.wikimedia.org/r/349787 (https://phabricator.wikimedia.org/T163369)
[10:04:21] <icinga-wm>	 RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 16 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[10:07:11] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 14 probes of 274 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[11:11:21] <icinga-wm>	 PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 38 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[11:14:11] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 32 probes of 274 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[11:29:11] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 15 probes of 274 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[11:31:21] <icinga-wm>	 RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 17 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[12:39:18] <Sagan>	 hi :). Is a stuck global rename a task, that operations may handle too? I'm not sure in case of T163622
[12:39:19] <stashbot>	 T163622: Please unblock stuck global rename - https://phabricator.wikimedia.org/T163622
[12:51:51] <icinga-wm>	 PROBLEM - Check Varnish expiry mailbox lag on cp2017 is CRITICAL: CRITICAL: expiry mailbox lag is 665784
[13:17:51] <icinga-wm>	 PROBLEM - Check Varnish expiry mailbox lag on cp2002 is CRITICAL: CRITICAL: expiry mailbox lag is 791213
[14:33:11] <icinga-wm>	 PROBLEM - Check Varnish expiry mailbox lag on cp2020 is CRITICAL: CRITICAL: expiry mailbox lag is 710196
[14:37:51] <icinga-wm>	 RECOVERY - Check Varnish expiry mailbox lag on cp2002 is OK: OK: expiry mailbox lag is 6
[15:35:45] <wikibugs>	 (03Draft1) 10Paladox: Phabricator: Install php5-apcu on debian jessie [puppet] - 10https://gerrit.wikimedia.org/r/349793
[15:35:48] <wikibugs__>	 (03PS2) 10Paladox: Phabricator: Install php5-apcu on debian jessie [puppet] - 10https://gerrit.wikimedia.org/r/349793
[15:36:29] <wikibugs>	 (03CR) 10Paladox: "Phabricator now supports apcu." [puppet] - 10https://gerrit.wikimedia.org/r/349793 (owner: 10Paladox)
[15:39:17] <wikibugs__>	 (03CR) 10Paladox: "The apc package on debian is a dummy package which means it dosent work. Also looking at phab-01.wmflabs.org. It shows it uses the apcu ca" [puppet] - 10https://gerrit.wikimedia.org/r/349793 (owner: 10Paladox)
[16:11:51] <icinga-wm>	 RECOVERY - Check Varnish expiry mailbox lag on cp2017 is OK: OK: expiry mailbox lag is 15266
[17:08:06] <Zppix>	 Reedy:  just to be clear on T163167 you're agreeing with my fix? I'm a bit confused about your comment
[17:08:07] <stashbot>	 T163167: Sysops are no longer able to add education extension groups - https://phabricator.wikimedia.org/T163167
[17:10:29] <wikibugs>	 (03PS11) 10Zppix: Fix EducationProgram user rights so that they can be assigned/removed by sysops [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349427 (https://phabricator.wikimedia.org/T163167)
[17:21:40] <wikibugs>	 (03CR) 10Urbanecm: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349427 (https://phabricator.wikimedia.org/T163167) (owner: 10Zppix)
[17:31:01] <icinga-wm>	 PROBLEM - Check Varnish expiry mailbox lag on cp2005 is CRITICAL: CRITICAL: expiry mailbox lag is 605418
[17:40:51] <icinga-wm>	 PROBLEM - Check Varnish expiry mailbox lag on cp2017 is CRITICAL: CRITICAL: expiry mailbox lag is 609647
[17:49:47] <jynus>	 !log disabling puppet on db2062 and upgrading MariaDB package to 10.1 T116557
[17:49:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:49:57] <stashbot>	 T116557: AFComputedVariable::compute query timeouts - https://phabricator.wikimedia.org/T116557
[18:10:51] <icinga-wm>	 RECOVERY - Check Varnish expiry mailbox lag on cp2017 is OK: OK: expiry mailbox lag is 2244
[18:11:01] <icinga-wm>	 RECOVERY - Check Varnish expiry mailbox lag on cp2005 is OK: OK: expiry mailbox lag is 0
[18:34:18] <wikibugs>	 (03CR) 10EddieGP: [C: 031] Fix EducationProgram user rights so that they can be assigned/removed by sysops [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349427 (https://phabricator.wikimedia.org/T163167) (owner: 10Zppix)
[19:10:01] <icinga-wm>	 PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0]
[19:10:51] <icinga-wm>	 PROBLEM - nova-compute process on labvirt1003 is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/bin/python /usr/bin/nova-compute
[19:11:51] <icinga-wm>	 RECOVERY - nova-compute process on labvirt1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute
[19:12:01] <icinga-wm>	 PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0]
[19:13:30] <ema>	 !log cp2020: restart varnish-be 
[19:13:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:23:11] <icinga-wm>	 RECOVERY - Check Varnish expiry mailbox lag on cp2020 is OK: OK: expiry mailbox lag is 0
[19:26:01] <icinga-wm>	 RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[19:56:51] <icinga-wm>	 PROBLEM - Check Varnish expiry mailbox lag on cp2002 is CRITICAL: CRITICAL: expiry mailbox lag is 798101
[20:05:08] <Revent>	 mutante: “Oh shit”
[20:05:13] <Revent>	 https://commons.wikimedia.org/wiki/File:HooJoome.pdf
[20:05:34] <Revent>	 “This file is not used in any other project.
[20:05:35] <Revent>	 A database query error has occurred. This may indicate a bug in the software.[WP0IwwrAIDsAAG4VS2UAAADF] 2017-04-23 20:04:35: Fatal exception of type "DBQueryError"”
[20:05:47] <Revent>	 (when attempting to delete it)
[20:06:37] <Revent>	 Then shows up as ‘deleted’ when attempting to delete it again.
[20:09:03] <paladox>	 Revent does that happen if you delete other files?
[20:09:46] <Revent>	 Umm, sec
[20:11:35] <Revent>	 paladox: Nope.
[20:11:41] <paladox>	 Ok
[20:11:51] <paladox>	 thanks. So it could be just that specific file
[20:12:22] <Revent>	 Maybe lag, since it showed up a deleted when I tried again.
[20:12:57] <paladox>	 Yep
[20:13:09] <Revent>	 Sad that finding ‘something else to delete’ was so easy, lol.
[20:15:01] <paladox>	 lol
[20:56:51] <icinga-wm>	 RECOVERY - Check Varnish expiry mailbox lag on cp2002 is OK: OK: expiry mailbox lag is 28
[22:28:37] <wikibugs__>	 06Operations, 10ops-eqiad: Degraded RAID on restbase1018 - https://phabricator.wikimedia.org/T163454#3204641 (10Volans)
[22:28:39] <wikibugs>	 06Operations, 10ops-eqiad: Degraded RAID on restbase1018 - https://phabricator.wikimedia.org/T163280#3204643 (10Volans)