[01:21:11] PROBLEM - Disk space on ocg1003 is CRITICAL: DISK CRITICAL - free space: / 1713 MB (3% inode=85%) [03:11:59] !log removing files in /srv/deployment/ocg/postmortem on ocg1003, another case of T162780 [03:12:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:12:08] T162780: ocg1003 partitions are severely misconfigured - https://phabricator.wikimedia.org/T162780 [03:14:11] RECOVERY - Disk space on ocg1003 is OK: DISK OK [03:33:01] PROBLEM - puppet last run on analytics1039 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-City.mmdb.gz] [03:33:01] PROBLEM - puppet last run on mw1287 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-City.mmdb.gz] [03:33:01] PROBLEM - puppet last run on mw1192 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-City.mmdb.gz] [03:33:31] PROBLEM - puppet last run on mw1216 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-City.mmdb.test],File[/usr/share/GeoIP/GeoIP2-City.mmdb.gz] [03:52:31] PROBLEM - Juniper alarms on mr1-eqiad is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 208.80.154.199 [03:53:21] RECOVERY - Juniper alarms on mr1-eqiad is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms [04:00:31] RECOVERY - puppet last run on mw1216 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [04:00:52] RECOVERY - puppet last run on analytics1039 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [04:00:52] RECOVERY - puppet last run on mw1287 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [04:01:01] RECOVERY - puppet last run on mw1192 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [04:09:01] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=4536.00 Read Requests/Sec=6146.30 Write Requests/Sec=0.20 KBytes Read/Sec=35383.60 KBytes_Written/Sec=1.60 [04:14:51] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=22.50 Read Requests/Sec=2.50 Write Requests/Sec=6.80 KBytes Read/Sec=10.00 KBytes_Written/Sec=175.20 [06:38:21] PROBLEM - puppet last run on cp4001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:07:21] RECOVERY - puppet last run on cp4001 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [07:52:11] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 33 probes of 274 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [07:54:21] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 44 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [08:04:05] (03CR) 10Reedy: [C: 04-1] "Needs a comment referencing T123085 and that this should be removed when that bug is fixed" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349427 (https://phabricator.wikimedia.org/T163167) (owner: 10Zppix) [09:48:54] (03PS1) 10Volans: MediaWiki: reduce verbosity of the cache warmup [puppet] - 10https://gerrit.wikimedia.org/r/349787 (https://phabricator.wikimedia.org/T163369) [10:04:21] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 16 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [10:07:11] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 14 probes of 274 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [11:11:21] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 38 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [11:14:11] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 32 probes of 274 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [11:29:11] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 15 probes of 274 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [11:31:21] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 17 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [12:39:18] hi :). Is a stuck global rename a task, that operations may handle too? I'm not sure in case of T163622 [12:39:19] T163622: Please unblock stuck global rename - https://phabricator.wikimedia.org/T163622 [12:51:51] PROBLEM - Check Varnish expiry mailbox lag on cp2017 is CRITICAL: CRITICAL: expiry mailbox lag is 665784 [13:17:51] PROBLEM - Check Varnish expiry mailbox lag on cp2002 is CRITICAL: CRITICAL: expiry mailbox lag is 791213 [14:33:11] PROBLEM - Check Varnish expiry mailbox lag on cp2020 is CRITICAL: CRITICAL: expiry mailbox lag is 710196 [14:37:51] RECOVERY - Check Varnish expiry mailbox lag on cp2002 is OK: OK: expiry mailbox lag is 6 [15:35:45] (03Draft1) 10Paladox: Phabricator: Install php5-apcu on debian jessie [puppet] - 10https://gerrit.wikimedia.org/r/349793 [15:35:48] (03PS2) 10Paladox: Phabricator: Install php5-apcu on debian jessie [puppet] - 10https://gerrit.wikimedia.org/r/349793 [15:36:29] (03CR) 10Paladox: "Phabricator now supports apcu." [puppet] - 10https://gerrit.wikimedia.org/r/349793 (owner: 10Paladox) [15:39:17] (03CR) 10Paladox: "The apc package on debian is a dummy package which means it dosent work. Also looking at phab-01.wmflabs.org. It shows it uses the apcu ca" [puppet] - 10https://gerrit.wikimedia.org/r/349793 (owner: 10Paladox) [16:11:51] RECOVERY - Check Varnish expiry mailbox lag on cp2017 is OK: OK: expiry mailbox lag is 15266 [17:08:06] Reedy: just to be clear on T163167 you're agreeing with my fix? I'm a bit confused about your comment [17:08:07] T163167: Sysops are no longer able to add education extension groups - https://phabricator.wikimedia.org/T163167 [17:10:29] (03PS11) 10Zppix: Fix EducationProgram user rights so that they can be assigned/removed by sysops [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349427 (https://phabricator.wikimedia.org/T163167) [17:21:40] (03CR) 10Urbanecm: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349427 (https://phabricator.wikimedia.org/T163167) (owner: 10Zppix) [17:31:01] PROBLEM - Check Varnish expiry mailbox lag on cp2005 is CRITICAL: CRITICAL: expiry mailbox lag is 605418 [17:40:51] PROBLEM - Check Varnish expiry mailbox lag on cp2017 is CRITICAL: CRITICAL: expiry mailbox lag is 609647 [17:49:47] !log disabling puppet on db2062 and upgrading MariaDB package to 10.1 T116557 [17:49:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:57] T116557: AFComputedVariable::compute query timeouts - https://phabricator.wikimedia.org/T116557 [18:10:51] RECOVERY - Check Varnish expiry mailbox lag on cp2017 is OK: OK: expiry mailbox lag is 2244 [18:11:01] RECOVERY - Check Varnish expiry mailbox lag on cp2005 is OK: OK: expiry mailbox lag is 0 [18:34:18] (03CR) 10EddieGP: [C: 031] Fix EducationProgram user rights so that they can be assigned/removed by sysops [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349427 (https://phabricator.wikimedia.org/T163167) (owner: 10Zppix) [19:10:01] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [19:10:51] PROBLEM - nova-compute process on labvirt1003 is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [19:11:51] RECOVERY - nova-compute process on labvirt1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [19:12:01] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [19:13:30] !log cp2020: restart varnish-be [19:13:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:11] RECOVERY - Check Varnish expiry mailbox lag on cp2020 is OK: OK: expiry mailbox lag is 0 [19:26:01] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [19:56:51] PROBLEM - Check Varnish expiry mailbox lag on cp2002 is CRITICAL: CRITICAL: expiry mailbox lag is 798101 [20:05:08] mutante: “Oh shit” [20:05:13] https://commons.wikimedia.org/wiki/File:HooJoome.pdf [20:05:34] “This file is not used in any other project. [20:05:35] A database query error has occurred. This may indicate a bug in the software.[WP0IwwrAIDsAAG4VS2UAAADF] 2017-04-23 20:04:35: Fatal exception of type "DBQueryError"” [20:05:47] (when attempting to delete it) [20:06:37] Then shows up as ‘deleted’ when attempting to delete it again. [20:09:03] Revent does that happen if you delete other files? [20:09:46] Umm, sec [20:11:35] paladox: Nope. [20:11:41] Ok [20:11:51] thanks. So it could be just that specific file [20:12:22] Maybe lag, since it showed up a deleted when I tried again. [20:12:57] Yep [20:13:09] Sad that finding ‘something else to delete’ was so easy, lol. [20:15:01] lol [20:56:51] RECOVERY - Check Varnish expiry mailbox lag on cp2002 is OK: OK: expiry mailbox lag is 28 [22:28:37] 06Operations, 10ops-eqiad: Degraded RAID on restbase1018 - https://phabricator.wikimedia.org/T163454#3204641 (10Volans) [22:28:39] 06Operations, 10ops-eqiad: Degraded RAID on restbase1018 - https://phabricator.wikimedia.org/T163280#3204643 (10Volans)