[00:29:10] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [00:37:00] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [00:38:10] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [00:40:00] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [01:07:00] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [01:10:00] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [01:27:50] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [01:28:10] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [01:36:50] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [01:37:00] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [01:40:00] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [01:46:10] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [01:48:10] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [01:48:50] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [02:01:08] (03PS13) 10Tim Starling: Use EtcdConfig in beta cluster only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347537 (https://phabricator.wikimedia.org/T156924) [02:03:12] (03PS14) 10Tim Starling: Use EtcdConfig in beta cluster only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347537 (https://phabricator.wikimedia.org/T156924) [02:05:11] (03CR) 10Tim Starling: "PS13: rebase" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347537 (https://phabricator.wikimedia.org/T156924) (owner: 10Tim Starling) [02:18:10] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [02:19:50] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [02:20:31] (03CR) 10Tim Starling: "Doesn't look cherry picked to me. I'm going to do it." [puppet] - 10https://gerrit.wikimedia.org/r/347360 (owner: 10Giuseppe Lavagetto) [02:27:50] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [02:29:10] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [03:38:09] (03CR) 10Tim Starling: [C: 032] Use EtcdConfig in beta cluster only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347537 (https://phabricator.wikimedia.org/T156924) (owner: 10Tim Starling) [03:39:07] (03Merged) 10jenkins-bot: Use EtcdConfig in beta cluster only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347537 (https://phabricator.wikimedia.org/T156924) (owner: 10Tim Starling) [03:39:16] (03CR) 10jenkins-bot: Use EtcdConfig in beta cluster only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347537 (https://phabricator.wikimedia.org/T156924) (owner: 10Tim Starling) [03:44:38] !log tstarling@naos Synchronized wmf-config/etcd.php: https://gerrit.wikimedia.org/r/#/c/347537/ (duration: 02m 39s) [03:44:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:46:12] !log tstarling@naos Synchronized wmf-config: https://gerrit.wikimedia.org/r/#/c/347537/ (duration: 01m 01s) [03:46:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:12:26] 06Operations, 10MediaWiki-General-or-Unknown, 06Multimedia: Segmentation fault creating thumbnail - https://phabricator.wikimedia.org/T159242#3061484 (10Jonesey95) This is also happening here: https://upload.wikimedia.org/wikipedia/commons/thumb/f/f0/Map_of_Virginia_highlighting_Richmond_City.svg/258px-Map_... [04:17:20] RECOVERY - MariaDB Slave Lag: s3 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 89934.36 seconds [04:20:00] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements responds with malformed body: list index out of range [04:21:00] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [04:22:10] 06Operations, 10MediaWiki-Configuration, 06MediaWiki-Platform-Team, 06Performance-Team, and 9 others: Allow integration of data from etcd into the MediaWiki configuration - https://phabricator.wikimedia.org/T156924#3224690 (10tstarling) Did the following testing: * cherry-picked the proposed conftool-data... [04:28:09] (03PS1) 10Tim Starling: Enable EtcdConfig in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351132 [04:29:15] (03PS2) 10Tim Starling: Enable EtcdConfig in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351132 (https://phabricator.wikimedia.org/T156924) [04:33:10] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=495.30 Read Requests/Sec=581.50 Write Requests/Sec=3.70 KBytes Read/Sec=37903.20 KBytes_Written/Sec=92.40 [04:37:00] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [04:40:00] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [04:41:11] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=2.20 Read Requests/Sec=0.00 Write Requests/Sec=0.60 KBytes Read/Sec=0.00 KBytes_Written/Sec=15.20 [05:24:38] 06Operations, 07Documentation: Improve SSH access information in onboarding documentation - https://phabricator.wikimedia.org/T160941#3224700 (10Tbayer) >>! In T160941#3146572, @Milimetric wrote: > @Tbayer, I copied Zareen's notes in the description, please add anything else that you think is painful about SSH... [05:30:47] 06Operations, 07Documentation: Improve SSH access information in onboarding documentation - https://phabricator.wikimedia.org/T160941#3224702 (10Tbayer) PS: Perhaps (I haven't checked in detail) there are also things that could be adapted from the SSH documentation changes @Cdentinger recently made to the Fun... [05:33:10] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] [05:35:50] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [05:51:50] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [06:01:50] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:02:10] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:25:00] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements responds with malformed body: list index out of range [06:26:00] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [06:43:10] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] [06:44:10] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:47:10] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0] [06:56:10] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:36:20] PROBLEM - nova-compute process on labvirt1008 is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [07:37:20] RECOVERY - nova-compute process on labvirt1008 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [08:19:25] Hi. Is 2017-05-01 a working or holiday from SF point of view? [08:36:10] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [08:37:00] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [08:38:10] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [08:40:00] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [08:43:10] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [08:51:51] PROBLEM - Check Varnish expiry mailbox lag on cp2014 is CRITICAL: CRITICAL: expiry mailbox lag is 592462 [09:08:10] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [09:20:10] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [09:20:10] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [09:21:50] RECOVERY - Check Varnish expiry mailbox lag on cp2014 is OK: OK: expiry mailbox lag is 60 [09:50:00] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements responds with malformed body: list index out of range [09:51:00] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [09:52:23] 06Operations, 06Operations-Software-Development, 05codfw-rollout: switchdc: Improve wgReadOnly message - https://phabricator.wikimedia.org/T164177#3224532 (10Volans) The manual change + commit + deploy of the MW configuration might actually not be needed anymore, it depends on T163398. If that change lands i... [10:06:15] 06Operations, 10MediaWiki-Configuration, 06MediaWiki-Platform-Team, 06Performance-Team, and 9 others: Allow integration of data from etcd into the MediaWiki configuration - https://phabricator.wikimedia.org/T156924#3224786 (10Volans) Is there an easy way I could check which version and/or value of an Etcd-... [10:07:00] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [10:10:00] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [10:11:10] PROBLEM - Check Varnish expiry mailbox lag on cp2017 is CRITICAL: CRITICAL: expiry mailbox lag is 609790 [10:37:00] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [10:40:00] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [11:25:04] !log running alter table on enwiki.categorylinks on db1052 T164185 [11:25:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:15] T164185: Convert unique keys into primary keys for some wiki tables on s1, s2, s4 and s7 (eqiad) - https://phabricator.wikimedia.org/T164185 [11:31:10] RECOVERY - Check Varnish expiry mailbox lag on cp2017 is OK: OK: expiry mailbox lag is 1762 [11:31:15] !log running alter table on categorylinks on db1054, 68, 62 T164185 [11:31:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:26] T164185: Convert unique keys into primary keys for some wiki tables on s1, s2, s4 and s7 (eqiad) - https://phabricator.wikimedia.org/T164185 [11:40:03] eddiegp: so what exactly is the plan with the read-only msgs i saw the task comments and email but i am confused [11:53:20] PROBLEM - Check Varnish expiry mailbox lag on cp2022 is CRITICAL: CRITICAL: expiry mailbox lag is 734441 [12:11:54] 06Operations, 06Operations-Software-Development, 05codfw-rollout: switchdc: Improve wgReadOnly message - https://phabricator.wikimedia.org/T164177#3224891 (10EddieGP) >>! In T164177#3224771, @Volans wrote: > The manual change + commit + deploy of the MW configuration might actually not be needed anymore, it... [12:20:50] PROBLEM - Check Varnish expiry mailbox lag on cp2014 is CRITICAL: CRITICAL: expiry mailbox lag is 672631 [12:28:10] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [12:40:10] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [12:50:11] 06Operations, 10ORES, 06Revision-Scoring-As-A-Service, 05MW-1.30-release-notes (WMF-deploy-2017-05-09_(1.30.0-wmf.1)), 13Patch-For-Review: Re-enable ORES data in action API - https://phabricator.wikimedia.org/T163687#3224914 (10Tgr) @Joe ORES functionality in api.php will be reenabled with the next train... [12:50:33] 06Operations, 10ORES, 06Revision-Scoring-As-A-Service, 07Developer-notice, 05MW-1.30-release-notes (WMF-deploy-2017-05-09_(1.30.0-wmf.1)): Re-enable ORES data in action API - https://phabricator.wikimedia.org/T163687#3224915 (10Tgr) [12:52:30] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: ocg_job_status 603815 msg: ocg_render_job_queue 3005 msg (=3000 critical) [12:52:30] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: ocg_job_status 603820 msg: ocg_render_job_queue 3009 msg (=3000 critical) [12:52:30] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: ocg_job_status 603820 msg: ocg_render_job_queue 3009 msg (=3000 critical) [12:58:33] !log cleaning ores_classification rows half an hour or so (T159753) [12:58:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:42] T159753: Concerns about ores_classification table size on enwiki - https://phabricator.wikimedia.org/T159753 [13:06:55] Amir1, may I ask you to pause ores_classification clean up until thursday? [13:07:03] jynus: sure [13:07:10] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [13:07:18] jynus: done [13:07:25] we have some alter tables that cannot be done after wednesday [13:07:37] and while it should not interact, they will be slower [13:08:01] and I think that is not that high prority now [13:08:12] I thought they won't affect each other otherwise I didn't do it [13:08:44] Nah, since it's already shrunk to 75% of its size it has some time to grow again [13:09:00] yes, but while an alter table is running, (long running query) UNDO entries are not cleaned up [13:09:01] (by some time I mean several months at least) [13:09:24] and that increases ibdata1 size and slows down queries in general [13:09:39] so I want to avoid non-essential writes during the alters [13:10:09] It's okay. I will continue in Thursday [13:10:10] I should be finished by tomorrow [13:10:10] PROBLEM - Check Varnish expiry mailbox lag on cp2017 is CRITICAL: CRITICAL: expiry mailbox lag is 614541 [13:10:31] but of course ednesday is failback, so I suggested thursday [13:10:43] Yeah [13:10:58] thank you for taking care of that, I appreciate it! [13:11:43] I should've done it way sooner [13:13:22] as a note- I asked on operations- list to be given priority on deployment-related tasks as some can only be done during eqiad-passive [13:13:52] I will not ask for priority any other week [13:25:16] Zppix: As I thought there is a general plan to move away from the need to merge things in gerrit first for switching DCs. So they're building switchdc upon conftool, which is way more dynamical and can serve a specific message instead of the general one. We're not sure if this will be done before the switchover on Wed, but if it isn't we'll do a workaround (changing both the switchdc script and mw-config) if not and look at a more general [13:25:16] solution afterwards. [13:29:15] Ok [13:33:10] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [13:35:15] (03PS2) 10Ottomata: Update cron job copying mediawiki db into hdfs [puppet] - 10https://gerrit.wikimedia.org/r/350888 (https://phabricator.wikimedia.org/T163483) (owner: 10Joal) [13:35:20] (03CR) 10Ottomata: [V: 032 C: 032] Update cron job copying mediawiki db into hdfs [puppet] - 10https://gerrit.wikimedia.org/r/350888 (https://phabricator.wikimedia.org/T163483) (owner: 10Joal) [13:37:12] 06Operations, 10DBA, 06Performance-Team, 10Traffic: Cache invalidations coming from the JobQueue are causing slowdown on masters and lag on several wikis, and impact on varnish - https://phabricator.wikimedia.org/T164173#3224974 (10jcrespo) [13:42:21] 06Operations, 10DBA, 06Performance-Team, 10Traffic: Cache invalidations coming from the JobQueue are causing slowdown on masters and lag on several wikis, and impact on varnish - https://phabricator.wikimedia.org/T164173#3224978 (10jcrespo) https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-cli... [13:43:50] Hi, i think theres some code that has gone wrong in puppet/operations as i see this factpath = $vardir/lib/facter in the puppet config file [13:44:06] notice this $vardir. That looks like a variable and not an actual path [13:57:28] (03PS7) 10Andrew Bogott: Monitor wikitech-static mediawiki version [puppet] - 10https://gerrit.wikimedia.org/r/350920 (https://phabricator.wikimedia.org/T163721) [13:59:14] (03CR) 10Andrew Bogott: [C: 032] Monitor wikitech-static mediawiki version [puppet] - 10https://gerrit.wikimedia.org/r/350920 (https://phabricator.wikimedia.org/T163721) (owner: 10Andrew Bogott) [14:03:26] (03PS2) 10Andrew Bogott: toollabs: ensure default hhvm service is stopped [puppet] - 10https://gerrit.wikimedia.org/r/351124 (https://phabricator.wikimedia.org/T78783) (owner: 10BryanDavis) [14:05:35] (03CR) 10Andrew Bogott: [C: 032] toollabs: ensure default hhvm service is stopped [puppet] - 10https://gerrit.wikimedia.org/r/351124 (https://phabricator.wikimedia.org/T78783) (owner: 10BryanDavis) [14:07:00] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements responds with malformed body: list index out of range [14:08:00] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [14:27:12] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:29:02] PROBLEM - Check Varnish expiry mailbox lag on cp2005 is CRITICAL: CRITICAL: expiry mailbox lag is 608472 [14:36:56] 06Operations, 06Labs, 10wikitech.wikimedia.org, 13Patch-For-Review: Update wikitech-static and develop procedures to keep it maintained - https://phabricator.wikimedia.org/T163721#3225107 (10Andrew) The icinga alert warned me about slippage between 1.28.1 and 1.28.2. I updated wikitech-static, and now the... [14:39:02] RECOVERY - Check Varnish expiry mailbox lag on cp2005 is OK: OK: expiry mailbox lag is 4062 [14:52:41] (03CR) 10BBlack: [C: 031] Remove the citoid.wm.org DNS record [dns] - 10https://gerrit.wikimedia.org/r/350748 (https://phabricator.wikimedia.org/T133001) (owner: 10Mobrovac) [14:52:59] (03CR) 10BBlack: [C: 031] Decom legacy citoid service hostname [puppet] - 10https://gerrit.wikimedia.org/r/350505 (https://phabricator.wikimedia.org/T133001) (owner: 10Jforrester) [14:53:08] (03CR) 10Mobrovac: [C: 031] "GTG" [dns] - 10https://gerrit.wikimedia.org/r/350748 (https://phabricator.wikimedia.org/T133001) (owner: 10Mobrovac) [14:53:33] (03PS2) 10BBlack: Remove the citoid.wm.org DNS record [dns] - 10https://gerrit.wikimedia.org/r/350748 (https://phabricator.wikimedia.org/T133001) (owner: 10Mobrovac) [14:55:12] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [14:56:17] (03CR) 10BBlack: [C: 032] Remove the citoid.wm.org DNS record [dns] - 10https://gerrit.wikimedia.org/r/350748 (https://phabricator.wikimedia.org/T133001) (owner: 10Mobrovac) [14:56:56] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [14:58:06] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [15:01:04] (03PS1) 10Andrew Bogott: Wikitech-static: Include a doc link in the version warning. [puppet] - 10https://gerrit.wikimedia.org/r/351155 (https://phabricator.wikimedia.org/T163721) [15:01:56] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [15:06:47] 06Operations, 10ops-eqiad, 10DBA: Reset db1070 idrac - https://phabricator.wikimedia.org/T160392#3225192 (10Cmjohnson) @marostegui I am not sure what to make of this...i know several servers have this issue but on the server itself ipmi works fine. [15:07:37] (03CR) 10Andrew Bogott: [C: 032] Wikitech-static: Include a doc link in the version warning. [puppet] - 10https://gerrit.wikimedia.org/r/351155 (https://phabricator.wikimedia.org/T163721) (owner: 10Andrew Bogott) [15:13:26] !log restarting varnish backend on cp2002 (mailbox issues) [15:13:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:12] 06Operations, 10DBA, 06DC-Ops: db1063 thermal issues (was: db1063 io (s5 master eqiad) performance is bad) - https://phabricator.wikimedia.org/T164107#3225208 (10Cmjohnson) This is what I found at http://www.dell.com/support/article/us/en/19/SLN285596/drac---how-to-set-fan-speed-offset-values-in-idrac7-witho... [15:15:06] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:16:56] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:21:19] PROBLEM - Host elastic2020 is DOWN: PING CRITICAL - Packet loss = 100% [15:21:20] PROBLEM - Host ganeti2006 is DOWN: PING CRITICAL - Packet loss = 100% [15:21:20] PROBLEM - Host ganeti2005 is DOWN: PING CRITICAL - Packet loss = 100% [15:21:20] RECOVERY - Check Varnish expiry mailbox lag on cp2002 is OK: OK: expiry mailbox lag is 0 [15:27:29] PROBLEM - Check Varnish expiry mailbox lag on cp2005 is CRITICAL: CRITICAL: expiry mailbox lag is 639243 [15:27:39] RECOVERY - Host elastic2020 is UP: PING OK - Packet loss = 0%, RTA = 0.62 ms [15:29:29] PROBLEM - HP RAID on elastic2020 is CRITICAL: Return code of 255 is out of bounds [15:29:49] PROBLEM - Check whether ferm is active by checking the default input chain on elastic2020 is CRITICAL: Return code of 255 is out of bounds [15:29:49] PROBLEM - configured eth on elastic2020 is CRITICAL: Return code of 255 is out of bounds [15:29:49] PROBLEM - DPKG on elastic2020 is CRITICAL: Return code of 255 is out of bounds [15:29:49] PROBLEM - dhclient process on elastic2020 is CRITICAL: Return code of 255 is out of bounds [15:29:49] PROBLEM - salt-minion processes on elastic2020 is CRITICAL: Return code of 255 is out of bounds [15:29:50] PROBLEM - Check size of conntrack table on elastic2020 is CRITICAL: Return code of 255 is out of bounds [15:30:09] PROBLEM - Elasticsearch HTTPS on elastic2020 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [15:30:29] PROBLEM - MD RAID on elastic2020 is CRITICAL: Return code of 255 is out of bounds [15:30:30] PROBLEM - Disk space on elastic2020 is CRITICAL: Return code of 255 is out of bounds [15:32:29] PROBLEM - SSH on bast3002 is CRITICAL: Server answer [15:32:39] PROBLEM - Check the NTP synchronisation status of timesyncd on elastic2020 is CRITICAL: Return code of 255 is out of bounds [15:32:46] 06Operations, 10DBA, 06DC-Ops: db1063 thermal issues (was: db1063 io (s5 master eqiad) performance is bad) - https://phabricator.wikimedia.org/T164107#3225272 (10jcrespo) ``` $ ssh db1059.eqiad.wmnet root@db1059:~$ cat /sys/class/thermal/thermal_zone*/temp 56000 49000 root@db1059:~$ megacli -AdpBbuCmd -GetB... [15:34:26] 06Operations, 10Citoid, 10ContentTranslation, 10ContentTranslation-CXserver, and 4 others: Decom legacy ex-parsoidcache cxserver, citoid, and restbase service hostnames - https://phabricator.wikimedia.org/T133001#3225276 (10Jdforrester-WMF) [15:34:29] RECOVERY - SSH on bast3002 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [15:39:19] PROBLEM - Host elastic2020 is DOWN: PING CRITICAL - Packet loss = 100% [15:39:39] RECOVERY - Host elastic2020 is UP: PING OK - Packet loss = 0%, RTA = 1.05 ms [15:40:09] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:42:19] PROBLEM - Host elastic2020 is DOWN: PING CRITICAL - Packet loss = 100% [15:46:56] !log shutting down db1063 for maintenance T164107 [15:47:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:06] T164107: db1063 thermal issues (was: db1063 io (s5 master eqiad) performance is bad) - https://phabricator.wikimedia.org/T164107 [15:51:19] 06Operations, 10DBA, 06DC-Ops: db1063 thermal issues (was: db1063 io (s5 master eqiad) performance is bad) - https://phabricator.wikimedia.org/T164107#3225325 (10Marostegui) I would support doing a switchover to db1049 just to be on the safe side for next week switchover [15:52:15] 06Operations: Degraded RAID on restbase1018 - https://phabricator.wikimedia.org/T163280#3225326 (10Eevans) p:05Triage>03Normal Ping? [15:53:07] 06Operations, 10DBA, 06DC-Ops: db1063 thermal issues (was: db1063 io (s5 master eqiad) performance is bad) - https://phabricator.wikimedia.org/T164107#3225328 (10jcrespo) Manuel, I am looking at some options now with Chris (air flow, PSU, ...) , we (and I mean I) will think about that on Tuesday depending on... [16:03:48] (03PS3) 10BBlack: Decom legacy citoid service hostname [puppet] - 10https://gerrit.wikimedia.org/r/350505 (https://phabricator.wikimedia.org/T133001) (owner: 10Jforrester) [16:05:56] (03CR) 10BBlack: [C: 032] Decom legacy citoid service hostname [puppet] - 10https://gerrit.wikimedia.org/r/350505 (https://phabricator.wikimedia.org/T133001) (owner: 10Jforrester) [16:07:09] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [16:07:29] RECOVERY - Check Varnish expiry mailbox lag on cp2005 is OK: OK: expiry mailbox lag is 1065 [16:10:09] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [16:12:59] PROBLEM - Check Varnish expiry mailbox lag on cp2026 is CRITICAL: CRITICAL: expiry mailbox lag is 663014 [16:17:01] !log mobrovac@naos Started deploy [citoid/deploy@747777f]: Remove mwDeprecated - T93514 [16:17:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:10] T93514: Update MAJOR version and remove mwDeprecated, url endpoint, and duplicate itemType: 'webpage' field publicationTitile - https://phabricator.wikimedia.org/T93514 [16:18:45] Hey, beta.wmflabs.org is "502 - Bad Gateway" [16:19:12] Hi, the shinken warnnings were sent to -releng [16:19:20] !log mobrovac@naos Finished deploy [citoid/deploy@747777f]: Remove mwDeprecated - T93514 (duration: 02m 19s) [16:19:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:32] PROBLEM - MariaDB Slave Lag: s5 on db1092 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 732.04 seconds [16:21:32] PROBLEM - MariaDB Slave Lag: s5 on db1026 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 800.22 seconds [16:21:32] PROBLEM - MariaDB Slave Lag: s5 on db1049 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 731.25 seconds [16:21:32] PROBLEM - MariaDB Slave Lag: s5 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 727.95 seconds [16:21:33] PROBLEM - MariaDB Slave Lag: s5 on db1070 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 758.95 seconds [16:21:33] PROBLEM - MariaDB Slave Lag: s5 on db1071 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 724.86 seconds [16:21:33] PROBLEM - MariaDB Slave Lag: s5 on db1063 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 715.08 seconds [16:21:44] PROBLEM - puppet last run on db1063 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 12 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[pt-heartbeat] [16:21:54] PROBLEM - MariaDB Slave Lag: s5 on db1045 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 733.97 seconds [16:22:45] RECOVERY - puppet last run on db1063 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [16:23:02] PROBLEM - MariaDB Slave Lag: s5 on db1087 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 519.94 seconds [16:23:07] Yeah, I just saw. [16:23:18] oh, that should not happen [16:23:47] this is a monitoring glitch [16:23:53] PROBLEM - Restbase root url on restbase1018 is CRITICAL: connect to address 10.64.48.97 and port 7231: Connection refused [16:23:54] PROBLEM - Check systemd state on labsdb1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:24:06] downtimes have been wiped [16:24:13] PROBLEM - cassandra-a CQL 10.64.48.98:9042 on restbase1018 is CRITICAL: connect to address 10.64.48.98 and port 9042: Connection refused [16:24:35] PROBLEM - cassandra-a SSL 10.64.48.98:7001 on restbase1018 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [16:24:36] PROBLEM - Check systemd state on labstore2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:24:44] PROBLEM - cassandra-a service on restbase1018 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [16:24:54] PROBLEM - cassandra-b CQL 10.64.48.99:9042 on restbase1018 is CRITICAL: connect to address 10.64.48.99 and port 9042: Connection refused [16:25:12] PROBLEM - are wikitech and wt-static in sync on labtestweb2001 is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (402304s 200000s) [16:25:14] PROBLEM - cassandra-b SSL 10.64.48.99:7001 on restbase1018 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [16:25:15] PROBLEM - are wikitech and wt-static in sync on silver is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (402304s 200000s) [16:25:22] PROBLEM - cassandra-b service on restbase1018 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [16:25:22] PROBLEM - Router interfaces on pfw-eqiad is CRITICAL: CRITICAL: host 208.80.154.218, interfaces up: 108, down: 1, dormant: 0, excluded: 3, unused: 0BRge-11/0/2: down - frdb1002BR [16:25:32] RECOVERY - MariaDB Slave Lag: s5 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 220.31 seconds [16:25:32] PROBLEM - cassandra-c CQL 10.64.48.100:9042 on restbase1018 is CRITICAL: connect to address 10.64.48.100 and port 9042: Connection refused [16:25:52] PROBLEM - cassandra-c SSL 10.64.48.100:7001 on restbase1018 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [16:26:02] PROBLEM - Check systemd state on phab2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:26:02] PROBLEM - puppet last run on ms-be3003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 10 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[parted-/dev/sdl] [16:26:02] PROBLEM - Check systemd state on restbase1018 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:26:02] PROBLEM - cassandra-c service on restbase1018 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [16:26:22] PROBLEM - puppet last run on ms-be1039 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 25 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[parted-/dev/sdc] [16:26:35] jynus: You are talking about the db lag, aren't you? [16:27:00] yes [16:27:02] PROBLEM - mediawiki-installation DSH group on mw2256 is CRITICAL: Host mw2256 is not in mediawiki-installation dsh group [16:27:02] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.48.97, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f377cee1950: Failed to establish a new connection: [Errno 111] Connection refused,)) [16:27:02] RECOVERY - MariaDB Slave Lag: s5 on db1087 is OK: OK slave_sql_lag Replication lag: 7.48 seconds [16:27:03] I thought that phab2001 phd being stopped is expected? Why is icinga sending warnnings about that. [16:27:12] PROBLEM - MD RAID on restbase1018 is CRITICAL: CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 1, Spare: 0 [16:27:13] ACKNOWLEDGEMENT - MD RAID on restbase1018 is CRITICAL: CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T164202 [16:27:17] 06Operations, 10ops-eqiad: Degraded RAID on restbase1018 - https://phabricator.wikimedia.org/T164202#3225389 (10ops-monitoring-bot) [16:27:32] RECOVERY - MariaDB Slave Lag: s5 on db1092 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [16:27:32] RECOVERY - MariaDB Slave Lag: s5 on db1049 is OK: OK slave_sql_lag Replication lag: 0.34 seconds [16:27:32] RECOVERY - MariaDB Slave Lag: s5 on db1026 is OK: OK slave_sql_lag Replication lag: 9.37 seconds [16:27:33] RECOVERY - MariaDB Slave Lag: s5 on db1070 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [16:27:33] RECOVERY - MariaDB Slave Lag: s5 on db1071 is OK: OK slave_sql_lag Replication lag: 0.01 seconds [16:27:33] RECOVERY - MariaDB Slave Lag: s5 on db1063 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [16:28:02] RECOVERY - MariaDB Slave Lag: s5 on db1045 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [16:33:52] PROBLEM - HP RAID on ms-be1039 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [16:34:09] 06Operations, 10Monitoring: Icinga randomly forgets downtimes, causing alert and page spam - https://phabricator.wikimedia.org/T164206#3225440 (10jcrespo) [16:34:35] 06Operations, 10Monitoring: Icinga randomly forgets downtimes, causing alert and page spam - https://phabricator.wikimedia.org/T164206#3225452 (10jcrespo) The issue seems to match the restart of the service. [16:38:41] 06Operations, 10DBA, 06DC-Ops: db1063 thermal issues (was: db1063 io (s5 master eqiad) performance is bad) - https://phabricator.wikimedia.org/T164107#3225462 (10jcrespo) ``` root@db1063:~$ megacli -AdpBbuCmd -GetBbuStatus -a0 | grep Temperature Temperature: 49 C Temperature :... [16:43:12] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [16:43:22] RECOVERY - Check Varnish expiry mailbox lag on cp2022 is OK: OK: expiry mailbox lag is 0 [16:46:12] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [16:48:14] 06Operations, 10DBA, 06DC-Ops: db1063 thermal issues (was: db1063 io (s5 master eqiad) performance is bad) - https://phabricator.wikimedia.org/T164107#3225472 (10jcrespo) 05Open>03Resolved a:03Cmjohnson Executed: ``` megacli -LDSetProp -NoCachedBadBBU -Immediate -Lall -aAll ``` ``` $ megacli -LDInf... [16:49:04] (03PS1) 10BBlack: Revert "traffic: route esams via codfw" [puppet] - 10https://gerrit.wikimedia.org/r/351161 [16:49:11] (03PS2) 10BBlack: Revert "traffic: route esams via codfw" [puppet] - 10https://gerrit.wikimedia.org/r/351161 [16:49:59] paladox: either the downtime expired or sometimes icinga forgets about them [16:50:25] oh [16:50:29] thanks [16:50:51] paladox: also, the best solution is we make sure it is not even adding it in the first place "if on passive server" [16:50:52] RECOVERY - Check Varnish expiry mailbox lag on cp2014 is OK: OK: expiry mailbox lag is 0 [16:50:57] like we did for other checks [16:51:04] Yep [16:51:20] even better than that would be if it can just run on both [16:51:30] the service [16:52:18] !log bblack@neodymium conftool action : set/pooled=yes; selector: dc=eqiad,cluster=cache_upload,name=cp107[1234].eqiad.wmnet [16:52:24] yep [16:52:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:28] !log reverting inter-caching routing from codfw-switchover period: https://wikitech.wikimedia.org/wiki/Switch_Datacenter#Switchback [16:53:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:47] (03CR) 10BBlack: [C: 032] Revert "traffic: route esams via codfw" [puppet] - 10https://gerrit.wikimedia.org/r/351161 (owner: 10BBlack) [16:53:52] 06Operations, 10DBA, 06DC-Ops: db1063 thermal issues (was: db1063 io (s5 master eqiad) performance is bad) - https://phabricator.wikimedia.org/T164107#3225479 (10Marostegui) Excellent news!! Thanks guys for working this out successfully!! [16:54:19] paladox: so.. the "is phd running" and " [16:54:31] oh [16:54:33] "is phd monitoring other processes" checks.. we already fixed that [16:54:50] Yep [16:54:51] they do not exist on phab2001 because Icinga does not add them because it knows phab2001 is not active [16:55:07] yep [16:55:13] the remaining problem is the generic "Check systemd state" check that is on everything [16:55:40] Ah, so that one needs the passive check? [16:55:44] that isn't specific to phd but it notices that there is a service in "degraded" state [16:56:10] it doesn't like that there is "degraded" in there.. as opposed to "stopped" ? [16:56:32] oh [16:56:40] ● phab2001 [16:56:40] State: degraded [16:56:46] Failed: 1 units [16:57:25] ah [16:57:36] which unit is failed? phd? [16:58:41] that is surprisingly non-obvious in the output of "systemctl status". but "systemctl status phd" says "failed". so ..yea [16:59:06] (code=exited, status=143) [16:59:09] ok thanks [16:59:15] it exited with code 143, whatever that is [16:59:36] and it is in that state since 2 months and 12 days. so not new, just Icinga forgot downtime or it expired [17:00:51] I wonder how to get it to propperly stop the service [17:01:20] ACKNOWLEDGEMENT - Check systemd state on phab2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn phd not running on passive server, currently failed with status 143 [17:02:38] RECOVERY - Check systemd state on phab2001 is OK: OK - running: The system is fully operational [17:02:53] :p ^ [17:03:03] (03PS1) 10Joal: Update cron job copying mediawiki db into hdfs (2) [puppet] - 10https://gerrit.wikimedia.org/r/351162 (https://phabricator.wikimedia.org/T163483) [17:03:17] ottomata: --^ if you have time [17:03:50] !log phab2001 - start/stop phd service - that fixed "systemd state" icinga check, even though phd does not run just like before [17:03:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:10] paladox: somehow the good old "turn it on and off" again [17:05:07] the general systemctl status is now "Failed: 0 units" and State: running. the status of phd specifically is still "exited, fail, 143" , unchanged [17:05:46] and the reason for 143 seems to be that the "ExecStartPre" fails, which is "/bin/mkdir /var/run/phd" [17:07:05] https://phabricator.wikimedia.org/P5355 [17:07:07] (03PS1) 10Ayounsi: Add pfw-*:lo0 to Smokeping [puppet] - 10https://gerrit.wikimedia.org/r/351163 [17:08:31] paladox: anyways,Icinga is happy and we are aware of it and then it would be about becoming active/active [17:08:46] which needs more clustering config changes [17:09:39] (03CR) 10Ayounsi: [C: 032] Add pfw-*:lo0 to Smokeping [puppet] - 10https://gerrit.wikimedia.org/r/351163 (owner: 10Ayounsi) [17:10:09] RECOVERY - Check Varnish expiry mailbox lag on cp2017 is OK: OK: expiry mailbox lag is 0 [17:10:18] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:15:35] 06Operations, 06Performance-Team: webpagetest-alerts: Difference in size authenticated - https://phabricator.wikimedia.org/T164209#3225543 (10Dzahn) [17:16:01] 06Operations, 06Performance-Team: webpagetest-alerts: Difference in size authenticated - https://phabricator.wikimedia.org/T164209#3225556 (10Dzahn) [17:17:09] 06Operations, 06Performance-Team: webpagetest-alerts: Difference in size authenticated - https://phabricator.wikimedia.org/T164209#3225543 (10Dzahn) {F7853620} [17:18:02] 06Operations, 06Performance-Team: webpagetest-alerts: Difference in size authenticated - https://phabricator.wikimedia.org/T164209#3225561 (10Dzahn) [17:19:22] (03PS8) 10Ejegg: Update instances of Wikimedia Foundation logo #1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307475 (https://phabricator.wikimedia.org/T144254) (owner: 10Urbanecm) [17:19:43] andrewbogott: labnet1001 - 0 processes with command name 'python', args 'nova-fullstack' - should that be started or acked/removed ? [17:20:17] (03PS9) 10Ejegg: Update instances of Wikimedia Foundation logo #1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307475 (https://phabricator.wikimedia.org/T144254) (owner: 10Urbanecm) [17:21:18] (03CR) 10Ejegg: "ok, makes sense to keep the notification icon changes. Jforrester, they're back in PS8" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307475 (https://phabricator.wikimedia.org/T144254) (owner: 10Urbanecm) [17:21:38] there are a couple other labs-related ones: labsdb1006/labstore2001 (systemd state), silver/labtestweb2001 (wt-static sync) (yea, that last one is because of work on wt-static, right) [17:24:30] (03PS1) 10Jdlrobson: Correction to config definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351166 (https://phabricator.wikimedia.org/T164044) [17:25:12] 10Blocked-on-Operations, 06Operations, 10Graphite, 06WMDE-Analytics-Engineering, and 3 others: scale graphite deployment (tracking) - https://phabricator.wikimedia.org/T85451#947062 (10Dzahn) graphite1003 is alerting in Icinga because of disk space 99% used on /var/lib/carbon but because that's 1.6T , 99%... [17:25:33] ACKNOWLEDGEMENT - Disk space on graphite1003 is CRITICAL: DISK CRITICAL - free space: /var/lib/carbon 19892 MB (1% inode=91%): daniel_zahn https://phabricator.wikimedia.org/T85451 [17:26:56] ACKNOWLEDGEMENT - puppet last run on ms-be1039 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Exec[parted-/dev/sdc] daniel_zahn https://phabricator.wikimedia.org/T163690 [17:27:05] ACKNOWLEDGEMENT - HP RAID on ms-be1039 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. daniel_zahn https://phabricator.wikimedia.org/T163690 [17:27:08] RECOVERY - Host elastic2020 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [17:27:21] papaul: ^ :)) [17:29:08] Oh, elastic2020 is out of downtime? [17:29:18] ACKNOWLEDGEMENT - Check size of conntrack table on elastic2020 is CRITICAL: Return code of 255 is out of bounds daniel_zahn https://phabricator.wikimedia.org/T149006 [17:29:19] ACKNOWLEDGEMENT - Check systemd state on elastic2020 is CRITICAL: Return code of 255 is out of bounds daniel_zahn https://phabricator.wikimedia.org/T149006 [17:29:19] ACKNOWLEDGEMENT - Check the NTP synchronisation status of timesyncd on elastic2020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. daniel_zahn https://phabricator.wikimedia.org/T149006 [17:29:19] ACKNOWLEDGEMENT - Check whether ferm is active by checking the default input chain on elastic2020 is CRITICAL: Return code of 255 is out of bounds daniel_zahn https://phabricator.wikimedia.org/T149006 [17:29:19] ACKNOWLEDGEMENT - DPKG on elastic2020 is CRITICAL: Return code of 255 is out of bounds daniel_zahn https://phabricator.wikimedia.org/T149006 [17:29:19] ACKNOWLEDGEMENT - Disk space on elastic2020 is CRITICAL: Return code of 255 is out of bounds daniel_zahn https://phabricator.wikimedia.org/T149006 [17:29:19] ACKNOWLEDGEMENT - Elasticsearch HTTPS on elastic2020 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused daniel_zahn https://phabricator.wikimedia.org/T149006 [17:29:19] mutante: can you please turn all the notifications off [17:29:20] ACKNOWLEDGEMENT - HP RAID on elastic2020 is CRITICAL: Return code of 255 is out of bounds daniel_zahn https://phabricator.wikimedia.org/T149006 [17:29:20] ACKNOWLEDGEMENT - MD RAID on elastic2020 is CRITICAL: Return code of 255 is out of bounds daniel_zahn https://phabricator.wikimedia.org/T149006 [17:29:21] ACKNOWLEDGEMENT - configured eth on elastic2020 is CRITICAL: Return code of 255 is out of bounds daniel_zahn https://phabricator.wikimedia.org/T149006 [17:29:21] ACKNOWLEDGEMENT - dhclient process on elastic2020 is CRITICAL: Return code of 255 is out of bounds daniel_zahn https://phabricator.wikimedia.org/T149006 [17:29:21] it seems like icinga forgot things [17:29:22] ACKNOWLEDGEMENT - puppet last run on elastic2020 is CRITICAL: Return code of 255 is out of bounds daniel_zahn https://phabricator.wikimedia.org/T149006 [17:29:22] ACKNOWLEDGEMENT - salt-minion processes on elastic2020 is CRITICAL: Return code of 255 is out of bounds daniel_zahn https://phabricator.wikimedia.org/T149006 [17:29:31] papaul: gehel: just did [17:29:54] it seems icinga forgot a few scheduled downtimes today [17:29:57] yes [17:29:59] I filed [17:30:04] ok, thanks [17:30:24] T164206 [17:30:25] T164206: Icinga randomly forgets downtimes, causing alert and page spam - https://phabricator.wikimedia.org/T164206 [17:31:00] so it forgets on service restarts.. but normally we never restart but just reload.. or so [17:31:04] it eems [17:31:06] seems [17:31:37] ok, I re-downtimed elastic2020... I'll check it in detail tomorrow and add it back to the cluster if it looks right [17:32:58] RECOVERY - Check Varnish expiry mailbox lag on cp2026 is OK: OK: expiry mailbox lag is 0 [17:33:26] 06Operations, 10Icinga, 10Monitoring: Icinga randomly forgets downtimes, causing alert and page spam - https://phabricator.wikimedia.org/T164206#3225607 (10Dzahn) [17:34:10] 06Operations, 10Icinga, 10Monitoring: Icinga randomly forgets downtimes, causing alert and page spam - https://phabricator.wikimedia.org/T164206#3225613 (10Dzahn) It seems a service restart causes this but usually we just do reloads on config changes (via puppet) and not hard restarts. [17:34:25] (03Abandoned) 10BBlack: block ancient chrome [puppet] - 10https://gerrit.wikimedia.org/r/350098 (owner: 10BBlack) [17:36:56] ACKNOWLEDGEMENT - Host ganeti2005 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T164011 [17:36:56] ACKNOWLEDGEMENT - Host ganeti2006 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T164011 [17:38:09] mutante: thanks, I will look [17:38:13] (03PS1) 10Ayounsi: Remove asw-d-eqiad from monitoring - T148506 [puppet] - 10https://gerrit.wikimedia.org/r/351167 [17:38:24] ACKNOWLEDGEMENT - Host labstore1001 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T158196 [17:38:52] andrewbogott: cool:) also might be because icinga forgot downtimes [17:39:04] no, it's a real problem I think [17:39:12] If the test agent fails too many times it quits [17:39:26] ok, and that's why we want the cruft gone then :) [17:39:31] to see the signal [17:40:35] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2256 is CRITICAL: Host mw2256 is not in mediawiki-installation dsh group daniel_zahn https://phabricator.wikimedia.org/T163346 [17:42:13] (03CR) 10Jforrester: [C: 031] Update instances of Wikimedia Foundation logo #1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307475 (https://phabricator.wikimedia.org/T144254) (owner: 10Urbanecm) [17:48:40] (03CR) 10Ayounsi: [C: 032] Remove asw-d-eqiad from monitoring - T148506 [puppet] - 10https://gerrit.wikimedia.org/r/351167 (owner: 10Ayounsi) [17:50:04] nice ^ [17:51:37] ACKNOWLEDGEMENT - puppet last run on ms-be3003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[parted-/dev/sdl] daniel_zahn https://phabricator.wikimedia.org/T83811 [17:53:20] anything known about "mc1018". that's down since about 5 days but nothing in SAL or phab afaict [17:53:36] (last host that is still an alert now) [17:57:18] PROBLEM - Check correctness of the icinga configuration on tegmen is CRITICAL: Icinga configuration contains errors [17:57:53] who deployed to icinga? [17:58:45] jynus: probably the network-check stuff [17:59:32] but I donno [17:59:37] Error: 'asw-d-eqiad' is not a valid parent for host 'mc1018' (file '/etc/icinga/puppet_hosts.cfg', line 10478)! [17:59:41] Error: 'asw-d-eqiad' is not a valid parent for host 'restbase1018' (file '/etc/icinga/puppet_hosts.cfg', line 19413)! [17:59:51] yeah it is the dependencies alex added [17:59:59] ah, i know about the parent thing though [18:00:02] which are really nice [18:00:02] i can fix that [18:00:13] but makes things more complex :-) [18:00:50] also that responds to your question about mc1018 [18:01:19] BTW, Warning: Duplicate definition found for service 'keystone http' on host 'labtestcontrol2001' (config file '/etc/icinga/puppet_services.cfg', starting on line 214209) [18:01:26] Warning: Duplicate definition found for service 'keystone http' on host 'labcontrol1001' (config file '/etc/icinga/puppet_services.cfg', starting on line 206700) [18:01:36] also, it explains why you can see critical services on mc1018 but it's not spamming us here [18:01:43] because the "parent" thing works as we want it [18:01:45] heh [18:02:05] i added that stuff first avoid the alerts we had when dns-recursors were re-installed, bblack [18:02:07] andrewbogott, ^ recongnize some of those words and they look like something you would know [18:02:37] mutante, but that server is most likely not there, so the real reason why it is down is still unknown [18:02:42] jynus: I will look! [18:02:46] Probably my mess :) [18:02:51] not very important [18:02:55] only a warning [18:03:05] jynus: yes, i'll check that too [18:03:29] sorry, I didn't want to "tell you stuff", I just was checking the config [18:03:41] mutante: what controls adding switch dependencies? [18:03:45] !log restarting nova-fullstack tests but saving instance 2d60e8c5-fb2a-4681-ac0a-ae2162bb13fb for future research [18:03:52] I am not here, btw [18:03:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:12] jynus: can I reproduce that warning by forcing a puppet run on einsteinium? [18:04:18] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [18:04:19] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [18:04:35] andrewbogott, don't asm me, I said I am not here :-) [18:04:52] ok :) [18:05:13] bblack: i added parent depedency for dnsrec @ https://gerrit.wikimedia.org/r/#/c/347984/ but not for switches, just learned myself that Alex added those too.. looking where it is [18:06:08] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [18:06:10] * andrewbogott assumes all southern euro coworkers are wearing black bandanas, throwing bricks today [18:06:18] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [18:06:36] andrewbogott: sudo icinga -v /etc/icinga/icing.cfg [18:06:51] icinga.cfg [18:06:54] thanks [18:07:44] https://gerrit.wikimedia.org/r/#/c/343619/ [18:07:54] bblack: ^ i think that [18:08:06] or somewhere around there [18:08:41] https://gerrit.wikimedia.org/r/#/c/343619/3/modules/netops/manifests/check.pp [18:09:29] the "parents" option lets Icinga use the reachability logic https://docs.icinga.com/latest/en/networkreachability.html [18:10:15] well yeah but the change did remove asw-d-eqiad from that list [18:10:33] I was assuming we had some automagic somewhere that had added asw-d-eqiad as a parent of hosts on vlans that use it or something [18:10:38] it seems that problem was temp [18:10:43] i dont see that error anymore now [18:10:46] ok [18:11:10] well, maybe that automagic is a future task then :) [18:11:16] just the other issue with a duplicate defitioon on of "keyston http" [18:11:21] yea :) [18:11:34] as data: add a list of vlans to each switch, and get vlan => subnet mapping that we already have and mash it together :) [18:12:26] ( modules/network/data/data.yaml ) [18:12:43] then we can do a puppet-time lookup of hostip => supporting switch and mark that dependency in icinga [18:13:07] that sounds good:) yea [18:13:09] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:13:18] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:13:18] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:15:26] oh wait, error still there. i'll try manually fixing it and running puppet [18:15:46] and the duplicate definiton thing is just a warning, that doesnt keep icinga from restarting [18:17:18] RECOVERY - Check correctness of the icinga configuration on tegmen is OK: Icinga configuration is correct [18:17:45] (03PS1) 10Andrew Bogott: More detailed names for keystone api monitoring checks. [puppet] - 10https://gerrit.wikimedia.org/r/351171 [18:17:54] mutante: ^ should fix that warning I think [18:19:06] !log manually removed asw-d-eqiad remnants from /etc/icinga/puppet_hosts.cfg to fix icinga config after gerrit:351167 / T148506. fixes Icinga config error. then puppet adds it back [18:19:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:15] T148506: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506 [18:19:48] bblack: puppet actively adds that back: + parents asw-d-eqiad [18:20:08] (03CR) 10Dzahn: [C: 031] More detailed names for keystone api monitoring checks. [puppet] - 10https://gerrit.wikimedia.org/r/351171 (owner: 10Andrew Bogott) [18:20:38] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements responds with malformed body: list index out of range [18:20:42] (03CR) 10Andrew Bogott: [C: 032] More detailed names for keystone api monitoring checks. [puppet] - 10https://gerrit.wikimedia.org/r/351171 (owner: 10Andrew Bogott) [18:20:43] andrewbogott: seems good, and that was just a warning. it didn't break it [18:20:46] thanks [18:21:09] yeah, I realize this is a distraction from the actual problem you're working on :) [18:21:17] just reminds of a "duplicate definition" in puppet [18:21:38] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [18:21:51] no, no the overall problem was to clean up Icinga and that also helps :) [18:24:06] (03PS1) 10Ottomata: Keep 90 days of refined webrequest data [puppet] - 10https://gerrit.wikimedia.org/r/351172 [18:27:18] (03PS10) 10Ejegg: Update instances of Wikimedia Foundation logo #1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307475 (https://phabricator.wikimedia.org/T144254) (owner: 10Urbanecm) [18:27:22] my fault for not beeing clear [18:28:37] !log powercycling mc1018 [18:28:40] (03CR) 10Ejegg: "Votewiki update restored per dstrine's meeting with comms this morning." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307475 (https://phabricator.wikimedia.org/T144254) (owner: 10Urbanecm) [18:28:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:18] PROBLEM - Check correctness of the icinga configuration on tegmen is CRITICAL: Icinga configuration contains errors [18:30:09] mutante: I still don't really comprehend where puppet/icinga is getting asw-d-eqiad from, unless it's just temporary issues until puppet runs everywhere related in a certain order [18:30:48] RECOVERY - Host mc1018 is UP: PING OK - Packet loss = 0%, RTA = 36.38 ms [18:31:12] oh, it's only mc1018 and restbase1018 [18:31:27] bblack: I was now suspecting something like the resource is a stored resource/ in puppetdb, and needs to expire or something [18:31:28] I think it's relying on those hosts running the agent to update the central stuff [18:31:31] right [18:31:37] i just got mc1018 back [18:31:45] simply powercycle and no error.... [18:32:08] running puppet on it and then on icinga again [18:33:10] restbase1018 is up but puppet is disabled because T163292 [18:33:11] T163292: Failed disk / degraded RAID arrays: restbase1018.eqiad.wmnet - https://phabricator.wikimedia.org/T163292 [18:33:33] - parents asw-d-eqiad [18:33:33] + parents asw2-d-eqiad [18:34:23] and now we have just 1 error instead of 2 .. yea [18:35:10] yeah something's broken about that whole data model (that puppet has to run on a host to generate updates to the host's monitoring metadata) [18:35:19] but I'm sure there's no easy fix and it's this way for lots of good reasons, too [18:35:34] (because of how puppet facts and collections work, etc) [18:35:42] !log brought mc1018 back up, ran puppet on it and then on Icinga. parent was adjusted from asw-d-eqiad to asw2-2-eqiad. reduced icinga config errors by 50% :p (1 of 2 left, restbase1018) [18:35:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:41] mobrovac: hi, i see you disabled puppet on restbase1018 because of the broken disk. i also see it's not pooled in confctl, would it be bad to enable puppet and run it just once? [18:44:59] (03CR) 10Ottomata: [C: 032] Keep 90 days of refined webrequest data [puppet] - 10https://gerrit.wikimedia.org/r/351172 (owner: 10Ottomata) [18:46:45] !log temp. re-enabling puppet on restbase1018 and running it once to fix icinga config syntax error. then disabling it again. restbase service stopped before and after. this box has a broken disk. [18:46:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:59] that fixes Icinga. it's ok again [18:51:28] Thank you. [18:52:30] (03CR) 10Dzahn: "We noticed an issue today. when asw-d-eqiad was removed from monitoring in https://gerrit.wikimedia.org/r/#/c/351167/, Icinga failed to re" [puppet] - 10https://gerrit.wikimedia.org/r/343619 (owner: 10Alexandros Kosiaris) [18:54:51] (03PS2) 10Ottomata: Update cron job copying mediawiki db into hdfs (2) [puppet] - 10https://gerrit.wikimedia.org/r/351162 (https://phabricator.wikimedia.org/T163483) (owner: 10Joal) [18:54:57] (03CR) 10Ottomata: [V: 032 C: 032] Update cron job copying mediawiki db into hdfs (2) [puppet] - 10https://gerrit.wikimedia.org/r/351162 (https://phabricator.wikimedia.org/T163483) (owner: 10Joal) [18:59:15] RECOVERY - Check correctness of the icinga configuration on tegmen is OK: Icinga configuration is correct [19:00:59] mutante: no, the disk array still needs to be rebuilt [19:02:11] mobrovac: ok. yea. it was just.. i needed to run puppet once on it to fix an unrelated issue with icinga. it had to update which switch is its "parent" [19:02:21] oh ok [19:02:25] thnx mu [19:02:28] and it's now disabled again and the service was stopped before and after [19:02:29] mutante: thnx [19:02:33] sure, np [19:07:03] madhuvishy: does it seem right that people need the "researchers" admin group (shell) to get access to "SWAP" (i didnt realize SWAP is "PAWS internal", is it?). https://phabricator.wikimedia.org/T164060 [19:08:31] mutante: yes, researchers or analytics-privatedata-users depending on if they want mysql/hadoop or both [19:08:38] !log mobrovac@naos Started deploy [mobileapps/deploy@b5afcb8]: Forced deploy to bring the targets to the current version [19:08:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:45] madhuvishy: thanks! ok [19:09:37] mutante: also, yes I renamed it because there was a lot of confusion with PAWS. https://wikitech.wikimedia.org/wiki/SWAP. I'll try to get out some better docs/announcement soon [19:09:45] PROBLEM - puppet last run on cp3005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:10:45] !log mobrovac@naos Finished deploy [mobileapps/deploy@b5afcb8]: Forced deploy to bring the targets to the current version (duration: 02m 08s) [19:10:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:45] (03PS1) 10Thcipriani: WIP: scap: Add a scap::master profile [puppet] - 10https://gerrit.wikimedia.org/r/351179 [19:13:08] madhuvishy: alright:) cool [19:15:04] (03PS1) 10Dzahn: admins: add phuedx to researchers for SWAP access [puppet] - 10https://gerrit.wikimedia.org/r/351180 (https://phabricator.wikimedia.org/T164060) [19:19:09] 06Operations, 10ops-eqiad, 10netops, 13Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#3226013 (10ayounsi) @Cmjohnson you're free to decommission/unrack asw-d-eqiad. [19:38:46] RECOVERY - puppet last run on cp3005 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [20:10:16] 06Operations, 10ops-codfw, 06DC-Ops, 10netops: setup wifi in codfw - https://phabricator.wikimedia.org/T86541#3226100 (10ayounsi) Diff from paste above pushed to mr1-codfw. @papaul, let me know when we can sync-up to configure the AP. [20:57:20] (03Abandoned) 10Dzahn: admins: add phuedx to researchers for SWAP access [puppet] - 10https://gerrit.wikimedia.org/r/351180 (https://phabricator.wikimedia.org/T164060) (owner: 10Dzahn) [20:58:50] (03PS1) 10Jcrespo: Add the possibility of filter root and dump user requests [software/tendril] - 10https://gerrit.wikimedia.org/r/351194 [21:01:29] 06Operations, 10ops-eqiad, 10DBA: db1047 BBU RAID issues (was: Investigate db1047 replication lag) - https://phabricator.wikimedia.org/T159266#3226286 (10Ottomata) @Cmjohnson Cool, I've announced the date. Let's do it. [21:01:46] (03PS1) 10Niharika29: Enable LoginNotify on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351195 [21:16:21] (03PS1) 10Madhuvishy: swap: Add analytics-privatedata-users to allowed user groups for notebook access [puppet] - 10https://gerrit.wikimedia.org/r/351201 [21:17:09] mutante: https://gerrit.wikimedia.org/r/#/c/351201 will fix the issue [21:17:37] etoomanygroups, I had them wrong [21:18:48] (03CR) 10Madhuvishy: [C: 032] swap: Add analytics-privatedata-users to allowed user groups for notebook access [puppet] - 10https://gerrit.wikimedia.org/r/351201 (owner: 10Madhuvishy) [21:22:39] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Request to add phuedx to "researchers" group - https://phabricator.wikimedia.org/T164060#3220328 (10RobH) So, it seems that this is more of a troubleshooting access, rather than requesting further access. The initial request states, and I can confirm,... [21:22:42] 06Operations, 10ops-eqiad, 10DBA: db1047 BBU RAID issues (was: Investigate db1047 replication lag) - https://phabricator.wikimedia.org/T159266#3226320 (10jcrespo) This wednesday is the failover, do you really want to do it then? We may need Chris or me to put things down and we may be unavailable? [21:23:22] (03Restored) 10RobH: admins: add phuedx to researchers for SWAP access [puppet] - 10https://gerrit.wikimedia.org/r/351180 (https://phabricator.wikimedia.org/T164060) (owner: 10Dzahn) [21:23:58] the entire analytics-statistics-researchers group naming is confusing as hell [21:24:08] it is long overdue to be overhauled and replaced. [21:25:06] * robh isnt saying that in condemnation, cuz the analytics services of wmf have been growing over time for quite awhile, and organic growth results in these issues [21:26:13] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Request to add phuedx to "researchers" group - https://phabricator.wikimedia.org/T164060#3226326 (10madhuvishy) @phuedx Can you check now? I had the groups on notebook100* wrong, but having analytics-privatedata-users access should get you into noteboo... [21:26:26] PROBLEM - Check Varnish expiry mailbox lag on cp2005 is CRITICAL: CRITICAL: expiry mailbox lag is 652702 [21:27:43] robh: yeah - I fixed the perms on the notebook servers, so access to researchers shouldn't be needed. I commented on task [21:28:01] 06Operations, 10Analytics, 10DBA: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3226328 (10jcrespo) > That would put us into next FY Q1 for decom of db1047. Let's buy 2 new servers, and reassign them if the functionality is replaced. Knowing typical medi... [21:28:02] oh, so his access will fix itself now? [21:28:10] cuz thats easy on me ;D [21:28:17] robh: it will :) [21:28:38] 06Operations, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, 07Wikimedia-Multiple-active-datacenters: Prepare and improve the datacenter switchover procedure - https://phabricator.wikimedia.org/T154658#3226332 (10Mattflaschen-WMF) [21:28:56] I didn't wanna assume and just change the groups allowed on servers, but commented about it [21:28:58] thx for fixing =] [21:29:18] robh: there was a task about renaming those iirc [21:30:41] robh: np :) thanks for looking into it [21:30:59] 06Operations, 10Ops-Access-Requests, 10Deployment-Systems: Enable keyholder for ORES deployments - https://phabricator.wikimedia.org/T163939#3226339 (10RobH) 05Open>03Resolved a:03RobH It appears this is merely an issue on how to use the software, but not an access request to the software itself. @ako... [21:31:29] (03PS2) 10Jcrespo: Add the possibility of filtering root and dump user requests [software/tendril] - 10https://gerrit.wikimedia.org/r/351194 [21:33:13] (03CR) 10Jcrespo: [V: 032 C: 032] Add the possibility of filtering root and dump user requests [software/tendril] - 10https://gerrit.wikimedia.org/r/351194 (owner: 10Jcrespo) [21:35:40] 06Operations, 10MediaWiki-Configuration, 06MediaWiki-Platform-Team, 06Performance-Team, and 9 others: Allow integration of data from etcd into the MediaWiki configuration - https://phabricator.wikimedia.org/T156924#3226349 (10tstarling) >>! In T156924#3224786, @Volans wrote: > Is there an easy way I could... [21:45:11] 06Operations, 10ops-eqiad, 10DBA: db1047 BBU RAID issues (was: Investigate db1047 replication lag) - https://phabricator.wikimedia.org/T159266#3226386 (10Ottomata) OO interesting. Yeah @cmjohnson that IS a bad time. When else do you wanna? [21:46:02] 06Operations, 10Analytics, 10DBA: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3226387 (10Ottomata) Hm, ok, fine with me. Should we not do T159266 then and just wait for new boxes? [21:46:11] 06Operations, 10ops-eqiad, 10DBA: db1047 BBU RAID issues (was: Investigate db1047 replication lag) - https://phabricator.wikimedia.org/T159266#3226388 (10Cmjohnson) @ottomata: is there a better time this week or do you push it out to next week? [21:49:34] 06Operations, 10MediaWiki-Configuration, 06MediaWiki-Platform-Team, 06Performance-Team, and 9 others: Allow integration of data from etcd into the MediaWiki configuration - https://phabricator.wikimedia.org/T156924#3226395 (10Krinkle) >>! In T156924#3224690, @tstarling wrote: > Did the following testing: [... [21:54:41] 06Operations, 10Analytics, 10DBA: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3226400 (10jcrespo) No, let's do that still- assuming no parts will be bought. No harm on a quick reboot and we will not by anything until next fiscal year (months away). [21:57:05] (03Abandoned) 10RobH: admins: add phuedx to researchers for SWAP access [puppet] - 10https://gerrit.wikimedia.org/r/351180 (https://phabricator.wikimedia.org/T164060) (owner: 10Dzahn) [21:59:24] 06Operations, 06Labs, 10hardware-requests: Eqiad: (2) hardware access request for labnet1003/1004 - https://phabricator.wikimedia.org/T158204#3226404 (10chasemp) [22:03:56] (03PS2) 10Dzahn: List disabled user accounts with associated open tasks in weekly Phab email [puppet] - 10https://gerrit.wikimedia.org/r/351011 (https://phabricator.wikimedia.org/T157740) (owner: 10Aklapper) [22:04:42] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Request to add phuedx to "researchers" group - https://phabricator.wikimedia.org/T164060#3226414 (10RobH) 05Open>03Resolved a:03RobH So this was resolved by the merge of https://gerrit.wikimedia.org/r/#/c/351201 [22:06:25] RECOVERY - Check Varnish expiry mailbox lag on cp2005 is OK: OK: expiry mailbox lag is 39 [22:27:55] (03CR) 10Dzahn: [C: 032] List disabled user accounts with associated open tasks in weekly Phab email [puppet] - 10https://gerrit.wikimedia.org/r/351011 (https://phabricator.wikimedia.org/T157740) (owner: 10Aklapper) [22:31:27] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Request to add phuedx to "researchers" group - https://phabricator.wikimedia.org/T164060#3226479 (10Dzahn) thanks all :) @phuedx for the extra details, @madhuvishy and @robh for fixing that. [22:36:04] (03CR) 10Dzahn: "deployed. works. i ran it once manually and there is now "DISABLED USER ACCOUNTS WITH OPEN TASKS ASSIGNED:" info in it. result can be seen" [puppet] - 10https://gerrit.wikimedia.org/r/351011 (https://phabricator.wikimedia.org/T157740) (owner: 10Aklapper) [22:43:43] (03CR) 10Dzahn: [C: 031] "lgtm. i feel like i'd also move the CSS into a separate .css file." [puppet] - 10https://gerrit.wikimedia.org/r/350493 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [22:47:23] (03CR) 10Krinkle: "Hm.. in that case we should probably create a function for this that can do the substitution of the css variable internally instead of req" [puppet] - 10https://gerrit.wikimedia.org/r/350493 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [22:50:00] (03CR) 10Dzahn: [C: 031] "gotcha. ok. didn't mean to make it more complicated. seems alright as it is." [puppet] - 10https://gerrit.wikimedia.org/r/350493 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [22:51:03] (03CR) 10Krinkle: "Yeah, I'll definitely consider it though. But until we have a function for this, I'd rather keep it all in one file so that I can prioriti" [puppet] - 10https://gerrit.wikimedia.org/r/350493 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [23:01:34] (03PS1) 10Dzahn: install_server: add netmon1002 to DHCP, partman [puppet] - 10https://gerrit.wikimedia.org/r/351209 (https://phabricator.wikimedia.org/T159756) [23:03:03] (03PS2) 10Dzahn: install_server: add netmon1002 to DHCP, partman [puppet] - 10https://gerrit.wikimedia.org/r/351209 (https://phabricator.wikimedia.org/T159756) [23:05:23] (03PS3) 10Dzahn: install_server: add netmon1002 to DHCP, partman [puppet] - 10https://gerrit.wikimedia.org/r/351209 (https://phabricator.wikimedia.org/T159756) [23:05:27] (03CR) 10Dzahn: [C: 032] install_server: add netmon1002 to DHCP, partman [puppet] - 10https://gerrit.wikimedia.org/r/351209 (https://phabricator.wikimedia.org/T159756) (owner: 10Dzahn) [23:06:00] !log Ran puppet cert clean striker-deploy03.striker.eqiad.wmflabs on labcontrol1001 [23:06:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:18] ACKNOWLEDGEMENT - are wikitech and wt-static in sync on labtestweb2001 is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (426703s 200000s) andrew bogott This is Andrew working on new Horizon features... puppet will be off for quite a while. [23:08:14] (03PS1) 10Madhuvishy: sge: Add gridengine-client package dependency to grid master and shadow-master [puppet] - 10https://gerrit.wikimedia.org/r/351214 (https://phabricator.wikimedia.org/T162955) [23:08:48] 06Operations, 06Labs, 10wikitech.wikimedia.org, 13Patch-For-Review: Update wikitech-static and develop procedures to keep it maintained - https://phabricator.wikimedia.org/T163721#3226606 (10Dzahn) We have "wikitech-static CRIT - wikitech and wikitech-static out of sync (426228s > 200000s)" alerts on silve... [23:10:03] 06Operations, 06Labs, 10Labs-Infrastructure, 10Wikimedia-Apache-configuration, and 2 others: wikitech-static sync broken - https://phabricator.wikimedia.org/T101803#1347928 (10Dzahn) 05Resolved>03Open re-using the ticket again for the same issue. we have an alert https://icinga.wikimedia.org/cgi-bin/i... [23:10:16] 06Operations, 06Labs, 10Labs-Infrastructure, 10Wikimedia-Apache-configuration, and 2 others: wikitech-static sync broken - https://phabricator.wikimedia.org/T101803#3226610 (10Dzahn) p:05High>03Normal [23:10:39] (03PS2) 10Madhuvishy: sge: Add gridengine-client package dependency to grid master and shadow-master [puppet] - 10https://gerrit.wikimedia.org/r/351214 (https://phabricator.wikimedia.org/T162955) [23:15:02] !log netmon1002 - boot into PXE, initial OS install (T159756) [23:15:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:15:10] T159756: setup netmon1002.wikimedia.org - https://phabricator.wikimedia.org/T159756 [23:21:35] PROBLEM - Disk space on ocg1003 is CRITICAL: DISK CRITICAL - free space: /srv/deployment/ocg/output 10231 MB (3% inode=98%) [23:34:34] ^ yea, but that's still almost 10GB [23:41:10] !log netmon1002 - signed puppet cert, initial puppet run, accept salt-key,.. (T159756) [23:41:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:41:20] T159756: setup netmon1002.wikimedia.org - https://phabricator.wikimedia.org/T159756 [23:42:20] (03PS7) 10Tim Starling: conftool: add mwconfig object type, define the first couple variables [puppet] - 10https://gerrit.wikimedia.org/r/347360 (owner: 10Giuseppe Lavagetto) [23:46:58] (03CR) 10Tim Starling: [C: 032] conftool: add mwconfig object type, define the first couple variables [puppet] - 10https://gerrit.wikimedia.org/r/347360 (owner: 10Giuseppe Lavagetto) [23:59:55] (03PS1) 10Dzahn: add IPv6 for netmon1002, forward and reverse records [dns] - 10https://gerrit.wikimedia.org/r/351221 (https://phabricator.wikimedia.org/T159756)