[00:00:05] RoanKattouw, Niharika, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do Evening SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191204T0000). [00:00:05] niedzielski: A patch you scheduled for Evening SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:13] o/ [00:05:52] Hey Urbanecm! Are you conducting today? [00:12:32] (03PS1) 10Dzahn: phabricator: switch failover server to phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/554390 (https://phabricator.wikimedia.org/T238956) [00:13:48] (03PS2) 10Dzahn: phabricator: switch failover server to phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/554390 (https://phabricator.wikimedia.org/T238956) [00:14:34] (03CR) 10Dzahn: [C: 03+2] phabricator: switch failover server to phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/554390 (https://phabricator.wikimedia.org/T238956) (owner: 10Dzahn) [00:14:59] * niedzielski looks at deployers list [00:15:33] Amir1 Lucas_WMDE_ awight any chance one of you are around and can do a SWAT? [00:16:32] I don't think I can deploy atm. I'm too drunk [00:16:46] It's 1am here anyway [00:17:18] lol, ok maybe I'll have to reschedule :] [00:19:07] Let me see if I can make it work on my private laptop [00:19:13] niedzielski: ^ [00:20:23] Amir1, if you (and your laptop) are up for it, that'd be great! [00:21:07] Amir1: BORING [00:21:09] I can do it [00:21:20] [✓] rsyncing /srv/repos from phab1003 to phab1001 (T238956) [00:21:21] T238956: switch prod Phabricator from phab1003 to phab1001 - https://phabricator.wikimedia.org/T238956 [00:21:40] :))) [00:21:57] Thanks, Reedy! [00:22:09] I leave you to it, i have my yubikey, I just need to set it up in my private laptop [00:27:32] [✓] disable puppet and "systemctl stop phd" on phab1001 and phab1003 (T238956) [00:27:33] T238956: switch prod Phabricator from phab1003 to phab1001 - https://phabricator.wikimedia.org/T238956 [00:29:07] c'mon jerkins [00:38:46] (03PS3) 10Dzahn: dumps/phabricator: switch dumps host from phab1003 to phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/552593 (https://phabricator.wikimedia.org/T238956) [00:40:34] (03CR) 10Dzahn: [C: 03+2] dumps/phabricator: switch dumps host from phab1003 to phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/552593 (https://phabricator.wikimedia.org/T238956) (owner: 10Dzahn) [00:40:54] !log reedy@deploy1001 Synchronized php-1.35.0-wmf.8/skins/Vector/includes/templates/SearchComponent.mustache: I9776a3c355081dc5fec7753edf256f55dfe6045b (duration: 01m 01s) [00:41:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:41:17] !log switching phabricator to read-only mode [00:41:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:42:17] [✓] switching phabricator dumps host (on dumps servers) from phab1003 to phab1001 (T238956) [00:42:18] T238956: switch prod Phabricator from phab1003 to phab1001 - https://phabricator.wikimedia.org/T238956 [00:47:27] Reedy: OK, I'm seeing the changes. [00:47:56] Reedy: This looks good to me. [00:49:47] coool [00:50:14] (03PS1) 10DannyS712: Partial cleanup of InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554392 [00:51:06] (03PS2) 10Dzahn: phabricator: switch "phabricator_active_server" from phab1003 to phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/552591 (https://phabricator.wikimedia.org/T238956) [00:51:55] (03PS3) 10Dzahn: phabricator: switch "phabricator_active_server" from phab1003 to phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/552591 (https://phabricator.wikimedia.org/T238956) [00:53:13] (03PS4) 10Dzahn: phabricator: switch "phabricator_active_server" from phab1003 to phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/552591 (https://phabricator.wikimedia.org/T238956) [00:55:21] Thank you so much for pinch-hitting, @Reedy!! [00:59:03] (03PS5) 10Dzahn: phabricator: switch "phabricator_active_server" from phab1003 to phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/552591 (https://phabricator.wikimedia.org/T238956) [00:59:29] (03PS1) 10Dzahn: phabricator: switch "phabricator_server" from phab1003 to phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/554397 (https://phabricator.wikimedia.org/T238956) [01:04:48] (03PS2) 10Dzahn: phabricator: switch "phabricator_server" from phab1003 to phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/554397 (https://phabricator.wikimedia.org/T238956) [01:06:04] (03CR) 10Dzahn: [C: 03+2] phabricator: switch "phabricator_server" from phab1003 to phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/554397 (https://phabricator.wikimedia.org/T238956) (owner: 10Dzahn) [01:06:59] (03CR) 10Dzahn: [C: 03+2] "puppet is currently disabled on both hosts" [puppet] - 10https://gerrit.wikimedia.org/r/552591 (https://phabricator.wikimedia.org/T238956) (owner: 10Dzahn) [01:11:15] Improper Cluster Write [01:11:15] Unable to establish a write-mode connection (to application database "phabricator_search") because Phabricator is in read-only mode. Whatever you are trying to do does not function correctly in read-only mode. [01:11:25] what on earth is phabricator doing writing when I just want to search... [01:11:47] (03PS2) 10DannyS712: InitialiseSettings - clean up groupOverrides layout / spacing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554392 (https://phabricator.wikimedia.org/T231178) [01:12:45] (03CR) 10DannyS712: "Should be a no-op and have no impact on users" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554392 (https://phabricator.wikimedia.org/T231178) (owner: 10DannyS712) [01:15:57] (03PS2) 10Dzahn: switch discovery record for phabricator to 1001 for ATS [dns] - 10https://gerrit.wikimedia.org/r/552598 (https://phabricator.wikimedia.org/T238956) [01:16:53] (03CR) 10Dzahn: [C: 03+2] switch discovery record for phabricator to 1001 for ATS [dns] - 10https://gerrit.wikimedia.org/r/552598 (https://phabricator.wikimedia.org/T238956) (owner: 10Dzahn) [01:17:32] [✓] switching DNS discovery record from phab1003 to phab1001 - switches backend in ATS (T238956) [01:17:32] T238956: switch prod Phabricator from phab1003 to phab1001 - https://phabricator.wikimedia.org/T238956 [01:18:18] (03CR) 10Dzahn: [C: 03+2] varnish: switch phabricator backend to phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/552595 (https://phabricator.wikimedia.org/T238956) (owner: 10Dzahn) [01:20:58] !log running puppet on cp-eqiad [01:21:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:21:50] (03PS2) 10Dzahn: phabricator: switch mail destination to phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/552597 (https://phabricator.wikimedia.org/T238956) [01:23:15] (03CR) 10Dzahn: [C: 03+2] phabricator: switch mail destination to phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/552597 (https://phabricator.wikimedia.org/T238956) (owner: 10Dzahn) [01:29:50] !log phabricator currently under maintenance - db connection error is known [01:29:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:32:50] Krenair: CSRF tokens I believe. [01:33:32] !log phab1001.eqiad.wmnet : sudo chown root.www-data /srv/phab/phabricator/conf/local/www.json [01:33:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:35:27] (03PS2) 10Dzahn: mtail: stop using phab1003 for tests, use phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/552604 (https://phabricator.wikimedia.org/T238957) [01:37:14] (03CR) 10jerkins-bot: [V: 04-1] mtail: stop using phab1003 for tests, use phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/552604 (https://phabricator.wikimedia.org/T238957) (owner: 10Dzahn) [01:37:14] !log re-enable phabricator writes (disable cluster.read-only) [01:37:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:37:42] Krenair: write-mode is back [01:40:04] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=phab1003-vcs.eqiad.wmnet [01:40:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:40:29] twentyafterfour: depoling phab1003-vcs.. changing conftool-data... pooling phab1001-vcs [01:40:39] or wait.. until rsync [01:40:54] well.. i would say the first 2 steps but not the 3rd yet [01:41:16] i hope the icinga pybal checks are not going crazy this time [01:41:55] Unhandled Exception ("PhutilMissingSymbolException") - Failed to load class or interface "PhabricatorWorkboardViewState". [01:41:58] at https://phabricator.wikimedia.org/tag/wikimedia-production-error/ [01:43:58] (03CR) 10Krinkle: "If these don't result in live lookups in hiera or prod hosts, perhaps use mock names like phab1.example or some such, which should avoid s" [puppet] - 10https://gerrit.wikimedia.org/r/552604 (https://phabricator.wikimedia.org/T238957) (owner: 10Dzahn) [01:44:16] (03PS2) 10Dzahn: phabricator/conftool: switch phab-vcs (git-ssh) service to phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/552589 (https://phabricator.wikimedia.org/T238956) [01:45:02] (03CR) 10Dzahn: [C: 03+2] phabricator/conftool: switch phab-vcs (git-ssh) service to phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/552589 (https://phabricator.wikimedia.org/T238956) (owner: 10Dzahn) [01:45:23] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=phab1003-vcs.eqiad.wmnet [01:45:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:47:00] !log switching phab-vcs in conftool-data from phab1003 to phab1001, running puppet on conf* [01:47:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:48:24] Krinkle: fixed [01:48:42] mutante: yeah sounds right [01:48:55] PROBLEM - PyBal IPVS diff check on lvs1014 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([2620:0:861:ed1a::3:16:22, 208.80.154.250:22]) https://wikitech.wikimedia.org/wiki/PyBal [01:49:05] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([2620:0:861:ed1a::3:16:22, 208.80.154.250:22]) https://wikitech.wikimedia.org/wiki/PyBal [01:49:11] (03CR) 10Dzahn: "would you know how to fix "AssertionError: ('status=200,method=GET,backend=be_phab1001_eqiad_wmnet', 1.245995) not found in [(u'status=301" [puppet] - 10https://gerrit.wikimedia.org/r/552604 (https://phabricator.wikimedia.org/T238957) (owner: 10Dzahn) [01:49:13] hmm, pygmentize not found [01:49:21] PROBLEM - Confd template for /srv/config-master/pybal/codfw/git-ssh on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/git-ssh is broken https://wikitech.wikimedia.org/wiki/Confd [01:49:25] twentyafterfour: and there are the pybal alerts again.. meh [01:50:07] (03CR) 10Krinkle: mtail: stop using phab1003 for tests, use phab1001 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/552604 (https://phabricator.wikimedia.org/T238957) (owner: 10Dzahn) [01:50:21] PROBLEM - Confd template for /srv/config-master/pybal/codfw/git-ssh on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/git-ssh is broken https://wikitech.wikimedia.org/wiki/Confd [01:50:23] (03CR) 10Krinkle: "probably the test fixture file. marked as 'input' with inline comment." [puppet] - 10https://gerrit.wikimedia.org/r/552604 (https://phabricator.wikimedia.org/T238957) (owner: 10Dzahn) [01:50:41] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/git-ssh on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/git-ssh is broken https://wikitech.wikimedia.org/wiki/Confd [01:50:49] twentyafterfour: thanks [01:51:01] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/git-ssh on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/git-ssh is broken https://wikitech.wikimedia.org/wiki/Confd [01:51:30] runs puppet on puppetmasters [01:52:28] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs1014 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([2620:0:861:ed1a::3:16:22, 208.80.154.250:22]) daniel_zahn git-ssh moved to a different backend https://wikitech.wikimedia.org/wiki/PyBal [01:52:28] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([2620:0:861:ed1a::3:16:22, 208.80.154.250:22]) daniel_zahn git-ssh moved to a different backend https://wikitech.wikimedia.org/wiki/PyBal [01:52:28] ACKNOWLEDGEMENT - Confd template for /srv/config-master/pybal/codfw/git-ssh on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/git-ssh is broken daniel_zahn git-ssh moved to a different backend https://wikitech.wikimedia.org/wiki/Confd [01:52:28] ACKNOWLEDGEMENT - Confd template for /srv/config-master/pybal/eqiad/git-ssh on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/git-ssh is broken daniel_zahn git-ssh moved to a different backend https://wikitech.wikimedia.org/wiki/Confd [01:52:28] ACKNOWLEDGEMENT - Confd template for /srv/config-master/pybal/codfw/git-ssh on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/git-ssh is broken daniel_zahn git-ssh moved to a different backend https://wikitech.wikimedia.org/wiki/Confd [01:52:28] ACKNOWLEDGEMENT - Confd template for /srv/config-master/pybal/eqiad/git-ssh on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/git-ssh is broken daniel_zahn git-ssh moved to a different backend https://wikitech.wikimedia.org/wiki/Confd [01:55:16] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=phab1001-vcs.eqiad.wmnet [01:55:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:58:37] RECOVERY - PyBal IPVS diff check on lvs1014 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [01:58:37] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [02:00:03] RECOVERY - Check systemd state on boron is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:04:12] yay @ RECOVERY [02:05:23] PROBLEM - Check systemd state on boron is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:06:55] !log re-enabling puppet on phab1003 and phab1001.. switching active_server for puppet [02:06:55] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 2 (dbprov2001, ...), Fresh: 97 jobs https://wikitech.wikimedia.org/wiki/Backups%23Monitoring [02:06:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:10:09] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=bacula site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:15:02] (03PS1) 1020after4: Phabricator: install python3-pygments instead of python-pygments [puppet] - 10https://gerrit.wikimedia.org/r/554401 [02:16:16] (03PS3) 10Dzahn: mtail: stop using phab1003 for tests, use phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/552604 (https://phabricator.wikimedia.org/T238957) [02:16:24] (03PS1) 1020after4: Phabricator: remove validate_hash, it's deprecated [puppet] - 10https://gerrit.wikimedia.org/r/554402 [02:16:59] (03CR) 10Dzahn: [C: 03+2] Phabricator: install python3-pygments instead of python-pygments [puppet] - 10https://gerrit.wikimedia.org/r/554401 (owner: 1020after4) [02:17:57] (03CR) 1020after4: "this was causing warnings in puppet lint output." [puppet] - 10https://gerrit.wikimedia.org/r/554402 (owner: 1020after4) [02:27:04] PROBLEM - Check systemd state on phab1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:31:47] (03PS1) 10Krinkle: mtail: Use mock hostnames in test fixtures [puppet] - 10https://gerrit.wikimedia.org/r/554403 [02:33:44] (03CR) 10jerkins-bot: [V: 04-1] mtail: Use mock hostnames in test fixtures [puppet] - 10https://gerrit.wikimedia.org/r/554403 (owner: 10Krinkle) [02:37:00] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:43:22] (03PS2) 10Krinkle: mtail: Use mock hostnames in test fixtures [puppet] - 10https://gerrit.wikimedia.org/r/554403 [02:44:35] (03PS3) 10Krinkle: mtail: Use mock hostnames in test fixtures [puppet] - 10https://gerrit.wikimedia.org/r/554403 [03:07:14] RECOVERY - Check systemd state on phab1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:42:11] (03CR) 10Andrew Bogott: [C: 03+2] wmfkeystonehooks: designate domain orig project when creating default domain [puppet] - 10https://gerrit.wikimedia.org/r/554289 (owner: 10Andrew Bogott) [03:42:11] 10Operations, 10Traffic: Make DNS operations resilient against predictable failures - https://phabricator.wikimedia.org/T239711 (10colewhite) p:05Triage→03Normal [03:42:11] 10Operations, 10Release-Engineering-Team-TODO, 10Jenkins, 10Release-Engineering-Team (CI & Testing services): Add latest jenkins debian packages to apt.wikimedia.org and upgrade jenkins to latest LTS (2.190.3) - https://phabricator.wikimedia.org/T239586 (10colewhite) p:05Triage→03Normal [03:42:11] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Add Maryum to Puppet - https://phabricator.wikimedia.org/T239300 (10colewhite) p:05Triage→03Normal [03:42:11] 10Operations, 10SRE-Access-Requests: Requesting access to LogStash for rxy - https://phabricator.wikimedia.org/T239494 (10colewhite) a:03colewhite [03:42:28] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production shell for Maryum Styles - https://phabricator.wikimedia.org/T239654 (10colewhite) a:03colewhite [03:53:28] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:53:58] PROBLEM - OSPF status on cr2-eqord is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:58:48] PROBLEM - BGP status on cr2-eqord is CRITICAL: BGP CRITICAL - AS2914/IPv6: Active, AS2914/IPv4: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [04:00:42] (03PS1) 10Andrew Bogott: keystone ocata service: fix keystone-wdgi-admin.py filename [puppet] - 10https://gerrit.wikimedia.org/r/554407 [04:01:29] (03CR) 10Andrew Bogott: [C: 03+2] keystone ocata service: fix keystone-wdgi-admin.py filename [puppet] - 10https://gerrit.wikimedia.org/r/554407 (owner: 10Andrew Bogott) [04:02:26] RECOVERY - BGP status on cr2-eqord is OK: BGP OK - up: 89, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [04:10:18] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=bacula site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:15:33] (03Abandoned) 10Dzahn: mtail: stop using phab1003 for tests, use phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/552604 (https://phabricator.wikimedia.org/T238957) (owner: 10Dzahn) [04:39:11] !log phab1001 - rebooting for BIOS config change [04:39:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:46:14] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - git-ssh6_22: Servers phab1001-vcs.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:46:46] PROBLEM - PyBal backends health check on lvs1014 is CRITICAL: PYBAL CRITICAL - CRITICAL - git-ssh6_22: Servers phab1001-vcs.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:47:08] ACKNOWLEDGEMENT - PyBal backends health check on lvs1014 is CRITICAL: PYBAL CRITICAL - CRITICAL - git-ssh6_22: Servers phab1001-vcs.eqiad.wmnet are marked down but pooled daniel_zahn rebooted https://wikitech.wikimedia.org/wiki/PyBal [04:47:08] ACKNOWLEDGEMENT - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - git-ssh6_22: Servers phab1001-vcs.eqiad.wmnet are marked down but pooled daniel_zahn rebooted https://wikitech.wikimedia.org/wiki/PyBal [04:47:36] PROBLEM - TFTP service on install1002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), regex args .*/usr/sbin/atftpd .* https://wikitech.wikimedia.org/wiki/Monitoring/atftpd [04:48:34] ACKNOWLEDGEMENT - TFTP service on install1002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), regex args .*/usr/sbin/atftpd .* daniel_zahn manually stopped https://wikitech.wikimedia.org/wiki/Monitoring/atftpd [04:48:52] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:49:24] RECOVERY - OSPF status on cr2-eqord is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:54:42] RECOVERY - TFTP service on install1002 is OK: PROCS OK: 1 process with UID = 65534 (nobody), regex args .*/usr/sbin/atftpd .* https://wikitech.wikimedia.org/wiki/Monitoring/atftpd [04:58:39] !log install1002 - restarted isc-dhcpd [04:58:41] !log phabricator maintenance ended for today - now running on phab1001 (buster) [04:58:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:58:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:59:53] !log removed downtime for phabricator.wikimedia.org meta service (paging) [04:59:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:01:27] (03Abandoned) 10Dzahn: re-add phabricator-new to point to caching layer [dns] - 10https://gerrit.wikimedia.org/r/551284 (https://phabricator.wikimedia.org/T137928) (owner: 10Dzahn) [05:01:34] (03Abandoned) 10Dzahn: ATS/varnish: add phabricator-new to point to phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/551286 (https://phabricator.wikimedia.org/T190568) (owner: 10Dzahn) [05:15:30] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Dzahn) [05:16:06] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Dzahn) https://ticket.wikimedia.org (OTRS) has been switched to use https://ticket.discovery.wmnet (envoy on mendelevium). [05:36:11] (03PS1) 10Dzahn: phabricator: temp. disable automatic rsync of repo data [puppet] - 10https://gerrit.wikimedia.org/r/554411 (https://phabricator.wikimedia.org/T238956) [05:37:07] (03CR) 10Dzahn: [C: 03+2] phabricator: temp. disable automatic rsync of repo data [puppet] - 10https://gerrit.wikimedia.org/r/554411 (https://phabricator.wikimedia.org/T238956) (owner: 10Dzahn) [05:49:48] (03PS1) 10Dzahn: phabricator: temp. make phab1001 the "failover" again for rsync [puppet] - 10https://gerrit.wikimedia.org/r/554412 [05:51:48] !log Deploy schema change on s3 primary master (db1123) [05:51:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:30] (03PS2) 10Dzahn: phabricator: temp. make phab1001 the "failover" again for rsync [puppet] - 10https://gerrit.wikimedia.org/r/554412 (https://phabricator.wikimedia.org/T238956) [05:55:32] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:55:53] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1001/19769/" [puppet] - 10https://gerrit.wikimedia.org/r/554412 (https://phabricator.wikimedia.org/T238956) (owner: 10Dzahn) [05:56:00] (03CR) 10Dzahn: [C: 03+2] phabricator: temp. make phab1001 the "failover" again for rsync [puppet] - 10https://gerrit.wikimedia.org/r/554412 (https://phabricator.wikimedia.org/T238956) (owner: 10Dzahn) [06:01:38] !log rsyncing /srv/repos data once again. pulling from phab1003 to phab1001 (T238956) [06:01:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:43] T238956: switch prod Phabricator from phab1003 to phab1001 - https://phabricator.wikimedia.org/T238956 [06:03:02] (03PS1) 10Marostegui: dbproxy1010: Depool labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/554413 [06:04:28] (03CR) 10Marostegui: [C: 03+2] dbproxy1010: Depool labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/554413 (owner: 10Marostegui) [06:04:43] !log Depool labsdb1011 [06:04:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:07:05] (03PS1) 10Marostegui: db2070: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/554414 (https://phabricator.wikimedia.org/T239684) [06:07:44] (03CR) 10Marostegui: [C: 03+2] db2070: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/554414 (https://phabricator.wikimedia.org/T239684) (owner: 10Marostegui) [06:09:38] PROBLEM - Check systemd state on wdqs1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:09:56] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 99 jobs https://wikitech.wikimedia.org/wiki/Backups%23Monitoring [06:11:28] RECOVERY - Check systemd state on wdqs1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:13:39] !log phab1001 - running rsync of /srv/repos with --delete because it's larger than the source by about 5GB - deleting objects to match phab1003, former prod server. now both 50G (T238956) [06:13:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:13:44] T238956: switch prod Phabricator from phab1003 to phab1001 - https://phabricator.wikimedia.org/T238956 [06:25:08] !log dzahn@cumin1001 conftool action : set/weight=10; selector: name=phab1001-vcs.eqiad.wmnet [06:25:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:31:22] PROBLEM - PyBal IPVS diff check on lvs1014 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([2620:0:861:ed1a::3:16:22]) https://wikitech.wikimedia.org/wiki/PyBal [06:31:22] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([2620:0:861:ed1a::3:16:22]) https://wikitech.wikimedia.org/wiki/PyBal [06:36:10] !log removed LVS IP for git-ssh from interface on phab1003 [06:36:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:53] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs1014 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([2620:0:861:ed1a::3:16:22]) daniel_zahn phab git-ssh move https://wikitech.wikimedia.org/wiki/PyBal [06:37:53] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([2620:0:861:ed1a::3:16:22]) daniel_zahn phab git-ssh move https://wikitech.wikimedia.org/wiki/PyBal [06:54:57] (03CR) 10Giuseppe Lavagetto: [C: 03+1] wmf_style: add contain to this list of include like types [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/531893 (owner: 10Jbond) [06:55:36] (03CR) 10Giuseppe Lavagetto: [C: 03+1] blubberoid: Harmonize eqiad limits/requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/553469 (owner: 10Alexandros Kosiaris) [06:55:48] (03PS1) 10Marostegui: Revert "dbproxy1010: Depool labsdb1011" [puppet] - 10https://gerrit.wikimedia.org/r/554416 [06:55:55] (03PS2) 10Marostegui: Revert "dbproxy1010: Depool labsdb1011" [puppet] - 10https://gerrit.wikimedia.org/r/554416 [07:01:29] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1010: Depool labsdb1011" [puppet] - 10https://gerrit.wikimedia.org/r/554416 (owner: 10Marostegui) [07:02:12] !log Repool labsdb1011 [07:02:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:07] (03PS1) 10Marostegui: dbproxy1011: Depool labsdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/554420 (https://phabricator.wikimedia.org/T238399) [07:08:38] (03CR) 10Marostegui: [C: 03+2] dbproxy1011: Depool labsdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/554420 (https://phabricator.wikimedia.org/T238399) (owner: 10Marostegui) [07:09:30] !log Depool labsdb1010 to reimport wikidatawiki.page - T238399 [07:09:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:35] T238399: Reimport wikidatawiki.{pagelinks,page} on labsdb1010 - https://phabricator.wikimedia.org/T238399 [07:13:43] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 98 jobs https://wikitech.wikimedia.org/wiki/Backups%23Monitoring [07:22:16] there are some alerts about pybal/confctl for git-ssh on phab. i acked them and will fix that tomorrow. it was a long day and that is rarely used (ssh git on phab as opposed to Gerrit), https works fine. [07:35:29] mutante: o/ [08:05:08] !log Restart php7-fpm on mw1348 [08:05:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:30] (03PS1) 10Andrew Bogott: Neutron/ocata: update our l3 hacks using the latest ocata base files [puppet] - 10https://gerrit.wikimedia.org/r/554423 (https://phabricator.wikimedia.org/T237749) [08:10:06] (03CR) 10jerkins-bot: [V: 04-1] Neutron/ocata: update our l3 hacks using the latest ocata base files [puppet] - 10https://gerrit.wikimedia.org/r/554423 (https://phabricator.wikimedia.org/T237749) (owner: 10Andrew Bogott) [08:12:01] 10Operations: Remove old builds on package builder - https://phabricator.wikimedia.org/T237713 (10MoritzMuehlenhoff) The systemd time threw a number of errors on boron when trying to remove /var/cache/pbuilder/build/cow.6815/sys/devices* and proc/*, "Operation not permitted" Didn't look closer yet, probably the... [08:13:45] (03PS2) 10Andrew Bogott: Neutron/ocata: update our l3 hacks using the latest ocata base files [puppet] - 10https://gerrit.wikimedia.org/r/554423 (https://phabricator.wikimedia.org/T237749) [08:14:21] (03CR) 10jerkins-bot: [V: 04-1] Neutron/ocata: update our l3 hacks using the latest ocata base files [puppet] - 10https://gerrit.wikimedia.org/r/554423 (https://phabricator.wikimedia.org/T237749) (owner: 10Andrew Bogott) [08:15:54] 10Operations, 10Gerrit-Privilege-Requests, 10Wikidata, 10Wikidata-Query-Service, 10Release-Engineering-Team (Unit & Int & System Tooling): Push rights on https://gerrit.wikimedia.org/r/admin/projects/wikidata/query/blazegraph for onimisionipe - https://phabricator.wikimedia.org/T238733 (10hashar) We have... [08:19:24] (03PS3) 10Andrew Bogott: Neutron/ocata: update our l3 hacks using the latest ocata base files [puppet] - 10https://gerrit.wikimedia.org/r/554423 (https://phabricator.wikimedia.org/T237749) [08:21:40] (03CR) 10jerkins-bot: [V: 04-1] Neutron/ocata: update our l3 hacks using the latest ocata base files [puppet] - 10https://gerrit.wikimedia.org/r/554423 (https://phabricator.wikimedia.org/T237749) (owner: 10Andrew Bogott) [08:24:37] (03PS4) 10Andrew Bogott: Neutron/ocata: update our l3 hacks using the latest ocata base files [puppet] - 10https://gerrit.wikimedia.org/r/554423 (https://phabricator.wikimedia.org/T237749) [08:25:00] 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10Gilles) As expected the SSL cert rollback undoes all of the TLS handshake regression: {F31456216, size=full} https://grafana.wiki... [08:25:10] (03CR) 10jerkins-bot: [V: 04-1] Neutron/ocata: update our l3 hacks using the latest ocata base files [puppet] - 10https://gerrit.wikimedia.org/r/554423 (https://phabricator.wikimedia.org/T237749) (owner: 10Andrew Bogott) [08:26:19] 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review: 15% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10Gilles) [08:27:25] (03PS5) 10Andrew Bogott: Neutron/ocata: update our l3 hacks using the latest ocata base files [puppet] - 10https://gerrit.wikimedia.org/r/554423 (https://phabricator.wikimedia.org/T237749) [08:29:32] (03CR) 10jerkins-bot: [V: 04-1] Neutron/ocata: update our l3 hacks using the latest ocata base files [puppet] - 10https://gerrit.wikimedia.org/r/554423 (https://phabricator.wikimedia.org/T237749) (owner: 10Andrew Bogott) [08:31:20] (03PS6) 10Andrew Bogott: Neutron/ocata: update our l3 hacks using the latest ocata base files [puppet] - 10https://gerrit.wikimedia.org/r/554423 (https://phabricator.wikimedia.org/T237749) [08:32:43] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/554305 (owner: 10Jbond) [08:33:16] 10Operations, 10Gerrit, 10vm-requests: Gerrit VM to test data migration - https://phabricator.wikimedia.org/T239151 (10hashar) The way I understand the message: the virtualization servers in group `row_A` lack free memory to allocate a VM. But maybe another group would have memory available? You should be... [08:33:52] (03CR) 10Andrew Bogott: [C: 03+2] Neutron/ocata: update our l3 hacks using the latest ocata base files [puppet] - 10https://gerrit.wikimedia.org/r/554423 (https://phabricator.wikimedia.org/T237749) (owner: 10Andrew Bogott) [08:56:41] 10Operations, 10Gerrit, 10vm-requests: Gerrit VM to test data migration - https://phabricator.wikimedia.org/T239151 (10MoritzMuehlenhoff) The old puppetdb hosts (puppetdb1001) should be ready to go away, @jbond merged the patches to stop broadcasting to it last week. It also has 16G RAM, so those would be fr... [08:58:26] (03PS1) 10Marostegui: Revert "dbproxy1011: Depool labsdb1010" [puppet] - 10https://gerrit.wikimedia.org/r/554470 [09:06:33] !log updated jenkins on apt.wikimedia.org to 2.190.3 (T239586) [09:06:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:40] T239586: Add latest jenkins debian packages to apt.wikimedia.org and upgrade jenkins to latest LTS (2.190.3) - https://phabricator.wikimedia.org/T239586 [09:07:49] jouncebot: next [09:07:50] In 2 hour(s) and 52 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191204T1200) [09:09:16] 10Operations, 10Release-Engineering-Team-TODO, 10Jenkins, 10Release-Engineering-Team (CI & Testing services): Add latest jenkins debian packages to apt.wikimedia.org and upgrade jenkins to latest LTS (2.190.3) - https://phabricator.wikimedia.org/T239586 (10MoritzMuehlenhoff) 05Open→03Resolved a:03Mori... [09:10:07] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1011: Depool labsdb1010" [puppet] - 10https://gerrit.wikimedia.org/r/554470 (owner: 10Marostegui) [09:14:01] (03PS1) 10Filippo Giunchedi: logstash: set kafka consumer groups at the role level [puppet] - 10https://gerrit.wikimedia.org/r/554472 (https://phabricator.wikimedia.org/T234854) [09:15:33] !log Reload labsdb1010 after reimporting wikidatawiki.page - T238399 [09:15:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:39] T238399: Reimport wikidatawiki.{pagelinks,page} on labsdb1010 - https://phabricator.wikimedia.org/T238399 [09:18:20] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1002/19770/" [puppet] - 10https://gerrit.wikimedia.org/r/554472 (https://phabricator.wikimedia.org/T234854) (owner: 10Filippo Giunchedi) [09:19:06] (03PS1) 10Elukey: profile::refinery::job::spark_job: allow to pass a keytab [puppet] - 10https://gerrit.wikimedia.org/r/554473 (https://phabricator.wikimedia.org/T228291) [09:19:26] (03CR) 10Jbond: [C: 03+2] admin: add check to reject privileges which are to permissive [puppet] - 10https://gerrit.wikimedia.org/r/554317 (https://phabricator.wikimedia.org/T239070) (owner: 10Jbond) [09:20:09] (03PS3) 10Jbond: admin::tests: update tests to use python3 [puppet] - 10https://gerrit.wikimedia.org/r/554305 [09:22:19] (03PS1) 10Andrew Bogott: neutron dhcp: reconcile dhcp_domain and dns_domain [puppet] - 10https://gerrit.wikimedia.org/r/554474 (https://phabricator.wikimedia.org/T237749) [09:22:23] (03CR) 10Filippo Giunchedi: [C: 03+2] logstash: set kafka consumer groups at the role level [puppet] - 10https://gerrit.wikimedia.org/r/554472 (https://phabricator.wikimedia.org/T234854) (owner: 10Filippo Giunchedi) [09:22:33] (03PS2) 10Elukey: profile::refinery::job::spark_job: allow to pass a keytab [puppet] - 10https://gerrit.wikimedia.org/r/554473 (https://phabricator.wikimedia.org/T228291) [09:23:43] (03CR) 10Jbond: [C: 03+2] admin::tests: update tests to use python3 [puppet] - 10https://gerrit.wikimedia.org/r/554305 (owner: 10Jbond) [09:24:12] (03CR) 10Marostegui: [C: 03+1] "Not much to say if it is a requirement for the installation of your tools :-)" [puppet] - 10https://gerrit.wikimedia.org/r/554354 (https://phabricator.wikimedia.org/T236180) (owner: 10EBernhardson) [09:27:51] (03CR) 10Jbond: "lgtm, but you missed one :)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/554215 (https://phabricator.wikimedia.org/T236180) (owner: 10Dzahn) [09:29:48] !log roll-restart logstash7 in codfw/eqiad after https://gerrit.wikimedia.org/r/c/operations/puppet/+/554472 [09:29:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:27] (03PS1) 10Elukey: Add fake TLS certificates for the Hadoop Analytics cluster [labs/private] - 10https://gerrit.wikimedia.org/r/554475 [09:32:47] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add fake TLS certificates for the Hadoop Analytics cluster [labs/private] - 10https://gerrit.wikimedia.org/r/554475 (owner: 10Elukey) [09:33:42] (03PS2) 10Andrew Bogott: neutron dhcp: reconcile dhcp_domain and dns_domain [puppet] - 10https://gerrit.wikimedia.org/r/554474 (https://phabricator.wikimedia.org/T237749) [09:34:57] PROBLEM - Check size of conntrack table on kubernetes1002 is CRITICAL: CRITICAL: nf_conntrack is 90 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [09:37:44] (03PS1) 10Elukey: Add missing fake Kerberos Keytab for an-master1001 [labs/private] - 10https://gerrit.wikimedia.org/r/554477 [09:39:13] (03PS3) 10Andrew Bogott: neutron dhcp: reconcile dhcp_domain and dns_domain [puppet] - 10https://gerrit.wikimedia.org/r/554474 (https://phabricator.wikimedia.org/T237749) [09:39:20] (03PS1) 10DCausse: Revert "[wdqs] enable asynchronous imports on wdqs1004" [puppet] - 10https://gerrit.wikimedia.org/r/554478 [09:39:46] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add missing fake Kerberos Keytab for an-master1001 [labs/private] - 10https://gerrit.wikimedia.org/r/554477 (owner: 10Elukey) [09:42:49] (03CR) 10Jbond: "lgtm apart from volans comment regarding the ability to set CAS_CREATE_USER and CAS_VERSION" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/554273 (owner: 10Muehlenhoff) [09:43:19] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/19776/" [puppet] - 10https://gerrit.wikimedia.org/r/554473 (https://phabricator.wikimedia.org/T228291) (owner: 10Elukey) [09:43:28] (03CR) 10Andrew Bogott: [C: 03+2] neutron dhcp: reconcile dhcp_domain and dns_domain [puppet] - 10https://gerrit.wikimedia.org/r/554474 (https://phabricator.wikimedia.org/T237749) (owner: 10Andrew Bogott) [09:43:50] andrewbogott: ok to merge? [09:44:09] elukey: is it "neutron dhcp: reconcile dhcp_domain and dns_domain"? [09:44:11] if so yes [09:44:15] yep [09:44:18] thanks [09:44:40] hi all just a heads up that i plan to start the CA change in 15 minutes which will mean disabling puppet for some time [09:47:37] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "Thanks. I will allocate some time to extend this (deploy the base sbuild setup) when I find some spare brain CPU cycles." [puppet] - 10https://gerrit.wikimedia.org/r/549211 (owner: 10Bstorm) [09:51:03] PROBLEM - Check size of conntrack table on kubernetes1002 is CRITICAL: CRITICAL: nf_conntrack is 90 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [09:52:57] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production shell for Maryum Styles - https://phabricator.wikimedia.org/T239654 (10jbond) [09:53:11] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production shell for Maryum Styles - https://phabricator.wikimedia.org/T239654 (10jbond) user is staff so i have updated the NDA section [09:53:46] (03CR) 10Jbond: [C: 03+2] admin: add mstyles as user [puppet] - 10https://gerrit.wikimedia.org/r/553219 (https://phabricator.wikimedia.org/T239654) (owner: 10Mstyles) [09:54:37] PROBLEM - Check size of conntrack table on kubernetes1002 is CRITICAL: CRITICAL: nf_conntrack is 90 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [09:56:54] (03CR) 10Jbond: [C: 03+2] wmf_style: add contain to this list of include like types [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/531893 (owner: 10Jbond) [09:57:53] (03PS2) 10Gehel: Revert "[wdqs] enable asynchronous imports on wdqs1004" [puppet] - 10https://gerrit.wikimedia.org/r/554478 (owner: 10DCausse) [10:00:12] (03CR) 10Gehel: [C: 03+2] Revert "[wdqs] enable asynchronous imports on wdqs1004" [puppet] - 10https://gerrit.wikimedia.org/r/554478 (owner: 10DCausse) [10:00:48] dcausse: ^ [10:00:59] gehel: thanks [10:01:45] PROBLEM - Check size of conntrack table on kubernetes1002 is CRITICAL: CRITICAL: nf_conntrack is 90 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [10:02:27] !log disabling puppet accros the fleet to start CA update change 548241 [10:02:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:26] (03PS4) 10Jbond: puppet_ca: update puppet ca with a new certificate valid for 10 years [puppet] - 10https://gerrit.wikimedia.org/r/548241 (https://phabricator.wikimedia.org/T237362) [10:08:49] PROBLEM - Check size of conntrack table on kubernetes1002 is CRITICAL: CRITICAL: nf_conntrack is 90 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [10:10:03] (03CR) 10Jbond: [C: 03+2] puppet_ca: update puppet ca with a new certificate valid for 10 years [puppet] - 10https://gerrit.wikimedia.org/r/548241 (https://phabricator.wikimedia.org/T237362) (owner: 10Jbond) [10:10:42] jbond42: LMK when done, I have a pending rolling restart via puppet to do [10:10:47] also good luck [10:10:54] will do and thanks [10:11:09] PROBLEM - Check size of conntrack table on kubernetes1004 is CRITICAL: CRITICAL: nf_conntrack is 91 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [10:16:19] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: preform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10jbond) [10:18:36] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10jbond) p:05Triage→03Normal [10:21:25] !log stop replication and mysql on db2071 to test puppet CA changes [10:21:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:53] PROBLEM - Check size of conntrack table on kubernetes1004 is CRITICAL: CRITICAL: nf_conntrack is 91 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [10:24:29] !log stop replication and mysql on db2107 (s2 codfw master) to test puppet CA changes [10:24:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:11] PROBLEM - Check size of conntrack table on kubernetes1004 is CRITICAL: CRITICAL: nf_conntrack is 93 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [10:30:54] !log enable puppet in eqsin and deploy updated CA [10:31:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:10] !log rolling restart of wdqs for config change (event logging) - T101013 [10:31:12] !log gehel@cumin1001 START - Cookbook sre.wdqs.restart [10:31:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:15] T101013: Log Wikidata Query Service queries to the event gate infrastructure - https://phabricator.wikimedia.org/T101013 [10:31:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:33] (03CR) 10Muehlenhoff: Add CAS authentication to debmonitor (033 comments) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/554273 (owner: 10Muehlenhoff) [10:31:42] (03PS5) 10Muehlenhoff: Add CAS authentication to debmonitor [software/debmonitor] - 10https://gerrit.wikimedia.org/r/554273 [10:40:57] PROBLEM - Check size of conntrack table on kubernetes1002 is CRITICAL: CRITICAL: nf_conntrack is 90 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [10:42:21] !log enable puppet in ulsfo and deploy updated CA [10:42:21] lemme acknowledge those 2 while debugging [10:42:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:31] (03PS1) 10Muehlenhoff: Drop grafana/jessie from reprepro sync [puppet] - 10https://gerrit.wikimedia.org/r/554481 [10:45:20] 10Operations, 10Puppet, 10DBA, 10Patch-For-Review, 10User-jbond: Extend Puppet CA Expiry date - https://phabricator.wikimedia.org/T236277 (10Marostegui) Checks performed: On a standalone slave (db2071) with no action on its master: - Run puppet - Stop slave ; start slave; - Stop MySQL, start MySQL Repl... [10:46:53] !log enable puppet in esams and deploy updated CA [10:46:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:11] !log enable puppet in codfw and deploy updated CA [10:54:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:05] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [10:58:39] (03CR) 10Elukey: "> Not much to say if it is a requirement for the installation of your" [puppet] - 10https://gerrit.wikimedia.org/r/554354 (https://phabricator.wikimedia.org/T236180) (owner: 10EBernhardson) [10:59:37] (03PS1) 10Urbanecm: Enable abusefilter blocking cap at testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554482 [11:01:47] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [11:03:16] (03CR) 10Daimona Eaytoy: [C: 04-1] "Minor ordering nit." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554482 (owner: 10Urbanecm) [11:03:54] (03PS2) 10Urbanecm: Enable abusefilter blocking cap at testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554482 [11:04:17] (03CR) 10Urbanecm: Enable abusefilter blocking cap at testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554482 (owner: 10Urbanecm) [11:04:49] (03PS3) 10Urbanecm: Enable abusefilter blocking cap at testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554482 [11:04:52] (03CR) 10Urbanecm: "> Patch Set 1: Code-Review-1" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554482 (owner: 10Urbanecm) [11:09:52] (03PS1) 10Ema: ATS: add milestone timing to logs, update TTFetchHeaders [puppet] - 10https://gerrit.wikimedia.org/r/554483 (https://phabricator.wikimedia.org/T238494) [11:12:08] (03PS2) 10Ema: ATS: add milestone timing logs, update TTFetchHeaders [puppet] - 10https://gerrit.wikimedia.org/r/554483 (https://phabricator.wikimedia.org/T238494) [11:13:41] !log gehel@cumin1001 END (FAIL) - Cookbook sre.wdqs.restart (exit_code=99) [11:13:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:29] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10jcrespo) I wonder if some of these could be done on reimage, if/when there is one planned anyway. [11:14:38] (03CR) 10Elukey: [C: 03+2] "Going to merge and apply in the hadoop test cluster for a bit to see if anything comes up." [puppet] - 10https://gerrit.wikimedia.org/r/554354 (https://phabricator.wikimedia.org/T236180) (owner: 10EBernhardson) [11:15:48] (03PS1) 10Jcrespo: bacula: Increase production and Databases retention to 90 days [puppet] - 10https://gerrit.wikimedia.org/r/554485 (https://phabricator.wikimedia.org/T238048) [11:16:57] ah snap jbond42 sorry I ran puppet-merge without thinking, I hope I didn't mess up anything [11:18:01] elukey: it should be fine puppet is been enabled on codfw now, its allready enabled everywhere elses but eqiad. I think 30 mins and it should be up everywhere [11:19:06] yep yep no rush, I can wait even two hours, it was for your ongoing work, sorry :( [11:19:22] np thankxs for the heads up [11:19:26] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) Probably, some of them will probably be covered with the reimage to buster I would say. [11:19:48] (03CR) 10Ema: [C: 03+2] ATS: add milestone timing logs, update TTFetchHeaders [puppet] - 10https://gerrit.wikimedia.org/r/554483 (https://phabricator.wikimedia.org/T238494) (owner: 10Ema) [11:23:36] !log enable puppet in eqiad and deploy updated CA [11:23:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:22] !log drain kubernetes1002 for test of nf_conntrack changes [11:31:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:49] RECOVERY - Check size of conntrack table on kubernetes1002 is OK: OK: nf_conntrack is 31 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [11:34:03] the rest of the hosts are probably gonna alert for the conntrack now that they have the load the kubernetes1002 was handling [11:37:15] PROBLEM - Check size of conntrack table on kubernetes1003 is CRITICAL: CRITICAL: nf_conntrack is 96 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [11:38:39] !log puppet enabled accross the fleet and new CA certificate installed [11:38:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:14] godog: ^^^ i also notice that logstash[1023-1025] are disabled by you so they dont have the new cert just yet [11:40:23] jbond42: kk, let me reenable [11:40:54] jbond42: {{done}}, anything else I should do besides running puppet ? [11:41:03] no [11:41:16] ok thanks! [11:42:02] no let me know if you see anything unexpected [11:44:55] 10Operations, 10Goal, 10Patch-For-Review: Followup to backup1001 bacula switchover (misc pending tasks) - https://phabricator.wikimedia.org/T238048 (10jcrespo) "Offsite Job" seems to be correctly configured as "Copy", but it is not showing any activity. Needs checking. [11:45:28] (03CR) 10Daimona Eaytoy: [C: 03+1] Enable abusefilter blocking cap at testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554482 (owner: 10Urbanecm) [11:49:33] RECOVERY - Check size of conntrack table on kubernetes1004 is OK: OK: nf_conntrack is 25 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [11:49:49] RECOVERY - Check size of conntrack table on kubernetes1003 is OK: OK: nf_conntrack is 25 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [11:57:11] (03PS1) 10Alexandros Kosiaris: kubernetes::node: Enforce upstream nf_conntrack_max [puppet] - 10https://gerrit.wikimedia.org/r/554490 (https://phabricator.wikimedia.org/T239795) [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor I � Unicode. All rise for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191204T1200). [12:00:04] tim_WMDE: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:21] tim_WMDE: I can SWAT for you if you want :) [12:00:31] Sure thing, thanks! [12:00:55] 10Operations, 10Puppet, 10cloud-services-team, 10User-jbond: cloudservices machines are currently failing puppet runs - https://phabricator.wikimedia.org/T239804 (10jbond) [12:01:40] 10Operations, 10Puppet, 10DBA, 10Patch-For-Review, 10User-jbond: Extend Puppet CA Expiry date - https://phabricator.wikimedia.org/T236277 (10jbond) The new certificate has been distributed ` % sudo cumin 'A:all' 'openssl x509 -in $(sudo puppet config print localcacert 2>/dev/null) -noout -enddate ' ====... [12:02:26] tim_WMDE: you want to swat https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/+/554296 to both wmf.8 and wmf.5? [12:03:08] (03CR) 10Muehlenhoff: kubernetes::node: Enforce upstream nf_conntrack_max (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/554490 (https://phabricator.wikimedia.org/T239795) (owner: 10Alexandros Kosiaris) [12:03:15] I have to admit I don't know what the difference is, but I assume the answer is yes?! :O [12:03:45] tim_WMDE: deployed version of MediaWiki :-). wmf.8 is at test wikis, wmf.5 at the rest (as-of nowú) [12:03:48] 10Operations, 10Puppet, 10DBA, 10Patch-For-Review, 10User-jbond: Extend Puppet CA Expiry date - https://phabricator.wikimedia.org/T236277 (10jbond) get the same results from the following `sudo cumin 'A:all' 'openssl x509 -in /etc/ssl/certs/Puppet_Internal_CA.pem -noout -enddate '` [12:04:05] (03CR) 10Alexandros Kosiaris: [C: 03+2] kubernetes::node: Enforce upstream nf_conntrack_max [puppet] - 10https://gerrit.wikimedia.org/r/554490 (https://phabricator.wikimedia.org/T239795) (owner: 10Alexandros Kosiaris) [12:04:35] Ah okay, both should be the way to go then! [12:04:54] Okay, will do! [12:05:37] tim_WMDE: I've +2'ed the backports (https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/+/554493 and https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/+/554492), I'll ping you once the commit is ready to be tested. [12:05:53] Thank you! [12:10:12] tim_WMDE: please test your commit at mwdebug1001 (both at a testwiki and any other open wiki), so we can be sure it works :) [12:14:02] Let me see what I can do, I assume the only way I can test this is to fire a manual eventlogging event on one of the test wikis and then check hadoop if the event is there? [12:14:45] tim_WMDE: that would probably work. [12:17:43] PROBLEM - Host ms-fe2007 is DOWN: PING CRITICAL - Packet loss = 100% [12:19:49] 10Operations, 10Maps: Re-import OSM data at eqiad and codfw to temporarily fix current OSM replication issues. - https://phabricator.wikimedia.org/T239728 (10Mathew.onipe) More details: At each DC (starting from eqiad) [] Manual re-import on master (maps1004). [] Tile generation on master [] Then Guillaume ru... [12:23:05] tim_WMDE: how's testing going? [12:23:46] I have sent an mw.track request on AA wiki via mwdebug1001 but now I am not sure how to see if the server has received that properly [12:24:47] tim_WMDE: hmm, it seems it at least didn't fail, so I'm going to presume the good and sync [12:25:45] Yeah, I assume it should be fine, it's not going to break anything on our end if it is not so I would also just go ahead and deploy it, we'll know if it works later today [12:26:45] !log urbanecm@deploy1001 Synchronized php-1.35.0-wmf.8/extensions/WikimediaMessages/: SWAT: b3ef5cd: Change Schema Revision of WMDEBannerEvents (T239430) (duration: 01m 04s) [12:26:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:52] T239430: Fix tracking events for mobile banners - https://phabricator.wikimedia.org/T239430 [12:27:21] tim_WMDE: okay, ack [12:27:54] uh oh, I'll take a look at ms-fe2007 [12:28:48] !log urbanecm@deploy1001 Synchronized php-1.35.0-wmf.5/extensions/WikimediaMessages/: SWAT: bbf2a33: Change Schema Revision of WMDEBannerEvents (T239430) (duration: 01m 02s) [12:28:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:01] tim_WMDE: done! [12:29:23] Thanks Martin! [12:29:33] happy to help, Tim! [12:29:36] !log EU SWAT done [12:29:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:47] !log filippo@cumin1001 conftool action : set/pooled=no; selector: name=ms-fe2007.codfw.wmnet [12:31:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:09] 10Operations, 10ops-codfw: ms-fe2007 nic failure - https://phabricator.wikimedia.org/T239805 (10fgiunchedi) [12:49:31] (03PS1) 10Ladsgroup: mediawiki: Stop rebuildItermTerms temporary [puppet] - 10https://gerrit.wikimedia.org/r/554496 (https://phabricator.wikimedia.org/T229407) [12:50:47] PROBLEM - Maps tiles generation on icinga1001 is CRITICAL: CRITICAL: 90.28% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [12:52:32] (03PS1) 10Arturo Borrero Gonzalez: role: toollabs: drop unused role::toollabs::k8s::bastion class [puppet] - 10https://gerrit.wikimedia.org/r/554497 [12:57:49] I'm going to mess with mwdebug1001 to debug something [13:00:27] (03PS1) 10Jbond: puppetdb: change old puppetdbs to spare role [puppet] - 10https://gerrit.wikimedia.org/r/554499 [13:02:35] There's a fatal there in logstash, I caused it [13:03:29] 10Operations, 10Goal, 10Patch-For-Review: Followup to backup1001 bacula switchover (misc pending tasks) - https://phabricator.wikimedia.org/T238048 (10akosiaris) >>! In T238048#5701534, @jcrespo wrote: > Same for bast1001: > > > `lines=10 > 29-Nov 13:19 backup1001.eqiad.wmnet JobId 162657: Start Restore Jo... [13:05:10] 10Operations, 10Goal, 10Patch-For-Review: Followup to backup1001 bacula switchover (misc pending tasks) - https://phabricator.wikimedia.org/T238048 (10akosiaris) >>! In T238048#5711820, @jcrespo wrote: > "Offsite Job" seems to be correctly configured as "Copy", but it is not showing any activity. Needs check... [13:05:27] (03CR) 10Muehlenhoff: "Looks good. We can also remove the puppetdb4_servers part in hieradata/regex.yaml now (or in a separate patch, fine either way)" [puppet] - 10https://gerrit.wikimedia.org/r/554499 (owner: 10Jbond) [13:07:03] (03CR) 10Alexandros Kosiaris: [C: 03+1] "+1 (and yay!) but note this will require updating the pool manually." [puppet] - 10https://gerrit.wikimedia.org/r/554485 (https://phabricator.wikimedia.org/T238048) (owner: 10Jcrespo) [13:10:26] (03CR) 10Muehlenhoff: "As discussed on IRC, feel free to go ahead with merging, we'll address the debian/ bits in the forthcoming 101 session." [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/554091 (owner: 10Ssingh) [13:18:08] (03CR) 10Marostegui: [C: 03+2] mediawiki: Stop rebuildItermTerms temporary [puppet] - 10https://gerrit.wikimedia.org/r/554496 (https://phabricator.wikimedia.org/T229407) (owner: 10Ladsgroup) [13:18:21] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 99 jobs https://wikitech.wikimedia.org/wiki/Backups%23Monitoring [13:22:10] (03PS2) 10Jbond: puppetdb: change old puppetdbs to spare role [puppet] - 10https://gerrit.wikimedia.org/r/554499 [13:23:06] !log dns[345]001 - starting downtimes/etc for reimage to buster... [13:23:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:13] (03CR) 10Jbond: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/554499 (owner: 10Jbond) [13:24:18] !log downtimed maps1004 - T239728 [13:24:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:23] T239728: Re-import OSM data at eqiad and codfw to temporarily fix current OSM replication issues. - https://phabricator.wikimedia.org/T239728 [13:24:38] !log bblack@cumin1001 conftool action : set/pooled=no; selector: name=dns[345]001.wikimedia.org [13:24:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:26] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Convert DNS servers to Buster - https://phabricator.wikimedia.org/T239667 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts: ` ['dns5001.wikimedia.org', 'dns3001.wikimedia.org', 'dns4001.wikimedia.... [13:26:16] (03PS1) 10Ema: ATS: add atsbackendttfb.mtail [puppet] - 10https://gerrit.wikimedia.org/r/554503 (https://phabricator.wikimedia.org/T238494) [13:27:33] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job={bird,pdnsrec} site={eqsin,esams,ulsfo} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:28:01] (03CR) 10jerkins-bot: [V: 04-1] ATS: add atsbackendttfb.mtail [puppet] - 10https://gerrit.wikimedia.org/r/554503 (https://phabricator.wikimedia.org/T238494) (owner: 10Ema) [13:36:29] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM, one thought inline. And better doublecheck with PCC that [12]002 are unchanged :-)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/554499 (owner: 10Jbond) [13:40:26] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by rzl on cumin1001.eqiad.wmnet for hosts: ` ['mw2274.codfw.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/201912041329_rzl_100644.log`. [13:43:07] !log bblack@cumin1001 START - Cookbook sre.hosts.downtime [13:43:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:39] (03PS2) 10Arturo Borrero Gonzalez: role: toollabs: drop unused role classes [puppet] - 10https://gerrit.wikimedia.org/r/554497 [13:45:17] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:45:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:09] (03PS1) 10Jbond: openstack::pdns::auth::db: ensure we manage the md cronjob correctly [puppet] - 10https://gerrit.wikimedia.org/r/554504 (https://phabricator.wikimedia.org/T239804) [13:48:45] (03CR) 10Jbond: "> Patch Set 2: Code-Review+1" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/554499 (owner: 10Jbond) [13:50:26] !log bblack@cumin1001 START - Cookbook sre.hosts.downtime [13:50:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:03] !log bblack@cumin1001 START - Cookbook sre.hosts.downtime [13:52:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:09] (03PS2) 10Jbond: openstack::pdns::auth::db: ensure we manage the md cronjob correctly [puppet] - 10https://gerrit.wikimedia.org/r/554504 (https://phabricator.wikimedia.org/T239804) [13:52:21] (03Abandoned) 10Jbond: openstack::pdns::auth::db: ensure we manage the md cronjob correctly [puppet] - 10https://gerrit.wikimedia.org/r/554504 (https://phabricator.wikimedia.org/T239804) (owner: 10Jbond) [13:52:31] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:52:35] (03PS3) 10Arturo Borrero Gonzalez: role: toollabs: drop unused role classes [puppet] - 10https://gerrit.wikimedia.org/r/554497 [13:52:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:06] (03PS4) 10Arturo Borrero Gonzalez: role: toollabs: drop unused role classes [puppet] - 10https://gerrit.wikimedia.org/r/554497 [13:54:23] PROBLEM - Host 2620:0:863:1:f6e9:d4ff:feba:f390 is DOWN: CRITICAL - Destination Unreachable (2620:0:863:1:f6e9:d4ff:feba:f390) [13:54:34] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] role: toollabs: drop unused role classes [puppet] - 10https://gerrit.wikimedia.org/r/554497 (owner: 10Arturo Borrero Gonzalez) [13:54:38] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:54:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:55] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/554499 (owner: 10Jbond) [13:57:30] !log rzl@cumin1001 START - Cookbook sre.hosts.downtime [13:57:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:52] (03CR) 10Jbond: [C: 03+2] puppetdb: change old puppetdbs to spare role [puppet] - 10https://gerrit.wikimedia.org/r/554499 (owner: 10Jbond) [13:58:55] PROBLEM - Recursive DNS on 103.102.166.8 is CRITICAL: Return code of 255 is out of bounds https://wikitech.wikimedia.org/wiki/DNS [13:59:10] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [13:59:27] PROBLEM - Host 2620:0:862:1:b226:28ff:fe6e:cfe0 is DOWN: PING CRITICAL - Packet loss = 100% [13:59:39] !log rzl@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:59:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:44] PROBLEM - Host 2001:df2:e500:1:f6e9:d4ff:fed0:ac00 is DOWN: PING CRITICAL - Packet loss = 100% [14:01:45] (03PS1) 10Jbond: profile::puppetdb: update elk logging [puppet] - 10https://gerrit.wikimedia.org/r/554505 [14:02:48] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [14:04:30] !log bblack@cumin1001 conftool action : set/pooled=yes; selector: name=dns[34]001.wikimedia.org [14:04:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:37] (03CR) 10Ottomata: [C: 03+1] profile::refinery::job::spark_job: allow to pass a keytab [puppet] - 10https://gerrit.wikimedia.org/r/554473 (https://phabricator.wikimedia.org/T228291) (owner: 10Elukey) [14:08:08] PROBLEM - Host 2001:df2:e500:1:f6e9:d4ff:fed0:ac00 is DOWN: PING CRITICAL - Packet loss = 100% [14:10:44] !log bblack@cumin1001 conftool action : set/pooled=yes; selector: name=dns5001.wikimedia.org [14:10:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:39] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Convert DNS servers to Buster - https://phabricator.wikimedia.org/T239667 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['dns3001.wikimedia.org'] ` Of which those **FAILED**: ` ['dns3001.wikimedia.org'] ` [14:12:36] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:15:38] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=jmx_puppetdb site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:17:31] (03PS1) 10Effie Mouzeli: php::admin Include APCu fragmentation percentage metric [puppet] - 10https://gerrit.wikimedia.org/r/554507 [14:21:44] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2274.codfw.wmnet'] ` and were **ALL** successful. [14:24:11] !log ns2 authdns: re-route from ganeti3003 to dns3001 - T236479 [14:24:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:17] T236479: Temporarily use ganeti3003 as ns2 authdns - https://phabricator.wikimedia.org/T236479 [14:31:08] !log test ldap-corp2001 as LDAP server on mx2001 [14:31:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:34] 10Operations, 10DC-Ops, 10decommission, 10Discovery-Search (Current work): decommission elastic10[17-31].eqiad.wmnet - https://phabricator.wikimedia.org/T239821 (10Gehel) [14:32:12] 10Operations, 10DC-Ops, 10decommission, 10Discovery-Search (Current work): decommission elastic10[17-31].eqiad.wmnet - https://phabricator.wikimedia.org/T239821 (10Gehel) [14:32:33] (03PS2) 10Effie Mouzeli: php::admin Include APCu fragmentation percentage metric [puppet] - 10https://gerrit.wikimedia.org/r/554507 [14:35:23] (03PS2) 10Muehlenhoff: Switch ldap-corp.codfw.wikimedia.org to ldap-corp2001 [dns] - 10https://gerrit.wikimedia.org/r/553323 (https://phabricator.wikimedia.org/T224557) [14:40:39] !log rzl@cumin1001 conftool action : set/pooled=yes; selector: service=nginx,cluster=appserver,dc=codfw,name=mw2274.codfw.wmnet [14:40:40] !log rzl@cumin1001 conftool action : set/pooled=yes; selector: service=apache2,cluster=appserver,dc=codfw,name=mw2274.codfw.wmnet [14:40:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:45] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10RLazarus) [14:42:05] 10Operations, 10DC-Ops, 10decommission, 10Discovery-Search (Current work): decommission elastic10[18-31].eqiad.wmnet - https://phabricator.wikimedia.org/T239821 (10Gehel) [14:43:04] (03CR) 10CDanis: [C: 03+1] Drop grafana/jessie from reprepro sync [puppet] - 10https://gerrit.wikimedia.org/r/554481 (owner: 10Muehlenhoff) [14:43:54] 10Operations, 10DC-Ops, 10decommission, 10Discovery-Search (Current work): decommission elastic10[18-31].eqiad.wmnet - https://phabricator.wikimedia.org/T239821 (10Gehel) [14:44:14] (03PS1) 10Jbond: profile::puppetdb: refactor [puppet] - 10https://gerrit.wikimedia.org/r/554516 [14:46:18] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:46:18] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [14:46:48] (03PS3) 10Effie Mouzeli: php::admin include APCu fragmentation percentage metrics [puppet] - 10https://gerrit.wikimedia.org/r/554507 [14:47:30] (03PS1) 10Gehel: search: decommission elastic10[18-31].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/554517 (https://phabricator.wikimedia.org/T239821) [14:48:36] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by rzl on cumin1001.eqiad.wmnet for hosts: ` ['mw2273.codfw.wmnet', 'mw2272.codfw.wmnet', 'mw2267.codfw.wmnet'] ` The log can be found in `/var/log/w... [14:48:42] (03PS2) 10Jbond: profile::puppetdb: refactor [puppet] - 10https://gerrit.wikimedia.org/r/554516 [14:49:48] 10Operations, 10Analytics, 10Event-Platform, 10Wikimedia-Logstash, 10observability: Move eventgate logs to new logging infrastructure - https://phabricator.wikimedia.org/T225129 (10Ottomata) 05Open→03Resolved [14:49:51] 10Operations, 10Elasticsearch, 10Wikimedia-Logstash, 10observability, and 2 others: Migrate Elasticsearch from deprecated Gelf logstash input to rsyslog Kafka logging pipeline - https://phabricator.wikimedia.org/T225125 (10Ottomata) [14:49:58] (03PS1) 10BBlack: Revert "authdns_servers: add ganeti3003" [puppet] - 10https://gerrit.wikimedia.org/r/554518 (https://phabricator.wikimedia.org/T236479) [14:50:00] (03PS1) 10BBlack: Revert "ganeti3003: provision as authdns::server" [puppet] - 10https://gerrit.wikimedia.org/r/554519 (https://phabricator.wikimedia.org/T236479) [14:51:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db2135 in m5 codfw', diff saved to https://phabricator.wikimedia.org/P9805 and previous config saved to /var/cache/conftool/dbconfig/20191204-145145-marostegui.json [14:51:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db2135 as master for s10 in codfw', diff saved to https://phabricator.wikimedia.org/P9806 and previous config saved to /var/cache/conftool/dbconfig/20191204-145349-marostegui.json [14:53:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:53] (03CR) 10BBlack: [C: 03+2] Revert "authdns_servers: add ganeti3003" [puppet] - 10https://gerrit.wikimedia.org/r/554518 (https://phabricator.wikimedia.org/T236479) (owner: 10BBlack) [14:56:33] (03PS1) 10Muehlenhoff: Add library hint for mariadb-10.3 (distro-packaged version for libmariadb, not the WMF ones) [puppet] - 10https://gerrit.wikimedia.org/r/554523 [14:58:17] (03CR) 10jerkins-bot: [V: 04-1] Add library hint for mariadb-10.3 (distro-packaged version for libmariadb, not the WMF ones) [puppet] - 10https://gerrit.wikimedia.org/r/554523 (owner: 10Muehlenhoff) [14:59:49] (03PS2) 10Muehlenhoff: Add library hint for mariadb-10.3 (distro-packaged version) [puppet] - 10https://gerrit.wikimedia.org/r/554523 [15:00:05] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [15:00:41] (03PS1) 10Jbond: netbox: create netbox_frontend global variables [puppet] - 10https://gerrit.wikimedia.org/r/554526 [15:00:45] (03CR) 10BBlack: [C: 03+2] Revert "ganeti3003: provision as authdns::server" [puppet] - 10https://gerrit.wikimedia.org/r/554519 (https://phabricator.wikimedia.org/T236479) (owner: 10BBlack) [15:01:17] (03PS2) 10Jbond: netbox: create netbox_frontend global variables [puppet] - 10https://gerrit.wikimedia.org/r/554526 [15:02:49] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [15:04:53] (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for mariadb-10.3 (distro-packaged version) [puppet] - 10https://gerrit.wikimedia.org/r/554523 (owner: 10Muehlenhoff) [15:05:20] (03PS3) 10Jbond: netbox: create netbox_frontend global variables [puppet] - 10https://gerrit.wikimedia.org/r/554526 [15:05:36] !log rzl@cumin1001 START - Cookbook sre.hosts.downtime [15:05:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:45] 10Operations, 10Traffic, 10Patch-For-Review: Temporarily use ganeti3003 as ns2 authdns - https://phabricator.wikimedia.org/T236479 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts: ` ['ganeti3003.esams.wmnet'] ` The log can be found in `/var/log/wmf-aut... [15:06:43] (03PS1) 10Elukey: statistics::discovery: move cron to timer and add kerberos support [puppet] - 10https://gerrit.wikimedia.org/r/554528 [15:06:59] (03CR) 10Muehlenhoff: [C: 03+2] Drop grafana/jessie from reprepro sync [puppet] - 10https://gerrit.wikimedia.org/r/554481 (owner: 10Muehlenhoff) [15:07:15] (03CR) 10jerkins-bot: [V: 04-1] statistics::discovery: move cron to timer and add kerberos support [puppet] - 10https://gerrit.wikimedia.org/r/554528 (owner: 10Elukey) [15:07:32] !log ladsgroup@deploy1001 Synchronized php-1.35.0-wmf.5/extensions/Wikibase/repo/includes/ParserOutput/FullEntityParserOutputGenerator.php: T229407 (duration: 01m 00s) [15:07:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:37] T229407: Spikes in DB traffic and rows/s reads when reading from new terms store - https://phabricator.wikimedia.org/T229407 [15:07:43] !log rzl@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:07:43] (03PS3) 10Bstorm: toolforge: set a new package builder role [puppet] - 10https://gerrit.wikimedia.org/r/549211 [15:07:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:54] (03CR) 10Jcrespo: [C: 03+2] bacula: Increase production and Databases retention to 90 days [puppet] - 10https://gerrit.wikimedia.org/r/554485 (https://phabricator.wikimedia.org/T238048) (owner: 10Jcrespo) [15:09:06] (03PS4) 10Jbond: netbox: create netbox_frontend global variables [puppet] - 10https://gerrit.wikimedia.org/r/554526 [15:09:43] (03CR) 10Filippo Giunchedi: [C: 03+1] profile::puppetdb: update elk logging [puppet] - 10https://gerrit.wikimedia.org/r/554505 (owner: 10Jbond) [15:09:47] !log ladsgroup@deploy1001 Synchronized php-1.35.0-wmf.5/extensions/Wikibase/repo/includes/ParserOutput/FullEntityParserOutputGenerator.php: T229407, part II (duration: 01m 02s) [15:09:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:08] (03CR) 10Bstorm: [C: 03+2] toolforge: set a new package builder role [puppet] - 10https://gerrit.wikimedia.org/r/549211 (owner: 10Bstorm) [15:11:12] 10Operations, 10ops-codfw: ms-fe2007 NIC failure - https://phabricator.wikimedia.org/T239805 (10fgiunchedi) [15:11:28] jynus: bacula: Increase production and Databases retention to 90 days (9695ec8da6) ok to merge? [15:11:36] yes [15:11:42] Ok, doing it :) [15:11:44] (03PS5) 10Jbond: netbox: create netbox_frontend global variables [puppet] - 10https://gerrit.wikimedia.org/r/554526 [15:11:46] was on it, it took some time to merge [15:12:41] !log mobrovac@deploy1001 Started deploy [restbase/deploy@f4b752e]: Parsoid: Set title when sending html2html reqs; Mirror 6% of html2html reqs to Parsoid/PHP - T239768 T239643 [15:12:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:47] T239643: Bugs in PHP port of LanguageConverter - https://phabricator.wikimedia.org/T239643 [15:12:47] T239768: All RestBase mirrored html2html language conversion pages have page title set to Main Page - https://phabricator.wikimedia.org/T239768 [15:14:15] 10Operations, 10Goal: Followup to backup1001 bacula switchover (misc pending tasks) - https://phabricator.wikimedia.org/T238048 (10jcrespo) Hi, @akosiaris, thanks for the reviews and feedback. Could I have further your thoughts on T238048#5701519 and T238048#5701534. Normally I would just find a solution or wo... [15:14:54] (03PS2) 10Elukey: statistics::discovery: move cron to timer and add kerberos support [puppet] - 10https://gerrit.wikimedia.org/r/554528 [15:14:56] (03PS2) 10Ema: ATS: add atsbackendttfb.mtail [puppet] - 10https://gerrit.wikimedia.org/r/554503 (https://phabricator.wikimedia.org/T238494) [15:15:30] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [15:17:55] 10Operations, 10Goal: Followup to backup1001 bacula switchover (misc pending tasks) - https://phabricator.wikimedia.org/T238048 (10jcrespo) After update, the pools seem ok, although we probably should also increase the offsite one (creating patch). ` *list pool +--------+------------+---------+---------+------... [15:20:23] (03CR) 10Reedy: "recheck" [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/553385 (owner: 10Ssingh) [15:21:14] (03CR) 10Reedy: "recheck" [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/553407 (owner: 10Ssingh) [15:21:19] (03CR) 10Reedy: "recheck" [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/554091 (owner: 10Ssingh) [15:21:39] (03CR) 10jerkins-bot: [V: 04-1] Replace the string "CAIDA" with "IODA" to maintain consistency [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/553407 (owner: 10Ssingh) [15:21:46] (03CR) 10jerkins-bot: [V: 04-1] Add scripts for fetching data from OONI [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/554091 (owner: 10Ssingh) [15:23:42] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/554273 (owner: 10Muehlenhoff) [15:24:35] !log bblack@cumin1001 START - Cookbook sre.hosts.downtime [15:24:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:41] 10Operations, 10Puppet, 10serviceops, 10User-jbond: Rolling restart of etcd to pick up the renewed CA public certificate. - https://phabricator.wikimedia.org/T237362 (10jbond) The new CA has been distributed now so this can be started [15:26:44] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:26:47] (03CR) 10Marostegui: "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/554485 (https://phabricator.wikimedia.org/T238048) (owner: 10Jcrespo) [15:26:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:35] 10Operations, 10Traffic: Temporarily use ganeti3003 as ns2 authdns - https://phabricator.wikimedia.org/T236479 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ganeti3003.esams.wmnet'] ` Of which those **FAILED**: ` ['ganeti3003.esams.wmnet'] ` [15:28:42] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@f4b752e]: Parsoid: Set title when sending html2html reqs; Mirror 6% of html2html reqs to Parsoid/PHP - T239768 T239643 (duration: 16m 02s) [15:28:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:48] T239643: Bugs in PHP port of LanguageConverter - https://phabricator.wikimedia.org/T239643 [15:28:48] T239768: All RestBase mirrored html2html language conversion pages have page title set to Main Page - https://phabricator.wikimedia.org/T239768 [15:29:12] !log installing mariadb 10.3 updates from Buster 10.2 point release (client libs/tools only) [15:29:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:59] 10Operations: Integrate Buster 10.2 point update - https://phabricator.wikimedia.org/T238519 (10MoritzMuehlenhoff) [15:30:15] 10Operations: Integrate Buster 10.2 point update - https://phabricator.wikimedia.org/T238519 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This is complete [15:30:50] 10Operations, 10Traffic: Temporarily use ganeti3003 as ns2 authdns - https://phabricator.wikimedia.org/T236479 (10BBlack) 05Open→03Resolved a:03BBlack Our `ns2` service address is now re-routed to `dns3001`, and `ganeti3003` is reimaged back to `spare::system`. [15:34:49] (03PS1) 10BBlack: ganeti3003: buster installer [puppet] - 10https://gerrit.wikimedia.org/r/554533 (https://phabricator.wikimedia.org/T236479) [15:35:22] 10Operations, 10Wikimedia-Logstash: Upgrade ELK Stack - https://phabricator.wikimedia.org/T234854 (10herron) [15:35:43] (03CR) 10BBlack: [C: 03+2] ganeti3003: buster installer [puppet] - 10https://gerrit.wikimedia.org/r/554533 (https://phabricator.wikimedia.org/T236479) (owner: 10BBlack) [15:36:54] 10Operations, 10ops-esams, 10Patch-For-Review: rack/setup/install ganeti300[123] - https://phabricator.wikimedia.org/T236216 (10BBlack) [15:38:04] (03PS1) 10Jbond: profile::opennstack::services: Ensure the mdadm cron is not managed [puppet] - 10https://gerrit.wikimedia.org/r/554534 (https://phabricator.wikimedia.org/T224828) [15:38:50] 10Operations, 10ops-esams, 10Patch-For-Review: rack/setup/install ganeti300[123] - https://phabricator.wikimedia.org/T236216 (10BBlack) With T236479 closed, ganeti3003 is no longer special and everyone can ignore the **IMPORTANT NOTE** earlier. [15:39:19] (03PS1) 10BBlack: Revert "Switch to digicert-2019a in esams, temporarily" [puppet] - 10https://gerrit.wikimedia.org/r/554535 [15:39:48] (03CR) 10jerkins-bot: [V: 04-1] profile::opennstack::services: Ensure the mdadm cron is not managed [puppet] - 10https://gerrit.wikimedia.org/r/554534 (https://phabricator.wikimedia.org/T224828) (owner: 10Jbond) [15:41:13] (03PS1) 10Jcrespo: bacula: Increase offsite backup retention to 90 days [puppet] - 10https://gerrit.wikimedia.org/r/554536 (https://phabricator.wikimedia.org/T238048) [15:41:17] (03PS2) 10Jbond: profile::opennstack::services: Ensure the mdadm cron is not managed [puppet] - 10https://gerrit.wikimedia.org/r/554534 (https://phabricator.wikimedia.org/T224828) [15:42:45] (03PS3) 10Jbond: profile::opennstack::services: Ensure the mdadm cron is not managed [puppet] - 10https://gerrit.wikimedia.org/r/554534 (https://phabricator.wikimedia.org/T224828) [15:45:53] (03CR) 10Volans: "I think the name needs to be changed in the two views, according to their docs. But if you tested that it works anyway glad to leave it as" (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/554273 (owner: 10Muehlenhoff) [15:47:03] !log rzl@cumin1001 START - Cookbook sre.hosts.downtime [15:47:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:39] (03CR) 10Bstorm: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/554534 (https://phabricator.wikimedia.org/T224828) (owner: 10Jbond) [15:48:24] (03PS3) 10Ema: ATS: add client<->ats-be interactions metrics [puppet] - 10https://gerrit.wikimedia.org/r/554503 (https://phabricator.wikimedia.org/T238494) [15:48:51] (03CR) 10Jbond: [C: 03+2] profile::opennstack::services: Ensure the mdadm cron is not managed [puppet] - 10https://gerrit.wikimedia.org/r/554534 (https://phabricator.wikimedia.org/T224828) (owner: 10Jbond) [15:49:09] !log rzl@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:49:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:58] (03CR) 10Muehlenhoff: Add CAS authentication to debmonitor (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/554273 (owner: 10Muehlenhoff) [15:52:26] (03CR) 10CRusnov: [C: 03+1] "Typo inline. I'm not completely familiar with the alias() function, but otherwise lgtm." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/554526 (owner: 10Jbond) [15:55:22] (03PS1) 10Jcrespo: bacula: Schedule hourly copies of production backups to the offsite pool [puppet] - 10https://gerrit.wikimedia.org/r/554539 (https://phabricator.wikimedia.org/T238048) [15:56:01] (03CR) 10Jcrespo: [C: 03+2] bacula: Increase offsite backup retention to 90 days [puppet] - 10https://gerrit.wikimedia.org/r/554536 (https://phabricator.wikimedia.org/T238048) (owner: 10Jcrespo) [15:56:17] (03PS2) 10Jcrespo: bacula: Increase offsite backup retention to 90 days [puppet] - 10https://gerrit.wikimedia.org/r/554536 (https://phabricator.wikimedia.org/T238048) [15:56:40] (03CR) 10Jbond: "thanks updated" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/554526 (owner: 10Jbond) [15:57:03] !log rebooting ms-fe2007 for HW maintenance [15:57:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:22] (03CR) 10Filippo Giunchedi: [C: 03+1] ATS: add client<->ats-be interactions metrics [puppet] - 10https://gerrit.wikimedia.org/r/554503 (https://phabricator.wikimedia.org/T238494) (owner: 10Ema) [15:59:11] 10Operations, 10serviceops: VP9-enabled ffmpeg doesn't get installed after reimage of mw job runner/video scaler - https://phabricator.wikimedia.org/T239831 (10MoritzMuehlenhoff) [15:59:50] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2062.codfw.wmnet - https://phabricator.wikimedia.org/T238726 (10Papaul) [16:01:16] RECOVERY - Host ms-fe2007 is UP: PING OK - Packet loss = 0%, RTA = 36.14 ms [16:01:59] (03CR) 10Alexandros Kosiaris: [C: 03+1] Switch ldap-corp.codfw.wikimedia.org to ldap-corp2001 [dns] - 10https://gerrit.wikimedia.org/r/553323 (https://phabricator.wikimedia.org/T224557) (owner: 10Muehlenhoff) [16:04:43] 10Operations: Fix installation of Puppet 5 on new stretch installs/reimages - https://phabricator.wikimedia.org/T239832 (10MoritzMuehlenhoff) [16:04:45] 10Operations, 10observability: StatsD Exporter does not relay dropped metrics - https://phabricator.wikimedia.org/T239833 (10colewhite) [16:06:31] 10Operations, 10Goal, 10Patch-For-Review: Followup to backup1001 bacula switchover (misc pending tasks) - https://phabricator.wikimedia.org/T238048 (10jcrespo) Rentention change documented at: https://wikitech.wikimedia.org/wiki/Bacula#Modify_a_pool's_retention_(or_other_similar_properties) [16:06:48] 10Operations, 10Goal, 10Patch-For-Review: Followup to backup1001 bacula switchover (misc pending tasks) - https://phabricator.wikimedia.org/T238048 (10jcrespo) [16:07:50] (03CR) 10Ema: [C: 03+2] ATS: add client<->ats-be interactions metrics [puppet] - 10https://gerrit.wikimedia.org/r/554503 (https://phabricator.wikimedia.org/T238494) (owner: 10Ema) [16:07:58] PROBLEM - Host ms-fe2007.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:08:19] jynus: ok to merge your change too? [16:08:30] ema: E: failed to lock, another puppet-merge running on this host? [16:08:37] yes [16:08:46] (03PS1) 10Volans: Initial setup of repo [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/554542 [16:08:48] (03PS1) 10Volans: dns: generate DNS snippets from Netbox [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/554543 (https://phabricator.wikimedia.org/T233183) [16:08:56] jynus: done! [16:09:00] thanks [16:09:32] (03PS1) 10Cwhite: when configured to relay statsd traffic, send the raw []byte recieved toward the configured statsd endpoint [debs/prometheus-statsd-exporter] - 10https://gerrit.wikimedia.org/r/554544 (https://phabricator.wikimedia.org/T239833) [16:10:23] (03CR) 10Volans: "I've sent an iteration of this starting form PS8 on Ia6f120884bc255d85cb0408eba3784b09bc04199" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/539013 (https://phabricator.wikimedia.org/T233183) (owner: 10CRusnov) [16:12:40] (03CR) 10CRusnov: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/554542 (owner: 10Volans) [16:13:04] (03PS6) 10Jbond: netbox: create netbox_frontend global variables [puppet] - 10https://gerrit.wikimedia.org/r/554526 [16:14:08] "The database is currently locked to new entries and other modifications, probably for routine database maintenance, after which it will be back to normal. " :O [16:16:26] (03PS1) 10Ssingh: Update test/test_countrycodes.py to handle missing iso_3166-1.json file [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/554548 [16:16:28] 10Operations: Fix installation of Puppet 5/Facter 3 on new stretch installs/reimages - https://phabricator.wikimedia.org/T239832 (10MoritzMuehlenhoff) [16:16:48] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [16:18:21] (03PS1) 10Muehlenhoff: Setup apt pinning for puppet 5 / facter 3 on stretch/jessie [puppet] - 10https://gerrit.wikimedia.org/r/554549 (https://phabricator.wikimedia.org/T239832) [16:18:32] RECOVERY - Host ms-fe2007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.19 ms [16:19:32] (03CR) 10Jcrespo: [C: 03+1] "I think this is a safe change to test without a +1 and revert and rethink if it doesn't work as intended." [puppet] - 10https://gerrit.wikimedia.org/r/554539 (https://phabricator.wikimedia.org/T238048) (owner: 10Jcrespo) [16:19:39] (03CR) 10Jcrespo: [C: 03+2] bacula: Schedule hourly copies of production backups to the offsite pool [puppet] - 10https://gerrit.wikimedia.org/r/554539 (https://phabricator.wikimedia.org/T238048) (owner: 10Jcrespo) [16:21:02] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [16:23:36] 10Operations, 10Goal, 10Patch-For-Review: Followup to backup1001 bacula switchover (misc pending tasks) - https://phabricator.wikimedia.org/T238048 (10akosiaris) >>! In T238048#5712471, @jcrespo wrote: > Hi, @akosiaris, thanks for the reviews and feedback. Could I have further your thoughts on T238048#570151... [16:24:03] (03PS1) 10Muehlenhoff: Fix apt pinning for VP9-enabled ffmpeg build [puppet] - 10https://gerrit.wikimedia.org/r/554550 (https://phabricator.wikimedia.org/T239831) [16:25:00] !log disable puppet on mw1348 [16:25:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:00] !log rzl@cumin1001 START - Cookbook sre.hosts.downtime [16:28:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:10] !log rzl@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:30:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:28] 10Operations, 10vm-requests: Site: (QUANTITY) VM request for SERVICE[S] - https://phabricator.wikimedia.org/T239838 (10akosiaris) [16:31:29] PROBLEM - Host ms-fe2007.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:31:52] 10Operations, 10vm-requests: EQIAD+CODFW: (9) VM request for kubernetes etcd - https://phabricator.wikimedia.org/T239838 (10akosiaris) p:05Triage→03High a:03akosiaris [16:32:20] 10Operations, 10vm-requests: EQIAD+CODFW: (9) VM request for kubernetes etcd - https://phabricator.wikimedia.org/T239838 (10akosiaris) [16:32:48] !log enagle puppet on mw1348 [16:32:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:53] !log enagle puppet on mwdebug1001 [16:32:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:53] 10Operations, 10Goal, 10Patch-For-Review: Followup to backup1001 bacula switchover (misc pending tasks) - https://phabricator.wikimedia.org/T238048 (10jcrespo) > I have provided them already in Indeed, sorry- mail client only showed you last comment. > Maybe they are encrypted with that key (which should... [16:33:58] (03CR) 10Volans: [V: 03+2 C: 03+2] "Merging so that we can enable CI in the repo to make it run on the next CR." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/554542 (owner: 10Volans) [16:35:01] (03PS1) 10Ema: ATS: allow to toggle request coalescing [puppet] - 10https://gerrit.wikimedia.org/r/554558 (https://phabricator.wikimedia.org/T238494) [16:35:10] 10Operations, 10serviceops, 10Patch-For-Review: VP9-enabled ffmpeg doesn't get installed after reimage of mw job runner/video scaler - https://phabricator.wikimedia.org/T239831 (10jijiki) a:03jijiki [16:35:35] (03CR) 10Effie Mouzeli: [C: 03+1] Fix apt pinning for VP9-enabled ffmpeg build [puppet] - 10https://gerrit.wikimedia.org/r/554550 (https://phabricator.wikimedia.org/T239831) (owner: 10Muehlenhoff) [16:36:45] RECOVERY - Host ms-fe2007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.79 ms [16:37:36] (03PS1) 10Jcrespo: Revert "bacula: Schedule hourly copies of production backups to the offsite pool" [puppet] - 10https://gerrit.wikimedia.org/r/554559 [16:37:47] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] Revert "bacula: Schedule hourly copies of production backups to the offsite pool" [puppet] - 10https://gerrit.wikimedia.org/r/554559 (owner: 10Jcrespo) [16:38:26] (03CR) 10Ssingh: [C: 03+2] First commit of censorship monitoring project [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/553385 (owner: 10Ssingh) [16:38:54] (03Merged) 10jenkins-bot: First commit of censorship monitoring project [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/553385 (owner: 10Ssingh) [16:38:57] (03CR) 10Ssingh: [V: 03+2 C: 03+2] First commit of censorship monitoring project [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/553385 (owner: 10Ssingh) [16:39:31] (03PS1) 10Alexandros Kosiaris: WIP: Add kubetcd[12]00[456], kubestagetcd100[456] [dns] - 10https://gerrit.wikimedia.org/r/554561 (https://phabricator.wikimedia.org/T239838) [16:40:22] 10Operations, 10Goal, 10Patch-For-Review: Followup to backup1001 bacula switchover (misc pending tasks) - https://phabricator.wikimedia.org/T238048 (10jcrespo) I scheduled by accident the migration, not the copy. ` Incremental Migrate 20 04-Dec-19 17:00 Migrate Job ` I think it wouldn't have run a... [16:42:25] (03PS1) 10Jcrespo: bacula: Schedule hourly copies of production backups to the offsite pool [puppet] - 10https://gerrit.wikimedia.org/r/554563 (https://phabricator.wikimedia.org/T238048) [16:43:24] (03PS2) 10Jcrespo: bacula: Schedule hourly copies of production backups to the offsite pool [puppet] - 10https://gerrit.wikimedia.org/r/554563 (https://phabricator.wikimedia.org/T238048) [16:47:48] 10Operations, 10Goal, 10Patch-For-Review: Followup to backup1001 bacula switchover (misc pending tasks) - https://phabricator.wikimedia.org/T238048 (10jcrespo) [16:47:50] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2273.codfw.wmnet', 'mw2272.codfw.wmnet', 'mw2267.codfw.wmnet'] ` and were **ALL** successful. [16:48:16] !log rzl@cumin1001 conftool action : set/pooled=yes; selector: name=mw2273.codfw.wmnet,dc=codfw,cluster=appserver,service=apache2 [16:48:17] !log rzl@cumin1001 conftool action : set/pooled=yes; selector: name=mw2273.codfw.wmnet,dc=codfw,cluster=appserver,service=nginx [16:48:18] !log rzl@cumin1001 conftool action : set/pooled=yes; selector: name=mw2272.codfw.wmnet,dc=codfw,service=apache2,cluster=appserver [16:48:19] !log rzl@cumin1001 conftool action : set/pooled=yes; selector: name=mw2272.codfw.wmnet,dc=codfw,service=nginx,cluster=appserver [16:48:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:21] !log rzl@cumin1001 conftool action : set/pooled=yes; selector: dc=codfw,service=apache2,name=mw2267.codfw.wmnet,cluster=videoscaler [16:48:22] !log rzl@cumin1001 conftool action : set/pooled=yes; selector: dc=codfw,service=nginx,name=mw2267.codfw.wmnet,cluster=videoscaler [16:48:23] !log rzl@cumin1001 conftool action : set/pooled=yes; selector: dc=codfw,service=apache2,name=mw2267.codfw.wmnet,cluster=jobrunner [16:48:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:24] !log rzl@cumin1001 conftool action : set/pooled=yes; selector: dc=codfw,service=nginx,name=mw2267.codfw.wmnet,cluster=jobrunner [16:48:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:57] !log dns[12]001 - reimaging to buster [16:49:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:01] !log bblack@cumin1001 conftool action : set/pooled=no; selector: name=dns[12]001.wikimedia.org [16:49:07] (03CR) 10Jcrespo: [C: 03+2] bacula: Schedule hourly copies of production backups to the offsite pool [puppet] - 10https://gerrit.wikimedia.org/r/554563 (https://phabricator.wikimedia.org/T238048) (owner: 10Jcrespo) [16:49:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:45] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Convert DNS servers to Buster - https://phabricator.wikimedia.org/T239667 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts: ` ['dns1001.wikimedia.org', 'dns2001.wikimedia.org'] ` The log can be fo... [16:53:11] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=pdnsrec site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:53:58] 10Operations, 10Goal, 10Patch-For-Review: Followup to backup1001 bacula switchover (misc pending tasks) - https://phabricator.wikimedia.org/T238048 (10jcrespo) Now it is ok: ` Scheduled Jobs: Level Type Pri Scheduled Job Name Volume =========================================... [16:55:09] PROBLEM - Host 2620:0:861:1:208:80:154:10 is DOWN: PING CRITICAL - Packet loss = 100% [16:55:23] PROBLEM - Host 2620:0:860:3:208:80:153:77 is DOWN: PING CRITICAL - Packet loss = 100% [16:55:27] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [16:55:28] (03PS2) 10Ema: ATS: allow to toggle request coalescing [puppet] - 10https://gerrit.wikimedia.org/r/554558 (https://phabricator.wikimedia.org/T238494) [16:55:29] I could've sworn I downtimed all those [16:55:34] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/554549 (https://phabricator.wikimedia.org/T239832) (owner: 10Muehlenhoff) [16:55:42] I guess those both aren't services and don't go with the host, sorry [16:56:07] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:56:07] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed o [16:56:07] nse was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:56:15] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good artic [16:56:15] ut before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:56:29] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:56:29] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response wa [16:56:29] ain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - bad article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:56:29] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:56:45] ACKNOWLEDGEMENT - Host 2620:0:860:3:208:80:153:77 is DOWN: PING CRITICAL - Packet loss = 100% Brandon Black T239667 [16:56:45] ACKNOWLEDGEMENT - Host 2620:0:861:1:208:80:154:10 is DOWN: PING CRITICAL - Packet loss = 100% Brandon Black T239667 [16:56:45] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a re [16:56:45] ed https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:56:49] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed [16:56:49] ponse was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:56:49] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.7833 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [16:56:59] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:57:29] PROBLEM - High average POST latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var- [16:57:33] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:57:33] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response wa [16:57:33] ain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:57:51] mw api in trouble? [16:58:11] PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:58:20] yep latencies are horrible [16:58:22] or something [16:58:28] 10Operations, 10Puppet, 10cloud-services-team, 10Patch-For-Review, 10User-jbond: cloudservices machines are currently failing puppet runs - https://phabricator.wikimedia.org/T239804 (10jbond) 05Open→03Resolved Fixed [16:58:35] 10Operations, 10Puppet, 10DBA, 10Patch-For-Review, 10User-jbond: Extend Puppet CA Expiry date - https://phabricator.wikimedia.org/T236277 (10jbond) [16:58:37] ES, recommendations, and API? [16:58:38] (03CR) 10Volans: "recheck" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/554543 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [16:58:47] which underlies which in what way? [16:58:52] and restbase of course [16:59:10] (03CR) 10Jbond: [C: 03+2] profile::puppetdb: update elk logging [puppet] - 10https://gerrit.wikimedia.org/r/554505 (owner: 10Jbond) [16:59:23] I think that reccomendation pulls data from the mw api or similar [16:59:29] (03CR) 10Ema: "pcc looks good: https://puppet-compiler.wmflabs.org/compiler1002/19791/" [puppet] - 10https://gerrit.wikimedia.org/r/554558 (https://phabricator.wikimedia.org/T238494) (owner: 10Ema) [16:59:37] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:59:45] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:59:45] PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:59:59] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:00:06] there are also fatals/exceptions for mediawiki [17:00:11] PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:00:16] (03PS1) 10Ema: ATS: add trafficserver_backend_client_requests_total to mtail [puppet] - 10https://gerrit.wikimedia.org/r/554570 (https://phabricator.wikimedia.org/T238494) [17:00:19] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [17:00:37] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:01:21] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:01:31] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:01:43] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:02:05] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [17:02:11] PROBLEM - termbox eqiad on termbox.svc.eqiad.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [17:02:23] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:02:25] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [17:02:27] PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:03:59] RECOVERY - termbox eqiad on termbox.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [17:04:07] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:04:07] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:04:07] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [17:05:03] PROBLEM - Restrouter LVS codfw on restrouter.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [17:05:45] PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:06:25] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:06:57] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) timed out before a response was received: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [17:06:58] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:07:08] !log bblack@cumin1001 START - Cookbook sre.hosts.downtime [17:07:09] RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:07:11] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:07:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:21] RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:08:45] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [17:08:55] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:09:01] (03PS2) 10Ssingh: Replace the string "CAIDA" with "IODA" to maintain consistency [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/553407 [17:09:03] (03PS2) 10Ssingh: Add scripts for fetching data from OONI [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/554548 [17:09:18] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:09:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:32] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [17:09:58] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:10:00] RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:10:18] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:10:36] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [17:11:21] !log bblack@cumin1001 START - Cookbook sre.hosts.downtime [17:11:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:58] PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:12:06] PROBLEM - Host ms-fe2007.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:12:54] RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:13:22] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:13:26] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:13:29] (03CR) 10Ssingh: [C: 03+2] Replace the string "CAIDA" with "IODA" to maintain consistency [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/553407 (owner: 10Ssingh) [17:13:40] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:13:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:52] PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:15:08] PROBLEM - termbox codfw on termbox.svc.codfw.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [17:15:13] !log killing dump threads on db1118 T143870 [17:15:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:26] T143870: Some mw snapshot hosts are accessing main db servers - https://phabricator.wikimedia.org/T143870 [17:15:40] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:15:52] RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:16:00] PROBLEM - Host 2620:0:861:1:d294:66ff:fe5f:5a1d is DOWN: PING CRITICAL - Packet loss = 100% [17:16:10] RECOVERY - termbox codfw on termbox.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [17:17:14] RECOVERY - Host ms-fe2007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.79 ms [17:17:14] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:17:28] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:17:46] (03CR) 10Ssingh: [C: 03+2] Add scripts for fetching data from OONI [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/554548 (owner: 10Ssingh) [17:17:48] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:18:38] RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:18:50] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:18:56] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:20:00] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:20:18] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:20:32] RECOVERY - Restrouter LVS codfw on restrouter.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [17:20:34] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:20:48] PROBLEM - Host 2620:0:860:3:d294:66ff:fe5f:6a40 is DOWN: CRITICAL - Destination Unreachable (2620:0:860:3:d294:66ff:fe5f:6a40) [17:21:12] ^ that ipv6 alert is an artifact of reimaging [17:21:50] <_joe_> !log depooling mw1348 for debugging [17:21:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:04] RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:23:08] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:23:20] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Convert DNS servers to Buster - https://phabricator.wikimedia.org/T239667 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['dns1001.wikimedia.org', 'dns2001.wikimedia.org'] ` and were **ALL** successful. [17:25:09] <_joe_> !log repooling mw1348 [17:25:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:17] !log bblack@cumin1001 conftool action : set/pooled=yes; selector: name=dns[12]001.wikimedia.org [17:25:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:32] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:29:33] 10Operations, 10ops-codfw: ms-fe2007 NIC failure - https://phabricator.wikimedia.org/T239805 (10Papaul) @fgiunchedi the 10G NiC is dead 1- option replace the server with another server https://netbox.wikimedia.org/dcim/devices/1099/ 2- option Buy another 10G NIC [17:33:46] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [17:33:58] RECOVERY - High average POST latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POST [17:34:00] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:34:00] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:34:00] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:34:40] (03PS1) 10Holger Knust: T220399 Migrate cpjobqueue to kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/554576 [17:41:04] RECOVERY - Host mw2259 is UP: PING OK - Packet loss = 0%, RTA = 36.17 ms [17:41:56] 10Operations, 10ops-codfw, 10DC-Ops, 10serviceops: mw2259 down and mgmt does not exist? - https://phabricator.wikimedia.org/T239758 (10Papaul) 05Open→03Resolved Reset the IDRAC server is back up [17:45:32] 10Operations, 10ops-codfw: ms-fe2007 NIC failure - https://phabricator.wikimedia.org/T239805 (10fgiunchedi) >>! In T239805#5713046, @Papaul wrote: > @fgiunchedi the 10G NiC is dead > > 1- option replace the server with another server > https://netbox.wikimedia.org/dcim/devices/1099/ > 2- option Buy another... [17:47:34] 10Operations, 10ops-codfw: ms-fe2007 NIC failure - https://phabricator.wikimedia.org/T239805 (10RobH) [17:50:00] !log ladsgroup@deploy1001 Synchronized php-1.35.0-wmf.5/extensions/Wikibase/repo/includes/ParserOutput/FullEntityParserOutputGenerator.php: T229407, part III (duration: 01m 01s) [17:50:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:05] T229407: Spikes in DB traffic and rows/s reads when reading from new terms store - https://phabricator.wikimedia.org/T229407 [17:51:43] !log dns1001: stopping recursive dns to test failure theory (same method as prere-imaging earlier, intended to not cause impact) [17:51:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:29] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:54:31] !log dns1001: back to normal state [17:54:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:53] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:56:03] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:56:51] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:57:11] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:58:23] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:02:35] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.0875 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [18:04:37] 10Operations, 10Puppet, 10DBA, 10Patch-For-Review, 10User-jbond: Document all uses of the puppetCA certificate - https://phabricator.wikimedia.org/T237259 (10jbond) [18:05:06] !log dns1002: stopping recursive dns to test failure theory (same method as prere-imaging earlier, intended to not cause impact) [18:05:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:15] !log dns1002: back to normal state [18:08:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:09] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [18:09:11] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:10:09] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:10:27] (03PS1) 10Ladsgroup: Revert "mediawiki: Stop rebuildItermTerms temporary" [puppet] - 10https://gerrit.wikimedia.org/r/554583 [18:11:39] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:11:58] (03Abandoned) 10Ssingh: Add scripts for fetching data from OONI [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/554091 (owner: 10Ssingh) [18:12:53] we'll be doing a parsoid deploy shortly .. should reduce the language converter logspam from parsoid/php. [18:14:00] (03PS1) 10Jbond: eventgate-logging values: update puppet CA [deployment-charts] - 10https://gerrit.wikimedia.org/r/554584 (https://phabricator.wikimedia.org/T237259) [18:14:01] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [18:14:03] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:18:53] PROBLEM - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is CRITICAL: 312.6 ge 210 https://phabricator.wikimedia.org/T202307 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [18:24:24] !log arlolra@deploy1001 Started deploy [parsoid/deploy@0910e18]: Updating Parsoid to b81bbf4 [18:24:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:35] !log arlolra@deploy1001 Finished deploy [parsoid/deploy@0910e18]: Updating Parsoid to b81bbf4 (duration: 08m 11s) [18:32:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:55] !log dns1001: stopping just bird [18:41:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:50] (03PS4) 10Krinkle: mtail: Use mock hostnames in test fixtures [puppet] - 10https://gerrit.wikimedia.org/r/554403 [18:44:09] (03PS5) 10Krinkle: mtail: Use mock hostnames in test fixtures [puppet] - 10https://gerrit.wikimedia.org/r/554403 [18:44:21] (03CR) 10Krinkle: "Rebased to squash in Daniel's abandoned parent change." [puppet] - 10https://gerrit.wikimedia.org/r/554403 (owner: 10Krinkle) [18:44:24] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Add Maryum to Puppet - https://phabricator.wikimedia.org/T239300 (10Dzahn) Hi @Mstyles i see your user exists now for example on `bast1002`: ` [bast1002:~] $ id mstyles uid=22524(mstyles) gid=500(wikidev) groups=500(wikidev),600(all-use... [18:45:12] !log Updated Parsoid to b81bbf4 (T239643, T239830, T238456, T239841) [18:45:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:40] T239830: Add a couple missing performance metrics - https://phabricator.wikimedia.org/T239830 [18:45:41] T239841: Parsoid resource limits and metrics use mb_strlen, which is (a) inconsistent w/ Parser.php and (b) slow - https://phabricator.wikimedia.org/T239841 [18:45:42] T238456: Missing implementation to post Parsoid/PHP lints to production database - https://phabricator.wikimedia.org/T238456 [18:45:42] (03CR) 10Ottomata: [C: 03+2] eventgate-logging values: update puppet CA [deployment-charts] - 10https://gerrit.wikimedia.org/r/554584 (https://phabricator.wikimedia.org/T237259) (owner: 10Jbond) [18:45:43] T239643: Bugs in PHP port of LanguageConverter - https://phabricator.wikimedia.org/T239643 [18:46:05] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10RLazarus) [18:46:27] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:46:27] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:46:29] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:46:41] PROBLEM - High average POST latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var- [18:46:49] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out bef [18:46:49] s received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:46:57] !log dns1001: restart bird.service [18:47:00] (03CR) 10Dzahn: [C: 03+2] mtail: Use mock hostnames in test fixtures [puppet] - 10https://gerrit.wikimedia.org/r/554403 (owner: 10Krinkle) [18:47:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:15] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:47:15] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:47:15] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:47:19] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [18:47:21] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:47:25] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:47:33] PROBLEM - Restrouter LVS codfw on restrouter.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [18:47:47] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:48:11] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:48:27] RECOVERY - High average POST latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POST [18:48:31] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:48:55] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:48:57] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:48:57] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:49:05] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:49:18] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'logging-external' . [18:49:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:25] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:49:55] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:49:57] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:50:02] !log otto@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'logging-external' . [18:50:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:51] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:50:59] RECOVERY - Restrouter LVS codfw on restrouter.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [18:52:06] !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'logging-external' . [18:52:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:23] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [18:54:25] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:54:36] !log dns1001: stop bird.service again, briefly [18:54:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:45] !log dns1001: back to normal again [18:55:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:59] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by rzl on cumin1001.eqiad.wmnet for hosts: ` ['mw2266.codfw.wmnet', 'mw2265.codfw.wmnet', 'mw2264.codfw.wmnet', 'mw2263.codfw.wmnet'] ` The log can b... [18:59:39] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [18:59:41] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:00:05] RoanKattouw, Niharika, and Urbanecm: Time to snap out of that daydream and deploy Morning SWAT(Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191204T1900). [19:00:05] No GERRIT patches in the queue for this window AFAICS. [19:00:52] (03Abandoned) 10CRusnov: Add script to generate DNS records from Netbox [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/539013 (https://phabricator.wikimedia.org/T233183) (owner: 10CRusnov) [19:01:23] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [19:03:13] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:03:30] I'm deploying something for swat [19:06:45] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:07:16] (03CR) 10Dzahn: [C: 03+2] Phabricator: remove validate_hash, it's deprecated [puppet] - 10https://gerrit.wikimedia.org/r/554402 (owner: 1020after4) [19:07:24] (03PS2) 10Dzahn: Phabricator: remove validate_hash, it's deprecated [puppet] - 10https://gerrit.wikimedia.org/r/554402 (owner: 1020after4) [19:11:20] (03Abandoned) 10Anomie: RejectParserCacheValue to reject possibly-corrupted entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545647 (https://phabricator.wikimedia.org/T235188) (owner: 10Anomie) [19:16:01] !log rzl@cumin1001 START - Cookbook sre.hosts.downtime [19:16:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:12] (03PS1) 10Dzahn: Revert "switch discovery record for phabricator to 1001 for ATS" [dns] - 10https://gerrit.wikimedia.org/r/554589 [19:17:11] !log ladsgroup@deploy1001 Synchronized php-1.35.0-wmf.8/extensions/Wikibase/repo/includes/ParserOutput/FullEntityParserOutputGenerator.php: SWAT: [[gerrit:554330|Remove no-op 'jquery.ui.core.styles' from FullEntityParserOutputGenerator]] (T219604 T239594) (duration: 01m 06s) [19:17:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:18] T219604: Remove unused jquery.ui.* and jquery.effects.* modules - https://phabricator.wikimedia.org/T219604 [19:17:18] T239594: Unexpected general module "jquery.ui.core.styles" in styles queue. - https://phabricator.wikimedia.org/T239594 [19:17:53] (03PS1) 10Dzahn: Revert "varnish: switch phabricator backend to phab1001" [puppet] - 10https://gerrit.wikimedia.org/r/554590 [19:18:19] (03PS1) 10Dzahn: Revert "phabricator: switch mail destination to phab1001" [puppet] - 10https://gerrit.wikimedia.org/r/554591 [19:18:47] (03PS1) 10Dzahn: Revert "dumps/phabricator: switch dumps host from phab1003 to phab1001" [puppet] - 10https://gerrit.wikimedia.org/r/554592 [19:19:23] !log rzl@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [19:19:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:06] !log morning SWAT is done [19:21:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:48] (03CR) 10BryanDavis: [C: 04-1] dynamicproxy: add backend information to access log entries (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/554041 (https://phabricator.wikimedia.org/T238641) (owner: 10Arturo Borrero Gonzalez) [19:30:24] 10Operations, 10ops-codfw: codfw:rack/setup/install frdb2002 - https://phabricator.wikimedia.org/T239733 (10Jgreen) >>! In T239733#5709767, @Papaul wrote: > @Jgreen I need some information for this server. > > - RAID information since we have 6x1.92TB disks > - Server name: I am using frdb2002 since we alr... [19:33:05] (03CR) 10Jgreen: [C: 03+2] frack: add missing asset tag management records [dns] - 10https://gerrit.wikimedia.org/r/554079 (https://phabricator.wikimedia.org/T239597) (owner: 10Volans) [19:33:17] (03CR) 10Jgreen: [C: 03+1] frack: add missing asset tag management records [dns] - 10https://gerrit.wikimedia.org/r/554079 (https://phabricator.wikimedia.org/T239597) (owner: 10Volans) [19:35:00] (03PS1) 10Volans: Setup default permissions [software/netbox-extras] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/554598 [19:35:13] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:35:56] (03PS1) 10Krinkle: mediawiki: Add reqId/file/line to fatal-error.php top-level message [puppet] - 10https://gerrit.wikimedia.org/r/554599 [19:37:39] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: Add reqId/file/line to fatal-error.php top-level message [puppet] - 10https://gerrit.wikimedia.org/r/554599 (owner: 10Krinkle) [19:38:28] !log milimetric@deploy1001 Started deploy [analytics/refinery@c8de2ab]: Weekly train deploy [19:38:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:49] !log milimetric@deploy1001 deploy aborted: Weekly train deploy (duration: 00m 21s) [19:38:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:10] (03PS2) 10Krinkle: mediawiki: Add reqId/file/line to fatal-error.php top-level message [puppet] - 10https://gerrit.wikimedia.org/r/554599 [19:43:01] !log twentyafterfour@deploy1001 Started deploy [phabricator/deployment@e4e2b22]: deploy phabricator to phab2001.codfw.wmnet [19:43:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:32] !log twentyafterfour@deploy1001 Finished deploy [phabricator/deployment@e4e2b22]: deploy phabricator to phab2001.codfw.wmnet (duration: 00m 31s) [19:43:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:13] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:46:27] 10Operations, 10Phabricator, 10Release-Engineering-Team-TODO, 10serviceops, 10Release-Engineering-Team (Development services): Reimage both phab1001 and phab2001 to stretch / buster - https://phabricator.wikimedia.org/T190568 (10mmodell) [19:46:35] 10Operations, 10Phabricator: reinstall iridium (phabricator) as phab1001 with jessie - https://phabricator.wikimedia.org/T152129 (10mmodell) [19:46:37] 10Operations, 10Patch-For-Review: Provide cross-dc redundancy (active-active or active-passive) to all important misc services - https://phabricator.wikimedia.org/T156937 (10mmodell) [19:48:08] (03CR) 10Cwhite: [C: 03+1] "LGTM assuming these changes have been tested outside of production (e.g. deployment-prep)." [puppet] - 10https://gerrit.wikimedia.org/r/554599 (owner: 10Krinkle) [19:49:19] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [19:49:46] ^ that was me, dns1002 ethernet card blipped link :/ [19:50:01] on wikitech visual editor: Error loading data from server: apierror-visualeditor-docserver-http: Error contacting the Parsoid/RESTBase server (HTTP 404). Would you like to retry? [19:50:20] 404 from restbase? [19:50:27] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:50:46] 10Operations, 10CX-cxserver, 10Core Platform Team, 10Mobile-Content-Service, and 3 others: service-runner apps (wikifeeds/cxserver at the least) running on kubernetes emit logs with log level 50 - https://phabricator.wikimedia.org/T239459 (10WDoranWMF) [19:53:39] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [19:54:29] !log rzl@cumin1001 START - Cookbook sre.hosts.downtime [19:54:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:24] 10Operations, 10CX-cxserver, 10Mobile-Content-Service, 10Product-Infrastructure-Team-Backlog, and 2 others: service-runner apps (wikifeeds/cxserver at the least) running on kubernetes emit logs with log level 50 - https://phabricator.wikimedia.org/T239459 (10Anomie) [19:56:40] !log rzl@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [19:56:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:04] brennen and twentyafterfour: #bothumor My software never has bugs. It just develops random features. Rise for Mediawiki train - American Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191204T2000). [20:00:24] yay! [20:00:54] brennen: how is it going with the train, sorry I haven't been a very good caboose captain this week, but I'm here if you need me [20:01:25] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:02:00] lovely, error rate is way up [20:02:25] (03CR) 10Krinkle: "They have not. Do we run this in beta cluster with the same configuration?" [puppet] - 10https://gerrit.wikimedia.org/r/554599 (owner: 10Krinkle) [20:03:28] twentyafterfour: we held for thanksgiving, so rolling wmf.8 -> group1 will be my first action for this week [20:04:49] i think things look basically status quo in errors, apart from a lot of "PHP Warning: Debugging T229407", which i believe is an intended signal. [20:04:49] T229407: Spikes in DB traffic and rows/s reads when reading from new terms store - https://phabricator.wikimedia.org/T229407 [20:06:03] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:06:53] (03PS1) 10Brennen Bearnes: group1 wikis to 1.35.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554608 [20:06:55] (03CR) 10Brennen Bearnes: [C: 03+2] group1 wikis to 1.35.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554608 (owner: 10Brennen Bearnes) [20:07:55] (03Merged) 10jenkins-bot: group1 wikis to 1.35.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554608 (owner: 10Brennen Bearnes) [20:09:03] (03PS1) 10Dzahn: phabricator: simplify rsync setup, step 1 [puppet] - 10https://gerrit.wikimedia.org/r/554610 [20:09:08] !log milimetric@deploy1001 Started deploy [analytics/refinery@fc710ec]: Weekly train deploy [20:09:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:12] (03CR) 10Krinkle: "I've cherry-picked it but it looks like Logstash has been broken for at least 7 days there. https://logstash-beta.wmflabs.org/app/kibana#/" [puppet] - 10https://gerrit.wikimedia.org/r/554599 (owner: 10Krinkle) [20:11:01] (03CR) 10jerkins-bot: [V: 04-1] phabricator: simplify rsync setup, step 1 [puppet] - 10https://gerrit.wikimedia.org/r/554610 (owner: 10Dzahn) [20:11:42] Getting an error: 'The storage backend "global-multiwrite" is currently read-only' [20:12:28] (03PS2) 10Dzahn: phabricator: simplify rsync setup, step 1 [puppet] - 10https://gerrit.wikimedia.org/r/554610 [20:12:42] !log brennen@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.8 [20:12:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:29] (03PS3) 10Dzahn: phabricator: simplify rsync setup, step 1 [puppet] - 10https://gerrit.wikimedia.org/r/554610 (https://phabricator.wikimedia.org/T238956) [20:14:12] !log brennen@deploy1001 Synchronized php: group1 wikis to 1.35.0-wmf.8 (duration: 01m 29s) [20:14:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:19] (03CR) 10jerkins-bot: [V: 04-1] phabricator: simplify rsync setup, step 1 [puppet] - 10https://gerrit.wikimedia.org/r/554610 (https://phabricator.wikimedia.org/T238956) (owner: 10Dzahn) [20:15:45] !log milimetric@deploy1001 Finished deploy [analytics/refinery@fc710ec]: Weekly train deploy (duration: 06m 37s) [20:15:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:49] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [20:15:57] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:16:03] (03PS4) 10Dzahn: phabricator: simplify rsync setup, step 1 [puppet] - 10https://gerrit.wikimedia.org/r/554610 (https://phabricator.wikimedia.org/T238956) [20:16:53] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:17:40] 10Operations, 10observability, 10Patch-For-Review: StatsD Exporter does not relay dropped metrics - https://phabricator.wikimedia.org/T239833 (10colewhite) [20:17:54] 10Operations, 10observability, 10Patch-For-Review: StatsD Exporter does not relay dropped metrics - https://phabricator.wikimedia.org/T239833 (10colewhite) p:05Triage→03Normal [20:18:14] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:18:34] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:18:40] PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [20:18:52] PROBLEM - High average POST latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var- [20:18:56] PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:18:58] RECOVERY - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is OK: (C)210 ge (W)150 ge 138.3 https://phabricator.wikimedia.org/T202307 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [20:19:00] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:19:11] (03CR) 1020after4: [C: 03+1] "+1 for simplification" [puppet] - 10https://gerrit.wikimedia.org/r/554610 (https://phabricator.wikimedia.org/T238956) (owner: 10Dzahn) [20:19:22] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:19:53] hrm [20:19:54] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response wa [20:19:54] ain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:20:00] RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:20:28] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:21:30] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.7167 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [20:21:31] !log milimetric@deploy1001 Started deploy [analytics/refinery@fc710ec]: Weekly train deploy [20:21:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:36] (03CR) 10Dzahn: [C: 03+2] "http://puppet-compiler.wmflabs.org/19795/" [puppet] - 10https://gerrit.wikimedia.org/r/554610 (https://phabricator.wikimedia.org/T238956) (owner: 10Dzahn) [20:21:38] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:21:41] Dn't know what's going on, but this looks pretty bad: [20:21:43] Cannot access the database: Unknown error (10.64.48.153) [20:21:48] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:21:54] happening frequently and continues until now [20:22:02] RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [20:22:06] now affects wmf.5 as well which is odd. [20:22:12] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:22:28] Krinkle: it spiked quite a bit and then seemed to tail off? [20:22:44] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:23:03] shall i roll wmf.8 back? [20:23:58] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:24:13] brennen: maybe roll back just to verify that it isn't related to the branch [20:24:22] twentyafterfour: ack, doing so. [20:24:30] PROBLEM - termbox eqiad on termbox.svc.eqiad.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [20:24:36] ugh [20:24:44] PROBLEM - High average POST latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var- [20:24:54] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [20:24:55] yeah so definitely roll back [20:25:12] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:25:22] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:25:32] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:25:33] I was getting errors from parsoid on wikitech even before the deployment of train [20:25:34] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:25:42] PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:25:42] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:25:46] RECOVERY - termbox eqiad on termbox.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [20:25:54] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 98 jobs https://wikitech.wikimedia.org/wiki/Backups%23Monitoring [20:26:00] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:26:02] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was rece [20:26:02] tech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:26:08] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [20:26:24] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:26:36] PROBLEM - Apache HTTP on mw1343 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:26:40] PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out be [20:26:40] as received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:26:44] PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:26:44] I also see numerous other exceptions that probably need investigating [20:26:55] XegTJApAIDYAAIVbwqwAAADC] /w/api.php MediaWiki\Extension\MachineVision\MachineVisionEntitySaveException from line 150 of /srv/mediawiki/php-1.35.0-wmf.8/extensions/MachineVision/src/Handler/WikidataDepictsSetter.php [20:26:58] that's a new one [20:27:10] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:27:50] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out be [20:27:50] as received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:27:52] PROBLEM - Restrouter LVS codfw on restrouter.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200): /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (R [20:27:52] d feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [20:28:27] !log brennen@deploy1001 rebuilt and synchronized wikiversions files: Revert "group1 wikis to 1.35.0-wmf.5" [20:28:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:32] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:28:36] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:28:41] !log milimetric@deploy1001 Finished deploy [analytics/refinery@fc710ec]: Weekly train deploy (duration: 07m 09s) [20:28:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:47] (03PS1) 10BBlack: hardcode statsd IP via profile::base [puppet] - 10https://gerrit.wikimedia.org/r/554618 [20:28:56] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:28:58] PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [20:29:04] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:29:10] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:29:16] RECOVERY - Apache HTTP on mw1343 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 4.598 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:29:26] RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:29:26] RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:29:28] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:29:28] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:29:34] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:29:38] just realized i typoed the version in my revert commit. [20:29:40] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was rece [20:29:40] tech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:29:42] RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:29:55] (er, sorry, misreading things, disregard that.) [20:30:12] so the recovery seems to indicate that wmf.8 is indeed broken [20:30:22] RECOVERY - High average POST latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [20:30:36] (03CR) 10jerkins-bot: [V: 04-1] hardcode statsd IP via profile::base [puppet] - 10https://gerrit.wikimedia.org/r/554618 (owner: 10BBlack) [20:31:08] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:31:14] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:31:22] (03PS2) 10BBlack: hardcode statsd IP via profile::base [puppet] - 10https://gerrit.wikimedia.org/r/554618 [20:31:33] !log rzl@cumin1001 START - Cookbook sre.hosts.downtime [20:31:36] RECOVERY - High average POST latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POST [20:31:36] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:31:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:14] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:32:50] (03PS1) 10Brennen Bearnes: Revert "group1 wikis to 1.35.0-wmf.8" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554619 [20:32:52] (03CR) 10Brennen Bearnes: [C: 03+2] Revert "group1 wikis to 1.35.0-wmf.8" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554619 (owner: 10Brennen Bearnes) [20:33:16] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:33:37] (03CR) 10BBlack: [C: 03+2] hardcode statsd IP via profile::base [puppet] - 10https://gerrit.wikimedia.org/r/554618 (owner: 10BBlack) [20:33:42] !log rzl@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:33:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:59] (03Merged) 10jenkins-bot: Revert "group1 wikis to 1.35.0-wmf.8" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554619 (owner: 10Brennen Bearnes) [20:34:59] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:35:06] it looks like a lot of errors happening in parsoid which is now deployed by a separate scap deploy from the train? [20:35:24] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:35:26] like I see erros in /srv/deployment/parsoid/deploy-cache/revs/743efb032da50284c64698c114b23c91f411825f/src/src/Wt2Html/Grammar.php:7346 [20:35:52] * twentyafterfour doesn't know about the recommendation_api endpoints, seems to be flapping a lot and maybe unrelated to the train [20:36:05] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:36:07] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:37:03] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.5458 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [20:37:29] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:37:30] what is wtp1039, lots of memory errors there [20:37:41] Probably. It's only used on Commons, so the train group1 change may well have altered it. [20:37:48] wtp1039 is one of the Parsoid boxes. [20:37:51] wtp is parsoid [20:38:15] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:38:16] ok so we've got parseoid on dedicated servers but we've switched to the php version now fully? [20:38:27] subbu/cscott: You around to confirm if we should worry about the Parsoid issues? [20:38:33] twentyafterfour: Not fully. [20:38:44] twentyafterfour: parsoid/PHP runs on the appservers [20:39:18] so wtp1NNNN is the old parseoid? [20:39:22] yes [20:39:25] parsoid/js [20:39:58] I don't remember seeing that in logstash fatalmonitor before. odd [20:40:01] So the PHP errors coming from wtp1039 are… test runs? [20:40:14] lots of Allowed memory size of 796917760 bytes exhausted (tried to allocate 20480 bytes) from parsoid servers [20:40:27] Yeah, ignore the parsoid servers for the train. [20:40:29] RECOVERY - Restrouter LVS codfw on restrouter.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [20:40:37] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:40:46] twentyafterfour: i should have said "wtp servers became like appservers" sorry [20:40:53] hard to ignore all those alerts [20:40:55] mediawiki stuff was added to them [20:40:59] Oooh, so those are real? [20:41:15] Yeah, this level of noise is not really acceptable. [20:41:41] something sure happened with the new branch, a lot of shit failed [20:41:45] or at least alerted [20:42:07] also, are we not doing a december code freeze? [20:42:27] twentyafterfour: Yes, read the other channel where you asked. On 23 Dec. [20:42:39] James_F: yes, they are. what happens is that a wtp server can either $use_php or not. and if it does..then it has mediawiki classes like an mw server..on wtp [20:43:03] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.07083 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [20:43:06] mutante: Which MW classes? Are wtp* hosts now part of scap's deploy targets? [20:43:11] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:43:36] If yes, we need to monitor them closely. If not, how are they not going to cause production outages when MW changes? [20:43:39] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:43:51] James_F: mediawiki::scap_proxy, mediawiki::common, ::nutcracker, ::mcrouter, mediawiki::php, mediawiki::webserver ... [20:43:53] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:43:59] Oy. [20:44:09] James_F: looks like they deploy parsoid with a separate scap3 deployment in /srv/deployment/parsoid [20:44:15] In that case, yeah, all these errors are train-stoppers. [20:44:35] James_F: yes. {'cluster': 'parsoid', 'service': 'parsoid-php'} is in "mediawiki-installation" [20:44:43] what was formerly dsh group [20:44:45] Right. [20:45:05] I think anything that's triggering multiple icinga alerts is a train stopper even if it's a false alarm [20:45:30] But none of the wtp* servers are in the logstash dashboards that I looked at… [20:45:37] (blocks until the monitoring can be fixed) [20:45:44] Ah, it's in FFM. [20:46:00] James_F: they show up in my dashboard for whatever reason: https://logstash.wikimedia.org/app/kibana#/dashboard/77cc3e90-aa27-11e7-9109-51bd3197f7a9?_g=h@97fe121&_a=h@7a17cb3 [20:46:43] yeah, i was primarily watching mediawiki-new-errors and fatal monitor. [20:46:45] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:46:48] the cluster "parsoid" has 2 services, "parsoid" and "parsoid-php" [20:47:09] !log milimetric@deploy1001 Started deploy [analytics/refinery@fc710ec] (thin): Weekly train deploy to labs/notebooks [20:47:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:15] !log milimetric@deploy1001 Finished deploy [analytics/refinery@fc710ec] (thin): Weekly train deploy to labs/notebooks (duration: 00m 07s) [20:47:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:31] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:47:53] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:48:17] James_F, i just happened to jump in now. what is the issue? [20:48:29] i can read backlog .. but figured you might have a tldr [20:48:48] subbu: Huge amount of logspam from parsoid. [20:49:01] And it went into meltdown when the train tried to roll forward. [20:49:23] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:49:34] ok .. let me look in logstash. [20:49:37] and even with a rollback we still have a high error rate [20:50:11] parsoid does have a lot of ooms and timeouts since we deployed, yes. but, i can check if it spiked since the last hour. [20:50:50] fatalmonitor errors have been up since around 4:52 (~4 hours ago) [20:51:11] 16:52 that is [20:51:19] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:51:43] and I don't know what's up with the recommendation_api flapping [20:52:13] twentyafterfour: that one is not new. https://phabricator.wikimedia.org/T178445 [20:52:43] mutante: so we can safely disregard for purposes of train? [20:52:55] brennen: yes, i think so for "recommendation_api" things [20:52:57] Yeah, the scb stuff you can ignore. [20:52:57] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:53:00] looking at logstash, i don't see anything newly alarming. known ooms and timeouts. there has been a spike in timeouts over the last 24 hours .. and T239806 tracks that. [20:53:00] T239806: Parsoid/PHP errors - https://phabricator.wikimedia.org/T239806 [20:53:18] subbu: For anything else, we'd hold the train until this was fixed… [20:53:34] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:54:06] But is this "good enough"? If so, can we downgrade the errors to the level we actually treat them as? [20:54:09] understood. the ooms are just a transfer over from parsoid/js plus the fact that parsoid/php has lower memory limits (700M vs 2g) and lower timeouts (60s vs 120s). [20:54:19] And if it's not, what can we do to fix it? [20:54:37] ACKNOWLEDGEMENT - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received daniel_zahn https://phabricator.wikimedia.org/T178445 https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:54:37] ACKNOWLEDGEMENT - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received daniel_zahn https://phabricator.wikimedia.org/T178445 https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:54:39] these are exceptions raised outside the parsoid code. [20:54:45] Yeah, but the OOMs weren't PHP-land beforehand and so didn't page people. :-) [20:54:48] so, recommendations welcome for how to downgrade them. [20:55:25] subbu: maybe raise the memory limits instead? [20:55:33] and timeouts? [20:55:36] I don't know [20:55:38] we've been gradually raising it. but, we will still have ooms independent of that. [20:55:41] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:55:59] let me rephrase. [20:56:11] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:56:28] so we don't know the root cause of the OOMs? are they really non-critical when they happen? We can filter out ooms on those hosts in the dashboard but that seems like it's just papering over a problem [20:56:56] we've raised it once and will raise it again a few times, but i don't anticipate that to lower the oom volume substantially .. so, we'll need to investigate them separately .. tracked in T236833 [20:56:57] T236833: wt2html: Out of memory crashers - https://phabricator.wikimedia.org/T236833 [20:57:10] twentyafterfour, yes .. parsoid/php processes every single edit event like parsoid/js. [20:57:34] they could lead to edit failures if someone tries to edit those pages .. but those are usually large pages and/or pages that will never be opened in VE. [20:57:36] twentyafterfour: AIUI fundamentally the problem is content that is too large. PEBCAK etc. [20:58:02] as part of our integration with the core parser, we'll investigate how to tackle these sensibly. [20:58:10] but, nothing that can be fixed on a short timeline. [20:58:29] My preference is to throttle down the wikitext maximum from 2MiB to 100KiB. [20:58:31] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:58:54] James_F, templates .. anyway, that is a different discussion. [20:59:11] but, for now, if there are ways we can lower the criticality of these logstash events, we will. [20:59:35] ideas / recommendations welcome since i don't know how to .. but i'll raise it with my team as well. [21:00:05] cscott, arlolra, subbu, halfak, and accraze: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Services – Graphoid / Parsoid / Citoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191204T2100). [21:00:14] Perfect timing. [21:00:15] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:00:16] schedules downtime for the recommendation_api alerts to reduce noise [21:00:19] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [21:00:57] Content that is too large *to edit in visual editor*? Shouldn't there be something to detect that and prevent people from even loading VE on huge articles? [21:01:33] we see so many OOMs on a regular basis, the resources used just attempting to handle those requests must be something more than trivial [21:01:49] (03PS1) 10Dzahn: phabricator: simplify rsync setup for migration, part 2 [puppet] - 10https://gerrit.wikimedia.org/r/554628 (https://phabricator.wikimedia.org/T238956) [21:01:54] an ounce of prevention .... [21:02:18] twentyafterfour, parsoid pre-renders *all* pages on edit events. [21:02:23] so, this is not a ve-triggered event. [21:02:40] interesting [21:02:45] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:02:47] * twentyafterfour doesn't know how this works [21:02:53] subbu: thanks for the clarity [21:03:11] so it pre-renders the pages just to warm the cache for future views? [21:03:19] stored in restbase, yes. [21:03:19] what happens when it ooms? [21:03:35] how do huge pages ever get rendered at all? [21:03:37] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:03:50] it doesn't get stored there .. and if someone tries to edit the page or view the page, it fails to render, yes. [21:03:57] but, parsoid isn't used for reads right now. [21:04:01] (03CR) 10jerkins-bot: [V: 04-1] phabricator: simplify rsync setup for migration, part 2 [puppet] - 10https://gerrit.wikimedia.org/r/554628 (https://phabricator.wikimedia.org/T238956) (owner: 10Dzahn) [21:04:28] anyway, i think for now, let us figure out practical steps for how to reduce criticality of these events. [21:04:29] so some pages exist that are simply too big to render? And the only fix is to break them up into smaller wikipages? [21:04:47] Pretty much [21:04:51] i don't think we can solve the bigger problem here now. [21:04:54] Or a lot more memory [21:04:56] sorry if my ignorance is distracting from the real problem at hand just trying to understand the whole picture [21:05:02] for procedural clarity: the train has been blocked for > 30 min; i'd like to have that clearly indicated in phabricator until we have this resolved but i'm not sure: a) whether T239806 models this correctly and b) whether the parsoid errors are the only blocker. [21:05:03] T239806: Parsoid/PHP errors - https://phabricator.wikimedia.org/T239806 [21:05:04] Reedy: nice :) [21:05:05] twentyafterfour: yes, for example "Syrian war template" needs > 2360 MB [21:05:07] or improving parsoid perforamnce, etc. [21:05:17] Wikidata items getting too big caused problems too at one point [21:05:24] So much so some were made uneditable :D [21:05:29] I think that issue is solved now though [21:05:33] brennen: I'm not sure that's the only bloker either [21:05:38] mutante: wow [21:05:52] twentyafterfour: https://phabricator.wikimedia.org/T236833#5617915 [21:06:05] * twentyafterfour feels like this might be a giant rabbit hole [21:06:17] It is a bit [21:06:28] For a while I was upping $wgMemoryLimit a few times a year... [21:06:34] Krinkle mentioned a MediaWiki\Extension\MachineVision\MachineVisionEntitySaveException although i'm not sure that wasn't cropping up in wmf.5; i'll investigate. [21:06:38] i think this other discussion is outside the scope of the deployment and yes, i think we should focus on what is needed to make sure deployment problems aren't obscured by parsoid's logspam. [21:07:43] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:08:05] seems like y'all might already be discussing it, but I'm getting a Parsoid/RestBase 404 when trying to use VE to edit a page on Wikitech: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history?veaction=edit [21:08:34] neilpquinn: unfortunately that is a separate issue from the others we are currently tackling [21:08:36] (03PS1) 10Mholloway: MachineVision: Update Beta settings to (mostly) match production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554630 (https://phabricator.wikimedia.org/T230813) [21:08:45] neilpquinn: but you aren't alone, I had the same problem just a few minutes ago [21:08:56] https://usercontent.irccloud-cdn.com/file/ujy1Mj5l/Parsoid%2FRestBase%20404 [21:09:03] it may be related but it happened before the train rolled today [21:09:07] !log rzl@cumin1001 START - Cookbook sre.hosts.downtime [21:09:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:34] (03PS2) 10Dzahn: phabricator: simplify rsync setup for migration, part 2 [puppet] - 10https://gerrit.wikimedia.org/r/554628 (https://phabricator.wikimedia.org/T238956) [21:10:07] twentyafterfour: okay, thank you for the details. It's not a critical problem :) [21:10:23] neilpquinn: looks like this is the task: https://phabricator.wikimedia.org/T236998 [21:11:12] !log rzl@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [21:11:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:24] That's about the Graph PNGs, not Parsoid HTML. Different APIs. [21:11:36] Unlikely to be due to the same cause, unless hte cause is Restbase overall not working for wikitech [21:12:20] (03PS3) 10Dzahn: phabricator: reduce confusion with multiple Hiera keys [puppet] - 10https://gerrit.wikimedia.org/r/554628 (https://phabricator.wikimedia.org/T238956) [21:12:30] neilpquinn: sorry for the misdirect :) [21:12:37] Krinkle: the task seems to say that restbase overall isn't available on wikitech [21:12:39] (03CR) 10Herron: "> I've cherry-picked it but it looks like Logstash has been broken" [puppet] - 10https://gerrit.wikimedia.org/r/554599 (owner: 10Krinkle) [21:13:17] twentyafterfour: Yes, using /rest_v1 isn't available on wikitech, but htat's been the case for years. [21:13:20] VE worked until recently [21:13:26] twentyafterfour, James_F i am supposedly away :-) and would like to head out again ... but wanted to check if you are happy with the current resolution. [21:13:30] because it's not using Restbase presumably but Parsoid directly [21:13:35] and/or because we made Restbase work since then [21:13:36] rlazarus: no problem...this is the right place to report it in any case :) [21:13:47] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.5042 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [21:14:02] or if you need anything you would like us to tackle immediately vs. flagging the logspam issue for us to address in some fashion in the coming weeks. [21:14:07] twentyafterfour's call. [21:14:24] Or rather, brennen and twentyafterfour's. [21:14:37] subbu: so the solution immediately is to filter the errors and move the train forward? [21:14:46] what about the icinga alerts? [21:14:49] yes, for today. [21:15:04] yes for *this week*. [21:15:12] (03PS4) 10Dzahn: phabricator: simplify rsync setup for migration, part 2 [puppet] - 10https://gerrit.wikimedia.org/r/554628 (https://phabricator.wikimedia.org/T238956) [21:15:16] as for icinga alerts .. i am less familiar with them. [21:15:31] so i cannot answer authoritatively there. [21:15:37] I'm fine with filtering in logstash but icinga alerts seem like a much bigger issue and that's SRE's call not mine to make [21:15:51] mutante, ^ [21:16:18] shdubsh is clinic duty, does that mean it's their call? [21:16:55] m_utante is busy working on phabricator stuff and I have to be helping him with that shortly [21:17:16] what's up? [21:17:26] since phabricator is in a degraded state and we need to restore it today asap [21:17:36] * shdubsh returns from regex land [21:17:40] shdubsh what's up is rolling the train forward triggered a bunch of icinga alerts [21:18:09] and subbu says we can ignore them and move forward with the train [21:18:40] but I don't know who decides the policy on silencing a bunch of icinga alerts for the rest of the week [21:19:00] * cscott checks in [21:19:01] or whether we should just hold the train until we can sort it all out [21:19:35] * shdubsh checking [21:20:08] I really don't think we can just silence the alerts because some of them are _mediawiki_ alerts not just restbase alerts [21:20:30] My gut feeling is that it's OK to ACK them for a week given we know about them and people are working on fixing them, but… [21:20:41] also the rate of errors was causing logstash to choke so that's also a problem [21:21:06] James_F: yeah but I'm worried about knock-on effects of the errors, like possibly killing logstash [21:21:22] Totally. [21:21:56] not to mention a high noise floor in general makes it harder to detect other errors or debug them when they are detected [21:22:20] killing logstash would be bad on multiple levels [21:22:23] we were also seeing a fair bit of `Cannot access the database: Unknown error (10.64.48.153)` a while ago; that was coming from wmf.5 code as well. i'm not clear whether it has anything to do with wmf.8. [21:22:24] subbu: re OOM being acceptable for parsing, is there not a way we can deterministically restrict parsing in a way that will fail due to those restritions instead of letting it OOM? E.g. tune the maximum size of the wikitext post-template expansion, or depth, or template count, whichever is triggering it now. Making it fail based on how many bytes are used this week for a Title object (for example) is bad for users and developers [21:22:24] alike. And either way OOM means uncacheable content and HTTP 500 prompts. Which may be fine toward RESTBase, but not toward VE and readers (in the future), but that's a bit more longer term. [21:22:27] honestly it it were my call I'd say block the train [21:22:45] brennen: Just a load issue, maybe? [21:22:49] Krinkle: thanks, that's what I was trying to say earlier [21:23:27] also parsoid-php is configured to use a differnet syslog/type (it uses 'parsoid-php' instead of 'mediawiki'). [21:23:33] James_F: plausible - it seemed to spike and dissipate. [21:23:34] Are its OOMs showing up under type:mediawiki? [21:23:41] twentyafterfour, Krinkle, Parsoid output isn't used for read views. and this isn't substantially different form how it is with Parsoid/JS. [21:24:04] subbu: yeah, it's only different in that it now falls under a category of failures we care more about (prod MW) [21:24:11] which we can't ignore wholesome. [21:24:14] subbu: the difference is the way errors are logged, mostly. But even so it's a difference that has consequnce for logstash and icinga [21:24:16] understood. [21:24:17] Krinkle: I see OOMs and timeouts in mediawiki fatal log [21:24:32] that is all i am saying .. let usfigure out how to deal with the logspam in a way that doesn't obscure other issues. [21:24:41] shdubsh: that's always the case to a certain extent [21:24:45] we cannot solve the oom and large page and timeout problem in a week. [21:24:57] that is a 18 month project as part of the integration work. [21:25:02] Krinkle: the problem with deterministically restricting parsing is that the legacy parser and parsoid measure things in different ways. so there are pages which choke the legacy parser which don't choke parsoid, and vice-versa. [21:25:17] In my main dash, we've got 1,200 OOM errors from Parsoid in the past 15 mins; normally we'd not deploy further if there were > 20 errors total. [21:25:33] let us not get sidetracked by that unrelated problem here. [21:25:39] subbu: right, but if you are on vacation and we don't have anyone who can work on the logspam issue we probably should just block the train [21:26:04] ^ what James_F said [21:26:16] So the good thing is that because Parsoid has its own special memory threshold, it has its own OOM message, so we "can" just ignore it trivially. [21:26:32] 100 a minute is a lot but won't risk logstash. [21:26:33] I hven't yeard heard a "no" against fixing the logspam, subbu said no about fixing the root cause of the error itself [21:26:37] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2266.codfw.wmnet', 'mw2265.codfw.wmnet', 'mw2264.codfw.wmnet', 'mw2263.codfw.wmnet'] ` and were **ALL** successful. [21:26:41] PROBLEM - Blazegraph process -wdqs-categories- on wdqs1010 is CRITICAL: connect to address 10.64.32.63 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:26:43] I think making sure we don't classify them as 'mediawiki' is probably doable. [21:26:47] Krinkle: You heard a 'how?'. [21:26:51] Krinkle: right [21:26:56] it is tagged with a _type:'parsoid-php' [21:26:59] PROBLEM - puppet last run on wdqs1010 is CRITICAL: connect to address 10.64.32.63 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [21:27:05] I believe that was the intent and there is code for it already thats to do this already, which I guess is either broken or insufficient [21:27:11] 10Operations: unwind the Puppetized /etc/hosts override of statsd.eqiad.wmnet - https://phabricator.wikimedia.org/T239862 (10CDanis) [21:27:12] (arlolra and i are online to work on the issue) [21:27:15] PROBLEM - MD RAID on wdqs1010 is CRITICAL: connect to address 10.64.32.63 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [21:27:21] PROBLEM - Check size of conntrack table on wdqs1010 is CRITICAL: connect to address 10.64.32.63 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [21:27:23] i am inclined to block until the logspam situation is managed. [21:27:27] PROBLEM - Check whether ferm is active by checking the default input chain on wdqs1010 is CRITICAL: connect to address 10.64.32.63 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [21:27:33] PROBLEM - Query Service HTTP Port on wdqs1010 is CRITICAL: connect to address 10.64.32.63 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service [21:27:35] PROBLEM - Blazegraph Port for wdqs-categories on wdqs1010 is CRITICAL: connect to address 10.64.32.63 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:27:41] filtering the logspam in kibana solves part of it but we had a ton of icinga alerts which were not all parsoid or restbase [21:27:43] PROBLEM - DPKG on wdqs1010 is CRITICAL: connect to address 10.64.32.63 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [21:27:47] They aren't type:mediawiki. [21:27:47] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs1010 is CRITICAL: connect to address 10.64.32.63 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:28:21] so subbu I guess you're off the hook? thanks cscott [21:28:27] :) [21:28:28] sure. [21:28:30] TBF we already mute PHP memory limit in the filtered fatal monitor. [21:28:42] So filtering the Parsoid one is only not happening because of the different threshold. [21:28:46] right, I'm not at all against filtering [21:28:53] 10Operations: unwind the Puppetized /etc/hosts override of statsd.eqiad.wmnet - https://phabricator.wikimedia.org/T239862 (10Krinkle) If I recall correctly, HHVM had a dns cache. This is among the reasons that, over the years, we gradually adopted more use of hostnames in wmf-config for services instead of hardc... [21:28:58] (03PS1) 10CDanis: statsd: document Puppet /etc/hosts-ification [dns] - 10https://gerrit.wikimedia.org/r/554631 (https://phabricator.wikimedia.org/T239862) [21:28:59] i can be off the hook, but i just want there to be realism about what is actually solvable. :) [21:29:01] Is there an issue beyond the OOMs? [21:29:04] 10Operations, 10Patch-For-Review, 10Performance-Team (Radar): unwind the Puppetized /etc/hosts override of statsd.eqiad.wmnet - https://phabricator.wikimedia.org/T239862 (10Krinkle) [21:29:07] as long as it isn't gonna kill logstash or make sre mad then I'm ok with it [21:29:13] If not, let's roll the train. [21:29:20] subbu: ack [21:29:42] The risk of not rolling (and making next week a five-weeks-of-dev deploy) is worse. [21:30:01] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs1010 is CRITICAL: connect to address 10.64.32.63 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:30:09] James_F: hard to know if there is an issue beyond the ooms until we try to roll again I guess. I got the impression there were other problems and it's hard to sort them out [21:30:20] Yeah. [21:30:30] subbu: James_F: wmf-config changes the logstash/type value by server name for parsoid-php vs 'mediawiki' [21:30:43] so I'm trying to figure out exactly which icinga alerts we need to look at [21:30:46] Krinkle: What? [21:31:15] (03PS1) 10CDanis: base: document statsd DNS kludge [puppet] - 10https://gerrit.wikimedia.org/r/554632 (https://phabricator.wikimedia.org/T239862) [21:31:37] it seems this is not sufficiently working because 1) there is also the raw php-fpm/syslog reporting the same OOMs outside MW as type:syslog/program:php72-fpm with no mention of MW which Icgina listens to and 2) there is php-wmerrors/fatala-errors.php which is hardcoded to type:mediawiki which Scap listens to [21:32:01] the latter is easily fixed to use the same varying 'type' value as we do in wmf-config already. [21:32:08] I looked at the logstash graphs a bit and there is some headroom on the logstash cluster. If the +100/min is an accurate estimate, logstash can handle it. [21:32:10] That was done by subbu on purpose so as to not spam mediawiki logs [21:32:29] It just didn't cover the other two ways in which fatals are logged. [21:32:51] 10Operations, 10ops-codfw: codfw:rack/setup/install frdb2002 - https://phabricator.wikimedia.org/T239733 (10Papaul) ` Virtual Disk 0: RAID10, 5.237TB, Ready [21:33:14] Krinkle: for #1, is the hostname part of the log? [21:33:23] since parsoid has its own cluster, it may be possible to filter that way? [21:33:24] but all other exception/trace/debug/info/warning messages from MW from parsoid-php hosts are already categorised correctly as type:parsoid-php and not spamming the logs we monitor for deployments. [21:33:45] Yes, for #1 we'll need to manaully filter out by host afterwards because we have no runtime influence on how fpm logs to syslog [21:33:53] for #2 we edit php7-fatal-errors.php in puppet. [21:34:06] so there are a ton of timeouts also right after the branch was deployed [21:34:24] cscott: yes the hostname could be used to fitler [21:34:36] For #1 we'd need to alter the alert query that Icginga uses. Not sure exactly where that is. It might actually be querying Graphite instead of Logstash, in which case #2 suffices since that also is responsible for the Graphite metric. [21:34:39] PROBLEM - Disk space on wdqs1010 is CRITICAL: connect to address 10.64.32.63 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=wdqs1010&var-datasource=eqiad+prometheus/ops [21:35:32] I happen to have written about the three ways in which PHP fatals are logged earlier today at: [21:35:32] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/554599/2//COMMIT_MSG [21:35:47] https://github.com/wikimedia/operations-mediawiki-config/blob/1782ce8ca17b7dbe424dea68079fb0d6d5d258b7/wmf-config/logging.php#L211 [21:36:15] 10Operations, 10Patch-For-Review, 10Performance-Team (Radar): unwind the Puppetized /etc/hosts override of statsd.eqiad.wmnet - https://phabricator.wikimedia.org/T239862 (10BBlack) In general, usually applayer DNS caching is a Bad Idea unless it's done very carefully (e.g. cap it at something like 5s max, or... [21:36:32] https://github.com/wikimedia/puppet/blob/64dbfd86abc67d78e1fe6dca995973545f25b1b9/modules/profile/files/mediawiki/php/php7-fatal-error.php#L52 [21:37:45] https://github.com/wikimedia/puppet/blob/e10b88745ca02d5c459f26b09e48065b1128d25e/modules/profile/manifests/mediawiki/alerts.pp#L64 [21:38:38] cscott, arlolra thanks for taking this over from me .. :) I'll hang around in the background for the next hour .. feel free to ping me if anything is needed. [21:38:43] !log rzl@cumin1001 conftool action : set/pooled=yes; selector: name=mw226[3456]\.codfw\.wmnet [21:38:43] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.1 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [21:38:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:26] Looks like the Icginga alert was already fixed a while ago to not consider the raw syslog/php-fpm messages [21:39:34] so we only need to fix the fatals under type:mediawiki [21:40:24] cscott: want to give that a shot? It's not a hard patch, but for the sake of transferring knowledge, could you give it a try / can do together. [21:41:00] let me take a look [21:41:06] (03CR) 10CDanis: [C: 03+2] base: document statsd DNS kludge [puppet] - 10https://gerrit.wikimedia.org/r/554632 (https://phabricator.wikimedia.org/T239862) (owner: 10CDanis) [21:41:51] (03CR) 10CDanis: [C: 03+2] statsd: document Puppet /etc/hosts-ification [dns] - 10https://gerrit.wikimedia.org/r/554631 (https://phabricator.wikimedia.org/T239862) (owner: 10CDanis) [21:42:22] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10RLazarus) [21:42:49] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [21:42:59] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:43:55] PROBLEM - configured eth on wdqs1010 is CRITICAL: connect to address 10.64.32.63 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [21:44:05] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:44:39] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:45:45] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:46:07] PROBLEM - dhclient process on wdqs1010 is CRITICAL: connect to address 10.64.32.63 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [21:46:15] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:46:26] Krinkle: i'm a bit lost [21:46:44] is the alerts.pp file the thing icinga is using? [21:47:17] 10Operations, 10ops-codfw: codfw:rack/setup/install frdb2002 - https://phabricator.wikimedia.org/T239733 (10Papaul) ` papaul@fasw-c-codfw# show | compare [edit interfaces interface-range vlan-fundraising] member "ge-[0-1]/0/19" { ... } + member "ge-[0-1]/0/20"; [edit interfaces] + ge-0/0/20 { +... [21:47:44] brennen, i would recommend against making T239806 a train blocker .. let us separate immediately unsolvable problems from logspam problems. [21:47:45] T239806: Parsoid/PHP errors - https://phabricator.wikimedia.org/T239806 [21:47:47] cscott: Yes, that one queries Logstash (or rather a Prometheus metric that observes Logstash) for type:mediawiki channel:exception/fatal. I thought that that query was also looking at type:syslog/program:php72-fpm, but that was already fixed. So we don't need to alter this query. [21:48:22] Instead we need to fix the second producer of these messages to correctly use type:parsoid-php instead of type:mediawiki. The first producer of those is MW itself which Subbu fixed months ago via wmf-config/logging.php [21:48:31] second is https://github.com/wikimedia/puppet/blob/64dbfd86abc67d78e1fe6dca995973545f25b1b9/modules/profile/files/mediawiki/php/php7-fatal-error.php#L52 ? [21:48:37] That's right [21:48:38] subbu: ack - apologies for phab flailing. [21:48:57] You'll want to base on https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/554599/ because right now the logic for setting type:mediawiki on that is in logstash-syslog.conf [21:49:03] (03PS5) 10Dzahn: phabricator: simplify rsync setup for migration, part 2 [puppet] - 10https://gerrit.wikimedia.org/r/554628 (https://phabricator.wikimedia.org/T238956) [21:49:08] this commit moves that to the PHP file for easier maintenance. [21:49:19] 10Operations, 10ops-codfw: codfw:rack/setup/install frdb2002 - https://phabricator.wikimedia.org/T239733 (10Papaul) ` apaul@fasw-c-codfw# run show interfaces ge-[0-1]/0/20 descriptions Interface Admin Link Description ge-0/0/20 up up frdb2002:eth0 ge-1/0/20 up up frdb2002:eth1 [21:49:27] Or edit filter-syslog.conf but I don't know if that is capable of varying by hostname prefix [21:49:30] 10Operations, 10SRE-Access-Requests, 10Discovery-Search (Current work): Requesting access to production shell for Maryum Styles - https://phabricator.wikimedia.org/T239654 (10colewhite) @Mstyles is now in the wmf ldap group. Please let me know if you encounter any related issue. [21:49:34] and we need something like https://github.com/wikimedia/operations-mediawiki-config/blob/1782ce8ca17b7dbe424dea68079fb0d6d5d258b7/wmf-config/logging.php#L183 ? [21:49:35] shdubsh might know [21:49:40] Krinkle, ty for the log filtering guidance. :) [21:49:40] cscott: Yeah [21:49:44] 10Operations, 10SRE-Access-Requests, 10Discovery-Search (Current work): Requesting access to production shell for Maryum Styles - https://phabricator.wikimedia.org/T239654 (10colewhite) 05Open→03Resolved [21:49:47] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Add Maryum to Puppet - https://phabricator.wikimedia.org/T239300 (10colewhite) [21:51:34] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [21:52:54] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:54:04] (03PS1) 10C. Scott Ananian: Ensure fatal PHP7 errors from the Parsoid cluster don't spam mediawiki logs [puppet] - 10https://gerrit.wikimedia.org/r/554638 [21:54:18] Krinkle: ^ be gentle I have no idea what I'm doing. ;) [21:54:36] RECOVERY - DPKG on wdqs1010 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [21:54:40] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs1010 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:54:52] RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs1010 is OK: PROCS OK: 1 process with UID = 499 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:54:56] RECOVERY - MD RAID on wdqs1010 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [21:55:04] RECOVERY - Disk space on wdqs1010 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=wdqs1010&var-datasource=eqiad+prometheus/ops [21:55:06] RECOVERY - Check size of conntrack table on wdqs1010 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [21:55:10] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:55:18] RECOVERY - Check whether ferm is active by checking the default input chain on wdqs1010 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [21:55:22] also: maybe we should have a separate phab task for this? "parsoid spams logs"? [21:55:28] RECOVERY - Query Service HTTP Port on wdqs1010 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.019 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [21:55:28] RECOVERY - dhclient process on wdqs1010 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [21:55:34] RECOVERY - Blazegraph Port for wdqs-categories on wdqs1010 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9990 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:56:20] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:56:38] brennen, you can make T239867 a train blocker if you wish instead of T239806 .. T239806 shouldn't be a train blocker. [21:56:39] T239806: Parsoid/PHP errors - https://phabricator.wikimedia.org/T239806 [21:56:39] T239867: Address Parsoid/PHP noise from cluttering mediawiki train deployments - https://phabricator.wikimedia.org/T239867 [21:57:04] (03PS2) 10Krinkle: Ensure fatal PHP7 errors from the Parsoid cluster don't spam mediawiki logs [puppet] - 10https://gerrit.wikimedia.org/r/554638 (owner: 10C. Scott Ananian) [21:57:09] (03CR) 10Krinkle: [C: 03+1] "LGTM :)" [puppet] - 10https://gerrit.wikimedia.org/r/554638 (owner: 10C. Scott Ananian) [21:57:44] (03CR) 10Subramanya Sastry: "Bug: T239867" [puppet] - 10https://gerrit.wikimedia.org/r/554638 (owner: 10C. Scott Ananian) [21:57:53] cscott: are there wpt hosts in beta? [21:57:59] (03PS3) 10C. Scott Ananian: Ensure fatal PHP7 errors from the Parsoid cluster don't spam mediawiki logs [puppet] - 10https://gerrit.wikimedia.org/r/554638 (https://phabricator.wikimedia.org/T239867) [21:58:32] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1003/19800/" [puppet] - 10https://gerrit.wikimedia.org/r/554628 (https://phabricator.wikimedia.org/T238956) (owner: 10Dzahn) [21:58:42] (03CR) 10Subramanya Sastry: Ensure fatal PHP7 errors from the Parsoid cluster don't spam mediawiki logs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/554638 (https://phabricator.wikimedia.org/T239867) (owner: 10C. Scott Ananian) [21:59:10] (03CR) 10Dzahn: "if ( $_SERVER['SERVERGROUP'] === 'parsoid' ) {" [puppet] - 10https://gerrit.wikimedia.org/r/554638 (https://phabricator.wikimedia.org/T239867) (owner: 10C. Scott Ananian) [21:59:16] RECOVERY - puppet last run on wdqs1010 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [21:59:44] Krinkle: we have wtp1025-wtp1048 in eqiad and wtp2001-2020 in codfw; I don't know if anyone else is using that prefix. [21:59:54] (03CR) 10Dzahn: [C: 03+2] "no change: https://puppet-compiler.wmflabs.org/compiler1003/19800/" [puppet] - 10https://gerrit.wikimedia.org/r/554628 (https://phabricator.wikimedia.org/T238956) (owner: 10Dzahn) [22:00:11] (03CR) 10Krinkle: [C: 03+1] "Be sure to check if it exists. Example at" [puppet] - 10https://gerrit.wikimedia.org/r/554638 (https://phabricator.wikimedia.org/T239867) (owner: 10C. Scott Ananian) [22:00:12] Krinkle: i think there's only a single parsoid machine in beta [22:00:44] subbu: done; thanks for authoring. [22:01:04] cscott: hm.. well, the existing hostname check doesn't cover it for beta either, so I guess that's for another time. [22:01:13] although I imagine the ENV check wil work there as well [22:01:21] assuming it has been puppetized in a way that affects beta too [22:01:21] (03CR) 10Subramanya Sastry: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/554638 (https://phabricator.wikimedia.org/T239867) (owner: 10C. Scott Ananian) [22:01:30] I don't know whether that apache config applies there or not [22:01:32] 10Operations, 10ops-eqiad, 10Analytics: Degraded RAID on an-worker1089 - https://phabricator.wikimedia.org/T239365 (10Jclark-ctr) Disk arrived [22:01:45] shdubsh: btw, did you find something about logstash-beta? [22:02:37] The fix for this issue depends on my patch which we we're still testing. Alternatively, maybe there's a way in the logstash-conf syntax to condition by hostname prefix? (If we do it without my patch). I don't know how that works. [22:02:52] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:03:12] Krinkle: haven't looked yet. [22:05:28] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:05:38] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:05:41] (03PS4) 10C. Scott Ananian: Ensure fatal PHP7 errors from the Parsoid cluster don't spam mediawiki logs [puppet] - 10https://gerrit.wikimedia.org/r/554638 (https://phabricator.wikimedia.org/T239867) [22:06:56] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:07:08] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:07:09] (03CR) 10Dzahn: "noop on all servers but simpler now. thing was one was "hostname" and one was "fqdn". now just using fqdn for all things." [puppet] - 10https://gerrit.wikimedia.org/r/554628 (https://phabricator.wikimedia.org/T238956) (owner: 10Dzahn) [22:08:32] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:08:44] (03PS1) 10C. Scott Ananian: Keep $isParsoidCluster test consistent w/ operations/puppet [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554639 (https://phabricator.wikimedia.org/T239867) [22:09:00] shdubsh: cscott: Rather than waiting for my refactor, it might be easier to do this directly in the logstash-syslog.conf file for now. You can see where in LHS of https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/554599/2/modules/profile/files/logstash/filter-syslog.conf. I'm not familiar with the Grok syntax nor which version we use of it, but shdubsh would know I think. [22:09:10] (03PS5) 10C. Scott Ananian: Ensure fatal PHP7 errors from the Parsoid cluster don't spam mediawiki logs [puppet] - 10https://gerrit.wikimedia.org/r/554638 (https://phabricator.wikimedia.org/T239867) [22:09:23] It says here it supports regex so presumably something like /^wpt/ could work but not sure which version we use of that thing [22:09:23] https://www.elastic.co/guide/en/logstash/current/config-examples.html [22:09:28] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): cloudstore1008 crash - Memory error - https://phabricator.wikimedia.org/T239569 (10Jclark-ctr) Dimm has arrived [22:09:52] (03PS1) 10Papaul: DNS: Add mgmt and production DNS for frdb2002 [dns] - 10https://gerrit.wikimedia.org/r/554640 [22:09:52] RECOVERY - Blazegraph process -wdqs-categories- on wdqs1010 is OK: PROCS OK: 1 process with UID = 499 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [22:09:56] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:10:02] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:10:24] (03CR) 10Jforrester: "Neat!" [puppet] - 10https://gerrit.wikimedia.org/r/554599 (owner: 10Krinkle) [22:10:54] (03CR) 10Krinkle: [C: 03+1] Keep $isParsoidCluster test consistent w/ operations/puppet (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554639 (https://phabricator.wikimedia.org/T239867) (owner: 10C. Scott Ananian) [22:11:32] (03CR) 10C. Scott Ananian: Keep $isParsoidCluster test consistent w/ operations/puppet (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554639 (https://phabricator.wikimedia.org/T239867) (owner: 10C. Scott Ananian) [22:11:43] (03CR) 10Jforrester: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/554599 (owner: 10Krinkle) [22:12:29] (03PS2) 10C. Scott Ananian: Keep $isParsoidCluster test consistent w/ operations/puppet [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554639 (https://phabricator.wikimedia.org/T239867) [22:12:49] 10Operations, 10Beta-Cluster-Infrastructure, 10Wikimedia-Logstash: Logstash in Beta Cluster stopped ingesting messages from MediaWiki - https://phabricator.wikimedia.org/T239868 (10Krinkle) [22:12:52] shdubsh: filed now ^ [22:13:01] thanks [22:13:15] 10Operations, 10ops-codfw, 10Patch-For-Review: codfw:rack/setup/install frdb2002 - https://phabricator.wikimedia.org/T239733 (10Papaul) [22:13:18] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:13:32] 10Operations, 10Beta-Cluster-Infrastructure, 10Wikimedia-Logstash: Logstash in Beta Cluster stopped ingesting messages from MediaWiki - https://phabricator.wikimedia.org/T239868 (10Krinkle) >>! @colewhite wrote at : > seeing rsyslog complaining about "omkafka: kafka del... [22:14:21] 10Operations, 10Beta-Cluster-Infrastructure, 10Wikimedia-Logstash: Logstash in Beta Cluster stopped ingesting messages from MediaWiki - https://phabricator.wikimedia.org/T239868 (10Krinkle) [22:14:33] 10Operations, 10Beta-Cluster-Infrastructure, 10Wikimedia-Logstash: Logstash in Beta Cluster stopped ingesting messages from MediaWiki - https://phabricator.wikimedia.org/T239868 (10Jdforrester-WMF) See also {T211984} and {T233134}. [22:14:54] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:14:58] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:15:00] (03PS1) 10Hashar: contint: role for CI package_builder instances [puppet] - 10https://gerrit.wikimedia.org/r/554642 (https://phabricator.wikimedia.org/T224943) [22:15:04] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:15:32] (03CR) 10Hashar: [C: 04-1] "Being tested on integration-agent-pkgbuilder-1001.integration.eqiad.wmflabs" [puppet] - 10https://gerrit.wikimedia.org/r/554642 (https://phabricator.wikimedia.org/T224943) (owner: 10Hashar) [22:16:46] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:16:48] (03CR) 10jerkins-bot: [V: 04-1] contint: role for CI package_builder instances [puppet] - 10https://gerrit.wikimedia.org/r/554642 (https://phabricator.wikimedia.org/T224943) (owner: 10Hashar) [22:18:16] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:18:22] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:18:36] RECOVERY - configured eth on wdqs1010 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [22:18:39] (03PS1) 10Dzahn: phabricator: get rid of "failover" server Hiera key, further simplify [puppet] - 10https://gerrit.wikimedia.org/r/554643 (https://phabricator.wikimedia.org/T238956) [22:18:59] (03PS2) 10Hashar: contint: role for CI package_builder instances [puppet] - 10https://gerrit.wikimedia.org/r/554642 (https://phabricator.wikimedia.org/T224943) [22:19:58] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:20:52] (03CR) 10jerkins-bot: [V: 04-1] contint: role for CI package_builder instances [puppet] - 10https://gerrit.wikimedia.org/r/554642 (https://phabricator.wikimedia.org/T224943) (owner: 10Hashar) [22:21:02] (03CR) 10Krinkle: "Being able to do it this way depends my refactor (parent patch), which we'll want to test in beta first, which in turn is also broken (T23" [puppet] - 10https://gerrit.wikimedia.org/r/554638 (https://phabricator.wikimedia.org/T239867) (owner: 10C. Scott Ananian) [22:21:24] !log T208369 ran mwscript extensions/GrowthExperiments/maintenance/deleteOldSurveys.php cswiki --cutoff 350 [22:21:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:30] T208369: Welcome survey: anonymize data after one year - https://phabricator.wikimedia.org/T208369 [22:21:35] James_F: as usual, things always break in pairs, or more commonly, triplets. [22:21:56] (03CR) 10Dzahn: [C: 03+2] "noop https://puppet-compiler.wmflabs.org/compiler1003/19801/" [puppet] - 10https://gerrit.wikimedia.org/r/554643 (https://phabricator.wikimedia.org/T238956) (owner: 10Dzahn) [22:21:58] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:23:00] (03PS3) 10Hashar: contint: role for CI package_builder instances [puppet] - 10https://gerrit.wikimedia.org/r/554642 (https://phabricator.wikimedia.org/T224943) [22:23:42] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:24:37] !log T208369 ran mwscript extensions/GrowthExperiments/maintenance/deleteOldSurveys.php kowiki --cutoff 350 [22:24:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:25:13] (03CR) 10Hashar: [C: 04-1] contint: role for CI package_builder instances [puppet] - 10https://gerrit.wikimedia.org/r/554642 (https://phabricator.wikimedia.org/T224943) (owner: 10Hashar) [22:26:16] (03PS1) 10Dzahn: phabricator: switch prod server to phab1003, enables dumps and ferm holes [puppet] - 10https://gerrit.wikimedia.org/r/554644 (https://phabricator.wikimedia.org/T238956) [22:27:01] Krinkle: :-) [22:28:16] krinkle is breaking in triplets? [22:28:24] why yes this is how rumours get started... [22:33:34] (03PS1) 10Cwhite: profile: rewrite log type to parsoid-php if php7.2-fpm err from wtp hosts [puppet] - 10https://gerrit.wikimedia.org/r/554645 (https://phabricator.wikimedia.org/T239867) [22:34:30] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:36:21] (03CR) 10Krinkle: [C: 03+1] profile: rewrite log type to parsoid-php if php7.2-fpm err from wtp hosts [puppet] - 10https://gerrit.wikimedia.org/r/554645 (https://phabricator.wikimedia.org/T239867) (owner: 10Cwhite) [22:37:40] (03PS1) 10Jforrester: Disable VisualEditor on Wikitech (and Labs Wikitech) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554646 [22:37:46] !log poweroff cloudstore1008 for memory module replacement [22:37:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:56] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:38:02] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:38:06] (03CR) 10Krinkle: [C: 03+1] profile: rewrite log type to parsoid-php if php7.2-fpm err from wtp hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/554645 (https://phabricator.wikimedia.org/T239867) (owner: 10Cwhite) [22:38:29] brennen: If the train is stalled, can I deploy https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/554646 (to disable VE/Parsoid on wikitech)? [22:39:32] !log powered off cloudstore1008, disabled sync from cloudstore1009, and downtimed both cloudstore1008 and cloudstore1009 for memory module replacement T239569 [22:39:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:39:37] T239569: cloudstore1008 crash - Memory error - https://phabricator.wikimedia.org/T239569 [22:39:38] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:39:43] James_F: yeah, seems fine. presently just awaiting a resolution on T239867. [22:39:43] T239867: Address Parsoid/PHP noise from cluttering mediawiki train deployments - https://phabricator.wikimedia.org/T239867 [22:39:50] Cool, thanks. [22:39:55] !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@e6afe36]: Update mobileapps to 9e9b042 [22:39:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:40:07] (03CR) 10Jforrester: [C: 03+2] Disable VisualEditor on Wikitech (and Labs Wikitech) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554646 (owner: 10Jforrester) [22:40:59] (03Merged) 10jenkins-bot: Disable VisualEditor on Wikitech (and Labs Wikitech) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554646 (owner: 10Jforrester) [22:42:44] PROBLEM - Host cloudstore1008.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [22:42:59] (03PS2) 10Cwhite: profile: rewrite log type to parsoid-php if php7.2-fpm err from wtp hosts [puppet] - 10https://gerrit.wikimedia.org/r/554645 (https://phabricator.wikimedia.org/T239867) [22:43:24] (03CR) 10Cwhite: profile: rewrite log type to parsoid-php if php7.2-fpm err from wtp hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/554645 (https://phabricator.wikimedia.org/T239867) (owner: 10Cwhite) [22:44:46] (03CR) 10Krinkle: [C: 03+1] profile: rewrite log type to parsoid-php if php7.2-fpm err from wtp hosts [puppet] - 10https://gerrit.wikimedia.org/r/554645 (https://phabricator.wikimedia.org/T239867) (owner: 10Cwhite) [22:45:03] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): cloudstore1008 crash - Memory error - https://phabricator.wikimedia.org/T239569 (10Jclark-ctr) Finished replacement of DIMM_B2 [22:45:21] (03CR) 10Cwhite: [C: 03+2] profile: rewrite log type to parsoid-php if php7.2-fpm err from wtp hosts [puppet] - 10https://gerrit.wikimedia.org/r/554645 (https://phabricator.wikimedia.org/T239867) (owner: 10Cwhite) [22:45:43] !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@e6afe36]: Update mobileapps to 9e9b042 (duration: 05m 48s) [22:45:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:48:36] RECOVERY - Host cloudstore1008.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.31 ms [22:49:51] James_F: I've got a beta cluster config change teed up to deploy, too, when you're done. [22:49:54] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Disable VisualEditor on Wikitech (and Labs Wikitech) (duration: 01m 02s) [22:49:56] (03CR) 10Neil P. Quinn-WMF: [C: 03+1] "😢" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554646 (owner: 10Jforrester) [22:49:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:50:36] mdholloway: Just C+2 it, that's fine. [22:50:53] (03CR) 10Mholloway: [C: 03+2] MachineVision: Update Beta settings to (mostly) match production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554630 (https://phabricator.wikimedia.org/T230813) (owner: 10Mholloway) [22:51:33] 10Operations, 10Gerrit, 10vm-requests: Gerrit VM to test data migration - https://phabricator.wikimedia.org/T239151 (10Dzahn) >>! In T239151#5711306, @hashar wrote: > The way I understand the message: the virtualization servers in group `row_A` lack free memory to allocate a VM. But maybe another group woul... [22:51:40] it's #deploy-every-service o'clock [22:51:40] (03Merged) 10jenkins-bot: MachineVision: Update Beta settings to (mostly) match production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554630 (https://phabricator.wikimedia.org/T230813) (owner: 10Mholloway) [22:52:06] brennen: Prod is yours again. [22:53:16] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [22:53:46] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:54:15] shdubsh: I can verify in Logstash once it's applied to the relevant hosts. ping me then? [22:54:30] James_F: ack, ty. [22:54:33] Krinkle: it's applied now. [22:55:57] It's looking good to me. No more mediawiki fatals from wtp hosts afaict [22:55:59] 10Operations, 10Gerrit, 10vm-requests: Gerrit VM to test data migration - https://phabricator.wikimedia.org/T239151 (10Dzahn) >>! In T239151#5711346, @MoritzMuehlenhoff wrote: > The old puppetdb hosts (puppetdb1001) should be ready to go away, @jbond merged the patches to stop broadcasting to it last week. I... [22:56:34] https://logstash.wikimedia.org/goto/add6d26f6cd84bd4fa123091114d0a4b [22:56:36] yeah, lgtm. [22:56:41] dropped off 4-5min ago [22:57:31] shdubsh, Krinkle, cscott: Thank you! [22:58:10] rad. all right, once more unto the breach... [22:58:54] Did we want to deploy the other changes? [22:58:56] * Krinkle unbreaks puppet master in beta to undo my earlier patch so as to make shdubsh 's patch apply cleanly [22:59:06] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:59:16] * brennen pauses [22:59:25] https://gerrit.wikimedia.org/r/c/operations/puppet/+/554599/ and https://gerrit.wikimedia.org/r/c/operations/puppet/+/554638/ ? [22:59:41] (And https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/554639 ?) [23:00:46] (03CR) 10Krinkle: [C: 04-1] "This is now superseded by https://gerrit.wikimedia.org/r/554645. I'll make sure that when I rebase my parent patch to use this logic as it" [puppet] - 10https://gerrit.wikimedia.org/r/554638 (https://phabricator.wikimedia.org/T239867) (owner: 10C. Scott Ananian) [23:00:58] James_F: Not the puppet ones. that's only cleanup/refactor now. [23:01:02] the wmf-confnig one sure :) [23:01:27] (03CR) 10Jforrester: [C: 03+1] Keep $isParsoidCluster test consistent w/ operations/puppet (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554639 (https://phabricator.wikimedia.org/T239867) (owner: 10C. Scott Ananian) [23:02:10] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [23:04:39] (03CR) 10Krinkle: "This no longer applies cleanly to latest HEAD, and is the reason beta puppetmaster is now stalled." [puppet] - 10https://gerrit.wikimedia.org/r/437052 (owner: 10Alex Monk) [23:06:16] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:07:32] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [23:11:30] (03PS3) 10Krinkle: mediawiki: Add reqId/file/line to php7-fatal-error.php's 'message' field [puppet] - 10https://gerrit.wikimedia.org/r/554599 [23:11:38] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:11:55] (03CR) 10Krinkle: [C: 04-1] "Testing this in beta is blocked on T233134." [puppet] - 10https://gerrit.wikimedia.org/r/554599 (owner: 10Krinkle) [23:16:28] getServerStates: host db1062 is unreachable/ Error connecting to db1062 as user wikiuser: :real_connect(): (HY000/2002): Connection refused [23:16:34] seeing the same issue pop back up again [23:16:56] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:17:05] but.. seems ot have happened throughout the non-deployed window just now as well so not train induced I suppose [23:17:10] worrying non the less but will file separate [23:18:19] (03CR) 10Jforrester: [C: 03+2] Keep $isParsoidCluster test consistent w/ operations/puppet [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554639 (https://phabricator.wikimedia.org/T239867) (owner: 10C. Scott Ananian) [23:19:06] (03Merged) 10jenkins-bot: Keep $isParsoidCluster test consistent w/ operations/puppet [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554639 (https://phabricator.wikimedia.org/T239867) (owner: 10C. Scott Ananian) [23:21:05] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: Keep test consistent w/ operations/puppet, for CS (duration: 01m 03s) [23:21:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:11] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): cloudstore1008 crash - Memory error - https://phabricator.wikimedia.org/T239569 (10Bstorm) The server shows the right amount of RAM, so that's a good start. Checking logs on the web console [23:22:32] !log jforrester@deploy1001 Synchronized wmf-config/logging.php: Keep test consistent w/ operations/puppet, for logging (duration: 01m 02s) [23:22:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:23:38] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): cloudstore1008 crash - Memory error - https://phabricator.wikimedia.org/T239569 (10Bstorm) [23:23:46] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install cloudstore1008 & cloudstore1009 - https://phabricator.wikimedia.org/T193655 (10Bstorm) [23:23:55] brennen: Go for it. [23:24:04] James_F: ta. rolling. [23:24:17] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): cloudstore1008 crash - Memory error - https://phabricator.wikimedia.org/T239569 (10Bstorm) 05Open→03Resolved No errors for now. Hopefully, we don't see more pop up. Thanks! [23:24:23] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install cloudstore1008 & cloudstore1009 - https://phabricator.wikimedia.org/T193655 (10Bstorm) [23:25:48] (03PS1) 10Brennen Bearnes: group1 wikis to 1.35.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554650 [23:25:50] (03CR) 10Brennen Bearnes: [C: 03+2] group1 wikis to 1.35.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554650 (owner: 10Brennen Bearnes) [23:26:46] (03Merged) 10jenkins-bot: group1 wikis to 1.35.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554650 (owner: 10Brennen Bearnes) [23:29:04] !log brennen@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.8 [23:29:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:30:06] !log brennen@deploy1001 Synchronized php: group1 wikis to 1.35.0-wmf.8 (duration: 01m 01s) [23:30:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:31:35] another spike in database connection errors there [23:32:25] Dropping back though? [23:32:58] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:33:08] seemed like it but now cropping up again. [23:33:34] PROBLEM - Apache HTTP on mw1317 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:33:34] PROBLEM - Apache HTTP on mw1314 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:33:35] rolling back. [23:33:36] PROBLEM - Apache HTTP on mw1313 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:33:36] PROBLEM - Nginx local proxy to apache on mw1222 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:33:36] PROBLEM - Apache HTTP on mw1229 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:33:36] PROBLEM - Nginx local proxy to apache on mw1230 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:33:38] PROBLEM - Apache HTTP on mw1226 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:33:38] PROBLEM - Apache HTTP on mw1280 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:33:40] PROBLEM - Apache HTTP on mw1282 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:33:40] PROBLEM - Nginx local proxy to apache on mw1231 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:33:40] PROBLEM - Nginx local proxy to apache on mw1283 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:33:42] -,- [23:33:43] Nooo [23:33:44] PROBLEM - PHP7 rendering on mw1227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [23:33:44] PROBLEM - Apache HTTP on mw1343 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:33:46] PROBLEM - PHP7 rendering on mw1289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [23:33:46] PROBLEM - Apache HTTP on mw1315 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:33:46] PROBLEM - PHP7 rendering on mw1297 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [23:33:50] PROBLEM - PHP7 rendering on mw1284 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [23:33:52] PROBLEM - PHP7 rendering on mw1333 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [23:33:54] PROBLEM - Apache HTTP on mw1230 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:33:54] PROBLEM - PHP7 rendering on mw1235 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [23:33:56] PROBLEM - Apache HTTP on mw1316 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:33:56] PROBLEM - Nginx local proxy to apache on mw1314 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:33:58] PROBLEM - Nginx local proxy to apache on mw1228 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:33:58] PROBLEM - Nginx local proxy to apache on mw1232 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:33:58] PROBLEM - Nginx local proxy to apache on mw1224 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:34:00] PROBLEM - phpfpm_up reduced availability on icinga1001 is CRITICAL: 0.6487 le 0.8 https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_exporters_%22up%22_metrics_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:34:00] PROBLEM - Apache HTTP on mw1222 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:34:01] Yeah, I think it's real, not just a blip. [23:34:02] PROBLEM - PHP7 rendering on mw1344 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [23:34:02] PROBLEM - Apache HTTP on mw1284 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:34:02] PROBLEM - Apache HTTP on mw1264 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:34:02] PROBLEM - Nginx local proxy to apache on mw1251 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:34:02] PROBLEM - PHP7 rendering on mw1257 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [23:34:03] PROBLEM - Nginx local proxy to apache on mw1346 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:34:03] PROBLEM - PHP7 rendering on mw1251 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [23:34:04] PROBLEM - Nginx local proxy to apache on mw1272 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:34:04] PROBLEM - Apache HTTP on mw1278 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:34:06] PROBLEM - Nginx local proxy to apache on mw1285 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:34:06] PROBLEM - Nginx local proxy to apache on mw1290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:34:08] PROBLEM - Nginx local proxy to apache on mw1343 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:34:08] PROBLEM - Apache HTTP on mw1224 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:34:08] PROBLEM - Nginx local proxy to apache on mw1225 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:34:08] PROBLEM - Nginx local proxy to apache on mw1322 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:34:09] :( [23:34:10] PROBLEM - PHP7 rendering on mw1278 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [23:34:10] PROBLEM - Apache HTTP on mw1247 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:34:10] PROBLEM - Nginx local proxy to apache on mw1258 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:34:10] PROBLEM - PHP7 rendering on mw1240 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [23:34:10] PROBLEM - PHP7 rendering on mw1348 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [23:34:11] PROBLEM - PHP7 rendering on mw1270 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [23:34:11] PROBLEM - Nginx local proxy to apache on mw1265 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:34:11] No idea how, though. [23:34:12] PROBLEM - Apache HTTP on mw1290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:34:12] PROBLEM - Apache HTTP on mw1270 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:34:13] PROBLEM - Nginx local proxy to apache on mw1281 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:34:13] PROBLEM - PHP7 rendering on mw1343 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [23:34:14] PROBLEM - Apache HTTP on mw1322 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:34:14] PROBLEM - PHP7 rendering on mw1313 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [23:34:15] PROBLEM - Apache HTTP on mw1286 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:34:15] PROBLEM - Nginx local proxy to apache on mw1257 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:34:16] PROBLEM - Nginx local proxy to apache on mw1278 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:34:16] PROBLEM - Apache HTTP on mw1268 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:34:17] PROBLEM - PHP7 rendering on mw1249 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [23:34:17] PROBLEM - Apache HTTP on mw1262 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:34:18] PROBLEM - PHP7 rendering on mw1314 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [23:34:18] PROBLEM - PHP7 rendering on mw1326 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [23:34:19] PROBLEM - PHP7 rendering on mw1268 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [23:34:19] PROBLEM - PHP7 rendering on mw1241 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [23:34:20] PROBLEM - Nginx local proxy to apache on mw1328 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:34:20] PROBLEM - PHP7 rendering on mw1258 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [23:34:31] Always a good sign when icinga quits. [23:34:36] Haha [23:34:48] ::sigh:: [23:35:27] If it's DB related, I'm going to blame A.aron as a first approximation. ;-) [23:35:27] !log brennen@deploy1001 Scap failed!: 9/11 canaries failed their endpoint checks(http://en.wikipedia.org) [23:35:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:51] James_F / anyone: scap just failed hard, i think checking canaries [23:35:52] brennen: Failure to rollout or rollback? [23:35:54] Yeah. [23:35:59] Just --force it. [23:36:17] scap sync-wikiversions --force 'Revert "group1 wikis to 1.35.0-wmf.5"' [23:36:19] ? [23:36:33] rolling with that. [23:36:35] Yes [23:37:01] PROBLEM - LVS HTTPS IPv6 #page on text-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [23:37:03] PROBLEM - Apache HTTP on mw1330 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:37:07] PROBLEM - Nginx local proxy to apache on mw1273 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:37:07] PROBLEM - Apache HTTP on mw1273 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:37:07] PROBLEM - Apache HTTP on mw1323 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:37:09] PROBLEM - Apache HTTP on mw1265 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:37:09] PROBLEM - Nginx local proxy to apache on mw1266 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:37:13] Welcome back icinga-wm -,- [23:37:17] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([mw1344.eqiad.wmnet, mw1272.eqiad.wmnet, mw1320.eqiad.wmnet, mw1250.eqiad.wmnet, mw1266.eqiad.wmnet, mw1223.eqiad.wmnet, mw1282.eqiad.wmnet, mw1333.eqiad.wmnet, mw1241.eqiad.wmnet, mw1221.eqiad.wmnet, mw1317.eqiad.wmnet, mw1224.eqiad.wmnet, mw1316.eqiad.wmnet, mw1325.eqiad.wmnet, mw1312.eqiad.wmnet, mw1347.eqiad.wmnet, mw1342 [23:37:17] 270.eqiad.wmnet, mw1273.eqiad.wmnet, mw1341.eqiad.wmnet, mw1332.eqiad.wmnet, mw1313.eqiad.wmnet, mw1229.eqiad.wmnet, mw1246.eqiad.wmnet, mw1322.eqiad.wmnet, mw1288.eqiad.wmnet, mw1281.eqiad.wmnet, mw1314.eqiad.wmnet, mw1323.eqiad.wmnet, mw1227.eqiad.wmnet, mw1233.eqiad.wmnet, mw1327.eqiad.wmnet, mw1245.eqiad.wmnet, mw1340.eqiad.wmnet, mw1258.eqiad.wmnet, mw1225.eqiad.wmnet, mw1264.eqiad.wmnet, mw1255.eqiad.wmnet, mw1257.eqiad.wmn [23:37:17] wmnet, mw1238.eqiad.wmnet, mw1234.eqiad.wmnet, mw1235.eqiad.wmnet, mw1231.eqiad.wmnet, mw1 https://wikitech.wikimedia.org/wiki/PyBal [23:37:17] PROBLEM - PHP7 rendering on mw1328 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [23:37:29] PROBLEM - PHP7 rendering on mw1273 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [23:37:40] PROBLEM - LVS HTTPS IPv4 #page on appservers.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [23:38:11] PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is WARNING: Test Retrieve aggregated feed content for April 29, 2016 responds with unexpected value at path = Missing keys: [image] https://wikitech.wikime [23:38:11] Base [23:38:26] icinga-wm: it's all about that? [23:38:27] RECOVERY - Nginx local proxy to apache on mw1342 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 6.826 second response time https://wikitech.wikimedia.org/wiki/Application_servers [23:38:38] * Vermont blames the bot [23:38:43] RECOVERY - Nginx local proxy to apache on mw1234 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 1.727 second response time https://wikitech.wikimedia.org/wiki/Application_servers [23:38:47] RECOVERY - Nginx local proxy to apache on mw1345 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 6.359 second response time https://wikitech.wikimedia.org/wiki/Application_servers [23:38:49] RECOVERY - Apache HTTP on mw1241 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 6.730 second response time https://wikitech.wikimedia.org/wiki/Application_servers [23:38:49] RECOVERY - Nginx local proxy to apache on mw1240 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 8.142 second response time https://wikitech.wikimedia.org/wiki/Application_servers [23:38:50] !log brennen@deploy1001 rebuilt and synchronized wikiversions files: Revert "group1 wikis to 1.35.0-wmf.5" [23:38:51] RECOVERY - PHP7 rendering on mw1345 is OK: HTTP OK: HTTP/1.1 200 OK - 75129 bytes in 5.126 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [23:38:52] RECOVERY - LVS HTTPS IPv6 #page on text-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 15472 bytes in 5.545 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [23:38:52] RECOVERY - Apache HTTP on mw1227 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.029 second response time https://wikitech.wikimedia.org/wiki/Application_servers [23:38:52] RECOVERY - Nginx local proxy to apache on mw1248 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 1.316 second response time https://wikitech.wikimedia.org/wiki/Application_servers [23:38:53] RECOVERY - Apache HTTP on mw1348 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 7.244 second response time https://wikitech.wikimedia.org/wiki/Application_servers [23:38:53] RECOVERY - Apache HTTP on mw1330 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 6.293 second response time https://wikitech.wikimedia.org/wiki/Application_servers [23:38:53] RECOVERY - Apache HTTP on mw1253 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 9.903 second response time https://wikitech.wikimedia.org/wiki/Application_servers [23:38:53] RECOVERY - Nginx local proxy to apache on mw1348 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 7.034 second response time https://wikitech.wikimedia.org/wiki/Application_servers [23:38:55] RECOVERY - Apache HTTP on mw1240 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 5.226 second response time https://wikitech.wikimedia.org/wiki/Application_servers [23:38:55] RECOVERY - Apache HTTP on mw1251 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 4.285 second response time https://wikitech.wikimedia.org/wiki/Application_servers [23:38:55] RECOVERY - Nginx local proxy to apache on mw1317 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 6.147 second response time https://wikitech.wikimedia.org/wiki/Application_servers [23:38:55] here [23:38:57] RECOVERY - Nginx local proxy to apache on mw1245 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 5.728 second response time https://wikitech.wikimedia.org/wiki/Application_servers [23:38:57] RECOVERY - Apache HTTP on mw1323 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 6.264 second response time https://wikitech.wikimedia.org/wiki/Application_servers [23:38:57] RECOVERY - Nginx local proxy to apache on mw1255 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers [23:38:57] RECOVERY - Nginx local proxy to apache on mw1244 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/Application_servers [23:38:57] RECOVERY - Apache HTTP on mw1297 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 5.860 second response time https://wikitech.wikimedia.org/wiki/Application_servers [23:38:59] RECOVERY - Apache HTTP on mw1273 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 9.144 second response time https://wikitech.wikimedia.org/wiki/Application_servers [23:38:59] RECOVERY - Nginx local proxy to apache on mw1273 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 9.152 second response time https://wikitech.wikimedia.org/wiki/Application_servers [23:38:59] RECOVERY - Nginx local proxy to apache on mw1266 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 5.412 second response time https://wikitech.wikimedia.org/wiki/Application_servers [23:38:59] RECOVERY - Apache HTTP on mw1238 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.030 second response time https://wikitech.wikimedia.org/wiki/Application_servers [23:39:00] RECOVERY - Nginx local proxy to apache on mw1246 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers [23:39:00] RECOVERY - Nginx local proxy to apache on mw1316 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 9.342 second response time https://wikitech.wikimedia.org/wiki/Application_servers [23:39:01] RECOVERY - PHP7 rendering on mw1255 is OK: HTTP OK: HTTP/1.1 200 OK - 75128 bytes in 0.128 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [23:39:01] RECOVERY - Nginx local proxy to apache on mw1287 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 8.445 second response time https://wikitech.wikimedia.org/wiki/Application_servers [23:39:02] RECOVERY - Apache HTTP on mw1265 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 5.816 second response time https://wikitech.wikimedia.org/wiki/Application_servers [23:39:02] RECOVERY - Nginx local proxy to apache on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 5.414 second response time https://wikitech.wikimedia.org/wiki/Application_servers [23:39:03] RECOVERY - Nginx local proxy to apache on mw1326 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 9.132 second response time https://wikitech.wikimedia.org/wiki/Application_servers [23:39:03] RECOVERY - PHP7 rendering on mw1339 is OK: HTTP OK: HTTP/1.1 200 OK - 75129 bytes in 6.570 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [23:39:04] RECOVERY - PHP7 rendering on mw1317 is OK: HTTP OK: HTTP/1.1 200 OK - 75129 bytes in 3.545 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [23:39:04] RECOVERY - PHP7 rendering on mw1250 is OK: HTTP OK: HTTP/1.1 200 OK - 75129 bytes in 1.275 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [23:39:05] RECOVERY - Apache HTTP on mw1287 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 6.104 second response time https://wikitech.wikimedia.org/wiki/Application_servers [23:39:05] RECOVERY - Apache HTTP on mw1289 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 9.642 second response time https://wikitech.wikimedia.org/wiki/Application_servers [23:39:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:41:25] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:41:25] RECOVERY - Restrouter LVS eqiad on restrouter.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [23:41:33] RECOVERY - wikidata.org dispatch lag is REALLY high ---4000s- on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1973 bytes in 0.103 second response time https://phabricator.wikimedia.org/project/view/71/ [23:41:35] RECOVERY - proton endpoints health on proton2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [23:41:47] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [23:41:55] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [23:41:57] RECOVERY - phpfpm_up reduced availability on icinga1001 is OK: (C)0.8 le (W)0.9 le 0.9534 https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_exporters_%22up%22_metrics_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:42:01] RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [23:42:03] RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [23:42:17] RECOVERY - High average POST latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [23:42:17] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [23:42:27] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [23:42:31] RECOVERY - Apache HTTP on mw1346 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 1.058 second response time https://wikitech.wikimedia.org/wiki/Application_servers [23:42:45] RECOVERY - Nginx local proxy to apache on mw1347 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 5.470 second response time https://wikitech.wikimedia.org/wiki/Application_servers [23:42:45] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:42:45] RECOVERY - High average POST latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POST [23:42:57] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [23:43:05] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [23:43:11] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [23:43:15] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [23:43:23] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [23:44:03] RECOVERY - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [23:44:03] RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: (C)60 le (W)70 le 71.37 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [23:44:57] RECOVERY - Restbase edge codfw on text-lb.codfw.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [23:44:59] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [23:45:11] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [23:45:23] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [23:45:25] I do love having to `/ignore icinga-wm`. [23:45:39] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [23:45:50] (03PS3) 10DannyS712: InitialiseSettings - clean up groupOverrides layout / spacing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554392 (https://phabricator.wikimedia.org/T231178) [23:46:05] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [23:46:05] (03PS4) 10DannyS712: InitialiseSettings - clean up groupOverrides layout / spacing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554392 (https://phabricator.wikimedia.org/T231178) [23:46:11] James_F: some serious custom botspam filtering for these channels is on my personal todo list for near-term future. [23:46:13] My theory is that something is failing to release connections so the appservers get overwhelmed. [23:46:20] brennen: +1 [23:46:41] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [23:47:01] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [23:47:05] PROBLEM - Restrouter LVS codfw on restrouter.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [23:48:13] James_F: that tracks: https://grafana.wikimedia.org/d/000000550/mediawiki-application-servers?orgId=1&from=1575492481559&to=1575503281559&var-source=eqiad%20prometheus%2Fops&var-cluster=appserver&var-node=mw1251 [23:48:31] (grain of salt, I'm still learning my way around both the stack and the dashboards) [23:48:41] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [23:49:39] oh, pulling that graph out means you can't see the header -- that's from the Nutcracker section [23:49:57] Particularly, note that the s1 wiki (enwiki) had errors even though we didn't deploy to that wiki with the wmf.8 code. [23:50:20] Yeah, the Apache workers graph falling through the floor is a bad look. [23:50:33] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [23:50:36] (03PS1) 10Brennen Bearnes: Revert "group1 wikis to 1.35.0-wmf.8" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554651 [23:50:38] (03CR) 10Brennen Bearnes: [C: 03+2] Revert "group1 wikis to 1.35.0-wmf.8" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554651 (owner: 10Brennen Bearnes) [23:50:39] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [23:51:30] (03Merged) 10jenkins-bot: Revert "group1 wikis to 1.35.0-wmf.8" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554651 (owner: 10Brennen Bearnes) [23:52:07] Jobrunners were unaffected though? [23:52:11] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [23:52:21] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [23:52:27] RECOVERY - Restrouter LVS codfw on restrouter.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [23:52:48] I feel this deserves an incident report. [23:52:56] +1 [23:53:03] i would concur. [23:53:51] https://wikitech.wikimedia.org/wiki/Incident_documentation/20191204-MediaWiki [23:54:11] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [23:54:25] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [23:54:29] https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&from=1575492814576&to=1575503614577&fullscreen&panelId=43&var-datasource=eqiad%20prometheus%2Fops&var-cluster=appserver&var-method=GET&var-code=200 "worst scraping time" was up for s7 only [23:54:41] (s3 too but within normal variation, it looks like) [23:55:45] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [23:56:11] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api