[00:00:04] twentyafterfour: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Phabricator update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191017T0000). [00:06:23] (03CR) 10CDanis: [C: 03+1] labtestwiki: move to a wmcs-hosted database on clouddb2001-dev [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543664 (https://phabricator.wikimedia.org/T233236) (owner: 10Andrew Bogott) [00:10:15] (03PS1) 10Dzahn: add metafo record for parsoid-php [dns] - 10https://gerrit.wikimedia.org/r/543737 [00:10:48] (03CR) 10jerkins-bot: [V: 04-1] add metafo record for parsoid-php [dns] - 10https://gerrit.wikimedia.org/r/543737 (owner: 10Dzahn) [00:11:19] (03CR) 10Dzahn: "also see https://gerrit.wikimedia.org/r/c/operations/puppet/+/542572 for the discovery.yaml entry" [dns] - 10https://gerrit.wikimedia.org/r/543737 (owner: 10Dzahn) [00:11:42] (03CR) 10Dzahn: "also see https://gerrit.wikimedia.org/r/c/operations/dns/+/543737 for the DNS metafo record" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/542572 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [00:13:23] (03CR) 10Dzahn: ""error: Name 'parsoid-php.discovery.wmnet.': resolver plugin 'metafo' rejected resource name 'disc-parsoid-php'" would this be fixed by fi" [dns] - 10https://gerrit.wikimedia.org/r/543737 (owner: 10Dzahn) [00:34:55] 10Operations, 10ops-esams, 10Epic: Remove all decommissioned hardware - https://phabricator.wikimedia.org/T184063 (10Papaul) a:03Papaul [00:35:46] 10Operations, 10ops-esams, 10decommission: Decommission esams ms-fe / ms-be - https://phabricator.wikimedia.org/T169518 (10Papaul) a:05mark→03Papaul [00:37:22] 10Operations, 10ops-esams, 10decommission: Decommission cp300[3456] - https://phabricator.wikimedia.org/T167376 (10Papaul) a:03Papaul [00:40:39] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm [00:40:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:41:00] 10Operations, 10ops-esams, 10decommission: Decommission bast3001 - https://phabricator.wikimedia.org/T159480 (10Papaul) a:05mark→03Papaul [00:41:08] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) [00:41:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:42:06] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm [00:42:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:42:10] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) [00:42:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:43:04] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm [00:43:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:43:08] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) [00:43:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:50:05] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm [00:50:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:55:12] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [00:55:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:02:27] (03PS1) 10BBlack: Revert "depool eqsin" [dns] - 10https://gerrit.wikimedia.org/r/543740 [02:04:00] (03CR) 10BBlack: [C: 03+2] Revert "depool eqsin" [dns] - 10https://gerrit.wikimedia.org/r/543740 (owner: 10BBlack) [02:04:19] !log repooling eqsin [02:04:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:16:17] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is CRITICAL: 54.98 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [02:46:17] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is OK: (C)60 le (W)70 le 74.68 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [03:04:28] 10Operations, 10serviceops, 10HHVM, 10Patch-For-Review, 10Performance-Team (Radar): Remove HHVM from production - https://phabricator.wikimedia.org/T229792 (10Krinkle) >>! In T229792#5574497, @gerritbot wrote: > Change 539128 had a related patch set uploaded (by Effie Mouzeli; owner: Giuseppe Lavagetto):... [03:33:22] (03CR) 10Krinkle: "If we need more libs in the future, we can add a tiny shim here included by prod as well if needed. Yaml parsing should be pretty simple f" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542658 (owner: 10Krinkle) [03:33:56] (03PS15) 10Mathew.onipe: query_service: rename wdqs module to query_service [puppet] - 10https://gerrit.wikimedia.org/r/538572 (https://phabricator.wikimedia.org/T232297) [03:33:58] (03PS24) 10Mathew.onipe: query_service: prepare query_service for reusbility [puppet] - 10https://gerrit.wikimedia.org/r/537138 (https://phabricator.wikimedia.org/T232297) [03:34:00] (03PS21) 10Mathew.onipe: query_service: rename profile/wdqs to profile/query_service [puppet] - 10https://gerrit.wikimedia.org/r/538849 (https://phabricator.wikimedia.org/T232297) [03:34:02] (03PS16) 10Mathew.onipe: query_service: separate categories from main blazegraph profile [puppet] - 10https://gerrit.wikimedia.org/r/539285 (https://phabricator.wikimedia.org/T232297) [03:34:04] (03PS16) 10Mathew.onipe: query_service: properly adapt query_service profile [puppet] - 10https://gerrit.wikimedia.org/r/539513 (https://phabricator.wikimedia.org/T232297) [03:34:06] (03PS16) 10Mathew.onipe: query_service: properly adapt hiera configs [puppet] - 10https://gerrit.wikimedia.org/r/539998 (https://phabricator.wikimedia.org/T232297) [03:36:05] (03CR) 10jerkins-bot: [V: 04-1] query_service: rename wdqs module to query_service [puppet] - 10https://gerrit.wikimedia.org/r/538572 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [03:36:25] (03CR) 10jerkins-bot: [V: 04-1] query_service: prepare query_service for reusbility [puppet] - 10https://gerrit.wikimedia.org/r/537138 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [03:37:44] (03CR) 10jerkins-bot: [V: 04-1] query_service: separate categories from main blazegraph profile [puppet] - 10https://gerrit.wikimedia.org/r/539285 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [04:01:16] (03CR) 10Vgutierrez: [C: 03+2] Release 8.0.5-wm9 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/543025 (https://phabricator.wikimedia.org/T234011) (owner: 10Vgutierrez) [04:42:45] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-upload site=eqiad https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [04:43:09] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-upload site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [04:43:27] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_upload site=eqiad https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [04:44:19] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [04:44:41] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [04:45:01] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:00:25] !log uploaded trafficserver 8.0.5-1wm9 to apt.wikimedia.org (stretch) - T234011 [05:00:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:00:31] T234011: ATS fails to log the used SSLCurve when the SSL session is being reused - https://phabricator.wikimedia.org/T234011 [05:01:12] !log upgrading ATS to 8.0.5-1wm9 on cp5001 - T234011 [05:01:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:03:17] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:04:33] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:06:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1103:3312 and db1094', diff saved to https://phabricator.wikimedia.org/P9367 and previous config saved to /var/cache/conftool/dbconfig/20191017-050614-marostegui.json [05:06:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:07:16] 10Operations, 10Traffic, 10Patch-For-Review: ATS fails to log the used SSLCurve when the SSL session is being reused - https://phabricator.wikimedia.org/T234011 (10Vgutierrez) After upgrading to 8.0.5-1wm9 cp5001 reports properly the EC used on reused sessions: ` - ReqHeader X-CP-TLS-Version: TLSv1.2... [05:08:30] \o/ [05:09:29] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2067 - https://phabricator.wikimedia.org/T235695 (10Marostegui) a:03Papaul @Papaul can we replace the failed disk with a brand new one? This host is scheduled for decommission, but we have to buy its replacement first (T234608) [05:09:39] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2067 - https://phabricator.wikimedia.org/T235695 (10Marostegui) p:05Triage→03Normal [05:10:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1105:3312 and db1136 for schema change', diff saved to https://phabricator.wikimedia.org/P9368 and previous config saved to /var/cache/conftool/dbconfig/20191017-051055-marostegui.json [05:10:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:12:30] !log Deploy schema change on db1095:3312 [05:12:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:18:59] (03PS1) 10Marostegui: site.pp: Provision new hosts as spare [puppet] - 10https://gerrit.wikimedia.org/r/543748 [05:19:47] (03CR) 10Marostegui: [C: 03+2] site.pp: Provision new hosts as spare [puppet] - 10https://gerrit.wikimedia.org/r/543748 (owner: 10Marostegui) [05:30:16] !log Deploy schema change on labtestwiki and labswiki [05:30:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:32:57] 10Operations, 10Growth-Team, 10Notifications, 10serviceops, and 3 others: Provision Kask for Echo timestamp storage in k8s - https://phabricator.wikimedia.org/T234376 (10Joe) Heh yes sorry, I forgot to tell you yesterday - you need to use `helmfile destroy` in newer versions of helmfile. [05:33:06] (03PS1) 10Mforns: ::reportupdater::jobs::hadoop.pp: Add wmcs job [puppet] - 10https://gerrit.wikimedia.org/r/543749 (https://phabricator.wikimedia.org/T235718) [05:33:46] (03CR) 10Marostegui: labtestwiki: move to a wmcs-hosted database on clouddb2001-dev (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543664 (https://phabricator.wikimedia.org/T233236) (owner: 10Andrew Bogott) [05:35:45] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission [05:35:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:35:55] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [05:35:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:36:01] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2052.codfw.wmnet - https://phabricator.wikimedia.org/T230883 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `db2052.codfw.wmnet` - db2052.codfw.wmnet (**PASS**) - Downtimed host on Ic... [05:36:55] (03PS1) 10Marostegui: site.pp: Remove puppet references for db2052 [puppet] - 10https://gerrit.wikimedia.org/r/543750 (https://phabricator.wikimedia.org/T230883) [05:37:17] (03PS1) 10Marostegui: wmnet: Remove production entries for db2052 [dns] - 10https://gerrit.wikimedia.org/r/543751 (https://phabricator.wikimedia.org/T230883) [05:37:45] (03CR) 10Marostegui: [C: 03+2] site.pp: Remove puppet references for db2052 [puppet] - 10https://gerrit.wikimedia.org/r/543750 (https://phabricator.wikimedia.org/T230883) (owner: 10Marostegui) [05:38:12] (03CR) 10Marostegui: [C: 03+2] wmnet: Remove production entries for db2052 [dns] - 10https://gerrit.wikimedia.org/r/543751 (https://phabricator.wikimedia.org/T230883) (owner: 10Marostegui) [05:39:23] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:39:43] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:40:35] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2052.codfw.wmnet - https://phabricator.wikimedia.org/T230883 (10Marostegui) a:05RobH→03Papaul [05:40:49] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2052.codfw.wmnet - https://phabricator.wikimedia.org/T230883 (10Marostegui) Host ready for switch disablement + onsite steps [05:44:39] (03CR) 10CDanis: [C: 04-1] labtestwiki: move to a wmcs-hosted database on clouddb2001-dev (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543664 (https://phabricator.wikimedia.org/T233236) (owner: 10Andrew Bogott) [05:59:20] (03PS2) 10Elukey: ::reportupdater::jobs::hadoop.pp: Add wmcs job [puppet] - 10https://gerrit.wikimedia.org/r/543749 (https://phabricator.wikimedia.org/T235718) (owner: 10Mforns) [06:00:27] (03PS3) 10Elukey: reportupdater::jobs::hadoop.pp: Add wmcs job [puppet] - 10https://gerrit.wikimedia.org/r/543749 (https://phabricator.wikimedia.org/T235718) (owner: 10Mforns) [06:00:42] (03PS4) 10Elukey: reportupdater::jobs::hadoop: Add wmcs job [puppet] - 10https://gerrit.wikimedia.org/r/543749 (https://phabricator.wikimedia.org/T235718) (owner: 10Mforns) [06:02:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Change special weights from x to x100 on s5 - T231018', diff saved to https://phabricator.wikimedia.org/P9369 and previous config saved to /var/cache/conftool/dbconfig/20191017-060251-marostegui.json [06:02:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:02:56] T231018: specify group (api/vslow/etc) weights in terms of 0..100 instead of 0..1 - https://phabricator.wikimedia.org/T231018 [06:04:56] (03CR) 10Elukey: [C: 03+2] reportupdater::jobs::hadoop: Add wmcs job [puppet] - 10https://gerrit.wikimedia.org/r/543749 (https://phabricator.wikimedia.org/T235718) (owner: 10Mforns) [06:06:12] !log upgrade archiva on archiva1001 to 2.2.4 - T222595 [06:06:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:08:43] of course in labs worked, in prod the landing page doesn't load sigh [06:08:54] checking what is wrong, archiva might alert [06:19:02] (03PS1) 10Marostegui: mariadb: Allow installation of es1020-es1025 [puppet] - 10https://gerrit.wikimedia.org/r/543752 (https://phabricator.wikimedia.org/T235659) [06:20:34] (03PS2) 10Marostegui: mariadb: Allow installation of es1020-es1025 [puppet] - 10https://gerrit.wikimedia.org/r/543752 (https://phabricator.wikimedia.org/T235659) [06:22:41] (03CR) 10Marostegui: [C: 03+2] mariadb: Allow installation of es1020-es1025 [puppet] - 10https://gerrit.wikimedia.org/r/543752 (https://phabricator.wikimedia.org/T235659) (owner: 10Marostegui) [06:23:10] 10Operations, 10ops-codfw: No microcode updates loaded on puppetmaster2001/2002 after reimage to Buster - https://phabricator.wikimedia.org/T235250 (10MoritzMuehlenhoff) @papaul: Thanks, that fixed the microcode loading for puppetmaster2002 as well. [06:35:58] (03PS1) 10Muehlenhoff: Revert "puppet/config-master: migrate puppet and config-master to eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/543753 (https://phabricator.wikimedia.org/T235250) [06:49:09] (03PS2) 10Muehlenhoff: Revert "puppet/config-master: migrate puppet and config-master to eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/543753 (https://phabricator.wikimedia.org/T235250) [06:52:39] (03CR) 10Muehlenhoff: [C: 03+2] Revert "puppet/config-master: migrate puppet and config-master to eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/543753 (https://phabricator.wikimedia.org/T235250) (owner: 10Muehlenhoff) [06:53:32] the archiva UI now loads, it was a new setting to add, but now archvia doesn't load its config (Repos, etc..) [06:53:39] that is very strange [06:54:07] at this point a rollback may help [06:55:52] (03PS1) 10Muehlenhoff: Revert "puppet/config-master: migrate puppet and config-master to eqiad" [dns] - 10https://gerrit.wikimedia.org/r/543755 (https://phabricator.wikimedia.org/T235250) [07:02:40] ok now I get what is happened. It seems that archiva stores config in archiva.xml, that is not managed by puppet, and it got wiped [07:03:02] the last upgrade was a vm change, so it was not clear [07:03:43] gehel: --^ sorry for the trouble, working on it [07:13:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1130 (non partitioned host) into s5 special group with low weight - T223151', diff saved to https://phabricator.wikimedia.org/P9370 and previous config saved to /var/cache/conftool/dbconfig/20191017-071308-marostegui.json [07:13:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:14] T223151: Review special replica partitioning of certain tables by `xx_user` - https://phabricator.wikimedia.org/T223151 [07:17:31] elukey: no worries! [07:18:25] gehel: I think it is now a matter of 1) re-adding the ldap config 2) add the repos back to the config [07:18:52] I'll make sure at the end that everything is properly puppetized [07:19:04] elukey: no emergency on our side, we might have a few failing builds in the meantime, but that's not blocking anything urgent [07:19:17] dcausse, onimisionipe: ^ [07:20:43] I rarely (never?) rely on archiva, perhaps wdqs release process does? [07:22:38] I don't have any build in the pipeline now. So good from my side [07:23:58] thanks! [07:38:18] (03PS1) 10KartikMistry: Enable CX out of beta in Malayalam/Bengali/Mongolian WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543764 (https://phabricator.wikimedia.org/T233008) [07:47:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1136', diff saved to https://phabricator.wikimedia.org/P9371 and previous config saved to /var/cache/conftool/dbconfig/20191017-074658-marostegui.json [07:47:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:17] (03CR) 10Ayounsi: [C: 03+1] Remove zookeeper terms from the Analytics filters [homer/public] - 10https://gerrit.wikimedia.org/r/543183 (https://phabricator.wikimedia.org/T217057) (owner: 10Elukey) [07:51:13] (03PS2) 10Muehlenhoff: Revert "puppet/config-master: migrate puppet and config-master to eqiad" [dns] - 10https://gerrit.wikimedia.org/r/543755 (https://phabricator.wikimedia.org/T235250) [07:56:15] (03CR) 10Muehlenhoff: [C: 03+2] Revert "puppet/config-master: migrate puppet and config-master to eqiad" [dns] - 10https://gerrit.wikimedia.org/r/543755 (https://phabricator.wikimedia.org/T235250) (owner: 10Muehlenhoff) [08:00:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1090:3317 pool db1136 temporarily into vslow,dump', diff saved to https://phabricator.wikimedia.org/P9372 and previous config saved to /var/cache/conftool/dbconfig/20191017-080026-marostegui.json [08:00:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fix db1136 weight', diff saved to https://phabricator.wikimedia.org/P9373 and previous config saved to /var/cache/conftool/dbconfig/20191017-080157-marostegui.json [08:02:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:40] !log Deploy schema change on db1090:3317 [08:03:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:56] !log upgrading ATS on eqsin nodes to 8.0.5-1wm9 - T234011 [08:05:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:07] T234011: ATS fails to log the used SSLCurve when the SSL session is being reused - https://phabricator.wikimedia.org/T234011 [08:08:23] (03CR) 10Elukey: "Next question is - how do I deploy this?" [homer/public] - 10https://gerrit.wikimedia.org/r/543183 (https://phabricator.wikimedia.org/T217057) (owner: 10Elukey) [08:26:01] !log jmm@cumin1001 START - Cookbook sre.hosts.decommission [08:26:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:10] (03CR) 10Volans: [C: 03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/542139 (owner: 10Ayounsi) [08:26:32] gehel: update - I realized that /var/lib/archiva is in bacula, so I am trying to retrieve the archiva.xml now [08:26:35] !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [08:26:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:05] elukey: I'm done with my interviews, ping me if you need any help! [08:27:43] I am currently waiting for bacula to restore the file [08:28:38] !log jmm@cumin1001 START - Cookbook sre.hosts.decommission [08:28:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:54] !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [08:28:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:29] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Move users to YAML common instead of templates [homer/public] - 10https://gerrit.wikimedia.org/r/542139 (owner: 10Ayounsi) [08:29:59] (03PS1) 10Ayounsi: Add OSPF support [homer/public] - 10https://gerrit.wikimedia.org/r/543795 [08:36:22] 10Operations, 10ops-codfw, 10ops-eqiad, 10DC-Ops, and 2 others: Triage and resolve all outstanding Netbox report errors - https://phabricator.wikimedia.org/T223450 (10faidon) Thanks for the update! >>! In T223450#5582390, @RobH wrote: > On accounting: > > https://netbox.wikimedia.org/extras/reports/accou... [08:38:56] (03PS1) 10Jon Harald Søby: Allow sysops to add transwiki on nnwiki, and add import sources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543796 (https://phabricator.wikimedia.org/T231761) [08:39:29] (03PS1) 10Muehlenhoff: labpuppetmaster: Remove remaining puppet references [puppet] - 10https://gerrit.wikimedia.org/r/543797 (https://phabricator.wikimedia.org/T234462) [08:41:32] (03PS1) 10Elukey: role::analytics_test_cluster::client: allow ssh access to analytics users [puppet] - 10https://gerrit.wikimedia.org/r/543798 (https://phabricator.wikimedia.org/T212258) [08:41:48] (03CR) 10Muehlenhoff: [C: 03+2] labpuppetmaster: Remove remaining puppet references [puppet] - 10https://gerrit.wikimedia.org/r/543797 (https://phabricator.wikimedia.org/T234462) (owner: 10Muehlenhoff) [08:42:28] (03PS2) 10Elukey: role::analytics_test_cluster::client: allow ssh access to analytics users [puppet] - 10https://gerrit.wikimedia.org/r/543798 (https://phabricator.wikimedia.org/T212258) [08:42:49] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [08:44:22] (03CR) 10Elukey: [C: 03+2] role::analytics_test_cluster::client: allow ssh access to analytics users [puppet] - 10https://gerrit.wikimedia.org/r/543798 (https://phabricator.wikimedia.org/T212258) (owner: 10Elukey) [08:44:34] Is anyone else having trouble with logging into wikitech? I have to reset my password every time I'm logged out [08:50:41] (03PS1) 10Muehlenhoff: Remove DNS entries for labpuppetmaster [dns] - 10https://gerrit.wikimedia.org/r/543799 (https://phabricator.wikimedia.org/T234462) [08:56:39] 10Operations, 10ops-eqiad, 10netops: (Need By: Sept 30) upgrade msw1-eqiad from EX4200 to EX4300 - https://phabricator.wikimedia.org/T225121 (10faidon) 05Stalled→03Open p:05Normal→03High What's the status of this? It seems like this migration is in some limbo state :) As far as I understand it: - Ol... [08:57:54] 10Operations, 10Puppet, 10Patch-For-Review: puppet failing on puppetmaster2001 - https://phabricator.wikimedia.org/T235060 (10MoritzMuehlenhoff) 05Open→03Resolved This got resolved with the cergen build for buster. [08:58:01] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] "Chatted over IRC, will write doc soon on how other people can deploy it." [homer/public] - 10https://gerrit.wikimedia.org/r/543183 (https://phabricator.wikimedia.org/T217057) (owner: 10Elukey) [08:59:27] (03PS2) 10Arturo Borrero Gonzalez: openstack: drop jessie code [puppet] - 10https://gerrit.wikimedia.org/r/539065 (https://phabricator.wikimedia.org/T212302) [09:01:59] (03CR) 10Muehlenhoff: [C: 03+2] Remove DNS entries for labpuppetmaster [dns] - 10https://gerrit.wikimedia.org/r/543799 (https://phabricator.wikimedia.org/T234462) (owner: 10Muehlenhoff) [09:04:32] gehel: archiva should be ok now :) [09:04:39] \o/ [09:05:15] elukey: I'm doing a recheck of https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/543004, let's see if it works out [09:05:33] thanks! [09:06:55] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "> Patch Set 1: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/539065 (https://phabricator.wikimedia.org/T212302) (owner: 10Arturo Borrero Gonzalez) [09:18:11] 10Operations, 10Traffic: cp3032 and cp3040 occasional failed fetches - https://phabricator.wikimedia.org/T235736 (10jijiki) [09:22:57] elukey: that build passed, I'd say we're good for now! [09:23:22] gehel: thanks! \o/ [09:23:38] will open a task to follow up on puppetization [09:23:40] elukey: we'll need to test uploading a release, but I don't have one ready atm. onimisionipe will probably have one release to do fairly soon [09:23:49] ack [09:26:32] !log Stop MySQL on db1117 this will generate some haproxy alerts - T227133 [09:26:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:37] T227133: a8-eqiad pdu refresh (Thursday 10/17 @11am UTC) - https://phabricator.wikimedia.org/T227133 [09:27:43] !log upload archiva 2.2.4-1 to stretch-wikimedia - T222595 [09:27:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:37] PROBLEM - haproxy failover on dbproxy1001 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [09:30:48] ^ expected as I said [09:30:55] PROBLEM - haproxy failover on dbproxy1008 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [09:31:11] PROBLEM - haproxy failover on dbproxy1013 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [09:31:13] PROBLEM - haproxy failover on dbproxy1014 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [09:31:15] PROBLEM - haproxy failover on dbproxy1003 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [09:31:18] ACKNOWLEDGEMENT - haproxy failover on dbproxy1001 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Marostegui PDU work https://wikitech.wikimedia.org/wiki/HAProxy [09:31:18] ACKNOWLEDGEMENT - haproxy failover on dbproxy1002 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Marostegui PDU work https://wikitech.wikimedia.org/wiki/HAProxy [09:31:18] ACKNOWLEDGEMENT - haproxy failover on dbproxy1003 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Marostegui PDU work https://wikitech.wikimedia.org/wiki/HAProxy [09:31:18] ACKNOWLEDGEMENT - haproxy failover on dbproxy1007 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Marostegui PDU work https://wikitech.wikimedia.org/wiki/HAProxy [09:31:18] ACKNOWLEDGEMENT - haproxy failover on dbproxy1008 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Marostegui PDU work https://wikitech.wikimedia.org/wiki/HAProxy [09:31:19] ACKNOWLEDGEMENT - haproxy failover on dbproxy1013 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Marostegui PDU work https://wikitech.wikimedia.org/wiki/HAProxy [09:31:19] ACKNOWLEDGEMENT - haproxy failover on dbproxy1014 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Marostegui PDU work https://wikitech.wikimedia.org/wiki/HAProxy [09:31:20] ACKNOWLEDGEMENT - haproxy failover on dbproxy1016 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Marostegui PDU work https://wikitech.wikimedia.org/wiki/HAProxy [09:31:20] ACKNOWLEDGEMENT - haproxy failover on dbproxy1017 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Marostegui PDU work https://wikitech.wikimedia.org/wiki/HAProxy [09:31:21] ACKNOWLEDGEMENT - haproxy failover on dbproxy1021 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Marostegui PDU work https://wikitech.wikimedia.org/wiki/HAProxy [09:34:01] (03PS1) 10Muehlenhoff: dnsrecursor: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/543803 [09:37:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1129 for PDU work, give some traffic to db1090:3312 meanwhile T22meanwhile T227133', diff saved to https://phabricator.wikimedia.org/P9374 and previous config saved to /var/cache/conftool/dbconfig/20191017-093753-marostegui.json [09:37:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:58] T227133: a8-eqiad pdu refresh (Thursday 10/17 @11am UTC) - https://phabricator.wikimedia.org/T227133 [09:38:08] !log Stop MySQL on db1129 for PDU work [09:38:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:13] 10Operations, 10ops-eqiad, 10DC-Ops: a8-eqiad pdu refresh (Thursday 10/17 @11am UTC) - https://phabricator.wikimedia.org/T227133 (10Marostegui) db1129 and db1117 are good to go. [09:39:29] si'!log swift eqiad-prod: add weight to ms-be105[1-6] - T232367 [09:39:32] T232367: (2019-09-15) rack/setup/install ms-be105[1-6].eqiad.wmnet - https://phabricator.wikimedia.org/T232367 [09:39:38] !log swift eqiad-prod: add weight to ms-be105[1-6] - T232367 [09:39:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:02] (03PS5) 10Arturo Borrero Gonzalez: toolforge: k8s: adjust ports in the ingress setup [puppet] - 10https://gerrit.wikimedia.org/r/543137 (https://phabricator.wikimedia.org/T234037) [09:40:34] 10Operations, 10Mail, 10Wikimedia-Mailing-lists: mass Yahoo / AOL bounces mailman - https://phabricator.wikimedia.org/T232417 (10Sjoerddebruin) I received about 1000 mails in two days, glad that I set up a filter... [09:49:38] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: k8s: adjust ports in the ingress setup [puppet] - 10https://gerrit.wikimedia.org/r/543137 (https://phabricator.wikimedia.org/T234037) (owner: 10Arturo Borrero Gonzalez) [09:53:40] (03CR) 10Volans: "some comment inline on the jinja part, I cannot validate all the values in the yaml" (035 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/543795 (owner: 10Ayounsi) [09:57:28] !log swift codfw-prod: more weight to ms-be205[1-6] - T233638 [09:57:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:33] T233638: rack/setup/install ms-be205[1-6].codfw.wmnet - https://phabricator.wikimedia.org/T233638 [10:02:17] (03PS1) 10Ema: cache: reimage cp4027 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/543807 (https://phabricator.wikimedia.org/T227432) [10:05:42] (03PS1) 10Ema: ATS: include tls profile in cache::text_ats role [puppet] - 10https://gerrit.wikimedia.org/r/543809 (https://phabricator.wikimedia.org/T227432) [10:05:44] (03PS2) 10Ayounsi: Add OSPF support [homer/public] - 10https://gerrit.wikimedia.org/r/543795 [10:07:09] (03CR) 10Ayounsi: "Thx!" (035 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/543795 (owner: 10Ayounsi) [10:09:34] (03PS2) 10Ema: ATS: include tls profile in cache::text_ats role [puppet] - 10https://gerrit.wikimedia.org/r/543809 (https://phabricator.wikimedia.org/T227432) [10:10:18] 10Operations, 10observability: Reimage wezen to Stretch or Buster (and rename to centrallog2001) - https://phabricator.wikimedia.org/T224564 (10fgiunchedi) [10:10:27] 10Operations, 10observability, 10User-fgiunchedi: Reimage wezen to Stretch or Buster (and rename to centrallog2001) - https://phabricator.wikimedia.org/T224564 (10fgiunchedi) [10:11:05] (03CR) 10Vgutierrez: [C: 03+1] ATS: include tls profile in cache::text_ats role [puppet] - 10https://gerrit.wikimedia.org/r/543809 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [10:13:53] (03CR) 10Vgutierrez: [C: 03+1] cache: reimage cp4027 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/543807 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [10:16:05] (03CR) 10Ema: [C: 03+2] ATS: include tls profile in cache::text_ats role [puppet] - 10https://gerrit.wikimedia.org/r/543809 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [10:16:31] 10Operations, 10observability, 10User-fgiunchedi: Reimage wezen to Stretch or Buster (and rename to centrallog2001) - https://phabricator.wikimedia.org/T224564 (10fgiunchedi) a:03fgiunchedi [10:17:41] !log elukey@deploy1001 Started deploy [eventlogging/analytics@0f0a1aa]: Move codebase to Python3 [10:17:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:46] !log elukey@deploy1001 Finished deploy [eventlogging/analytics@0f0a1aa]: Move codebase to Python3 (duration: 00m 05s) [10:17:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:27] !log Move eventlogging on eventlog1002 to Python3 [10:19:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:32] \o/ [10:19:43] (03PS1) 10Arturo Borrero Gonzalez: toolforge: refresh puppet code for the new k8s [puppet] - 10https://gerrit.wikimedia.org/r/543815 (https://phabricator.wikimedia.org/T215531) [10:20:08] (03PS1) 10Muehlenhoff: Remove late-install hack for puppet 4 installation [puppet] - 10https://gerrit.wikimedia.org/r/543816 (https://phabricator.wikimedia.org/T228657) [10:24:12] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/543465 (https://phabricator.wikimedia.org/T235655) (owner: 10Jbond) [10:24:26] !log elukey@deploy1001 Started deploy [eventlogging/analytics@0f0a1aa]: Rollback move codebase to Python3 [10:24:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:29] !log elukey@deploy1001 Finished deploy [eventlogging/analytics@0f0a1aa]: Rollback move codebase to Python3 (duration: 00m 03s) [10:24:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:25] !log rollback eventlogging back to Python 2, some errors (unseen in tests) logged by the processors [10:25:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:30] :( [10:30:40] (03PS2) 10Muehlenhoff: Extend wmf-userschema for additional MFA options: [puppet] - 10https://gerrit.wikimedia.org/r/543402 [10:31:18] (03PS2) 10Ema: cache: reimage cp4027 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/543807 (https://phabricator.wikimedia.org/T227432) [10:31:59] !log depool mw1333 [10:32:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:12] !log depool cp4027 and reimage as text_ats T227432 [10:32:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:16] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [10:32:45] (03CR) 10Ema: [C: 03+2] cache: reimage cp4027 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/543807 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [10:33:11] 10Operations, 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for mnwwiki - https://phabricator.wikimedia.org/T235743 (10jhsoby) [10:35:40] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp4027.ulsfo.wmnet'] ` The log can be found in `/var/log/wm... [10:38:09] (03PS1) 10Mobrovac: RESTRouter: Sort the list of domains in config.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/543818 [10:39:18] (03PS1) 10Ema: cache_text ulsfo: read ats-be etcd keys [puppet] - 10https://gerrit.wikimedia.org/r/543819 (https://phabricator.wikimedia.org/T227432) [10:39:52] (03CR) 10Mobrovac: [C: 03+2] RESTRouter: Sort the list of domains in config.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/543818 (owner: 10Mobrovac) [10:40:08] (03Merged) 10jenkins-bot: RESTRouter: Sort the list of domains in config.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/543818 (owner: 10Mobrovac) [10:43:13] PROBLEM - Aggregate IPsec Tunnel Status eqiad on icinga1001 is CRITICAL: instance={cp1077:9536,cp1083:9536,cp1085:9536,cp1089:9536} site=eqiad tunnel={cp4027_v4,cp4027_v6} https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [10:43:33] PROBLEM - Aggregate IPsec Tunnel Status codfw on icinga1001 is CRITICAL: instance={cp2001:9536,cp2004:9536,cp2007:9536,cp2010:9536,cp2012:9536,cp2023:9536} site=codfw tunnel={cp4027_v4,cp4027_v6} https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [10:44:11] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [10:45:23] PROBLEM - IPsec on cp2012 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp4027_v4, cp4027_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [10:45:27] PROBLEM - IPsec on cp2001 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp4027_v4, cp4027_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [10:45:39] PROBLEM - IPsec on cp1083 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4027_v4, cp4027_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [10:45:39] (03PS1) 10Mobrovac: CXServer: downcase the header name used for rate-limiting [deployment-charts] - 10https://gerrit.wikimedia.org/r/543820 [10:45:53] PROBLEM - IPsec on cp2023 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp4027_v4, cp4027_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [10:45:57] PROBLEM - IPsec on cp2010 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp4027_v4, cp4027_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [10:46:09] PROBLEM - IPsec on cp1089 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4027_v4, cp4027_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [10:46:21] PROBLEM - IPsec on cp2004 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp4027_v4, cp4027_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [10:46:25] PROBLEM - IPsec on cp1085 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4027_v4, cp4027_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [10:46:31] PROBLEM - IPsec on cp1077 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4027_v4, cp4027_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [10:46:31] PROBLEM - Host asw2-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [10:46:58] PROBLEM - IPsec on cp2007 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp4027_v4, cp4027_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [10:47:14] PROBLEM - Host bast4002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [10:47:16] PROBLEM - Host cp4024.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [10:47:16] PROBLEM - Host cp4025.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [10:47:17] PROBLEM - Host cp4026.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [10:47:17] PROBLEM - Host cp4028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [10:47:17] PROBLEM - Host cp4030.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [10:47:27] 10Operations, 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for mnwwiki - https://phabricator.wikimedia.org/T235743 (10Marostegui) Let us know when the database is created so we can sanitize its tables and hand over to WMCS for views creation. [10:47:30] PROBLEM - Host cp4029.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [10:47:50] PROBLEM - Host mr1-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [10:47:50] PROBLEM - Host ps1-23-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [10:48:10] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:48:44] PROBLEM - Host lvs4006.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [10:48:52] PROBLEM - Host dns4001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [10:48:52] PROBLEM - Host dns4002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [10:48:58] RECOVERY - IPsec on cp2007 is OK: Strongswan OK - 50 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [10:49:00] PROBLEM - Host cp4022.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [10:49:04] PROBLEM - Host cp4032.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [10:49:04] PROBLEM - Host cp4023.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [10:49:12] PROBLEM - Host ganeti4001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [10:49:12] PROBLEM - Host ganeti4002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [10:49:30] PROBLEM - Host lvs4007.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [10:49:30] PROBLEM - Host lvs4005.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [10:49:32] PROBLEM - Host cp4031.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [10:49:34] PROBLEM - Host mr1-ulsfo.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [10:50:18] PROBLEM - Host mr1-ulsfo.oob is DOWN: CRITICAL - Network Unreachable (198.24.47.102) [10:50:46] PROBLEM - Host ganeti4003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [10:50:47] PROBLEM - Host mr1-ulsfo IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [10:51:56] PROBLEM - Host cp4021.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [10:52:25] (03CR) 10KartikMistry: [C: 03+1] "Also: https://gerrit.wikimedia.org/r/#/c/mediawiki/services/cxserver/+/543822 done for cxserver." [deployment-charts] - 10https://gerrit.wikimedia.org/r/543820 (owner: 10Mobrovac) [10:52:48] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:53:48] RECOVERY - IPsec on cp2004 is OK: Strongswan OK - 50 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [10:53:50] RECOVERY - IPsec on cp1085 is OK: Strongswan OK - 56 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [10:55:34] RECOVERY - IPsec on cp2010 is OK: Strongswan OK - 50 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [10:56:31] !log ema@cumin1001 START - Cookbook sre.hosts.downtime [10:56:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:40] RECOVERY - IPsec on cp2023 is OK: Strongswan OK - 50 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [10:56:52] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:57:33] (03PS1) 10Mobrovac: RESTRouter: Add mnwwiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/543824 (https://phabricator.wikimedia.org/T235744) [10:58:34] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:58:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: (Dis)respected human, time to deploy European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191017T1100). Please do the needful. [11:00:05] matthiasmullie and Jhs: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:19] (03CR) 10Mobrovac: [C: 03+2] RESTRouter: Add mnwwiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/543824 (https://phabricator.wikimedia.org/T235744) (owner: 10Mobrovac) [11:00:28] PROBLEM - Host cp4027.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:00:31] o/ [11:00:31] (03Merged) 10jenkins-bot: RESTRouter: Add mnwwiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/543824 (https://phabricator.wikimedia.org/T235744) (owner: 10Mobrovac) [11:01:03] I can SWAT today! [11:01:37] matthiasmullie: or are you doing it yourself? [11:01:52] RECOVERY - IPsec on cp1083 is OK: Strongswan OK - 56 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [11:01:54] that... would be convenient - seem to be stuck in a restricted network here & can't seem to ssh :D [11:01:58] ACKNOWLEDGEMENT - Host ps1-23-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% Ayounsi ulsfo power work - The acknowledgement expires at: 2019-10-18 13:30:30. [11:01:58] ACKNOWLEDGEMENT - Host ps1-22-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% Ayounsi ulsfo power work - The acknowledgement expires at: 2019-10-18 13:30:30. [11:01:58] ACKNOWLEDGEMENT - Host mr1-ulsfo.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% Ayounsi ulsfo power work - The acknowledgement expires at: 2019-10-18 13:30:30. [11:01:58] ACKNOWLEDGEMENT - Host mr1-ulsfo.oob is DOWN: CRITICAL - Network Unreachable (198.24.47.102) Ayounsi ulsfo power work - The acknowledgement expires at: 2019-10-18 13:30:30. [11:01:58] ACKNOWLEDGEMENT - Host mr1-ulsfo IPv6 is DOWN: PING CRITICAL - Packet loss = 100% Ayounsi ulsfo power work - The acknowledgement expires at: 2019-10-18 13:30:30. [11:02:26] Urbanecm, i'm here :) [11:02:32] (03CR) 10Mobrovac: [C: 03+2] CXServer: downcase the header name used for rate-limiting [deployment-charts] - 10https://gerrit.wikimedia.org/r/543820 (owner: 10Mobrovac) [11:02:39] Urbanecm: would be great if you could do it! otherwise I'll go set up tethering :p [11:02:41] Jhs: ack :) [11:02:44] matthiasmullie: sure! [11:02:49] are you able to test the change? [11:02:52] thanks :) [11:03:28] RECOVERY - IPsec on cp1089 is OK: Strongswan OK - 56 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [11:03:36] ACKNOWLEDGEMENT - IPMI Sensor Status on cp4021 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] Ayounsi ulsfo power work - The acknowledgement expires at: 2019-10-18 13:30:18. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [11:03:36] ACKNOWLEDGEMENT - IPMI Sensor Status on cp4023 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] Ayounsi ulsfo power work - The acknowledgement expires at: 2019-10-18 13:30:18. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [11:03:36] ACKNOWLEDGEMENT - IPMI Sensor Status on cp4028 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] Ayounsi ulsfo power work - The acknowledgement expires at: 2019-10-18 13:30:18. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [11:03:36] ACKNOWLEDGEMENT - IPMI Sensor Status on cp4030 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] Ayounsi ulsfo power work - The acknowledgement expires at: 2019-10-18 13:30:18. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [11:03:36] ACKNOWLEDGEMENT - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP Ayounsi ulsfo power work - The acknowledgement expires at: 2019-10-18 13:30:18. https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:03:36] ACKNOWLEDGEMENT - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP Ayounsi ulsfo power work - The acknowledgement expires at: 2019-10-18 13:30:18. https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:03:36] ACKNOWLEDGEMENT - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: Ayounsi ulsfo power work - The acknowledgement expires at: 2019-10-18 13:30:18. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:03:37] ACKNOWLEDGEMENT - IPMI Sensor Status on lvs4006 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] Ayounsi ulsfo power work - The acknowledgement expires at: 2019-10-18 13:30:18. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [11:03:48] RECOVERY - IPsec on cp2001 is OK: Strongswan OK - 50 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [11:03:49] Urbanecm, was that question for me? If so, yes by checking Special:UserGroupRights [11:03:54] RECOVERY - Aggregate IPsec Tunnel Status eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [11:03:57] Jhs: no, for matthiasmullie [11:04:18] yeah, am able to test! [11:04:43] 👍 [11:04:56] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp4027.ulsfo.wmnet'] ` Of which those **FAILED**: ` ['cp4027.ulsfo.wmnet'] ` [11:05:10] matthiasmullie: cool! [11:05:19] matthiasmullie: please test at mwdebug1001 [11:05:44] RECOVERY - IPsec on cp2012 is OK: Strongswan OK - 50 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [11:06:18] Urbanecm: works! [11:06:21] (03CR) 10Urbanecm: [C: 03+2] Allow sysops to add transwiki on nnwiki, and add import sources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543796 (https://phabricator.wikimedia.org/T231761) (owner: 10Jon Harald Søby) [11:06:24] ok to scap [11:06:24] thanks matthiasmullie [11:07:09] syncing [11:07:11] (03Merged) 10jenkins-bot: Allow sysops to add transwiki on nnwiki, and add import sources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543796 (https://phabricator.wikimedia.org/T231761) (owner: 10Jon Harald Søby) [11:08:03] !log urbanecm@deploy1001 Synchronized php-1.35.0-wmf.2/extensions/WikibaseMediaInfo: SWAT: 5a67011: Keep track of assigned nodes in both old & new DOM (T235236) (duration: 01m 03s) [11:08:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:07] T235236: [Bug] File Page statement panels are not behaving properly in master branch and production - https://phabricator.wikimedia.org/T235236 [11:08:08] ACKNOWLEDGEMENT - IPMI Sensor Status on cp4022 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] Ayounsi ulsfo power work - The acknowledgement expires at: 2019-10-18 13:30:50. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [11:08:08] ACKNOWLEDGEMENT - IPMI Sensor Status on cp4025 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] Ayounsi ulsfo power work - The acknowledgement expires at: 2019-10-18 13:30:50. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [11:08:08] ACKNOWLEDGEMENT - IPMI Sensor Status on cp4032 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] Ayounsi ulsfo power work - The acknowledgement expires at: 2019-10-18 13:30:50. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [11:08:08] ACKNOWLEDGEMENT - IPMI Sensor Status on dns4002 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] Ayounsi ulsfo power work - The acknowledgement expires at: 2019-10-18 13:30:50. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [11:08:09] matthiasmullie: synced! [11:08:10] (03PS5) 10Vgutierrez: ATS: Adjust timeouts in ats-tls and ats-backend instances [puppet] - 10https://gerrit.wikimedia.org/r/541524 (https://phabricator.wikimedia.org/T234887) [11:08:15] Jhs: you're next [11:08:26] great [11:08:33] Jhs: please test your patch at mwdebug1001 and let me know [11:08:49] Urbanecm: thanks a million! [11:08:55] yw! [11:09:30] Urbanecm, looks right to me 👍 [11:09:43] !log upgrading ATS on ulsfo nodes to 8.0.5-1wm9 - T234011 [11:09:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:47] T234011: ATS fails to log the used SSLCurve when the SSL session is being reused - https://phabricator.wikimedia.org/T234011 [11:09:56] RECOVERY - Check the Netbox report puppetdb for fail status. on netbox1001 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [11:10:02] Jhs: syncing [11:10:23] (03CR) 10jerkins-bot: [V: 04-1] ATS: Adjust timeouts in ats-tls and ats-backend instances [puppet] - 10https://gerrit.wikimedia.org/r/541524 (https://phabricator.wikimedia.org/T234887) (owner: 10Vgutierrez) [11:11:00] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 36d4612: Allow sysops to add transwiki on nnwiki, and add import sources (T231761) (duration: 00m 59s) [11:11:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:04] T231761: Allow sysop to grant transwiki-rights on nnwiki - https://phabricator.wikimedia.org/T231761 [11:11:06] !log failover vrrp from cr2-eqiad to cr1-eqiad - T227133 [11:11:09] Jhs: sznced" [11:11:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:10] T227133: a8-eqiad pdu refresh (Thursday 10/17 @11am UTC) - https://phabricator.wikimedia.org/T227133 [11:11:14] *synced [11:11:16] Urbanecm, thank you very much [11:11:18] !log EU SWAT done [11:11:21] you're welcome :) [11:11:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:36] ACKNOWLEDGEMENT - IPMI Sensor Status on bast4002 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] Ayounsi ulsfo power work - The acknowledgement expires at: 2019-10-18 13:30:17. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [11:12:36] ACKNOWLEDGEMENT - IPMI Sensor Status on cp4029 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] Ayounsi ulsfo power work - The acknowledgement expires at: 2019-10-18 13:30:17. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [11:12:36] ACKNOWLEDGEMENT - IPMI Sensor Status on cp4031 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] Ayounsi ulsfo power work - The acknowledgement expires at: 2019-10-18 13:30:17. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [11:12:36] ACKNOWLEDGEMENT - IPMI Sensor Status on dns4001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] Ayounsi ulsfo power work - The acknowledgement expires at: 2019-10-18 13:30:17. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [11:12:56] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [11:14:24] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [11:14:47] XioNoX: are you doing things in ulsfo? [11:15:03] (03PS2) 10Mobrovac: CXServer: downcase the header name used for rate-limiting [deployment-charts] - 10https://gerrit.wikimedia.org/r/543820 [11:15:33] ema: ulsfo is doing work on the the B side power feed [11:15:50] so we don't have any power redundancy until 13:30 utc [11:16:28] XioNoX: ok, thanks! [11:16:40] I'm ACK/downtiming everything so it doesn't make more power related alerts in parallel to the upcoming eqiad PDU work :) [11:16:45] !log upgrading ATS on esams nodes to 8.0.5-1wm9 - T234011 [11:16:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:49] T234011: ATS fails to log the used SSLCurve when the SSL session is being reused - https://phabricator.wikimedia.org/T234011 [11:17:02] RECOVERY - Host ganeti4002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.14 ms [11:17:42] RECOVERY - Aggregate IPsec Tunnel Status codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [11:21:10] RECOVERY - IPsec on cp1077 is OK: Strongswan OK - 56 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [11:22:44] (03CR) 10Ema: [C: 03+2] cache_text ulsfo: read ats-be etcd keys [puppet] - 10https://gerrit.wikimedia.org/r/543819 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [11:22:59] (03CR) 10Jbond: [C: 03+1] Remove late-install hack for puppet 4 installation [puppet] - 10https://gerrit.wikimedia.org/r/543816 (https://phabricator.wikimedia.org/T228657) (owner: 10Muehlenhoff) [11:23:32] (03CR) 10Jbond: [C: 03+2] puppetdb: remove activerecord db settings from servers using puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/543465 (https://phabricator.wikimedia.org/T235655) (owner: 10Jbond) [11:23:43] (03PS2) 10Jbond: puppetdb: remove activerecord db settings from servers using puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/543465 (https://phabricator.wikimedia.org/T235655) [11:26:55] (03PS6) 10Vgutierrez: ATS: Adjust timeouts in ats-tls and ats-backend instances [puppet] - 10https://gerrit.wikimedia.org/r/541524 (https://phabricator.wikimedia.org/T234887) [11:27:05] !log ema@puppetmaster1001 conftool action : set/weight=100; selector: name=cp4027.ulsfo.wmnet,service=ats-be [11:27:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:19] !log upgrading ATS on codfw nodes to 8.0.5-1wm9 - T234011 [11:27:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:24] T234011: ATS fails to log the used SSLCurve when the SSL session is being reused - https://phabricator.wikimedia.org/T234011 [11:31:01] ACKNOWLEDGEMENT - IPMI Sensor Status on cp4024 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical] Ayounsi ulsfo power work - The acknowledgement expires at: 2019-10-18 13:30:42. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [11:31:01] ACKNOWLEDGEMENT - IPMI Sensor Status on cp4026 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] Ayounsi ulsfo power work - The acknowledgement expires at: 2019-10-18 13:30:42. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [11:31:01] ACKNOWLEDGEMENT - IPMI Sensor Status on lvs4005 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] Ayounsi ulsfo power work - The acknowledgement expires at: 2019-10-18 13:30:42. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [11:31:01] ACKNOWLEDGEMENT - IPMI Sensor Status on lvs4007 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] Ayounsi ulsfo power work - The acknowledgement expires at: 2019-10-18 13:30:42. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [11:32:03] 10Operations, 10ops-eqiad: rack/setup/install dumpsdata1003.eqiad.wmnet - https://phabricator.wikimedia.org/T234076 (10ArielGlenn) As far as the base install, I'd like buster on it, and just one interface active. It's more important to get this done soon(and with buster) than to have bonding worked out. The r... [11:32:46] PROBLEM - IPMI Sensor Status on cp4027 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [11:33:04] PROBLEM - traffic_server backend process restarted on cp4027 is CRITICAL: 3 ge 2 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=ulsfo+prometheus/ops&var-instance=cp4027&var-layer=backend [11:33:20] the cp4027 alerts above are due to my reimage [11:35:56] RECOVERY - traffic_server backend process restarted on cp4027 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=ulsfo+prometheus/ops&var-instance=cp4027&var-layer=backend [11:36:07] !log upgrading ATS on eqiad nodes to 8.0.5-1wm9 - T234011 [11:36:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:11] T234011: ATS fails to log the used SSLCurve when the SSL session is being reused - https://phabricator.wikimedia.org/T234011 [11:36:36] ACKNOWLEDGEMENT - IPMI Sensor Status on cp4027 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] Ema Known, ongoing power work @ulsfo https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [11:39:28] PROBLEM - Host an-worker1095.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:39:28] PROBLEM - Host db1084.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:39:43] !log pool cp4027 with ATS backend T227432 [11:39:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:47] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [11:40:30] PROBLEM - Host ps1-d1-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [11:40:30] PROBLEM - Host ps1-d2-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [11:40:30] PROBLEM - Host ps1-d3-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [11:40:30] PROBLEM - Host ps1-d6-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [11:40:30] PROBLEM - Host ps1-d7-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [11:40:30] PROBLEM - Host ps1-d8-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [11:40:30] PROBLEM - Host ps1-d5-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [11:40:54] PROBLEM - Host elastic1039.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:40:56] PROBLEM - Host restbase1017.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:40:56] PROBLEM - Host an-worker1082.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:40:56] PROBLEM - Host an-worker1091.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:40:56] PROBLEM - Host an-presto1005.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:40:56] PROBLEM - Host an-worker1094.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:40:56] PROBLEM - Host an-worker1081.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:40:56] PROBLEM - Host an-worker1087.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:40:57] PROBLEM - Host an-worker1083.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:40:57] PROBLEM - Host an-worker1084.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:41:19] 10Operations, 10Traffic: Improve ATS prometheus metrics - https://phabricator.wikimedia.org/T231533 (10Vgutierrez) [11:41:22] 10Operations, 10Traffic: ATS fails to log the used SSLCurve when the SSL session is being reused - https://phabricator.wikimedia.org/T234011 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez [11:42:34] PROBLEM - Host mc1026.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:42:34] PROBLEM - Host db1074.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:42:34] PROBLEM - Host ms-be1045.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:42:34] PROBLEM - Host puppetmaster1003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:42:35] PROBLEM - Host wtp1044.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:42:35] PROBLEM - Host mw1258.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:42:36] PROBLEM - Host ms-be1019.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:42:37] PROBLEM - Host wdqs1010.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:42:37] PROBLEM - Host ms-be1043.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:42:38] PROBLEM - Host ms-be1020.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:42:38] PROBLEM - Host ms-be1042.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:42:39] PROBLEM - Host kubestage1002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:42:39] PROBLEM - Host wtp1039.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:42:40] PROBLEM - Host mw1329.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:42:40] PROBLEM - Host ms-be1054.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:42:41] PROBLEM - Host ms-be1028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:42:41] PROBLEM - Host mw1308.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:42:42] PROBLEM - Host db1139.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:42:42] PROBLEM - Host ms-be1031.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:42:57] 10Operations, 10ops-eqiad, 10cloud-services-team: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only - https://phabricator.wikimedia.org/T230289 (10Cmjohnson) Setup raid to raid10 and 2 spare disks. [11:44:20] PROBLEM - Host bast1002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:44:20] PROBLEM - Host mw1316.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:44:20] PROBLEM - Host aqs1004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:44:20] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:44:21] PROBLEM - Host cloudvirt1006.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:44:22] PROBLEM - Host cloudvirt1002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:45:34] PROBLEM - Host notebook1003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:45:34] PROBLEM - Host ores1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:45:34] PROBLEM - Host oresrdb1002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:45:34] PROBLEM - Host db1082.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:45:34] PROBLEM - Host db1077.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:45:35] PROBLEM - Host wtp1035.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:45:36] PROBLEM - Host oresrdb1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:45:36] PROBLEM - Host mw1330.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:45:37] PROBLEM - Host mw1346.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:45:37] PROBLEM - Host wtp1038.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:45:38] PROBLEM - Host mw1326.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:45:38] PROBLEM - Host wtp1042.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:45:39] PROBLEM - Host mw1341.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:45:39] PROBLEM - Host mw1325.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:45:39] PROBLEM - Host mw1324.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:45:40] PROBLEM - Host analytics1060.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:45:40] PROBLEM - Host mw1345.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:45:44] PROBLEM - Host wtp1043.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:45:50] PROBLEM - Host mwlog1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:45:54] PROBLEM - Host db1102.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:45:56] PROBLEM - Host restbase1027.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:45:56] PROBLEM - Host scandium.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:45:56] PROBLEM - Host restbase1019.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:45:56] PROBLEM - Host sessionstore1003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:46:02] PROBLEM - Host db1118.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:46:04] PROBLEM - Host db1132.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:46:04] PROBLEM - Host dbproxy1014.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:46:16] PROBLEM - Host dbstore1005.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:46:18] PROBLEM - Host an-coord1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:46:18] PROBLEM - Host cloudvirt1025.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:46:19] PROBLEM - Host cloudvirt1023.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:46:19] PROBLEM - Host cloudvirt1024.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:46:20] PROBLEM - Host cloudvirt1016.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:46:21] PROBLEM - Host cloudvirt1018.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:46:22] PROBLEM - Host aqs1006.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:46:22] PROBLEM - Host an-presto1004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:46:22] PROBLEM - Host conf1004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:46:23] PROBLEM - Host an-presto1003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:46:24] PROBLEM - Host an-presto1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:46:24] PROBLEM - Host an-worker1090.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:46:25] PROBLEM - Host db1088.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:46:25] PROBLEM - Host db1075.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:52:50] RECOVERY - Host mw1233.mgmt is UP: PING WARNING - Packet loss = 86%, RTA = 12.19 ms [11:52:50] RECOVERY - Host cp1077.mgmt is UP: PING WARNING - Packet loss = 93%, RTA = 78.54 ms [11:52:51] RECOVERY - Host cloudservices1003.mgmt is UP: PING WARNING - Packet loss = 93%, RTA = 77.78 ms [11:52:51] RECOVERY - Host cloudvirt1003.mgmt is UP: PING WARNING - Packet loss = 86%, RTA = 93.48 ms [11:52:51] RECOVERY - Host cloudvirt1008.mgmt is UP: PING WARNING - Packet loss = 86%, RTA = 111.00 ms [11:52:52] RECOVERY - Host cp1079.mgmt is UP: PING WARNING - Packet loss = 86%, RTA = 63.94 ms [11:52:52] RECOVERY - Host cloudelastic1004.mgmt is UP: PING WARNING - Packet loss = 86%, RTA = 98.46 ms [11:52:53] RECOVERY - Host cp1080.mgmt is UP: PING WARNING - Packet loss = 86%, RTA = 65.18 ms [11:52:53] RECOVERY - Host cp1081.mgmt is UP: PING WARNING - Packet loss = 86%, RTA = 98.83 ms [11:52:54] RECOVERY - Host cloudnet1004.mgmt is UP: PING WARNING - Packet loss = 86%, RTA = 62.29 ms [11:55:13] RECOVERY - Host mw1278.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.86 ms [11:55:13] RECOVERY - Host mw1226.mgmt is UP: PING OK - Packet loss = 0%, RTA = 5.42 ms [11:55:14] RECOVERY - Host ores1008.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.85 ms [11:55:14] RECOVERY - Host ores1006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.39 ms [11:55:14] RECOVERY - Host netmon1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.41 ms [11:55:14] RECOVERY - Host ores1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 7.09 ms [11:55:14] RECOVERY - Host ores1004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.44 ms [11:55:15] RECOVERY - Host cp1085.mgmt is UP: PING OK - Packet loss = 0%, RTA = 3.31 ms [11:55:16] RECOVERY - Host pc1008.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.19 ms [11:55:16] RECOVERY - Host pc1007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.19 ms [11:55:16] RECOVERY - Host pc1009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.04 ms [11:55:17] RECOVERY - Host cp1076.mgmt is UP: PING OK - Packet loss = 0%, RTA = 3.82 ms [11:55:17] RECOVERY - Host db1133.mgmt is UP: PING OK - Packet loss = 0%, RTA = 6.76 ms [11:55:18] RECOVERY - Host db1135.mgmt is UP: PING OK - Packet loss = 0%, RTA = 7.22 ms [11:55:18] RECOVERY - Host sessionstore1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 8.36 ms [11:55:19] RECOVERY - Host db1130.mgmt is UP: PING OK - Packet loss = 0%, RTA = 10.92 ms [11:55:19] RECOVERY - Host thumbor1003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 8.00 ms [11:55:20] RECOVERY - Host thumbor1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 9.83 ms [11:55:20] RECOVERY - Host wtp1036.mgmt is UP: PING OK - Packet loss = 0%, RTA = 7.23 ms [11:55:21] RECOVERY - Host wtp1029.mgmt is UP: PING OK - Packet loss = 0%, RTA = 8.66 ms [11:55:22] RECOVERY - Host wtp1033.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.71 ms [11:55:22] RECOVERY - Host wtp1025.mgmt is UP: PING OK - Packet loss = 0%, RTA = 5.67 ms [11:55:22] RECOVERY - Host wtp1027.mgmt is UP: PING OK - Packet loss = 0%, RTA = 6.95 ms [11:55:23] RECOVERY - Host wtp1040.mgmt is UP: PING OK - Packet loss = 0%, RTA = 10.95 ms [11:55:23] RECOVERY - Host wtp1026.mgmt is UP: PING OK - Packet loss = 0%, RTA = 9.82 ms [11:55:25] RECOVERY - Host wdqs1009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.14 ms [11:55:25] RECOVERY - Host wtp1028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 8.01 ms [11:55:25] RECOVERY - Host elastic1045.mgmt is UP: PING OK - Packet loss = 0%, RTA = 5.39 ms [11:55:28] RECOVERY - Host cp1087.mgmt is UP: PING OK - Packet loss = 0%, RTA = 4.38 ms [11:55:28] RECOVERY - Host dbproxy1012.mgmt is UP: PING OK - Packet loss = 0%, RTA = 7.35 ms [11:55:28] RECOVERY - Host db1138.mgmt is UP: PING OK - Packet loss = 0%, RTA = 7.00 ms [11:55:32] RECOVERY - Host mc1020.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.75 ms [11:55:32] RECOVERY - Host dbprov1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 11.61 ms [11:55:32] RECOVERY - Host dbprov1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 13.54 ms [11:55:32] RECOVERY - Host maps1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.38 ms [11:55:32] RECOVERY - Host cloudvirt1021.mgmt is UP: PING OK - Packet loss = 0%, RTA = 14.65 ms [11:55:33] RECOVERY - Host analytics1030.mgmt is UP: PING OK - Packet loss = 0%, RTA = 12.99 ms [11:55:34] RECOVERY - Host lvs1015.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.43 ms [11:55:36] RECOVERY - Host mc1019.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.71 ms [11:55:36] RECOVERY - Host analytics1068.mgmt is UP: PING OK - Packet loss = 0%, RTA = 7.24 ms [11:55:36] RECOVERY - Host aqs1008.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.76 ms [11:55:36] RECOVERY - Host maps1003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.73 ms [11:55:36] RECOVERY - Host maps1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.71 ms [11:55:36] RECOVERY - Host logstash1022.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.75 ms [11:55:37] RECOVERY - Host analytics1062.mgmt is UP: PING OK - Packet loss = 0%, RTA = 3.18 ms [11:55:38] RECOVERY - Host db1083.mgmt is UP: PING OK - Packet loss = 0%, RTA = 5.32 ms [11:55:42] RECOVERY - Host db1126.mgmt is UP: PING OK - Packet loss = 0%, RTA = 5.64 ms [11:55:48] RECOVERY - Host db1062.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.88 ms [11:55:51] (03CR) 10Ema: "LGTM, one typo" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/541220 (https://phabricator.wikimedia.org/T234803) (owner: 10Vgutierrez) [11:55:56] RECOVERY - Host mr1-eqiad IPv6 is UP: PING OK - Packet loss = 0%, RTA = 2.77 ms [11:56:00] RECOVERY - Host aqs1004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 5.80 ms [11:56:00] RECOVERY - Host bast1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 3.43 ms [11:56:00] RECOVERY - Host mw1316.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.20 ms [11:56:02] RECOVERY - Host mw1297.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.92 ms [11:56:02] RECOVERY - Host cloudvirt1012.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.12 ms [11:56:03] RECOVERY - Host cloudvirt1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 4.65 ms [11:56:03] RECOVERY - Host cloudvirt1006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 4.56 ms [11:56:04] RECOVERY - Host cloudvirt1022.mgmt is UP: PING OK - Packet loss = 0%, RTA = 5.95 ms [11:56:04] RECOVERY - Host cloudvirt1013.mgmt is UP: PING OK - Packet loss = 0%, RTA = 3.24 ms [11:56:05] RECOVERY - Host cloudvirt1014.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.20 ms [11:57:28] RECOVERY - Host sessionstore1003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.08 ms [11:57:28] RECOVERY - Host wtp1043.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.82 ms [11:57:32] RECOVERY - Host mwlog1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.84 ms [11:57:38] RECOVERY - Host rdb1010.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.17 ms [11:57:38] RECOVERY - Host restbase1027.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.18 ms [11:57:38] RECOVERY - Host scandium.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.22 ms [11:57:38] RECOVERY - Host restbase1019.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.12 ms [11:57:44] RECOVERY - Host db1124.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.06 ms [11:57:44] RECOVERY - Host db1118.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.80 ms [11:57:46] RECOVERY - Host db1132.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.03 ms [11:57:46] RECOVERY - Host dbproxy1014.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.09 ms [11:57:52] RECOVERY - Host dbstore1005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.07 ms [11:57:55] RECOVERY - Host authdns1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.08 ms [11:57:55] RECOVERY - Host cloudvirt1025.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.11 ms [11:57:56] RECOVERY - Host an-coord1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.18 ms [11:57:56] RECOVERY - Host cloudvirt1023.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.01 ms [11:57:57] RECOVERY - Host cloudvirt1024.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.11 ms [11:57:57] RECOVERY - Host cloudvirt1018.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.98 ms [11:57:58] RECOVERY - Host cloudvirt1016.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.86 ms [11:57:59] RECOVERY - Host aqs1006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.78 ms [11:57:59] RECOVERY - Host an-presto1003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.35 ms [11:58:00] RECOVERY - Host an-presto1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.47 ms [11:58:00] RECOVERY - Host conf1004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.41 ms [11:58:00] RECOVERY - Host an-presto1004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.51 ms [11:58:01] RECOVERY - Host db1075.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.01 ms [11:58:02] RECOVERY - Host cp1072.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.14 ms [11:58:03] RECOVERY - Host cloudstore1009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.44 ms [11:58:03] RECOVERY - Host db1088.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.30 ms [11:58:03] RECOVERY - Host db1067.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.15 ms [11:58:04] RECOVERY - Host an-worker1080.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.93 ms [11:58:04] RECOVERY - Host db1087.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.94 ms [11:58:05] RECOVERY - Host db1099.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.60 ms [11:58:05] RECOVERY - Host db1076.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.58 ms [11:58:06] RECOVERY - Host db1094.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.59 ms [12:03:42] PROBLEM - Host msw1-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [12:03:46] PROBLEM - Juniper alarms on cr2-eqiad is CRITICAL: JNX_ALARMS CRITICAL - 6 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [12:04:04] PROBLEM - Router interfaces on mr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.199, interfaces up: 35, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:04:44] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:04:56] PROBLEM - Juniper alarms on asw2-a-eqiad is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [12:05:44] PROBLEM - Host db1117.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:06:20] PROBLEM - Host db1129.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:06:26] PROBLEM - Host torrelay1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:06:54] PROBLEM - Host re0.cr2-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [12:08:02] PROBLEM - Host helium.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:11:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1105:3312 after schema change', diff saved to https://phabricator.wikimedia.org/P9375 and previous config saved to /var/cache/conftool/dbconfig/20191017-121106-marostegui.json [12:11:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:15] 04Critical Alert for device mr1-eqiad.wikimedia.org - Juniper alarm active [12:13:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Restore db1090:3312 original weight', diff saved to https://phabricator.wikimedia.org/P9376 and previous config saved to /var/cache/conftool/dbconfig/20191017-121330-marostegui.json [12:13:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:05] I think we lost wikibugs [12:15:30] 10Operations, 10Performance-Team, 10serviceops: Increased latency in POST requests - https://phabricator.wikimedia.org/T235755 (10jijiki) [12:15:32] §22 [12:15:34] 10Operations, 10Performance-Team, 10serviceops: Increased latency in POST requests - https://phabricator.wikimedia.org/T235755 (10jijiki) [12:15:38] 10Operations, 10Performance-Team, 10serviceops: Increased latency in POST requests - https://phabricator.wikimedia.org/T235755 (10jijiki) [12:15:40] /22/10 [12:15:42] oh it is here [12:17:34] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:18:15] 04Critical Alert for device cr2-eqiad.wikimedia.org - Juniper alarm active [12:19:32] RECOVERY - Host helium.mgmt is UP: PING OK - Packet loss = 0%, RTA = 4.63 ms [12:20:24] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:21:00] RECOVERY - Host msw1-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.87 ms [12:21:02] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:21:06] RECOVERY - Host mr1-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 74.89 ms [12:21:06] RECOVERY - Host asw2-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 75.12 ms [12:21:32] RECOVERY - Host ganeti4003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 79.70 ms [12:21:32] RECOVERY - Host mr1-ulsfo IPv6 is UP: PING OK - Packet loss = 0%, RTA = 76.73 ms [12:21:38] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:22:15] 04Critical Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Juniper alarm active [12:22:34] RECOVERY - Juniper alarms on cr2-eqiad is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [12:22:50] RECOVERY - Host cp4021.mgmt is UP: PING OK - Packet loss = 0%, RTA = 79.32 ms [12:23:04] RECOVERY - Host db1117.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.12 ms [12:23:34] RECOVERY - Host bast4002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 86.23 ms [12:23:34] RECOVERY - Host cp4024.mgmt is UP: PING OK - Packet loss = 0%, RTA = 88.99 ms [12:23:34] RECOVERY - Host cp4028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 88.41 ms [12:23:34] RECOVERY - Host cp4026.mgmt is UP: PING OK - Packet loss = 0%, RTA = 91.76 ms [12:23:35] RECOVERY - Host cp4025.mgmt is UP: PING OK - Packet loss = 0%, RTA = 92.83 ms [12:23:35] RECOVERY - Host cp4030.mgmt is UP: PING OK - Packet loss = 0%, RTA = 90.65 ms [12:23:38] RECOVERY - Host mr1-ulsfo.oob is UP: PING OK - Packet loss = 0%, RTA = 94.89 ms [12:23:42] RECOVERY - Host db1129.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.12 ms [12:23:46] RECOVERY - Host torrelay1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.21 ms [12:23:54] RECOVERY - Host cp4029.mgmt is UP: PING OK - Packet loss = 0%, RTA = 76.73 ms [12:24:02] RECOVERY - Host lvs4005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 79.29 ms [12:24:14] RECOVERY - Host re0.cr2-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.01 ms [12:25:24] RECOVERY - Host lvs4006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 79.16 ms [12:26:14] RECOVERY - Host cp4022.mgmt is UP: PING OK - Packet loss = 0%, RTA = 74.90 ms [12:26:14] RECOVERY - Host cp4027.mgmt is UP: PING OK - Packet loss = 0%, RTA = 74.85 ms [12:26:14] RECOVERY - Host cp4031.mgmt is UP: PING OK - Packet loss = 0%, RTA = 74.85 ms [12:26:15] RECOVERY - Host cp4023.mgmt is UP: PING OK - Packet loss = 0%, RTA = 74.88 ms [12:26:15] RECOVERY - Host cp4032.mgmt is UP: PING OK - Packet loss = 0%, RTA = 74.85 ms [12:26:24] RECOVERY - Host dns4002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 75.07 ms [12:26:24] RECOVERY - Host ganeti4001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 75.07 ms [12:26:24] RECOVERY - Host dns4001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 74.86 ms [12:26:24] RECOVERY - Host lvs4007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 74.88 ms [12:26:25] RECOVERY - Host mr1-ulsfo.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 94.81 ms [12:27:15] 04Critical Alert for device mr1-ulsfo.wikimedia.org - Device rebooted [12:27:44] RECOVERY - Router interfaces on mr1-eqiad is OK: OK: host 208.80.154.199, interfaces up: 37, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:28:39] (03PS1) 10Ayounsi: Add security zones for MRs [homer/public] - 10https://gerrit.wikimedia.org/r/543840 [12:30:52] RECOVERY - haproxy failover on dbproxy1013 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [12:30:52] PROBLEM - Widespread puppet agent failures on icinga1001 is CRITICAL: 0.04935 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [12:31:02] (03PS2) 10Ayounsi: Add security zones for MRs [homer/public] - 10https://gerrit.wikimedia.org/r/543840 [12:31:10] RECOVERY - haproxy failover on dbproxy1014 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [12:31:18] RECOVERY - haproxy failover on dbproxy1001 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [12:31:22] RECOVERY - haproxy failover on dbproxy1003 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [12:31:48] RECOVERY - haproxy failover on dbproxy1008 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [12:32:40] PROBLEM - DNS ganeti4002.mgmt on ganeti4002.mgmt is CRITICAL: DNS CRITICAL - expected 0.0.0.0 but got 10.128.129.18 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:33:15] 04̶C̶r̶i̶t̶i̶c̶a̶l Device mr1-ulsfo.wikimedia.org recovered from Device rebooted [12:33:54] RECOVERY - IPMI Sensor Status on cp4027 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [12:34:37] 10Operations, 10ops-eqiad, 10DC-Ops: a8-eqiad pdu refresh (Thursday 10/17 @11am UTC) - https://phabricator.wikimedia.org/T227133 (10Jclark-ctr) pdu swap completed all host online netbox update [12:34:59] (03PS3) 10Ayounsi: Add security zones for MRs [homer/public] - 10https://gerrit.wikimedia.org/r/543840 [12:38:15] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-eqiad.wikimedia.org recovered from Juniper alarm active [12:39:22] PROBLEM - Widespread puppet agent failures- no resources reported on icinga1001 is CRITICAL: 0.01016 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [12:42:32] PROBLEM - Widespread puppet agent failures- no resources reported on icinga1001 is CRITICAL: 0.01105 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [12:45:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1129 after PDU maintenance', diff saved to https://phabricator.wikimedia.org/P9377 and previous config saved to /var/cache/conftool/dbconfig/20191017-124503-marostegui.json [12:45:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:36] (03PS1) 10Jon Harald Søby: Initial configuration for mnwwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543842 (https://phabricator.wikimedia.org/T235739) [12:48:20] (03CR) 10jerkins-bot: [V: 04-1] Initial configuration for mnwwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543842 (https://phabricator.wikimedia.org/T235739) (owner: 10Jon Harald Søby) [12:49:37] (03CR) 10Jon Harald Søby: [C: 04-1] Initial configuration for mnwwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543842 (https://phabricator.wikimedia.org/T235739) (owner: 10Jon Harald Søby) [12:50:02] !log Compress tables on db2088:3312 - T235599 [12:50:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:07] T235599: Recompress special slaves across eqiad and codfw - https://phabricator.wikimedia.org/T235599 [12:50:34] RECOVERY - Juniper alarms on asw2-a-eqiad is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [12:51:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'More traffic to db1129 after PDU maintenance', diff saved to https://phabricator.wikimedia.org/P9378 and previous config saved to /var/cache/conftool/dbconfig/20191017-125154-marostegui.json [12:51:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2088:3312 for compression T235599', diff saved to https://phabricator.wikimedia.org/P9379 and previous config saved to /var/cache/conftool/dbconfig/20191017-125248-marostegui.json [12:52:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:45] (03CR) 10Urbanecm: [C: 04-2] "Per above" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543842 (https://phabricator.wikimedia.org/T235739) (owner: 10Jon Harald Søby) [12:54:22] !log downtiming all mgmt host for 30min (mr1-eqiad needs to be rebooted) [12:54:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:28] (03PS2) 10Jon Harald Søby: Initial configuration for mnwwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543842 (https://phabricator.wikimedia.org/T235739) [12:55:15] (03CR) 10jerkins-bot: [V: 04-1] Initial configuration for mnwwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543842 (https://phabricator.wikimedia.org/T235739) (owner: 10Jon Harald Søby) [12:56:19] !log restart mr1-eqiad [12:56:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:42] PROBLEM - Host mr1-eqiad.oob IPv6 is DOWN: CRITICAL - Destination Unreachable (2607:f6f0:205::153) [12:56:54] PROBLEM - Host ps1-b1-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [12:56:54] PROBLEM - Host ps1-a7-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [12:56:54] PROBLEM - Host ps1-b6-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [12:56:54] PROBLEM - Host ps1-b8-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [12:57:06] PROBLEM - Host ps1-a3-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [12:57:18] PROBLEM - Host fasw-c-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [12:57:26] PROBLEM - Host asw2-c-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [12:57:26] PROBLEM - Host mr1-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [12:57:36] PROBLEM - Host asw2-b-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [12:57:50] PROBLEM - Host asw2-a-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [12:57:54] PROBLEM - Host ps1-a2-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [12:57:54] PROBLEM - Host ps1-b2-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [12:57:54] PROBLEM - Host ps1-d6-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [12:57:54] PROBLEM - Host ps1-a4-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [12:57:54] PROBLEM - Host ps1-b3-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [12:57:54] PROBLEM - Host ps1-a5-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [12:57:56] PROBLEM - Host ps1-c5-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [12:57:56] PROBLEM - Host ps1-c1-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [12:57:56] PROBLEM - Host ps1-b7-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [12:57:56] PROBLEM - Host ps1-c7-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [12:57:56] PROBLEM - Host ps1-c3-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [12:57:58] PROBLEM - Host ps1-d1-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [12:57:58] PROBLEM - Host ps1-a1-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [12:57:58] PROBLEM - Host ps1-d5-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [12:57:58] PROBLEM - Host ps1-d8-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [12:58:00] PROBLEM - Host ps1-c2-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [12:58:00] PROBLEM - Host ps1-d7-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [12:58:02] PROBLEM - Host ps1-d4-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [12:58:02] PROBLEM - Host ps1-d2-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [12:58:04] PROBLEM - Host ps1-c8-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [12:58:06] PROBLEM - Host ps1-b5-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [12:58:06] PROBLEM - Host ps1-c6-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [12:58:08] PROBLEM - Host ps1-a6-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [12:58:08] PROBLEM - Host ps1-c4-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [12:58:10] PROBLEM - Host ps1-d3-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [12:58:14] PROBLEM - Host ps1-b4-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [12:58:20] PROBLEM - Host asw2-d-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [12:59:04] PROBLEM - Host mr1-eqiad IPv6 is DOWN: CRITICAL - Destination Unreachable (2620:0:861:ffff::6) [12:59:12] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:59:46] RECOVERY - Host ps1-d7-eqiad is UP: PING WARNING - Packet loss = 50%, RTA = 1.72 ms [12:59:46] RECOVERY - Host ps1-a4-eqiad is UP: PING WARNING - Packet loss = 50%, RTA = 2.15 ms [12:59:48] RECOVERY - Host asw2-b-eqiad is UP: PING WARNING - Packet loss = 54%, RTA = 1.13 ms [12:59:48] RECOVERY - Host ps1-c7-eqiad is UP: PING WARNING - Packet loss = 50%, RTA = 3.68 ms [12:59:48] RECOVERY - Host ps1-b6-eqiad is UP: PING WARNING - Packet loss = 50%, RTA = 3.63 ms [12:59:48] RECOVERY - Host ps1-d2-eqiad is UP: PING WARNING - Packet loss = 50%, RTA = 4.73 ms [12:59:48] RECOVERY - Host ps1-a7-eqiad is UP: PING WARNING - Packet loss = 50%, RTA = 4.06 ms [12:59:48] RECOVERY - Host ps1-c5-eqiad is UP: PING WARNING - Packet loss = 50%, RTA = 3.31 ms [12:59:48] RECOVERY - Host ps1-d3-eqiad is UP: PING WARNING - Packet loss = 50%, RTA = 2.92 ms [12:59:49] RECOVERY - Host ps1-b8-eqiad is UP: PING WARNING - Packet loss = 50%, RTA = 2.25 ms [12:59:49] RECOVERY - Host ps1-d5-eqiad is UP: PING WARNING - Packet loss = 50%, RTA = 3.15 ms [12:59:50] RECOVERY - Host ps1-d1-eqiad is UP: PING WARNING - Packet loss = 50%, RTA = 4.22 ms [12:59:50] RECOVERY - Host ps1-c2-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.67 ms [12:59:51] RECOVERY - Host ps1-d6-eqiad is UP: PING OK - Packet loss = 0%, RTA = 2.74 ms [12:59:51] RECOVERY - Host ps1-c6-eqiad is UP: PING OK - Packet loss = 0%, RTA = 2.07 ms [12:59:52] RECOVERY - Host asw2-d-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.85 ms [12:59:52] RECOVERY - Host ps1-a2-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.74 ms [12:59:53] RECOVERY - Host ps1-b7-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.95 ms [12:59:53] RECOVERY - Host ps1-a6-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.92 ms [12:59:54] RECOVERY - Host ps1-a3-eqiad is UP: PING OK - Packet loss = 0%, RTA = 2.08 ms [12:59:54] RECOVERY - Host ps1-b3-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.81 ms [12:59:55] RECOVERY - Host ps1-c8-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.62 ms [12:59:55] RECOVERY - Host ps1-c4-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.68 ms [12:59:56] RECOVERY - Host ps1-d4-eqiad is UP: PING OK - Packet loss = 0%, RTA = 4.59 ms [12:59:56] RECOVERY - Host ps1-d8-eqiad is UP: PING OK - Packet loss = 0%, RTA = 4.02 ms [12:59:57] RECOVERY - Host ps1-c3-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.99 ms [12:59:57] RECOVERY - Host ps1-a5-eqiad is UP: PING OK - Packet loss = 0%, RTA = 2.86 ms [12:59:58] RECOVERY - Host ps1-b5-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.40 ms [12:59:58] RECOVERY - Host ps1-b2-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.90 ms [12:59:59] RECOVERY - Host asw2-a-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.99 ms [12:59:59] RECOVERY - Host ps1-c1-eqiad is UP: PING OK - Packet loss = 0%, RTA = 2.23 ms [13:00:00] RECOVERY - Host ps1-b4-eqiad is UP: PING OK - Packet loss = 0%, RTA = 2.07 ms [13:00:00] RECOVERY - Host fasw-c-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.00 ms [13:00:06] RECOVERY - Host ps1-b1-eqiad is UP: PING OK - Packet loss = 0%, RTA = 2.11 ms [13:00:30] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:00:46] RECOVERY - Host mr1-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.87 ms [13:00:50] RECOVERY - Host asw2-c-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.13 ms [13:02:04] RECOVERY - Host mr1-eqiad.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 5.82 ms [13:03:10] 10Puppet, 10Patch-For-Review: Populate puppetdb1002 with live data - https://phabricator.wikimedia.org/T235655 (10jbond) p:05Triage→03Normal [13:03:22] RECOVERY - Host ps1-a1-eqiad is UP: PING OK - Packet loss = 0%, RTA = 2.44 ms [13:04:32] RECOVERY - Host mr1-eqiad IPv6 is UP: PING OK - Packet loss = 0%, RTA = 2.93 ms [13:06:36] !log rollback failover vrrp from cr2-eqiad to cr1-eqiad - T227133 [13:06:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:40] T227133: a8-eqiad pdu refresh (Thursday 10/17 @11am UTC) - https://phabricator.wikimedia.org/T227133 [13:06:52] PROBLEM - Check systemd state on elastic2037 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:07:15] 04̶C̶r̶i̶t̶i̶c̶a̶l Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Juniper alarm active [13:09:08] (03PS3) 10Jon Harald Søby: Initial configuration for mnwwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543842 (https://phabricator.wikimedia.org/T235739) [13:11:38] (03PS1) 10Jbond: puppetdb: enable multiple service urls and command_broadcast [puppet] - 10https://gerrit.wikimedia.org/r/543845 (https://phabricator.wikimedia.org/T235655) [13:11:58] RECOVERY - Widespread puppet agent failures- no resources reported on icinga1001 is OK: (C)0.01 ge (W)0.006 ge 0.001451 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [13:13:08] RECOVERY - Widespread puppet agent failures on icinga1001 is OK: (C)0.01 ge (W)0.006 ge 0.003676 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [13:13:26] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/543845 (https://phabricator.wikimedia.org/T235655) (owner: 10Jbond) [13:17:47] (03PS2) 10Jbond: puppetdb: enable multiple service urls and command_broadcast [puppet] - 10https://gerrit.wikimedia.org/r/543845 (https://phabricator.wikimedia.org/T235655) [13:18:00] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/543845 (https://phabricator.wikimedia.org/T235655) (owner: 10Jbond) [13:18:16] 04̶C̶r̶i̶t̶i̶c̶a̶l Device mr1-eqiad.wikimedia.org recovered from Juniper alarm active [13:18:56] 10Operations, 10DBA, 10serviceops, 10Goal, 10Patch-For-Review: Strengthen backup infrastructure and support - https://phabricator.wikimedia.org/T229209 (10jcrespo) The copy finished correctly and actually found a bug on transfer.py: ` ERROR: Original checksum c89dcd766fa3072718753b9ab0bdfb7d baculasd... [13:19:30] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: apply throttle filter to all log levels [puppet] - 10https://gerrit.wikimedia.org/r/543708 (owner: 10Herron) [13:22:05] 10Operations, 10ops-eqiad, 10DC-Ops: a8-eqiad pdu refresh (Thursday 10/17 @11am UTC) - https://phabricator.wikimedia.org/T227133 (10Cmjohnson) [13:22:15] (03PS3) 10Jbond: puppetdb: enable multiple service urls and command_broadcast [puppet] - 10https://gerrit.wikimedia.org/r/543845 (https://phabricator.wikimedia.org/T235655) [13:22:29] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, see inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/543467 (owner: 10Herron) [13:23:03] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/543845 (https://phabricator.wikimedia.org/T235655) (owner: 10Jbond) [13:24:08] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM as one of the mitigations, thanks Antoine!" [puppet] - 10https://gerrit.wikimedia.org/r/543192 (https://phabricator.wikimedia.org/T234564) (owner: 10Hashar) [13:24:33] (03CR) 10jerkins-bot: [V: 04-1] puppetdb: enable multiple service urls and command_broadcast [puppet] - 10https://gerrit.wikimedia.org/r/543845 (https://phabricator.wikimedia.org/T235655) (owner: 10Jbond) [13:24:35] (03PS4) 10Jbond: puppetdb: enable multiple service urls and command_broadcast [puppet] - 10https://gerrit.wikimedia.org/r/543845 (https://phabricator.wikimedia.org/T235655) [13:25:37] RECOVERY - Check systemd state on elastic2037 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:25:45] (03PS5) 10Jbond: puppetdb: enable multiple service urls and command_broadcast [puppet] - 10https://gerrit.wikimedia.org/r/543845 (https://phabricator.wikimedia.org/T235655) [13:26:01] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/543845 (https://phabricator.wikimedia.org/T235655) (owner: 10Jbond) [13:28:05] (03CR) 10jerkins-bot: [V: 04-1] puppetdb: enable multiple service urls and command_broadcast [puppet] - 10https://gerrit.wikimedia.org/r/543845 (https://phabricator.wikimedia.org/T235655) (owner: 10Jbond) [13:29:59] RECOVERY - DNS ganeti4002.mgmt on ganeti4002.mgmt is OK: DNS OK: 0.014 seconds response time. ganeti4002.mgmt.ulsfo.wmnet returns 10.128.129.18 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:30:43] (03CR) 10Volans: "minor formatting nits inline" (033 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/543840 (owner: 10Ayounsi) [13:30:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'More traffic to db1129 after PDU maintenance', diff saved to https://phabricator.wikimedia.org/P9380 and previous config saved to /var/cache/conftool/dbconfig/20191017-133047-marostegui.json [13:30:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:18] (03PS1) 10Faidon Liambotis: librenms: exclude another PDU secondary model [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/543848 (https://phabricator.wikimedia.org/T223450) [13:32:28] (03PS6) 10Jbond: puppetdb: enable multiple service urls and command_broadcast [puppet] - 10https://gerrit.wikimedia.org/r/543845 (https://phabricator.wikimedia.org/T235655) [13:32:40] (03PS1) 10Faidon Liambotis: accounting: map the Date field as well [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/543849 [13:32:57] (03PS7) 10Vgutierrez: ATS: Pick the unified cert using hiera key public_tls_unified_cert_vendor [puppet] - 10https://gerrit.wikimedia.org/r/541220 (https://phabricator.wikimedia.org/T234803) [13:33:05] (03CR) 10jerkins-bot: [V: 04-1] librenms: exclude another PDU secondary model [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/543848 (https://phabricator.wikimedia.org/T223450) (owner: 10Faidon Liambotis) [13:33:17] (03CR) 10Vgutierrez: ATS: Pick the unified cert using hiera key public_tls_unified_cert_vendor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/541220 (https://phabricator.wikimedia.org/T234803) (owner: 10Vgutierrez) [13:33:35] (03CR) 10jerkins-bot: [V: 04-1] accounting: map the Date field as well [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/543849 (owner: 10Faidon Liambotis) [13:34:29] (03CR) 10jerkins-bot: [V: 04-1] puppetdb: enable multiple service urls and command_broadcast [puppet] - 10https://gerrit.wikimedia.org/r/543845 (https://phabricator.wikimedia.org/T235655) (owner: 10Jbond) [13:35:27] (03PS2) 10Faidon Liambotis: librenms: exclude another PDU secondary model [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/543848 (https://phabricator.wikimedia.org/T223450) [13:38:24] (03CR) 10Faidon Liambotis: [C: 03+2] librenms: exclude another PDU secondary model [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/543848 (https://phabricator.wikimedia.org/T223450) (owner: 10Faidon Liambotis) [13:38:43] (03PS2) 10Faidon Liambotis: accounting: map the Date field as well [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/543849 [13:41:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1129 after PDU maintenance', diff saved to https://phabricator.wikimedia.org/P9381 and previous config saved to /var/cache/conftool/dbconfig/20191017-134112-marostegui.json [13:41:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:48] (03PS1) 10Giuseppe Lavagetto: scaffold: Add option for TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/543854 [13:48:01] PROBLEM - Old JVM GC check - elastic1050-production-search-psi-eqiad on elastic1050 is CRITICAL: 160.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1 [13:49:25] looking ^ [13:58:45] 10Operations, 10Growth-Team, 10Notifications, 10serviceops, and 3 others: Provision Kask for Echo timestamp storage in k8s - https://phabricator.wikimedia.org/T234376 (10Eevans) >>! In T234376#5582893, @Joe wrote: > Heh yes sorry, I forgot to tell you yesterday - you need to use `helmfile destroy` in newer... [14:00:01] dcausse: thanks! [14:01:56] (03PS1) 10Muehlenhoff: Add Icinga check for monitoring correct application of CPU microcode updates [puppet] - 10https://gerrit.wikimedia.org/r/543858 (https://phabricator.wikimedia.org/T235250) [14:04:08] (03CR) 10jerkins-bot: [V: 04-1] Add Icinga check for monitoring correct application of CPU microcode updates [puppet] - 10https://gerrit.wikimedia.org/r/543858 (https://phabricator.wikimedia.org/T235250) (owner: 10Muehlenhoff) [14:04:17] (03PS8) 10BBlack: ATS: Pick the unified cert using hiera key public_tls_unified_cert_vendor [puppet] - 10https://gerrit.wikimedia.org/r/541220 (https://phabricator.wikimedia.org/T234803) (owner: 10Vgutierrez) [14:05:08] (03PS1) 10Muehlenhoff: Bump meta package for new ABI in 4.9.189 for jessie [debs/linux-meta] - 10https://gerrit.wikimedia.org/r/543859 [14:07:22] (03PS2) 10Muehlenhoff: Add Icinga check for monitoring correct application of CPU microcode updates [puppet] - 10https://gerrit.wikimedia.org/r/543858 (https://phabricator.wikimedia.org/T235250) [14:07:34] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Bump meta package for new ABI in 4.9.189 for jessie [debs/linux-meta] - 10https://gerrit.wikimedia.org/r/543859 (owner: 10Muehlenhoff) [14:09:42] !log ✔️ cdanis@install1002.wikimedia.org ~ 🕙☕ sudo -E reprepro --restrict grafana update buster-wikimedia [14:09:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:45] 10Operations, 10Traffic, 10Patch-For-Review: Provide an easy way of picking the traffic serving TLS certificate used by ATS - https://phabricator.wikimedia.org/T234803 (10BBlack) Notes from IRC, etc: The current patch (merging shortly: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/541220/ ) gets us... [14:14:18] (03PS3) 10Muehlenhoff: Add Icinga check for monitoring correct application of CPU microcode updates [puppet] - 10https://gerrit.wikimedia.org/r/543858 (https://phabricator.wikimedia.org/T235250) [14:15:11] 10Operations, 10Acme-chief, 10Traffic: Decide/document criteria needed to serve acme-chief LE issued unified certificate to end users - https://phabricator.wikimedia.org/T230687 (10BBlack) >>! In T230687#5422646, @BBlack wrote: > @Vgutierrez may have some ideas about how to tackle these, but it's behind othe... [14:15:24] 10Operations, 10Acme-chief, 10Traffic: Decide/document criteria needed to serve acme-chief LE issued unified certificate to end users - https://phabricator.wikimedia.org/T230687 (10BBlack) [14:15:27] 10Operations, 10Traffic, 10Patch-For-Review: Provide an easy way of picking the traffic serving TLS certificate used by ATS - https://phabricator.wikimedia.org/T234803 (10BBlack) [14:16:53] 10Operations, 10netbox: update librenms report - https://phabricator.wikimedia.org/T235716 (10faidon) 05Open→03Resolved Fixed. [14:16:55] 10Operations, 10ops-codfw, 10ops-eqiad, 10DC-Ops, and 2 others: Triage and resolve all outstanding Netbox report errors - https://phabricator.wikimedia.org/T223450 (10faidon) [14:20:27] (03CR) 10Faidon Liambotis: [C: 03+2] accounting: map the Date field as well [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/543849 (owner: 10Faidon Liambotis) [14:22:39] (03PS1) 10CDanis: grafana1002: answer for grafana-beta.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/543861 (https://phabricator.wikimedia.org/T220838) [14:23:51] (03CR) 10CDanis: [C: 03+2] grafana1002: answer for grafana-beta.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/543861 (https://phabricator.wikimedia.org/T220838) (owner: 10CDanis) [14:27:38] (03PS1) 10CDanis: grafana-beta.wikimedia.org: point to grafana1002 [puppet] - 10https://gerrit.wikimedia.org/r/543862 (https://phabricator.wikimedia.org/T220838) [14:29:09] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/542922 (https://phabricator.wikimedia.org/T235427) (owner: 10Jbond) [14:29:28] (03PS1) 10Alexandros Kosiaris: Add IPv6 RRs for backup hosts [dns] - 10https://gerrit.wikimedia.org/r/543863 [14:31:53] (03CR) 10CDanis: [C: 03+1] prometheus: turn puppet failed run into a boolean (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/542251 (owner: 10Filippo Giunchedi) [14:32:11] 10Operations, 10serviceops, 10Patch-For-Review: Make the parsoid cluster support parsoid/PHP - https://phabricator.wikimedia.org/T233654 (10Dzahn) [14:32:23] !log disable puppet on cache fleet (cp*) ahead of cert deployment refactoring - T234803 [14:32:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:27] T234803: Provide an easy way of picking the traffic serving TLS certificate used by ATS - https://phabricator.wikimedia.org/T234803 [14:32:56] !log uploaded linux-meta 1.22 for jessie-wikimedia [14:32:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:44] (03PS6) 10Filippo Giunchedi: profile: sanity checks for cluster [puppet] - 10https://gerrit.wikimedia.org/r/539934 (https://phabricator.wikimedia.org/T234232) [14:35:31] (03PS9) 10BBlack: ATS: Pick the unified cert using hiera key public_tls_unified_cert_vendor [puppet] - 10https://gerrit.wikimedia.org/r/541220 (https://phabricator.wikimedia.org/T234803) (owner: 10Vgutierrez) [14:35:48] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: turn puppet failed run into a boolean (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/542251 (owner: 10Filippo Giunchedi) [14:35:58] (03PS4) 10Filippo Giunchedi: prometheus: turn puppet failed run into a boolean [puppet] - 10https://gerrit.wikimedia.org/r/542251 [14:36:07] (03CR) 10CDanis: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/18904/" [puppet] - 10https://gerrit.wikimedia.org/r/543862 (https://phabricator.wikimedia.org/T220838) (owner: 10CDanis) [14:37:08] (03PS3) 10Filippo Giunchedi: DNM Revert "hieradata: add acmechief cluster" [puppet] - 10https://gerrit.wikimedia.org/r/540246 [14:37:38] !log banning elastic1050:psi to investigate gc issues [14:37:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:06] (03CR) 10Jcrespo: [C: 03+1] "Ok idea and syntax, but I didn't check the ips were free." [dns] - 10https://gerrit.wikimedia.org/r/543863 (owner: 10Alexandros Kosiaris) [14:39:01] 10Operations, 10DC-Ops, 10decommission: decommission eeden - https://phabricator.wikimedia.org/T235770 (10MoritzMuehlenhoff) [14:39:13] (03PS10) 10BBlack: ATS: Pick the unified cert using hiera key public_tls_unified_cert_vendor [puppet] - 10https://gerrit.wikimedia.org/r/541220 (https://phabricator.wikimedia.org/T234803) (owner: 10Vgutierrez) [14:39:21] (03PS5) 10Filippo Giunchedi: prometheus: turn puppet failed run into a boolean [puppet] - 10https://gerrit.wikimedia.org/r/542251 [14:39:23] (03PS4) 10Jon Harald Søby: Initial configuration for mnwwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543842 (https://phabricator.wikimedia.org/T235739) [14:39:25] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] prometheus: turn puppet failed run into a boolean [puppet] - 10https://gerrit.wikimedia.org/r/542251 (owner: 10Filippo Giunchedi) [14:39:47] waiting for CI and rebases and all of that on puppet: https://media.giphy.com/media/5Scemv9W86owg/source.gif [14:41:10] (03PS11) 10BBlack: ATS: Pick the unified cert using hiera key public_tls_unified_cert_vendor [puppet] - 10https://gerrit.wikimedia.org/r/541220 (https://phabricator.wikimedia.org/T234803) (owner: 10Vgutierrez) [14:43:26] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Config, 10Release-Engineering-Team (Development services): Fix operations/puppet.git "rebase hell" - https://phabricator.wikimedia.org/T224033 (10BBlack) IRC says the meeting was mostly consumed by OKR discussion, it may ha... [14:44:33] (03PS1) 10Jcrespo: Redirect stderr to /dev/null to prevent cronspam [puppet] - 10https://gerrit.wikimedia.org/r/543866 (https://phabricator.wikimedia.org/T132324) [14:45:03] 10Operations, 10Analytics, 10SRE-Access-Requests: SSH access for Lex Nasser, analytics intern - https://phabricator.wikimedia.org/T235688 (10crusnov) p:05Triage→03Normal a:03Dzahn [14:45:22] 10Operations, 10Gerrit, 10Release-Engineering-Team, 10Wikimedia Design Style Guide: Automatic pickup of Gerrit clone master doesn't happen - https://phabricator.wikimedia.org/T235677 (10crusnov) p:05Triage→03Normal [14:45:30] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Config, 10Release-Engineering-Team (Development services): Fix operations/puppet.git "rebase hell" - https://phabricator.wikimedia.org/T224033 (10BBlack) 05Open→03Resolved a:03BBlack It's switched to `Rebase-if-necess... [14:45:49] (03CR) 10BBlack: [C: 03+2] ATS: Pick the unified cert using hiera key public_tls_unified_cert_vendor [puppet] - 10https://gerrit.wikimedia.org/r/541220 (https://phabricator.wikimedia.org/T234803) (owner: 10Vgutierrez) [14:46:07] !log installing 4.9.189 Linux update on jessie hosts (no reboots, deploying the package only at this point) [14:46:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:13] 10Operations, 10Icinga, 10fundraising-tech-ops: dwisehaupt needs access to iginca for frack hosts - https://phabricator.wikimedia.org/T235676 (10crusnov) p:05Triage→03Normal [14:46:37] 10Operations, 10service-runner, 10serviceops, 10CPT Initiatives (RESTBase Split (CDP2)), and 5 others: RESTBase/RESTRouter/service-runner rate limiting plans - https://phabricator.wikimedia.org/T235437 (10Joe) One detail I want to understand about the Redis hypothesis: - What happens if the rate-limiting... [14:46:42] (03CR) 10jerkins-bot: [V: 04-1] Redirect stderr to /dev/null to prevent cronspam [puppet] - 10https://gerrit.wikimedia.org/r/543866 (https://phabricator.wikimedia.org/T132324) (owner: 10Jcrespo) [14:46:48] (03CR) 10Giuseppe Lavagetto: [C: 03+1] discovery.yaml: add parsoid-php microservice [puppet] - 10https://gerrit.wikimedia.org/r/542572 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [14:48:36] (03PS3) 10Jhedden: openstack: update eqiad1 clients to wikimediacloud auth url [puppet] - 10https://gerrit.wikimedia.org/r/542452 (https://phabricator.wikimedia.org/T223907) [14:50:02] (03PS2) 10Jcrespo: Redirect stderr to /dev/null to prevent cronspam [puppet] - 10https://gerrit.wikimedia.org/r/543866 (https://phabricator.wikimedia.org/T132324) [14:50:32] (03CR) 10Alexandros Kosiaris: [C: 03+2] Add IPv6 RRs for backup hosts [dns] - 10https://gerrit.wikimedia.org/r/543863 (owner: 10Alexandros Kosiaris) [14:50:45] (03CR) 10Jhedden: [C: 03+2] openstack: update eqiad1 clients to wikimediacloud auth url [puppet] - 10https://gerrit.wikimedia.org/r/542452 (https://phabricator.wikimedia.org/T223907) (owner: 10Jhedden) [14:51:11] 10Operations, 10service-runner, 10serviceops, 10CPT Initiatives (RESTBase Split (CDP2)), and 5 others: RESTBase/RESTRouter/service-runner rate limiting plans - https://phabricator.wikimedia.org/T235437 (10Pchelolo) > What happens if the rate-limiting service is unavailable or is lagging? In Change-Prop, w... [14:51:23] (03PS7) 10Filippo Giunchedi: profile: sanity checks for cluster [puppet] - 10https://gerrit.wikimedia.org/r/539934 (https://phabricator.wikimedia.org/T234232) [14:51:26] (03PS4) 10Filippo Giunchedi: DNM Revert "hieradata: add acmechief cluster" [puppet] - 10https://gerrit.wikimedia.org/r/540246 [14:52:07] (03CR) 10Filippo Giunchedi: profile: sanity checks for cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/539934 (https://phabricator.wikimedia.org/T234232) (owner: 10Filippo Giunchedi) [14:52:09] (03CR) 10jerkins-bot: [V: 04-1] Redirect stderr to /dev/null to prevent cronspam [puppet] - 10https://gerrit.wikimedia.org/r/543866 (https://phabricator.wikimedia.org/T132324) (owner: 10Jcrespo) [14:54:03] (03PS3) 10Jcrespo: swap: Redirect stderr to /dev/null to prevent cronspam [puppet] - 10https://gerrit.wikimedia.org/r/543866 (https://phabricator.wikimedia.org/T132324) [14:54:06] (03PS1) 10Ema: cache: reimage cp4028 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/543867 (https://phabricator.wikimedia.org/T227432) [14:55:04] (03CR) 10Filippo Giunchedi: [C: 03+2] "LGTM, fails on undefined cluster on a dummy revert: https://puppet-compiler.wmflabs.org/compiler1001/18906/acmechief1001.eqiad.wmnet/chang" [puppet] - 10https://gerrit.wikimedia.org/r/539934 (https://phabricator.wikimedia.org/T234232) (owner: 10Filippo Giunchedi) [14:57:01] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2067 - https://phabricator.wikimedia.org/T235695 (10Papaul) a:05Papaul→03Marostegui Disk replaced [14:57:49] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2067 - https://phabricator.wikimedia.org/T235695 (10Marostegui) Thanks! ` physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, Rebuilding) ` [15:00:15] (03PS1) 10BBlack: Add local flock to puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/543869 [15:01:18] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2052.codfw.wmnet - https://phabricator.wikimedia.org/T230883 (10Papaul) ` [edit interfaces interface-range vlan-private1-d-codfw] - member ge-6/0/0; [edit interfaces interface-range disabled] member ge-6/0/14 { ... } + member ge... [15:01:20] (03CR) 10CDanis: [C: 03+1] Add local flock to puppet-merge (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/543869 (owner: 10BBlack) [15:01:39] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2052.codfw.wmnet - https://phabricator.wikimedia.org/T230883 (10Papaul) [15:01:46] !log dumping jvm heap on elastic1050:psi to investigate gc issues [15:01:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:43] (03PS2) 10BBlack: Add local flock to puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/543869 [15:02:51] (03CR) 10Andrew Bogott: labtestwiki: move to a wmcs-hosted database on clouddb2001-dev (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543664 (https://phabricator.wikimedia.org/T233236) (owner: 10Andrew Bogott) [15:03:05] (03PS14) 10Andrew Bogott: labtestwiki: move to a wmcs-hosted database on clouddb2001-dev [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543664 (https://phabricator.wikimedia.org/T233236) [15:03:20] (03PS4) 10Ayounsi: Add security zones for MRs [homer/public] - 10https://gerrit.wikimedia.org/r/543840 [15:03:22] (03PS1) 10Ayounsi: Add LLDP and IGMP snooping support [homer/public] - 10https://gerrit.wikimedia.org/r/543870 [15:05:16] (03CR) 10Jcrespo: "Check if this make sense to avoid one cronspam message per day (or other method is preferred). Also if nothing breaks on deploy." [puppet] - 10https://gerrit.wikimedia.org/r/543866 (https://phabricator.wikimedia.org/T132324) (owner: 10Jcrespo) [15:06:54] (03CR) 10Ottomata: [C: 03+1] swap: Redirect stderr to /dev/null to prevent cronspam [puppet] - 10https://gerrit.wikimedia.org/r/543866 (https://phabricator.wikimedia.org/T132324) (owner: 10Jcrespo) [15:07:18] (03CR) 10Filippo Giunchedi: prometheus global: add rules for correct global HTTP avail (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/540676 (https://phabricator.wikimedia.org/T234567) (owner: 10CDanis) [15:07:37] !log unbanning elastic1050:psi [15:07:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:37] (03PS3) 10BBlack: Add local flock to puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/543869 [15:14:39] (03PS1) 10Filippo Giunchedi: hieradata: bump kafka-logging default partitions [puppet] - 10https://gerrit.wikimedia.org/r/543873 (https://phabricator.wikimedia.org/T215904) [15:15:49] (03CR) 10Ayounsi: "Thx!" (033 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/543840 (owner: 10Ayounsi) [15:16:45] (03PS1) 10BBlack: Fixup flock for authdns-local-update [puppet] - 10https://gerrit.wikimedia.org/r/543874 [15:18:33] (03CR) 10Filippo Giunchedi: [C: 03+2] swift: add swiftrepl [puppet] - 10https://gerrit.wikimedia.org/r/536586 (https://phabricator.wikimedia.org/T162123) (owner: 10Filippo Giunchedi) [15:19:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1090:3317 and remove db1136 from its temporary vslow,dump role', diff saved to https://phabricator.wikimedia.org/P9382 and previous config saved to /var/cache/conftool/dbconfig/20191017-151952-marostegui.json [15:19:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:20] (03CR) 10Filippo Giunchedi: [C: 04-1] "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/543278 (https://phabricator.wikimedia.org/T219928) (owner: 10Elukey) [15:22:28] (03CR) 10Elukey: "Thanks for the comments! I wasn't aware of it, will try to add :)" [puppet] - 10https://gerrit.wikimedia.org/r/543278 (https://phabricator.wikimedia.org/T219928) (owner: 10Elukey) [15:25:08] (03PS4) 10Elukey: aqs: replace logstash host/port with rsyslog localhost/port [puppet] - 10https://gerrit.wikimedia.org/r/543278 (https://phabricator.wikimedia.org/T219928) [15:25:58] (03PS1) 10CRusnov: netbox: add max-requests parameter to uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/543876 [15:29:41] (03PS2) 10Andrew Bogott: Revert "cloudweb2001-dev: remove wikitech profiles" [puppet] - 10https://gerrit.wikimedia.org/r/537683 [15:30:20] (03CR) 10Elukey: "Thanks a lot for the code change, I wanted to fix it but got sidetracked." [puppet] - 10https://gerrit.wikimedia.org/r/543866 (https://phabricator.wikimedia.org/T132324) (owner: 10Jcrespo) [15:30:38] (03PS1) 10Alexandros Kosiaris: profile::backup::host: Add the ability to configure ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/543877 (https://phabricator.wikimedia.org/T229209) [15:32:08] (03CR) 10CRusnov: "Compiler looks good." [puppet] - 10https://gerrit.wikimedia.org/r/543876 (owner: 10CRusnov) [15:32:46] (03CR) 10Andrew Bogott: [C: 03+2] Revert "cloudweb2001-dev: remove wikitech profiles" [puppet] - 10https://gerrit.wikimedia.org/r/537683 (owner: 10Andrew Bogott) [15:34:37] RECOVERY - Old JVM GC check - elastic1050-production-search-psi-eqiad on elastic1050 is OK: (C)100 gt (W)80 gt 8.136 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1 [15:34:55] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1001/18910/aqs1008.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/543278 (https://phabricator.wikimedia.org/T219928) (owner: 10Elukey) [15:36:37] (03PS2) 10Alexandros Kosiaris: profile::backup::host: Add the ability to configure ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/543877 (https://phabricator.wikimedia.org/T229209) [15:40:40] (03PS3) 10Herron: logstash: apply truncate filter to all fields [puppet] - 10https://gerrit.wikimedia.org/r/543467 [15:42:19] PROBLEM - HTTPS-wmflabs on tools.wmflabs.org is CRITICAL: SSL CRITICAL - Certificate *.wmflabs.org valid until 2019-11-16 15:41:05 +0000 (expires in 29 days) https://phabricator.wikimedia.org/tag/toolforge/ [15:44:04] (03CR) 10Herron: [C: 03+1] "👍" [puppet] - 10https://gerrit.wikimedia.org/r/543278 (https://phabricator.wikimedia.org/T219928) (owner: 10Elukey) [15:46:37] (03CR) 10Jcrespo: "Please do take over!" [puppet] - 10https://gerrit.wikimedia.org/r/543866 (https://phabricator.wikimedia.org/T132324) (owner: 10Jcrespo) [15:48:22] 10Operations, 10Gerrit, 10Release-Engineering-Team, 10Wikimedia Design Style Guide: Automatic pickup of Gerrit clone master doesn't happen (due to git-lfs not installed on production misc) - https://phabricator.wikimedia.org/T235677 (10Aklapper) [15:52:14] (03CR) 10Herron: logstash: apply truncate filter to all fields (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/543467 (owner: 10Herron) [15:52:22] (03CR) 10Ema: [C: 03+1] Add local flock to puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/543869 (owner: 10BBlack) [15:53:33] (03CR) 10Jcrespo: "Running puppet compiler: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/18913/" [puppet] - 10https://gerrit.wikimedia.org/r/543877 (https://phabricator.wikimedia.org/T229209) (owner: 10Alexandros Kosiaris) [15:55:06] (03PS2) 10Herron: logstash: apply throttle filter to all log levels [puppet] - 10https://gerrit.wikimedia.org/r/543708 [15:57:31] (03CR) 10Volans: [C: 03+1] "LGTM, feel free to adjust the number as we see fit based on the results." [puppet] - 10https://gerrit.wikimedia.org/r/543876 (owner: 10CRusnov) [15:57:49] (03CR) 10Jcrespo: [C: 03+1] "Seems correct: https://puppet-compiler.wmflabs.org/compiler1001/18913/" [puppet] - 10https://gerrit.wikimedia.org/r/543877 (https://phabricator.wikimedia.org/T229209) (owner: 10Alexandros Kosiaris) [16:00:04] godog and _joe_: Dear deployers, time to do the Puppet SWAT(Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191017T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:01:09] (03CR) 10Herron: [C: 03+2] logstash: apply throttle filter to all log levels [puppet] - 10https://gerrit.wikimedia.org/r/543708 (owner: 10Herron) [16:05:22] (03CR) 10Cwhite: [C: 03+1] logstash: apply truncate filter to all fields [puppet] - 10https://gerrit.wikimedia.org/r/543467 (owner: 10Herron) [16:05:41] (03CR) 10CRusnov: [C: 03+2] netbox: add max-requests parameter to uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/543876 (owner: 10CRusnov) [16:06:16] (03PS1) 10Volans: devices: fix behaviour according to docstring [software/homer] - 10https://gerrit.wikimedia.org/r/543886 [16:06:18] (03PS1) 10Volans: typing: use Mapping instead of Dict for arguments [software/homer] - 10https://gerrit.wikimedia.org/r/543887 [16:06:20] (03PS1) 10Volans: tests: increase coverage for transports.junos [software/homer] - 10https://gerrit.wikimedia.org/r/543888 [16:06:22] (03PS1) 10Volans: devices: refactor signature [software/homer] - 10https://gerrit.wikimedia.org/r/543889 [16:06:24] (03PS1) 10Volans: netbox: allow to select the devices from Netbox [software/homer] - 10https://gerrit.wikimedia.org/r/543890 (https://phabricator.wikimedia.org/T228388) [16:13:25] (03CR) 10Volans: [C: 03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/543795 (owner: 10Ayounsi) [16:16:09] (03CR) 10Volans: [C: 03+1] "LGTM, minor optional nit inline" (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/543840 (owner: 10Ayounsi) [16:16:58] (03CR) 10Volans: [C: 03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/543870 (owner: 10Ayounsi) [16:17:04] (03CR) 10Bstorm: [C: 03+1] "Can't seem to get the compiler working, but it looks good! Looks less like a "testing" setup." [puppet] - 10https://gerrit.wikimedia.org/r/543815 (https://phabricator.wikimedia.org/T215531) (owner: 10Arturo Borrero Gonzalez) [16:17:50] (03CR) 10Volans: "This code would require that we import into Netbox the FQDN of the virtual chassis into the domain property and the FQDN of the other devi" [software/homer] - 10https://gerrit.wikimedia.org/r/543890 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [16:23:45] 10Operations, 10DC-Ops, 10decommission: decommission eeden - https://phabricator.wikimedia.org/T235770 (10wiki_willy) [16:24:56] 10Operations, 10DC-Ops, 10decommission: decommission eeden - https://phabricator.wikimedia.org/T235770 (10wiki_willy) I just made one slight change - changed the point person to @Jclark-ctr for assigning eqiad decom tasks [16:25:54] (03PS2) 10Herron: logstash: raise elasticsearch mapping limit [puppet] - 10https://gerrit.wikimedia.org/r/543192 (https://phabricator.wikimedia.org/T234564) (owner: 10Hashar) [16:26:14] (03CR) 10BBlack: [C: 03+2] Add local flock to puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/543869 (owner: 10BBlack) [16:28:54] (03CR) 10Herron: [C: 03+2] logstash: raise elasticsearch mapping limit [puppet] - 10https://gerrit.wikimedia.org/r/543192 (https://phabricator.wikimedia.org/T234564) (owner: 10Hashar) [16:30:12] (03PS3) 10Cwhite: lvs, prometheus, profile: add swagger exporter jobs [puppet] - 10https://gerrit.wikimedia.org/r/542472 (https://phabricator.wikimedia.org/T205870) [16:32:31] (03CR) 10jerkins-bot: [V: 04-1] lvs, prometheus, profile: add swagger exporter jobs [puppet] - 10https://gerrit.wikimedia.org/r/542472 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [16:35:38] (03CR) 10BBlack: [C: 03+2] Fixup flock for authdns-local-update [puppet] - 10https://gerrit.wikimedia.org/r/543874 (owner: 10BBlack) [16:37:21] 10Operations, 10Puppet, 10Patch-For-Review: occasional puppet errors: Error 500 on SERVER: Server Error: Unsupported facts format - https://phabricator.wikimedia.org/T233643 (10Aklapper) [16:44:29] (03PS4) 10Cwhite: lvs, prometheus, profile: add swagger exporter jobs [puppet] - 10https://gerrit.wikimedia.org/r/542472 (https://phabricator.wikimedia.org/T205870) [16:46:22] (03PS1) 10Andrew Bogott: cloudweb2001-dev: move to php7.2 [puppet] - 10https://gerrit.wikimedia.org/r/543898 [16:47:33] (03CR) 10Andrew Bogott: [C: 03+2] cloudweb2001-dev: move to php7.2 [puppet] - 10https://gerrit.wikimedia.org/r/543898 (owner: 10Andrew Bogott) [16:47:53] (03PS4) 10Herron: logstash: apply truncate filter to all fields [puppet] - 10https://gerrit.wikimedia.org/r/543467 [16:52:09] (03CR) 10Herron: [C: 03+2] logstash: apply truncate filter to all fields [puppet] - 10https://gerrit.wikimedia.org/r/543467 (owner: 10Herron) [16:57:46] (03PS1) 10BBlack: Test digicert-2019 on cp3030 and cp3034 [puppet] - 10https://gerrit.wikimedia.org/r/543900 (https://phabricator.wikimedia.org/T209515) [16:58:00] (03PS1) 10BBlack: Deploy digicert-2019 to esams [puppet] - 10https://gerrit.wikimedia.org/r/543901 (https://phabricator.wikimedia.org/T209515) [16:58:46] 10Operations, 10DC-Ops, 10decommission: decommission - https://phabricator.wikimedia.org/T235785 (10RobH) [16:58:57] 10Operations, 10DC-Ops, 10decommission: decommission - https://phabricator.wikimedia.org/T235785 (10RobH) 05Open→03Declined [17:00:04] cscott, arlolra, subbu, halfak, and accraze: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Services – Graphoid / Parsoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191017T1700). [17:00:15] no parsoid deploy today [17:00:21] (03CR) 10BBlack: [C: 03+2] Test digicert-2019 on cp3030 and cp3034 [puppet] - 10https://gerrit.wikimedia.org/r/543900 (https://phabricator.wikimedia.org/T209515) (owner: 10BBlack) [17:02:40] (03PS1) 10Herron: logstash: increase filter truncate max_count to 125000 [puppet] - 10https://gerrit.wikimedia.org/r/543904 [17:03:07] RECOVERY - Memcached on cloudweb2001-dev is OK: TCP OK - 0.036 second response time on 208.80.153.60 port 11000 https://wikitech.wikimedia.org/wiki/Memcached [17:03:13] (03PS2) 10Herron: logstash: increase filter truncate max_counters to 125000 [puppet] - 10https://gerrit.wikimedia.org/r/543904 [17:16:08] (03CR) 10BBlack: [C: 03+2] Deploy digicert-2019 to esams [puppet] - 10https://gerrit.wikimedia.org/r/543901 (https://phabricator.wikimedia.org/T209515) (owner: 10BBlack) [17:36:00] 10Operations, 10Traffic, 10Patch-For-Review: Renew Digicert Unified in 2019 - https://phabricator.wikimedia.org/T209515 (10BBlack) 05Open→03Resolved Digicert-2019 is now in live use at the `esams` edge and we have full normal redundancy (for now) among commercial cert vendors. Random status update on ot... [17:36:25] (03CR) 10Filippo Giunchedi: [C: 04-1] "A good step in the right direction! See inline" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/542472 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [17:49:35] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: increase filter truncate max_counters to 125000 [puppet] - 10https://gerrit.wikimedia.org/r/543904 (owner: 10Herron) [17:51:31] (03PS3) 10Herron: logstash: increase filter truncate max_counters to 125000 [puppet] - 10https://gerrit.wikimedia.org/r/543904 [17:53:45] (03CR) 10Herron: [C: 03+2] logstash: increase filter truncate max_counters to 125000 [puppet] - 10https://gerrit.wikimedia.org/r/543904 (owner: 10Herron) [17:57:42] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/543278 (https://phabricator.wikimedia.org/T219928) (owner: 10Elukey) [18:00:05] MaxSem, RoanKattouw, Niharika, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy Morning SWAT (Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191017T1800). [18:00:05] No GERRIT patches in the queue for this window AFAICS. [18:00:13] 04Critical Alert for device scs-a8-eqiad.mgmt.eqiad.wmnet - Device rebooted [18:00:15] PROBLEM - Host ps1-a8-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [18:01:01] (03PS4) 10Cwhite: profile, prometheus: install swagger exporter on icinga [puppet] - 10https://gerrit.wikimedia.org/r/541619 (https://phabricator.wikimedia.org/T205870) [18:01:01] (03PS5) 10Cwhite: lvs, prometheus, profile: add swagger exporter jobs [puppet] - 10https://gerrit.wikimedia.org/r/542472 (https://phabricator.wikimedia.org/T205870) [18:01:07] RECOVERY - Host ps1-a8-eqiad is UP: PING OK - Packet loss = 0%, RTA = 2.32 ms [18:01:20] !log update librdkafka on eventlog1002 and restart eventlogging [18:01:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:49] (03CR) 10Cwhite: lvs, prometheus, profile: add swagger exporter jobs (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/542472 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [18:01:56] <_joe_> !log depooled wtp1025 from parsoid, parsoid-php to allow running benchmarks there [18:01:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:17] (03PS1) 10Andrew Bogott: cloudweb2001: set an lvs IP [puppet] - 10https://gerrit.wikimedia.org/r/543910 [18:04:33] PROBLEM - ps1-a8-eqiad-infeed-load-tower-B-phase-X on ps1-a8-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:04:37] PROBLEM - ps1-a8-eqiad-infeed-load-tower-B-phase-Z on ps1-a8-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:04:47] PROBLEM - ps1-a8-eqiad-infeed-load-tower-A-phase-Z on ps1-a8-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:04:59] PROBLEM - ps1-a8-eqiad-infeed-load-tower-A-phase-X on ps1-a8-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:05:13] PROBLEM - ps1-a8-eqiad-infeed-load-tower-B-phase-Y on ps1-a8-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:05:16] (03CR) 10Andrew Bogott: [C: 03+2] cloudweb2001: set an lvs IP [puppet] - 10https://gerrit.wikimedia.org/r/543910 (owner: 10Andrew Bogott) [18:05:29] PROBLEM - ps1-a8-eqiad-infeed-load-tower-A-phase-Y on ps1-a8-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:07:31] 10Operations, 10ops-eqiad, 10DC-Ops: a8-eqiad pdu refresh (Thursday 10/17 @11am UTC) - https://phabricator.wikimedia.org/T227133 (10RobH) a:05Cmjohnson→03RobH [18:09:13] 04̶C̶r̶i̶t̶i̶c̶a̶l Device scs-a8-eqiad.mgmt.eqiad.wmnet recovered from Device rebooted [18:09:37] (03CR) 10Cwhite: lvs, prometheus, profile: add swagger exporter jobs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/542472 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [18:10:40] (03PS1) 10RobH: updating pdu model for ps1-a8-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/543913 (https://phabricator.wikimedia.org/T233129) [18:11:03] 10Operations, 10observability, 10Availability, 10Goal: Setup bacula backup monitoring - https://phabricator.wikimedia.org/T234900 (10jcrespo) I've created a quick and dirty script that extracts backup job status without needing to query the database. For example, in 6 lines it can get the backup host whose... [18:11:19] (03CR) 10RobH: [C: 03+2] updating pdu model for ps1-a8-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/543913 (https://phabricator.wikimedia.org/T233129) (owner: 10RobH) [18:11:22] (03CR) 10jerkins-bot: [V: 04-1] updating pdu model for ps1-a8-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/543913 (https://phabricator.wikimedia.org/T233129) (owner: 10RobH) [18:12:00] grr [18:12:09] 10Operations, 10Phabricator, 10Release-Engineering-Team-TODO, 10Traffic, and 2 others: Prepare Phame to support heavy traffic for a Tech Department blog - https://phabricator.wikimedia.org/T226044 (10srodlund) Hey all -- I am currently seeking some answers to some basic infrastructure questions. Unfortunat... [18:13:11] (03PS2) 10RobH: updating pdu model for ps1-a8-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/543913 (https://phabricator.wikimedia.org/T233129) [18:13:12] 04Critical Alert for device ps1-a8-eqiad.mgmt.eqiad.wmnet - Device rebooted [18:13:24] yes yes librenms we know =] [18:13:48] (03PS5) 10Cwhite: profile, prometheus, role: install swagger exporter on prometheus nodes [puppet] - 10https://gerrit.wikimedia.org/r/541619 (https://phabricator.wikimedia.org/T205870) [18:13:56] (03CR) 10jerkins-bot: [V: 04-1] updating pdu model for ps1-a8-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/543913 (https://phabricator.wikimedia.org/T233129) (owner: 10RobH) [18:14:10] (03PS1) 10Bstorm: host monitoring: add optional contact group for mgmt interfaces [puppet] - 10https://gerrit.wikimedia.org/r/543916 (https://phabricator.wikimedia.org/T223458) [18:15:05] (03PS6) 10Cwhite: lvs, prometheus, profile: add swagger exporter jobs [puppet] - 10https://gerrit.wikimedia.org/r/542472 (https://phabricator.wikimedia.org/T205870) [18:15:26] wtf [18:15:29] the top line is a subject [18:15:39] then blank, then details, then blank, then bug: task number [18:15:49] i dont get what i did wrong and why its bitching line 6 shouldnt be blank [18:15:51] (03CR) 10jerkins-bot: [V: 04-1] profile, prometheus, role: install swagger exporter on prometheus nodes [puppet] - 10https://gerrit.wikimedia.org/r/541619 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [18:16:10] robh that's because Bug: needs to be ontop of change-id [18:16:29] with no space between them? [18:16:33] yup [18:16:50] i think thats just how my commit message id was inserted by git [18:17:00] ill have to find where its set and change it manually for future updates, retrying with the space removed [18:17:08] now its bug: blah blah and changeid directly on line below [18:17:12] (03PS3) 10RobH: updating pdu model for ps1-a8-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/543913 (https://phabricator.wikimedia.org/T233129) [18:18:13] (03CR) 10jerkins-bot: [V: 04-1] updating pdu model for ps1-a8-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/543913 (https://phabricator.wikimedia.org/T233129) (owner: 10RobH) [18:18:18] arghsfkasdlkjfasd [18:18:24] bleh [18:20:19] i dont get why it failed [18:20:37] the details say it passed [18:20:42] do i have to remove the -1 before i rebase? [18:21:52] (03PS4) 10RobH: updating pdu model for ps1-a8-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/543913 (https://phabricator.wikimedia.org/T233129) [18:22:35] (03CR) 10jerkins-bot: [V: 04-1] updating pdu model for ps1-a8-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/543913 (https://phabricator.wikimedia.org/T233129) (owner: 10RobH) [18:22:38] robh: it looks like it wants an extra space before the => on lines 71-73, so that they line up vertically with the one you added on 74 [18:23:13] 04̶C̶r̶i̶t̶i̶c̶a̶l Device ps1-a8-eqiad.mgmt.eqiad.wmnet recovered from Device rebooted [18:23:28] but I'm new, might be misreading :) take with a grain of salt [18:23:33] ahh, ok [18:23:36] i see [18:23:41] rlazarus that's correct. [18:24:02] it wants it to allign => [18:24:19] so you need to adjust 71-73 [18:24:41] yeah doin [18:24:44] thx =] [18:25:00] (03PS5) 10RobH: updating pdu model for ps1-a8-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/543913 (https://phabricator.wikimedia.org/T233129) [18:25:08] so many patchsets for such an easy thing [18:25:10] heh [18:26:05] (03CR) 10RobH: [C: 03+2] updating pdu model for ps1-a8-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/543913 (https://phabricator.wikimedia.org/T233129) (owner: 10RobH) [18:26:32] \o/ [18:26:41] !log wtp1025 - cd /srv/deployment/parsoid/deploy/src ; sudo -u deploy-service ln -s ../vendor (for benchmarking test) [18:26:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:41] 10Operations, 10ops-eqiad, 10DC-Ops: a8-eqiad pdu refresh (Thursday 10/17 @11am UTC) - https://phabricator.wikimedia.org/T227133 (10RobH) [18:34:56] (03PS1) 10Bstorm: wiki replicas: Add the labsdb1012 replica to maintain_dbusers [puppet] - 10https://gerrit.wikimedia.org/r/543924 (https://phabricator.wikimedia.org/T235791) [18:37:52] (03CR) 10Jhedden: [C: 03+1] wiki replicas: Add the labsdb1012 replica to maintain_dbusers [puppet] - 10https://gerrit.wikimedia.org/r/543924 (https://phabricator.wikimedia.org/T235791) (owner: 10Bstorm) [18:39:08] (03CR) 10ArielGlenn: [C: 03+1] "All the moving parts look right to me but I would love it if someone with more expertise on db configs gave the final signoff." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543664 (https://phabricator.wikimedia.org/T233236) (owner: 10Andrew Bogott) [18:40:32] 10Operations, 10ops-codfw, 10ops-eqiad, 10DC-Ops, and 2 others: pull asset tag info off msw1-eqiad - https://phabricator.wikimedia.org/T235793 (10RobH) [18:42:50] (03CR) 10Dzahn: [C: 03+1] "nothing breaks in compiler https://puppet-compiler.wmflabs.org/compiler1002/18918/ and seems fine to me. let me add some more 'observabili" [puppet] - 10https://gerrit.wikimedia.org/r/543916 (https://phabricator.wikimedia.org/T223458) (owner: 10Bstorm) [18:47:06] RECOVERY - ps1-a8-eqiad-infeed-load-tower-B-phase-X on ps1-a8-eqiad is OK: SNMP OK - ps1-a8-eqiad-infeed-load-tower-B-phase-X 181 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:47:10] RECOVERY - ps1-a8-eqiad-infeed-load-tower-B-phase-Z on ps1-a8-eqiad is OK: SNMP OK - ps1-a8-eqiad-infeed-load-tower-B-phase-Z 443 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:47:24] RECOVERY - ps1-a8-eqiad-infeed-load-tower-A-phase-Z on ps1-a8-eqiad is OK: SNMP OK - ps1-a8-eqiad-infeed-load-tower-A-phase-Z 400 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:47:32] RECOVERY - ps1-a8-eqiad-infeed-load-tower-A-phase-X on ps1-a8-eqiad is OK: SNMP OK - ps1-a8-eqiad-infeed-load-tower-A-phase-X 111 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:52:31] (03PS1) 10Andrew Bogott: enable enable_fpm on cloudweb2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/543925 [18:52:33] (03PS3) 10Dzahn: discovery.yaml: add parsoid-php microservice [puppet] - 10https://gerrit.wikimedia.org/r/542572 (https://phabricator.wikimedia.org/T233654) [18:53:15] (03PS2) 10Andrew Bogott: enable_fpm=true on cloudweb2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/543925 [18:54:58] (03CR) 10Andrew Bogott: [C: 03+2] enable_fpm=true on cloudweb2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/543925 (owner: 10Andrew Bogott) [18:55:11] (03CR) 10CDanis: [C: 03+1] labtestwiki: move to a wmcs-hosted database on clouddb2001-dev [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543664 (https://phabricator.wikimedia.org/T233236) (owner: 10Andrew Bogott) [18:55:15] andrewbogott: you will need more more in Hiera [18:55:28] mutante: what else? [18:55:34] profile::mediawiki::php::fpm_config: at least [18:55:40] take a look at ..eh... [18:55:50] you mean like https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/543925/ [18:55:51] ? [18:55:55] hieradata/role/common/parsoid/testing.yaml for example [18:56:15] see it has profile::mediawiki::php::enable_fpm: true [18:56:20] but also the lines after that [18:57:33] 29 to 32 and maybe 34 -37 [18:59:50] mutante: I don't see those things set specifically for the other wikitech hosts… maybe we're getting good defaults from an included profile? [19:00:01] anyway, with enable_fpm:true it seems to be loading [19:00:05] longma: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train - American version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191017T1900). [19:00:19] (complaining about database access, which I think means that mediawiki is loading) [19:00:24] andrewbogott: oh! ok, well good then [19:04:16] RECOVERY - ps1-a8-eqiad-infeed-load-tower-A-phase-Y on ps1-a8-eqiad is OK: SNMP OK - ps1-a8-eqiad-infeed-load-tower-A-phase-Y 485 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:06:08] RECOVERY - ps1-a8-eqiad-infeed-load-tower-B-phase-Y on ps1-a8-eqiad is OK: SNMP OK - ps1-a8-eqiad-infeed-load-tower-B-phase-Y 361 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:16:59] (03PS1) 10Aaron Schulz: Reduce arclamp .log file retention from 90 to 45 days [puppet] - 10https://gerrit.wikimedia.org/r/543931 (https://phabricator.wikimedia.org/T235455) [19:37:36] (03PS1) 10Andrew Bogott: clouddb2001-dev: open firewall for cloudweb to access the database [puppet] - 10https://gerrit.wikimedia.org/r/543936 [19:39:22] RECOVERY - Check the Netbox report librenms for fail status. on netbox1001 is OK: librenms.LibreNMS OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [19:39:44] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [19:39:52] (03PS1) 10Dzahn: install_server: update DHCP config for moscovium [puppet] - 10https://gerrit.wikimedia.org/r/543937 (https://phabricator.wikimedia.org/T232077) [19:40:19] (03CR) 10Andrew Bogott: [C: 03+2] clouddb2001-dev: open firewall for cloudweb to access the database [puppet] - 10https://gerrit.wikimedia.org/r/543936 (owner: 10Andrew Bogott) [19:42:07] (03CR) 10Dzahn: [C: 03+2] install_server: update DHCP config for moscovium [puppet] - 10https://gerrit.wikimedia.org/r/543937 (https://phabricator.wikimedia.org/T232077) (owner: 10Dzahn) [19:42:16] (03PS2) 10Dzahn: install_server: update DHCP config for moscovium [puppet] - 10https://gerrit.wikimedia.org/r/543937 (https://phabricator.wikimedia.org/T232077) [19:46:58] (03CR) 10Andrew Bogott: "> please do test on mwdebug in eqiad and codfw with some requests for production wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543664 (https://phabricator.wikimedia.org/T233236) (owner: 10Andrew Bogott) [19:49:30] (03PS1) 10Paladox: Add websession-flatfile plugin [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/543940 (https://phabricator.wikimedia.org/T222472) [19:50:09] (03CR) 10jerkins-bot: [V: 04-1] Add websession-flatfile plugin [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/543940 (https://phabricator.wikimedia.org/T222472) (owner: 10Paladox) [19:51:31] (03CR) 10BryanDavis: [C: 03+1] "If I understand the maintain-dbusers.py script correctly, I think that manually running `sudo maintain-dbusers harvest-replicas` manually " [puppet] - 10https://gerrit.wikimedia.org/r/543924 (https://phabricator.wikimedia.org/T235791) (owner: 10Bstorm) [19:51:38] (03PS2) 10Paladox: Add websession-flatfile plugin [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/543940 (https://phabricator.wikimedia.org/T222472) [19:52:14] (03CR) 10jerkins-bot: [V: 04-1] Add websession-flatfile plugin [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/543940 (https://phabricator.wikimedia.org/T222472) (owner: 10Paladox) [19:53:58] (03Abandoned) 10Paladox: Merge tag 'v2.16.11' into wmf/stable-2.16 [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/533350 (owner: 10Paladox) [19:54:27] (03Restored) 10Paladox: Merge tag 'v2.16.11' into wmf/stable-2.16 [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/533350 (owner: 10Paladox) [19:54:30] (03CR) 10Paladox: [C: 03+2] Merge tag 'v2.16.11' into wmf/stable-2.16 [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/533350 (owner: 10Paladox) [19:55:01] (03Abandoned) 10Paladox: Merge tag 'v2.15.16' into wmf/stable-2.15 [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/533332 (owner: 10Paladox) [19:55:19] (03PS3) 10Paladox: Add websession-flatfile plugin [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/543940 (https://phabricator.wikimedia.org/T222472) [19:56:24] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [19:56:25] (03PS4) 10Paladox: Add websession-flatfile plugin [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/543940 (https://phabricator.wikimedia.org/T222472) [19:57:15] (03PS1) 10Andrew Bogott: wikitech: use the new labtest ldap server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543943 [19:57:45] (03PS1) 10Dzahn: site: add moscovium.eqiad.wmnet as a spare system [puppet] - 10https://gerrit.wikimedia.org/r/543944 (https://phabricator.wikimedia.org/T232077) [19:59:56] (03PS1) 10Andrew Bogott: codfw1dev: correct lookup for the nova proxy ldap password [puppet] - 10https://gerrit.wikimedia.org/r/543946 [20:01:44] (03Merged) 10jenkins-bot: Merge tag 'v2.16.11' into wmf/stable-2.16 [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/533350 (owner: 10Paladox) [20:03:50] (03PS2) 10Andrew Bogott: codfw1dev: correct lookup for the nova proxy ldap password [puppet] - 10https://gerrit.wikimedia.org/r/543946 [20:04:26] (03PS2) 10Dzahn: site: add moscovium.eqiad.wmnet as a spare system [puppet] - 10https://gerrit.wikimedia.org/r/543944 (https://phabricator.wikimedia.org/T232077) [20:05:32] (03CR) 10Dzahn: [C: 03+2] site: add moscovium.eqiad.wmnet as a spare system [puppet] - 10https://gerrit.wikimedia.org/r/543944 (https://phabricator.wikimedia.org/T232077) (owner: 10Dzahn) [20:05:39] (03PS1) 10Paladox: Merge branch 'stable-2.16' into wmf/stable-2.16 [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/543947 [20:05:42] (03PS3) 10Dzahn: site: add moscovium.eqiad.wmnet as a spare system [puppet] - 10https://gerrit.wikimedia.org/r/543944 (https://phabricator.wikimedia.org/T232077) [20:06:53] (03PS3) 10Andrew Bogott: codfw1dev: correct lookup for the nova proxy ldap password [puppet] - 10https://gerrit.wikimedia.org/r/543946 [20:07:32] RECOVERY - Check the Netbox report puppetdb for fail status. on netbox1001 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [20:07:51] (03CR) 10Paladox: [C: 03+2] Merge branch 'stable-2.16' into wmf/stable-2.16 [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/543947 (owner: 10Paladox) [20:09:55] (03PS1) 10Dzahn: site: add requesttracker role on moscovium on buster [puppet] - 10https://gerrit.wikimedia.org/r/543954 (https://phabricator.wikimedia.org/T180641) [20:10:35] (03CR) 10Andrew Bogott: [C: 03+2] codfw1dev: correct lookup for the nova proxy ldap password [puppet] - 10https://gerrit.wikimedia.org/r/543946 (owner: 10Andrew Bogott) [20:14:15] (03PS2) 10Andrew Bogott: wikitech: use the new codfw1-dev servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543943 [20:19:06] (03Merged) 10jenkins-bot: Merge branch 'stable-2.16' into wmf/stable-2.16 [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/543947 (owner: 10Paladox) [20:20:00] (03PS1) 10Andrew Bogott: m5 grants: remove grants for 'labtestwiki' database [puppet] - 10https://gerrit.wikimedia.org/r/543955 (https://phabricator.wikimedia.org/T233236) [20:34:13] (03CR) 10Krinkle: [C: 03+1] Reduce arclamp .log file retention from 90 to 45 days [puppet] - 10https://gerrit.wikimedia.org/r/543931 (https://phabricator.wikimedia.org/T235455) (owner: 10Aaron Schulz) [20:47:26] (03PS1) 10Volans: doc: update requests doc link [software/spicerack] - 10https://gerrit.wikimedia.org/r/543963 [20:49:09] (03PS1) 10Volans: doc: update requests doc link [software/cumin] - 10https://gerrit.wikimedia.org/r/543964 [20:51:08] (03PS1) 10Volans: setup.py: remove unused test dependency [software/homer] - 10https://gerrit.wikimedia.org/r/543965 [20:51:54] (03CR) 10jerkins-bot: [V: 04-1] doc: update requests doc link [software/spicerack] - 10https://gerrit.wikimedia.org/r/543963 (owner: 10Volans) [20:55:16] (03PS2) 10Volans: doc: update requests doc link [software/spicerack] - 10https://gerrit.wikimedia.org/r/543963 [20:55:18] (03PS1) 10Volans: dns: remove unused type ignore comment [software/spicerack] - 10https://gerrit.wikimedia.org/r/543975 [20:56:07] (03CR) 10Volans: [C: 03+2] doc: update requests doc link [software/cumin] - 10https://gerrit.wikimedia.org/r/543964 (owner: 10Volans) [20:59:57] (03CR) 10jerkins-bot: [V: 04-1] dns: remove unused type ignore comment [software/spicerack] - 10https://gerrit.wikimedia.org/r/543975 (owner: 10Volans) [21:00:41] (03CR) 10Krinkle: [C: 03+1] "LGTM. At current rate I expect the disk space alert to fire again in maybe 2-4 days, T235425. This should hopefully prevent that by not ha" [puppet] - 10https://gerrit.wikimedia.org/r/543931 (https://phabricator.wikimedia.org/T235455) (owner: 10Aaron Schulz) [21:02:34] (03CR) 10Volans: [V: 03+2 C: 03+2] "The sphinx failures are fixed in the next CR" [software/spicerack] - 10https://gerrit.wikimedia.org/r/543975 (owner: 10Volans) [21:02:50] (03CR) 10Volans: [C: 03+2] doc: update requests doc link [software/spicerack] - 10https://gerrit.wikimedia.org/r/543963 (owner: 10Volans) [21:03:21] (03Merged) 10jenkins-bot: doc: update requests doc link [software/cumin] - 10https://gerrit.wikimedia.org/r/543964 (owner: 10Volans) [21:04:28] (03PS1) 10DannyS712: Partial cleanup of InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544001 (https://phabricator.wikimedia.org/T231178) [21:09:19] (03Merged) 10jenkins-bot: doc: update requests doc link [software/spicerack] - 10https://gerrit.wikimedia.org/r/543963 (owner: 10Volans) [21:10:58] (03PS2) 10Dzahn: site: add requesttracker role on moscovium on buster [puppet] - 10https://gerrit.wikimedia.org/r/543954 (https://phabricator.wikimedia.org/T180641) [21:11:30] (03CR) 10Jhedden: [C: 03+1] cloud: Replace SSHSessions diamond collector with prometheus (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/543268 (https://phabricator.wikimedia.org/T210993) (owner: 10BryanDavis) [21:11:34] (03CR) 10Dzahn: [C: 03+2] site: add requesttracker role on moscovium on buster [puppet] - 10https://gerrit.wikimedia.org/r/543954 (https://phabricator.wikimedia.org/T180641) (owner: 10Dzahn) [21:12:18] (03PS1) 10CDanis: trafficserver: actually map grafana-beta to grafana1002 [puppet] - 10https://gerrit.wikimedia.org/r/544023 (https://phabricator.wikimedia.org/T220838) [21:14:44] (03CR) 10CDanis: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/18924/cp1075.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/544023 (https://phabricator.wikimedia.org/T220838) (owner: 10CDanis) [21:29:22] (03PS1) 10CDanis: trafficserver: add never-cache for grafana1002 [puppet] - 10https://gerrit.wikimedia.org/r/544029 (https://phabricator.wikimedia.org/T220838) [21:31:06] (03CR) 10CDanis: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/18925/cp1075.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/544029 (https://phabricator.wikimedia.org/T220838) (owner: 10CDanis) [21:32:24] (03PS1) 10Dzahn: requesttracker: use perl2 Apache module if on buster [puppet] - 10https://gerrit.wikimedia.org/r/544031 [21:33:00] (03CR) 10jerkins-bot: [V: 04-1] requesttracker: use perl2 Apache module if on buster [puppet] - 10https://gerrit.wikimedia.org/r/544031 (owner: 10Dzahn) [21:36:31] (03CR) 10Cwhite: [C: 03+1] "Looks good to me assuming there are no concerns with additional memory usage." [puppet] - 10https://gerrit.wikimedia.org/r/543873 (https://phabricator.wikimedia.org/T215904) (owner: 10Filippo Giunchedi) [21:36:40] (03PS2) 10Dzahn: requesttracker: use perl2 Apache module if on buster [puppet] - 10https://gerrit.wikimedia.org/r/544031 [21:37:14] !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@d663006]: Update mobileapps to f345673 [21:37:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:33] (03CR) 10Cwhite: [C: 03+1] profile: sanity checks for cluster [puppet] - 10https://gerrit.wikimedia.org/r/539934 (https://phabricator.wikimedia.org/T234232) (owner: 10Filippo Giunchedi) [21:42:51] !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@d663006]: Update mobileapps to f345673 (duration: 05m 38s) [21:42:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:28] (03PS2) 10DannyS712: Partial cleanup of InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544001 (https://phabricator.wikimedia.org/T231178) [21:48:02] (03CR) 10jerkins-bot: [V: 04-1] Partial cleanup of InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544001 (https://phabricator.wikimedia.org/T231178) (owner: 10DannyS712) [21:52:44] (03CR) 10DannyS712: "The failure is that "[FATAL tini (7)] exec /srv/composer/composer-install-dev-only failed: Permission denied" - doesn't appear to be relat" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544001 (https://phabricator.wikimedia.org/T231178) (owner: 10DannyS712) [21:52:52] (03PS3) 10Dzahn: requesttracker: install libapache2-mod-perl2 on buster [puppet] - 10https://gerrit.wikimedia.org/r/544031 (https://phabricator.wikimedia.org/T180641) [21:53:27] (03CR) 10jerkins-bot: [V: 04-1] requesttracker: install libapache2-mod-perl2 on buster [puppet] - 10https://gerrit.wikimedia.org/r/544031 (https://phabricator.wikimedia.org/T180641) (owner: 10Dzahn) [21:55:52] (03CR) 10DannyS712: "Recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544001 (https://phabricator.wikimedia.org/T231178) (owner: 10DannyS712) [21:56:27] (03PS4) 10Dzahn: requesttracker: install libapache2-mod-perl2 on buster [puppet] - 10https://gerrit.wikimedia.org/r/544031 (https://phabricator.wikimedia.org/T180641) [22:06:49] (03PS1) 10Thcipriani: beta: keyholder: don't require encrypted keys [puppet] - 10https://gerrit.wikimedia.org/r/544064 (https://phabricator.wikimedia.org/T235674) [22:09:14] (03PS5) 10Dzahn: requesttracker: install libapache2-mod-perl2 on buster [puppet] - 10https://gerrit.wikimedia.org/r/544031 (https://phabricator.wikimedia.org/T180641) [22:09:48] (03CR) 10jerkins-bot: [V: 04-1] requesttracker: install libapache2-mod-perl2 on buster [puppet] - 10https://gerrit.wikimedia.org/r/544031 (https://phabricator.wikimedia.org/T180641) (owner: 10Dzahn) [22:12:29] (03PS6) 10Dzahn: requesttracker: install libapache2-mod-perl2 on buster [puppet] - 10https://gerrit.wikimedia.org/r/544031 (https://phabricator.wikimedia.org/T180641) [22:14:46] (03CR) 10jerkins-bot: [V: 04-1] requesttracker: install libapache2-mod-perl2 on buster [puppet] - 10https://gerrit.wikimedia.org/r/544031 (https://phabricator.wikimedia.org/T180641) (owner: 10Dzahn) [22:15:58] (03PS7) 10Dzahn: requesttracker: install libapache2-mod-perl2 on buster [puppet] - 10https://gerrit.wikimedia.org/r/544031 (https://phabricator.wikimedia.org/T180641) [22:16:33] (03CR) 10jerkins-bot: [V: 04-1] requesttracker: install libapache2-mod-perl2 on buster [puppet] - 10https://gerrit.wikimedia.org/r/544031 (https://phabricator.wikimedia.org/T180641) (owner: 10Dzahn) [22:19:42] (03PS8) 10Dzahn: requesttracker: install libapache2-mod-perl2 on buster [puppet] - 10https://gerrit.wikimedia.org/r/544031 (https://phabricator.wikimedia.org/T180641) [22:22:29] (03CR) 10Dzahn: [C: 03+2] requesttracker: install libapache2-mod-perl2 on buster [puppet] - 10https://gerrit.wikimedia.org/r/544031 (https://phabricator.wikimedia.org/T180641) (owner: 10Dzahn) [22:24:55] (03CR) 10Catrope: [WIP] Config changes for Echo kask migration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540731 (https://phabricator.wikimedia.org/T222851) (owner: 10Catrope) [22:55:02] (03PS1) 10Dzahn: requesttracker: use mod_scgi instead of mod_fastcgi on buster [puppet] - 10https://gerrit.wikimedia.org/r/544069 (https://phabricator.wikimedia.org/T180641) [22:57:18] (03CR) 10Dzahn: [C: 03+2] requesttracker: use mod_scgi instead of mod_fastcgi on buster [puppet] - 10https://gerrit.wikimedia.org/r/544069 (https://phabricator.wikimedia.org/T180641) (owner: 10Dzahn) [23:00:05] MaxSem, RoanKattouw, Niharika, and Urbanecm: Dear deployers, time to do the Evening SWAT (Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191017T2300). [23:00:05] No GERRIT patches in the queue for this window AFAICS. [23:09:46] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is CRITICAL: 36.9 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [23:10:02] PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 57.12 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [23:11:36] RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: (C)60 le (W)70 le 70.53 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [23:14:28] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is OK: (C)60 le (W)70 le 109.8 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [23:18:42] (03PS1) 10Papaul: DNS: Remove mgmt DNS for labsdb100[4-5] [dns] - 10https://gerrit.wikimedia.org/r/544074 [23:20:50] (03CR) 10Papaul: [C: 03+2] DNS: Remove mgmt DNS for labsdb100[4-5] [dns] - 10https://gerrit.wikimedia.org/r/544074 (owner: 10Papaul) [23:30:28] (03PS1) 10Papaul: DNS: Remove mgmt DNS for labcontrol100[1-2] [dns] - 10https://gerrit.wikimedia.org/r/544075 [23:32:11] (03CR) 10Papaul: [C: 03+2] DNS: Remove mgmt DNS for labcontrol100[1-2] [dns] - 10https://gerrit.wikimedia.org/r/544075 (owner: 10Papaul) [23:58:24] (03CR) 10Alex Monk: [C: 04-1] "There is no need for the keys we use to be unencrypted." [puppet] - 10https://gerrit.wikimedia.org/r/544064 (https://phabricator.wikimedia.org/T235674) (owner: 10Thcipriani) [23:59:31] (03PS2) 10Dzahn: add discovery name for RT, point to moscovium [dns] - 10https://gerrit.wikimedia.org/r/534129 (https://phabricator.wikimedia.org/T180641)