[00:02:00] (03PS3) 10Bstorm: newk8s: adjust things to be compatible with migration to the new cluster [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/547676 (https://phabricator.wikimedia.org/T236202) [00:30:01] (03PS1) 10Alex Monk: Copy cloud-puppetmaster hiera to new puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/547691 (https://phabricator.wikimedia.org/T235218) [00:45:53] (03PS2) 10Dzahn: scapify design/style-guide microsite (2 of 2) [puppet] - 10https://gerrit.wikimedia.org/r/546254 (owner: 1020after4) [00:47:06] (03CR) 10Dzahn: [C: 03+2] scapify design/style-guide microsite (2 of 2) [puppet] - 10https://gerrit.wikimedia.org/r/546254 (owner: 1020after4) [00:50:34] (03CR) 10Dzahn: "Error: Could not remove existing file" [puppet] - 10https://gerrit.wikimedia.org/r/546254 (owner: 1020after4) [00:53:39] (03CR) 10Dzahn: "manually moved the previous dir to /root/backup on both servers" [puppet] - 10https://gerrit.wikimedia.org/r/546254 (owner: 1020after4) [01:04:07] Backporting a quick fix to prod. [01:04:51] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:05:25] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:09:39] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:10:17] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:11:15] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:11:51] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:21:31] !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.4/resources/src/mediawiki.widgets/mw.widgets.UsersMultiselectWidget.js: T236460 mw.widgets.UsersMultiselectWidget: Fix property name (duration: 00m 54s) [01:21:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:21:38] T236460: Regression: Changes to email blacklist or muted users do not activate Save button in Preferences - https://phabricator.wikimedia.org/T236460 [01:22:29] (03PS2) 10Ammarpad: Add localized Minerva wordmark for Sindhi Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547061 (https://phabricator.wikimedia.org/T200870) [01:27:53] PROBLEM - Check the Netbox report librenms for fail status. on netbox1001 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [01:38:39] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:39:03] RECOVERY - Check the Netbox report librenms for fail status. on netbox1001 is OK: librenms.LibreNMS OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [01:39:13] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:43:53] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:43:53] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:58:48] !log volker-e@deploy1001 Started deploy [design/style-guide@4d8d085]: deploying design/style-guide with mobile layout improvements [01:58:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:58:53] !log volker-e@deploy1001 Finished deploy [design/style-guide@4d8d085]: deploying design/style-guide with mobile layout improvements (duration: 00m 05s) [01:58:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:21:57] (03PS1) 10BryanDavis: toolforge: Update toolviews.py for ldap3 v2.4.1 [puppet] - 10https://gerrit.wikimedia.org/r/547700 (https://phabricator.wikimedia.org/T214541) [03:23:28] (03CR) 10BryanDavis: "Tested by copying toolviews.py to tools-proxy-05 and running manually" [puppet] - 10https://gerrit.wikimedia.org/r/547700 (https://phabricator.wikimedia.org/T214541) (owner: 10BryanDavis) [03:28:39] (03PS2) 10Catrope: GrowthExperiments (beta-only): make GE use local search on beta enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547648 (https://phabricator.wikimedia.org/T236823) (owner: 10Gergő Tisza) [03:28:44] (03CR) 10Catrope: [C: 03+2] GrowthExperiments (beta-only): make GE use local search on beta enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547648 (https://phabricator.wikimedia.org/T236823) (owner: 10Gergő Tisza) [03:30:28] (03Merged) 10jenkins-bot: GrowthExperiments (beta-only): make GE use local search on beta enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547648 (https://phabricator.wikimedia.org/T236823) (owner: 10Gergő Tisza) [03:40:47] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:06:23] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:20:05] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:30:50] (03Abandoned) 10Tim Starling: Do a 301 redirect for wiki requests to URLs starting with /? [puppet] - 10https://gerrit.wikimedia.org/r/411522 (owner: 10Tim Starling) [05:55:09] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [05:55:15] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:06:31] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:16:19] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [06:21:53] PROBLEM - Disk space on elastic1025 is CRITICAL: DISK CRITICAL - free space: /srv 27218 MB (5% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1025&var-datasource=eqiad+prometheus/ops [06:39:25] RECOVERY - Disk space on elastic1025 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1025&var-datasource=eqiad+prometheus/ops [07:14:31] PROBLEM - Disk space on elastic1025 is CRITICAL: DISK CRITICAL - free space: /srv 27690 MB (5% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1025&var-datasource=eqiad+prometheus/ops [07:30:25] RECOVERY - Disk space on elastic1025 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1025&var-datasource=eqiad+prometheus/ops [08:17:18] !log installing libarchive security updates [08:17:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:19] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:39:18] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/547144 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [08:45:37] PROBLEM - Disk space on elastic1025 is CRITICAL: DISK CRITICAL - free space: /srv 25967 MB (5% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1025&var-datasource=eqiad+prometheus/ops [08:57:11] !log installing ruby-loofah security updates [08:57:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:22] (03CR) 10Gehel: Introduce Elastic 7 support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/545867 (https://phabricator.wikimedia.org/T234854) (owner: 10Filippo Giunchedi) [09:03:11] RECOVERY - Disk space on elastic1025 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1025&var-datasource=eqiad+prometheus/ops [09:19:00] !log installing golang-1.11 security updates [09:19:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:35] !log depool mw1317 [09:21:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:08] !log installing file security updates on jessie [09:28:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:37] (03PS1) 10Effie Mouzeli: mediawiki: remove hhvm from stop_cronjobs() [software/spicerack] - 10https://gerrit.wikimedia.org/r/547714 (https://phabricator.wikimedia.org/T229792) [09:42:48] (03PS1) 10Awight: Install a cron job to produce Reference Previews metrics [puppet] - 10https://gerrit.wikimedia.org/r/547715 (https://phabricator.wikimedia.org/T233108) [09:43:05] (03CR) 10Volans: "Nit inline" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/547714 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [09:44:18] (03PS1) 10Ema: vcl: focus on frontends for X-Cache-Status miss and pass [puppet] - 10https://gerrit.wikimedia.org/r/547716 [09:44:40] (03CR) 10jerkins-bot: [V: 04-1] Install a cron job to produce Reference Previews metrics [puppet] - 10https://gerrit.wikimedia.org/r/547715 (https://phabricator.wikimedia.org/T233108) (owner: 10Awight) [09:45:53] (03CR) 10Awight: "what. > 10:44:36 mv: cannot stat '/srv/workspace/puppet/.tox/log/*': No such file or directory" [puppet] - 10https://gerrit.wikimedia.org/r/547715 (https://phabricator.wikimedia.org/T233108) (owner: 10Awight) [09:45:57] (03CR) 10Awight: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/547715 (https://phabricator.wikimedia.org/T233108) (owner: 10Awight) [09:50:02] (03PS2) 10Awight: Install a cron job to produce Reference Previews metrics [puppet] - 10https://gerrit.wikimedia.org/r/547715 (https://phabricator.wikimedia.org/T233108) [09:55:51] re Mastodon support for logs: https://gerrit.wikimedia.org/r/c/labs/tools/stashbot/+/547717 is my first attempt [09:56:00] bd808, ^ [10:00:18] (03CR) 10ArielGlenn: "This looks good to me. You'll (probably) need to manually clean up labtestwiki.dblist after these changes are merged. We can get a dba's o" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547596 (owner: 10Jforrester) [10:15:25] (03PS1) 10Brian Wolff: Set CSP on doc.wikimedia.org to enforce. [puppet] - 10https://gerrit.wikimedia.org/r/547718 (https://phabricator.wikimedia.org/T213223) [10:15:44] (03PS2) 10Ema: vcl: focus on frontends for X-Cache-Status miss and pass [puppet] - 10https://gerrit.wikimedia.org/r/547716 [10:23:53] (03PS3) 10Ema: vcl: focus on frontends for X-Cache-Status miss and pass [puppet] - 10https://gerrit.wikimedia.org/r/547716 [10:24:31] (03CR) 10Effie Mouzeli: mediawiki: remove hhvm from stop_cronjobs() (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/547714 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [10:25:20] (03CR) 10Ema: [C: 03+2] vcl: focus on frontends for X-Cache-Status miss and pass [puppet] - 10https://gerrit.wikimedia.org/r/547716 (owner: 10Ema) [10:31:43] (03PS5) 10Effie Mouzeli: prometheus: remove hhvm stats gathering and stop exporters [puppet] - 10https://gerrit.wikimedia.org/r/547144 (https://phabricator.wikimedia.org/T229792) [10:33:08] !log Disable puppet on mediawiki and prometheus servers to remove hhvm exporters - T229792 [10:33:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:15] T229792: Remove HHVM from production - https://phabricator.wikimedia.org/T229792 [10:37:13] !log installing clamav security updates on mendelevium [10:37:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:20] (03CR) 10Effie Mouzeli: prometheus: remove hhvm stats gathering and stop exporters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/547144 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [10:41:27] (03PS6) 10Effie Mouzeli: prometheus: remove hhvm stats gathering and stop exporters [puppet] - 10https://gerrit.wikimedia.org/r/547144 (https://phabricator.wikimedia.org/T229792) [10:45:23] (03CR) 10Effie Mouzeli: [C: 03+2] prometheus: remove hhvm stats gathering and stop exporters [puppet] - 10https://gerrit.wikimedia.org/r/547144 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [10:54:40] !log remove prometheus-hhvm-exporter package from mw* servers - T229792 [10:54:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:45] T229792: Remove HHVM from production - https://phabricator.wikimedia.org/T229792 [11:02:03] (03PS1) 10Effie Mouzeli: hhvm: minor fix in hhvm exporter removal [puppet] - 10https://gerrit.wikimedia.org/r/547722 [11:04:25] (03CR) 10Effie Mouzeli: "LGTM https://puppet-compiler.wmflabs.org/compiler1001/19225/" [puppet] - 10https://gerrit.wikimedia.org/r/547722 (owner: 10Effie Mouzeli) [11:04:47] (03CR) 10Effie Mouzeli: [C: 03+2] hhvm: minor fix in hhvm exporter removal [puppet] - 10https://gerrit.wikimedia.org/r/547722 (owner: 10Effie Mouzeli) [11:08:50] !log enable puppet mediawiki and prometheus servers [11:08:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:57] (03PS1) 10Ema: cache: reimage cp5010 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/547723 (https://phabricator.wikimedia.org/T227432) [11:15:30] (03PS5) 10Effie Mouzeli: logging: remove hhvm references [puppet] - 10https://gerrit.wikimedia.org/r/547489 (https://phabricator.wikimedia.org/T229792) [11:17:41] gehel: FYI elastic1025 alerted a few times earlier today due to disk space issues [11:19:12] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/547723 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [11:19:38] gehel: see https://grafana.wikimedia.org/d/000000377/host-overview?var-server=elastic1025&var-datasource=eqiad%20prometheus%2Fops&panelId=12&fullscreen&orgId=1&from=1572574575956&to=1572603627478 [11:21:03] !log depool cp5010 and reimage as text_ats T227432 [11:21:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:10] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [11:26:56] (03CR) 10Ema: [C: 03+2] cache: reimage cp5010 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/547723 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [11:42:21] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [11:44:19] (03CR) 10Alexandros Kosiaris: [C: 04-1] "minor inline comment, other LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/547568 (https://phabricator.wikimedia.org/T221083) (owner: 10Jbond) [11:47:36] akosiaris: any idea if that icinga alert thing for grafana can also be pointed at a dashboard on labs grafana? [11:50:11] I wouldn't do that [11:50:20] I don't know if it's possible, probably yes? [11:50:48] it's just a URL that is checked, but having production alerting checking a different realms infrastructure is a recipe for pain [11:56:55] !log ema@cumin1001 START - Cookbook sre.hosts.downtime [11:56:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:01] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:59:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:46] (03PS1) 10Ladsgroup: labs: Make wmgWikibaseClientRepositories override all production values [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547725 (https://phabricator.wikimedia.org/T235970) [12:00:55] (03CR) 10Ladsgroup: [C: 03+2] "noop for production" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547725 (https://phabricator.wikimedia.org/T235970) (owner: 10Ladsgroup) [12:01:30] (03Merged) 10jenkins-bot: labs: Make wmgWikibaseClientRepositories override all production values [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547725 (https://phabricator.wikimedia.org/T235970) (owner: 10Ladsgroup) [12:02:14] ^ rebased [12:04:12] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Nice start! I guess we should also move director.pp ($bconsolepassword) and storage.pp (same thing)" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/547569 (https://phabricator.wikimedia.org/T221083) (owner: 10Jbond) [12:08:02] RECOVERY - Check the Netbox report puppetdb for fail status. on netbox1001 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [12:12:33] addshore: you should talk to go.dog when he's back re: the alerting roadmap [12:17:38] (03PS7) 10Phamhi: Docker-images: create new docker images based on buster. [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/543124 (https://phabricator.wikimedia.org/T230961) [12:18:36] !log pool cp5010 with ATS backend T227432 [12:18:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:41] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [12:18:57] (03CR) 10Phamhi: Docker-images: create new docker images based on buster. (033 comments) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/543124 (https://phabricator.wikimedia.org/T230961) (owner: 10Phamhi) [12:19:44] (03PS1) 10Ladsgroup: labs: More override of Wikibase client configs on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547726 (https://phabricator.wikimedia.org/T235970) [12:20:18] (03PS2) 10Ladsgroup: labs: More override of Wikibase client configs on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547726 (https://phabricator.wikimedia.org/T235970) [12:21:50] (03CR) 10Ladsgroup: [C: 03+2] "noop for production" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547726 (https://phabricator.wikimedia.org/T235970) (owner: 10Ladsgroup) [12:22:33] (03Merged) 10jenkins-bot: labs: More override of Wikibase client configs on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547726 (https://phabricator.wikimedia.org/T235970) (owner: 10Ladsgroup) [12:22:51] rebased ^ [12:26:17] Ema: thanks, will have a look [12:31:33] (03PS1) 10CDanis: fastnetmon: improve recovery email [puppet] - 10https://gerrit.wikimedia.org/r/547727 [12:32:15] (03CR) 10Muehlenhoff: mediawiki: remove hhvm from stop_cronjobs() (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/547714 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [12:46:03] (03CR) 10RLazarus: [C: 03+1] fastnetmon: improve recovery email [puppet] - 10https://gerrit.wikimedia.org/r/547727 (owner: 10CDanis) [12:56:22] !log upgrading mwdebug2002 to PHP 7.2.24 for some smoke tests with the new build [12:56:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:01] (03CR) 10CDanis: [C: 03+2] fastnetmon: improve recovery email [puppet] - 10https://gerrit.wikimedia.org/r/547727 (owner: 10CDanis) [13:20:27] 10Operations, 10Wikimedia-Incident: September 2019 DoS attacks [Public] - https://phabricator.wikimedia.org/T232224 (10Aklapper) If Heather says so then I guess that WMF Communications might publish something... However that likely will not be an "incident report" (in its technical meaning). [13:21:45] 10Operations, 10Wikimedia-Incident: September 2019 DoS attacks [Public] - https://phabricator.wikimedia.org/T232224 (10Aklapper) (Not sure why this task was moved to "Follow-up/Actionables" as I don't see any open followup tasks left here. I think this task could be closed.) [13:40:53] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:42:10] 10Operations: Review current architecture/capacity and establish plan for Kafka main cluster upgrade/refresh to cover needs for next 2-3 years - https://phabricator.wikimedia.org/T220389 (10herron) [13:42:14] 10Operations, 10Analytics, 10Event-Platform, 10CPT Initiatives (Modern Event Platform (TEC2)), and 2 others: Possibly expand Kafka main-{eqiad,codfw} clusters in Q4 2019. - https://phabricator.wikimedia.org/T217359 (10herron) [13:46:06] 10Operations, 10Analytics, 10Event-Platform, 10CPT Initiatives (Modern Event Platform (TEC2)), and 2 others: Possibly expand Kafka main-{eqiad,codfw} clusters in Q4 2019. - https://phabricator.wikimedia.org/T217359 (10herron) 05Open→03Resolved a:03herron To circle back on this, we moved forward with... [13:46:09] 10Operations, 10User-herron: Transition Kafka main ownership from Analytics Engineering to SRE - (2018-2019 Q4 SRE Goal Tracking Task) - https://phabricator.wikimedia.org/T220387 (10herron) [13:46:16] 10Operations, 10Analytics, 10Event-Platform, 10CPT Initiatives (Modern Event Platform (TEC2)), and 2 others: Possibly expand Kafka main-{eqiad,codfw} clusters in Q4 2019. - https://phabricator.wikimedia.org/T217359 (10herron) [13:46:20] 10Operations, 10Analytics, 10Core Platform Team Legacy (Watching / External), 10Patch-For-Review, and 2 others: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] - https://phabricator.wikimedia.org/T225005 (10herron) [13:46:51] 10Operations, 10User-herron: (Need By: June 30) rack/setup/install kafka-main100[1-5] - https://phabricator.wikimedia.org/T226274 (10herron) 05Open→03Resolved [13:48:25] 10Operations, 10User-herron: Transition Kafka main ownership from Analytics Engineering to SRE - (2018-2019 Q4 SRE Goal Tracking Task) - https://phabricator.wikimedia.org/T220387 (10herron) [13:54:32] Not sure if this is the place to ask, but will there still be SWAT deploys during the November deploy break? We're trying to plan rollout of something that will require a config change. [13:58:03] bpirkle, I don't see why not, but you could check with t [13:58:14] er, thcipriani I mean [13:58:33] liw: thank you [14:00:05] 10Operations: Audit existing Kafka main producers/consumers and document their configuration and use cases - https://phabricator.wikimedia.org/T220390 (10herron) The document https://docs.google.com/document/d/1mr217D6eyoGvGUG31M-FVMve9MCOCRKZxUWFZOZirQw/edit#heading=h.mbousz3hsm22 was created & shared via the g... [14:00:17] 10Operations: Audit existing Kafka main producers/consumers and document their configuration and use cases - https://phabricator.wikimedia.org/T220390 (10herron) 05Open→03Stalled [14:00:19] 10Operations, 10User-herron: Transition Kafka main ownership from Analytics Engineering to SRE - (2018-2019 Q4 SRE Goal Tracking Task) - https://phabricator.wikimedia.org/T220387 (10herron) [14:01:27] 10Operations: Establish guideline documentation for Kafka cluster use cases (main, jumbo, logging, etc.) - https://phabricator.wikimedia.org/T220391 (10herron) [14:01:30] 10Operations: Audit existing Kafka main producers/consumers and document their configuration and use cases - https://phabricator.wikimedia.org/T220390 (10herron) [14:02:12] 10Operations: Establish guideline documentation for Kafka cluster use cases (main, jumbo, logging, etc.) - https://phabricator.wikimedia.org/T220391 (10herron) Essentially a duplicate of T220390 where audit work has gone into documenting clusters and use cases [14:02:35] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [14:02:35] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:02:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:49] (03PS1) 10Ema: cache: reimage cp5011 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/547731 (https://phabricator.wikimedia.org/T227432) [14:02:53] !log rebooting kafka-main1004 for microcode tests [14:02:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:28] 10Operations, 10User-herron: Transition Kafka main ownership from Analytics Engineering to SRE - (2018-2019 Q4 SRE Goal Tracking Task) - https://phabricator.wikimedia.org/T220387 (10herron) [14:05:02] !log depool cp5011 and reimage as text_ats T227432 [14:05:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:08] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [14:05:51] (03CR) 10Ema: [C: 03+2] cache: reimage cp5011 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/547731 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [14:06:21] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:07:21] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp5011.eqsin.wmnet'] ` The log can be found in `/var/log/wm... [14:26:36] (03PS2) 10Muehlenhoff: Enable ldap-corp1001/2001 as additional replicas [puppet] - 10https://gerrit.wikimedia.org/r/539150 [14:26:44] (03PS3) 10Muehlenhoff: Enable ldap-corp1001/2001 as additional replicas [puppet] - 10https://gerrit.wikimedia.org/r/539150 [14:27:52] (03PS1) 10Muehlenhoff: Enable idp2001 as second identity provider [puppet] - 10https://gerrit.wikimedia.org/r/547735 [14:34:09] !log ema@cumin1001 START - Cookbook sre.hosts.downtime [14:34:09] !log ema@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [14:34:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:11] (03PS1) 10Ema: 8.0.5-1wm10: fix #4635 with upstream patch [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/547736 [14:37:49] 10Operations, 10hardware-requests: Two test hosts for SREs - https://phabricator.wikimedia.org/T214024 (10faidon) a:05faidon→03RobH Ping :) [14:39:49] 10Operations, 10hardware-requests: codfw spare pool system for partman testing - https://phabricator.wikimedia.org/T215301 (10CDanis) [14:49:04] PROBLEM - Host cp5011 is DOWN: PING CRITICAL - Packet loss = 100% [14:49:44] cp5011 is me ^ [14:51:01] (03PS1) 10Muehlenhoff: Enable new adduser base class on spare hosts [puppet] - 10https://gerrit.wikimedia.org/r/547738 (https://phabricator.wikimedia.org/T235162) [14:51:49] James_F: release tagger bot seems to be down [14:51:49] 10Operations, 10ops-esams, 10DC-Ops: Update DNS/NTP servers on the esams PDUs/SCS - https://phabricator.wikimedia.org/T237011 (10faidon) 05Open→03Resolved Anycasting NTP sounds a good idea in general, but a) should be kept in a separate task b) it doesn't sound like a priority IMHO at this time. Things w... [14:53:34] 10Operations, 10ops-esams, 10netops: setup new MX204 in knams - https://phabricator.wikimedia.org/T237030 (10faidon) [14:53:36] 10Operations, 10ops-esams, 10DC-Ops, 10Patch-For-Review: ESAMS Refresh/Rebuild (October 2019) - https://phabricator.wikimedia.org/T235805 (10faidon) [14:54:59] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/19226/" [puppet] - 10https://gerrit.wikimedia.org/r/547738 (https://phabricator.wikimedia.org/T235162) (owner: 10Muehlenhoff) [14:58:22] 10Operations, 10ops-codfw: codfw: recycle Cisco old servers - https://phabricator.wikimedia.org/T235669 (10faidon) [14:59:23] 10Operations, 10decommission, 10Goal: reclaim and return all cisco servers - https://phabricator.wikimedia.org/T128821 (10faidon) [14:59:25] 10Operations, 10ops-codfw: codfw: recycle Cisco old servers - https://phabricator.wikimedia.org/T235669 (10faidon) [15:00:24] 10Operations, 10Traffic: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp5011.eqsin.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/2019110... [15:04:28] RECOVERY - Host cp5011 is UP: PING OK - Packet loss = 0%, RTA = 235.88 ms [15:06:02] (03CR) 10Faidon Liambotis: Add term vmhost to cr loopback4 filter (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/547647 (https://phabricator.wikimedia.org/T236598) (owner: 10Ayounsi) [15:07:16] (03PS1) 10Mholloway: MachineVision: Use an HTTP proxy in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547741 (https://phabricator.wikimedia.org/T236797) [15:09:50] (03CR) 10Ema: "Tested on traffic-cache-atstext.traffic.eqiad.wmflabs, the patch does seem to work as advertised. Specifically, I've tried changing loggin" [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/547736 (owner: 10Ema) [15:10:50] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [15:11:05] !log installing python-ecdsa security updates [15:11:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:54] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, 10observability: Publish tls related info to webrequest via varnish - https://phabricator.wikimedia.org/T233661 (10Nuria) >Hmm maybe it could make sense to store some TLS data like the ciphersuite, version or elliptic curve as integers assuming th... [15:25:05] !log installing libpcap security updates [15:25:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:23] !log ema@cumin1001 START - Cookbook sre.hosts.downtime [15:28:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:27] (03Abandoned) 10Herron: Add clamav to lists for malware scanning [puppet] - 10https://gerrit.wikimedia.org/r/364827 (https://phabricator.wikimedia.org/T170462) (owner: 10Herron) [15:30:27] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:30:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:36] (03PS1) 10Muehlenhoff: Add library hint for libpcap [puppet] - 10https://gerrit.wikimedia.org/r/547744 [15:32:04] (03PS4) 10Andrew Bogott: db::views: Bring back abuse_filter_history table [puppet] - 10https://gerrit.wikimedia.org/r/498773 (https://phabricator.wikimedia.org/T123978) (owner: 10Meshvogel) [15:32:52] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, 10observability: Publish tls related info to webrequest via varnish - https://phabricator.wikimedia.org/T233661 (10BBlack) @Nuria - what you're asking for is something like a combined TLS field with separators? e.g. we contruct a 4-part semicolon-... [15:36:53] (03CR) 10Andrew Bogott: [C: 03+2] db::views: Bring back abuse_filter_history table [puppet] - 10https://gerrit.wikimedia.org/r/498773 (https://phabricator.wikimedia.org/T123978) (owner: 10Meshvogel) [15:39:05] RECOVERY - Check the Netbox report puppetdb for fail status. on netbox1001 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [15:39:17] !log installing libonig security updates [15:39:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:07] (03CR) 10BryanDavis: "Minor comments inline. There is one slightly misleading comment block in the php 7.3 stuff, but looking good otherwise. I have not tried t" (032 comments) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/543124 (https://phabricator.wikimedia.org/T230961) (owner: 10Phamhi) [15:43:38] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, 10observability: Publish tls related info to webrequest via varnish - https://phabricator.wikimedia.org/T233661 (10Nuria) @BBlack Varnish would send items to varnishkafka similar to how it is done for x-nalytics: https://github.com/wikimedia/puppe... [15:44:24] 10Operations, 10ops-esams, 10DC-Ops: Add missing labels for equipment and cables - https://phabricator.wikimedia.org/T237009 (10Papaul) Power cords label information for servers in rack OE14 Server ps1 ps2 lvs3005 20080 20088 ganeti3001 20081 20089 dns3001 20082 20090 cp3050 20083 20091 c... [15:47:53] 10Operations, 10ops-esams, 10DC-Ops: Add missing labels for equipment and cables - https://phabricator.wikimedia.org/T237009 (10Papaul) [15:55:18] (03PS1) 10Muehlenhoff: Add Icinga check for correct application of microcode mitigations [puppet] - 10https://gerrit.wikimedia.org/r/547747 (https://phabricator.wikimedia.org/T235250) [15:56:56] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, 10observability: Publish tls related info to webrequest via varnish - https://phabricator.wikimedia.org/T233661 (10BBlack) We probably don't need to send the reused value (it's not that useful for analysis at this level, IMHO), and we don't need t... [15:57:48] 10Operations, 10Traffic: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp5011.eqsin.wmnet'] ` and were **ALL** successful. [15:58:52] (03PS9) 10RLazarus: Initial version of httpbb, the HTTP black box testing tool. [software/httpbb] - 10https://gerrit.wikimedia.org/r/545689 (https://phabricator.wikimedia.org/T236699) [15:59:09] (03CR) 10RLazarus: Initial version of httpbb, the HTTP black box testing tool. (036 comments) [software/httpbb] - 10https://gerrit.wikimedia.org/r/545689 (https://phabricator.wikimedia.org/T236699) (owner: 10RLazarus) [16:08:25] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, 10observability: Publish tls related info to webrequest via varnish - https://phabricator.wikimedia.org/T233661 (10Nuria) >Also I'm assuming from how X-Analytics is set up that the format is k1=v1;k2=v2;.... (equal sign rather than colon). yes, co... [16:09:57] PROBLEM - eventlogging Varnishkafka log producer on cp5011 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/eventlogging.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [16:10:27] PROBLEM - statsv Varnishkafka log producer on cp5011 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [16:10:54] cp5011 is still me ^ [16:11:09] PROBLEM - Webrequests Varnishkafka log producer on cp5011 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [16:11:49] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, 10observability: Update webrequest_128 dataset in turnilo to include TLS fields once available - https://phabricator.wikimedia.org/T237117 (10Nuria) [16:12:47] RECOVERY - Webrequests Varnishkafka log producer on cp5011 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [16:13:13] RECOVERY - eventlogging Varnishkafka log producer on cp5011 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/eventlogging.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [16:13:41] RECOVERY - statsv Varnishkafka log producer on cp5011 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [16:14:47] 10Operations, 10DC-Ops, 10netops: Juniper network device audit - all sites - https://phabricator.wikimedia.org/T213843 (10ayounsi) A lot of back and forth with Juniper, current status is: test_consistency fails 70 test_missing_device_from_installed_base 4 test_missing_inventory_from_installed_base 183 Email... [16:15:15] 10Operations, 10Security-Team: Offboard Raz Shuty from Wikimedia Tools, Etc. - https://phabricator.wikimedia.org/T237118 (10sbassett) [16:15:40] 10Operations, 10Security-Team: Offboard Raz Shuty from Wikimedia Tools, Etc. - https://phabricator.wikimedia.org/T237118 (10sbassett) p:05Triage→03Normal [16:16:10] 10Operations, 10Discovery, 10Elasticsearch, 10Discovery-Search (Current work), 10Patch-For-Review: Icinga should alert on free disk space < 15% (now < 12%) on Elasticsearch hosts - https://phabricator.wikimedia.org/T130329 (10EBernhardson) All we can really do is wait for the servers that are replacing t... [16:16:12] !log asw2-a-eqiad# run request system license add terminal [16:16:13] 10Operations, 10Security-Team: Offboard Raz Shuty from various Wikimedia systems - https://phabricator.wikimedia.org/T237118 (10sbassett) [16:16:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:07] 10Operations, 10ops-eqiad, 10Thumbor, 10serviceops, 10User-jijiki: (OoW) thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 (10jijiki) [16:23:18] (03CR) 10Phamhi: [C: 03+2] toolforge: Update toolviews.py for ldap3 v2.4.1 [puppet] - 10https://gerrit.wikimedia.org/r/547700 (https://phabricator.wikimedia.org/T214541) (owner: 10BryanDavis) [16:23:53] (03CR) 10Jhedden: [C: 03+1] toolforge: Update toolviews.py for ldap3 v2.4.1 [puppet] - 10https://gerrit.wikimedia.org/r/547700 (https://phabricator.wikimedia.org/T214541) (owner: 10BryanDavis) [16:28:53] (03PS8) 10Phamhi: Docker-images: create new docker images based on buster. [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/543124 (https://phabricator.wikimedia.org/T230961) [16:30:06] (03CR) 10Phamhi: Docker-images: create new docker images based on buster. (032 comments) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/543124 (https://phabricator.wikimedia.org/T230961) (owner: 10Phamhi) [16:30:22] PROBLEM - traffic_server backend process restarted on cp5011 is CRITICAL: 3 ge 2 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=eqsin+prometheus/ops&var-instance=cp5011&var-layer=backend [16:31:40] RECOVERY - traffic_server backend process restarted on cp5011 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=eqsin+prometheus/ops&var-instance=cp5011&var-layer=backend [16:32:49] (03PS2) 10Ayounsi: Add term vmhost to cr loopback4 filter [homer/public] - 10https://gerrit.wikimedia.org/r/547647 (https://phabricator.wikimedia.org/T236598) [16:33:44] 10Operations, 10ops-esams, 10DC-Ops: Add missing labels for equipment and cables - https://phabricator.wikimedia.org/T237009 (10Papaul) [16:33:59] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Add term vmhost to cr loopback4 filter (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/547647 (https://phabricator.wikimedia.org/T236598) (owner: 10Ayounsi) [16:34:07] 10Operations, 10ops-esams, 10DC-Ops: Add missing labels for equipment and cables - https://phabricator.wikimedia.org/T237009 (10Papaul) [16:34:24] 10Operations, 10ops-esams, 10DC-Ops: Add missing labels for equipment and cables - https://phabricator.wikimedia.org/T237009 (10Papaul) [16:37:25] !log pool cp5011 with ATS backend T227432 [16:37:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:30] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [16:37:44] (03PS2) 10Ayounsi: Add BGP_from_core_LVS policy [homer/public] - 10https://gerrit.wikimedia.org/r/547678 (https://phabricator.wikimedia.org/T167841) [16:37:57] (03PS3) 10Ayounsi: Add BGP_from_LVS policy [homer/public] - 10https://gerrit.wikimedia.org/r/547678 (https://phabricator.wikimedia.org/T167841) [16:38:13] 10Operations, 10Security-Team: Offboard Raz Shuty from various Wikimedia systems - https://phabricator.wikimedia.org/T237118 (10sbassett) [16:38:34] (03PS1) 10Herron: netops: add host monitoring for scs systems (serial console servers) [puppet] - 10https://gerrit.wikimedia.org/r/547752 (https://phabricator.wikimedia.org/T233318) [16:38:36] (03PS1) 10Herron: Merge branch 'production' of ssh://gerrit.wikimedia.org:29418/operations/puppet into production [puppet] - 10https://gerrit.wikimedia.org/r/547753 [16:38:38] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Add BGP_from_LVS policy [homer/public] - 10https://gerrit.wikimedia.org/r/547678 (https://phabricator.wikimedia.org/T167841) (owner: 10Ayounsi) [16:39:15] (03Abandoned) 10Herron: Merge branch 'production' of ssh://gerrit.wikimedia.org:29418/operations/puppet into production [puppet] - 10https://gerrit.wikimedia.org/r/547753 (owner: 10Herron) [16:39:35] !log push Add BGP_from_LVS policy and term vmhost to loopback4 filter to CRs [16:39:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:37] (03CR) 10Herron: netops: add host monitoring for scs systems (serial console servers) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/547752 (https://phabricator.wikimedia.org/T233318) (owner: 10Herron) [16:49:27] (03CR) 10Ayounsi: "Overall LGTM, their parent should be set to their local mr1-site routers." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/547752 (https://phabricator.wikimedia.org/T233318) (owner: 10Herron) [16:57:24] (03PS2) 10Herron: netops: add host monitoring for scs systems (serial console servers) [puppet] - 10https://gerrit.wikimedia.org/r/547752 (https://phabricator.wikimedia.org/T233318) [16:58:43] (03CR) 10Herron: ">" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/547752 (https://phabricator.wikimedia.org/T233318) (owner: 10Herron) [16:58:57] (03CR) 10BryanDavis: [C: 03+1] "Untested, but reading review looks good" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/543124 (https://phabricator.wikimedia.org/T230961) (owner: 10Phamhi) [17:21:42] 10Operations, 10Discovery, 10Elasticsearch, 10Discovery-Search (Current work), 10Patch-For-Review: Icinga should alert on free disk space < 15% (now < 12%) on Elasticsearch hosts - https://phabricator.wikimedia.org/T130329 (10Dzahn) We can just set a really long downtime on them if we don't want to see t... [17:24:05] 10Operations, 10LDAP-Access-Requests, 10Security-Team: Offboard Raz Shuty from various Wikimedia systems - https://phabricator.wikimedia.org/T237118 (10Dzahn) [17:28:11] 10Operations, 10LDAP-Access-Requests: Add bawolff to either NDA or WMF ldap group - https://phabricator.wikimedia.org/T236636 (10Dzahn) 05Resolved→03Open Access works but a fix is still needed in the puppet repo. [17:28:22] 10Operations, 10LDAP-Access-Requests: Add bawolff to either NDA or WMF ldap group - https://phabricator.wikimedia.org/T236636 (10Dzahn) p:05High→03Normal [17:36:41] 10Operations, 10serviceops, 10Patch-For-Review: decom cobalt - https://phabricator.wikimedia.org/T236187 (10RhinosF1) Is there any difference between this and T236747? [17:37:33] 10Operations, 10Wikimedia-Incident: September 2019 DoS attacks [Public] - https://phabricator.wikimedia.org/T232224 (10RhinosF1) >>! In T232224#5626422, @Aklapper wrote: > (Not sure why this task was moved to "Follow-up/Actionables" as I don't see any open followup tasks left here. I think this task could be c... [17:37:43] 10Operations, 10serviceops, 10Patch-For-Review: decom cobalt - https://phabricator.wikimedia.org/T236187 (10Dzahn) [17:37:46] 10Operations, 10DC-Ops, 10decommission: decommission cobalt.wikimedia.org - https://phabricator.wikimedia.org/T236747 (10Dzahn) [17:37:59] 10Operations, 10serviceops, 10Patch-For-Review: decom cobalt - https://phabricator.wikimedia.org/T236187 (10Dzahn) >>! In T236187#5627285, @RhinosF1 wrote: > Is there any difference between this and T236747? No, there isn't. Thanks for spotting the duplicate. Merged. [17:40:30] (03CR) 10MSantos: [C: 03+1] MachineVision: Use an HTTP proxy in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547741 (https://phabricator.wikimedia.org/T236797) (owner: 10Mholloway) [17:44:25] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/547747 (https://phabricator.wikimedia.org/T235250) (owner: 10Muehlenhoff) [18:02:08] (03PS4) 10Bstorm: newk8s: adjust things to be compatible with migration to the new cluster [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/547676 (https://phabricator.wikimedia.org/T236202) [18:02:37] (03CR) 10jerkins-bot: [V: 04-1] newk8s: adjust things to be compatible with migration to the new cluster [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/547676 (https://phabricator.wikimedia.org/T236202) (owner: 10Bstorm) [18:07:30] (03PS5) 10Bstorm: newk8s: adjust things to be compatible with migration to the new cluster [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/547676 (https://phabricator.wikimedia.org/T236202) [18:10:15] (03PS6) 10Bstorm: newk8s: adjust things to be compatible with migration to the new cluster [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/547676 (https://phabricator.wikimedia.org/T236202) [18:10:24] 10Operations, 10LDAP-Access-Requests, 10Security-Team: Offboard Raz Shuty from various Wikimedia systems - https://phabricator.wikimedia.org/T237118 (10WMDE-leszek) Thanks @sbassett, I was just about to submit LDAP group removal request on WMDE behalf. Sorry for being late with, you shouldn't have relied on... [18:11:39] 10Operations, 10LDAP-Access-Requests, 10Security-Team: Offboard Raz Shuty from various Wikimedia systems - https://phabricator.wikimedia.org/T237118 (10WMDE-leszek) [18:13:12] (03PS7) 10Bstorm: newk8s: adjust things to be compatible with migration to the new cluster [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/547676 (https://phabricator.wikimedia.org/T236202) [18:20:55] 10Operations, 10LDAP-Access-Requests, 10Security-Team: Offboard Raz Shuty from various Wikimedia systems - https://phabricator.wikimedia.org/T237118 (10sbassett) @WMDE-leszek - no problem, I was just trying to be proactive :) [18:26:04] 10Operations, 10serviceops: decom cobalt - https://phabricator.wikimedia.org/T236187 (10Dzahn) 05Open→03Stalled [18:26:07] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (Development services): Reimage cobalt and gerrit2001 as buster - https://phabricator.wikimedia.org/T176774 (10Dzahn) [18:26:11] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10serviceops, and 2 others: Gerrit Hardware Upgrade (+ upgrade from jessie to stretch or buster) - https://phabricator.wikimedia.org/T222391 (10Dzahn) [18:27:43] (03CR) 10Volans: [C: 03+1] "Thanks for all the fixes. LGTM as a first version." [software/httpbb] - 10https://gerrit.wikimedia.org/r/545689 (https://phabricator.wikimedia.org/T236699) (owner: 10RLazarus) [18:29:50] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10Dzahn) [18:36:01] (03CR) 10Volans: [C: 03+1] "LGTM, one optional note/nit inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/547747 (https://phabricator.wikimedia.org/T235250) (owner: 10Muehlenhoff) [18:36:51] (03CR) 10RLazarus: "Thanks! I don't think we've finalized a deployment plan yet -- I'll start to follow up on that Monday. I'll also leave this review open un" [software/httpbb] - 10https://gerrit.wikimedia.org/r/545689 (https://phabricator.wikimedia.org/T236699) (owner: 10RLazarus) [18:42:01] (03CR) 10Muehlenhoff: Add Icinga check for correct application of microcode mitigations (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/547747 (https://phabricator.wikimedia.org/T235250) (owner: 10Muehlenhoff) [18:42:42] (03PS1) 10Ayounsi: Icinga: add parents to mgmt devices [puppet] - 10https://gerrit.wikimedia.org/r/547767 [18:44:40] (03CR) 10jerkins-bot: [V: 04-1] Icinga: add parents to mgmt devices [puppet] - 10https://gerrit.wikimedia.org/r/547767 (owner: 10Ayounsi) [18:45:46] (03PS2) 10Ayounsi: Icinga: add parents to mgmt devices [puppet] - 10https://gerrit.wikimedia.org/r/547767 [18:46:59] (03CR) 10Volans: [C: 03+1] Add Icinga check for correct application of microcode mitigations (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/547747 (https://phabricator.wikimedia.org/T235250) (owner: 10Muehlenhoff) [18:47:47] (03CR) 10jerkins-bot: [V: 04-1] Icinga: add parents to mgmt devices [puppet] - 10https://gerrit.wikimedia.org/r/547767 (owner: 10Ayounsi) [18:49:03] (03PS1) 10BBlack: DNS cleanup: leading whitespace on NS records [dns] - 10https://gerrit.wikimedia.org/r/547769 [18:49:05] (03PS1) 10BBlack: DNS Cleanup: Leading ws for other zone records [dns] - 10https://gerrit.wikimedia.org/r/547770 [18:49:07] (03PS1) 10BBlack: DNS Cleanup: Leading whitespace on AAAAs [dns] - 10https://gerrit.wikimedia.org/r/547771 [18:49:09] (03PS1) 10BBlack: DNS Cleanup: single line SOA with comment split [dns] - 10https://gerrit.wikimedia.org/r/547772 [18:49:11] (03PS1) 10BBlack: zone_validator: disallow leading whitespace [dns] - 10https://gerrit.wikimedia.org/r/547773 [18:53:56] (03PS2) 10BBlack: DNS cleanup: leading whitespace on NS records [dns] - 10https://gerrit.wikimedia.org/r/547769 [18:53:58] (03PS2) 10BBlack: DNS Cleanup: Leading ws for other zone records [dns] - 10https://gerrit.wikimedia.org/r/547770 [18:54:00] (03PS2) 10BBlack: DNS Cleanup: Leading whitespace on AAAAs [dns] - 10https://gerrit.wikimedia.org/r/547771 [18:54:02] (03PS2) 10BBlack: DNS Cleanup: single line SOA with comment split [dns] - 10https://gerrit.wikimedia.org/r/547772 [18:54:04] (03PS2) 10BBlack: zone_validator: disallow leading whitespace [dns] - 10https://gerrit.wikimedia.org/r/547773 [18:55:11] (03PS1) 10Dzahn: install_server: switch gerrit2001 to buster and same partman as 1001 [puppet] - 10https://gerrit.wikimedia.org/r/547774 (https://phabricator.wikimedia.org/T176774) [18:56:59] (03PS3) 10Ayounsi: Icinga: add parents to mgmt devices [puppet] - 10https://gerrit.wikimedia.org/r/547767 [18:57:42] (03CR) 10Dzahn: [C: 03+2] install_server: switch gerrit2001 to buster and same partman as 1001 [puppet] - 10https://gerrit.wikimedia.org/r/547774 (https://phabricator.wikimedia.org/T176774) (owner: 10Dzahn) [18:57:54] (03PS2) 10Dzahn: install_server: switch gerrit2001 to buster and same partman as 1001 [puppet] - 10https://gerrit.wikimedia.org/r/547774 (https://phabricator.wikimedia.org/T176774) [19:02:38] (03CR) 10Volans: [C: 03+1] "LGTM, glad to see this go away." [dns] - 10https://gerrit.wikimedia.org/r/547773 (owner: 10BBlack) [19:03:41] !log volker-e@deploy1001 Started deploy [design/style-guide@4abbc70]: Add wikimedia deployment (scap) configuration [19:03:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:52] !log volker-e@deploy1001 Finished deploy [design/style-guide@4abbc70]: Add wikimedia deployment (scap) configuration (duration: 00m 11s) [19:03:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:16] (03CR) 10Dzahn: Icinga: add parents to mgmt devices (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/547767 (owner: 10Ayounsi) [19:07:38] (03CR) 10Ayounsi: Icinga: add parents to mgmt devices (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/547767 (owner: 10Ayounsi) [19:09:44] (03CR) 10BBlack: [C: 03+2] DNS cleanup: leading whitespace on NS records [dns] - 10https://gerrit.wikimedia.org/r/547769 (owner: 10BBlack) [19:09:46] (03CR) 10BBlack: [C: 03+2] DNS Cleanup: Leading ws for other zone records [dns] - 10https://gerrit.wikimedia.org/r/547770 (owner: 10BBlack) [19:09:53] (03CR) 10BBlack: [C: 03+2] DNS Cleanup: Leading whitespace on AAAAs [dns] - 10https://gerrit.wikimedia.org/r/547771 (owner: 10BBlack) [19:09:56] (03CR) 10BBlack: [C: 03+2] DNS Cleanup: single line SOA with comment split [dns] - 10https://gerrit.wikimedia.org/r/547772 (owner: 10BBlack) [19:10:01] (03CR) 10BBlack: [C: 03+2] zone_validator: disallow leading whitespace [dns] - 10https://gerrit.wikimedia.org/r/547773 (owner: 10BBlack) [19:10:24] (03CR) 10Dzahn: Icinga: add parents to mgmt devices (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/547767 (owner: 10Ayounsi) [19:13:44] (03CR) 10Ayounsi: "> Patch Set 2:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/547752 (https://phabricator.wikimedia.org/T233318) (owner: 10Herron) [19:15:37] (03CR) 10Dzahn: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/19227/icinga1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/547767 (owner: 10Ayounsi) [19:16:03] 10Operations, 10Gerrit, 10Release-Engineering-Team, 10Wikimedia Design Style Guide: Automatic pickup of Gerrit clone master doesn't happen due to missing git-lfs – new deployment env - https://phabricator.wikimedia.org/T235677 (10Volker_E) [19:19:19] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (Kanban): labtestnet2002: repurpose as cloudweb2001-dev.wikimedia.org - https://phabricator.wikimedia.org/T220426 (10Dzahn) Still has alert: "Host cloudweb2001-dev is not in mediawiki-installation dsh group". That would mean it need... [19:19:41] ACKNOWLEDGEMENT - mediawiki-installation DSH group on cloudweb2001-dev is CRITICAL: Host cloudweb2001-dev is not in mediawiki-installation dsh group daniel_zahn https://phabricator.wikimedia.org/T220426 https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [19:23:29] (03PS1) 10Brennen Bearnes: logging: add logspam utilities [puppet] - 10https://gerrit.wikimedia.org/r/547777 [19:24:15] (03PS3) 10Ayounsi: Add the ability to ignore some or all Junos warnings [software/homer] - 10https://gerrit.wikimedia.org/r/547523 [19:25:22] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster: analytics1062 lost one of its power supplies - https://phabricator.wikimedia.org/T237133 (10Dzahn) [19:25:27] (03CR) 10jerkins-bot: [V: 04-1] logging: add logspam utilities [puppet] - 10https://gerrit.wikimedia.org/r/547777 (owner: 10Brennen Bearnes) [19:25:41] ACKNOWLEDGEMENT - IPMI Sensor Status on analytics1062 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] daniel_zahn https://phabricator.wikimedia.org/T237133 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [19:26:02] (03CR) 10Ayounsi: Add the ability to ignore some or all Junos warnings (032 comments) [software/homer] - 10https://gerrit.wikimedia.org/r/547523 (owner: 10Ayounsi) [19:35:53] (03PS1) 1020after4: Add git::lfs on design/style-guide targets [puppet] - 10https://gerrit.wikimedia.org/r/547778 (https://phabricator.wikimedia.org/T235013) [19:36:39] (03CR) 1020after4: "This just installs the git-lfs debian package on targets so that scap can do git-lfs stuffs." [puppet] - 10https://gerrit.wikimedia.org/r/547778 (https://phabricator.wikimedia.org/T235013) (owner: 1020after4) [19:36:45] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (Development services): Reimage cobalt and gerrit2001 as buster - https://phabricator.wikimedia.org/T176774 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` gerrit20... [19:36:59] !log gerrit2001 - reinstalling with buster [19:37:04] (03PS3) 10Herron: netops: add host monitoring for scs systems (serial console servers) [puppet] - 10https://gerrit.wikimedia.org/r/547752 (https://phabricator.wikimedia.org/T233318) [19:37:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:14] (03CR) 10Herron: "> > Patch Set 2:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/547752 (https://phabricator.wikimedia.org/T233318) (owner: 10Herron) [19:39:08] (03CR) 10Dzahn: [C: 04-1] "We just made the entire switch to scap to not use git-lfs. Why would we install that now that the scap setup finally works?" [puppet] - 10https://gerrit.wikimedia.org/r/547778 (https://phabricator.wikimedia.org/T235013) (owner: 1020after4) [19:44:22] thcipriani i wonder should we enable jgit gc or wait till next week? [19:46:21] 10Operations, 10ops-codfw: codfw: recycle Cisco old servers - https://phabricator.wikimedia.org/T235669 (10Papaul) I spoke to the recycle project manager today, the pickup is now setup for November 7th between 12 pm and 3pm. I will have to pull all those servers on pallets and secured them. [19:49:14] paladox: probably fine to re-enable at this point; but let's wait until early next week so my Saturday-self doesn't hate my Friday-self [19:49:21] ok [19:49:28] sure :) [19:49:31] :) [19:52:16] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [19:52:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:48] (03PS1) 10Ayounsi: Rename site to metadata['site'] [homer/public] - 10https://gerrit.wikimedia.org/r/547780 [19:54:19] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [19:54:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:49] (03CR) 10Ayounsi: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1003/19228/icinga1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/547752 (https://phabricator.wikimedia.org/T233318) (owner: 10Herron) [19:59:58] (03CR) 10VolkerE: "@Dzahn How does scap resolve the issue with the Git repo growing?" [puppet] - 10https://gerrit.wikimedia.org/r/547778 (https://phabricator.wikimedia.org/T235013) (owner: 1020after4) [20:00:07] Hey folks. Looks like ORES is getting hammered all of a sudden. What's the best way to track which IPs we're getting hammered from? [20:00:28] I'm digging in the analytics webrequest table but it doesn't appear to have recent enough data. This event started about an hour ago. [20:01:21] halfak: dumb question, what do ORES URLs look like? [20:01:31] https://ores.wikimedia.org [20:01:34] ack [20:01:46] (03CR) 10Ayounsi: "LGTM." [software/homer] - 10https://gerrit.wikimedia.org/r/547638 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [20:02:00] I can see some of the request in logstash/kibana, but I can't see the client IPs. [20:03:39] halfak: I'm working on it [20:03:44] Thanks for your help. [20:03:58] It's all hitting codfw it looks like FWIW [20:04:32] (03CR) 10Ayounsi: [C: 03+2] devices: allow to expose arbitrary metadata [software/homer] - 10https://gerrit.wikimedia.org/r/547638 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [20:05:33] is this hurting ores for other users, halfak? [20:05:41] Yes. [20:06:08] Not a total degradation but will be frustrating and slower than usual for anyone who hits codfw. [20:06:18] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Rename site to metadata['site'] [homer/public] - 10https://gerrit.wikimedia.org/r/547780 (owner: 10Ayounsi) [20:07:11] (03Merged) 10jenkins-bot: devices: allow to expose arbitrary metadata [software/homer] - 10https://gerrit.wikimedia.org/r/547638 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [20:09:28] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:09:43] halfak: okay, I have top requestors for the past few hours [20:09:56] Awesome. [20:10:23] (sampled 1/1000th, so it's hard to be certain, but a few addresses -- Amazon EC2 instances -- stand out) [20:11:15] Damn it. [20:11:26] Was hoping it was a few computers at some university. [20:12:16] Anything useful in the User-agent? [20:12:19] it is blank. [20:13:38] Gotcha. I do see a bunch of requests in kibana with "python-requests/2.4.3 CPython/" [20:14:07] yeah, there's a lower volume of those AFAICT from this sample of the data [20:14:24] Gotcha. If you multiply their request rate by 50, do they compare? [20:14:51] by 50? yes [20:15:09] are they sending more expensive queries? [20:15:55] very expensive [20:16:27] also EC2 instances [20:16:49] I thought there's a policy on blank user agents? [20:16:54] there is, unenforced [20:16:54] is https://meta.wikimedia.org/wiki/User-Agent_policy still authoritative? [20:17:00] haha beaten to it :) [20:17:22] it is something I would like to start enforcing, honestly [20:17:26] Me too :) [20:18:47] (03CR) 10Ladsgroup: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/547778 (https://phabricator.wikimedia.org/T235013) (owner: 1020after4) [20:19:54] (03PS2) 10Brennen Bearnes: logging: add logspam utilities [puppet] - 10https://gerrit.wikimedia.org/r/547777 [20:20:09] halfak: I am going to see what other kinds of traffic is coming from these instances, and then think about blocking them, if that sounds good to you [20:21:31] (03CR) 1020after4: "> I'm not sure this is needed but I might be wrong. The way scap works is that we load the thing in deployment node (deploy1001) and scap " [puppet] - 10https://gerrit.wikimedia.org/r/547778 (https://phabricator.wikimedia.org/T235013) (owner: 1020after4) [20:21:48] (03CR) 10jerkins-bot: [V: 04-1] logging: add logspam utilities [puppet] - 10https://gerrit.wikimedia.org/r/547777 (owner: 10Brennen Bearnes) [20:22:35] cdanis, +1 [20:23:19] it is authoritative, but unless we seriousl ypublicize it everywhere we ca't just drop all script requests on the floor that don't comply [20:23:33] because there has not been a history of enforcement for [20:23:43] well, since whenever that page was written. [20:23:56] except this year in a few exceptional circumstances [20:24:53] to be clear, I'm fine with publicizing it everywhere, giving a 3 month winow or whatever it is, and then actually dropping all noncompliant requests, just not overnight [20:24:58] one of those instances has also scraped the recentchanges API [20:25:05] apergos: oh, yes, that's what I was thinking as well [20:25:10] but not much other traffic [20:25:30] (03CR) 10Ayounsi: [C: 04-1] Netbox: expose additional metadata (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/547639 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [20:25:42] possibly including a temporary enforcement window somewhere in there to get attention [20:25:45] so... is there a ticket? [20:25:56] +1 for the temp widow [20:25:59] *window [20:26:44] (03PS2) 10Ayounsi: Initial forwarding-options templating [homer/public] - 10https://gerrit.wikimedia.org/r/547586 [20:26:45] I do support enforcing it without warning in cases where it's causing problems though [20:26:59] I'm sympathetic to the "Beware of the leopard" situation, but [20:27:00] apergos: no, I've just been grumbling about wanting to do this in cases where I've done what rlazarus just said ;) [20:27:21] also the error page served in that case, when we block a particular IP for doing this, points to the policy [20:27:32] 👍 I was about to ask that [20:27:34] so as to avoid potential leopard-ing [20:28:08] where it's causing problems we block, that's never been a question [20:28:16] and notify if at all possible [20:30:45] bblack: are you around? [20:31:56] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (Development services): Reimage cobalt and gerrit2001 as buster - https://phabricator.wikimedia.org/T176774 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['gerrit2001.wikimedia.org'] ` and were **ALL** successf... [20:40:24] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:40:58] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting Access to Stat1004, Stat1006, Stat1007, notebook1003 and notebook1004 - https://phabricator.wikimedia.org/T236321 (10CGlenn) @Nuria going to get a Mac computer so I am going to need to generate a new SSH key. Deleting the previous key that w... [20:41:26] cdanis, BTW, I started notes here: https://phabricator.wikimedia.org/T237134 [20:41:55] halfak: thanks. if I need to I'll put IP addresses or the like in NDA'd pastes [20:42:06] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting Access to Stat1004, Stat1006, Stat1007, notebook1003 and notebook1004 - https://phabricator.wikimedia.org/T236321 (10CGlenn) [20:42:19] halfak: it will be easy to block the python-requests/ U-As [20:42:21] I'll do that now [20:43:09] Oh yeah. Good idea. +1 [20:46:39] !log add to bot_blocked_nets the IPs of several EC2 instances sending expensive requests to ORES T237134 [20:46:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:45] T237134: Heavy traffic on ORES CODFW (2019-11-01 @ 1900 UTC) - https://phabricator.wikimedia.org/T237134 [20:53:16] cdanis: ? [20:53:52] I see [20:53:56] bblack: was hoping you could validate an assumption of mine. when the logs on weblog1001 say that a request had a UA of '-', that means "the header was empty or not set", right? [20:54:28] cdanis: I would assume so, but I'm not 100% sure [20:54:37] if there are n oempty ones, then yet [20:54:44] second question is, how does one express that in VCL? does req.http.User-Agent == "" evaluate to true in either case? [20:54:46] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:55:03] !req.http.User-Agent [20:55:10] thanks! [20:56:02] (03CR) 10Jhedden: [C: 03+2] deployment-prep: Fix deploy-service access [puppet] - 10https://gerrit.wikimedia.org/r/545067 (https://phabricator.wikimedia.org/T236103) (owner: 10Alex Monk) [20:56:23] (03CR) 10Jhedden: [C: 03+2] deployment-prep: Fix ATS upload domain handling [puppet] - 10https://gerrit.wikimedia.org/r/546332 (owner: 10Alex Monk) [20:56:50] (03PS3) 10Jhedden: deployment-prep: Fix deploy-service access [puppet] - 10https://gerrit.wikimedia.org/r/545067 (https://phabricator.wikimedia.org/T236103) (owner: 10Alex Monk) [20:57:09] (03PS1) 10CDanis: bot_blocked_nets: also block blank/unset UA [puppet] - 10https://gerrit.wikimedia.org/r/547792 [20:57:33] bblack: ^ ptal :) [20:58:02] (03PS2) 10CDanis: bot_blocked_nets: also block blank/unset UA [puppet] - 10https://gerrit.wikimedia.org/r/547792 (https://phabricator.wikimedia.org/T237134) [21:00:13] (03PS2) 10Jhedden: deployment-prep: Fix ATS upload domain handling [puppet] - 10https://gerrit.wikimedia.org/r/546332 (owner: 10Alex Monk) [21:00:26] halfak: looks like ORES load is back to normal? [21:00:42] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:00:56] halfak: lower than normal, even? [21:06:10] !log scp /usr/share/java/mysql-connector-java.jar from gerrit1001 to gerrit2001 (T176774) [21:06:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:16] T176774: Reimage cobalt and gerrit2001 as buster - https://phabricator.wikimedia.org/T176774 [21:06:51] (03CR) 10Jhedden: [C: 03+2] Remove old absented star.tools.wmflabs.org cert [puppet] - 10https://gerrit.wikimedia.org/r/547357 (https://phabricator.wikimedia.org/T236962) (owner: 10Alex Monk) [21:07:02] (03PS2) 10Jhedden: Remove old absented star.tools.wmflabs.org cert [puppet] - 10https://gerrit.wikimedia.org/r/547357 (https://phabricator.wikimedia.org/T236962) (owner: 10Alex Monk) [21:07:34] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:07:46] halfak: ping :) [21:08:56] Looks great now [21:09:30] halfak: it is lower than before, which makes me mildly worried that this was previously a 'good' user who just had a change in their traffic pattern [21:09:49] but, eh. [21:11:18] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:14:14] (03PS1) 10Dzahn: gerrit: update ssh host key of gerrit2001 after reimage [puppet] - 10https://gerrit.wikimedia.org/r/547796 (https://phabricator.wikimedia.org/T176774) [21:15:00] (03CR) 10Dzahn: [C: 03+2] gerrit: update ssh host key of gerrit2001 after reimage [puppet] - 10https://gerrit.wikimedia.org/r/547796 (https://phabricator.wikimedia.org/T176774) (owner: 10Dzahn) [21:15:29] 10Operations, 10Wikimedia-Mailing-lists, 10Wiktionary-fr: Add a mailing list for fr.wiktionary - https://phabricator.wikimedia.org/T26851 (10Aklapper) [21:15:40] 10Operations, 10Wikimedia-Site-requests, 10Wiktionary-fr, 10Wikimedia-maintenance-script-run: Run "refreshLinks.php --dfn-only" on all wikis periodically - https://phabricator.wikimedia.org/T18112 (10Aklapper) [21:20:35] !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.4/extensions/UploadWizard/resources/mw.UploadWizardUploadInterface.js: T237126 Fixing DOM in upload interface of UploadWizard (duration: 00m 56s) [21:20:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:40] T237126: UploadWizard Upload interface messed up - https://phabricator.wikimedia.org/T237126 [21:20:46] (03CR) 10Paladox: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/547796 (https://phabricator.wikimedia.org/T176774) (owner: 10Dzahn) [21:26:45] (03PS1) 10CDanis: cumin: aliases: cache::text_ats is a thing now [puppet] - 10https://gerrit.wikimedia.org/r/547800 (https://phabricator.wikimedia.org/T227432) [21:28:58] (03PS2) 10Andrew Bogott: Copy cloud-puppetmaster hiera to new puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/547691 (https://phabricator.wikimedia.org/T235218) (owner: 10Alex Monk) [21:29:47] (03CR) 10Andrew Bogott: [C: 03+2] Copy cloud-puppetmaster hiera to new puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/547691 (https://phabricator.wikimedia.org/T235218) (owner: 10Alex Monk) [21:35:49] (03CR) 10Phamhi: [C: 03+2] Docker-images: create new docker images based on buster. [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/543124 (https://phabricator.wikimedia.org/T230961) (owner: 10Phamhi) [21:36:21] (03CR) 10Phamhi: [V: 03+2 C: 03+2] Docker-images: create new docker images based on buster. [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/543124 (https://phabricator.wikimedia.org/T230961) (owner: 10Phamhi) [21:48:23] (03PS1) 10Alex Monk: cloud: encapi stuff for new puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/547807 [21:48:47] (03PS2) 10Alex Monk: cloud: encapi stuff for new puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/547807 (https://phabricator.wikimedia.org/T235218) [21:49:34] 10Operations, 10ChangeProp, 10Release Pipeline, 10Release-Engineering-Team-TODO, and 4 others: Migrate cpjobqueue to kubernetes - https://phabricator.wikimedia.org/T220399 (10WDoranWMF) [21:49:35] (03CR) 10Alex Monk: [C: 04-1] "oops, hang on" [puppet] - 10https://gerrit.wikimedia.org/r/547807 (https://phabricator.wikimedia.org/T235218) (owner: 10Alex Monk) [21:49:58] 10Operations, 10serviceops, 10CPT Initiatives (Containerise Services): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10WDoranWMF) [21:50:15] (03PS3) 10Alex Monk: cloud: encapi stuff for new puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/547807 (https://phabricator.wikimedia.org/T235218) [21:50:37] (03CR) 10VolkerE: "Just to clear some possible confusion, my latest comment should have mentioned, that I or any Design Style Guide deployer need LFS on the " [puppet] - 10https://gerrit.wikimedia.org/r/547778 (https://phabricator.wikimedia.org/T235013) (owner: 1020after4) [21:53:08] (03CR) 10Jhedden: [C: 03+2] cloud: encapi stuff for new puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/547807 (https://phabricator.wikimedia.org/T235218) (owner: 10Alex Monk) [21:54:10] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Proton, 10Core Platform Team Workboards (Clinic Duty Team): Requests to MW 404 when on HTTPS - https://phabricator.wikimedia.org/T202982 (10Pchelolo) Seems like it's been fixed, the only thing left to be done is to remove the hacky line from puppet. [21:57:52] (03CR) 10Volans: Netbox: expose additional metadata (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/547639 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [22:00:21] !log gerrit - repo sync between gerrit and gerrit-replica in progress .. if you can't clone from replica you can use main gerrit and replica will come back [22:00:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:24] cdanis, I'll be on the lookout for concerns being raised. The nice thing is that I can tell people the way to get around the block is to just set a valid user-agent. (Sorry for the delay. Was in back to back meetings and lost focus when the fire was out) [22:01:39] I appreciate you working on this. :) [22:29:18] 10Operations, 10ops-eqiad, 10decommission: Decommission ms-be1027 - https://phabricator.wikimedia.org/T233289 (10Jclark-ctr) removed all drives to be degaused hardware failure will not boot to wipe drives [22:30:49] 10Operations, 10ops-eqiad, 10decommission: Decommission ms-be1027 - https://phabricator.wikimedia.org/T233289 (10Jclark-ctr) [22:32:50] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Decommission db1063.eqiad.wmnet - https://phabricator.wikimedia.org/T232564 (10Jclark-ctr) [22:33:30] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission db1071.eqiad.wmnet - https://phabricator.wikimedia.org/T229381 (10Jclark-ctr) [22:34:01] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission db1071.eqiad.wmnet - https://phabricator.wikimedia.org/T229381 (10Jclark-ctr) wiped drives added to google decom sheet [22:34:55] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission db1065 - https://phabricator.wikimedia.org/T227560 (10Jclark-ctr) [22:35:46] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission db1068 - https://phabricator.wikimedia.org/T226689 (10Jclark-ctr) [22:35:51] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster: analytics1062 lost one of its power supplies - https://phabricator.wikimedia.org/T237133 (10wiki_willy) a:03Jclark-ctr @Jclark-ctr - looks like this one is from last Thursday's PDU upgrade. Can you check if it's maybe a loose cord? If not, we... [22:37:13] 10Operations, 10ops-eqiad, 10decommission: Decommission db1064 - https://phabricator.wikimedia.org/T223217 (10Jclark-ctr) [22:37:47] 10Operations, 10ops-eqiad, 10Analytics, 10decommission, 10Patch-For-Review: Decommission dbstore1002 - https://phabricator.wikimedia.org/T216491 (10Jclark-ctr) [22:38:19] 10Operations, 10ops-eqiad, 10decommission: decom einsteinium - https://phabricator.wikimedia.org/T209738 (10Jclark-ctr) [22:38:50] 10Operations, 10ops-eqiad, 10decommission, 10User-fgiunchedi: Return graphite100[13] to spares pool (or decom) - https://phabricator.wikimedia.org/T209357 (10Jclark-ctr) [22:40:35] 10Operations, 10ops-eqiad, 10Analytics, 10DC-Ops, 10decommission: Decommission analytics1003 - https://phabricator.wikimedia.org/T206524 (10Jclark-ctr) [22:40:58] 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission radon - https://phabricator.wikimedia.org/T202040 (10Jclark-ctr) [22:41:33] (03PS2) 10Volans: Netbox: expose additional metadata [software/homer] - 10https://gerrit.wikimedia.org/r/547639 (https://phabricator.wikimedia.org/T228388) [22:42:22] (03CR) 10Volans: "Addressed issue" (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/547639 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [22:49:26] (03PS1) 10Dzahn: gerrit: fix migration class, allow syncing data to gerrit2001 [puppet] - 10https://gerrit.wikimedia.org/r/547818 [22:51:25] (03CR) 10jerkins-bot: [V: 04-1] gerrit: fix migration class, allow syncing data to gerrit2001 [puppet] - 10https://gerrit.wikimedia.org/r/547818 (owner: 10Dzahn) [22:52:22] (03PS2) 10Dzahn: gerrit: fix migration class, allow syncing data to gerrit2001 [puppet] - 10https://gerrit.wikimedia.org/r/547818 [22:54:34] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:56:11] (03CR) 10Paladox: [C: 03+1] gerrit: fix migration class, allow syncing data to gerrit2001 [puppet] - 10https://gerrit.wikimedia.org/r/547818 (owner: 10Dzahn) [22:57:08] (03PS8) 10Bstorm: newk8s: adjust things to be compatible with migration to the new cluster [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/547676 (https://phabricator.wikimedia.org/T236202) [22:57:38] (03PS3) 10Dzahn: gerrit: fix migration class, allow syncing data to gerrit2001 [puppet] - 10https://gerrit.wikimedia.org/r/547818 [22:57:58] (03CR) 10Bstorm: "This change is ready for review." [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/547676 (https://phabricator.wikimedia.org/T236202) (owner: 10Bstorm) [22:58:50] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [23:05:08] (03PS4) 10Dzahn: gerrit: fix migration class, allow syncing data to gerrit2001 [puppet] - 10https://gerrit.wikimedia.org/r/547818 (https://phabricator.wikimedia.org/T176774) [23:05:52] (03PS5) 10Dzahn: gerrit: fix migration class, allow syncing data to gerrit2001 [puppet] - 10https://gerrit.wikimedia.org/r/547818 (https://phabricator.wikimedia.org/T176774) [23:07:28] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:09:26] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [23:17:00] (03CR) 10Dzahn: [C: 03+2] gerrit: fix migration class, allow syncing data to gerrit2001 [puppet] - 10https://gerrit.wikimedia.org/r/547818 (https://phabricator.wikimedia.org/T176774) (owner: 10Dzahn) [23:25:19] (03CR) 10Dzahn: "removed the rsync from gerrit1001 (good) but did not add it on gerrit2001 (where is the issue?)" [puppet] - 10https://gerrit.wikimedia.org/r/547818 (https://phabricator.wikimedia.org/T176774) (owner: 10Dzahn) [23:31:25] (03PS1) 10Dzahn: gerrit: always include migration profile [puppet] - 10https://gerrit.wikimedia.org/r/547827 (https://phabricator.wikimedia.org/T176774) [23:35:38] (03PS2) 10Dzahn: gerrit: always include migration profile [puppet] - 10https://gerrit.wikimedia.org/r/547827 (https://phabricator.wikimedia.org/T176774) [23:35:54] (03CR) 10jerkins-bot: [V: 04-1] gerrit: always include migration profile [puppet] - 10https://gerrit.wikimedia.org/r/547827 (https://phabricator.wikimedia.org/T176774) (owner: 10Dzahn) [23:36:43] (03PS3) 10Dzahn: gerrit: always include migration profile [puppet] - 10https://gerrit.wikimedia.org/r/547827 (https://phabricator.wikimedia.org/T176774) [23:36:54] (03CR) 10Paladox: gerrit: always include migration profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/547827 (https://phabricator.wikimedia.org/T176774) (owner: 10Dzahn) [23:41:21] (03CR) 10Dzahn: gerrit: always include migration profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/547827 (https://phabricator.wikimedia.org/T176774) (owner: 10Dzahn) [23:41:31] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/19231/ works now" [puppet] - 10https://gerrit.wikimedia.org/r/547827 (https://phabricator.wikimedia.org/T176774) (owner: 10Dzahn) [23:43:06] (03CR) 10Paladox: [C: 03+1] "Well done! (also inline comment can be ignored, the change i was thinking about wasen't merged)." [puppet] - 10https://gerrit.wikimedia.org/r/547827 (https://phabricator.wikimedia.org/T176774) (owner: 10Dzahn) [23:45:28] !log rsyncing gerrit git data from gerrit1001 to gerrit2001 (using --delete too!) T176774 [23:45:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:45:35] T176774: Reimage cobalt and gerrit2001 as buster - https://phabricator.wikimedia.org/T176774 [23:46:16] (03PS1) 10Gergő Tisza: Make beta wikis use the corresponding prod wiki for pageview info [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547830 [23:47:29] (03PS2) 10Gergő Tisza: Make beta wikis use the corresponding prod wiki for pageview info [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547830 [23:55:43] (03PS1) 10Catrope: GrowthExperiments: Configure intro links for suggested edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547831 (https://phabricator.wikimedia.org/T235723) [23:58:45] (03PS3) 10Catrope: Make beta wikis use the corresponding prod wiki for pageview info [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547830 (owner: 10Gergő Tisza) [23:58:53] (03CR) 10Catrope: [C: 03+2] Make beta wikis use the corresponding prod wiki for pageview info [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547830 (owner: 10Gergő Tisza) [23:59:40] urandom: are you around?