[00:00:04] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 52 probes of 573 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [00:02:00] (03CR) 10Bstorm: wikireplicas: add wikireplica cookbook to add a wiki (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/621088 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm) [00:05:55] jouncebot: now [00:05:55] No deployments scheduled for the next 10 hour(s) and 54 minute(s) [00:07:28] (03PS1) 10Dave Pifke: xhgui: enable database access for admins [puppet] - 10https://gerrit.wikimedia.org/r/621100 (https://phabricator.wikimedia.org/T260640) [00:15:36] !log wkandek@cumin1001 START - Cookbook sre.hosts.reboot-cluster [00:15:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:17:34] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 50 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [00:18:36] (03CR) 10BryanDavis: [C: 04-1] wikireplicas: create cumin aliases for wikireplica servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/621067 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm) [00:26:07] 10Operations, 10Machine Learning Platform, 10ORES, 10Patch-For-Review: ORES icinga alerts - https://phabricator.wikimedia.org/T260732 (10Dzahn) 05Open→03Resolved a:03Dzahn {F32188041} [00:29:22] (03CR) 10Krinkle: [C: 03+2] xhgui: remove MongoDB backend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620142 (https://phabricator.wikimedia.org/T180761) (owner: 10Dave Pifke) [00:30:07] (03Merged) 10jenkins-bot: xhgui: remove MongoDB backend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620142 (https://phabricator.wikimedia.org/T180761) (owner: 10Dave Pifke) [00:31:23] dpifke: lets test on mwdebug1001 instead, it seems 002 is receiving quite a lot of traffic [00:31:43] OK. [00:31:51] easier to isolate in logstash [00:32:04] RECOVERY - Host ripe-atlas-eqsin IPv6 is UP: PING OK - Packet loss = 0%, RTA = 228.49 ms [00:33:53] OK, it's live on 1001. [00:36:14] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 51 probes of 575 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [00:37:49] 10Operations, 10serviceops, 10Platform Team Workboards (Clinic Duty Team): PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10tstarling) [00:41:04] Test profile made it through, and don't see any new logstash errors, any objection to continuing on? [00:42:12] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 46 probes of 575 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [00:43:48] dpifke: nope, lgtm [00:44:13] (03PS4) 10Krinkle: profiler: Update XHGui SERVER/GET key filter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620139 [00:44:16] (03CR) 10Krinkle: [C: 03+2] profiler: Update XHGui SERVER/GET key filter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620139 (owner: 10Krinkle) [00:45:06] (03Merged) 10jenkins-bot: profiler: Update XHGui SERVER/GET key filter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620139 (owner: 10Krinkle) [00:45:44] RECOVERY - Host ripe-atlas-codfw IPv6 is UP: PING OK - Packet loss = 0%, RTA = 33.58 ms [00:48:20] PROBLEM - Host mw1308 is DOWN: PING CRITICAL - Packet loss = 100% [00:49:22] !log dpifke@deploy1001 Synchronized wmf-config/ProductionServices.php: Disabling old XHGui backend (T180761) (duration: 05m 13s) [00:49:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:49:25] T180761: Move XHGui from tungsten to xhgui-001 - https://phabricator.wikimedia.org/T180761 [00:50:13] ACKNOWLEDGEMENT - Check whether microcode mitigations for CPU vulnerabilities are applied on cloudvirt1024 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.20.43: Connection reset by peer andrew bogott work in progress https://wikitech.wikimedia.org/wiki/Microcode [00:50:16] dpifke: wanna do the next one as well? [00:50:42] Yup, grabbing it now. [00:52:02] PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [00:54:16] 10Operations, 10ops-eqiad, 10Analytics-Clusters: (Need By: TBD) upgrade ram in an-master100[12] - https://phabricator.wikimedia.org/T259162 (10Jclark-ctr) @elukey. ram has arrived how long will it take to drain host? and how long can they be down? [00:54:42] Live on mwdebug1001. Something's not quite right with the change to add HOSTNAME: https://performance.wikimedia.org/xhgui/run/view?id=5f3c77b95dde7fd7f5d47622 [00:55:06] dpifke: fix incoming for uname() [00:55:07] my bad [00:55:18] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 52 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [00:55:38] (03PS1) 10Andrew Bogott: wmcs-backup-instances.py: replace a local hack that crept into a recent patch [puppet] - 10https://gerrit.wikimedia.org/r/621105 [00:56:17] Ack. [00:56:25] (03PS1) 10Krinkle: profiler: Fix bad php_uname() call [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621106 [00:56:45] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-backup-instances.py: replace a local hack that crept into a recent patch [puppet] - 10https://gerrit.wikimedia.org/r/621105 (owner: 10Andrew Bogott) [00:57:06] dpifke: It is of course beyond redicule that 'Linux … … … … '['n'] yields "L" [00:57:32] (int cast yada yada) [00:58:17] (03CR) 10Krinkle: [C: 03+2] profiler: Fix bad php_uname() call [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621106 (owner: 10Krinkle) [00:59:02] (03Merged) 10jenkins-bot: profiler: Fix bad php_uname() call [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621106 (owner: 10Krinkle) [01:00:59] Fix is live, looks good now: https://performance.wikimedia.org/xhgui/run/view?id=5f3c79b4300dddb134045e92 [01:01:24] (live on mwdebug1001, that is) [01:02:31] ack [01:02:36] also made a comparison [01:02:37] befre - https://performance.wikimedia.org/xhgui/run/view?id=5f3c7635362770b63583fa10 [01:03:02] after - https://performance.wikimedia.org/xhgui/run/view?id=5f3c7a1e03b0dd68b71c09b6 [01:03:35] The UNIQUE_ID is no longer a lie [01:03:39] and other changes also LGTM [01:03:42] good to go from me [01:03:46] PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 141.4 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [01:04:57] Onwards it goes... [01:05:43] !log dpifke@deploy1001 Synchronized wmf-config/profiler.php: Deploying https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/620139 (duration: 01m 18s) [01:05:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:05:46] RECOVERY - Host mw1308 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [01:06:43] (03PS1) 10Andrew Bogott: cloudvirt1006.eqiad.wmnet: move to role::spare::system [puppet] - 10https://gerrit.wikimedia.org/r/621107 (https://phabricator.wikimedia.org/T259192) [01:07:03] Do we have to do anything with mw1308 because it missed those deploys? The change shouldn't affect it, but do we care if it's out of sync with the others? [01:07:32] (It timed out on the first deploy, got connection refused on the second.) [01:08:01] (03CR) 10Andrew Bogott: [C: 03+2] cloudvirt1006.eqiad.wmnet: move to role::spare::system [puppet] - 10https://gerrit.wikimedia.org/r/621107 (https://phabricator.wikimedia.org/T259192) (owner: 10Andrew Bogott) [01:08:39] I can do a scap pull on it, if that's the correct fix. [01:08:39] RECOVERY - Host ripe-atlas-ulsfo IPv6 is UP: PING OK - Packet loss = 0%, RTA = 72.00 ms [01:10:27] dpifke: Hm.. checking SAL [01:11:03] nothing about mw1308 since Nov 2019 [01:11:10] yeah, try ssh-ing there and scap pull [01:11:46] this might be one of those things that mw-deploy grants, but try :) [01:11:47] we are rebooting eqiad jobrunner serves to pick up a new kernel parameter. mw1308 did not come back after the reboot command and I powercycled it through the mgmt interface [01:11:48] Uptime is 6 minutes, so someone/something rebooted it while we were deploying. [01:12:03] wkandek: ah okay [01:12:32] scap pull worked, and I confirmed it now has the correct copies of the two files. [01:12:46] :) [01:13:06] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 50 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [01:21:58] (03PS2) 10Dave Pifke: [WIP] profiler: remove MongoDB client [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621095 (https://phabricator.wikimedia.org/T180761) [01:24:40] (03PS1) 10Catrope: Enable GrowthExperiments on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621108 (https://phabricator.wikimedia.org/T257490) [01:30:46] (03CR) 10Krinkle: "Overall LGTM. Using MongoDate is fine for now, unless upstream's version no longer uses it either in which case we can match its current l" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621095 (https://phabricator.wikimedia.org/T180761) (owner: 10Dave Pifke) [01:34:15] (03CR) 10Dave Pifke: "The collector just cares that we pass it an object which has ->sec and ->usec:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621095 (https://phabricator.wikimedia.org/T180761) (owner: 10Dave Pifke) [01:45:36] RECOVERY - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is OK: (C)100 gt (W)80 gt 75.25 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [01:57:04] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 51 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:00:20] !log wkandek@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0) [02:00:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:02:25] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 50 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:11:14] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 51 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:44:11] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group={logstash,logstash-codfw} instance=kafkamon1001 job=burrow partition={0,1,2,3} prometheus=ops site=eqiad topic={udp_localhost-info,udp_localhost-warning} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId [02:44:11] =thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [02:46:49] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 50 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:48:16] (03PS1) 10Andrew Bogott: update partman recipes for cloudvirt1004 and 1006 [puppet] - 10https://gerrit.wikimedia.org/r/621118 [02:49:17] (03CR) 10Andrew Bogott: [C: 03+2] update partman recipes for cloudvirt1004 and 1006 [puppet] - 10https://gerrit.wikimedia.org/r/621118 (owner: 10Andrew Bogott) [02:50:11] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 53 probes of 575 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:56:11] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 45 probes of 575 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:56:41] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 51 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [03:02:40] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 50 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [03:12:35] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 51 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [03:24:31] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 50 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [03:40:14] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [03:40:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:42:23] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [03:42:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:50:02] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=webperf_arclamp site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:51:06] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:04:33] (03CR) 10Cwhite: prometheus: use aggs to consolidate mediawiki logging metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/621098 (https://phabricator.wikimedia.org/T256418) (owner: 10Cwhite) [04:10:54] RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [04:12:14] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 51 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [04:27:26] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 49 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [05:12:08] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 51 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [05:24:04] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 50 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [05:31:55] !log oblivian@cumin1001 START - Cookbook sre.hosts.reboot-single [05:31:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:37:36] !log oblivian@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [05:37:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:41:27] (03PS1) 10Volans: doc: improve logging documentation [software/spicerack] - 10https://gerrit.wikimedia.org/r/621128 [05:47:05] !log oblivian@cumin1001 START - Cookbook sre.hosts.reboot-single [05:47:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:44] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 52 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [05:49:01] (03PS2) 10Volans: doc: improve logging documentation [software/spicerack] - 10https://gerrit.wikimedia.org/r/621128 [05:53:20] !log oblivian@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [05:53:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:53:54] RECOVERY - Stale file for node-exporter textfile in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile [05:55:24] !log oblivian@cumin1001 START - Cookbook sre.hosts.reboot-single [05:55:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:59:36] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 50 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [06:00:50] !log oblivian@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [06:00:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:02:26] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=webperf_arclamp site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:03:27] !log oblivian@cumin1001 START - Cookbook sre.hosts.reboot-single [06:03:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:03:49] (03CR) 10Volans: [C: 03+2] "doc improvements." [software/spicerack] - 10https://gerrit.wikimedia.org/r/621128 (owner: 10Volans) [06:04:26] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:04:28] !log oblivian@cumin1001 START - Cookbook sre.hosts.reboot-cluster [06:04:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:47] (03Merged) 10jenkins-bot: doc: improve logging documentation [software/spicerack] - 10https://gerrit.wikimedia.org/r/621128 (owner: 10Volans) [06:07:29] !log oblivian@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [06:07:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:08:22] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=webperf_arclamp site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:09:36] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 51 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [06:11:24] (03CR) 10Volans: "Thanks a lot for writing a new cookbook!" (038 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/621088 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm) [06:12:20] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:15:36] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 50 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [06:35:05] (03CR) 10Volans: "addendum" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/621088 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm) [06:39:05] !log oblivian@cumin1001 START - Cookbook sre.hosts.reboot-single [06:39:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:40] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:47:36] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:01:16] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 51 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:04:00] <_joe_> effie: I reenabled puppet on mwdebug1002 [07:04:16] <_joe_> sorry but was needed in order to fix the memleak [07:04:32] <_joe_> I'll restore whatever you modified there later [07:10:55] !log oblivian@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [07:10:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:12] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 50 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:15:38] !log oblivian@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0) [07:15:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:56] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=webperf_arclamp site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:20:52] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:23:28] (03CR) 10JMeybohm: [C: 04-1] "> Patch Set 4:" (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/619437 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [07:38:58] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 51 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:39:32] (03CR) 10Filippo Giunchedi: [C: 03+2] profile: switch Grafana plugins to Debian package [puppet] - 10https://gerrit.wikimedia.org/r/619451 (https://phabricator.wikimedia.org/T259143) (owner: 10Filippo Giunchedi) [07:39:43] (03PS3) 10Filippo Giunchedi: profile: switch Grafana plugins to Debian package [puppet] - 10https://gerrit.wikimedia.org/r/619451 (https://phabricator.wikimedia.org/T259143) [07:44:54] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 50 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:46:17] !log upgrade to grafana 7 on cloudmetrics hosts - T259143 [07:46:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:21] T259143: Upgrade to Grafana 7 - https://phabricator.wikimedia.org/T259143 [07:47:59] (03PS1) 10JMeybohm: mcrouter_wancache: Temporarily remove two codfw proxies for reboot [puppet] - 10https://gerrit.wikimedia.org/r/621196 (https://phabricator.wikimedia.org/T260329) [07:49:59] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Switch all charts from "stable" to "wmf-stable" [deployment-charts] - 10https://gerrit.wikimedia.org/r/620935 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [07:50:13] (03PS3) 10Giuseppe Lavagetto: Switch all charts from "stable" to "wmf-stable" [deployment-charts] - 10https://gerrit.wikimedia.org/r/620935 (https://phabricator.wikimedia.org/T258572) [07:50:15] (03PS3) 10Giuseppe Lavagetto: Test deployments with helmfile lint [deployment-charts] - 10https://gerrit.wikimedia.org/r/620934 (https://phabricator.wikimedia.org/T258572) [07:50:17] (03PS5) 10Giuseppe Lavagetto: helmfile.d: refactor eventgate-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/619437 (https://phabricator.wikimedia.org/T258572) [07:50:48] (03CR) 10JMeybohm: "PCC https://puppet-compiler.wmflabs.org/compiler1003/24580/mw1381.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/621196 (https://phabricator.wikimedia.org/T260329) (owner: 10JMeybohm) [07:52:05] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mcrouter_wancache: Temporarily remove two codfw proxies for reboot [puppet] - 10https://gerrit.wikimedia.org/r/621196 (https://phabricator.wikimedia.org/T260329) (owner: 10JMeybohm) [07:54:06] (03CR) 10JMeybohm: [C: 03+2] mcrouter_wancache: Temporarily remove two codfw proxies for reboot [puppet] - 10https://gerrit.wikimedia.org/r/621196 (https://phabricator.wikimedia.org/T260329) (owner: 10JMeybohm) [07:54:44] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 51 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:00:42] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 50 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:02:19] (03PS2) 10Filippo Giunchedi: hieradata: switch grafana.w.o to Grafana 7 [puppet] - 10https://gerrit.wikimedia.org/r/620663 (https://phabricator.wikimedia.org/T259143) [08:02:21] (03PS1) 10Filippo Giunchedi: hieradata: disable panel html sanitization for grafana-labs [puppet] - 10https://gerrit.wikimedia.org/r/621197 (https://phabricator.wikimedia.org/T259143) [08:02:41] (03CR) 10jerkins-bot: [V: 04-1] hieradata: switch grafana.w.o to Grafana 7 [puppet] - 10https://gerrit.wikimedia.org/r/620663 (https://phabricator.wikimedia.org/T259143) (owner: 10Filippo Giunchedi) [08:03:26] (03CR) 10Filippo Giunchedi: "Sending this out for completeness, although not compulsory" [puppet] - 10https://gerrit.wikimedia.org/r/621197 (https://phabricator.wikimedia.org/T259143) (owner: 10Filippo Giunchedi) [08:04:59] (03PS3) 10Filippo Giunchedi: hieradata: switch grafana.w.o to Grafana 7 [puppet] - 10https://gerrit.wikimedia.org/r/620663 (https://phabricator.wikimedia.org/T259143) [08:05:01] (03PS2) 10Filippo Giunchedi: hieradata: disable panel html sanitization for grafana-labs [puppet] - 10https://gerrit.wikimedia.org/r/621197 (https://phabricator.wikimedia.org/T259143) [08:05:07] (03CR) 10Filippo Giunchedi: hieradata: switch grafana.w.o to Grafana 7 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/620663 (https://phabricator.wikimedia.org/T259143) (owner: 10Filippo Giunchedi) [08:06:42] !log running puppet on A:all-mw-eqiad [08:06:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:58] (03PS1) 10JMeybohm: mcrouter_wancache: Temporarily remove two codfw proxies for reboot [puppet] - 10https://gerrit.wikimedia.org/r/621198 (https://phabricator.wikimedia.org/T260329) [08:13:40] (03CR) 10jerkins-bot: [V: 04-1] mcrouter_wancache: Temporarily remove two codfw proxies for reboot [puppet] - 10https://gerrit.wikimedia.org/r/621198 (https://phabricator.wikimedia.org/T260329) (owner: 10JMeybohm) [08:14:25] !log oblivian@cumin1001 START - Cookbook sre.hosts.reboot-cluster [08:14:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:34] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 51 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:18:39] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single [08:18:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:09] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: switch grafana.w.o to Grafana 7 [puppet] - 10https://gerrit.wikimedia.org/r/620663 (https://phabricator.wikimedia.org/T259143) (owner: 10Filippo Giunchedi) [08:19:20] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single [08:19:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:29] !log switch grafana.w.o to grafana 7 in codfw - T259143 [08:20:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:34] T259143: Upgrade to Grafana 7 - https://phabricator.wikimedia.org/T259143 [08:21:35] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 50 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:22:45] (03PS2) 10JMeybohm: mcrouter_wancache: Temporarily remove two codfw proxies for reboot [puppet] - 10https://gerrit.wikimedia.org/r/621198 (https://phabricator.wikimedia.org/T260329) [08:23:32] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [08:23:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:39] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [08:23:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:59] (03CR) 10JMeybohm: "PCC https://puppet-compiler.wmflabs.org/compiler1002/24581/mw1381.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/621198 (https://phabricator.wikimedia.org/T260329) (owner: 10JMeybohm) [08:28:30] (03PS1) 10Giuseppe Lavagetto: profile::service_proxy::envoy: inject XFP in all calls to mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/621199 [08:28:32] (03PS1) 10Giuseppe Lavagetto: role::ores: enable the service proxy to the MediaWiki api [puppet] - 10https://gerrit.wikimedia.org/r/621200 [08:28:58] 10Operations, 10ops-codfw: backup2001 RAID controller failure - https://phabricator.wikimedia.org/T260764 (10jcrespo) [08:33:14] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mcrouter_wancache: Temporarily remove two codfw proxies for reboot [puppet] - 10https://gerrit.wikimedia.org/r/621198 (https://phabricator.wikimedia.org/T260329) (owner: 10JMeybohm) [08:33:45] (03CR) 10JMeybohm: [C: 03+2] mcrouter_wancache: Temporarily remove two codfw proxies for reboot [puppet] - 10https://gerrit.wikimedia.org/r/621198 (https://phabricator.wikimedia.org/T260329) (owner: 10JMeybohm) [08:33:49] 10Operations, 10ops-codfw: backup2001 RAID controller failure - https://phabricator.wikimedia.org/T260764 (10jcrespo) [08:34:17] (03PS1) 10JMeybohm: mcrouter_wancache: Temporarily remove two codfw proxies for reboot [puppet] - 10https://gerrit.wikimedia.org/r/621201 (https://phabricator.wikimedia.org/T260329) [08:35:00] 10Operations, 10ops-codfw, 10DBA, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Kormat) a:05Marostegui→03Kormat [08:35:07] 10Operations, 10ops-codfw, 10DBA, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Kormat) I'll take this over from the DBA side as manuel is on vacation. [08:35:32] !log running puppet on A:all-mw-eqiad [08:35:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:01] !log update firewall policies on pfw - T260585 [08:36:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:29] (03CR) 10JMeybohm: "PCC https://puppet-compiler.wmflabs.org/compiler1003/24582/mw1381.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/621201 (https://phabricator.wikimedia.org/T260329) (owner: 10JMeybohm) [08:38:54] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/621090 (https://phabricator.wikimedia.org/T260742) (owner: 10Dzahn) [08:40:38] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, 10Sustainability (Incident Followup): mw* servers memory leaks (12 Aug) - https://phabricator.wikimedia.org/T260281 (10Joe) a:03Joe [08:41:00] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 52 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:41:04] (03CR) 10Jcrespo: [C: 04-1] "Not yet." [puppet] - 10https://gerrit.wikimedia.org/r/621042 (https://phabricator.wikimedia.org/T260717) (owner: 10Dzahn) [08:43:08] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single [08:43:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:18] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single [08:43:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:49] 10Operations, 10fundraising-tech-ops, 10netops: Automate diff and commit of frack ACL - https://phabricator.wikimedia.org/T260655 (10ayounsi) The former, instead of (or in addition to) copying the file over, it would display the diff, and could prompt the user with a "Commit? (y/N)" It could also do a "commi... [08:46:21] (03CR) 10Jcrespo: [C: 04-1] "It requires more work, modifying backup director code." [puppet] - 10https://gerrit.wikimedia.org/r/621038 (https://phabricator.wikimedia.org/T260717) (owner: 10Dzahn) [08:48:34] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [08:48:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:51] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [08:48:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:10] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] "Hugh will most likely be happy with this as well ;-) - merging." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/620317 (owner: 10Addshore) [08:52:11] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mcrouter_wancache: Temporarily remove two codfw proxies for reboot [puppet] - 10https://gerrit.wikimedia.org/r/621201 (https://phabricator.wikimedia.org/T260329) (owner: 10JMeybohm) [08:52:50] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 50 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:55:24] (03PS1) 10Jbond: admin: add jan-e to ldap group [puppet] - 10https://gerrit.wikimedia.org/r/621204 (https://phabricator.wikimedia.org/T260555) [08:56:18] (03CR) 10Jbond: [C: 03+2] "LDAP users also need to have an entry in admin/data/data.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/621204 (https://phabricator.wikimedia.org/T260555) (owner: 10Jbond) [09:00:15] jbond42: thanks for ^, completely forgot of course [09:00:24] no probs [09:01:41] (03CR) 10JMeybohm: [C: 03+2] mcrouter_wancache: Temporarily remove two codfw proxies for reboot [puppet] - 10https://gerrit.wikimedia.org/r/621201 (https://phabricator.wikimedia.org/T260329) (owner: 10JMeybohm) [09:04:22] 10Operations, 10SRE-Access-Requests: Request for access to analytics-privatedata-users - https://phabricator.wikimedia.org/T260450 (10fgiunchedi) (Nuria is back next week FYI) [09:05:50] 10Operations, 10fundraising-tech-ops, 10netops, 10observability: Add alert[12]001 to network ACLs - https://phabricator.wikimedia.org/T260533 (10fgiunchedi) p:05Triage→03Medium [09:06:00] 10Operations, 10SRE-tools, 10serviceops: Create a cookbook for depooling one or all services from one kubernetes cluster - https://phabricator.wikimedia.org/T260663 (10fgiunchedi) p:05Triage→03Medium [09:06:08] 10Operations, 10SRE-tools, 10serviceops: Create a cookbook for applying an apache config change safely - https://phabricator.wikimedia.org/T260664 (10fgiunchedi) p:05Triage→03Medium [09:06:18] 10Operations, 10User-Kormat: generate-debdeploy-spec breaks when trying to use the transition feature - https://phabricator.wikimedia.org/T260680 (10fgiunchedi) p:05Triage→03Medium [09:06:28] 10Operations, 10DC-Ops, 10Patch-For-Review: decom helium and heze - https://phabricator.wikimedia.org/T260717 (10fgiunchedi) p:05Triage→03Medium [09:07:11] 10Operations, 10ops-codfw: backup2001 RAID controller failure - https://phabricator.wikimedia.org/T260764 (10fgiunchedi) p:05Triage→03Medium [09:07:20] 10Operations, 10ops-codfw: backup2001 RAID controller failure - https://phabricator.wikimedia.org/T260764 (10jcrespo) a:03Papaul It doesn't post after restart, I tried twice, it gets stuck after "initializing devices", even after a porwerdown and a powerup: ` F2 = System Setup F10 = Lifecycle Controller F... [09:11:00] (03CR) 10Filippo Giunchedi: [C: 03+2] profile: refactor logmsgbot to follow Icinga failover [puppet] - 10https://gerrit.wikimedia.org/r/620701 (https://phabricator.wikimedia.org/T247966) (owner: 10Filippo Giunchedi) [09:11:09] (03PS2) 10Filippo Giunchedi: profile: refactor logmsgbot to follow Icinga failover [puppet] - 10https://gerrit.wikimedia.org/r/620701 (https://phabricator.wikimedia.org/T247966) [09:12:20] (03PS2) 10Kormat: pontoon: stack-specific hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/620894 (owner: 10Filippo Giunchedi) [09:12:40] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 51 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:14:35] (03CR) 10Filippo Giunchedi: [C: 03+2] icinga: ensure tmpfs cleanup [puppet] - 10https://gerrit.wikimedia.org/r/620710 (https://phabricator.wikimedia.org/T260521) (owner: 10Filippo Giunchedi) [09:18:37] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 50 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:19:08] (03CR) 10Filippo Giunchedi: [C: 03+2] icinga: make sure update-etcd-mw-config-lastindex is enabled [puppet] - 10https://gerrit.wikimedia.org/r/620929 (https://phabricator.wikimedia.org/T247966) (owner: 10Filippo Giunchedi) [09:19:10] (03PS1) 10Ayounsi: Move HE to Transit BGP group [homer/public] - 10https://gerrit.wikimedia.org/r/621207 [09:19:17] (03PS2) 10Filippo Giunchedi: icinga: make sure update-etcd-mw-config-lastindex is enabled [puppet] - 10https://gerrit.wikimedia.org/r/620929 (https://phabricator.wikimedia.org/T247966) [09:20:40] (03CR) 10Ayounsi: [C: 03+2] Move HE to Transit BGP group [homer/public] - 10https://gerrit.wikimedia.org/r/621207 (owner: 10Ayounsi) [09:20:50] (03CR) 10Filippo Giunchedi: [C: 03+1] pontoon: stack-specific hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/620894 (owner: 10Filippo Giunchedi) [09:21:05] (03Merged) 10jenkins-bot: Move HE to Transit BGP group [homer/public] - 10https://gerrit.wikimedia.org/r/621207 (owner: 10Ayounsi) [09:22:59] !log oblivian@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0) [09:23:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:26] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: add alertmanagers variable [puppet] - 10https://gerrit.wikimedia.org/r/620922 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [09:26:52] (03PS3) 10Filippo Giunchedi: hieradata: add alertmanagers variable [puppet] - 10https://gerrit.wikimedia.org/r/620922 (https://phabricator.wikimedia.org/T258948) [09:26:54] (03CR) 10Jbond: [C: 03+2] netmon2001: update netmon librenms to use apereo cas [puppet] - 10https://gerrit.wikimedia.org/r/620910 (https://phabricator.wikimedia.org/T256958) (owner: 10Jbond) [09:34:24] (03PS2) 10Jbond: netmon - librenms: make apereo cass sso authentication the default [puppet] - 10https://gerrit.wikimedia.org/r/620911 (https://phabricator.wikimedia.org/T256958) [09:35:04] (03CR) 10Jbond: [C: 03+2] netmon - librenms: make apereo cass sso authentication the default [puppet] - 10https://gerrit.wikimedia.org/r/620911 (https://phabricator.wikimedia.org/T256958) (owner: 10Jbond) [09:45:48] (03PS1) 10Jbond: librenms: use the ldap uid to map users [puppet] - 10https://gerrit.wikimedia.org/r/621209 [09:50:23] (03PS3) 10Kormat: pontoon: stack-specific hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/620894 (owner: 10Filippo Giunchedi) [09:54:16] (03PS2) 10Filippo Giunchedi: prometheus: add alerts to 'ops' instance [puppet] - 10https://gerrit.wikimedia.org/r/620956 (https://phabricator.wikimedia.org/T258948) [09:54:18] (03PS2) 10Filippo Giunchedi: prometheus: export icinga service problems as metrics [puppet] - 10https://gerrit.wikimedia.org/r/620957 (https://phabricator.wikimedia.org/T258948) [09:54:20] (03PS1) 10Filippo Giunchedi: alertmanager: set tab title for karma [puppet] - 10https://gerrit.wikimedia.org/r/621211 (https://phabricator.wikimedia.org/T258948) [09:54:22] (03PS1) 10Filippo Giunchedi: prometheus: drop 'replica' label in alerts [puppet] - 10https://gerrit.wikimedia.org/r/621212 (https://phabricator.wikimedia.org/T258948) [09:57:10] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: set tab title for karma [puppet] - 10https://gerrit.wikimedia.org/r/621211 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [09:58:49] (03PS2) 10Jbond: librenms: use the ldap uid to map users [puppet] - 10https://gerrit.wikimedia.org/r/621209 [09:59:43] jbond42: I think the sso change to librenms broke the icinga check for librenms (UNKNOWN now in icinga) [10:01:05] godog: ack thanks yes taking a look will achknolage [10:01:21] (03CR) 10Jbond: [C: 03+2] librenms: use the ldap uid to map users [puppet] - 10https://gerrit.wikimedia.org/r/621209 (owner: 10Jbond) [10:01:42] nice, thank you! LMK if I can help [10:02:32] (03CR) 10Filippo Giunchedi: [C: 03+2] "PCC https://puppet-compiler.wmflabs.org/compiler1003/24583/prometheus1003.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/621212 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [10:03:42] will do thanks [10:07:59] (03PS1) 10Jbond: librenms: use uid fro realname [puppet] - 10https://gerrit.wikimedia.org/r/621216 [10:09:05] (03PS1) 10Filippo Giunchedi: prometheus: use 'regex' to drop alert labels [puppet] - 10https://gerrit.wikimedia.org/r/621217 (https://phabricator.wikimedia.org/T258948) [10:10:17] (03PS2) 10Jbond: librenms: use CN for realname which is case preserving [puppet] - 10https://gerrit.wikimedia.org/r/621216 [10:10:48] (03CR) 10Jbond: [V: 03+2 C: 03+2] librenms: use CN for realname which is case preserving [puppet] - 10https://gerrit.wikimedia.org/r/621216 (owner: 10Jbond) [10:11:21] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: use 'regex' to drop alert labels [puppet] - 10https://gerrit.wikimedia.org/r/621217 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [10:19:13] 10Operations, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: Enable CAS authentications on librenms - https://phabricator.wikimedia.org/T256958 (10jbond) SSO enabled however icinga is now failing with the following ` ERROR:check_librenms.py:Encountered exception JSONDecodeError('Expecting value: line 1 col... [10:24:20] (03CR) 10Kormat: [C: 03+1] "Tested with my pontoon env, and it works. Nice work!" [puppet] - 10https://gerrit.wikimedia.org/r/620894 (owner: 10Filippo Giunchedi) [10:26:17] (03PS4) 10Kormat: pontoon: stack-specific hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/620894 (owner: 10Filippo Giunchedi) [10:26:19] (03PS1) 10Kormat: pontoon: Add a hiera file for puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/621220 [10:27:07] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: stack-specific hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/620894 (owner: 10Filippo Giunchedi) [10:36:37] PROBLEM - dhclient process on backup2001 is CRITICAL: connect to address 10.192.48.116 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [10:37:45] PROBLEM - Check size of conntrack table on backup2001 is CRITICAL: connect to address 10.192.48.116 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [10:37:45] PROBLEM - Check systemd state on backup2001 is CRITICAL: connect to address 10.192.48.116 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:37:49] PROBLEM - bacula sd process on backup2001 is CRITICAL: connect to address 10.192.48.116 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Bacula [10:38:03] PROBLEM - SSH on backup2001 is CRITICAL: connect to address 10.192.48.116 and port 22: Connection refused https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:42:15] PROBLEM - puppet last run on backup2001 is CRITICAL: connect to address 10.192.48.116 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [10:42:45] PROBLEM - MegaRAID on backup2001 is CRITICAL: connect to address 10.192.48.116 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:42:55] PROBLEM - MD RAID on backup2001 is CRITICAL: connect to address 10.192.48.116 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [10:43:07] jynus: is that expected? [10:43:17] PROBLEM - configured eth on backup2001 is CRITICAL: connect to address 10.192.48.116 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [10:44:18] vgutierrez: nope [10:44:22] we are having an issue [10:44:27] a hw one [10:44:31] I will extend the downtime [10:44:33] PROBLEM - Check whether ferm is active by checking the default input chain on backup2001 is CRITICAL: connect to address 10.192.48.116 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [10:45:34] a very ugly hw raid issue [10:45:54] 10Operations, 10ops-codfw: backup2001 RAID controller failure, unable to post - https://phabricator.wikimedia.org/T260764 (10jcrespo) [10:46:05] ^ vgutierrez :-( [10:46:32] yup... that doesn't look good :( [10:46:59] thanks for pinging, vgutierrez! [10:47:06] np [10:47:23] I will wait for onsite help, we have backup redundancy precisely for this [10:58:38] (03PS1) 10DannyS712: Fix typos in flaggedrevs comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621227 [10:59:49] 10Operations, 10ops-codfw: backup2001 RAID controller failure, unable to post - https://phabricator.wikimedia.org/T260764 (10jcrespo) backup2001 has been unstable for a while- and now it got extra load from database backups from eqiad. Previous crashes: * 2019-11-08 T237730#5648203 * 2019-12-08 T240177#5789805 [11:00:00] 10Operations, 10ops-codfw: backup2001 RAID controller failure, unable to post 2020-08-19 - https://phabricator.wikimedia.org/T260764 (10jcrespo) [11:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for European mid-day backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200819T1100). [11:00:05] No GERRIT patches in the queue for this window AFAICS. [11:00:31] Amir1 Lucas_WMDE awight Urbanecm I'd like to add https://gerrit.wikimedia.org/r/621227 for this window once gerrit loads enough to let me finish it [11:01:23] DannyS712: sure, I can deploy whenever you're ready. [11:01:37] gerrit struggles with InitialiseSettings [11:02:23] * awight suspects you're making the edit in Gerrit :-D [11:02:33] (03PS2) 10DannyS712: Fix typos in flaggedrevs comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621227 [11:02:45] (03CR) 10DannyS712: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621227 (owner: 10DannyS712) [11:02:54] awight is correct [11:04:15] (03CR) 10Awight: [C: 03+2] "ock, stock, and barrel!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621227 (owner: 10DannyS712) [11:05:05] (03Merged) 10jenkins-bot: Fix typos in flaggedrevs comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621227 (owner: 10DannyS712) [11:08:40] !log awight@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:621227|Fix typos in flaggedrevs comments ()]] (duration: 01m 19s) [11:08:40] DannyS712: thanks for all the cleanup! [11:08:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:21] (oops, my summary script fails when the Bug header is missing) [11:21:05] DannyS712: gerrit UI has been getting way too slow [11:21:58] (03CR) 10JMeybohm: [C: 03+1] "LGTM apart from the typo in the commit message" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/621199 (owner: 10Giuseppe Lavagetto) [11:28:34] !log restart mwdebug* servers [11:28:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:54] !log EU bacon finished [11:29:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:30] 10Operations, 10ops-codfw, 10DBA, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Kormat) I started mariadb, and started replication. Mariadb crashed after about 10 minutes. There's nothing in the idrac logs, so it could well be unrelated to the... [12:08:48] (03PS2) 10Kormat: pontoon: Add a hiera file for puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/621220 [12:10:07] (03PS1) 10Kormat: pontoon: Add mariadb104-test stack [puppet] - 10https://gerrit.wikimedia.org/r/621248 [12:15:07] (03PS1) 10Jbond: librenms: Bypass CAS for api endpoints [puppet] - 10https://gerrit.wikimedia.org/r/621251 (https://phabricator.wikimedia.org/T256958) [12:18:21] PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [12:19:01] (03CR) 10Jbond: [C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/compiler1003/24584/" [puppet] - 10https://gerrit.wikimedia.org/r/621251 (https://phabricator.wikimedia.org/T256958) (owner: 10Jbond) [12:23:42] (03CR) 10JMeybohm: [C: 03+1] "I'm curious: Did `{{ if and (.Values.main_app.access_log) (eq .Values.main_app.access_log.type "eventgate") -}}` not work for you? Because" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/619341 (https://phabricator.wikimedia.org/T251812) (owner: 10Ppchelko) [12:24:40] 10Operations, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: Enable CAS authentications on librenms - https://phabricator.wikimedia.org/T256958 (10jbond) librenms is now using Apero CAS apart from api calls which authenticate directly with librenms using an api token [12:24:48] 10Operations, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: Enable CAS authentications on librenms - https://phabricator.wikimedia.org/T256958 (10jbond) 05Open→03Resolved [12:30:18] (03CR) 10JMeybohm: [C: 03+1] Add jwt and ratelimiter fixtures to gateway for more validation [deployment-charts] - 10https://gerrit.wikimedia.org/r/620992 (owner: 10Ppchelko) [12:31:42] (03CR) 10JMeybohm: [C: 03+1] ratelimit: crash on startup if config is invalid (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/620068 (owner: 10Ppchelko) [12:33:23] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/621220 (owner: 10Kormat) [12:33:59] (03CR) 10Kormat: [C: 03+2] pontoon: Add a hiera file for puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/621220 (owner: 10Kormat) [12:35:12] 10Operations, 10serviceops, 10Platform Team Workboards (Clinic Duty Team): PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10kostajh) This might substantially complicate the implementation, but it would be nice if you could do something like: `lang=php $result... [12:37:35] (03PS1) 10Filippo Giunchedi: Add grafana-piechart-panel [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/621255 (https://phabricator.wikimedia.org/T259143) [12:40:25] (03CR) 10Ppchelko: "> Patch Set 18: Code-Review+1" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/619341 (https://phabricator.wikimedia.org/T251812) (owner: 10Ppchelko) [12:41:54] (03PS2) 10Kormat: pontoon: Add mariadb104-test stack [puppet] - 10https://gerrit.wikimedia.org/r/621248 [12:42:36] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] Add grafana-piechart-panel [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/621255 (https://phabricator.wikimedia.org/T259143) (owner: 10Filippo Giunchedi) [12:43:02] (03CR) 10Kormat: [C: 03+2] pontoon: Add mariadb104-test stack [puppet] - 10https://gerrit.wikimedia.org/r/621248 (owner: 10Kormat) [12:47:40] (03CR) 10Filippo Giunchedi: "post-facto but see inline! LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/621248 (owner: 10Kormat) [12:49:57] (03PS1) 10Jbond: admin: remove lulu and extend access for Roldolfo [puppet] - 10https://gerrit.wikimedia.org/r/621257 [12:50:16] (03PS6) 10Ppchelko: ratelimit: crash on startup if config is invalid [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/620068 [12:50:50] (03CR) 10jerkins-bot: [V: 04-1] admin: remove lulu and extend access for Roldolfo [puppet] - 10https://gerrit.wikimedia.org/r/621257 (owner: 10Jbond) [12:52:04] (03PS2) 10Jbond: admin: remove lulu and extend access for Roldolfo [puppet] - 10https://gerrit.wikimedia.org/r/621257 [12:52:38] (03CR) 10Giuseppe Lavagetto: Resurrect fluent-bit image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/619512 (https://phabricator.wikimedia.org/T251812) (owner: 10Ppchelko) [12:53:22] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10serviceops: Deploy push-notifications service to Kubernetes - https://phabricator.wikimedia.org/T256973 (10MSantos) [12:53:25] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:53:28] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10Services, 10Service-deployment-requests: New Service Request: Wikimedia push notification service - https://phabricator.wikimedia.org/T250452 (10MSantos) [12:53:30] (03CR) 10Jbond: [C: 03+2] admin: remove lulu and extend access for Roldolfo [puppet] - 10https://gerrit.wikimedia.org/r/621257 (owner: 10Jbond) [12:53:32] (03CR) 10Ppchelko: Resurrect fluent-bit image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/619512 (https://phabricator.wikimedia.org/T251812) (owner: 10Ppchelko) [12:55:41] (03PS7) 10Ppchelko: Resurrect fluent-bit image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/619512 (https://phabricator.wikimedia.org/T251812) [12:56:03] (03CR) 10Ppchelko: Resurrect fluent-bit image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/619512 (https://phabricator.wikimedia.org/T251812) (owner: 10Ppchelko) [12:56:08] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] ratelimit: crash on startup if config is invalid [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/620068 (owner: 10Ppchelko) [12:57:19] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:57:46] 10Operations, 10LDAP-Access-Requests: LDAP Request Astinson - https://phabricator.wikimedia.org/T260791 (10Astinson) [12:57:49] <_joe_> !log building a new version of the base docker images [12:57:52] 10Operations, 10serviceops, 10Platform Team Workboards (Clinic Duty Team): PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10tstarling) >>! In T260330#6396222, @kostajh wrote: > Then, if the command execution exceeds the value set with `setTimeLimit()`, the com... [12:57:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:55] (03CR) 10Ottomata: "> It's not completely ideal that staging is used as release and environment name I guess.." [deployment-charts] - 10https://gerrit.wikimedia.org/r/619437 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [13:01:31] !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [13:01:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:35] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Resurrect fluent-bit image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/619512 (https://phabricator.wikimedia.org/T251812) (owner: 10Ppchelko) [13:03:22] <_joe_> !log building and uploading fluent-bit, ratelimit images [13:03:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:03] RECOVERY - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is OK: (C)100 gt (W)80 gt 58.98 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [13:10:23] !log pt1979@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:10:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:42] 10Operations, 10serviceops, 10Platform Team Workboards (Clinic Duty Team): PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10kostajh) >>! In T260330#6396259, @tstarling wrote: > This already exists, and will continue to exist in the new system. You can use Shel... [13:14:10] 10Operations, 10netops: Make eqord its own AS - https://phabricator.wikimedia.org/T259593 (10ayounsi) To be pushed, I'm only converting the existing sessions for now (not deleting/creating new ones). `lang=diff,name=cr2-eqord [edit routing-options] - autonomous-system 65001; + autonomous-system 65020; [ed... [13:14:18] (03PS2) 10Giuseppe Lavagetto: profile::service_proxy::envoy: inject XFP in all calls to mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/621199 [13:14:20] (03PS2) 10Giuseppe Lavagetto: role::ores: enable the service proxy to the MediaWiki api [puppet] - 10https://gerrit.wikimedia.org/r/621200 [13:14:50] (03CR) 10Filippo Giunchedi: prometheus: use aggs to consolidate mediawiki logging metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/621098 (https://phabricator.wikimedia.org/T256418) (owner: 10Cwhite) [13:14:52] (03CR) 10Giuseppe Lavagetto: profile::service_proxy::envoy: inject XFP in all calls to mediawiki (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/621199 (owner: 10Giuseppe Lavagetto) [13:14:57] (03CR) 10Giuseppe Lavagetto: [C: 03+2] profile::service_proxy::envoy: inject XFP in all calls to mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/621199 (owner: 10Giuseppe Lavagetto) [13:16:09] (03PS1) 10Papaul: DNS: Add mgmt DNS for mc2037 [dns] - 10https://gerrit.wikimedia.org/r/621259 [13:17:34] (03PS2) 10Papaul: DNS: Add production DNS for mc2037 [dns] - 10https://gerrit.wikimedia.org/r/621259 [13:18:38] (03CR) 10Matthias Mullie: [C: 03+1] Correct CirrusSearchUserTesting configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621099 (https://phabricator.wikimedia.org/T254388) (owner: 10Ebernhardson) [13:18:52] (03CR) 10Papaul: [C: 03+2] DNS: Add production DNS for mc2037 [dns] - 10https://gerrit.wikimedia.org/r/621259 (owner: 10Papaul) [13:20:15] (03PS1) 10Ppchelko: Crash ratelimit container on startup for invalid config [deployment-charts] - 10https://gerrit.wikimedia.org/r/621260 [13:21:50] (03CR) 10Ppchelko: [C: 03+2] Crash ratelimit container on startup for invalid config [deployment-charts] - 10https://gerrit.wikimedia.org/r/621260 (owner: 10Ppchelko) [13:23:53] (03Merged) 10jenkins-bot: Crash ratelimit container on startup for invalid config [deployment-charts] - 10https://gerrit.wikimedia.org/r/621260 (owner: 10Ppchelko) [13:25:23] 10Operations, 10ops-codfw, 10Patch-For-Review: Replace mc2028 with wmf6413 and name it mc2037 - https://phabricator.wikimedia.org/T260694 (10Papaul) @RLazarus what OS for mc2037? [13:25:34] !log ppchelko@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [13:25:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:48] !log ppchelko@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [13:26:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:51] !log ppchelko@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [13:27:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:05] PROBLEM - Query Service HTTP Port on wdqs1007 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [13:29:09] (03PS1) 10Giuseppe Lavagetto: Revert "profile::service_proxy::envoy: inject XFP in all calls to mediawiki" [puppet] - 10https://gerrit.wikimedia.org/r/621233 [13:29:14] (03PS19) 10Ppchelko: Modify api-gateway access logging to conform to schema [deployment-charts] - 10https://gerrit.wikimedia.org/r/619341 (https://phabricator.wikimedia.org/T251812) [13:29:31] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Revert "profile::service_proxy::envoy: inject XFP in all calls to mediawiki" [puppet] - 10https://gerrit.wikimedia.org/r/621233 (owner: 10Giuseppe Lavagetto) [13:29:58] <_joe_> !log depooling and disabling puppet on restbase1024 for further investigation [13:29:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:14] (03PS20) 10Ppchelko: Modify api-gateway access logging to conform to schema [deployment-charts] - 10https://gerrit.wikimedia.org/r/619341 (https://phabricator.wikimedia.org/T251812) [13:30:33] 10Operations, 10ops-codfw, 10DBA, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Kormat) Here's the log from the crash: {P12301} This looks similar to {T249188} [13:32:05] (03PS21) 10Ppchelko: Modify api-gateway access logging to conform to schema [deployment-charts] - 10https://gerrit.wikimedia.org/r/619341 (https://phabricator.wikimedia.org/T251812) [13:34:18] (03CR) 10Ppchelko: [C: 03+2] Modify api-gateway access logging to conform to schema [deployment-charts] - 10https://gerrit.wikimedia.org/r/619341 (https://phabricator.wikimedia.org/T251812) (owner: 10Ppchelko) [13:34:30] !log depooling wdqs1007 and restarting blazegraph [13:34:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:39] ryankemper: cc ^ [13:34:55] RECOVERY - Query Service HTTP Port on wdqs1007 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.029 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [13:35:42] (03Merged) 10jenkins-bot: Modify api-gateway access logging to conform to schema [deployment-charts] - 10https://gerrit.wikimedia.org/r/619341 (https://phabricator.wikimedia.org/T251812) (owner: 10Ppchelko) [13:35:49] PROBLEM - WDQS high update lag on wdqs1007 is CRITICAL: 9.001e+04 ge 4.32e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [13:37:05] (03PS1) 10Vgutierrez: Remove unnecessary patches for Varnish 6 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/621265 (https://phabricator.wikimedia.org/T260702) [13:38:06] !log ppchelko@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [13:38:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:09] ACKNOWLEDGEMENT - WDQS high update lag on wdqs1007 is CRITICAL: 8.98e+04 ge 4.32e+04 Gehel server depooled and catching up on lag after a restart https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [13:42:38] 10Operations, 10LDAP-Access-Requests: LDAP Request Astinson - https://phabricator.wikimedia.org/T260791 (10fgiunchedi) Certainly, for this access to the `wmf` LDAP group is needed. I'll followup with the required changes [13:44:12] 10Operations, 10ops-codfw, 10Patch-For-Review: Replace mc2028 with wmf6413 and name it mc2037 - https://phabricator.wikimedia.org/T260694 (10RLazarus) I know it's 2020, but Jessie for uniformity please. Upgrading those hosts is a work in progress. If we weren't about to do the switchover, I'd say install som... [13:44:53] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpe [13:44:54] expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:47:00] (03PS1) 10Filippo Giunchedi: admin: add astinson to ldap users [puppet] - 10https://gerrit.wikimedia.org/r/621268 (https://phabricator.wikimedia.org/T260791) [13:47:13] seeking volunteers for a quick +1 on ^ [13:48:40] mm. a vim user, i dunno [13:48:58] (03CR) 10Kormat: [C: 03+1] admin: add astinson to ldap users [puppet] - 10https://gerrit.wikimedia.org/r/621268 (https://phabricator.wikimedia.org/T260791) (owner: 10Filippo Giunchedi) [13:49:49] I know you are a secret viper-mode user kormat [13:49:59] aka "emacs for rebels" [13:50:09] also thank you [13:50:51] i'm an ex-vim vscode convert [13:51:49] +1 [13:52:13] ah, I'm using vscode from time to time myself and found the "vim emulation" to be quite accurate, I'm impressed [13:52:28] Hi. Can someone check the stack trace for `379e779f-8fa6-495c-b1da-c447aed4f159` please? I got a DBQueryError [13:52:59] (03CR) 10Filippo Giunchedi: [C: 03+2] admin: add astinson to ldap users [puppet] - 10https://gerrit.wikimedia.org/r/621268 (https://phabricator.wikimedia.org/T260791) (owner: 10Filippo Giunchedi) [13:53:07] (03PS2) 10Filippo Giunchedi: admin: add astinson to ldap users [puppet] - 10https://gerrit.wikimedia.org/r/621268 (https://phabricator.wikimedia.org/T260791) [13:54:04] DannyS712: I'll take a look, when did you get the error ? [13:54:28] a few minutes ago 2020-08-19 13:50:57 [13:54:37] wanted to know if I should file a phab task or not [13:54:41] (03PS1) 10Papaul: DHCP: Add MAC address for mc2037 [puppet] - 10https://gerrit.wikimedia.org/r/621270 (https://phabricator.wikimedia.org/T260694) [13:55:16] not sure, it is a lock timeout [13:55:27] Expectation (writeQueryTime <= 1) by RollbackAction::enableTransactionalTimelimit not met (actual: 16.724849224091): [13:55:39] ACKNOWLEDGEMENT - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned th [13:55:39] us 503 (expecting: 200) JMeybohm Depooled for investigation, also ask Joe if in doubt. https://gerrit.wikimedia.org/r/q/663c6b081c651063f1d54b3e12e6041d4397f145 https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:57:07] godog better safe than sorry - I'll file one [13:57:48] (03CR) 10Papaul: [C: 03+2] DHCP: Add MAC address for mc2037 [puppet] - 10https://gerrit.wikimedia.org/r/621270 (https://phabricator.wikimedia.org/T260694) (owner: 10Papaul) [13:57:57] (03CR) 10jerkins-bot: [V: 04-1] Remove unnecessary patches for Varnish 6 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/621265 (https://phabricator.wikimedia.org/T260702) (owner: 10Vgutierrez) [13:58:54] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP Request Astinson - https://phabricator.wikimedia.org/T260791 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi @Astinson you are part of the `wmf` LDAP group now and therefore should have access to superset. Resolving, please reopen is someth... [13:59:40] merging a bunch of changes papaul _joe_ [13:59:58] <_joe_> godog: sigh sorry, a meeting occurred [14:00:12] _joe_: np, I forgot myself and just got back to it [14:02:16] godog: thanks [14:03:31] @godog filed T260798 if you'd be willing to include the stack trace there [14:03:32] T260798: DBQueryError when using rollback on incubatorwiki - https://phabricator.wikimedia.org/T260798 [14:08:15] (03PS1) 10RLazarus: sre.switchdc.mediawiki: Add -ro services and handle parsoid-php specially. [cookbooks] - 10https://gerrit.wikimedia.org/r/621272 [14:09:40] (03CR) 10Volans: [C: 03+1] "Ok, let's try this way, but keep an eye on the double DNS check, if it takes too long we can refactor." [cookbooks] - 10https://gerrit.wikimedia.org/r/621272 (owner: 10RLazarus) [14:10:09] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:10:39] (03CR) 10RLazarus: "> Ok, let's try this way, but keep an eye on the double DNS check, if it takes too long we can refactor." [cookbooks] - 10https://gerrit.wikimedia.org/r/621272 (owner: 10RLazarus) [14:12:11] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 52 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:12:42] (03CR) 10Bstorm: wikireplicas: create cumin aliases for wikireplica servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/621067 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm) [14:13:24] (03PS1) 10Papaul: Add mc2037 to site.pp with role insetup [puppet] - 10https://gerrit.wikimedia.org/r/621275 (https://phabricator.wikimedia.org/T260694) [14:14:06] (03CR) 10RLazarus: [C: 03+2] sre.switchdc.mediawiki: Add -ro services and handle parsoid-php specially. [cookbooks] - 10https://gerrit.wikimedia.org/r/621272 (owner: 10RLazarus) [14:15:16] (03CR) 10Papaul: [C: 03+2] Add mc2037 to site.pp with role insetup [puppet] - 10https://gerrit.wikimedia.org/r/621275 (https://phabricator.wikimedia.org/T260694) (owner: 10Papaul) [14:15:30] (03Merged) 10jenkins-bot: sre.switchdc.mediawiki: Add -ro services and handle parsoid-php specially. [cookbooks] - 10https://gerrit.wikimedia.org/r/621272 (owner: 10RLazarus) [14:18:11] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 49 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:18:23] 10Operations, 10ops-codfw, 10Patch-For-Review: Replace mc2028 with wmf6413 and name it mc2037 - https://phabricator.wikimedia.org/T260694 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mc2037.codfw.wmnet ` The log can be found in `/var/log/wmf-auto... [14:19:48] 10Operations, 10DC-Ops, 10Patch-For-Review: decom helium and heze - https://phabricator.wikimedia.org/T260717 (10jcrespo) [14:22:41] jynus: thank God Heze is finally going away [14:22:43] 10Operations, 10DC-Ops, 10Patch-For-Review: decom helium and heze - https://phabricator.wikimedia.org/T260717 (10jcrespo) In addition to the data that I am unable to recover some data using the new hardware, replacement backup2001 has been crashing frequently since setup T260764#6396032, which does not give... [14:22:52] (03PS1) 10Ppchelko: Fix the api-gateway request log stream name [deployment-charts] - 10https://gerrit.wikimedia.org/r/621276 [14:23:10] (03CR) 10Ppchelko: [C: 03+2] Fix the api-gateway request log stream name [deployment-charts] - 10https://gerrit.wikimedia.org/r/621276 (owner: 10Ppchelko) [14:23:11] 10Operations, 10DC-Ops, 10Patch-For-Review: decom helium and heze - https://phabricator.wikimedia.org/T260717 (10jcrespo) [14:23:15] 10Operations, 10Goal: Followup to backup1001 bacula switchover (misc pending tasks) - https://phabricator.wikimedia.org/T238048 (10jcrespo) [14:24:02] papaul not if backup2001 keeps crashing... :-( [14:24:30] (03Merged) 10jenkins-bot: Fix the api-gateway request log stream name [deployment-charts] - 10https://gerrit.wikimedia.org/r/621276 (owner: 10Ppchelko) [14:26:09] 10Operations, 10ops-codfw: backup2001 RAID controller failure, unable to post 2020-08-19 - https://phabricator.wikimedia.org/T260764 (10jcrespo) To counter that maybe the setup was not ideal/hw configuration was chosen wrongly (e.g. too many disks/arrays for a single host), backup1001 had none of the crashes b... [14:27:08] !log ppchelko@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [14:27:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:25] jynus: yes i saw the tsk this AM and look into the HW log nothing helpful for now [14:27:35] DannyS712: logs will be purged after 90d so I'd rather let whoever's triaging the errors take a look [14:27:53] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 89951776 and 7 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:29:10] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), 10Epic: [Epic] Scaling strategy for Wikidata Query Service - https://phabricator.wikimedia.org/T221938 (10Gehel) [14:29:49] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 832 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:31:43] 10Operations, 10ops-codfw, 10Patch-For-Review: Replace mc2028 with wmf6413 and name it mc2037 - https://phabricator.wikimedia.org/T260694 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mc2037.codfw.wmnet'] ` Of which those **FAILED**: ` ['mc2037.codfw.wmnet'] ` [14:33:17] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [14:33:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:20] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [14:35:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:30] (03PS1) 10Andrew Bogott: ceph backups: exclude some VMs from backups [puppet] - 10https://gerrit.wikimedia.org/r/621281 [14:38:32] (03PS1) 10Andrew Bogott: ceph backups: change backup lifespan to 7 days [puppet] - 10https://gerrit.wikimedia.org/r/621282 [14:38:35] (03PS1) 10Giuseppe Lavagetto: services_proxy::envoy: completely overwrite X-Forwarded-Proto if needed [puppet] - 10https://gerrit.wikimedia.org/r/621283 [14:39:17] (03PS1) 10Vgutierrez: Update 0003-vsm-perms.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/621284 [14:39:38] (03CR) 10jerkins-bot: [V: 04-1] Update 0003-vsm-perms.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/621284 (owner: 10Vgutierrez) [14:41:06] (03CR) 10Andrew Bogott: [C: 03+2] ceph backups: exclude some VMs from backups [puppet] - 10https://gerrit.wikimedia.org/r/621281 (owner: 10Andrew Bogott) [14:41:10] !log disable puppet on cumin1001 for switchdc testing [14:41:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:14] (03CR) 10Andrew Bogott: [C: 03+2] ceph backups: change backup lifespan to 7 days [puppet] - 10https://gerrit.wikimedia.org/r/621282 (owner: 10Andrew Bogott) [14:44:05] FYI: starting a test of the switchdc cookbooks shortly, effectively simulating a switch to eqiad (where we're already running) -- no effect on production is expected but there'll be some noise in the SAL, and I'll have an eye on this channel in case of any issues [14:47:33] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-disable-puppet [14:47:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:36] !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-disable-puppet (exit_code=0) [14:47:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:16] 10Operations, 10ops-codfw, 10Patch-For-Review: Replace mc2028 with wmf6413 and name it mc2037 - https://phabricator.wikimedia.org/T260694 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mc2037.codfw.wmnet ` The log can be found in `/var/log/wmf-auto... [14:49:56] (03PS2) 10Vgutierrez: Update 0003-vsm-perms.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/621284 (https://phabricator.wikimedia.org/T260702) [14:50:05] !log running the switchdc cookbooks with --live-test, simulating a switch to eqiad where we're already running, no production impact is expected [14:50:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:15] (03CR) 10jerkins-bot: [V: 04-1] Update 0003-vsm-perms.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/621284 (https://phabricator.wikimedia.org/T260702) (owner: 10Vgutierrez) [14:50:18] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Revert to anaconda 2020.02, also some activation improvements [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/620144 (owner: 10Ottomata) [14:50:35] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-reduce-ttl [14:50:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:45] !log rzl@cumin1001 END (FAIL) - Cookbook sre.switchdc.mediawiki.00-reduce-ttl (exit_code=99) [14:50:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:11] (03PS1) 10JMeybohm: helmfile: refactor eventgate-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/621286 (https://phabricator.wikimedia.org/T258572) [15:02:30] (03PS2) 10Giuseppe Lavagetto: services_proxy::envoy: completely overwrite X-Forwarded-Proto if needed [puppet] - 10https://gerrit.wikimedia.org/r/621283 [15:04:12] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [15:04:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:55] (03CR) 10JMeybohm: [C: 03+1] "love it! 👍" [puppet] - 10https://gerrit.wikimedia.org/r/621283 (owner: 10Giuseppe Lavagetto) [15:06:15] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [15:06:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:39] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.08-restore-ttl [15:08:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:48] !log rzl@cumin1001 END (FAIL) - Cookbook sre.switchdc.mediawiki.08-restore-ttl (exit_code=99) [15:08:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:28] (03PS1) 10Hashar: Merge 'doxygen 1.8.19 release' from Debian [debs/doxygen] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/621291 [15:12:47] (03PS2) 10Hashar: Merge 'doxygen 1.8.19 release' from Debian [debs/doxygen] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/621291 (https://phabricator.wikimedia.org/T254465) [15:14:40] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.08-restore-ttl [15:14:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:52] !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.08-restore-ttl (exit_code=0) [15:14:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:05] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-reduce-ttl [15:16:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:14] !log rzl@cumin1001 END (FAIL) - Cookbook sre.switchdc.mediawiki.00-reduce-ttl (exit_code=99) [15:16:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:10] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-reduce-ttl [15:17:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:18] !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.00-reduce-ttl (exit_code=0) [15:17:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:19] !log prometheus codfw lvextend --resizefs --size +80G /dev/mapper/vg--ssd-prometheus--ops [15:18:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:07] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.08-restore-ttl [15:22:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:16] !log rzl@cumin1001 END (FAIL) - Cookbook sre.switchdc.mediawiki.08-restore-ttl (exit_code=99) [15:22:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:23] (03CR) 10jerkins-bot: [V: 04-1] Merge 'doxygen 1.8.19 release' from Debian [debs/doxygen] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/621291 (https://phabricator.wikimedia.org/T254465) (owner: 10Hashar) [15:26:17] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.00-reduce-ttl [15:26:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:27] !log rzl@cumin1001 END (FAIL) - Cookbook sre.switchdc.mediawiki.00-reduce-ttl (exit_code=99) [15:26:27] (03CR) 10Jbond: [C: 03+2] haveged: install haveged on VM'si debian < buster by default [puppet] - 10https://gerrit.wikimedia.org/r/609772 (owner: 10Jbond) [15:26:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:12] 10Operations, 10ops-codfw: Replace mc2028 with wmf6413 and name it mc2037 - https://phabricator.wikimedia.org/T260694 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mc2037.codfw.wmnet'] ` and were **ALL** successful. [15:29:14] (03PS4) 10Jbond: java: update java.security [puppet] - 10https://gerrit.wikimedia.org/r/593467 (https://phabricator.wikimedia.org/T251493) [15:29:31] (03CR) 10Hashar: "FTBS: ld: cannot find -lclang-cpp" [debs/doxygen] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/621291 (https://phabricator.wikimedia.org/T254465) (owner: 10Hashar) [15:30:12] !log oblivian@cumin1001 conftool action : set/ttl=300; selector: dnsdisc=api-rw [15:30:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:18] 10Operations, 10ops-codfw: Replace mc2028 with wmf6413 and name it mc2037 - https://phabricator.wikimedia.org/T260694 (10Papaul) [15:30:52] 10Operations, 10ops-codfw: Replace mc2028 with wmf6413 and name it mc2037 - https://phabricator.wikimedia.org/T260694 (10Papaul) [15:31:27] !log update java.security https://gerrit.wikimedia.org/r/c/operations/puppet/+/593467 [15:31:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:32] (03CR) 10Jbond: [C: 03+2] java: update java.security [puppet] - 10https://gerrit.wikimedia.org/r/593467 (https://phabricator.wikimedia.org/T251493) (owner: 10Jbond) [15:32:00] 10Operations, 10ops-codfw: mc2028 regular and mgmt interface down - https://phabricator.wikimedia.org/T260224 (10Papaul) [15:32:08] 10Operations, 10ops-codfw: Replace mc2028 with wmf6413 and name it mc2037 - https://phabricator.wikimedia.org/T260694 (10Papaul) 05Open→03Resolved This is complete. Please open a decommission sub-task to T260224 for mc2028 [15:32:47] (03PS1) 10Jdlrobson: Drop namespace special casing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621295 (https://phabricator.wikimedia.org/T257953) [15:33:23] (03Abandoned) 10Jbond: wmflib::require_domains: add new function to to replace require_realm [puppet] - 10https://gerrit.wikimedia.org/r/570343 (owner: 10Jbond) [15:33:24] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.08-restore-ttl [15:33:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:34] !log rzl@cumin1001 END (FAIL) - Cookbook sre.switchdc.mediawiki.08-restore-ttl (exit_code=99) [15:33:35] (03PS2) 10Jdlrobson: Drop namespace special casing on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621295 (https://phabricator.wikimedia.org/T257953) [15:33:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:37] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 51 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:34:10] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.08-restore-ttl [15:34:11] (03PS1) 10Ashot1997: Don't index Draft (118) and Draft talk (119) on hywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621296 (https://phabricator.wikimedia.org/T260804) [15:34:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:21] !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.08-restore-ttl (exit_code=0) [15:34:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:05] (03PS1) 10Jdlrobson: Enable $wgMFNoindexPages for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621298 (https://phabricator.wikimedia.org/T255458) [15:37:04] !log rzl@cumin1001 START - Cookbook sre.switchdc.mediawiki.08-start-maintenance [15:37:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:35] !log rzl@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.08-start-maintenance (exit_code=0) [15:38:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:11] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 50 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:41:37] !log finished exercising the switchdc cookbooks with --live-test for now, all changes reverted including re-enabling puppet on cumin1001 [15:41:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:28] (03CR) 10D3r1ck01: [C: 03+1] "lgtm!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621295 (https://phabricator.wikimedia.org/T257953) (owner: 10Jdlrobson) [15:48:25] (03PS3) 10Andrew Bogott: Mark cloudvirt1015 as spare [puppet] - 10https://gerrit.wikimedia.org/r/619562 (https://phabricator.wikimedia.org/T257366) [15:49:06] (03PS3) 10Jforrester: Drop namespace special casing on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621295 (https://phabricator.wikimedia.org/T257953) (owner: 10Jdlrobson) [15:49:40] (03PS1) 10Jdlrobson: Update 6 project wordmarks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621300 (https://phabricator.wikimedia.org/T254788) [15:49:46] Jdlrobson: Want me to deploy that now? [15:50:13] (03PS2) 10Jdlrobson: Update 5 project wordmarks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621300 (https://phabricator.wikimedia.org/T254788) [15:50:36] (03PS4) 10Jforrester: [Beta Cluster] Drop mainpage special casing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621295 (https://phabricator.wikimedia.org/T257953) (owner: 10Jdlrobson) [15:50:51] (03CR) 10Andrew Bogott: [C: 03+2] Mark cloudvirt1015 as spare [puppet] - 10https://gerrit.wikimedia.org/r/619562 (https://phabricator.wikimedia.org/T257366) (owner: 10Andrew Bogott) [15:52:15] (03PS1) 10Filippo Giunchedi: pontoon: add netmon-01 to observability [puppet] - 10https://gerrit.wikimedia.org/r/621301 [15:52:17] (03PS1) 10Filippo Giunchedi: pontoon: add observability stack hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/621302 [15:53:05] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission-hardware, and 3 others: decom cloudvirt1015 - https://phabricator.wikimedia.org/T257366 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts: ` ['cloudvirt1015.eqiad.wmnet'] ` The log can be found in `/v... [15:53:08] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: add netmon-01 to observability [puppet] - 10https://gerrit.wikimedia.org/r/621301 (owner: 10Filippo Giunchedi) [15:59:32] (03PS1) 10Jdlrobson: WIP: Make it easier to configure wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621303 [15:59:52] (03PS1) 10RLazarus: sre.switchdc.mediawiki: Add -ro targets to the TTL steps also. [cookbooks] - 10https://gerrit.wikimedia.org/r/621304 [16:00:31] (03CR) 10jerkins-bot: [V: 04-1] WIP: Make it easier to configure wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621303 (owner: 10Jdlrobson) [16:02:48] (03PS1) 10Jdlrobson: Configure namespaces on commons to include categories [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621306 (https://phabricator.wikimedia.org/T198716) [16:03:26] (03PS2) 10Jdlrobson: Configure namespaces on commons to include categories [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621306 (https://phabricator.wikimedia.org/T198716) [16:06:29] James_F: and yes to the early request to deploy. I've put it in the 11am backport window if not - i have a few other things to get deployed today [16:07:24] (03CR) 10Jforrester: "This file is meant to be static code only. Adding functions like this is absolutely not going to be OK. :-)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621303 (owner: 10Jdlrobson) [16:09:02] (03CR) 10Jforrester: [C: 03+2] [Beta Cluster] Drop mainpage special casing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621295 (https://phabricator.wikimedia.org/T257953) (owner: 10Jdlrobson) [16:09:04] (03Abandoned) 10Dzahn: decom helium and heze [puppet] - 10https://gerrit.wikimedia.org/r/621038 (https://phabricator.wikimedia.org/T260717) (owner: 10Dzahn) [16:09:23] (03Abandoned) 10Dzahn: profile::backup: remove helium from ferm directors [puppet] - 10https://gerrit.wikimedia.org/r/621042 (https://phabricator.wikimedia.org/T260717) (owner: 10Dzahn) [16:09:58] (03Merged) 10jenkins-bot: [Beta Cluster] Drop mainpage special casing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621295 (https://phabricator.wikimedia.org/T257953) (owner: 10Jdlrobson) [16:11:02] (03CR) 10Bstorm: wikireplicas: add wikireplica cookbook to add a wiki (036 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/621088 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm) [16:12:10] (03CR) 10Dzahn: [C: 04-1] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/621080 (https://phabricator.wikimedia.org/T260732) (owner: 10Alex Monk) [16:12:24] (03PS2) 10Bstorm: wikireplicas: add wikireplica cookbook to add a wiki [cookbooks] - 10https://gerrit.wikimedia.org/r/621088 (https://phabricator.wikimedia.org/T260389) [16:12:53] thx James_F :) [16:13:09] (03CR) 10Bstorm: "I think it'll be ok to log this to -operations for now. I figure a new socket service or something would be needed on icinga servers to lo" [cookbooks] - 10https://gerrit.wikimedia.org/r/621088 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm) [16:14:24] (03CR) 10Jdlrobson: [C: 04-1] "Got it. I was wondering about that. Do you have any alternate ideas with how we could clean this file up and make it more programattic? We" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621303 (owner: 10Jdlrobson) [16:15:49] (03PS3) 10Bstorm: wikireplicas: add wikireplica cookbook to add a wiki [cookbooks] - 10https://gerrit.wikimedia.org/r/621088 (https://phabricator.wikimedia.org/T260389) [16:16:07] (03CR) 10Dzahn: [C: 04-1] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/621080 (https://phabricator.wikimedia.org/T260732) (owner: 10Alex Monk) [16:17:22] (03CR) 10Bstorm: wikireplicas: add wikireplica cookbook to add a wiki (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/621088 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm) [16:18:18] (03PS4) 10Bstorm: wikireplicas: add wikireplica cookbook to add a wiki [cookbooks] - 10https://gerrit.wikimedia.org/r/621088 (https://phabricator.wikimedia.org/T260389) [16:19:41] (03CR) 10Dzahn: "I don't know what to say about this change, it still seems wrong that the user running the webserver owns the config file. Can we add more" [puppet] - 10https://gerrit.wikimedia.org/r/606824 (owner: 10Ladsgroup) [16:20:46] (03CR) 10Dzahn: [C: 03+1] "@hashar I am not sure now about the status of this patch. Your latest comment on IRC was that doc.wm.org can't be switched but integration" [puppet] - 10https://gerrit.wikimedia.org/r/611369 (https://phabricator.wikimedia.org/T149924) (owner: 10Hashar) [16:22:29] (03PS5) 10Bstorm: wikireplicas: add wikireplica cookbook to add a wiki [cookbooks] - 10https://gerrit.wikimedia.org/r/621088 (https://phabricator.wikimedia.org/T260389) [16:22:56] (03CR) 10Dzahn: [C: 04-1] "Yea, agree, the cert should be added before this gets merged. You will need 3 merges. 1) private puppet repo after creating the cert 2) " [puppet] - 10https://gerrit.wikimedia.org/r/616124 (owner: 10Herron) [16:23:55] (03CR) 10Dzahn: [C: 04-1] "Besides the cert comment I can't say that much about the actual logstash switch. And since this is stalled for now i'll remove myself for " [puppet] - 10https://gerrit.wikimedia.org/r/616124 (owner: 10Herron) [16:25:15] (03PS2) 10Majavah: Don't index Draft (118) and Draft talk (119) on hywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621296 (https://phabricator.wikimedia.org/T260804) (owner: 10Ashot1997) [16:27:02] (03PS2) 10Bstorm: wikireplicas: create cumin aliases for wikireplica servers [puppet] - 10https://gerrit.wikimedia.org/r/621067 (https://phabricator.wikimedia.org/T260389) [16:27:15] (03CR) 10Dzahn: "We worked around the issue for Gerrit in cloud by just using certbot since this is not looking like there is willingness to merge. The re" [puppet] - 10https://gerrit.wikimedia.org/r/602722 (owner: 10Paladox) [16:28:02] (03CR) 10Cwhite: prometheus: use aggs to consolidate mediawiki logging metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/621098 (https://phabricator.wikimedia.org/T256418) (owner: 10Cwhite) [16:28:42] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 56 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:30:10] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [16:30:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:42] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: ASAP) rack/setup/install clouddb10[13-20] - https://phabricator.wikimedia.org/T260441 (10Bstorm) >>! In T260441#6392164, @Marostegui wrote: > @Bstorm once the hosts are installed we can create a different task to talk about th... [16:32:20] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:32:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:43] (03PS1) 10Ppchelko: Enhancements to access logging for api-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/621308 (https://phabricator.wikimedia.org/T251812) [16:34:09] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-09-15) rack/setup/install db1150 (see note on hostname) - https://phabricator.wikimedia.org/T260817 (10RobH) [16:34:14] (03PS1) 10Cwhite: prometheus: cleanup count of all logs [puppet] - 10https://gerrit.wikimedia.org/r/621309 (https://phabricator.wikimedia.org/T256418) [16:34:16] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-09-15) rack/setup/install db1150 (see note on hostname) - https://phabricator.wikimedia.org/T260817 (10RobH) [16:34:19] (03CR) 10Ppchelko: [C: 03+2] Enhancements to access logging for api-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/621308 (https://phabricator.wikimedia.org/T251812) (owner: 10Ppchelko) [16:35:48] (03Merged) 10jenkins-bot: Enhancements to access logging for api-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/621308 (https://phabricator.wikimedia.org/T251812) (owner: 10Ppchelko) [16:37:06] !log ppchelko@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [16:37:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:19] !log ppchelko@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [16:39:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:31] 10Operations, 10Maps, 10Traffic, 10Wiki-Loves-Monuments (2020): maps.wikilovesmonuments.org returns a HTTP 429 error - https://phabricator.wikimedia.org/T260520 (10AntiCompositeNumber) This is due to mitigation from T244278/https://wikitech.wikimedia.org/wiki/Incident_documentation/20200204-maps. The toolf... [16:43:57] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-09-15) rack/setup/install db2141 (or next in sequence) - https://phabricator.wikimedia.org/T260819 (10RobH) [16:44:21] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-09-15) rack/setup/install db2141 (or next in sequence) - https://phabricator.wikimedia.org/T260819 (10RobH) [16:44:26] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 49 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:44:31] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-09-15) rack/setup/install db2141 (or next in sequence) - https://phabricator.wikimedia.org/T260819 (10RobH) a:03Papaul [16:48:04] (03CR) 10Dzahn: "It's too early to add me as reviewer when this is still a WIP. Also per my last comment this needs a reason or problem statement why we wa" [puppet] - 10https://gerrit.wikimedia.org/r/509542 (owner: 10Paladox) [16:49:07] (03PS1) 10Urbanecm: Add clinton.presidentiallibraries.us to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621312 (https://phabricator.wikimedia.org/T259927) [16:49:43] (03CR) 10Dzahn: "@paladox This has a jerkins-bot -1 but the link to the details is already 404 because it's been months ago. Are you planning to amend/reba" [puppet] - 10https://gerrit.wikimedia.org/r/492518 (owner: 10Paladox) [16:51:36] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 52 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:51:40] (03CR) 10Dzahn: "I am removing myself because I need to keep my review queues from filling up and not a good reviewer for the actual ruby code." [puppet] - 10https://gerrit.wikimedia.org/r/492518 (owner: 10Paladox) [16:52:26] (03CR) 10Krinkle: "A unit test might be all we need here." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621303 (owner: 10Jdlrobson) [16:53:43] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission-hardware, and 2 others: decom cloudvirt1015 - https://phabricator.wikimedia.org/T257366 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudvirt1015.eqiad.wmnet'] ` and were **ALL** successful. [16:57:27] 10Operations, 10Maps, 10Traffic, 10Wiki-Loves-Monuments (2020): maps.wikilovesmonuments.org returns a HTTP 429 error (add it to varnish maps_domains) - https://phabricator.wikimedia.org/T260520 (10Dzahn) [17:01:07] (03PS1) 10Dzahn: decom releases2001 [puppet] - 10https://gerrit.wikimedia.org/r/621314 (https://phabricator.wikimedia.org/T260742) [17:01:34] (03CR) 10jerkins-bot: [V: 04-1] decom releases2001 [puppet] - 10https://gerrit.wikimedia.org/r/621314 (https://phabricator.wikimedia.org/T260742) (owner: 10Dzahn) [17:03:31] (03PS1) 10Ottomata: jupyterhub - sort profiles by user name; increase spawner timeout [puppet] - 10https://gerrit.wikimedia.org/r/621315 (https://phabricator.wikimedia.org/T224658) [17:04:25] (03CR) 10Ottomata: [C: 03+2] jupyterhub - sort profiles by user name; increase spawner timeout [puppet] - 10https://gerrit.wikimedia.org/r/621315 (https://phabricator.wikimedia.org/T224658) (owner: 10Ottomata) [17:04:59] (03PS2) 10Dzahn: decom releases2001 [puppet] - 10https://gerrit.wikimedia.org/r/621314 (https://phabricator.wikimedia.org/T260742) [17:07:02] (03Abandoned) 10Dzahn: parsoid: remove vd_server and vd_client from parsoid::testing role [puppet] - 10https://gerrit.wikimedia.org/r/615831 (https://phabricator.wikimedia.org/T257906) (owner: 10Dzahn) [17:07:10] (03PS1) 10Cwhite: prometheus: update prometheus-es-exporter config test to enforce namespace [puppet] - 10https://gerrit.wikimedia.org/r/621318 (https://phabricator.wikimedia.org/T256418) [17:08:02] (03CR) 10jerkins-bot: [V: 04-1] prometheus: update prometheus-es-exporter config test to enforce namespace [puppet] - 10https://gerrit.wikimedia.org/r/621318 (https://phabricator.wikimedia.org/T256418) (owner: 10Cwhite) [17:08:06] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 50 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:09:28] (03CR) 10Cwhite: "expected build failure -- all current configuration does not follow the namespace convention introduced here" [puppet] - 10https://gerrit.wikimedia.org/r/621318 (https://phabricator.wikimedia.org/T256418) (owner: 10Cwhite) [17:10:07] (03PS2) 10Cwhite: prometheus: update prometheus-es-exporter config test to enforce namespace [puppet] - 10https://gerrit.wikimedia.org/r/621318 (https://phabricator.wikimedia.org/T256418) [17:10:58] (03CR) 10jerkins-bot: [V: 04-1] prometheus: update prometheus-es-exporter config test to enforce namespace [puppet] - 10https://gerrit.wikimedia.org/r/621318 (https://phabricator.wikimedia.org/T256418) (owner: 10Cwhite) [17:11:25] 10Operations, 10Maps, 10Traffic, 10Wiki-Loves-Monuments (2020): maps.wikilovesmonuments.org returns a HTTP 429 error (add it to varnish maps_domains) - https://phabricator.wikimedia.org/T260520 (10Dzahn) Looks like the current code can take exactly one "maps_domain" but not a list or array of multiple ones... [17:15:55] (03PS2) 10Catrope: Enable GrowthExperiments on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621108 (https://phabricator.wikimedia.org/T257490) [17:16:57] (03CR) 10Nskaggs: wikireplicas: create cumin aliases for wikireplica servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/621067 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm) [17:20:35] (03CR) 10Dzahn: [C: 03+2] decom releases2001 [puppet] - 10https://gerrit.wikimedia.org/r/621314 (https://phabricator.wikimedia.org/T260742) (owner: 10Dzahn) [17:20:41] (03PS3) 10Dzahn: decom releases2001 [puppet] - 10https://gerrit.wikimedia.org/r/621314 (https://phabricator.wikimedia.org/T260742) [17:29:14] (03CR) 10BryanDavis: wikireplicas: create cumin aliases for wikireplica servers (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/621067 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm) [17:35:03] (03Abandoned) 10Alex Monk: Fix hostname for labs ORES monitoring [puppet] - 10https://gerrit.wikimedia.org/r/621080 (https://phabricator.wikimedia.org/T260732) (owner: 10Alex Monk) [17:37:10] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission [17:37:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:32] !log decom'ing releases2001.codfw.wmnet ( [17:38:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:50] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [17:38:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:56] 10Operations, 10serviceops, 10Patch-For-Review: decom releases1001 and releases2001 - https://phabricator.wikimedia.org/T260742 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `releases2001.codfw.wmnet` - releases2001.codfw.wmnet (**PASS**) - Downtimed host on... [17:41:51] (03PS3) 10Bstorm: wikireplicas: create cumin aliases for wikireplica servers [puppet] - 10https://gerrit.wikimedia.org/r/621067 (https://phabricator.wikimedia.org/T260389) [17:43:14] (03CR) 10Bstorm: wikireplicas: create cumin aliases for wikireplica servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/621067 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm) [17:44:27] (03PS4) 10Bstorm: wikireplicas: create cumin aliases for wikireplica servers [puppet] - 10https://gerrit.wikimedia.org/r/621067 (https://phabricator.wikimedia.org/T260389) [17:45:24] (03CR) 10Bstorm: wikireplicas: create cumin aliases for wikireplica servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/621067 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm) [17:46:50] (03CR) 10BryanDavis: [C: 03+1] wikireplicas: create cumin aliases for wikireplica servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/621067 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm) [17:48:08] (03PS3) 10Jdlrobson: Update project wordmarks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621300 (https://phabricator.wikimedia.org/T254788) [17:53:13] 10Operations, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need By: 2020-09-30) rack/setup/install frmx1001 & frdata1002 - https://phabricator.wikimedia.org/T260181 (10wiki_willy) a:03Jclark-ctr [17:54:20] (03PS1) 10Jdlrobson: Update taglines [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621328 (https://phabricator.wikimedia.org/T258552) [17:55:18] 10Operations, 10ops-eqiad, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps10[05-10].eqiad.wmnet - https://phabricator.wikimedia.org/T260269 (10wiki_willy) a:05Cmjohnson→03Jclark-ctr [17:58:45] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10wiki_willy) a:05Cmjohnson→03Jclark-ctr [18:00:04] twentyafterfour and marxarelli: Your horoscope predicts another unfortunate Train log triage with CPT deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200819T1800). [18:00:04] RoanKattouw, Niharika, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy Morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200819T1800). [18:00:04] Jdlrobson and Ashot1997: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:39] I can deploy today! [18:00:46] 10Operations, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need By: 2020-09-30) rack/setup/install frdb1004.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T260379 (10wiki_willy) a:05Cmjohnson→03Jclark-ctr [18:01:03] hello [18:01:23] hi Jdlrobson [18:01:28] Ashot1997: hello, are you around? [18:01:33] yes [18:01:59] (03PS2) 10Urbanecm: Enable $wgMFNoindexPages for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621298 (https://phabricator.wikimedia.org/T255458) (owner: 10Jdlrobson) [18:02:10] (03CR) 10Urbanecm: [C: 03+2] Enable $wgMFNoindexPages for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621298 (https://phabricator.wikimedia.org/T255458) (owner: 10Jdlrobson) [18:02:27] Ashot1997: thanks, I will ping you once your patch is ready [18:02:56] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-09-15) rack/setup/install db1150 (see note on hostname) - https://phabricator.wikimedia.org/T260817 (10wiki_willy) a:03Jclark-ctr [18:02:59] (03Merged) 10jenkins-bot: Enable $wgMFNoindexPages for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621298 (https://phabricator.wikimedia.org/T255458) (owner: 10Jdlrobson) [18:03:27] Jdlrobson: ^^is at mwdebug1001, could you test, please? [18:03:45] testing [18:03:57] (03PS2) 10Jdlrobson: Update taglines for various projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621328 (https://phabricator.wikimedia.org/T258552) [18:03:58] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker10[18-40] - https://phabricator.wikimedia.org/T260445 (10wiki_willy) a:05Cmjohnson→03Jclark-ctr [18:04:29] (03PS4) 10Urbanecm: Update project wordmarks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621300 (https://phabricator.wikimedia.org/T254788) (owner: 10Jdlrobson) [18:04:41] (03CR) 10Urbanecm: [C: 03+2] Update project wordmarks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621300 (https://phabricator.wikimedia.org/T254788) (owner: 10Jdlrobson) [18:04:53] 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops: (Need By: TBD) rack/setup/install kubernetes1017.eqiad.wmnet - https://phabricator.wikimedia.org/T258747 (10wiki_willy) a:03Jclark-ctr [18:04:55] (03CR) 10Alex Monk: "we should probably do T252199" [puppet] - 10https://gerrit.wikimedia.org/r/602722 (owner: 10Paladox) [18:05:29] (03Merged) 10jenkins-bot: Update project wordmarks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621300 (https://phabricator.wikimedia.org/T254788) (owner: 10Jdlrobson) [18:07:04] (03PS1) 10Ppchelko: Filter out null values using fluent-bit [deployment-charts] - 10https://gerrit.wikimedia.org/r/621329 (https://phabricator.wikimedia.org/T251812) [18:07:17] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By:TBD) rack/setup/install rows C and D new PDUs - https://phabricator.wikimedia.org/T253694 (10wiki_willy) a:03Cmjohnson [18:07:34] (03CR) 10Ppchelko: [C: 03+2] Filter out null values using fluent-bit [deployment-charts] - 10https://gerrit.wikimedia.org/r/621329 (https://phabricator.wikimedia.org/T251812) (owner: 10Ppchelko) [18:07:38] Urbanecm: sorry im having trouble testing that on mwdebug1001 [18:07:52] Jdlrobson: could you please clarify the trouble? [18:08:23] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By:TBD) rack/setup/install rows C and D new PDUs - https://phabricator.wikimedia.org/T253694 (10wiki_willy) @Cmjohnson to provide proposed schedule of all affected racks for upgrades, on Thursday, to send out to Service Owners. [18:08:30] im just not seeing what i expected to happen, so just need a bit more time to understand why [18:08:46] Jdlrobson: okay, that's fine, I'm waiting. [18:08:48] (03Merged) 10jenkins-bot: Filter out null values using fluent-bit [deployment-charts] - 10https://gerrit.wikimedia.org/r/621329 (https://phabricator.wikimedia.org/T251812) (owner: 10Ppchelko) [18:09:31] Jdlrobson: let me know if I can help in any way [18:12:12] Urbanecm: i think it's fine to sync. I think the issue I am seeing might not show up on debug1001.eqiad.wmnet for some reason [18:12:27] PROBLEM - DPKG on stat1006 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [18:12:40] Jdlrobson: okay, syncing [18:12:48] On ar.wikipedia I see !log ppchelko@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [18:13:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:53] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [18:13:54] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [18:13:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:05] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: a6f8354e7599a5e92bea060807065f5b42c540e5: Enable $wgMFNoindexPages for all wikis (T255458) (duration: 01m 07s) [18:15:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:09] T255458: Enable $wgMFNoindexPages for all wikis - https://phabricator.wikimedia.org/T255458 [18:15:22] Jdlrobson: synced. Is it possible to test it now? [18:15:42] !log rebooting webperf2002 VM on ganeti level (outside OS) to upgrade rom 8 to 16GB RAM (T260192) [18:15:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:45] T260192: More RAM needed for webperf1002 and webperf2002 - https://phabricator.wikimedia.org/T260192 [18:15:53] 10Operations, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: (Need By: 2020-09-30) rack/setup/install frmx2001.frack.codfw.wmnet, frdata2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T260183 (10wiki_willy) a:03Papaul [18:15:57] Urbanecm: now its showing up grewat [18:16:03] okay, cool! [18:16:26] Jdlrobson: `Update project wordmarks ` is available at mwdebug1001 now [18:17:11] testing [18:17:51] LGTM [18:17:55] (03PS3) 10Urbanecm: Configure namespaces on commons to include categories [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621306 (https://phabricator.wikimedia.org/T198716) (owner: 10Jdlrobson) [18:18:00] thanks, syncing [18:19:15] (03CR) 10Urbanecm: [C: 03+2] Configure namespaces on commons to include categories [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621306 (https://phabricator.wikimedia.org/T198716) (owner: 10Jdlrobson) [18:19:50] !log urbanecm@deploy1001 Synchronized static/images/mobile/copyright/: b9043331c1c1b352256cffd471b9ff128806607c: Update project wordmarks (T254788; sync 1/2) (duration: 01m 06s) [18:19:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:54] T254788: Update 5 wordmarks - https://phabricator.wikimedia.org/T254788 [18:20:05] (03Merged) 10jenkins-bot: Configure namespaces on commons to include categories [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621306 (https://phabricator.wikimedia.org/T198716) (owner: 10Jdlrobson) [18:21:21] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: b9043331c1c1b352256cffd471b9ff128806607c: Update project wordmarks (T254788; sync 2/2) (duration: 01m 04s) [18:21:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:32] Jdlrobson: should be done [18:22:03] Jdlrobson: ` Configure namespaces on commons to include categories` is now available at mwdebug1001, could you test, please? [18:22:14] hurrah [18:23:03] (03PS3) 10Urbanecm: Update taglines for various projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621328 (https://phabricator.wikimedia.org/T258552) (owner: 10Jdlrobson) [18:23:08] Urbanecm: cannot test this one until its synced unfortunately [18:23:09] (03CR) 10Urbanecm: [C: 03+2] Update taglines for various projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621328 (https://phabricator.wikimedia.org/T258552) (owner: 10Jdlrobson) [18:23:16] Jdlrobson: okay, will sync then [18:23:58] (03Merged) 10jenkins-bot: Update taglines for various projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621328 (https://phabricator.wikimedia.org/T258552) (owner: 10Jdlrobson) [18:24:34] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission-hardware, and 2 others: decom cloudvirt1015 - https://phabricator.wikimedia.org/T257366 (10Andrew) [18:24:47] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: bb4aa44b0bd5b2b33d190d3af81e038e5fc55e3f: Configure namespaces on commons to include categories (T198716) (duration: 01m 04s) [18:24:47] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [18:24:48] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [18:24:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:51] T198716: Enable PageImages on Commons categories namespace - https://phabricator.wikimedia.org/T198716 [18:24:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:56] Jdlrobson: should be live! [18:25:10] !log rebooting webperf1002 VM on ganeti level (outside OS) to upgrade rom 8 to 16GB RAM (T260192) [18:25:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:13] T260192: More RAM needed for webperf1002 and webperf2002 - https://phabricator.wikimedia.org/T260192 [18:26:40] Urbanecm: great. just taglines to do! [18:27:02] Jdlrobson: taglines pulled onto mwdebug1001 now! [18:29:00] perfect Urbanecm ! please sync! [18:29:14] Jdlrobson: syncing! [18:30:45] (03PS3) 10Urbanecm: Don't index Draft (118) and Draft talk (119) on hywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621296 (https://phabricator.wikimedia.org/T260804) (owner: 10Ashot1997) [18:30:54] !log urbanecm@deploy1001 Synchronized static/images/mobile/copyright/: 803cb1a0d2c8cc6df8e4e88ab3c4d27eb71d01b3: Update taglines for various projects (T258552) (duration: 01m 06s) [18:30:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:58] T258552: Add wordmarks and taglines for 26 more Wikipedias - https://phabricator.wikimedia.org/T258552 [18:31:05] (03CR) 10Urbanecm: [C: 03+2] Don't index Draft (118) and Draft talk (119) on hywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621296 (https://phabricator.wikimedia.org/T260804) (owner: 10Ashot1997) [18:31:55] (03Merged) 10jenkins-bot: Don't index Draft (118) and Draft talk (119) on hywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621296 (https://phabricator.wikimedia.org/T260804) (owner: 10Ashot1997) [18:32:25] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 803cb1a0d2c8cc6df8e4e88ab3c4d27eb71d01b3: Update taglines for various projects (T258552) (duration: 01m 04s) [18:32:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:32] Jdlrobson: should be all live! [18:32:53] Ashot1997: hello, your patch is at mwdebug1001, could you test, please? [18:33:09] hurrah thanks for all the help Urbanecm [18:33:19] happy to help! [18:34:18] 10Operations, 10vm-requests, 10Performance-Team (Radar): More RAM needed for webperf1002 and webperf2002 - https://phabricator.wikimedia.org/T260192 (10Dzahn) 05Open→03Resolved a:03Dzahn ` 17:54 < mutante> hi all, I could use some input on determining where we want to draw the line between ganeti VM an... [18:34:51] Urbanecm sorry but I don't know how or what is mwdebug1001 :/ [18:35:43] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:35:50] Ashot1997: ah, sure. mwdebug1001 is a special debug server that is used to make sure a patch is working, before it is synced to the whole fleet. That's to reduce impact of a mistake a human can make. To be able to test at mwdebug1001, you need to install a browser extension from https://wikitech.wikimedia.org/wiki/WikimediaDebug#Browser_extensions [18:36:27] Thanks, just a moment [18:36:56] once it is installed, you need to enable it, and pick the right debug server (in this case, I've pulled the change to mwdebug1001). Then, you just make sure the noindex patch works as intended (ie. mediawiki should propagate the noindex tag in the source code). [18:37:04] Ashot1997: sure, let me know if there are any issues. [18:37:50] 10Operations, 10Parsoid, 10serviceops, 10Parsoid-Tests, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10Dzahn) This is stalled on T260627 but otherwise should be good to go. [18:37:53] (03PS3) 10Urbanecm: ClosedWikiProvider: Use testUserForCreation rather than testForAuthentication [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615723 (https://phabricator.wikimedia.org/T258695) [18:37:59] (03CR) 10Urbanecm: [C: 03+2] ClosedWikiProvider: Use testUserForCreation rather than testForAuthentication [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615723 (https://phabricator.wikimedia.org/T258695) (owner: 10Urbanecm) [18:38:01] Urbanecm yes, it works as intended [18:38:08] Ashot1997: cool, I'll sync it then! [18:38:37] Urbanecm and for syncing and specially for teaching me! :) [18:38:44] (03Merged) 10jenkins-bot: ClosedWikiProvider: Use testUserForCreation rather than testForAuthentication [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615723 (https://phabricator.wikimedia.org/T258695) (owner: 10Urbanecm) [18:39:38] Ashot1997: I'm glad you learned a new thing :-). Sorry, I just assumed you did this before. [18:39:39] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:39:55] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 95d45f6e002df78d4860a711042d77a6b0bdecb9: Dont index Draft (118) and Draft talk (119) on hywiki (T260804) (duration: 01m 04s) [18:39:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:00] T260804: Prevent search engines from indexing Draft namespace pages on hywiki - https://phabricator.wikimedia.org/T260804 [18:40:03] Ashot1997: should be all live now! [18:40:34] 10Operations, 10Parsoid, 10serviceops, 10Parsoid-Tests, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10Dzahn) [18:40:50] 10Operations, 10DBA, 10Parsoid, 10serviceops, and 2 others: update mysql GRANTs for testreduce - https://phabricator.wikimedia.org/T260627 (10Dzahn) 05Open→03Resolved >>! In T260627#6392112, @Kormat wrote: > Hi, i've created the new grants. Please test and let me know if there are any issues. Cheers.... [18:40:59] 10Operations, 10Parsoid, 10serviceops, 10Parsoid-Tests, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10Dzahn) 05Stalled→03Open [18:41:22] Last time you did it for me (T259987) and gave me the https://wikitech.wikimedia.org/wiki/Deployments link :) [18:41:22] T259987: Two new namespaces for hy.wikipedia.org - https://phabricator.wikimedia.org/T259987 [18:41:46] Ashot1997: cool! Looking forward to seeing you again here :) [18:42:16] Thanks, I checked and it works now ^_^ [18:42:27] (03PS1) 10Ppchelko: Enable TLS for fluent-bit -> eventgate [deployment-charts] - 10https://gerrit.wikimedia.org/r/621332 (https://phabricator.wikimedia.org/T260626) [18:42:53] 10Operations, 10Parsoid, 10serviceops, 10Parsoid-Tests, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10Dzahn) @ssastry The `parsoid-rt` service is now running on testreduce1001 and does not stop anymore because it can t... [18:42:58] (03CR) 10Ppchelko: "I couldn't test it locally at all 😞" [deployment-charts] - 10https://gerrit.wikimedia.org/r/621332 (https://phabricator.wikimedia.org/T260626) (owner: 10Ppchelko) [18:43:45] 10Operations, 10Parsoid, 10serviceops, 10Parsoid-Tests, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10Dzahn) @ssastry Please let me know what else you need on the new instance. [18:43:47] Ashot1997: I'm happy to hear it! [18:44:44] (03PS2) 10Urbanecm: Add clinton.presidentiallibraries.us to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621312 (https://phabricator.wikimedia.org/T259927) [18:44:50] (03CR) 10Urbanecm: [C: 03+2] Add clinton.presidentiallibraries.us to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621312 (https://phabricator.wikimedia.org/T259927) (owner: 10Urbanecm) [18:45:13] subbu: parsoid-rt is now running on testreduce1001 because it can now talk to the testreduce db [18:45:21] !log urbanecm@deploy1001 Synchronized wmf-config/CommonSettings.php: 83b34e1bd1ed804a70f67e089580e082f89e2a0f: ClosedWikiProvider: Use testUserForCreation rather than testForAuthentication (T258695) (duration: 01m 04s) [18:45:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:25] T258695: Investigate why ClosedWikiProvider doesn't work - https://phabricator.wikimedia.org/T258695 [18:45:39] mutante, thanks! will take a look this week. [18:45:40] (03Merged) 10jenkins-bot: Add clinton.presidentiallibraries.us to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621312 (https://phabricator.wikimedia.org/T259927) (owner: 10Urbanecm) [18:45:40] subbu: just not sure if you want it to be running on both scandium and testreduce1001 [18:45:52] subbu: ok, cool [18:46:01] mutante, well, lets keep it turned off on testreduce1001 [18:46:54] subbu: that needs new code but i will do that just like i did for the -vd service recently [18:47:06] thx. [18:47:13] then they can be both controlled with a one-line in hiera [18:47:48] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 924a03bd624d6750a7e776e09713056cc45e5cc5: Add clinton.presidentiallibraries.us to the wgCopyUploadsDomains allowlist of Wikimedia Commons (T259927) (duration: 01m 04s) [18:47:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:52] T259927: Add clinton.presidentiallibraries.us to the wgCopyUploadsDomains allowlist of Wikimedia Commons - https://phabricator.wikimedia.org/T259927 [18:49:10] subbu: until i have that merged puppet is disabled and the service is manually stopped now. let's talk later what is left [18:49:41] sounds good. in a meeting now. will take a look after meeting + lunch. [18:49:50] ack [18:52:53] !log testreduce1001 - disable puppet; stop parsoid-rt service [18:52:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:46] (03PS1) 10Urbanecm: Add autopatrolled group at arzwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621336 (https://phabricator.wikimedia.org/T260761) [18:58:23] (03PS2) 10Urbanecm: Add autopatrolled group at arzwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621336 (https://phabricator.wikimedia.org/T260761) [18:58:31] (03CR) 10Urbanecm: [C: 03+2] Add autopatrolled group at arzwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621336 (https://phabricator.wikimedia.org/T260761) (owner: 10Urbanecm) [18:59:13] (03PS1) 10Dzahn: parsoid/testreduce: add paramater to control parsoid-rt service [puppet] - 10https://gerrit.wikimedia.org/r/621337 (https://phabricator.wikimedia.org/T257906) [18:59:25] (03Merged) 10jenkins-bot: Add autopatrolled group at arzwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621336 (https://phabricator.wikimedia.org/T260761) (owner: 10Urbanecm) [19:00:04] twentyafterfour and marxarelli: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Mediawiki train - American Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200819T1900). [19:00:08] 10Operations, 10Maps, 10Traffic, 10Wiki-Loves-Monuments (2020): maps.wikilovesmonuments.org returns a HTTP 429 error (add it to varnish maps_domains) - https://phabricator.wikimedia.org/T260520 (10TheDJ) No, it’s the line below that. Maps domain is what u give access to, referrer is what is used to check w... [19:00:17] o/ [19:00:42] marxarelli: twentyafterfour: let me finish deployment from B&C window [19:00:56] k [19:01:09] 10Operations, 10fundraising-tech-ops, 10netops: Automate diff and commit of frack ACL - https://phabricator.wikimedia.org/T260655 (10Jgreen) Ok, happy to work this into our deployment-prep script. [19:02:06] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 60af096b80a8ef7bc94ec40ce203fd27b0c97f26: Add autopatrolled group at arzwiki (T260761) (duration: 01m 04s) [19:02:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:10] T260761: Creation of Autopatrolled group on arz.wikipedia - https://phabricator.wikimedia.org/T260761 [19:02:21] (03PS1) 10Dzahn: testreduce::server: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/621338 [19:02:38] (03PS2) 10Dzahn: parsoid/testreduce: add parameter to control parsoid-rt service [puppet] - 10https://gerrit.wikimedia.org/r/621337 (https://phabricator.wikimedia.org/T257906) [19:02:39] marxarelli: should be all done now, thanks for your patience! [19:06:36] (03PS2) 10Dzahn: testreduce::server: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/621338 [19:07:09] ok deploying [19:07:28] (03PS1) 1020after4: group1 wikis to 1.36.0-wmf.5 refs T257973 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621339 [19:07:30] (03CR) 1020after4: [C: 03+2] group1 wikis to 1.36.0-wmf.5 refs T257973 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621339 (owner: 1020after4) [19:07:47] (03CR) 10Dzahn: [C: 03+2] "this shows how it keeps running on scandium but gets stopped on testreduce1001 - https://puppet-compiler.wmflabs.org/compiler1001/24589/" [puppet] - 10https://gerrit.wikimedia.org/r/621337 (https://phabricator.wikimedia.org/T257906) (owner: 10Dzahn) [19:08:17] (03Merged) 10jenkins-bot: group1 wikis to 1.36.0-wmf.5 refs T257973 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621339 (owner: 1020after4) [19:09:20] (03CR) 10Dzahn: [C: 03+2] "fyi, this is how you can now control where the service is running. just set it to 'running' or 'stopped' in Hiera for the old or new role." [puppet] - 10https://gerrit.wikimedia.org/r/621337 (https://phabricator.wikimedia.org/T257906) (owner: 10Dzahn) [19:09:31] (03PS3) 10Dzahn: parsoid/testreduce: add parameter to control parsoid-rt service [puppet] - 10https://gerrit.wikimedia.org/r/621337 (https://phabricator.wikimedia.org/T257906) [19:14:19] RECOVERY - DPKG on stat1006 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [19:14:20] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.36.0-wmf.5 refs T257973 [19:14:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:24] T257973: 1.36.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T257973 [19:15:25] !log twentyafterfour@deploy1001 Synchronized php: group1 wikis to 1.36.0-wmf.5 refs T257973 (duration: 01m 04s) [19:15:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:28] twentyafterfour: looks good from here [19:19:43] just that one eventbus enqueue failure [19:20:11] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1011-production-logstash-eqiad on logstash1011 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1011&panelId=37 [19:20:39] (03CR) 10Bstorm: "Quick sanity check before merge https://puppet-compiler.wmflabs.org/compiler1001/24588/cumin1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/621067 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm) [19:20:40] !log testreduce1001 - re-enabled puppet, confirmed parsoid-rt service was now stopped properly by puppet while it runs as before on scandium, the previous parsoid-testing host. switching it over is now a Hiera one-liner. (T257906) [19:20:43] (03CR) 10Bstorm: [C: 03+2] wikireplicas: create cumin aliases for wikireplica servers [puppet] - 10https://gerrit.wikimedia.org/r/621067 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm) [19:20:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:44] T257906: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 [19:21:26] 10Operations, 10Maps, 10Traffic, 10Wiki-Loves-Monuments (2020): maps.wikilovesmonuments.org returns a HTTP 429 error (add it to varnish maps_domains) - https://phabricator.wikimedia.org/T260520 (10Zache) So it would be just to add `wikilovesmonuments.org` to regexp below? ` if (req.http.Host == "<%= @vcl... [19:26:10] 10Operations, 10Maps, 10Traffic, 10Wiki-Loves-Monuments (2020): maps.wikilovesmonuments.org returns a HTTP 429 error (let it access varnish maps_domains) - https://phabricator.wikimedia.org/T260520 (10Dzahn) [19:33:41] (03PS1) 10Dzahn: varnish: add wikimedialovesmonuments to domains allowed to access maps servers [puppet] - 10https://gerrit.wikimedia.org/r/621342 (https://phabricator.wikimedia.org/T260520) [19:38:22] (03PS2) 10Dzahn: varnish: add wikilovesmonuments.org to domains allowed to access maps servers [puppet] - 10https://gerrit.wikimedia.org/r/621342 (https://phabricator.wikimedia.org/T260520) [19:43:19] !log restart mjolnir-kafka-bulk-daemon on search-loader2001 with debug logging [19:43:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:04] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/24590/" [puppet] - 10https://gerrit.wikimedia.org/r/621338 (owner: 10Dzahn) [19:44:10] (03PS3) 10Dzahn: testreduce::server: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/621338 [19:56:14] (03CR) 10CDanis: [C: 03+1] "If you have the time, please also (as a followup, once this is rolling out) also update modules/varnish/files/tests/upload/21-maps.vtc and" [puppet] - 10https://gerrit.wikimedia.org/r/621342 (https://phabricator.wikimedia.org/T260520) (owner: 10Dzahn) [19:56:53] (03PS1) 10Bstorm: cumin: for new wmcs. prefix for cookbooks, grant access to wmcs-admins [puppet] - 10https://gerrit.wikimedia.org/r/621343 (https://phabricator.wikimedia.org/T260389) [19:59:55] (03PS2) 10Dzahn: decom releases1001 [puppet] - 10https://gerrit.wikimedia.org/r/621090 (https://phabricator.wikimedia.org/T260742) [20:00:04] halfak and accraze: #bothumor I � Unicode. All rise for Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200819T2000). [20:00:50] (03CR) 10jerkins-bot: [V: 04-1] decom releases1001 [puppet] - 10https://gerrit.wikimedia.org/r/621090 (https://phabricator.wikimedia.org/T260742) (owner: 10Dzahn) [20:04:10] (03CR) 10BryanDavis: [C: 03+1] "Not sure that giving a +1 to a patch that grants me new hats is a good thing. :)" [puppet] - 10https://gerrit.wikimedia.org/r/621343 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm) [20:06:08] (03PS1) 10Andrew Bogott: wmcs-openstack.sh: don't overwrite things already set in the environment [puppet] - 10https://gerrit.wikimedia.org/r/621344 [20:06:30] (03CR) 10jerkins-bot: [V: 04-1] wmcs-openstack.sh: don't overwrite things already set in the environment [puppet] - 10https://gerrit.wikimedia.org/r/621344 (owner: 10Andrew Bogott) [20:08:11] (03PS2) 10Andrew Bogott: wmcs-openstack.sh: don't overwrite things already set in the environment [puppet] - 10https://gerrit.wikimedia.org/r/621344 [20:09:35] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-openstack.sh: don't overwrite things already set in the environment [puppet] - 10https://gerrit.wikimedia.org/r/621344 (owner: 10Andrew Bogott) [20:09:56] (03PS1) 10Ashot1997: Enable VisualEditor in namespaces Draft and Wikiproject on hywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621346 (https://phabricator.wikimedia.org/T260825) [20:12:01] (03PS1) 10Dzahn: varnish: add wikilovesmonuments to tests for maps access [puppet] - 10https://gerrit.wikimedia.org/r/621347 (https://phabricator.wikimedia.org/T260520) [20:14:42] (03PS3) 10Dzahn: decom releases1001 [puppet] - 10https://gerrit.wikimedia.org/r/621090 (https://phabricator.wikimedia.org/T260742) [20:18:41] 10Operations, 10ops-eqiad, 10DC-Ops: Check samarium status in Netbox - https://phabricator.wikimedia.org/T260772 (10wiki_willy) a:03Jclark-ctr [20:19:36] 10Operations, 10ops-eqiad, 10DC-Ops: Check samarium status in Netbox - https://phabricator.wikimedia.org/T260772 (10wiki_willy) Here's the Netbox error: https://netbox.wikimedia.org/extras/reports/coherence.Rack/ [20:29:53] (03CR) 10BryanDavis: wikireplicas: add wikireplica cookbook to add a wiki (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/621088 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm) [20:39:14] !log dpifke@deploy1001 Started deploy [performance/arc-lamp@2ef1af7]: Deploy fixes for notifications and OOM prevention (T259167) [20:39:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:18] T259167: Truncated ArcLamp output files - https://phabricator.wikimedia.org/T259167 [20:39:20] !log dpifke@deploy1001 Finished deploy [performance/arc-lamp@2ef1af7]: Deploy fixes for notifications and OOM prevention (T259167) (duration: 00m 06s) [20:39:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:42] PROBLEM - Persistent high iowait on labstore1006 is CRITICAL: 60.74 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/dashboard/db/labs-monitoring [21:17:34] !log cdanis@cumin1001 START - Cookbook sre.network.cf [21:17:34] !log cdanis@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [21:17:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:13] (03CR) 10CDanis: varnish: add wikilovesmonuments to tests for maps access (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/621347 (https://phabricator.wikimedia.org/T260520) (owner: 10Dzahn) [22:16:10] (03PS1) 10Bstorm: dumps: Start blocking by useragent [puppet] - 10https://gerrit.wikimedia.org/r/621359 [22:18:55] (03CR) 10Bstorm: "That did something https://puppet-compiler.wmflabs.org/compiler1002/24591/labstore1006.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/621359 (owner: 10Bstorm) [22:18:57] (03CR) 10CDanis: [C: 03+1] dumps: Start blocking by useragent (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/621359 (owner: 10Bstorm) [22:19:32] (03CR) 10Bstorm: [C: 03+2] dumps: Start blocking by useragent [puppet] - 10https://gerrit.wikimedia.org/r/621359 (owner: 10Bstorm) [22:20:23] (03CR) 10Bstorm: [C: 03+2] dumps: Start blocking by useragent (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/621359 (owner: 10Bstorm) [22:23:02] (03PS1) 10Bstorm: Revert "dumps: Start blocking by useragent" [puppet] - 10https://gerrit.wikimedia.org/r/621241 [22:23:21] 10Operations, 10serviceops, 10Patch-For-Review: decom releases1001 and releases2001 - https://phabricator.wikimedia.org/T260742 (10hashar) [22:23:25] RECOVERY - Persistent high iowait on labstore1006 is OK: (C)10 ge (W)5 ge 4.625 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/dashboard/db/labs-monitoring [22:24:39] (03CR) 10Bstorm: [C: 03+2] Revert "dumps: Start blocking by useragent" [puppet] - 10https://gerrit.wikimedia.org/r/621241 (owner: 10Bstorm) [22:24:58] 10Operations, 10serviceops, 10Patch-For-Review: decom releases1001 and releases2001 - https://phabricator.wikimedia.org/T260742 (10hashar) I am guessing this task is purely for tracking purpose. In case you are seeking any blessing, given releases.wm.o and the releases-jenkins.wm.o are properly working out o... [22:29:19] (03PS1) 10Bstorm: dumps: Start blocking by useragent [puppet] - 10https://gerrit.wikimedia.org/r/621360 [22:30:11] (03CR) 10CDanis: [C: 03+1] dumps: Start blocking by useragent [puppet] - 10https://gerrit.wikimedia.org/r/621360 (owner: 10Bstorm) [22:30:42] (03CR) 10Bstorm: [C: 03+2] dumps: Start blocking by useragent [puppet] - 10https://gerrit.wikimedia.org/r/621360 (owner: 10Bstorm) [22:52:32] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 51 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [22:58:27] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 49 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:00:05] RoanKattouw, Niharika, and Urbanecm: #bothumor I � Unicode. All rise for Evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200819T2300). [23:00:05] Ashot1997: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:01:08] I can deploy today! [23:01:19] Ashot1997: hi, are you around? [23:01:39] Yes, hi! [23:01:58] (03PS2) 10Urbanecm: Enable VisualEditor in namespaces Draft and Wikiproject on hywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621346 (https://phabricator.wikimedia.org/T260825) (owner: 10Ashot1997) [23:02:16] (03CR) 10Urbanecm: [C: 03+2] Enable VisualEditor in namespaces Draft and Wikiproject on hywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621346 (https://phabricator.wikimedia.org/T260825) (owner: 10Ashot1997) [23:03:03] (03Merged) 10jenkins-bot: Enable VisualEditor in namespaces Draft and Wikiproject on hywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621346 (https://phabricator.wikimedia.org/T260825) (owner: 10Ashot1997) [23:03:18] Ashot1997: cool! If I can ask you, could you please say something like "present", "here" or "around" when jouncebot pings you? That would indicate to me you're around, and would save some time :). Thanks! [23:03:31] 10Operations, 10serviceops, 10Patch-For-Review: decom releases1001 and releases2001 - https://phabricator.wikimedia.org/T260742 (10Dzahn) >>! In T260742#6398400, @hashar wrote: > I am guessing this task is purely for tracking purpose. In case you are seeking any blessing, given releases.wm.o and the releases... [23:03:49] Sure [23:03:55] thank you! [23:04:19] Ashot1997: pulled onto mwdebug1001, could you test, please? [23:04:29] now [23:04:45] yup, the change should be there Ashot1997 :) [23:06:26] it works [23:06:46] I think you can sync it [23:06:57] thanks, syncing! [23:08:19] Thank you very much ^_^ [23:08:51] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: a80899948c26ca36b970b80fbad07600fe4ce92c: Enable VisualEditor in namespaces Draft and Wikiproject on hywiki (T260825) (duration: 01m 05s) [23:08:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:08:58] T260825: Enable VisualEditor in namespaces Draft and Wikiproject on Armenian Wikipedia - https://phabricator.wikimedia.org/T260825 [23:09:02] Ashot1997: should be live! [23:10:34] Yes :) I don't know why the "Edit" button appears a second late but it works fine :D [23:11:02] No it is okay now ^_^ [23:11:32] cool! [23:11:38] happy to hear that [23:13:40] (03PS1) 10Dzahn: graphite: remove jessie support [puppet] - 10https://gerrit.wikimedia.org/r/621364 [23:15:57] (03PS1) 10Dzahn: apt: remove jessie support [puppet] - 10https://gerrit.wikimedia.org/r/621365 [23:20:16] !log Evening B&C window closed [23:20:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:28:24] (03PS1) 10Dzahn: otrs: remove jessie support [puppet] - 10https://gerrit.wikimedia.org/r/621368 [23:32:18] (03PS1) 10Dzahn: openstack::cumin: remove jessie support [puppet] - 10https://gerrit.wikimedia.org/r/621369 [23:37:19] (03PS1) 10Dzahn: tendril: remove jessie support [puppet] - 10https://gerrit.wikimedia.org/r/621370 [23:40:04] (03PS1) 10Dzahn: ldap: remove jessie support [puppet] - 10https://gerrit.wikimedia.org/r/621372 [23:44:15] (03PS1) 10Dzahn: mediawiki::fonts: remove jessie support [puppet] - 10https://gerrit.wikimedia.org/r/621374 [23:44:17] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 53 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:45:10] (03CR) 10jerkins-bot: [V: 04-1] mediawiki::fonts: remove jessie support [puppet] - 10https://gerrit.wikimedia.org/r/621374 (owner: 10Dzahn) [23:47:38] (03PS2) 10Dzahn: varnish: add wikilovesmonuments to tests for maps access [puppet] - 10https://gerrit.wikimedia.org/r/621347 (https://phabricator.wikimedia.org/T260520) [23:54:47] (03PS2) 10Dzahn: mediawiki::fonts: remove jessie support [puppet] - 10https://gerrit.wikimedia.org/r/621374 [23:56:15] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 50 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas