[12:24:08] (03CR) 10ArielGlenn: [C: 03+2] get revision info from stubs file and use to generate page range info [dumps] - 10https://gerrit.wikimedia.org/r/633567 (https://phabricator.wikimedia.org/T263319) (owner: 10ArielGlenn) [12:24:10] Test message [12:24:30] (03Merged) 10jenkins-bot: get revision info from stubs file and use to generate page range info [dumps] - 10https://gerrit.wikimedia.org/r/633567 (https://phabricator.wikimedia.org/T263319) (owner: 10ArielGlenn) [12:24:31] volans: it's logging and back fine now [12:25:19] (03PS2) 10ArielGlenn: update sql/xml dump config settings for stashing revision info [puppet] - 10https://gerrit.wikimedia.org/r/636026 (https://phabricator.wikimedia.org/T263319) [12:25:26] Spookreeeno: thanks! [12:25:38] No problem [12:26:05] Might be worth when you find one of the 3 I mentioned asking them to add a few more to the admin list [12:26:37] (03CR) 10ArielGlenn: [C: 03+2] update sql/xml dump config settings for stashing revision info [puppet] - 10https://gerrit.wikimedia.org/r/636026 (https://phabricator.wikimedia.org/T263319) (owner: 10ArielGlenn) [12:29:38] 10Operations, 10DBA, 10Datacenter-Switchover: When switching DCs, update pc hosts in tendril - https://phabricator.wikimedia.org/T266723 (10Marostegui) Thanks @RLazarus. pc1 is a bit different than the rest, as it has 2 hosts rather than the normal pc1008->pc2008 (pc2) or pc1009 -> pc2009. pc1 has the follo... [12:31:08] (03CR) 10ArielGlenn: [C: 03+2] Stash list of known tables once per run per wiki and re-use [dumps] - 10https://gerrit.wikimedia.org/r/636013 (https://phabricator.wikimedia.org/T266333) (owner: 10ArielGlenn) [12:31:36] (03Merged) 10jenkins-bot: Stash list of known tables once per run per wiki and re-use [dumps] - 10https://gerrit.wikimedia.org/r/636013 (https://phabricator.wikimedia.org/T266333) (owner: 10ArielGlenn) [12:33:54] !log ariel@deploy1001 Started deploy [dumps/dumps@4ed2cb9]: revinfo for page content jobs, tableinfo for list of known tables [12:33:58] !log ariel@deploy1001 Finished deploy [dumps/dumps@4ed2cb9]: revinfo for page content jobs, tableinfo for list of known tables (duration: 00m 05s) [12:34:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:00] !log restart idp-test [12:35:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:19] !log upgrade idp-test* hosts to latest Java securiy updates [12:35:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:47] 10Operations: Updated java security policy in OpenJDK 11.9 - https://phabricator.wikimedia.org/T266782 (10MoritzMuehlenhoff) [12:43:57] (03CR) 10Marostegui: [C: 03+2] orchestrator.conf: Add query to detect alias [puppet] - 10https://gerrit.wikimedia.org/r/636617 (https://phabricator.wikimedia.org/T266485) (owner: 10Marostegui) [12:44:31] !log Deploy grants for cluster alias on pc1 T266485 [12:44:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:37] T266485: Populating orchestrator metadata on a per-server basis - https://phabricator.wikimedia.org/T266485 [12:47:20] PROBLEM - Check systemd state on ms-be2052 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:49:40] PROBLEM - DPKG on idp-test2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [12:51:06] (03PS1) 10Marostegui: orchestrator.conf.json: Fix missing coma [puppet] - 10https://gerrit.wikimedia.org/r/637462 [12:51:50] (03CR) 10Kormat: [C: 03+1] orchestrator.conf.json: Fix missing coma [puppet] - 10https://gerrit.wikimedia.org/r/637462 (owner: 10Marostegui) [12:51:57] (03CR) 10Marostegui: [C: 03+2] orchestrator.conf.json: Fix missing coma [puppet] - 10https://gerrit.wikimedia.org/r/637462 (owner: 10Marostegui) [12:55:24] !log Deploy orchestrator grants on pc2 T266485 [12:55:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:30] T266485: Populating orchestrator metadata on a per-server basis - https://phabricator.wikimedia.org/T266485 [12:56:43] !log Make orchestrator discover pc2 T266485 [12:56:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:15] (03PS1) 10JMeybohm: aptrepo: add component for future kubernetes packages [puppet] - 10https://gerrit.wikimedia.org/r/637463 (https://phabricator.wikimedia.org/T266766) [13:03:07] 10Operations: move tunnelencabulator's repo to a Wikimedia-owned space - https://phabricator.wikimedia.org/T266783 (10CDanis) [13:03:13] 10Operations: move tunnelencabulator's repo to a Wikimedia-owned space - https://phabricator.wikimedia.org/T266783 (10CDanis) [13:03:15] 10Operations: Script to point SRE local machine traffic to another LB - https://phabricator.wikimedia.org/T244761 (10CDanis) [13:03:40] 10Operations: distribute tunnelencabulator in wmf-sre-laptop - https://phabricator.wikimedia.org/T266784 (10CDanis) [13:04:20] (03PS1) 10Muehlenhoff: Update puppetised java.security file from 11.9 [puppet] - 10https://gerrit.wikimedia.org/r/637464 (https://phabricator.wikimedia.org/T266782) [13:04:30] RECOVERY - Check systemd state on ms-be2052 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:06:26] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/637463 (https://phabricator.wikimedia.org/T266766) (owner: 10JMeybohm) [13:08:31] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add --force flag to safe-service-restart.py [puppet] - 10https://gerrit.wikimedia.org/r/635630 (https://phabricator.wikimedia.org/T243009) (owner: 10Ahmon Dancy) [13:10:42] (03CR) 10Jbond: "testing" [puppet] - 10https://gerrit.wikimedia.org/r/636641 (owner: 10Jbond) [13:10:49] (03CR) 10Jbond: "testing:" [puppet] - 10https://gerrit.wikimedia.org/r/636641 (owner: 10Jbond) [13:10:59] 10Operations, 10DBA, 10Datacenter-Switchover: When switching DCs, update pc hosts in tendril - https://phabricator.wikimedia.org/T266723 (10jcrespo) If I can provide more background, unless normal circumstances, pc* hosts are active-active, and no change should happen on them (no read only changes, etc.). Th... [13:12:36] (03CR) 10Jbond: "test foo" [puppet] - 10https://gerrit.wikimedia.org/r/636641 (owner: 10Jbond) [13:12:56] (03CR) 10Jbond: "PCC Started: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26210" [puppet] - 10https://gerrit.wikimedia.org/r/636641 (owner: 10Jbond) [13:14:01] (03CR) 10Jbond: "PCC Started: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26212" [puppet] - 10https://gerrit.wikimedia.org/r/636641 (owner: 10Jbond) [13:18:12] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] toolforge-k8s: Let hiera set the replicas value for ingress controllers [puppet] - 10https://gerrit.wikimedia.org/r/637072 (https://phabricator.wikimedia.org/T266506) (owner: 10Bstorm) [13:20:14] RECOVERY - DPKG on idp-test2001 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [13:20:42] (03CR) 10Herron: [C: 03+1] "LGTM! https://puppet-compiler.wmflabs.org/compiler1001/26211/" [puppet] - 10https://gerrit.wikimedia.org/r/636362 (https://phabricator.wikimedia.org/T261281) (owner: 10Filippo Giunchedi) [13:21:55] !log installing bluez security updates on stretch [13:21:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:14] !log Start of `mwscript extensions/AbuseFilter/maintenance/updateVarDumps.php --wiki=$wiki --print-orphaned-records-to=/tmp/urbanecm/$wiki-orphaned.log --progress-markers > $wiki.log` in a tmux session updateVarDumps at mwmaint2001 (wiki=idwiki; T246539) [13:23:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:19] T246539: Dry-run, then actually run updateVarDumps - https://phabricator.wikimedia.org/T246539 [13:25:20] !log Correction: Obviously 1002 (T246539) [13:25:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:41] (03CR) 10Jbond: "PCC Started: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26213" [puppet] - 10https://gerrit.wikimedia.org/r/636641 (owner: 10Jbond) [13:25:48] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26213" [puppet] - 10https://gerrit.wikimedia.org/r/636641 (owner: 10Jbond) [13:27:22] (03Abandoned) 10Arturo Borrero Gonzalez: wmcs: kubeadm: introduce different haproxy port frontend/backend [puppet] - 10https://gerrit.wikimedia.org/r/604782 (https://phabricator.wikimedia.org/T195217) (owner: 10Arturo Borrero Gonzalez) [13:28:04] (03PS5) 10Giuseppe Lavagetto: safe-service-restart: add optional poolcounter support [puppet] - 10https://gerrit.wikimedia.org/r/635991 (https://phabricator.wikimedia.org/T266055) [13:29:32] !log staggered restart of gdnsd on dns[12345]002 (1/2 recursors in each DC) - T266746 [13:29:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:38] T266746: TCP traffic increase for DNS over TLS breached a low limit for max open files on authdns1001/2001 - https://phabricator.wikimedia.org/T266746 [13:30:17] (03PS2) 10Jbond: pcc: update PCC cli so that it posts to the gerrit change [puppet] - 10https://gerrit.wikimedia.org/r/636652 [13:30:45] (03CR) 10jerkins-bot: [V: 04-1] pcc: update PCC cli so that it posts to the gerrit change [puppet] - 10https://gerrit.wikimedia.org/r/636652 (owner: 10Jbond) [13:31:26] (03PS1) 10Muehlenhoff: Add library hint for bluez [puppet] - 10https://gerrit.wikimedia.org/r/637480 [13:33:02] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdnsrec site=ulsfo https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:34:13] (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for bluez [puppet] - 10https://gerrit.wikimedia.org/r/637480 (owner: 10Muehlenhoff) [13:34:40] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:34:54] (03CR) 10Jbond: [V: 03+1] "PCC Started: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26214" [puppet] - 10https://gerrit.wikimedia.org/r/636641 (owner: 10Jbond) [13:35:02] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26214" [puppet] - 10https://gerrit.wikimedia.org/r/636641 (owner: 10Jbond) [13:35:56] (03PS3) 10Jbond: pcc: update PCC cli so that it posts to the gerrit change [puppet] - 10https://gerrit.wikimedia.org/r/636652 [13:36:22] (03CR) 10jerkins-bot: [V: 04-1] pcc: update PCC cli so that it posts to the gerrit change [puppet] - 10https://gerrit.wikimedia.org/r/636652 (owner: 10Jbond) [13:38:03] !log staggered restart of gdnsd on dns[12345]001 (1/2 recursors in each DC) - T266746 [13:38:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:10] T266746: TCP traffic increase for DNS over TLS breached a low limit for max open files on authdns1001/2001 - https://phabricator.wikimedia.org/T266746 [13:41:16] (03CR) 10Elukey: [C: 03+2] profile::java: add support for Jessie [puppet] - 10https://gerrit.wikimedia.org/r/636865 (owner: 10Elukey) [13:43:05] 10Operations, 10Wikidata, 10Wikidata Query UI, 10User-Addshore: Move WDQS UI to microsites - https://phabricator.wikimedia.org/T266702 (10BBlack) We can route different URI subspaces differently at the edge layer, based on URI regexes, as shown here for the split of the API namespace of the primary wiki si... [13:45:16] (03PS4) 10Jbond: pcc: update PCC cli so that it posts to the gerrit change [puppet] - 10https://gerrit.wikimedia.org/r/636652 [13:45:31] (03CR) 10Elukey: [C: 03+2] zookeeper: use profile::java [puppet] - 10https://gerrit.wikimedia.org/r/636864 (https://phabricator.wikimedia.org/T264176) (owner: 10Elukey) [13:46:06] !log authdns2001 - restart gdnsd - T266746 [13:46:09] I am disabling puppet on all zookeeper nodes for --^ [13:46:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:12] T266746: TCP traffic increase for DNS over TLS breached a low limit for max open files on authdns1001/2001 - https://phabricator.wikimedia.org/T266746 [13:52:04] !log authdns1001 - restart gdnsd - T266746 [13:52:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:10] T266746: TCP traffic increase for DNS over TLS breached a low limit for max open files on authdns1001/2001 - https://phabricator.wikimedia.org/T266746 [13:53:53] !log installing Java 11 security updates [13:53:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:42] !log roll out profile::java on all zookeeper instances [13:54:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:20] (03CR) 10Jbond: "Ready for review" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/636652 (owner: 10Jbond) [13:57:27] (03Abandoned) 10Giuseppe Lavagetto: Switch restbase to use envoy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578497 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [13:58:15] (03PS1) 10Elukey: java::cacert: add support for java 7 [puppet] - 10https://gerrit.wikimedia.org/r/637490 [13:58:24] jbond42: for your amusement --^ [13:58:41] :D [13:59:15] (I have a failure on conf2001, jessie node) [14:01:00] lol [14:01:17] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1003/26215/conf2001.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/637490 (owner: 10Elukey) [14:01:33] 10Operations, 10SRE-Access-Requests: Requesting access to production shell groups for DNdubane - https://phabricator.wikimedia.org/T266791 (10DNdubane_WMF) [14:02:00] (03CR) 10Muehlenhoff: [C: 03+1] "Looks "good"" [puppet] - 10https://gerrit.wikimedia.org/r/637490 (owner: 10Elukey) [14:02:13] ahahahha [14:02:21] (03CR) 10Elukey: [C: 03+2] java::cacert: add support for java 7 [puppet] - 10https://gerrit.wikimedia.org/r/637490 (owner: 10Elukey) [14:04:00] (03CR) 10Jbond: java::cacert: add support for java 7 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/637490 (owner: 10Elukey) [14:04:00] <_joe_> jouncebot: next [14:04:00] In 1 hour(s) and 55 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201029T1600) [14:04:46] * elukey wants to know all jbond42's puppet tricks [14:04:49] :D [14:04:57] :D lol [14:05:16] that was way better! [14:05:21] <_joe_> elukey: it's a secret technique [14:05:28] <_joe_> it's called "read the manual" [14:05:36] <_joe_> I hope jbond42 can teach it [14:05:45] * elukey falling and crawiling after an headshot [14:05:49] <_joe_> well if we deem you worthy [14:05:59] <_joe_> :* [14:07:03] 10Operations, 10ops-codfw, 10serviceops, 10User-jijiki: codfw: relocate sessionstore2002 and mc2029 from C4 to C3 - https://phabricator.wikimedia.org/T266577 (10Papaul) This is taking place today and no update yet from service owners. [14:08:28] !log bump FS for prometheus codfw global instance [14:08:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:17] 10Operations, 10Analytics-Clusters, 10Analytics-Kanban: Switch Zookeeper to profile::java - https://phabricator.wikimedia.org/T264176 (10elukey) Change rolled out! [14:09:27] (03PS3) 10Filippo Giunchedi: grafana: make grafana-rw dashboards link work for anonymous users [puppet] - 10https://gerrit.wikimedia.org/r/636927 (https://phabricator.wikimedia.org/T265712) [14:09:46] 10Operations, 10ops-codfw, 10serviceops, 10User-jijiki: codfw: relocate sessionstore2002 and mc2029 from C4 to C3 - https://phabricator.wikimedia.org/T266577 (10Joe) As far as mc2029 is concerned, you can just proceed without any impact. Not sure about sessionstore2002. @hnowlan @Eevans can you please adv... [14:10:29] 10Operations, 10ops-codfw, 10serviceops, 10User-jijiki: codfw: relocate sessionstore2002 and mc2029 from C4 to C3 - https://phabricator.wikimedia.org/T266577 (10Papaul) @Joe thanks [14:11:16] (03CR) 10Filippo Giunchedi: [C: 03+2] grafana: make grafana-rw dashboards link work for anonymous users [puppet] - 10https://gerrit.wikimedia.org/r/636927 (https://phabricator.wikimedia.org/T265712) (owner: 10Filippo Giunchedi) [14:19:25] 10Operations, 10ops-codfw, 10serviceops, 10User-jijiki: codfw: relocate sessionstore2002 and mc2029 from C4 to C3 - https://phabricator.wikimedia.org/T266577 (10hnowlan) All sessionstore2002 will need is a drain before the host is to be moved - I will be on hand for this. [14:19:27] 10Operations, 10Traffic: TCP traffic increase for DNS over TLS breached a low limit for max open files on authdns1001/2001 - https://phabricator.wikimedia.org/T266746 (10BBlack) All the authdns are restarted with the infinite limit applied. There's been some IRC discussion about a few possible spinoff tickets... [14:20:38] !log jmm@cumin1001 START - Cookbook sre.hosts.decommission [14:20:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:18] (03CR) 10Giuseppe Lavagetto: Add apache httpd base image (038 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/634924 (https://phabricator.wikimedia.org/T265324) (owner: 10Giuseppe Lavagetto) [14:24:19] 10Operations, 10Wikidata, 10Wikidata Query UI, 10User-Addshore: Move WDQS UI to microsites - https://phabricator.wikimedia.org/T266702 (10Addshore) >>! In T266702#6588396, @BBlack wrote: > We can route different URI subspaces differently at the edge layer, based on URI regexes, as shown here for the split... [14:24:22] !log restart zookeeper on an-conf1001 for openjdk upgrades [14:24:22] (03PS5) 10Giuseppe Lavagetto: Add apache httpd base image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/634924 (https://phabricator.wikimedia.org/T265324) [14:24:24] (03PS3) 10Giuseppe Lavagetto: Add an httpd-fcgi image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/636634 (https://phabricator.wikimedia.org/T265324) [14:24:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:01] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime [14:25:02] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:25:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:56] (03CR) 10Ppchelko: [C: 03+2] JobQueue: Increase concurrency for cdnPurge jobs. [deployment-charts] - 10https://gerrit.wikimedia.org/r/637031 (owner: 10Ppchelko) [14:26:48] !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [14:26:52] 10Operations, 10Patch-For-Review: Migrate LDAP replicas to Buster - https://phabricator.wikimedia.org/T264388 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin1001 for hosts: `ldap-eqiad-replica01.wikimedia.org` - ldap-eqiad-replica01.wikimedia.org (**PASS**) - Downtimed host on I... [14:26:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:13] (03PS1) 10Urbanecm: Set wgDLPQueryCacheTime to 120 at all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637499 (https://phabricator.wikimedia.org/T263220) [14:29:15] (03PS1) 10Muehlenhoff: Remove ldap-eqiad-replica0[12] from acmechief config [puppet] - 10https://gerrit.wikimedia.org/r/637500 (https://phabricator.wikimedia.org/T264388) [14:29:15] !log jmm@cumin1001 START - Cookbook sre.hosts.decommission [14:29:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:41] (03Merged) 10jenkins-bot: JobQueue: Increase concurrency for cdnPurge jobs. [deployment-charts] - 10https://gerrit.wikimedia.org/r/637031 (owner: 10Ppchelko) [14:33:23] !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [14:33:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:27] 10Operations, 10Patch-For-Review: Migrate LDAP replicas to Buster - https://phabricator.wikimedia.org/T264388 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin1001 for hosts: `ldap-eqiad-replica02.wikimedia.org` - ldap-eqiad-replica02.wikimedia.org (**PASS**) - Downtimed host on I... [14:34:07] !log ppchelko@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' . [14:34:11] (03PS1) 10Elukey: Review settings for the new Druid test cluster [puppet] - 10https://gerrit.wikimedia.org/r/637502 (https://phabricator.wikimedia.org/T255139) [14:34:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:05] (03PS1) 10Muehlenhoff: Remove Puppet references for ldap-eqiad* [puppet] - 10https://gerrit.wikimedia.org/r/637503 (https://phabricator.wikimedia.org/T264388) [14:35:09] !log ppchelko@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [14:35:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:06] !log ppchelko@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [14:36:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:22] (03CR) 10Elukey: [C: 03+2] Review settings for the new Druid test cluster [puppet] - 10https://gerrit.wikimedia.org/r/637502 (https://phabricator.wikimedia.org/T255139) (owner: 10Elukey) [14:38:24] (03CR) 10JMeybohm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634934 (owner: 10Giuseppe Lavagetto) [14:39:09] (03PS2) 10Muehlenhoff: Remove Puppet references for ldap-eqiad* [puppet] - 10https://gerrit.wikimedia.org/r/637503 (https://phabricator.wikimedia.org/T264388) [14:40:03] 10Operations: Migrate remaining services using Java to profile::java - https://phabricator.wikimedia.org/T264174 (10MoritzMuehlenhoff) [14:40:05] 10Operations, 10Analytics-Clusters, 10Analytics-Kanban: Switch Zookeeper to profile::java - https://phabricator.wikimedia.org/T264176 (10MoritzMuehlenhoff) 05Open→03Resolved [14:43:56] (03CR) 10RLazarus: safe-service-restart: add optional poolcounter support (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/635991 (https://phabricator.wikimedia.org/T266055) (owner: 10Giuseppe Lavagetto) [14:45:17] (03PS1) 10Klausman: install-server: Add DHCP entry for an-test-druid1001 [puppet] - 10https://gerrit.wikimedia.org/r/637504 (https://phabricator.wikimedia.org/T266771) [14:48:51] (03CR) 10Muehlenhoff: [C: 03+2] Remove Puppet references for ldap-eqiad* [puppet] - 10https://gerrit.wikimedia.org/r/637503 (https://phabricator.wikimedia.org/T264388) (owner: 10Muehlenhoff) [14:49:05] 10Operations, 10Wikidata, 10Wikidata Query UI, 10User-Addshore: Move WDQS UI to microsites - https://phabricator.wikimedia.org/T266702 (10Addshore) From #wikimedia-traffic > 2:21 PM addshore: https://phabricator.wikimedia.org/T266702#6588396 > 2:23 PM bblack: <3 ty > 2:42 PM ad... [14:54:44] (03PS2) 10Klausman: install-server: Add DHCP entry for an-test-druid1001 [puppet] - 10https://gerrit.wikimedia.org/r/637504 (https://phabricator.wikimedia.org/T266771) [14:56:32] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Switch restbase calls to be channeled via envoy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634934 (owner: 10Giuseppe Lavagetto) [14:56:41] <_joe_> jayme: ok let's see if we manage to break anything [14:57:13] <_joe_> I'll deploy first on mwdebug, then on one appserver to check for restbase calls failing [14:57:27] <_joe_> then on all machines [14:58:29] (03PS1) 10Urbanecm: Attempt to add a query cache to DPL [extensions/intersection] (wmf/1.36.0-wmf.14) - 10https://gerrit.wikimedia.org/r/637059 (https://phabricator.wikimedia.org/T262391) [14:59:16] !log poweroff sessionstore2002 for relocation [14:59:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:47] _joe_: are you deploying right now? [15:00:07] I've a fix for an UBN task (T263220) [15:00:08] T263220: Limit concurrency of DPL queries - https://phabricator.wikimedia.org/T263220 [15:00:34] <_joe_> Urbanecm: yeah I wanted to [15:00:43] <_joe_> I've merged already sorry [15:00:49] _joe_: no problem, seems to be a config patch anyway [15:00:53] (03CR) 10Klausman: [C: 03+2] install-server: Add DHCP entry for an-test-druid1001 [puppet] - 10https://gerrit.wikimedia.org/r/637504 (https://phabricator.wikimedia.org/T266771) (owner: 10Klausman) [15:00:54] <_joe_> yes [15:00:59] (03Merged) 10jenkins-bot: Switch restbase calls to be channeled via envoy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634934 (owner: 10Giuseppe Lavagetto) [15:01:05] I'll +2 my backport, it shouldn't affect you [15:01:07] <_joe_> ok lemme start now [15:01:08] <_joe_> yes [15:01:15] <_joe_> let's just sync before deploying [15:01:15] (03PS3) 10Klausman: install-server: Add DHCP entry for an-test-druid1001 [puppet] - 10https://gerrit.wikimedia.org/r/637504 (https://phabricator.wikimedia.org/T266771) [15:01:23] (03CR) 10Urbanecm: [C: 03+2] "fix for an UBN task: T263220" [extensions/intersection] (wmf/1.36.0-wmf.14) - 10https://gerrit.wikimedia.org/r/637059 (https://phabricator.wikimedia.org/T262391) (owner: 10Urbanecm) [15:01:27] (03CR) 10Klausman: [V: 03+2 C: 03+2] install-server: Add DHCP entry for an-test-druid1001 [puppet] - 10https://gerrit.wikimedia.org/r/637504 (https://phabricator.wikimedia.org/T266771) (owner: 10Klausman) [15:02:19] <_joe_> I'm pulling to mwdebug1001 [15:04:07] PROBLEM - Host sessionstore2002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:05:19] <_joe_> I'm now pulling to mw1331 [15:06:36] !log rolling restart of ATS to upgrade to trafficserver 8.0.8-1wm3 - T265911 [15:06:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:42] T265911: ATS trying to set socket options SO_MARK / IP_TOS - https://phabricator.wikimedia.org/T265911 [15:08:59] <_joe_> ok Urbanecm I'm ~ done [15:09:06] thanks [15:09:35] RECOVERY - Host sessionstore2002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.86 ms [15:09:42] !log oblivian@deploy1001 Synchronized wmf-config/ProductionServices.php: Switch restbase to use envoy, https (duration: 00m 57s) [15:09:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:42] (03CR) 10Urbanecm: [C: 03+2] "UBN fix: noop, used by 1ce83f59133723d0580ff68091cb719f5fd1fdbc (soon to be deployed)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637499 (https://phabricator.wikimedia.org/T263220) (owner: 10Urbanecm) [15:10:57] 10Operations, 10ops-codfw, 10serviceops, 10User-jijiki: codfw: relocate sessionstore2002 and mc2029 from C4 to C3 - https://phabricator.wikimedia.org/T266577 (10Papaul) @hnowlan sessionstore2002 has been move and back up online. All yours. Thanks [15:12:00] (03PS1) 10Andrew Bogott: puppet_alert.py: don't rely on last_run_summary.yaml for last success timestamp [puppet] - 10https://gerrit.wikimedia.org/r/637510 (https://phabricator.wikimedia.org/T266793) [15:12:52] 10Operations, 10Traffic: TCP traffic increase for DNS over TLS breached a low limit for max open files on authdns1001/2001 - https://phabricator.wikimedia.org/T266746 (10Vgutierrez) p:05High→03Medium [15:12:55] 10Operations, 10ops-codfw, 10serviceops, 10User-jijiki: codfw: relocate sessionstore2002 and mc2029 from C4 to C3 - https://phabricator.wikimedia.org/T266577 (10Papaul) [15:12:57] (03PS1) 10Klausman: site: add an-test-druid1001 in setup stage [puppet] - 10https://gerrit.wikimedia.org/r/637511 [15:13:07] (03Merged) 10jenkins-bot: Set wgDLPQueryCacheTime to 120 at all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637499 (https://phabricator.wikimedia.org/T263220) (owner: 10Urbanecm) [15:13:33] 10Operations, 10ops-codfw, 10serviceops, 10User-jijiki: codfw: relocate sessionstore2002 and mc2029 from C4 to C3 - https://phabricator.wikimedia.org/T266577 (10hnowlan) sessionstore2002 looks good on my end, thanks! [15:14:25] (03CR) 10Klausman: [C: 03+2] site: add an-test-druid1001 in setup stage [puppet] - 10https://gerrit.wikimedia.org/r/637511 (owner: 10Klausman) [15:15:02] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 19c5aff02c20812c56b8abdcc0ed530393010193: Set wgDLPQueryCacheTime to 120 at all wikis (T263220) (duration: 00m 59s) [15:15:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:08] T263220: Limit concurrency of DPL queries - https://phabricator.wikimedia.org/T263220 [15:15:22] 10Operations, 10ops-codfw, 10serviceops, 10User-jijiki: codfw: relocate sessionstore2002 and mc2029 from C4 to C3 - https://phabricator.wikimedia.org/T266577 (10Papaul) [15:15:53] (03CR) 10Cwhite: [C: 03+2] Accurately evaluate request results and add response headroom [debs/prometheus-swagger-exporter] - 10https://gerrit.wikimedia.org/r/637071 (owner: 10Cwhite) [15:16:25] !log poweroff mc2029 for relocation [15:16:25] (03CR) 10Bstorm: [C: 03+2] toolforge: script to make long-running processes on bastions less good [puppet] - 10https://gerrit.wikimedia.org/r/635888 (https://phabricator.wikimedia.org/T266300) (owner: 10Bstorm) [15:16:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:46] (03PS2) 10Andrew Bogott: cloud-vps instances: include bsd-mailx on all hosts [puppet] - 10https://gerrit.wikimedia.org/r/637018 [15:16:48] (03PS2) 10Andrew Bogott: puppet_alert.py: don't rely on last_run_summary.yaml for last success timestamp [puppet] - 10https://gerrit.wikimedia.org/r/637510 (https://phabricator.wikimedia.org/T266793) [15:16:50] (03PS1) 10Andrew Bogott: cloud-vps: reformat notify_maintainers.py and puppet_alert.py with black [puppet] - 10https://gerrit.wikimedia.org/r/637513 [15:17:19] (03Merged) 10jenkins-bot: Attempt to add a query cache to DPL [extensions/intersection] (wmf/1.36.0-wmf.14) - 10https://gerrit.wikimedia.org/r/637059 (https://phabricator.wikimedia.org/T262391) (owner: 10Urbanecm) [15:17:48] marostegui: jynus: fyi: I'm trying to fix the DPL issue now. [15:18:00] (03PS3) 10Andrew Bogott: puppet_alert.py: don't rely on last_run_summary.yaml for last success timestamp [puppet] - 10https://gerrit.wikimedia.org/r/637510 (https://phabricator.wikimedia.org/T266793) [15:18:02] (03PS2) 10Andrew Bogott: cloud-vps: reformat notify_maintainers.py and puppet_alert.py with black [puppet] - 10https://gerrit.wikimedia.org/r/637513 [15:18:11] Urbanecm: thank you very much! [15:20:03] (03CR) 10Bstorm: puppet_alert.py: don't rely on last_run_summary.yaml for last success timestamp (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/637510 (https://phabricator.wikimedia.org/T266793) (owner: 10Andrew Bogott) [15:21:35] (03PS4) 10Andrew Bogott: puppet_alert.py: don't rely on last_run_summary.yaml for last success timestamp [puppet] - 10https://gerrit.wikimedia.org/r/637510 (https://phabricator.wikimedia.org/T266793) [15:21:37] (03PS3) 10Andrew Bogott: cloud-vps: reformat notify_maintainers.py and puppet_alert.py with black [puppet] - 10https://gerrit.wikimedia.org/r/637513 [15:21:54] (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps instances: include bsd-mailx on all hosts [puppet] - 10https://gerrit.wikimedia.org/r/637018 (owner: 10Andrew Bogott) [15:22:00] doesn't break DPL at mwdebug, syncing to all fleet [15:22:42] !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.14/extensions/intersection/: 483c3bceb926ac6a2cfc40112fb9b4f0671fef72: Attempt to add a query cache to DPL (T263220) (duration: 00m 58s) [15:22:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:48] T263220: Limit concurrency of DPL queries - https://phabricator.wikimedia.org/T263220 [15:22:55] !log installing bacula updates from Buster point release [15:22:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:51] jynus: marostegui: The patch is live. Hopefully that sufficiently decreases the number of DPL queries. [15:24:46] 10Operations, 10SRE-Access-Requests: New prod ssh key for calbon - https://phabricator.wikimedia.org/T266498 (10ema) @calbon: please let me know if you now have access and we can close this. Thanks! [15:25:00] (03CR) 10Bstorm: [C: 03+1] "Unless you have a great idea about exception handling for that file, this should be much better!" [puppet] - 10https://gerrit.wikimedia.org/r/637510 (https://phabricator.wikimedia.org/T266793) (owner: 10Andrew Bogott) [15:25:33] PROBLEM - Aggregate IPsec Tunnel Status eqiad on alert1001 is CRITICAL: instance=mc1029 site=eqiad tunnel=mc2029_v4 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [15:25:45] ^ that is fine [15:25:50] I will downtime [15:26:09] urm, I mean I will ack [15:26:34] <_joe_> jayme: uhm something's not right with the move of restbase for mediawiki to envoy [15:26:46] effie: mc2029 should be coming up soon [15:26:55] <_joe_> https://grafana.wikimedia.org/d/7mUxtYVGk/jayme-ipvs_backend_connections?viewPanel=5&orgId=1&var-datasource=thanos&var-port=7231&var-port=7443&var-address=All&from=now-6h&to=now [15:26:58] <_joe_> I [15:27:02] great, thank you papaul ! [15:27:05] <_joe_> I'll revert :/ [15:27:26] (03PS5) 10Andrew Bogott: puppet_alert.py: don't rely on last_run_summary.yaml for last success timestamp [puppet] - 10https://gerrit.wikimedia.org/r/637510 (https://phabricator.wikimedia.org/T266793) [15:27:29] (03PS4) 10Andrew Bogott: cloud-vps: reformat notify_maintainers.py and puppet_alert.py with black [puppet] - 10https://gerrit.wikimedia.org/r/637513 [15:27:43] (03PS1) 10Giuseppe Lavagetto: Revert "Switch restbase calls to be channeled via envoy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637061 [15:27:53] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Revert "Switch restbase calls to be channeled via envoy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637061 (owner: 10Giuseppe Lavagetto) [15:28:05] <_joe_> Urbanecm: are you done with your patch? [15:28:10] _joe_: yes, go ahead [15:28:43] ACKNOWLEDGEMENT - Aggregate IPsec Tunnel Status eqiad on alert1001 is CRITICAL: instance=mc1029 site=eqiad tunnel=mc2029_v4 Effie Mouzeli T266577 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [15:28:59] _joe_: uh...that's a lot [15:29:01] RECOVERY - Aggregate IPsec Tunnel Status eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [15:29:15] <_joe_> jayme: 105 established connections per appserver, more or less [15:29:35] <_joe_> netstat -tunap | fgrep 10.2.2.17 | wc -l [15:29:46] <_joe_> why is this happening this way, I have no idea [15:29:51] <_joe_> we'll need to run more tests [15:31:20] 10Operations, 10ops-codfw, 10serviceops, 10User-jijiki: codfw: relocate sessionstore2002 and mc2029 from C4 to C3 - https://phabricator.wikimedia.org/T266577 (10Papaul) [15:31:58] (03PS1) 10Klausman: install_server: fix wrong TLD on an-test-druid1001 [puppet] - 10https://gerrit.wikimedia.org/r/637516 [15:32:19] 10Operations, 10ops-codfw, 10serviceops, 10User-jijiki: codfw: relocate sessionstore2002 and mc2029 from C4 to C3 - https://phabricator.wikimedia.org/T266577 (10Papaul) 05Open→03Resolved This is complete. Thanks to all [15:32:25] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - kartotherian-ssl_443: Servers maps1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:32:43] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Revert "Switch restbase calls to be channeled via envoy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637061 (owner: 10Giuseppe Lavagetto) [15:32:43] _joe_: but cxserver was fine? [15:32:53] <_joe_> jayme: later [15:32:58] (03CR) 10Klausman: [C: 03+2] install_server: fix wrong TLD on an-test-druid1001 [puppet] - 10https://gerrit.wikimedia.org/r/637516 (owner: 10Klausman) [15:33:02] ok [15:33:02] <_joe_> but yes, this is specific to restbase, we need to figure it out [15:33:58] <_joe_> I suspect that pointing envoy to the http endpoint would not cause this [15:34:09] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:34:20] PROBLEM - Kartotherian LVS eqiad #page on kartotherian.svc.eqiad.wmnet is CRITICAL: /v4/marker/pin-m-fuel+ffffff@2x.png (scaled pushpin marker with an icon) timed out before a response was received https://wikitech.wikimedia.org/wiki/Maps%23Kartotherian [15:34:33] !log oblivian@deploy1001 Synchronized wmf-config/ProductionServices.php: Revert: switch restbase to use envoy, https (duration: 00m 57s) [15:34:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:56] 10Operations: Integrate Buster 10.6 point update - https://phabricator.wikimedia.org/T263974 (10MoritzMuehlenhoff) [15:39:51] godog: FYI I got VO alert just now [15:40:06] yeah same here volans (in a meeting) [15:40:10] 5.5 minutes after the IRC one [15:40:17] ditto [15:40:22] 5m lag is not good [15:40:25] <_joe_> everyone [15:40:34] we sould investigate if it was on our side or their side [15:41:01] <_joe_> should we maybe first fix maps? [15:41:11] <_joe_> dunno, maybe it's tangentially relevant [15:41:48] maybe.. [15:41:54] kartotherian alert is probably related to my work [15:41:57] <_joe_> [2020-10-29T15:41:41.610Z] ERROR: kartotherian/8296 on maps1003: Bad geojson - unknown type ExternalData (err.levelPath=error) [15:41:58] not certain [15:42:05] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - kartotherian-ssl_443: Servers maps1002.eqiad.wmnet, maps1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:42:07] <_joe_> hnowlan: ok should we at least depool eqiad? [15:42:12] <_joe_> because maps are down [15:42:21] IIRC there was some reload of osm db those days [15:42:43] huh, that is strange - nothing has changed in maps since yesterday but an import is still underway [15:42:44] hnowlan: was it you yesterday? [15:42:48] volans: yes [15:43:19] huh, that is strange - nothing has changed in maps since yesterday but an import is still underway [15:46:25] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - kartotherian-ssl_443: Servers maps1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:46:40] <_joe_> hnowlan: are you trying to debug the problem? [15:46:49] yes [15:47:03] 10Operations, 10ops-eqiad, 10Reading Epics (Analytics): an-coord1001 ram upgrade - https://phabricator.wikimedia.org/T266709 (10Cmjohnson) @elukey I have the 2 DIMM on-site. Does this need to be scheduled? If so can we schedule this for Tuesday 3 November? If not, let me know if I can take it down anytime. [15:47:06] 10Operations, 10Traffic: ATS trying to set socket options SO_MARK / IP_TOS - https://phabricator.wikimedia.org/T265911 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez `vgutierrez@cumin1001:~$ sudo -i cumin 'A:cp' 'apt-cache policy trafficserver|grep Installed' 72 hosts will be targeted: cp[2027-2042].cod... [15:47:17] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:48:09] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:49:05] 10Operations, 10SRE-Access-Requests: New prod ssh key for calbon - https://phabricator.wikimedia.org/T266498 (10calbon) yep it works, thanks all! [15:50:44] 10Operations, 10SRE-Access-Requests: New prod ssh key for calbon - https://phabricator.wikimedia.org/T266498 (10ema) 05Open→03Resolved [15:53:09] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_kartotherian_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:55:51] PROBLEM - LVS kartotherian eqiad port 6533/tcp - Kartotherian- kartotherian.svc.eqiad.wmnet IPv4 on kartotherian.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [15:57:35] RECOVERY - LVS kartotherian eqiad port 6533/tcp - Kartotherian- kartotherian.svc.eqiad.wmnet IPv4 on kartotherian.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 9.445 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [15:57:43] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - kartotherian-ssl_443: Servers maps1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:58:20] (03CR) 10Andrew Bogott: [C: 03+2] puppet_alert.py: don't rely on last_run_summary.yaml for last success timestamp [puppet] - 10https://gerrit.wikimedia.org/r/637510 (https://phabricator.wikimedia.org/T266793) (owner: 10Andrew Bogott) [15:58:21] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:58:30] (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps: reformat notify_maintainers.py and puppet_alert.py with black [puppet] - 10https://gerrit.wikimedia.org/r/637513 (owner: 10Andrew Bogott) [15:58:31] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - kartotherian-ssl_443: Servers maps1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:58:42] (03PS5) 10Andrew Bogott: cloud-vps: reformat notify_maintainers.py and puppet_alert.py with black [puppet] - 10https://gerrit.wikimedia.org/r/637513 [15:59:02] (03PS6) 10Andrew Bogott: cloud-vps: reformat notify_maintainers.py and puppet_alert.py with black [puppet] - 10https://gerrit.wikimedia.org/r/637513 [15:59:09] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] cloud-vps: reformat notify_maintainers.py and puppet_alert.py with black [puppet] - 10https://gerrit.wikimedia.org/r/637513 (owner: 10Andrew Bogott) [16:00:04] jbond42 and cdanis: (Dis)respected human, time to deploy Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201029T1600). Please do the needful. [16:01:25] hnowlan: do you happen to know when we started seeing that error? [16:02:05] (03CR) 10Muehlenhoff: "As mentioned in the IF meeting; I like the change, but the aspect about the thousands separator feels a little too much." [software/cumin] - 10https://gerrit.wikimedia.org/r/636729 (owner: 10Volans) [16:02:24] 10Operations, 10ops-eqiad, 10Reading Epics (Analytics): an-coord1001 ram upgrade - https://phabricator.wikimedia.org/T266709 (10elukey) @Cmjohnson yep definitely I'd need to schedule this, Tuesday is ok! What time would you be able to start? (I'd need an hour of drain time before that) [16:03:42] cdanis: it seems to have been happening intermittently (nowhere near as commonly) since 06:53 or so [16:03:49] might be a red herring though [16:04:41] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:04:59] are we still in an outage? [16:04:59] hnowlan: hm okay, I *think* (all of this is vague handwavy speculation) in the past that error has been correlated with the call that Maps does back into Mediawiki to fetch some data (location markers? boundaries?) for certain flavors of rendering calls [16:05:08] XioNoX: yes but believed maps-only [16:06:38] cdanis: oh, interesting - would this happen on a cron or something by any chance? [16:07:17] I'm not sure why this surge is happening *now* rather than earlier, we've been at this level of capacity since this about time yesterday [16:07:21] 10Operations, 10ops-eqiad, 10Reading Epics (Analytics): an-coord1001 ram upgrade - https://phabricator.wikimedia.org/T266709 (10Cmjohnson) @elukey great, I usually get to the data center around 1500UTC [16:07:30] I don't think so? I don't really understand things here (maybe gehel is around if msantos isn't?) but AIUI there's calls where... Mediawiki includes a Maps URL; the processing of said Maps URL causes Maps to do a fetch against the MW API to get some addl data? [16:07:44] yeah, so, it could be a change in traffic pattern, or it could be a change in mediawiki [16:07:45] <_joe_> traffic goes up during this time usually [16:07:46] g.ehel is going on be on in a few minutes [16:07:48] k [16:08:11] load on the maps servers in eqiad is very high atm [16:09:22] * gehel is back [16:09:56] (03PS1) 10Elukey: Add role::druid::test_analytics::worker to an-test-druid1001 [puppet] - 10https://gerrit.wikimedia.org/r/637526 (https://phabricator.wikimedia.org/T255139) [16:10:16] the bad geojson is a red herring (the logging should be fixed) [16:10:22] ohh it looks like there are timeouts posting back to MW maybe? [16:11:18] <_joe_> Ihave no evidence of slowness on the mw api [16:12:05] gehel: is seeing "GroupId not available" in logs normal? [16:12:19] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_kartotherian_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:12:31] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:12:32] for some definition of normal, yes [16:12:35] <_joe_> so stracing one process, it seems to spends a lot of time in futexes [16:12:53] we do have additional server capacity coming, but not ready yet: T260269 [16:12:53] <_joe_> which is consistent with node managing more requests than it can handle [16:12:53] T260269: (Need By: TBD) rack/setup/install maps10[05-10].eqiad.wmnet - https://phabricator.wikimedia.org/T260269 [16:13:26] I assume we can't add additional servers when the master is resyncing anyway? [16:13:45] <_joe_> should we try to move the load to codfw alone? but I see the cpu there is already elevated too [16:13:45] we could probably manually sync data from another slave [16:14:03] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:14:16] codfw did not survive the DC switch alone, so it's unlikely to survive here [16:14:18] _joe_: I don't know how much things have changed since we stopped disallowing 3rd party traffic, but in the past, when we did that, we just further exacerbated the capacity crunch, since maps still isn't N+1 [16:14:23] can we move part of the load to codfw? [16:14:25] no [16:14:28] (03PS1) 10Andrew Bogott: cloud-vps notify_maintainers.py: encode mail body as utf8 [puppet] - 10https://gerrit.wikimedia.org/r/637529 [16:14:36] we still have no proportional-weight geodns [16:14:52] or proportional-weight dns discovery [16:15:07] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - kartotherian-ssl_443: Servers maps1002.eqiad.wmnet, maps1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:15:10] <_joe_> we could change how we map geodns so that esams => codfw [16:15:23] use goe zones to switch part of the traffic ? [16:15:39] <_joe_> yeah but it would be ineffective and coarse [16:15:46] <_joe_> and introduce latency for other services [16:16:05] we can't do it jsut for maps? [16:16:09] <_joe_> no [16:16:11] no, we can [16:16:12] ok [16:16:16] it'd just be annoying [16:16:22] right now maps is a cname to upload but it could be a separate dyna [16:16:28] <_joe_> it's error-prone and annoying [16:16:33] understood [16:16:35] it is not something I would like to maintain for longer than a day [16:16:36] (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps notify_maintainers.py: encode mail body as utf8 [puppet] - 10https://gerrit.wikimedia.org/r/637529 (owner: 10Andrew Bogott) [16:16:38] <_joe_> cdanis: oh you mean external dns [16:16:40] yes [16:16:46] <_joe_> I was thinking internally [16:16:50] for internal the situation is hopeless [16:16:56] but we could steer external traffic [16:17:15] we could stop the data reload on maps1004, copy the data over from one of the slaves [16:17:31] <_joe_> ottomata: we have a huge amount of 5xx to intake-analytics [16:17:37] <_joe_> for both events and NEL reports [16:17:39] (03CR) 10Alexandros Kosiaris: [C: 03+2] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/634192 (owner: 10Alexandros Kosiaris) [16:17:48] that's still a 900GB data transfer, but probably faster than the data reload from OSN [16:17:54] _joe_: it's been ~1rps for a while, did it jump up? [16:17:54] s/OSM/OSM/ [16:18:17] excuse me, 1rpm _joe_ https://phabricator.wikimedia.org/T264021 [16:18:59] we use debian here [16:19:01] * vgutierrez goes away [16:19:55] hnowlan: any estimate on the data reload ETA ? [16:20:47] just finished planet_osm_polygon which is one of the last operations but no way of knowing how much longer it'll be - planet_osm_ways indexes need to be built and afaik are the last ones as part of the osmimport [16:21:16] even if that finishes, will there not be a significant delay in resyncing the replicas? [16:21:37] we don't need to wait for the replicas to resync to start serving traffic from maps1004 [16:21:52] ah [16:22:18] tiles are only served from cassandra, not directly from psotgres [16:22:40] so we might actually be able to repool 1004 already [16:22:44] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:22:46] so judging by the 5xx we're serving to users, the problem started at about 15:15, and ramped up to full pain at 15:30 UTC [16:22:54] https://logstash.wikimedia.org/goto/8611faa22c936038da1a2049f7eb24d6 [16:23:42] that makes me potentially suspicious that Urbanecm's deployment of https://sal.toolforge.org/log/AkntdHUBhxWNv8gILin_ might have had unintended consequences? [16:24:04] (03CR) 10Klausman: [C: 03+2] Add role::druid::test_analytics::worker to an-test-druid1001 [puppet] - 10https://gerrit.wikimedia.org/r/637526 (https://phabricator.wikimedia.org/T255139) (owner: 10Elukey) [16:24:05] please let me know when things are in control and we can start the WMCS maintenance in eqiad [16:24:21] quick check on maps1004: kartotherian seems to be serving tiles just fine, [16:24:33] we could repool it already to regain some more capacity [16:24:42] I can repool if needs be [16:24:48] cdanis: that deployment had no effect until https://sal.toolforge.org/log/ch30dHUBgTbpqNOmNLXn got synced [16:24:53] ack [16:25:20] <_joe_> cdanis: is ti normal that referers have no url there? [16:25:26] <_joe_> I don't think it is [16:25:34] hnowlan: can you try repooling maps1004? [16:25:36] _joe_: looking [16:25:39] gehel: will do [16:25:44] thanks! [16:25:50] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - kartotherian-ssl_443: Servers maps1003.eqiad.wmnet, maps1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:25:55] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=maps,service=kartotherian-ssl,name=maps1004.eqiad.wmnet [16:25:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:59] <_joe_> ottomata: chris is probably right, it's just a ton of errors over some hours [16:26:04] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=maps,service=kartotherian,name=maps1004.eqiad.wmnet [16:26:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:15] (03CR) 10Alexandros Kosiaris: [C: 04-1] OTRS: replace cron with systemd timer (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/637038 (owner: 10Dzahn) [16:26:18] what we don't want to do is restart tilerator before the slaves are synced [16:26:31] my deployment also has only effect on wikis that use dynamic page list (ie. wmgUseDynamicPageList is true for them) [16:27:36] ok, i'm seeing in logstash at most 10ish per minute [16:27:47] the 500s i see there are [16:27:48] request aborted [16:28:02] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:28:24] _joe_: empty referer is allowed [16:28:55] 10Operations, 10observability: VictorOps ~5min delay from email received to incident paging - https://phabricator.wikimedia.org/T266800 (10fgiunchedi) [16:29:00] assuming that's just some clients disconnecting before finishing the req? [16:29:42] that endpoint is doing arouund 500 reqs/second [16:30:05] so perhaps a few 500s per minute of client disconnects is expected? not sure. [16:30:32] https://logstash-next.wikimedia.org/goto/43135fde569b3efe67007912623bb8e1 [16:31:20] hm you are seeing it for NEL? thta is going to a different endpoint [16:32:04] i see them there too in eventgate-logging-external, but also request aborted [16:34:26] 10Operations, 10ops-eqiad, 10Reading Epics (Analytics): an-coord1001 ram upgrade - https://phabricator.wikimedia.org/T266709 (10elukey) @Cmjohnson deal then, thanks! [16:34:49] hmm cdanis you might want to add ?hasty=true [16:34:56] !log force VRRP master on cr1-eqiad - T265288 [16:34:57] to your NEL endpoint url [16:35:01] ottomata: I don't, I want the browsers to retry if it doesn't succeed [16:35:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:03] T265288: Enable L3 routing on cloudsw nodes - https://phabricator.wikimedia.org/T265288 [16:35:21] ottomata: they're doing requests on a background thread, with exponential backoff [16:36:10] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:36:16] right but unrelated [16:36:29] hasty=true will tell eventgate not to wait for the kafka produce to ACK before closing the http request [16:36:37] aka 'fire and forget' [16:37:05] might not matter, but probably a bit better for the browser and perf [16:38:35] !log Move cr2-eqiad:ae2.1120 to cloudsw1-d5:irb.1120 - T265288 [16:38:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:43] (03PS1) 10Alexandros Kosiaris: sretest: Add missing colon [puppet] - 10https://gerrit.wikimedia.org/r/637532 [16:42:34] (03CR) 10Filippo Giunchedi: [C: 03+1] webperf: move navtiming monitoring back to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/637007 (owner: 10Dave Pifke) [16:42:57] (03CR) 10Filippo Giunchedi: [C: 03+2] webperf: move navtiming monitoring back to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/637007 (owner: 10Dave Pifke) [16:43:30] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_kartotherian_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:45:02] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:45:07] RECOVERY - Kartotherian LVS eqiad #page on kartotherian.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Maps%23Kartotherian [16:45:37] 10Operations, 10Analytics-Clusters, 10vm-requests: Create a ganeti VM in eqiad: an-test-ui1001.eqiad.wmnet - https://phabricator.wikimedia.org/T266648 (10elukey) 05Open→03Resolved [16:46:13] <_joe_> !log restarted kartotherian on all servers in eqiad at the same time [16:46:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:22] (03PS1) 10Bstorm: toolforge bastion: fix the wmcs_wheel_of_misfortune script for py3.5 [puppet] - 10https://gerrit.wikimedia.org/r/637535 (https://phabricator.wikimedia.org/T266300) [16:49:24] PROBLEM - Check systemd state on releases2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:49:26] PROBLEM - Too many messages in kafka logging-eqiad #o11y on alert1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash instance=kafkamon1002 job=burrow partition={4,5} prometheus=ops site=eqiad topic={udp_localhost-err,udp_localhost-info} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=th [16:49:26] logging-eqiad&var-topic=All&var-consumer_group=All [16:50:21] (03CR) 10Bstorm: "Tested this via livehack. I should have known that there as more than just LDAP that would require changes for python 3.5 and Debian oldst" [puppet] - 10https://gerrit.wikimedia.org/r/637535 (https://phabricator.wikimedia.org/T266300) (owner: 10Bstorm) [16:50:26] PROBLEM - VRRP status on cr1-eqiad is CRITICAL: VRRP CRITICAL - 1 misconfigured interfaces, 0 inconsistent interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status [16:51:05] 10Operations, 10ops-eqiad, 10Analytics-Radar: an-presto1004 down - https://phabricator.wikimedia.org/T253438 (10Cmjohnson) Spoke with Dell tech, Chris Bennet today. The ball was dropped by Dell, nobody ordered the new part and our case was left open and not owned by anyone. Today a new case for the backpl... [16:51:24] (03PS1) 10CDanis: temporarily failoid kartotherian in eqiad [dns] - 10https://gerrit.wikimedia.org/r/637536 (https://phabricator.wikimedia.org/T266807) [16:51:28] 10Operations, 10netops, 10cloud-services-team (Kanban): Enable L3 routing on cloudsw nodes - https://phabricator.wikimedia.org/T265288 (10aborrero) [16:52:24] (03CR) 10jerkins-bot: [V: 04-1] temporarily failoid kartotherian in eqiad [dns] - 10https://gerrit.wikimedia.org/r/637536 (https://phabricator.wikimedia.org/T266807) (owner: 10CDanis) [16:53:03] (03CR) 10Bstorm: toolforge bastion: fix the wmcs_wheel_of_misfortune script for py3.5 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/637535 (https://phabricator.wikimedia.org/T266300) (owner: 10Bstorm) [16:54:09] (03PS2) 10CDanis: temporarily failoid kartotherian in eqiad [dns] - 10https://gerrit.wikimedia.org/r/637536 (https://phabricator.wikimedia.org/T266807) [16:54:52] (03CR) 10jerkins-bot: [V: 04-1] temporarily failoid kartotherian in eqiad [dns] - 10https://gerrit.wikimedia.org/r/637536 (https://phabricator.wikimedia.org/T266807) (owner: 10CDanis) [16:55:21] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): cloudvirt1033 psu redundancy alert - https://phabricator.wikimedia.org/T263145 (10Cmjohnson) Called to open a ticket with Dell, they received the information and the TSR and are sending a new part [16:58:42] 10Operations, 10fundraising-tech-ops: Ensure all disaster recover documentation is in one central location - https://phabricator.wikimedia.org/T95841 (10Jgreen) [16:58:47] (03CR) 10Alexandros Kosiaris: [C: 03+2] sretest: Add missing colon [puppet] - 10https://gerrit.wikimedia.org/r/637532 (owner: 10Alexandros Kosiaris) [16:59:14] !log Delete cr1-eqiad:ae2.1120 and related static routes - T265288 [16:59:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:05] chrisalbon and accraze: It is that lovely time of the day again! You are hereby commanded to deploy Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201029T1700). [17:00:08] RECOVERY - VRRP status on cr1-eqiad is OK: VRRP OK - 0 misconfigured interfaces, 0 inconsistent interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status [17:02:59] @seen crusnov [17:02:59] mutante: I have never seen crusnov [17:05:00] (they're c.haomodus here fwiw) [17:06:55] rzl: ack, brain fart :) [17:08:06] RECOVERY - Too many messages in kafka logging-eqiad #o11y on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [17:10:56] Jenkins has some issue somehow [17:11:11] it has lost access to wmcs [17:11:55] we just had a network outage in wmcs [17:12:02] ahhh that explains it [17:12:23] had to restart it anyway [17:12:29] so that sounds like a good opportunity ;) [17:12:36] !log Stopping CI Jenkins [17:12:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:50] win 31 [17:15:21] !log CI: killed all java agents (java upgrade) [17:15:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:08] (03PS1) 10Jason Linehan: sessionTick: Add event stream and enable on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637539 (https://phabricator.wikimedia.org/T248987) [17:20:14] hashar: BTW, are you subscribed to cloud-announces@l.w.o? [17:20:43] arturo: I guess not :) [17:21:01] you should, this kind of operations are announced there, also CEPH migrations [17:22:50] arturo: I have did, will poke the rest of the team they probably to subscribe too [17:29:16] !log Restarted CI Jenkins a bit ago [17:29:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:50] (03Abandoned) 10CDanis: temporarily failoid kartotherian in eqiad [dns] - 10https://gerrit.wikimedia.org/r/637536 (https://phabricator.wikimedia.org/T266807) (owner: 10CDanis) [17:39:43] (03CR) 10Bstorm: [C: 03+2] toolforge bastion: fix the wmcs_wheel_of_misfortune script for py3.5 [puppet] - 10https://gerrit.wikimedia.org/r/637535 (https://phabricator.wikimedia.org/T266300) (owner: 10Bstorm) [17:48:09] !log ayounsi@cumin1001 START - Cookbook sre.network.cf [17:48:10] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [17:48:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:04] RoanKattouw, Niharika, and Urbanecm: That opportune time is upon us again. Time for a Morning backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201029T1800). [18:00:05] No GERRIT patches in the queue for this window AFAICS. [18:01:03] (03PS2) 10Urbanecm: [cswiki] Set wgGEHomepageManualAssignmentMentorsList to Wikipedie:Potřebuji pomoc/Mentoři/Manuální [mediawiki-config] - 10https://gerrit.wikimedia.org/r/636882 (https://phabricator.wikimedia.org/T245639) [18:01:05] * Urbanecm takes the window [18:01:08] (03CR) 10Urbanecm: [C: 03+2] [cswiki] Set wgGEHomepageManualAssignmentMentorsList to Wikipedie:Potřebuji pomoc/Mentoři/Manuální [mediawiki-config] - 10https://gerrit.wikimedia.org/r/636882 (https://phabricator.wikimedia.org/T245639) (owner: 10Urbanecm) [18:02:31] (03Merged) 10jenkins-bot: [cswiki] Set wgGEHomepageManualAssignmentMentorsList to Wikipedie:Potřebuji pomoc/Mentoři/Manuální [mediawiki-config] - 10https://gerrit.wikimedia.org/r/636882 (https://phabricator.wikimedia.org/T245639) (owner: 10Urbanecm) [18:04:12] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: b7eaaab81e1665c478f5dc1fdb495e36c53e7863: [cswiki] Set wgGEHomepageManualAssignmentMentorsList to Wikipedie:Potřebuji pomoc/Mentoři/Manuální (T245639) (duration: 00m 57s) [18:04:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:19] T245639: Allow anyone to claim a mentee - https://phabricator.wikimedia.org/T245639 [18:05:30] !log [urbanecm@mwmaint1002 ~]$ mwscript extensions/WikimediaMaintenance/createExtensionTables.php --wiki=hewikiquote wikilove # T266744 [18:05:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:37] T266744: Install wikilove in hewikiquote - https://phabricator.wikimedia.org/T266744 [18:06:15] !log [urbanecm@deploy1001 /srv/mediawiki-staging (master * u=)]$ sudo /usr/local/sbin/fix-staging-perms [18:06:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:21] !log herron@cumin1001 START - Cookbook sre.ganeti.makevm [18:06:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:30] !log herron@cumin1001 END (ERROR) - Cookbook sre.ganeti.makevm (exit_code=97) [18:06:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:40] (03PS1) 10Urbanecm: Enable WikiLove on hewikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637547 (https://phabricator.wikimedia.org/T266744) [18:06:42] (03CR) 10Urbanecm: [C: 03+2] Enable WikiLove on hewikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637547 (https://phabricator.wikimedia.org/T266744) (owner: 10Urbanecm) [18:07:04] !log herron@cumin1001 START - Cookbook sre.ganeti.makevm [18:07:07] !log herron@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) [18:07:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:35] (03Merged) 10jenkins-bot: Enable WikiLove on hewikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637547 (https://phabricator.wikimedia.org/T266744) (owner: 10Urbanecm) [18:09:24] !log herron@cumin1001 START - Cookbook sre.ganeti.makevm [18:09:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:25] (03PS1) 10Urbanecm: Enable WikiLove on hewikiquote #2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637548 (https://phabricator.wikimedia.org/T266744) [18:10:27] (03CR) 10Urbanecm: [C: 03+2] Enable WikiLove on hewikiquote #2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637548 (https://phabricator.wikimedia.org/T266744) (owner: 10Urbanecm) [18:11:21] (03Merged) 10jenkins-bot: Enable WikiLove on hewikiquote #2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637548 (https://phabricator.wikimedia.org/T266744) (owner: 10Urbanecm) [18:13:29] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable WikiLove on hewikiquote (T266744) (duration: 00m 57s) [18:13:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:36] T266744: Install wikilove in hewikiquote - https://phabricator.wikimedia.org/T266744 [18:13:44] * Urbanecm done [18:16:42] !log herron@cumin1001 END (ERROR) - Cookbook sre.ganeti.makevm (exit_code=97) [18:16:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:28] (03PS1) 10Elukey: Fix settings for the Druid test cluster [puppet] - 10https://gerrit.wikimedia.org/r/637549 [18:20:08] (03CR) 10Elukey: [C: 03+2] Fix settings for the Druid test cluster [puppet] - 10https://gerrit.wikimedia.org/r/637549 (owner: 10Elukey) [18:25:11] (03PS1) 10Herron: logstash: add dhcp/netboot entries for logstash1032 [puppet] - 10https://gerrit.wikimedia.org/r/637550 [18:29:31] (03CR) 10Herron: [C: 03+2] logstash: add dhcp/netboot entries for logstash1032 [puppet] - 10https://gerrit.wikimedia.org/r/637550 (owner: 10Herron) [18:43:14] (03CR) 10Dzahn: OTRS: replace cron with systemd timer (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/637038 (owner: 10Dzahn) [18:47:27] (03PS2) 10Dzahn: OTRS: replace cron with systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/637038 [18:58:24] !log robh@cumin1001 START - Cookbook sre.hosts.downtime [18:58:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:42] (03PS3) 10Dzahn: OTRS: replace cron with systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/637038 (https://phabricator.wikimedia.org/T265138) [19:00:02] (03PS1) 10Dzahn: create microsite for WDQS UI [puppet] - 10https://gerrit.wikimedia.org/r/637552 (https://phabricator.wikimedia.org/T266702) [19:00:27] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [19:00:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:24] !log rolling restart of ores uwsgi [19:13:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:30] (03PS2) 10Dzahn: create microsite for WDQS UI [puppet] - 10https://gerrit.wikimedia.org/r/637552 (https://phabricator.wikimedia.org/T266702) [19:22:29] !log Start of `mwscript extensions/AbuseFilter/maintenance/updateVarDumps.php --wiki=$wiki --print-orphaned-records-to=/tmp/urbanecm/$wiki-orphaned.log --progress-markers > $wiki.log` in a tmux session on mwmaint1002 (wiki=ukwiki; T246539) [19:22:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:35] T246539: Dry-run, then actually run updateVarDumps - https://phabricator.wikimedia.org/T246539 [19:22:38] (03PS3) 10Dzahn: create microsite for WDQS UI [puppet] - 10https://gerrit.wikimedia.org/r/637552 (https://phabricator.wikimedia.org/T266702) [19:25:34] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime [19:25:35] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [19:25:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:12] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1001/26217/" [puppet] - 10https://gerrit.wikimedia.org/r/637552 (https://phabricator.wikimedia.org/T266702) (owner: 10Dzahn) [19:31:47] !log cdanis@cumin1001 START - Cookbook sre.network.cf [19:31:47] !log cdanis@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [19:31:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:34] (03PS1) 10Hnowlan: maps: add maps(200[5-9]|2010) as maps hosts [puppet] - 10https://gerrit.wikimedia.org/r/637554 (https://phabricator.wikimedia.org/T266820) [19:36:28] (03PS2) 10Kosta Harlan: linkrecommendation: Add deployment chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/636916 (https://phabricator.wikimedia.org/T265893) [19:36:30] (03CR) 10Kosta Harlan: linkrecommendation: Add deployment chart (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/636916 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [19:37:13] (03CR) 10Addshore: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/637552 (https://phabricator.wikimedia.org/T266702) (owner: 10Dzahn) [19:37:43] (03CR) 10jerkins-bot: [V: 04-1] linkrecommendation: Add deployment chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/636916 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [19:45:15] (03CR) 10Bstorm: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/26218/" [puppet] - 10https://gerrit.wikimedia.org/r/636087 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [19:47:27] (03CR) 10Bstorm: [C: 03+2] toolforge-k8s: Let hiera set the replicas value for ingress controllers [puppet] - 10https://gerrit.wikimedia.org/r/637072 (https://phabricator.wikimedia.org/T266506) (owner: 10Bstorm) [19:49:21] (03PS1) 10Ladsgroup: ores: Stop memory reporting [puppet] - 10https://gerrit.wikimedia.org/r/637557 [19:51:54] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [19:55:12] (03CR) 10Dzahn: "thank you 😊" [puppet] - 10https://gerrit.wikimedia.org/r/636087 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [19:57:22] wow, that's quite the task looking at the list! [20:00:06] (03PS1) 10Mforns: Add ::profile::analytics::refinery::network_infra_config [puppet] - 10https://gerrit.wikimedia.org/r/637559 (https://phabricator.wikimedia.org/T254332) [20:03:17] (03PS2) 10Mforns: Add ::profile::analytics::refinery::network_infra_config [puppet] - 10https://gerrit.wikimedia.org/r/637559 (https://phabricator.wikimedia.org/T254332) [20:04:36] (03CR) 10jerkins-bot: [V: 04-1] Add ::profile::analytics::refinery::network_infra_config [puppet] - 10https://gerrit.wikimedia.org/r/637559 (https://phabricator.wikimedia.org/T254332) (owner: 10Mforns) [20:06:07] !log robh@cumin1001 START - Cookbook sre.hosts.downtime [20:06:08] !log robh@cumin1001 START - Cookbook sre.hosts.downtime [20:06:09] !log robh@cumin1001 START - Cookbook sre.hosts.downtime [20:06:10] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [20:06:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:23] (03PS3) 10Mforns: Add ::profile::analytics::refinery::network_infra_config [puppet] - 10https://gerrit.wikimedia.org/r/637559 (https://phabricator.wikimedia.org/T254332) [20:07:29] (03CR) 10Gehel: [C: 04-1] "A few missing pieces:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/637554 (https://phabricator.wikimedia.org/T266820) (owner: 10Hnowlan) [20:08:08] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:08:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:54] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:10:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:57] PROBLEM - Check systemd state on ms-be2042 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:17:03] !log herron@cumin1001 START - Cookbook sre.dns.netbox [20:17:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:22] !log herron@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:23:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:19] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [20:45:39] (03CR) 10Ottomata: [C: 03+1] geoip: cleanup having moved archiving to launcher [puppet] - 10https://gerrit.wikimedia.org/r/636517 (https://phabricator.wikimedia.org/T264152) (owner: 10Razzi) [20:48:01] (03CR) 10Razzi: [C: 03+2] geoip: cleanup having moved archiving to launcher [puppet] - 10https://gerrit.wikimedia.org/r/636517 (https://phabricator.wikimedia.org/T264152) (owner: 10Razzi) [20:50:11] !log robh@cumin1001 START - Cookbook sre.hosts.downtime [20:50:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:38] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:52:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:26] RECOVERY - Check systemd state on ms-be2042 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:19:11] (03CR) 10Ottomata: [C: 03+1] Add ::profile::analytics::refinery::network_infra_config [puppet] - 10https://gerrit.wikimedia.org/r/637559 (https://phabricator.wikimedia.org/T254332) (owner: 10Mforns) [21:22:16] (03CR) 10Ottomata: [C: 03+1] Add ::profile::analytics::refinery::network_infra_config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/637559 (https://phabricator.wikimedia.org/T254332) (owner: 10Mforns) [21:31:46] (03CR) 10AndyRussG: "Thanks so much for digging in!!! Maybe the version of python-debian in setup.py should also be updated? I couldn't find the documentation " [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/637082 (https://phabricator.wikimedia.org/T266730) (owner: 10Ejegg) [21:33:46] (03CR) 10Bstorm: "I don't see a lot of benefit for this just now. I'll abandon this and deal with it later if we need it." [puppet] - 10https://gerrit.wikimedia.org/r/637017 (owner: 10Bstorm) [21:33:52] (03Abandoned) 10Bstorm: paws-k8s: switch the ingress for https to http logging [puppet] - 10https://gerrit.wikimedia.org/r/637017 (owner: 10Bstorm) [21:40:47] (03CR) 10Ottomata: [C: 03+1] Add ::profile::analytics::refinery::network_infra_config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/637559 (https://phabricator.wikimedia.org/T254332) (owner: 10Mforns) [22:03:40] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [22:03:40] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [22:03:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:03:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:55] !log scandium - puppet disabled again (but only until tomorrow), downtimed in Icinga, for ongoing parsoid tests from testreduce1001 [22:04:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:06:25] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1267.eqiad.wmnet [22:06:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:06:40] !log depooled mw1267 (T266164) [22:06:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:06:46] T266164: eqiad: Physical Moves for MediaWiki Servers - https://phabricator.wikimedia.org/T266164 [22:07:22] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [22:07:22] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [22:07:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:08:08] !log razzi@cumin1001 START - Cookbook sre.hosts.downtime [22:08:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:06] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [22:10:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:30] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:18:40] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:19:13] (03PS1) 10Dzahn: scap: replace proxy for eqiad A7, mw1268->mw1269 [puppet] - 10https://gerrit.wikimedia.org/r/637572 (https://phabricator.wikimedia.org/T266164) [22:19:31] jouncebot: next [22:19:31] In 0 hour(s) and 40 minute(s): Evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201029T2300) [22:21:29] !log updated packages for thirdparty/kubeadm-k8s-1-17 to prepare for install T263284 [22:21:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:36] T263284: Upgrade Toolforge K8s to 1.17 - https://phabricator.wikimedia.org/T263284 [22:21:36] !log replacing scap proxy for rack A7 eqiad because mw1268 needs to move physically (T266164) [22:21:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:44] T266164: eqiad: Physical Moves for MediaWiki Servers - https://phabricator.wikimedia.org/T266164 [22:22:34] (03CR) 10Dzahn: [C: 03+2] scap: replace proxy for eqiad A7, mw1268->mw1269 [puppet] - 10https://gerrit.wikimedia.org/r/637572 (https://phabricator.wikimedia.org/T266164) (owner: 10Dzahn) [22:23:16] ^ doing this right around 38 minutes before deployment. 30 mins max waiting time for puppet [22:24:14] because changing a scap proxy means a change on all appservers [22:32:23] !log mw1269 rsyncd/ferm for scap proxy was enabled - mw1268 rsyncd/ferm for scan proxy was removed - deploy1001 scap-proxies dsh group was adjusted [22:32:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:27] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [22:33:28] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [22:33:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:34:31] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1268.eqiad.wmnet [22:34:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:35:20] !log mw1268 - depooled for T266164 [22:35:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:35:26] T266164: eqiad: Physical Moves for MediaWiki Servers - https://phabricator.wikimedia.org/T266164 [22:51:26] (03PS1) 10Dzahn: site: move mw1267,mw1268 from rack A7 to rack A8 [puppet] - 10https://gerrit.wikimedia.org/r/637576 (https://phabricator.wikimedia.org/T266164) [23:00:05] RoanKattouw, Niharika, and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Evening backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201029T2300). [23:00:05] No GERRIT patches in the queue for this window AFAICS. [23:02:48] (03PS1) 10Dzahn: site/appservers: cleanup comments about appserver rack locations [puppet] - 10https://gerrit.wikimedia.org/r/637577 [23:02:55] (03PS1) 10Urbanecm: Add namespace aliases to Turkish Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637578 (https://phabricator.wikimedia.org/T266609) [23:03:45] (03PS1) 10Urbanecm: Add namespace aliases to Turkish Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637579 (https://phabricator.wikimedia.org/T266608) [23:04:53] (03PS1) 10Urbanecm: Add namespace aliases to Turkish Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637580 (https://phabricator.wikimedia.org/T266606) [23:06:04] mutante: I see you did some scap changes, can I deploy those pathes now please? [23:09:41] (03PS2) 10Dzahn: site/appservers: cleanup comments about appserver rack locations [puppet] - 10https://gerrit.wikimedia.org/r/637577 [23:10:08] Urbanecm: yes, half an hour has passed so puppet should have run everywhere [23:10:17] okay, thanks [23:10:22] the change is that one scap proxy moved to another host [23:10:28] in one specific rack [23:11:04] (03PS2) 10Urbanecm: Add namespace aliases to Turkish Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637578 (https://phabricator.wikimedia.org/T266609) [23:11:09] (03CR) 10Urbanecm: [C: 03+2] Add namespace aliases to Turkish Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637578 (https://phabricator.wikimedia.org/T266609) (owner: 10Urbanecm) [23:11:13] mw1268/mw1269 messages would be related.. but you should not see any [23:11:41] ack [23:11:42] I already made sure puppet changed the rsyncd and firewall stuff there [23:11:52] which turns it into a proxy [23:11:57] (03Merged) 10jenkins-bot: Add namespace aliases to Turkish Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637578 (https://phabricator.wikimedia.org/T266609) (owner: 10Urbanecm) [23:16:59] (03PS2) 10Urbanecm: Add namespace aliases to Turkish Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637579 (https://phabricator.wikimedia.org/T266608) [23:17:38] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 090f75730727e7a3ca5a85af0ff9071213dd047f: Add namespace aliases to Turkish Wiktionary (T266609) (duration: 00m 58s) [23:17:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:45] T266609: Add namespace aliases to Turkish Wiktionary - https://phabricator.wikimedia.org/T266609 [23:17:54] (03CR) 10Urbanecm: [C: 03+2] Add namespace aliases to Turkish Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637579 (https://phabricator.wikimedia.org/T266608) (owner: 10Urbanecm) [23:18:46] (03Merged) 10jenkins-bot: Add namespace aliases to Turkish Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637579 (https://phabricator.wikimedia.org/T266608) (owner: 10Urbanecm) [23:18:49] !log urbanecm@mwmaint1002:~$ mwscript namespaceDupes.php --wiki=trwiktionary --fix # T266609 [23:18:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:15] (03PS2) 10Urbanecm: Add namespace aliases to Turkish Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637580 (https://phabricator.wikimedia.org/T266606) [23:21:19] (03CR) 10Urbanecm: [C: 03+2] Add namespace aliases to Turkish Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637580 (https://phabricator.wikimedia.org/T266606) (owner: 10Urbanecm) [23:22:03] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 1800d11ec8c07ff6ccffe0fd03ce11e6786f8a6e: Add namespace aliases to Turkish Wikibooks (T266608) (duration: 00m 57s) [23:22:06] (03Merged) 10jenkins-bot: Add namespace aliases to Turkish Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637580 (https://phabricator.wikimedia.org/T266606) (owner: 10Urbanecm) [23:22:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:14] T266608: Add namespace aliases to Turkish Wikibooks - https://phabricator.wikimedia.org/T266608 [23:23:31] !log urbanecm@mwmaint1002:~$ mwscript namespaceDupes.php --wiki=trwikibooks --fix # T266608 [23:23:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:23:50] (03PS3) 10Dzahn: site/appservers: cleanup comments about appserver rack locations [puppet] - 10https://gerrit.wikimedia.org/r/637577 [23:27:59] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: c3a8555154673c4c5a65f6ec2a1219d0832f48e0: Add namespace aliases to Turkish Wikisource (T266606) (duration: 00m 56s) [23:28:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:28:05] T266606: Add namespace aliases to Turkish Wikisource - https://phabricator.wikimedia.org/T266606 [23:29:53] !log urbanecm@mwmaint1002:~$ mwscript namespaceDupes.php --wiki=trwikisource --add-prefix=BROKEN --fix # T266606 # P13111 [23:30:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:31:17] ^ when stashbot adds logs to SAL, it links to phab tasks automatically, but not pastes - feature request for whoever is in charge of it [23:32:05] fill a task DannyS712 [23:32:12] Should be trivial [23:32:24] I would but I don't know the tag to use [23:32:40] I could probably write a patch though if I knew where the code is? [23:32:47] don't worry about it, just create the task anyways and others will add tags [23:32:56] I'll put a patch up [23:33:43] DannyS712: I think stashbot's project is called stashbot [23:33:45] (03PS1) 10Urbanecm: Add namespace aliases to Turkish Wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637581 (https://phabricator.wikimedia.org/T266605) [23:33:47] (03CR) 10Urbanecm: [C: 03+2] Add namespace aliases to Turkish Wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637581 (https://phabricator.wikimedia.org/T266605) (owner: 10Urbanecm) [23:33:53] https://gerrit.wikimedia.org/r/c/labs/tools/stashbot/+/637065/2/stashbot/bot.py [23:33:55] simple patch is simple [23:34:38] :) [23:34:42] (03Merged) 10jenkins-bot: Add namespace aliases to Turkish Wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637581 (https://phabricator.wikimedia.org/T266605) (owner: 10Urbanecm) [23:34:52] cool, thanks - https://phabricator.wikimedia.org/T266848 [23:36:23] what does M even represent in Phab? [23:36:33] mock [23:36:53] M284 [23:36:54] M284: setcontentmodel mockup - https://phabricator.wikimedia.org/M284 [23:37:04] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: ddb7e08e9c1d07f704c9f7585d8b6089f1895b5c: Add namespace aliases to Turkish Wikiquote (T266605) (duration: 00m 57s) [23:37:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:37:09] T266605: Add namespace aliases to Turkish Wikiquote - https://phabricator.wikimedia.org/T266605 [23:38:43] !log urbanecm@mwmaint1002:~$ mwscript namespaceDupes.php --wiki=trwikiquote --add-prefix=BROKEN --fix # T266605 # P13112 [23:38:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:39:00] !log Evening B&C window done [23:39:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:41:22] !bash DannyS712: I think stashbot's project is called stashbot [23:41:22] legoktm: Stored quip at https://bash.toolforge.org/quip/SEu8dnUBhxWNv8gIvX6y [23:41:45] oh no [23:41:59] I definitely should have at least tried to search for a project tag before posting that comment [23:42:04] xD [23:42:33] heh [23:44:00] I was about to ask if there was a view on bash.toolforge.org to see the quips in chronological order, since there was no link for "recent" or something, but then wanted to make sure I wasn't posting another dumb comment - I checked, apparently https://bash.toolforge.org/search lists them in order? But its the search page? [23:44:50] also, is there a way to see when quips were added? Im curious about the context of some (eg https://bash.toolforge.org/quip/w4pm8HQBgTbpqNOmGirT) [23:46:16] the /search page shows them in order [23:46:46] I don't think it exposes the date [23:49:11] * DannyS712 goes to file a task, but doesn't remember the relevant phab project to tag\ [23:49:23] Try bash :P [23:49:36] the bug wrangler will fix it [23:49:52] According to https://wikitech.wikimedia.org/wiki/Tool:Bash, it's not in Phab at all! [23:51:15] Im going to stay away from doing anything given how tired I am [23:52:21] "the bug wrangler will fix it" or I could create a new phab project just for lost and found? [23:53:34] well, you would have to request the new project with a ... ticket [23:53:35] DannyS712: it was said on October 3rd 2020 [23:53:58] yea, others will quickly fix tags for incoming new stuff [23:54:24] the ones without tags will be noticed [23:54:37] mutante I can create projects myself [23:55:24] I think that would cost you more time in the follow-up discussion about that project not being like others [23:55:45] I meant the proposal as a joke :)