[02:36:21] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 20432624 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:37:59] RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 25248 and 17 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [03:19:47] PROBLEM - MediaWiki centralauth errors on graphite1004 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [1.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=3&fullscreen&orgId=1 [03:32:43] RECOVERY - MediaWiki centralauth errors on graphite1004 is OK: OK: Less than 30.00% above the threshold [0.5] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=3&fullscreen&orgId=1 [05:15:01] RECOVERY - snapshot of s6 in eqiad on db1115 is OK: snapshot for s6 at eqiad taken less than 4 days ago and larger than 90 GB: Last one 2019-09-30 04:17:00 from db1139.eqiad.wmnet:3316 (503 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [05:57:13] (03PS1) 10Elukey: matomo: disable tracking failure notifications [puppet] - 10https://gerrit.wikimedia.org/r/539730 [05:57:16] 10Operations, 10DBA: snapshot for s6/s7 at eqiad taken more than 4 days ago - https://phabricator.wikimedia.org/T234152 (10Marostegui) ` 05:15:01 <+icinga-wm> RECOVERY - snapshot of s6 in eqiad on db1115 is OK: snapshot for s6 at eqiad taken less than 4 days ago and larger than 90 GB: Last one 2019-09-30 04:17... [05:58:29] 10Operations, 10DBA: snapshot for s6/s7 at eqiad taken more than 4 days ago - https://phabricator.wikimedia.org/T234152 (10Marostegui) p:05Triageβ†’03Normal [05:58:33] (03CR) 10Elukey: [C: 03+2] matomo: disable tracking failure notifications [puppet] - 10https://gerrit.wikimedia.org/r/539730 (owner: 10Elukey) [05:58:58] 10Operations, 10DBA: snapshot for s6/s7 at eqiad taken more than 4 days ago - https://phabricator.wikimedia.org/T234152 (10jcrespo) a:03jcrespo [06:05:35] (03PS3) 10Elukey: profile::swap: automatically remove jupyter trash files [puppet] - 10https://gerrit.wikimedia.org/r/539694 [06:13:32] (03CR) 10Elukey: [C: 03+2] profile::swap: automatically remove jupyter trash files [puppet] - 10https://gerrit.wikimedia.org/r/539694 (owner: 10Elukey) [06:25:36] 10Operations, 10serviceops: Deploy wikidiff2 v1.9.0 - https://phabricator.wikimedia.org/T234175 (10jijiki) [06:25:55] 10Operations, 10DBA: snapshot for s6/s7 at eqiad taken more than 4 days ago - https://phabricator.wikimedia.org/T234152 (10jcrespo) There is some amount of "self healing", but I will rerun manually some backups so not to miss the window. Thanks @jijiki for the report. These alerts are like RAID alerts- they ar... [06:47:49] !log installing exim security updates on buster [06:47:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:37] RECOVERY - snapshot of s7 in eqiad on db1115 is OK: snapshot for s7 at eqiad taken less than 4 days ago and larger than 90 GB: Last one 2019-09-30 04:15:08 from db1116.eqiad.wmnet:3317 (863 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [07:12:01] (03PS1) 10Marostegui: check_candidates.sh: Quick script to check candidate masters [software] - 10https://gerrit.wikimedia.org/r/539739 (https://phabricator.wikimedia.org/T234039) [07:12:52] (03CR) 10Marostegui: [C: 03+2] check_candidates.sh: Quick script to check candidate masters [software] - 10https://gerrit.wikimedia.org/r/539739 (https://phabricator.wikimedia.org/T234039) (owner: 10Marostegui) [07:13:21] (03Merged) 10jenkins-bot: check_candidates.sh: Quick script to check candidate masters [software] - 10https://gerrit.wikimedia.org/r/539739 (https://phabricator.wikimedia.org/T234039) (owner: 10Marostegui) [07:24:10] 10Operations, 10ops-eqiad, 10User-Elukey: (Need By: August 31) rack/setup/install (3) new zookeeper nodes - https://phabricator.wikimedia.org/T227025 (10elukey) ping @Cmjohnson :) [07:41:41] 10Operations, 10Acme-chief, 10Traffic: Memory leak on acme-chief 0.21 - https://phabricator.wikimedia.org/T234131 (10Vgutierrez) The issue is also happening on acmechief-test1001, so I decided to hack acme-chief code a little bit to add [[ https://mg.pov.lt/objgraph/objgraph.html | objgraph ]] reports, I've... [07:56:36] !log Stop dbstore1003:3311 for troubleshooting [07:56:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:55] !log installing e2fsprogs security updates on Stretch/Buster [08:01:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:59] 10Operations, 10Mail: Consider Postfix as MTA for our MXes (and OTRS/Mailman/Phab) - https://phabricator.wikimedia.org/T232343 (10MoritzMuehlenhoff) [08:08:40] 10Operations, 10Wikimedia-Logstash, 10observability, 10serviceops, 10Patch-For-Review: Errors managed by php-wmerrors (like OOMs) lack normalized_message on logstash - https://phabricator.wikimedia.org/T233828 (10Joe) >>! In T233828#5531056, @Krinkle wrote: > I think this should be fixed at the source in... [08:16:42] (03PS6) 10Volans: Homer: setup private repo [puppet] - 10https://gerrit.wikimedia.org/r/539453 (https://phabricator.wikimedia.org/T228388) [08:21:00] (03CR) 10Volans: [C: 03+2] "Compiler partially passing, seems related to a compiler data issue. Going ahead and testing it live:" [puppet] - 10https://gerrit.wikimedia.org/r/539453 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [08:23:45] (03PS2) 10Giuseppe Lavagetto: mediawiki: remove the PHP/HHVM conditionals from the code [puppet] - 10https://gerrit.wikimedia.org/r/539326 (https://phabricator.wikimedia.org/T192166) [08:23:52] <_joe_> volans: what compiler issue? [08:25:56] didn't match a puppetdb query, only partially, although few days have passed so I thought that the crontab to populate teh compiler puppetdb should have run [08:26:19] I know that Xio.NoX had pinged her.ron on the matter on Friday, but I don't know the outcome of that chat ;) [08:26:27] anyway all works fine in prod as expected [08:31:36] 10Operations, 10Acme-chief, 10Traffic: Memory leak on acme-chief 0.21 - https://phabricator.wikimedia.org/T234131 (10Vgutierrez) I've hacked an ACMEChief subclass that only performs OCSP response checks/updates and it still leaks memory: `lang=python class OCSPChecker(ACMEChief): def certificate_manageme... [08:32:47] (03PS1) 10Volans: homer: cleanup absented resource [puppet] - 10https://gerrit.wikimedia.org/r/539837 (https://phabricator.wikimedia.org/T228388) [08:33:03] 10Operations, 10Wikimedia-Logstash, 10observability, 10serviceops, 10Patch-For-Review: Errors managed by php-wmerrors (like OOMs) lack normalized_message on logstash - https://phabricator.wikimedia.org/T233828 (10Joe) >>! In T233828#5532958, @Joe wrote: >> The json fields here were modelled after MediaWi... [08:35:16] (03CR) 10Volans: [C: 03+2] "just cleanup" [puppet] - 10https://gerrit.wikimedia.org/r/539837 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [08:43:11] (03PS4) 10Arturo Borrero Gonzalez: Remove some old Trusty/Jessie stuff [puppet] - 10https://gerrit.wikimedia.org/r/539681 (owner: 10Alex Monk) [08:46:18] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] Remove some old Trusty/Jessie stuff [puppet] - 10https://gerrit.wikimedia.org/r/539681 (owner: 10Alex Monk) [08:55:47] 10Operations, 10Acme-chief, 10Traffic: Memory leak on acme-chief 0.21 - https://phabricator.wikimedia.org/T234131 (10Vgutierrez) Using objgraph to track leaking objects doesn't show anything wrong either: ` dict 723 set 248 CTypeDescr 101 tuple 39 list 7 SignalDict 6 weakref 1 meth... [08:57:44] (03PS3) 10Giuseppe Lavagetto: mediawiki: remove the PHP/HHVM conditionals from the code [puppet] - 10https://gerrit.wikimedia.org/r/539326 (https://phabricator.wikimedia.org/T192166) [09:10:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2091:3314 for a schema change - T233625', diff saved to https://phabricator.wikimedia.org/P9217 and previous config saved to /var/cache/conftool/dbconfig/20190930-091043-marostegui.json [09:10:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:49] T233625: Change PK and remove partitions from the logging table - https://phabricator.wikimedia.org/T233625 [09:11:45] !log draining ganeti2002 for upcoming reboot (combined kernel/qemu security updates) [09:11:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:58] (03CR) 10Volans: [C: 04-1] "Final pass, apart the queryfilter order, all the rest is very minor/nitpick." (039 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/514840 (https://phabricator.wikimedia.org/T205900) (owner: 10CRusnov) [09:19:03] (03PS1) 10Marostegui: dumps-misc.sh.erb: Remove designate_pool_manager from backups [puppet] - 10https://gerrit.wikimedia.org/r/539839 (https://phabricator.wikimedia.org/T233978) [09:21:33] (03CR) 10Giuseppe Lavagetto: "https://puppet-compiler.wmflabs.org/compiler1002/18675/ seems to do the right thing. I think this change can be merged once we're 100% sur" [puppet] - 10https://gerrit.wikimedia.org/r/539326 (https://phabricator.wikimedia.org/T192166) (owner: 10Giuseppe Lavagetto) [09:23:17] (03PS1) 10Marostegui: dump-misc.sh.erb: Remove puppet database from the backups [puppet] - 10https://gerrit.wikimedia.org/r/539840 (https://phabricator.wikimedia.org/T231539) [09:23:47] (03CR) 10Giuseppe Lavagetto: "This would change how we record messages from mediawiki exceptions, AIUI. I'm not sure if this would have unintended consequences with our" [puppet] - 10https://gerrit.wikimedia.org/r/539621 (https://phabricator.wikimedia.org/T233828) (owner: 10Herron) [09:25:06] (03Abandoned) 10Gilles: Gzip SVGs served by MediaWiki [puppet] - 10https://gerrit.wikimedia.org/r/535860 (https://phabricator.wikimedia.org/T232615) (owner: 10Gilles) [09:27:28] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "I prefer this approach compared to the alternative proposal as it's less invasive." [puppet] - 10https://gerrit.wikimedia.org/r/539623 (https://phabricator.wikimedia.org/T233828) (owner: 10Herron) [09:30:48] !log uploading ferm 2.4.1+wmf2+deb9u1 for stretch-wikimedia, fixes AAAA lookups (T153468) [09:30:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:53] T153468: Ferm's upstream Net::DNS Perl library questionable handling of NOERROR responses without records causing puppet errors when we try to @resolve AAAA in labs - https://phabricator.wikimedia.org/T153468 [09:31:48] (03PS1) 10Gilles: Document Apache gzip sidestepping [puppet] - 10https://gerrit.wikimedia.org/r/539842 (https://phabricator.wikimedia.org/T232615) [09:31:57] 10Operations, 10Acme-chief, 10Traffic: Memory leak on acme-chief 0.21 - https://phabricator.wikimedia.org/T234131 (10Vgutierrez) Using valgrind against a simplified version that only runs _fetch_ocsp_response() against `unified / rsa-2048` shows a leak on: `==25358== 8,888 (1,536 direct, 7,352 indirect) byte... [10:30:01] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [10:30:02] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:30:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:04] jan_drewniak: My dear minions, it's time we take the moon! Just kidding. Time for Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190930T1030). [10:30:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:40] (03CR) 10Daimona Eaytoy: "> Can you enable it for -labs as well (use the '-' prefix to" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539674 (https://phabricator.wikimedia.org/T156095) (owner: 10Daimona Eaytoy) [10:30:50] (03PS2) 10Daimona Eaytoy: Use AbuseFilterCachingParser for group0 and deploymentwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539674 (https://phabricator.wikimedia.org/T156095) [10:31:00] (03CR) 10jerkins-bot: [V: 04-1] Use AbuseFilterCachingParser for group0 and deploymentwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539674 (https://phabricator.wikimedia.org/T156095) (owner: 10Daimona Eaytoy) [10:32:08] (03PS2) 10Pmiazga: Enable alternate mobile link for it, nl, ko wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538296 (https://phabricator.wikimedia.org/T206497) [10:32:20] (03CR) 10jerkins-bot: [V: 04-1] Enable alternate mobile link for it, nl, ko wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538296 (https://phabricator.wikimedia.org/T206497) (owner: 10Pmiazga) [10:32:22] (03PS1) 10Elukey: profile::kerberos::kadminserver: add motd [puppet] - 10https://gerrit.wikimedia.org/r/539847 [10:32:24] (03CR) 10jerkins-bot: [V: 04-1] profile::kerberos::kadminserver: add motd [puppet] - 10https://gerrit.wikimedia.org/r/539847 (owner: 10Elukey) [10:32:32] (03CR) 10Elukey: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/539847 (owner: 10Elukey) [10:32:40] (03CR) 10jerkins-bot: [V: 04-1] profile::kerberos::kadminserver: add motd [puppet] - 10https://gerrit.wikimedia.org/r/539847 (owner: 10Elukey) [10:32:44] (03CR) 10Daimona Eaytoy: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539674 (https://phabricator.wikimedia.org/T156095) (owner: 10Daimona Eaytoy) [10:32:54] (03CR) 10jerkins-bot: [V: 04-1] Use AbuseFilterCachingParser for group0 and deploymentwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539674 (https://phabricator.wikimedia.org/T156095) (owner: 10Daimona Eaytoy) [10:32:58] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539849 (https://phabricator.wikimedia.org/T128546) [10:33:00] (03CR) 10jerkins-bot: [V: 04-1] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539849 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:33:02] (03Abandoned) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539849 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:33:33] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539850 (https://phabricator.wikimedia.org/T128546) [10:34:26] (03CR) 10jerkins-bot: [V: 04-1] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539850 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:34:49] PROBLEM - Host kubetcd2003 is DOWN: PING CRITICAL - Packet loss = 100% [10:35:33] ^ kubetcd2003 is expected, reboot of ganeti2002 [10:35:51] RECOVERY - Host kubetcd2003 is UP: PING OK - Packet loss = 0%, RTA = 36.37 ms [10:37:51] (03CR) 10Jbond: [V: 03+2 C: 03+2] remove TOTP support [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/539512 (owner: 10Jbond) [10:39:55] (03CR) 10Jdrewniak: [V: 03+2 C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539850 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:40:19] I dunno what's happening with CI right now [10:40:19] `fatal: Unable to look up contint2001.wikimedia.org (port 9418) (Temporary failure in name resolution)` [10:40:33] !log CI down due to some DNS related failure on the hosts :-\ [10:40:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:48] (03CR) 10jerkins-bot: [V: 04-1] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539850 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:42:32] !log draining ganeti2004 for upcoming reboot (combined kernel/qemu security updates) [10:42:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:35] (03PS3) 10Jbond: apereo_cas: configure to used MFA based on ldap group membership [puppet] - 10https://gerrit.wikimedia.org/r/539515 (https://phabricator.wikimedia.org/T233937) [10:42:48] (03PS4) 10Jbond: apereo_cas: configure to used MFA based on ldap group membership [puppet] - 10https://gerrit.wikimedia.org/r/539515 (https://phabricator.wikimedia.org/T233937) [10:43:28] (03CR) 10jerkins-bot: [V: 04-1] apereo_cas: configure to used MFA based on ldap group membership [puppet] - 10https://gerrit.wikimedia.org/r/539515 (https://phabricator.wikimedia.org/T233937) (owner: 10Jbond) [10:44:04] 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review: Enable gzip compression for interface icon SVGs served by MediaWiki - https://phabricator.wikimedia.org/T232615 (10alaa_wmde) hey @Gilles We encountered an incident related to this change {T234183#5533440}. I suspect either there's somethin... [10:44:17] (03CR) 10Jbond: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/539515 (https://phabricator.wikimedia.org/T233937) (owner: 10Jbond) [10:44:55] (03CR) 10jerkins-bot: [V: 04-1] apereo_cas: configure to used MFA based on ldap group membership [puppet] - 10https://gerrit.wikimedia.org/r/539515 (https://phabricator.wikimedia.org/T233937) (owner: 10Jbond) [10:46:22] Planning to do deploy cxserver before SWAT. Is CI OK? [10:46:25] hashar: ^ [10:47:11] hashar: i see the following error 'fatal: Unable to look up contint1001.wikimedia.org (port 9418) (Temporary failure in name resolution)' [10:47:17] https://integration.wikimedia.org/ci/job/operations-puppet-tests-stretch-docker/22668/console [10:47:48] 10Operations, 10Acme-chief, 10Traffic: Memory leak on acme-chief 0.21 - https://phabricator.wikimedia.org/T234131 (10Vgutierrez) Reported to debian package maintainer on https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=941413 [10:48:40] !log CI outage is tracked in https://phabricator.wikimedia.org/T234197 [10:48:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:50] thx [10:49:37] jbond42: kart_ : no CI is broken entirely (logged about ~ 10 minutes ago). Tracking that via T234197 [10:49:38] T234197: CI docker container fails to resolve DNS name: fatal: Unable to look up contint1001.wikimedia.org (port 9418) (Temporary failure in name resolution) - https://phabricator.wikimedia.org/T234197 [10:52:28] hashar: OK. Checking. [10:52:53] (03CR) 10Jbond: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/539462 (https://phabricator.wikimedia.org/T233906) (owner: 10Alexandros Kosiaris) [10:58:18] 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review: Enable gzip compression for interface icon SVGs served by MediaWiki - https://phabricator.wikimedia.org/T232615 (10Gilles) Indeed, it seems like Varnish claims that the content is gzipped but it actually isn't. I think this is a Varnish bug... [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190930T1100). [11:00:04] kart_ and raynor: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:01:50] I'm here. [11:02:05] Let's see if CI works :) [11:02:20] (03CR) 10Pmiazga: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538296 (https://phabricator.wikimedia.org/T206497) (owner: 10Pmiazga) [11:02:46] hello [11:03:15] (03CR) 10jerkins-bot: [V: 04-1] Enable alternate mobile link for it, nl, ko wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538296 (https://phabricator.wikimedia.org/T206497) (owner: 10Pmiazga) [11:03:20] (03CR) 10KartikMistry: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539517 (https://phabricator.wikimedia.org/T233006) (owner: 10KartikMistry) [11:03:45] raynor: CI is broken. [11:03:52] sorry, I was in the #wikimedia-ops channel :) [11:03:58] yeah, I know, just wanted to say I'm here [11:04:15] (03CR) 10jerkins-bot: [V: 04-1] Enable CX out of beta in Tagalog and Central Bikol Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539517 (https://phabricator.wikimedia.org/T233006) (owner: 10KartikMistry) [11:04:27] :) [11:04:43] raynor: we can wait. And see. [11:06:58] (03Abandoned) 10Hashar: contint: add puppetmaster CA cert [puppet] - 10https://gerrit.wikimedia.org/r/539301 (https://phabricator.wikimedia.org/T152941) (owner: 10Hashar) [11:08:02] !log Restarting Docker on CI agents to clear out some docker/iptables oddity # T234197 [11:08:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:05] T234197: CI docker container fails to resolve DNS name: fatal: Unable to look up contint1001.wikimedia.org (port 9418) (Temporary failure in name resolution) - https://phabricator.wikimedia.org/T234197 [11:09:14] 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review: Enable gzip compression for interface icon SVGs served by MediaWiki - https://phabricator.wikimedia.org/T232615 (10Vgutierrez) @Gilles, yeah, with varnishadm ban like it's documented on https://wikitech.wikimedia.org/wiki/Varnish#One-off_purg... [11:10:15] (03PS7) 10Mathew.onipe: query_service: rename wdqs module to query_service [puppet] - 10https://gerrit.wikimedia.org/r/538572 (https://phabricator.wikimedia.org/T232297) [11:10:17] (03PS10) 10Mathew.onipe: query_service: prepare query_service for reusbility [puppet] - 10https://gerrit.wikimedia.org/r/537138 (https://phabricator.wikimedia.org/T232297) [11:10:19] (03PS7) 10Mathew.onipe: query_service: rename profile/wdqs to profile/query_service [puppet] - 10https://gerrit.wikimedia.org/r/538849 (https://phabricator.wikimedia.org/T232297) [11:10:22] (03PS2) 10Mathew.onipe: query_service: separate categories from main blazegraph profile [puppet] - 10https://gerrit.wikimedia.org/r/539285 (https://phabricator.wikimedia.org/T232297) [11:10:24] (03PS2) 10Mathew.onipe: query_service: properly adapt query_service profile [puppet] - 10https://gerrit.wikimedia.org/r/539513 (https://phabricator.wikimedia.org/T232297) [11:10:58] (03CR) 10jerkins-bot: [V: 04-1] query_service: rename wdqs module to query_service [puppet] - 10https://gerrit.wikimedia.org/r/538572 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [11:11:17] (03CR) 10jerkins-bot: [V: 04-1] query_service: prepare query_service for reusbility [puppet] - 10https://gerrit.wikimedia.org/r/537138 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [11:11:42] (03CR) 10jerkins-bot: [V: 04-1] query_service: rename profile/wdqs to profile/query_service [puppet] - 10https://gerrit.wikimedia.org/r/538849 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [11:12:02] (03CR) 10jerkins-bot: [V: 04-1] query_service: separate categories from main blazegraph profile [puppet] - 10https://gerrit.wikimedia.org/r/539285 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [11:12:08] (03CR) 10KartikMistry: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/539520 (https://phabricator.wikimedia.org/T233834) (owner: 10KartikMistry) [11:12:34] I think it might be fixed [11:12:40] (03CR) 10jerkins-bot: [V: 04-1] query_service: properly adapt query_service profile [puppet] - 10https://gerrit.wikimedia.org/r/539513 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [11:12:47] checking still [11:12:49] hashar: OK. Let me try. [11:12:54] 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review: Enable gzip compression for interface icon SVGs served by MediaWiki - https://phabricator.wikimedia.org/T232615 (10Gilles) Yep, exactly. In fact you can go as far as only banning content-length between 150 and 860 (inclusive). [11:12:59] (03CR) 10KartikMistry: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539517 (https://phabricator.wikimedia.org/T233006) (owner: 10KartikMistry) [11:13:17] (03CR) 10Daimona Eaytoy: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539674 (https://phabricator.wikimedia.org/T156095) (owner: 10Daimona Eaytoy) [11:15:13] hashar: works. Going ahead with SWAT now. [11:15:24] cool [11:15:30] (03CR) 10Alexandros Kosiaris: "> thanks for the updates on this. For now i dont see any clean fix, although it sounds like the priority for fixing this just went down a" [puppet] - 10https://gerrit.wikimedia.org/r/539462 (https://phabricator.wikimedia.org/T233906) (owner: 10Alexandros Kosiaris) [11:15:32] (03CR) 10KartikMistry: [C: 03+2] Enable CX out of beta in Tagalog and Central Bikol Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539517 (https://phabricator.wikimedia.org/T233006) (owner: 10KartikMistry) [11:16:31] (03Merged) 10jenkins-bot: Enable CX out of beta in Tagalog and Central Bikol Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539517 (https://phabricator.wikimedia.org/T233006) (owner: 10KartikMistry) [11:16:53] (03CR) 10Hashar: "recheck T234197" [puppet] - 10https://gerrit.wikimedia.org/r/538572 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [11:16:56] (03CR) 10Hashar: "recheck T234197" [puppet] - 10https://gerrit.wikimedia.org/r/537138 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [11:17:07] (03CR) 10Hashar: "recheck T234197" [puppet] - 10https://gerrit.wikimedia.org/r/538849 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [11:17:16] (03CR) 10Hashar: "recheck T234197" [puppet] - 10https://gerrit.wikimedia.org/r/539285 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [11:17:32] (03CR) 10jerkins-bot: [V: 04-1] query_service: rename wdqs module to query_service [puppet] - 10https://gerrit.wikimedia.org/r/538572 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [11:17:47] (03CR) 10jerkins-bot: [V: 04-1] query_service: prepare query_service for reusbility [puppet] - 10https://gerrit.wikimedia.org/r/537138 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [11:18:00] (03CR) 10jerkins-bot: [V: 04-1] query_service: rename profile/wdqs to profile/query_service [puppet] - 10https://gerrit.wikimedia.org/r/538849 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [11:18:13] (03CR) 10Hashar: "recheck T234197" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538296 (https://phabricator.wikimedia.org/T206497) (owner: 10Pmiazga) [11:18:19] hashar: CI having issues? `fatal: Unable to look up contint2001.wikimedia.org (port 9418) (Temporary failure in name resolution)` [11:18:34] (03CR) 10jerkins-bot: [V: 04-1] query_service: separate categories from main blazegraph profile [puppet] - 10https://gerrit.wikimedia.org/r/539285 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [11:19:19] onimisionipe: see https://tools.wmflabs.org/sal/production , but should be fixed now [11:19:24] well hopefully :-\ [11:19:40] I am still verifying [11:19:51] hashar: one of change merged int. [11:19:53] in* [11:19:54] Oh.. I see [11:20:01] ahhh [11:20:07] I missed the host for operations/puppet bah ;:- _ [11:20:17] Ok [11:20:38] mwdebug1002 has any issue? scap pull seems stuck. [11:20:42] looks like verification step for https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/538296/ went through, it got +2 \o/ finally [11:20:47] !log Restarting Docker on integration-agent-puppet-docker-1001 # T234197 [11:20:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:51] T234197: CI docker container fails to resolve DNS name: fatal: Unable to look up contint1001.wikimedia.org (port 9418) (Temporary failure in name resolution) - https://phabricator.wikimedia.org/T234197 [11:21:15] raynor: I'll ping when change is deployed.. [11:22:10] 10Operations, 10netops: configure BGP route damping on IX sessions - https://phabricator.wikimedia.org/T222424 (10jbond) [[ https://www.ripe.net/publications/docs/ripe-580#recommendations | RIPE routing-WG recommends a suppress-value of 6000 ]] if we go to that we may want to also increase the reuse but i coul... [11:22:13] (03CR) 10Hashar: "recheck T234197, after restarting Docker on integration-agent-puppet-docker-1001" [puppet] - 10https://gerrit.wikimedia.org/r/538572 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [11:22:33] kart_, thx, I'm waiting [11:22:56] eh. scap pull on mwmaint1002 seems taking forever. How can I debug it? [11:22:58] (03CR) 10Jbond: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/539515 (https://phabricator.wikimedia.org/T233937) (owner: 10Jbond) [11:23:13] 11:21:29 Started rsync common [11:23:16] here. [11:23:23] onimisionipe: this time it is fixed for operations/puppet , the job for that repo runs on a different instance which I have missed :-\ [11:23:24] @hashar ^ [11:23:45] kart_: I am on the CI docker issue sorry [11:24:00] but maybe logstash debug logs for scap has more details [11:24:04] OK. Anyone else who can look at scap? [11:24:08] Checking.. [11:24:09] I can't look it up right now [11:24:12] hashar: thanks [11:25:03] Looks like it is done. Let me test and verify. [11:27:56] !log kartik@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit|539517|Enable CX out of beta in Tagalog and Central Bikol WPs (T233006, T233007)]] (duration: 00m 59s) [11:28:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:01] T233007: Enable Content Translation in Central Bikol Wikipedia as a default tool - https://phabricator.wikimedia.org/T233007 [11:28:02] T233006: Enable Content Translation in Tagalog Wikipedia as a default tool - https://phabricator.wikimedia.org/T233006 [11:28:09] raynor: I'm done. [11:34:57] kart_, checking [11:37:30] (03CR) 10Jbond: "Hi paladox," [puppet] - 10https://gerrit.wikimedia.org/r/539625 (owner: 10Paladox) [11:37:46] (03CR) 10Jbond: [C: 03+2] apereo_cas: configure to used MFA based on ldap group membership [puppet] - 10https://gerrit.wikimedia.org/r/539515 (https://phabricator.wikimedia.org/T233937) (owner: 10Jbond) [11:39:27] kart_, ah, sorry, you didn't deploy my patch, no worries, I'm on it [11:39:53] (03PS3) 10Pmiazga: Enable alternate mobile link for it, nl, ko wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538296 (https://phabricator.wikimedia.org/T206497) [11:40:00] (03CR) 10Pmiazga: [C: 03+2] Enable alternate mobile link for it, nl, ko wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538296 (https://phabricator.wikimedia.org/T206497) (owner: 10Pmiazga) [11:40:53] (03Merged) 10jenkins-bot: Enable alternate mobile link for it, nl, ko wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538296 (https://phabricator.wikimedia.org/T206497) (owner: 10Pmiazga) [11:41:29] (03PS1) 10Arturo Borrero Gonzalez: openstack: neutron: newton: refresh original files [puppet] - 10https://gerrit.wikimedia.org/r/539853 (https://phabricator.wikimedia.org/T233665) [11:42:10] (03CR) 10jerkins-bot: [V: 04-1] openstack: neutron: newton: refresh original files [puppet] - 10https://gerrit.wikimedia.org/r/539853 (https://phabricator.wikimedia.org/T233665) (owner: 10Arturo Borrero Gonzalez) [11:42:34] jbond42: which comment inline? :) [11:42:38] I see no inline comment [11:42:54] paladox: one sec, although it was basicly the same comment alex made [11:42:58] why the join? [11:44:24] Oh, I saw another piece of code do that [11:44:25] * paladox finds it [11:44:41] (03CR) 10Jbond: letsencrypt: Fix acme-setup script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/539625 (owner: 10Paladox) [11:45:15] volans: is probably the best person to say which is most pythonic thought [11:45:44] Jbond42: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/67c61acb00e343c190311c31a9418df92d3f20f4/modules/letsencrypt/files/acme-setup.py#324 [11:45:47] Ok [11:47:01] well both work lets see what volans says :), but if its urgent let me know and i will apply it [11:47:31] (03PS2) 10Arturo Borrero Gonzalez: openstack: neutron: newton: refresh original files [puppet] - 10https://gerrit.wikimedia.org/r/539853 (https://phabricator.wikimedia.org/T233665) [11:48:08] not urgent persee as renewals will still work I think. But any new installs will fail. [11:48:13] (03CR) 10jerkins-bot: [V: 04-1] openstack: neutron: newton: refresh original files [puppet] - 10https://gerrit.wikimedia.org/r/539853 (https://phabricator.wikimedia.org/T233665) (owner: 10Arturo Borrero Gonzalez) [11:48:18] Though should be fixed as soon as possible [11:48:33] (03PS1) 10Jbond: profile::idp use correct variable names [puppet] - 10https://gerrit.wikimedia.org/r/539854 [11:48:47] ack thanks suspect vol.ans is just on lucnh so will look later today [11:49:52] raynor: ouch. I thought we are self-deploying!! [11:50:11] nah, no worries, I can deploy [11:50:23] self note: will ask about it next time! [11:50:43] it was just small misunderstanding, I should ask also [11:51:56] 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review: Enable gzip compression for interface icon SVGs served by MediaWiki - https://phabricator.wikimedia.org/T232615 (10Gilles) @Vgutierrez purged SVGs whose content-length is > 100 and <= 899 in both cache layers for text and upload. [11:54:03] !log pmiazga@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:538296|Enable alternate mobile link for it, nl, ko wikis. (T206497)]] (duration: 00m 57s) [11:54:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:07] T206497: Enable $wgMFNoindexPages for: Italian, Dutch, Korean, Arabic, Chinese, and Hindi Wikipedias - https://phabricator.wikimedia.org/T206497 [11:54:45] !log EU SWAT finished [11:54:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:12] 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review: Enable gzip compression for interface icon SVGs served by MediaWiki - https://phabricator.wikimedia.org/T232615 (10Vgutierrez) Actually I'm missing the backend layer on the upload cluster.. it's powered by ATS and the procedure is different [11:57:13] !log Reopen EU SWAT to deploy throttle rule for October 02 (T234113) [11:57:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:17] T234113: Lift IP cap on 2019-10-02 for Wikipedia course - cs.wikipedia - https://phabricator.wikimedia.org/T234113 [11:58:11] (03CR) 10Urbanecm: [C: 03+2] New throttle rule for Czech wiki course [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539661 (https://phabricator.wikimedia.org/T234113) (owner: 10Urbanecm) [11:58:24] (03Merged) 10jenkins-bot: New throttle rule for Czech wiki course [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539661 (https://phabricator.wikimedia.org/T234113) (owner: 10Urbanecm) [11:58:56] 10Operations, 10Puppet, 10netops: Investigate improvements to how puppet manages interfaces - https://phabricator.wikimedia.org/T234207 (10jbond) [11:59:46] (03CR) 10Jbond: "> I don't see a quick and clean fix either. This is for now fixed so indeed lower priority, but we should probably start discussing about " [puppet] - 10https://gerrit.wikimedia.org/r/539462 (https://phabricator.wikimedia.org/T233906) (owner: 10Alexandros Kosiaris) [11:59:53] 10Operations, 10Wikidata, 10Wikidata-Campsite: mobile termbox missed icons - https://phabricator.wikimedia.org/T234183 (10alaa_wmde) [12:00:13] !log urbanecm@deploy1001 Synchronized wmf-config/throttle.php: SWAT: 3f4f242: New throttle rule for Czech wiki course (T234113) (duration: 00m 56s) [12:00:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:36] !log EU SWAT done #2 [12:00:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:55] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one comment inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/539843 (owner: 10Elukey) [12:01:13] (03CR) 10Jbond: [C: 03+2] profile::idp use correct variable names [puppet] - 10https://gerrit.wikimedia.org/r/539854 (owner: 10Jbond) [12:02:45] (03CR) 10Muehlenhoff: [C: 03+1] "I think that's fine, if it turns out to be an issue going forward we can revisit this." [puppet] - 10https://gerrit.wikimedia.org/r/539847 (owner: 10Elukey) [12:03:52] 10Operations, 10Wikidata, 10Wikidata-Campsite: mobile termbox missed icons - https://phabricator.wikimedia.org/T234183 (10Gilles) Check again, all affected SVGs should be fixed, we've purged them, no hash change required. [12:09:25] (03PS3) 10Arturo Borrero Gonzalez: openstack: neutron: newton: refresh original files [puppet] - 10https://gerrit.wikimedia.org/r/539853 (https://phabricator.wikimedia.org/T233665) [12:09:27] (03PS1) 10Arturo Borrero Gonzalez: openstack: newton: neutron: introduce WMF patches [puppet] - 10https://gerrit.wikimedia.org/r/539856 (https://phabricator.wikimedia.org/T233665) [12:10:09] (03CR) 10jerkins-bot: [V: 04-1] openstack: neutron: newton: refresh original files [puppet] - 10https://gerrit.wikimedia.org/r/539853 (https://phabricator.wikimedia.org/T233665) (owner: 10Arturo Borrero Gonzalez) [12:10:25] (03CR) 10jerkins-bot: [V: 04-1] openstack: newton: neutron: introduce WMF patches [puppet] - 10https://gerrit.wikimedia.org/r/539856 (https://phabricator.wikimedia.org/T233665) (owner: 10Arturo Borrero Gonzalez) [12:11:16] 10Operations, 10Wikidata, 10Wikidata-Campsite: mobile termbox missed icons - https://phabricator.wikimedia.org/T234183 (10alaa_wmde) >>! In T234183#5533729, @Gilles wrote: > Should be fixed now. Thanks @Gilles For the record, was it fixed by purging the cache for these svgs? [12:14:52] (03PS2) 10Arturo Borrero Gonzalez: openstack: newton: neutron: introduce WMF patches [puppet] - 10https://gerrit.wikimedia.org/r/539856 (https://phabricator.wikimedia.org/T233665) [12:21:18] I'm updating cxserver, unless there is something else going on deploy1001. [12:21:52] Checked calendar & topic, so not, I guess! [12:22:27] 10Operations, 10Wikidata, 10Wikidata-Campsite: mobile termbox missed icons - https://phabricator.wikimedia.org/T234183 (10Gilles) Yes, exactly, the Varnish caches for these SVGs were purged. That was the fix. We should have done that when the VCL change was rolled out, we just didn't anticipated that Varnish... [12:23:08] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2019-09-26-034732-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/539520 (https://phabricator.wikimedia.org/T233834) (owner: 10KartikMistry) [12:23:19] (03Merged) 10jenkins-bot: Update cxserver to 2019-09-26-034732-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/539520 (https://phabricator.wikimedia.org/T233834) (owner: 10KartikMistry) [12:24:53] !log @ helmfile [STAGING] Ran 'apply' command on namespace 'cxserver' for release 'staging' . [12:24:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:28] !log @ helmfile [CODFW] Ran 'apply' command on namespace 'cxserver' for release 'production' . [12:27:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:09] 10Operations, 10Discovery, 10Traffic, 10WMDE-Analytics-Engineering, and 4 others: Allow access to wdqs.svc.eqiad.wmnet on port 8888 - https://phabricator.wikimedia.org/T176875 (10WMDE-leszek) [12:28:30] (03PS2) 10Jbond: puppetmaster2002: Offline puppetmaster2002 to upgrade [puppet] - 10https://gerrit.wikimedia.org/r/539322 (https://phabricator.wikimedia.org/T233915) [12:28:43] 10Operations, 10Wikidata, 10Wikidata-Campsite (Wikidata-Campsite-Iteration-∞): mobile termbox missed icons - https://phabricator.wikimedia.org/T234183 (10alaa_wmde) [12:29:02] !log offline puppetmaster2002 to reimage https://gerrit.wikimedia.org/r/c/operations/puppet/+/539322 [12:29:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:35] !log @ helmfile [EQIAD] Ran 'apply' command on namespace 'cxserver' for release 'production' . [12:29:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:11] (03CR) 10Jbond: [C: 03+2] puppetmaster2002: Offline puppetmaster2002 to upgrade [puppet] - 10https://gerrit.wikimedia.org/r/539322 (https://phabricator.wikimedia.org/T233915) (owner: 10Jbond) [12:33:32] !log Update cxserver to 2019-09-26-034732-production (T233834, T232674, T233085) [12:33:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:39] T233834: Add nqowiki to cxserver - https://phabricator.wikimedia.org/T233834 [12:33:39] T232674: Content Translation substituted a template with TemplateStyles - https://phabricator.wikimedia.org/T232674 [12:33:39] T233085: Language templates causing problems in both the source and translation - https://phabricator.wikimedia.org/T233085 [12:38:05] 10Operations, 10Puppet, 10Patch-For-Review: Rebuild puppet master backends - https://phabricator.wikimedia.org/T233915 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jbond on cumin1001.eqiad.wmnet for hosts: ` ['puppetmaster2002.codfw.wmnet'] ` The log can be found in `/var/log/wmf-auto-reima... [12:43:22] (03CR) 10Krinkle: [C: 04-1] "Looks like this would discard the other fields, and rename exception.message to message? If I understand correctly." [puppet] - 10https://gerrit.wikimedia.org/r/539621 (https://phabricator.wikimedia.org/T233828) (owner: 10Herron) [12:46:35] 10Operations, 10Analytics, 10Fundraising-Backlog, 10SRE-Access-Requests: Banner History and page view data access for fundraising analysts - Jerrie and Erin - https://phabricator.wikimedia.org/T233636 (10EYener) Hi @Nuria I have created a Wikitech account. My username is my full name, Erin Yener. Please le... [12:47:51] (03CR) 10Elukey: profile::kerberos::replication: whitelist the correct krb host in ferm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/539843 (owner: 10Elukey) [12:48:10] 10Operations, 10Analytics, 10Fundraising-Backlog, 10SRE-Access-Requests: Banner History and page view data access for fundraising analysts - Jerrie and Erin - https://phabricator.wikimedia.org/T233636 (10jkumalah) HI @Nuria my wikitech is aslo my full name, Jerrie Kumalah [12:51:22] 10Operations, 10SRE-Access-Requests: Requesting access to deployment for andrew-wmde - https://phabricator.wikimedia.org/T233202 (10Andrew-WMDE) @Dzahn Sure, the SSH public key is this one P9126 (created exclusively for production): ` ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQC+POWSebb+Fiyxuw2qqMhMHt7DW5i6ZJ7h4H0... [12:51:26] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [12:51:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:56] (03PS1) 10CDanis: dbctl schema: disallow 's3' value when appropriate [software/conftool] - 10https://gerrit.wikimedia.org/r/539859 (https://phabricator.wikimedia.org/T233679) [12:53:23] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:53:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:05] (03CR) 10Filippo Giunchedi: "> Patch Set 12:" [puppet] - 10https://gerrit.wikimedia.org/r/505762 (https://phabricator.wikimedia.org/T213933) (owner: 10Filippo Giunchedi) [12:57:44] 10Operations, 10LDAP-Access-Requests, 10Scoring-platform-team: Grant LDAP to Kevinbazira - https://phabricator.wikimedia.org/T234209 (10Halfak) [12:58:39] 10Operations, 10LDAP-Access-Requests, 10Scoring-platform-team: Grant LDAP to Kevinbazira - https://phabricator.wikimedia.org/T234209 (10Halfak) [13:01:34] 10Operations, 10Puppet, 10Patch-For-Review: Rebuild puppet master backends - https://phabricator.wikimedia.org/T233915 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['puppetmaster2002.codfw.wmnet'] ` and were **ALL** successful. [13:03:09] (03CR) 10Elukey: [C: 03+2] profile::kerberos::replication: fix replicate_krb_database script [puppet] - 10https://gerrit.wikimedia.org/r/539580 (https://phabricator.wikimedia.org/T226089) (owner: 10Elukey) [13:03:18] (03PS4) 10Elukey: profile::kerberos::replication: fix replicate_krb_database script [puppet] - 10https://gerrit.wikimedia.org/r/539580 (https://phabricator.wikimedia.org/T226089) [13:06:22] 10Operations: Integrate Buster 10.1 point update - https://phabricator.wikimedia.org/T232310 (10MoritzMuehlenhoff) [13:15:11] (03PS2) 10Elukey: profile::kerberos::replication: whitelist the correct krb host in ferm [puppet] - 10https://gerrit.wikimedia.org/r/539843 [13:15:13] (03PS2) 10Elukey: profile::kerberos::kadminserver: add motd [puppet] - 10https://gerrit.wikimedia.org/r/539847 [13:15:24] (03PS1) 10Ottomata: Blacklist MessageGroupStatsRebuildJob from Hive refinement [puppet] - 10https://gerrit.wikimedia.org/r/539860 [13:15:29] (03CR) 10Elukey: [C: 03+2] profile::kerberos::replication: whitelist the correct krb host in ferm [puppet] - 10https://gerrit.wikimedia.org/r/539843 (owner: 10Elukey) [13:17:53] (03CR) 10Elukey: [C: 03+2] profile::kerberos::kadminserver: add motd [puppet] - 10https://gerrit.wikimedia.org/r/539847 (owner: 10Elukey) [13:19:37] (03CR) 10Ottomata: [C: 03+2] Blacklist MessageGroupStatsRebuildJob from Hive refinement [puppet] - 10https://gerrit.wikimedia.org/r/539860 (owner: 10Ottomata) [13:19:44] (03PS2) 10Ottomata: Blacklist MessageGroupStatsRebuildJob from Hive refinement [puppet] - 10https://gerrit.wikimedia.org/r/539860 [13:20:15] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Blacklist MessageGroupStatsRebuildJob from Hive refinement [puppet] - 10https://gerrit.wikimedia.org/r/539860 (owner: 10Ottomata) [13:21:15] (03PS1) 10Jbond: puppetmaster2002: bring puppetmaster2002 back online [puppet] - 10https://gerrit.wikimedia.org/r/539862 (https://phabricator.wikimedia.org/T233915) [13:21:57] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster2002: bring puppetmaster2002 back online [puppet] - 10https://gerrit.wikimedia.org/r/539862 (https://phabricator.wikimedia.org/T233915) (owner: 10Jbond) [13:26:35] (03PS1) 10Elukey: profile::kerberos::kadminserver: fix motd [puppet] - 10https://gerrit.wikimedia.org/r/539863 [13:27:00] (03CR) 10Jhedden: [C: 03+1] "Looks good overall, non-blocking question inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/539856 (https://phabricator.wikimedia.org/T233665) (owner: 10Arturo Borrero Gonzalez) [13:27:57] 10Operations, 10Wikimedia-Logstash, 10observability, 10serviceops, 10Patch-For-Review: Errors managed by php-wmerrors (like OOMs) lack normalized_message on logstash - https://phabricator.wikimedia.org/T233828 (10Krinkle) p:05Triageβ†’03High >>! In T233828#5532983, @Joe wrote: >[…] After looking furthe... [13:28:48] 10Operations, 10ops-codfw, 10media-storage: rack/setup/install ms-be205[1-6].codfw.wmnet - https://phabricator.wikimedia.org/T233638 (10fgiunchedi) a:05fgiunchediβ†’03Papaul >>! In T233638#5531188, @Papaul wrote: > @fgiunchedi all yours Thanks @papaul ! I see two hosts in row A+D per netbox, and one in B... [13:30:25] (03CR) 10Elukey: [C: 03+2] profile::kerberos::kadminserver: fix motd [puppet] - 10https://gerrit.wikimedia.org/r/539863 (owner: 10Elukey) [13:30:54] 10Operations, 10ops-eqiad: (2019-09-15) rack/setup/install ms-be105[1-6].eqiad.wmnet - https://phabricator.wikimedia.org/T232367 (10fgiunchedi) >>! In T232367#5529697, @Jclark-ctr wrote: > @fgiunchedi 3 backend systems to replace ms-be101[6-8] Thanks @jclark-ctr, I wanted to make sure the 3x replacement sys... [13:36:31] (03PS1) 10Elukey: profile::kerberos::kadminserver: fix motd - part 2 [puppet] - 10https://gerrit.wikimedia.org/r/539864 [13:38:18] (03CR) 10Elukey: [C: 03+2] profile::kerberos::kadminserver: fix motd - part 2 [puppet] - 10https://gerrit.wikimedia.org/r/539864 (owner: 10Elukey) [13:46:09] (03CR) 10Mathew.onipe: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/539513 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [13:48:00] (03CR) 10Mathew.onipe: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/539285 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [13:49:54] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [13:49:55] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:49:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:04] (03CR) 10Volans: "> Patch Set 8:" [puppet] - 10https://gerrit.wikimedia.org/r/539182 (https://phabricator.wikimedia.org/T233183) (owner: 10CRusnov) [13:52:01] (03CR) 10Mathew.onipe: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/538849 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [13:54:28] !log draining ganeti2005 for upcoming reboot (combined kernel/qemu security updates) [13:54:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:55] (03CR) 10Mathew.onipe: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/537138 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [14:00:57] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [14:00:58] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:00:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:06] (03PS2) 10Herron: logstash: if php7.2-fpm message field is empty, use exception.message [puppet] - 10https://gerrit.wikimedia.org/r/539623 (https://phabricator.wikimedia.org/T233828) [14:03:44] PROBLEM - Check systemd state on krb1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:06:01] (03CR) 10Herron: [C: 03+2] logstash: if php7.2-fpm message field is empty, use exception.message [puppet] - 10https://gerrit.wikimedia.org/r/539623 (https://phabricator.wikimedia.org/T233828) (owner: 10Herron) [14:07:53] 10Operations, 10Code-Stewardship-Reviews, 10Graphoid, 10Core Platform Team Legacy (Watching / External), and 3 others: graphoid: Code stewardship request - https://phabricator.wikimedia.org/T211881 (10Milimetric) @dr0ptp4kt and how will this work? The graphs will start out as some placeholder image as the... [14:08:35] !log draining ganeti2006 for upcoming reboot (combined kernel/qemu security updates) [14:08:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:23] 10Operations, 10ops-codfw, 10media-storage: rack/setup/install ms-be205[1-6].codfw.wmnet - https://phabricator.wikimedia.org/T233638 (10Papaul) @sorry I fogot to mentioned that on the task, It was the faster and easier way to rack those servers. We can still move 1 host from D to C but it will take a while f... [14:16:56] krb1001's failures are know [14:16:58] *known [14:17:11] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [14:17:12] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:17:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:47] (03CR) 10Herron: [C: 03+2] logstash: if php7.2-fpm message field is empty, use exception.message (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/539623 (https://phabricator.wikimedia.org/T233828) (owner: 10Herron) [14:18:05] (03PS1) 10Herron: Revert "logstash: if php7.2-fpm message field is empty, use exception.message" [puppet] - 10https://gerrit.wikimedia.org/r/539874 [14:20:39] (03CR) 10Herron: [C: 03+2] Revert "logstash: if php7.2-fpm message field is empty, use exception.message" [puppet] - 10https://gerrit.wikimedia.org/r/539874 (owner: 10Herron) [14:21:15] 10Operations, 10LDAP-Access-Requests, 10Scoring-platform-team: Grant LDAP to Kevinbazira - https://phabricator.wikimedia.org/T234209 (10Halfak) [14:22:02] (03PS1) 10Herron: logstash: if php7.2-fpm message field is empty, use exception.message [puppet] - 10https://gerrit.wikimedia.org/r/539875 (https://phabricator.wikimedia.org/T233828) [14:24:36] (03PS2) 10Herron: logstash: if php7.2-fpm message field is empty, use exception.message [puppet] - 10https://gerrit.wikimedia.org/r/539875 (https://phabricator.wikimedia.org/T233828) [14:27:34] (03CR) 10Herron: [C: 03+2] logstash: if php7.2-fpm message field is empty, use exception.message [puppet] - 10https://gerrit.wikimedia.org/r/539875 (https://phabricator.wikimedia.org/T233828) (owner: 10Herron) [14:28:53] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission db2044.codfw.wmnet - https://phabricator.wikimedia.org/T230761 (10Papaul) [14:29:31] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission db2044.codfw.wmnet - https://phabricator.wikimedia.org/T230761 (10Papaul) [14:29:51] !log draining ganeti2007 for upcoming reboot (combined kernel/qemu security updates) [14:29:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:58] (03PS2) 10Jbond: puppetmaster2002: bring puppetmaster2002 back online [puppet] - 10https://gerrit.wikimedia.org/r/539862 (https://phabricator.wikimedia.org/T233915) [14:36:48] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2046.codfw.wmnet - https://phabricator.wikimedia.org/T231767 (10Papaul) [14:37:04] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/539862 (https://phabricator.wikimedia.org/T233915) (owner: 10Jbond) [14:37:57] (03CR) 10Jbond: letsencrypt: Fix acme-setup script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/539625 (owner: 10Paladox) [14:38:06] (03PS4) 10Jbond: letsencrypt: Fix acme-setup script [puppet] - 10https://gerrit.wikimedia.org/r/539625 (owner: 10Paladox) [14:40:18] (03CR) 10Jbond: [C: 03+2] letsencrypt: Fix acme-setup script [puppet] - 10https://gerrit.wikimedia.org/r/539625 (owner: 10Paladox) [14:40:33] (03PS3) 10Jbond: puppetmaster2002: bring puppetmaster2002 back online [puppet] - 10https://gerrit.wikimedia.org/r/539862 (https://phabricator.wikimedia.org/T233915) [14:41:07] paladox: mertging the letsencrypt change now [14:41:13] jbond42 thanks!! [14:41:26] thank you :) [14:41:56] (03CR) 10Jbond: [C: 03+2] puppetmaster2002: bring puppetmaster2002 back online [puppet] - 10https://gerrit.wikimedia.org/r/539862 (https://phabricator.wikimedia.org/T233915) (owner: 10Jbond) [14:43:47] 10Operations, 10Maps, 10Wikimedia-Incident: Review sizing of maps cluster - https://phabricator.wikimedia.org/T228497 (10Joe) Hi @Gehel any news on this? [14:44:14] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [14:44:15] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:44:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:32] (03PS1) 10Jbond: fix yaml config [puppet] - 10https://gerrit.wikimedia.org/r/539880 [14:47:27] (03PS2) 10Jbond: puppetmaster2002: fix yaml config [puppet] - 10https://gerrit.wikimedia.org/r/539880 [14:47:36] PROBLEM - Host kubetcd2002 is DOWN: PING CRITICAL - Packet loss = 100% [14:48:12] (03CR) 10Jbond: [C: 03+2] puppetmaster2002: fix yaml config [puppet] - 10https://gerrit.wikimedia.org/r/539880 (owner: 10Jbond) [14:50:06] ^ kubetcd2003 is expected, reboot of ganeti2007 [14:50:15] (03PS1) 10Thcipriani: scap: mediawiki logstash_checker [puppet] - 10https://gerrit.wikimedia.org/r/539881 (https://phabricator.wikimedia.org/T233828) [14:50:32] RECOVERY - Host kubetcd2002 is UP: PING OK - Packet loss = 0%, RTA = 36.38 ms [14:54:24] PROBLEM - Check systemd state on puppetmaster2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:54:55] jbond42: ^ [14:56:19] !log shutting down scs-c1-codfw for replacement [14:56:27] 10Operations, 10Cassandra, 10Core Platform Team Workboards (Clinic Duty Team), 10Patch-For-Review, 10User-Eevans: enable authenticated access to Cassandra JMX - https://phabricator.wikimedia.org/T92471 (10daniel) p:05Triageβ†’03Normal [14:57:07] 10Operations, 10Analytics, 10Fundraising-Backlog, 10SRE-Access-Requests: Banner History and page view data access for fundraising analysts - Jerrie and Erin - https://phabricator.wikimedia.org/T233636 (10Nuria) ping @herron I guess jerrie can be added to wmf and Erin to nda groups in ldap? [14:57:10] 10Operations, 10Wikimedia-Logstash, 10observability, 10serviceops, 10Patch-For-Review: Errors managed by php-wmerrors (like OOMs) lack normalized_message on logstash - https://phabricator.wikimedia.org/T233828 (10herron) Copying `excepetion.message` to `message` looks to have made an improvement here. I... [15:01:21] (03PS1) 10Jbond: puppetmaster2002: use correct hiera key [puppet] - 10https://gerrit.wikimedia.org/r/539882 [15:04:31] (03CR) 10Jbond: [C: 03+2] puppetmaster2002: use correct hiera key [puppet] - 10https://gerrit.wikimedia.org/r/539882 (owner: 10Jbond) [15:05:48] 10Operations, 10Puppet, 10netops: Investigate improvements to how puppet manages interfaces - https://phabricator.wikimedia.org/T234207 (10akosiaris) p:05Triageβ†’03Lowest [15:06:50] (03CR) 10Krinkle: [C: 03+1] scap: mediawiki logstash_checker [puppet] - 10https://gerrit.wikimedia.org/r/539881 (https://phabricator.wikimedia.org/T233828) (owner: 10Thcipriani) [15:07:44] (03CR) 10Krinkle: [C: 03+1] Use AbuseFilterCachingParser for group0 and deploymentwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539674 (https://phabricator.wikimedia.org/T156095) (owner: 10Daimona Eaytoy) [15:07:53] (03CR) 10jerkins-bot: [V: 04-1] scap: mediawiki logstash_checker [puppet] - 10https://gerrit.wikimedia.org/r/539881 (https://phabricator.wikimedia.org/T233828) (owner: 10Thcipriani) [15:07:55] (03CR) 10Arturo Borrero Gonzalez: openstack: newton: neutron: introduce WMF patches (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/539856 (https://phabricator.wikimedia.org/T233665) (owner: 10Arturo Borrero Gonzalez) [15:08:11] 10Operations, 10Analytics, 10Fundraising-Backlog, 10SRE-Access-Requests: Banner History and page view data access for fundraising analysts - Jerrie and Erin - https://phabricator.wikimedia.org/T233636 (10EYener) Hi @Nuria I do have an LDAP account: eyener-ctr I'm not certain if/what groups I belong to, how... [15:09:01] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [15:09:02] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:09:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:50] (03PS2) 10Thcipriani: scap: mediawiki logstash_checker [puppet] - 10https://gerrit.wikimedia.org/r/539881 (https://phabricator.wikimedia.org/T233828) [15:10:48] PROBLEM - Host kubetcd2001 is DOWN: PING CRITICAL - Packet loss = 100% [15:11:06] (03CR) 10CRusnov: "> Patch Set 8:" [puppet] - 10https://gerrit.wikimedia.org/r/539182 (https://phabricator.wikimedia.org/T233183) (owner: 10CRusnov) [15:11:44] (03PS3) 10Thcipriani: scap: mediawiki logstash_checker [puppet] - 10https://gerrit.wikimedia.org/r/539881 (https://phabricator.wikimedia.org/T233828) [15:13:34] (03CR) 10Jhedden: [C: 03+1] openstack: neutron: newton: refresh original files [puppet] - 10https://gerrit.wikimedia.org/r/539853 (https://phabricator.wikimedia.org/T233665) (owner: 10Arturo Borrero Gonzalez) [15:13:52] ^ kubetcd2001 is expected, reboot of ganeti2008 [15:15:34] RECOVERY - Host kubetcd2001 is UP: PING OK - Packet loss = 0%, RTA = 36.35 ms [15:16:43] (03PS1) 10Jbond: puppetmaster2002: bring puppetmaster2002 back online [puppet] - 10https://gerrit.wikimedia.org/r/539885 [15:21:58] (03CR) 10Jbond: [C: 03+2] puppetmaster2002: bring puppetmaster2002 back online [puppet] - 10https://gerrit.wikimedia.org/r/539885 (owner: 10Jbond) [15:36:09] 10Operations, 10observability: acme-chief hosts not in Prometheus - https://phabricator.wikimedia.org/T234232 (10fgiunchedi) [15:39:05] (03PS1) 10Jhedden: openstack: add designate to eqiad1 haproxy [puppet] - 10https://gerrit.wikimedia.org/r/539889 (https://phabricator.wikimedia.org/T223907) [15:42:01] !log failover Ganeti master in codfw to ganeti2001 [15:42:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:16] (03CR) 10Jhedden: "PCC results: https://puppet-compiler.wmflabs.org/compiler1002/18679/" [puppet] - 10https://gerrit.wikimedia.org/r/539889 (https://phabricator.wikimedia.org/T223907) (owner: 10Jhedden) [15:46:12] !log cdanis@cumin1001 START - Cookbook sre.ganeti.makevm [15:46:12] !log cdanis@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) [15:46:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:25] 10Operations, 10Analytics, 10Code-Stewardship-Reviews, 10Tools, 10Wikimedia-IRC-RC-Server: IRC RecentChanges feed: code stewardship request - https://phabricator.wikimedia.org/T185319 (10Nuria) [15:49:17] !log installing console-setup bugfixes from Buster 10.1 point release [15:49:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:58] 10Operations, 10Maps, 10Wikimedia-Incident: Review sizing of maps cluster - https://phabricator.wikimedia.org/T228497 (10Gehel) >>! In T228497#5534268, @Joe wrote: > Hi @Gehel any news on this? Sadly, no news yet :( [15:53:03] (03PS1) 10Herron: admin: add eyener to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/539891 (https://phabricator.wikimedia.org/T233636) [15:53:05] (03PS1) 10Herron: admin: add jkumalah to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/539892 (https://phabricator.wikimedia.org/T233636) [15:54:23] (03PS1) 10CDanis: grafana1002: new VM for grafana 6.x [dns] - 10https://gerrit.wikimedia.org/r/539894 (https://phabricator.wikimedia.org/T220838) [15:54:59] (03CR) 10jerkins-bot: [V: 04-1] grafana1002: new VM for grafana 6.x [dns] - 10https://gerrit.wikimedia.org/r/539894 (https://phabricator.wikimedia.org/T220838) (owner: 10CDanis) [15:55:21] 10Operations, 10serviceops, 10Core Platform Team (Needs Cleaning - Services Operations): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10Mholloway) [15:56:25] (03PS2) 10CDanis: grafana1002: new VM for grafana 6.x [dns] - 10https://gerrit.wikimedia.org/r/539894 (https://phabricator.wikimedia.org/T220838) [15:57:05] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2047.codfw.wmnet - https://phabricator.wikimedia.org/T231852 (10Papaul) [16:00:17] (03CR) 10Filippo Giunchedi: [C: 03+1] grafana1002: new VM for grafana 6.x [dns] - 10https://gerrit.wikimedia.org/r/539894 (https://phabricator.wikimedia.org/T220838) (owner: 10CDanis) [16:00:54] (03PS12) 10Alexandros Kosiaris: rsyslog: populate kubernetes configuration [puppet] - 10https://gerrit.wikimedia.org/r/538627 (https://phabricator.wikimedia.org/T207200) [16:00:56] (03PS1) 10Phamhi: tools-webservice: Disallow restart unless webservice type is defined in advance. [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/539895 [16:00:58] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] rsyslog: populate kubernetes configuration [puppet] - 10https://gerrit.wikimedia.org/r/538627 (https://phabricator.wikimedia.org/T207200) (owner: 10Alexandros Kosiaris) [16:03:05] 10Operations, 10serviceops, 10Core Platform Team (Needs Cleaning - Services Operations): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10Mholloway) [16:03:20] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM. Thanks for working on this!" [puppet] - 10https://gerrit.wikimedia.org/r/539889 (https://phabricator.wikimedia.org/T223907) (owner: 10Jhedden) [16:03:41] (03CR) 10Jhedden: [C: 03+2] openstack: add designate to eqiad1 haproxy [puppet] - 10https://gerrit.wikimedia.org/r/539889 (https://phabricator.wikimedia.org/T223907) (owner: 10Jhedden) [16:03:55] 10Operations, 10observability, 10Patch-For-Review, 10Performance-Team (Radar): Upgrade grafana to 6.x - https://phabricator.wikimedia.org/T220838 (10Krinkle) [16:03:57] (03PS2) 10Jhedden: openstack: add designate to eqiad1 haproxy [puppet] - 10https://gerrit.wikimedia.org/r/539889 (https://phabricator.wikimedia.org/T223907) [16:05:01] (03CR) 10Andrew Bogott: [C: 03+1] openstack: add designate to eqiad1 haproxy [puppet] - 10https://gerrit.wikimedia.org/r/539889 (https://phabricator.wikimedia.org/T223907) (owner: 10Jhedden) [16:07:13] (03PS2) 10Phamhi: tools-webservice: Disallow restart unless webservice type is defined in advance. [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/539895 (https://phabricator.wikimedia.org/T218461) [16:08:17] (03CR) 10Krinkle: [C: 03+2] Use AbuseFilterCachingParser for group0 and deploymentwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539674 (https://phabricator.wikimedia.org/T156095) (owner: 10Daimona Eaytoy) [16:08:24] * Krinkle staging on mwdebug1002 [16:10:09] Daimona: (continuing chat here in public) [16:10:35] Will stage on mwdebug1002 once it merges, then after some testing, I'll pull it to mwmaint1002 for a maintenance script sanity test [16:11:05] Once we're comfortable with it working well on test wikis / mediawiki.org on mwdebug1002 can create graphs and then roll out for real to group0 [16:12:23] Fine to me [16:14:11] (03PS3) 10Daimona Eaytoy: Use AbuseFilterCachingParser for group0 and deploymentwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539674 (https://phabricator.wikimedia.org/T156095) [16:14:22] Sounds like it didn't merge [16:14:51] (03CR) 10Krinkle: [C: 03+2] Use AbuseFilterCachingParser for group0 and deploymentwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539674 (https://phabricator.wikimedia.org/T156095) (owner: 10Daimona Eaytoy) [16:15:07] Yeah, it's ff-only [16:15:12] fast-forward merge [16:15:17] needed rebase [16:15:22] stricter repo settings for this one [16:15:51] (03Merged) 10jenkins-bot: Use AbuseFilterCachingParser for group0 and deploymentwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539674 (https://phabricator.wikimedia.org/T156095) (owner: 10Daimona Eaytoy) [16:16:06] Merged now [16:17:32] Yep, staging now [16:20:02] 10Operations, 10Release Pipeline, 10Release-Engineering-Team-TODO, 10serviceops, and 4 others: Revisit the logging work done on Q1 2017-2018 for the standard pod setup - https://phabricator.wikimedia.org/T207200 (10akosiaris) [16:20:48] Daimona: should be live [16:21:35] Cool, gonna play around a little bit [16:21:49] (03PS4) 10BryanDavis: openstack: neutron: newton: refresh original files [puppet] - 10https://gerrit.wikimedia.org/r/539853 (https://phabricator.wikimedia.org/T233665) (owner: 10Arturo Borrero Gonzalez) [16:22:50] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group={logstash,logstash-codfw} instance=kafkamon1001:9501 job=burrow partition={0,1,2} site=eqiad topic={rsyslog-info,rsyslog-notice} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+p [16:22:50] -cluster=logging-eqiad&var-topic=All&var-consumer_group=All [16:23:53] 10Operations, 10Product-Analytics, 10Wikidata, 10Wikidata-Query-Service, and 4 others: MIgrate WDQS to new logging pipeline - https://phabricator.wikimedia.org/T232184 (10debt) 05Openβ†’03Resolved [16:24:22] Nothing weird [16:24:29] (03CR) 10CDanis: [C: 03+2] grafana1002: new VM for grafana 6.x [dns] - 10https://gerrit.wikimedia.org/r/539894 (https://phabricator.wikimedia.org/T220838) (owner: 10CDanis) [16:24:38] (03CR) 10jerkins-bot: [V: 04-1] openstack: neutron: newton: refresh original files [puppet] - 10https://gerrit.wikimedia.org/r/539853 (https://phabricator.wikimedia.org/T233665) (owner: 10Arturo Borrero Gonzalez) [16:24:50] mark [16:24:53] 10Operations, 10MediaWiki-Vagrant, 10phan: It should be possible to install php-ast using apt-get on MediaWiki-Vagrant - https://phabricator.wikimedia.org/T234240 (10Mainframe98) [16:25:24] ma.rk sorry ignore that [16:25:26] !log cdanis@cumin1001 START - Cookbook sre.ganeti.makevm [16:25:26] !log cdanis@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) [16:25:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:48] Krinkle: green light [16:25:50] !log cdanis@cumin1001 START - Cookbook sre.ganeti.makevm [16:25:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:55] Tested on testwiki and mw.o [16:26:10] Daimona: https://logstash.wikimedia.org/app/kibana#/dashboard/mwdebug1002? [16:26:25] those warnings are ok, right? [16:26:31] I mean, they're non-fatal, but also, not regression? [16:26:48] Ahah, I'm using a raw search for host+channel, but well... Yes, they're not regressions [16:26:56] It's https://phabricator.wikimedia.org/T230256 [16:27:10] ok, now, for graphs [16:27:25] 10Operations, 10Release Pipeline, 10Release-Engineering-Team-TODO, 10serviceops, and 4 others: Revisit the logging work done on Q1 2017-2018 for the standard pod setup - https://phabricator.wikimedia.org/T207200 (10akosiaris) Logs are now making it to logstash so I am gonna boldly resolve this. That being... [16:27:27] these few test edits should have lazy-created the graphite metrics [16:27:48] Confirming, I see them on graphite [16:27:55] (03PS3) 10Lucas Werkmeister (WMDE): fatalmonitor: exec watch [puppet] - 10https://gerrit.wikimedia.org/r/499761 [16:28:11] Based on what I tried out on grafana-labs, I think we should have the following [16:28:54] 1 - "Individual filter runs" within "parser profiling" should also show the sampe_rate for cachingparser_full [16:29:09] And possibly the sum of oldparser and cachingparser [16:29:34] 2-the "per filter runtime" graph under "parser profiling" should also have the p95 for cachingparser_full [16:30:01] 10Operations, 10Analytics, 10Traffic, 10observability: Publish tls related info to webrequest via varnish - https://phabricator.wikimedia.org/T233661 (10Milimetric) p:05Normalβ†’03High [16:30:06] 3-Ideally also a graph with cachingparser_full, cachingparser_evaltree and cachingparser_buildtree, although that'll be especially useful later on [16:31:06] OK. I'm duplicating hte old parser timing one for caching new [16:31:14] and adding cache parser to the counter one as stack? [16:32:27] So, based on what I see right now on grafana [16:32:32] (03PS5) 10BryanDavis: openstack: neutron: newton: refresh original files [puppet] - 10https://gerrit.wikimedia.org/r/539853 (https://phabricator.wikimedia.org/T233665) (owner: 10Arturo Borrero Gonzalez) [16:32:40] !log ppchelko@deploy1001 Started deploy [cpjobqueue/deploy@79db711]: Take job domain into account for deduplication T234226 [16:32:45] The counter one should have a sum of both parsers [16:32:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:46] T234226: MassMessage not delivering messages - https://phabricator.wikimedia.org/T234226 [16:32:51] !log krinkle@deploy1001 Synchronized wmf-config/abusefilter.php: 0aa4b4b5ab9a2e4 (duration: 00m 57s) [16:32:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:58] The rest is fine [16:33:31] (03CR) 10jerkins-bot: [V: 04-1] openstack: neutron: newton: refresh original files [puppet] - 10https://gerrit.wikimedia.org/r/539853 (https://phabricator.wikimedia.org/T233665) (owner: 10Arturo Borrero Gonzalez) [16:33:50] The only thing left is (3), which should have cachingParser_full, cachingParser_buildtree and cachingParser_eval, all p95 I'd say [16:33:58] !log ppchelko@deploy1001 Finished deploy [cpjobqueue/deploy@79db711]: Take job domain into account for deduplication T234226 (duration: 01m 17s) [16:34:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:18] !log krinkle@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 0aa4b4b5ab9a2e4 (duration: 00m 57s) [16:34:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:57] Daimona: added total as well [16:35:21] stack(old, new) and Y-axis 2 non-stack total [16:35:28] !log cdanis@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [16:35:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:35] Looking good [16:35:50] BTW you can erase my experiments in grafana-labs [16:36:06] PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:36:23] (03PS6) 10BryanDavis: openstack: neutron: newton: refresh original files [puppet] - 10https://gerrit.wikimedia.org/r/539853 (https://phabricator.wikimedia.org/T233665) (owner: 10Arturo Borrero Gonzalez) [16:36:33] I'll delete the dashboard for now, unless you find some use in it? [16:36:41] I imported it only recently as copy from prod. [16:36:50] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:36:54] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:36:54] No use [16:36:57] Feel free to delete [16:37:00] If we need it again, can re-import current [16:37:01] ok :) [16:37:27] Daimona: btw, to copy, I use "settings > json" and then copy that manually into the other one [16:37:43] Thanks, could come up handy [16:38:12] Daimona: btw, this will run "global" filters with the local parser setting, right? [16:38:14] that's ok, just checkin [16:38:25] Yes [16:38:35] cool, that's good actually, then we also get test coverage from that [16:38:41] That should be fine tho [16:39:25] I'm gonna try to add a graph :D [16:40:03] (03PS7) 10Andrew Bogott: openstack: neutron: newton: refresh original files [puppet] - 10https://gerrit.wikimedia.org/r/539853 (https://phabricator.wikimedia.org/T233665) (owner: 10Arturo Borrero Gonzalez) [16:41:11] Daimona: ok, go ahead. I've saved my edits now [16:43:00] 10Operations, 10Release Pipeline, 10Release-Engineering-Team-TODO, 10Core Platform Team Legacy (Watching / External), and 3 others: Migrate production services to kubernetes using the pipeline - https://phabricator.wikimedia.org/T198901 (10akosiaris) [16:43:04] 10Operations, 10Release Pipeline, 10Release-Engineering-Team-TODO, 10serviceops, and 4 others: Revisit the logging work done on Q1 2017-2018 for the standard pod setup - https://phabricator.wikimedia.org/T207200 (10akosiaris) 05Openβ†’03Resolved [16:43:23] (03CR) 10Andrew Bogott: [C: 03+2] openstack: neutron: newton: refresh original files [puppet] - 10https://gerrit.wikimedia.org/r/539853 (https://phabricator.wikimedia.org/T233665) (owner: 10Arturo Borrero Gonzalez) [16:43:25] 10Operations, 10ops-eqiad: Can't SSH to mw1290.mgmt - https://phabricator.wikimedia.org/T234153 (10Cmjohnson) From time to time the mgmt/idrac becomes unresponsive, we will need to power off the host for 10-30secs. Please depool this server and we'll take care of it. [16:43:41] (03PS5) 10Paladox: Gerrit: Support java 8 under buster [puppet] - 10https://gerrit.wikimedia.org/r/538925 [16:43:58] (03PS6) 10Paladox: Gerrit: Support java 8 under buster [puppet] - 10https://gerrit.wikimedia.org/r/538925 [16:44:02] OK, doing it now [16:44:15] BTW, is the change in prod already? [16:44:19] (03CR) 10BryanDavis: tools-webservice: Disallow restart unless webservice type is defined in advance. (032 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/539895 (https://phabricator.wikimedia.org/T218461) (owner: 10Phamhi) [16:44:24] (03PS7) 10Paladox: Gerrit: Support java 8 under buster [puppet] - 10https://gerrit.wikimedia.org/r/538925 [16:44:33] XioNoX: OSPF alerts --^ [16:44:59] elukey: this, again... https://phabricator.wikimedia.org/T228827 [16:45:05] 10Operations, 10ops-eqiad, 10serviceops: mw1286.mgmt is down - https://phabricator.wikimedia.org/T234009 (10Cmjohnson) a:03Jclark-ctr John, please reseat the green mgmt cable [16:45:08] is it a tunnel down over telia [16:45:08] ? [16:45:10] ah ok reading [16:45:14] (03CR) 10Volans: "REplies inline, no changes yet." (033 comments) [software/homer] - 10https://gerrit.wikimedia.org/r/539551 (owner: 10Volans) [16:45:42] (03PS3) 10Andrew Bogott: openstack: newton: neutron: introduce WMF patches [puppet] - 10https://gerrit.wikimedia.org/r/539856 (https://phabricator.wikimedia.org/T233665) (owner: 10Arturo Borrero Gonzalez) [16:45:46] RECOVERY - BFD status on cr2-esams is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:45:48] 10Operations, 10ops-eqiad: Move YHSM from auth1001 to auth1002 - https://phabricator.wikimedia.org/T233821 (10Cmjohnson) a:03Jclark-ctr @Jclark-ctr Can you please do this. let me know if you have questions [16:45:59] (03PS7) 10Paladox: gerrit: override gerrit::server::slave_hosts under gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/536359 [16:46:08] elukey: actually yeah it's the tunnel [16:46:27] that link has been going doing often so it's my guess before I actually look at logs [16:46:27] (03PS4) 10Paladox: gerrit: add role on gerrit1001 and remove spare [puppet] - 10https://gerrit.wikimedia.org/r/539204 (https://phabricator.wikimedia.org/T222391) [16:46:28] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:46:32] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:46:52] Krinkle: done editing [16:46:55] but in this case it was gr-4/3/0.1 [16:47:08] (03CR) 10Andrew Bogott: [C: 03+2] openstack: newton: neutron: introduce WMF patches [puppet] - 10https://gerrit.wikimedia.org/r/539856 (https://phabricator.wikimedia.org/T233665) (owner: 10Arturo Borrero Gonzalez) [16:47:11] 10Operations, 10ops-eqiad, 10DBA: es1019 IPMI and its management interface are unresponsive (again2) - https://phabricator.wikimedia.org/T233698 (10Cmjohnson) @marostegui can you please depool this. I will do around 1400UTC on Thursday [16:48:22] Daimona: nice. set unit to "seconds" as well? [16:48:34] Then it will render "4 ms" instead of 0.004 [16:48:34] Done, right now [16:48:45] cool [16:48:53] So are we done for now? [16:49:08] Daimona: technically these should be multiplied by 1000 in PHP to log ms only, but ok for now. [16:49:24] for consistency, but can do another time. [16:49:33] I missed it in CR [16:49:36] Yeah, we're done :) [16:49:47] Let's revisit in 2-3 days or maybe next Monday? [16:49:49] Sure [16:50:02] Well, I'd say 2-3 days should already give enough data, but let's see [16:50:22] I'll keep an eye on the graph and will ping you back around Wed. [16:50:47] (03PS6) 10Andrew Bogott: wmcs postgres: make 'includes' an array [puppet] - 10https://gerrit.wikimedia.org/r/539554 [16:50:47] If we don't find any show stopper we can do group1 as well [16:51:09] Which is where we'll start getting some big data, and it would be a good idea to wait at least a week at that point [16:51:26] agreed [16:54:34] Good, thanks :) [16:57:37] 10Operations, 10ops-eqiad, 10DBA: es1019 IPMI and its management interface are unresponsive (again2) - https://phabricator.wikimedia.org/T233698 (10Marostegui) >>! In T233698#5534931, @Cmjohnson wrote: > @marostegui can you please depool this. I will do around 1400UTC on Thursday Will have it ready by Thurs... [17:00:04] gehel and onimisionipe: Time to snap out of that daydream and deploy Wikidata Query Service weekly deploy. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190930T1700). [17:00:14] (03PS1) 10Alexandros Kosiaris: Revert "rsyslog: populate kubernetes configuration" [puppet] - 10https://gerrit.wikimedia.org/r/539911 [17:00:36] (03PS2) 10Alexandros Kosiaris: Revert "rsyslog: populate kubernetes configuration" [puppet] - 10https://gerrit.wikimedia.org/r/539911 [17:00:42] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Revert "rsyslog: populate kubernetes configuration" [puppet] - 10https://gerrit.wikimedia.org/r/539911 (owner: 10Alexandros Kosiaris) [17:04:48] (03CR) 10Jcrespo: "Np, I was also quite busy. Will deploy ASAP and will report how it goes." [puppet] - 10https://gerrit.wikimedia.org/r/538239 (https://phabricator.wikimedia.org/T229209) (owner: 10Jcrespo) [17:05:05] (03PS1) 10Alexandros Kosiaris: Revert "Revert "rsyslog: populate kubernetes configuration"" [puppet] - 10https://gerrit.wikimedia.org/r/539912 [17:05:22] 10Operations, 10serviceops, 10Core Platform Team (Needs Cleaning - Services Operations): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10Jdforrester-WMF) [17:05:26] (03CR) 10Alexandros Kosiaris: [C: 04-1] "To be merged once we have something for tools and zotero extraneous logging" [puppet] - 10https://gerrit.wikimedia.org/r/539912 (owner: 10Alexandros Kosiaris) [17:05:52] (03PS1) 10CDanis: grafana1002: autoinstall/netboot/role data [puppet] - 10https://gerrit.wikimedia.org/r/539913 (https://phabricator.wikimedia.org/T220838) [17:05:53] jouncebot: no deploy today [17:07:44] in that case, I'd like to deploy phatality [17:08:32] (03CR) 10jerkins-bot: [V: 04-1] Revert "Revert "rsyslog: populate kubernetes configuration"" [puppet] - 10https://gerrit.wikimedia.org/r/539912 (owner: 10Alexandros Kosiaris) [17:08:36] !log deploying minor update to phatality to fix T234223 [17:08:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:40] T234223: Single row view in Logstash hidden by Phatality - https://phabricator.wikimedia.org/T234223 [17:09:15] (03CR) 10CDanis: [C: 03+2] grafana1002: autoinstall/netboot/role data [puppet] - 10https://gerrit.wikimedia.org/r/539913 (https://phabricator.wikimedia.org/T220838) (owner: 10CDanis) [17:09:18] !log twentyafterfour@deploy1001 Started deploy [releng/phatality@62e2870]: fix T234223 [17:09:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:26] (03PS2) 10CDanis: grafana1002: autoinstall/netboot/role data [puppet] - 10https://gerrit.wikimedia.org/r/539913 (https://phabricator.wikimedia.org/T220838) [17:10:25] !log deploy failed [17:10:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:40] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2049.codfw.wmnet - https://phabricator.wikimedia.org/T230721 (10Papaul) [17:11:45] 10Operations, 10ops-codfw: refresh/replace scs-c1-codfw - https://phabricator.wikimedia.org/T231687 (10Papaul) [17:13:20] 10Operations, 10ops-eqiad, 10DC-Ops: a1-eqiad pdu refresh (Tuesday 10/15 @11am UTC - https://phabricator.wikimedia.org/T226782 (10wiki_willy) [17:13:29] 10Operations, 10ops-eqiad, 10DC-Ops: a1-eqiad pdu refresh (Tuesday 10/15 @11am UTC) - https://phabricator.wikimedia.org/T226782 (10wiki_willy) [17:14:00] (03PS1) 10Jforrester: Add the beta REL1_34 to ExtensionDistributor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539916 [17:15:19] 10Operations, 10ops-eqiad, 10DC-Ops: a1-eqiad pdu refresh (Tuesday 10/15 @11am UTC) - https://phabricator.wikimedia.org/T226782 (10wiki_willy) New date for upgrading the remaining PDU on the network rack A1 will be targeting Tuesday, 10/15 at 11am UTC. Thanks, Willy [17:15:43] !log twentyafterfour@deploy1001 deploy aborted: fix T234223 (duration: 06m 24s) [17:15:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:47] T234223: Single row view in Logstash hidden by Phatality - https://phabricator.wikimedia.org/T234223 [17:18:21] 10Operations, 10ops-eqiad, 10DC-Ops: a8-eqiad pdu refresh (Thursday 10/17 @11am UTC) - https://phabricator.wikimedia.org/T227133 (10wiki_willy) [17:18:36] !log twentyafterfour@deploy1001 Started deploy [releng/phatality@62e2870]: fix T234223 [17:18:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:41] !log twentyafterfour@deploy1001 Finished deploy [releng/phatality@62e2870]: fix T234223 (duration: 00m 05s) [17:18:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:25] !log twentyafterfour@deploy1001 Started deploy [releng/phatality@62e2870]: fix T234223 [17:22:33] !log twentyafterfour@deploy1001 Finished deploy [releng/phatality@62e2870]: fix T234223 (duration: 00m 07s) [17:25:34] RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [17:33:27] !log twentyafterfour@deploy1001 Started deploy [releng/phatality@62e2870]: fix T234223 [17:33:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:31] T234223: Single row view in Logstash hidden by Phatality - https://phabricator.wikimedia.org/T234223 [17:36:31] !log twentyafterfour@deploy1001 Finished deploy [releng/phatality@62e2870]: fix T234223 (duration: 03m 03s) [17:37:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:31] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T234018 (10Bstorm) Apparently, updating the firmware didn't do it. This system has something wrong with it's RAID system that likes to fail out lots of disks at a time (occasionally... [17:39:11] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T234018 (10Bstorm) [17:39:14] 10Operations, 10ops-eqiad, 10cloud-services-team: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only - https://phabricator.wikimedia.org/T230289 (10Bstorm) [17:40:32] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only - https://phabricator.wikimedia.org/T230442 (10Bstorm) [17:40:35] 10Operations, 10ops-eqiad, 10cloud-services-team: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only - https://phabricator.wikimedia.org/T230289 (10Bstorm) [17:41:06] 10Operations, 10ops-eqiad, 10cloud-services-team: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only - https://phabricator.wikimedia.org/T230289 (10Bstorm) p:05Triageβ†’03High [17:41:34] 10Operations, 10ops-eqiad, 10cloud-services-team: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only - https://phabricator.wikimedia.org/T230289 (10Bstorm) [17:41:40] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): rack/setup/install cloudvirt102[34] - https://phabricator.wikimedia.org/T199125 (10Bstorm) [17:42:11] 10Operations, 10ops-eqiad, 10cloud-services-team: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only - https://phabricator.wikimedia.org/T230289 (10Bstorm) Apparently, unfortunately, this is still misbehaving. Will gather more details shortly. [17:43:19] 10Operations, 10ops-eqiad, 10cloud-services-team: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only - https://phabricator.wikimedia.org/T230289 (10Bstorm) T234018 <-- if this turns out to be a normal failed disk, that'd be great. [18:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: #bothumor I οΏ½ Unicode. All rise for Morning SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190930T1800). [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:11:05] (03CR) 10Muehlenhoff: [C: 03+1] "One comment inline, but LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/538925 (owner: 10Paladox) [18:11:46] (03PS8) 10Paladox: Gerrit: Support java 8 under buster [puppet] - 10https://gerrit.wikimedia.org/r/538925 [18:12:00] (03CR) 10Paladox: Gerrit: Support java 8 under buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/538925 (owner: 10Paladox) [18:19:39] (03PS3) 10Phamhi: tools-webservice: Run update_manifest() on restart. [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/539895 (https://phabricator.wikimedia.org/T218461) [18:20:08] (03PS8) 10Dzahn: gerrit: override gerrit::server::slave_hosts under gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/536359 (owner: 10Paladox) [18:21:05] (03CR) 10Phamhi: "> Patch Set 2:" (032 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/539895 (https://phabricator.wikimedia.org/T218461) (owner: 10Phamhi) [18:21:36] (03CR) 10Dzahn: [C: 03+2] gerrit: override gerrit::server::slave_hosts under gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/536359 (owner: 10Paladox) [18:21:55] (03PS9) 10Paladox: Gerrit: Support java 8 under buster [puppet] - 10https://gerrit.wikimedia.org/r/538925 [18:21:59] mutante thanks! [18:24:37] (03PS1) 10Filippo Giunchedi: hieradata: add acmechief cluster [puppet] - 10https://gerrit.wikimedia.org/r/539927 (https://phabricator.wikimedia.org/T234232) [18:27:43] (03CR) 10Bstorm: "Looks good! https://puppet-compiler.wmflabs.org/compiler1002/18680/tools-sgebastion-07.tools.eqiad.wmflabs/" [puppet] - 10https://gerrit.wikimedia.org/r/513759 (https://phabricator.wikimedia.org/T222244) (owner: 10Lucas Werkmeister (WMDE)) [18:28:06] (03PS5) 10Paladox: gerrit: add role on gerrit1001 and remove spare [puppet] - 10https://gerrit.wikimedia.org/r/539204 (https://phabricator.wikimedia.org/T222391) [18:28:12] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/539204 (https://phabricator.wikimedia.org/T222391) (owner: 10Paladox) [18:29:52] (03PS7) 10Bstorm: dologmsg: add manpage [puppet] - 10https://gerrit.wikimedia.org/r/513759 (https://phabricator.wikimedia.org/T222244) (owner: 10Lucas Werkmeister (WMDE)) [18:31:05] (03PS1) 10Ottomata: Set up Presto on an-presto nodes [puppet] - 10https://gerrit.wikimedia.org/r/539930 [18:32:30] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: add acmechief cluster [puppet] - 10https://gerrit.wikimedia.org/r/539927 (https://phabricator.wikimedia.org/T234232) (owner: 10Filippo Giunchedi) [18:33:19] (03PS2) 10Filippo Giunchedi: hieradata: add acmechief cluster [puppet] - 10https://gerrit.wikimedia.org/r/539927 (https://phabricator.wikimedia.org/T234232) [18:34:51] PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:36:27] RECOVERY - BFD status on cr2-esams is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:36:53] (03CR) 10Thcipriani: [C: 03+1] "LGTM, left a rambling inline comment." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/538925 (owner: 10Paladox) [18:37:06] paladox: per other venue, yes, let's change that in puppet to have a clean parameter for it instead of comparing the host name [18:37:20] that "trick" works when you have 2 servers but with a 3rd in the mix not so much [18:37:23] yup! [18:37:29] * paladox will create that change [18:37:37] very nice, thanks paladox [18:38:43] (03PS1) 10Paladox: Gerrit: Introduce 'is_slave' paramater [puppet] - 10https://gerrit.wikimedia.org/r/539932 [18:39:07] (03PS10) 10Dzahn: Gerrit: Support java 8 under buster [puppet] - 10https://gerrit.wikimedia.org/r/538925 (owner: 10Paladox) [18:40:51] * paladox renames that param [18:43:05] 10Operations, 10Traffic: Broken puppet on traffic-upload-stretch.traffic.eqiad.wmflabs and traffic-text-stretch.traffic.eqiad.wmflabs - https://phabricator.wikimedia.org/T234256 (10Andrew) [18:44:32] 10Operations, 10serviceops: upgrade krypton (webserver_misc_apps) to stretch - https://phabricator.wikimedia.org/T210008 (10Andrew) Is T210008.wikistats.eqiad.wmflabs associated with this bug? It has had broken puppet for many weeks -- perhaps it can be deleted? [18:45:34] 10Operations, 10serviceops: upgrade krypton (webserver_misc_apps) to stretch - https://phabricator.wikimedia.org/T210008 (10Dzahn) @Andrew Yea, it is. I will look into it later today. [18:47:18] (03PS8) 10Bstorm: dologmsg: add manpage [puppet] - 10https://gerrit.wikimedia.org/r/513759 (https://phabricator.wikimedia.org/T222244) (owner: 10Lucas Werkmeister (WMDE)) [18:47:52] 10Operations, 10serviceops: upgrade krypton (webserver_misc_apps) to stretch - https://phabricator.wikimedia.org/T210008 (10Dzahn) This ticket is superseded by T224247. krypton has meanwhile been replaced by miscweb1001/2001 (stretch). [18:48:10] (03PS2) 10Paladox: Gerrit: Introduce 'is_replica' paramater and rename slave_hosts [puppet] - 10https://gerrit.wikimedia.org/r/539932 [18:48:20] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/539932 (owner: 10Paladox) [18:48:25] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:48:54] 10Operations, 10serviceops: upgrade krypton (webserver_misc_apps) to stretch - https://phabricator.wikimedia.org/T210008 (10Dzahn) [18:48:57] 10Operations, 10serviceops: upgrade and rename krypton & create its codfw equivalent - https://phabricator.wikimedia.org/T224247 (10Dzahn) [18:48:57] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:49:18] 10Operations, 10serviceops: upgrade and rename krypton & create its codfw equivalent - https://phabricator.wikimedia.org/T224247 (10Dzahn) [18:49:21] 10Operations, 10serviceops: upgrade krypton (webserver_misc_apps) to stretch - https://phabricator.wikimedia.org/T210008 (10Dzahn) [18:49:23] PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:50:17] (03PS3) 10Paladox: Gerrit: Introduce 'is_replica' paramater and rename slave_hosts [puppet] - 10https://gerrit.wikimedia.org/r/539932 [18:50:23] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/539932 (owner: 10Paladox) [18:50:25] (03CR) 10jerkins-bot: [V: 04-1] Gerrit: Introduce 'is_replica' paramater and rename slave_hosts [puppet] - 10https://gerrit.wikimedia.org/r/539932 (owner: 10Paladox) [18:50:58] (03PS2) 10Ottomata: Set up Presto on an-presto nodes [puppet] - 10https://gerrit.wikimedia.org/r/539930 [18:51:01] (03CR) 10Dzahn: [C: 04-1] "Evaluation Error: Error creating type specialization of an Enum-Type, Cannot use Integer where String is expected at /srv/jenkins-workspac" [puppet] - 10https://gerrit.wikimedia.org/r/538925 (owner: 10Paladox) [18:51:03] (03CR) 10Bstorm: [C: 03+2] dologmsg: add manpage [puppet] - 10https://gerrit.wikimedia.org/r/513759 (https://phabricator.wikimedia.org/T222244) (owner: 10Lucas Werkmeister (WMDE)) [18:51:34] (03CR) 10Dzahn: [C: 04-1] "paladox: https://puppet-compiler.wmflabs.org/compiler1002/18682/cobalt.wikimedia.org/change.cobalt.wikimedia.org.err" [puppet] - 10https://gerrit.wikimedia.org/r/538925 (owner: 10Paladox) [18:52:22] (03PS1) 10Filippo Giunchedi: WIP profile: sanity checks for cluster [puppet] - 10https://gerrit.wikimedia.org/r/539934 [18:52:45] (03CR) 10jerkins-bot: [V: 04-1] Gerrit: Introduce 'is_replica' paramater and rename slave_hosts [puppet] - 10https://gerrit.wikimedia.org/r/539932 (owner: 10Paladox) [18:53:27] (03PS4) 10Paladox: Gerrit: Introduce 'is_replica' paramater and rename slave_hosts [puppet] - 10https://gerrit.wikimedia.org/r/539932 [18:53:33] ACKNOWLEDGEMENT - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP Ayounsi known - The acknowledgement expires at: 2019-10-01 20:52:58. https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:53:33] ACKNOWLEDGEMENT - BFD status on cr2-esams is CRITICAL: CRIT: Down: 2 Ayounsi known - The acknowledgement expires at: 2019-10-01 20:52:58. https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:53:33] ACKNOWLEDGEMENT - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP Ayounsi known - The acknowledgement expires at: 2019-10-01 20:52:58. https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:53:58] (03PS5) 10Paladox: Gerrit: Introduce 'is_replica' paramater and rename slave_hosts [puppet] - 10https://gerrit.wikimedia.org/r/539932 [18:54:04] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/539932 (owner: 10Paladox) [18:57:01] (03PS2) 10Herron: admin: add eyener to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/539891 (https://phabricator.wikimedia.org/T233636) [18:57:08] (03PS2) 10Filippo Giunchedi: WIP profile: sanity checks for cluster [puppet] - 10https://gerrit.wikimedia.org/r/539934 (https://phabricator.wikimedia.org/T234232) [18:57:55] (03PS3) 10Ottomata: Set up Presto on an-presto nodes [puppet] - 10https://gerrit.wikimedia.org/r/539930 [18:58:32] (03PS11) 10Paladox: Gerrit: Support java 8 under buster [puppet] - 10https://gerrit.wikimedia.org/r/538925 [18:58:38] (03PS1) 10Jhedden: openstack: move haproxy exporter to controller role [puppet] - 10https://gerrit.wikimedia.org/r/539936 (https://phabricator.wikimedia.org/T223907) [18:58:43] (03PS12) 10Paladox: Gerrit: Support java 8 under buster [puppet] - 10https://gerrit.wikimedia.org/r/538925 [18:58:49] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/538925 (owner: 10Paladox) [18:59:19] (03CR) 10Filippo Giunchedi: [C: 03+1] WIP profile: sanity checks for cluster [puppet] - 10https://gerrit.wikimedia.org/r/539934 (https://phabricator.wikimedia.org/T234232) (owner: 10Filippo Giunchedi) [18:59:30] (03CR) 10Ottomata: [C: 03+2] "Looking good!" [puppet] - 10https://gerrit.wikimedia.org/r/539930 (owner: 10Ottomata) [18:59:50] 10Operations, 10serviceops: upgrade krypton (webserver_misc_apps) to stretch - https://phabricator.wikimedia.org/T210008 (10Dzahn) >>! In T210008#5535469, @Andrew wrote: > Is T210008.wikistats.eqiad.wmflabs associated with this bug? It has had broken puppet for many weeks -- perhaps it can be deleted? Per T2... [18:59:55] (03PS3) 10Filippo Giunchedi: profile: sanity checks for cluster [puppet] - 10https://gerrit.wikimedia.org/r/539934 (https://phabricator.wikimedia.org/T234232) [19:00:28] (03PS4) 10Ottomata: Set up Presto on an-presto nodes [puppet] - 10https://gerrit.wikimedia.org/r/539930 [19:00:36] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Set up Presto on an-presto nodes [puppet] - 10https://gerrit.wikimedia.org/r/539930 (owner: 10Ottomata) [19:01:15] PROBLEM - mediawiki originals uploads -hourly- for codfw on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005:9112 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/dashboard/file/swift?panelId=9&fullscreen&orgId=1&var-DC=codfw [19:01:41] mutante i think i've fixed https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/538925/ [19:01:41] (03PS3) 10Herron: admin: add eyener to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/539891 (https://phabricator.wikimedia.org/T233636) [19:01:57] also did the replica change in https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/539932/ [19:02:11] PROBLEM - mediawiki originals uploads -hourly- for eqiad on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005:9112 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/dashboard/file/swift?panelId=9&fullscreen&orgId=1&var-DC=eqiad [19:02:56] (03PS6) 10Paladox: Gerrit: Introduce 'is_replica' paramater and rename slave_hosts [puppet] - 10https://gerrit.wikimedia.org/r/539932 [19:03:06] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/539932 (owner: 10Paladox) [19:03:35] (03CR) 10Herron: [C: 03+2] admin: add eyener to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/539891 (https://phabricator.wikimedia.org/T233636) (owner: 10Herron) [19:04:25] (03PS7) 10Paladox: Gerrit: Introduce 'is_replica' paramater and rename slave_hosts [puppet] - 10https://gerrit.wikimedia.org/r/539932 [19:04:49] (03PS2) 10Jhedden: openstack: move haproxy exporter to controller role [puppet] - 10https://gerrit.wikimedia.org/r/539936 (https://phabricator.wikimedia.org/T223907) [19:05:56] (03PS8) 10Paladox: Gerrit: Introduce 'is_replica' paramater and rename slave_hosts [puppet] - 10https://gerrit.wikimedia.org/r/539932 [19:06:17] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/539932 (owner: 10Paladox) [19:06:32] (03PS2) 10Herron: admin: add jkumalah to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/539892 (https://phabricator.wikimedia.org/T233636) [19:07:38] (03CR) 10Jhedden: "PCC results: https://puppet-compiler.wmflabs.org/compiler1001/18686/" [puppet] - 10https://gerrit.wikimedia.org/r/539936 (https://phabricator.wikimedia.org/T223907) (owner: 10Jhedden) [19:08:28] (03CR) 10Paladox: "Puppet compiler https://puppet-compiler.wmflabs.org/compiler1002/280/" [puppet] - 10https://gerrit.wikimedia.org/r/538925 (owner: 10Paladox) [19:08:45] (03CR) 10Herron: [C: 03+2] admin: add jkumalah to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/539892 (https://phabricator.wikimedia.org/T233636) (owner: 10Herron) [19:12:55] 10Operations, 10MobileFrontend, 10Traffic, 10Readers-Web-Backlog (Tracking): Sections on some mobile pages are not collabsable - https://phabricator.wikimedia.org/T233373 (10Jdlrobson) 05Openβ†’03Resolved a:03Jdlrobson I'm considering this resolved. The cache was flushed for all wikis as part of T233095. [19:13:15] (03PS9) 10Paladox: Gerrit: Introduce 'is_replica' paramater and rename slave_hosts [puppet] - 10https://gerrit.wikimedia.org/r/539932 [19:13:21] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/539932 (owner: 10Paladox) [19:16:44] (03Abandoned) 10Herron: logstash: parse nested JSON in php7.2-fpm exception field [puppet] - 10https://gerrit.wikimedia.org/r/539621 (https://phabricator.wikimedia.org/T233828) (owner: 10Herron) [19:17:18] PROBLEM - Check systemd state on an-presto1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:17:22] PROBLEM - Check systemd state on an-presto1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:17:48] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:18:04] ^^^ is me [19:18:06] new service [19:18:25] (03PS10) 10Paladox: Gerrit: Introduce 'is_replica' paramater and rename slave_hosts [puppet] - 10https://gerrit.wikimedia.org/r/539932 [19:18:38] (03PS1) 10Ottomata: Fix settings discrepency between presto coordinator and workers [puppet] - 10https://gerrit.wikimedia.org/r/539943 [19:18:50] PROBLEM - Check systemd state on an-presto1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:19:00] RECOVERY - BFD status on cr2-esams is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:19:37] (03CR) 10Ottomata: [C: 03+2] Fix settings discrepency between presto coordinator and workers [puppet] - 10https://gerrit.wikimedia.org/r/539943 (owner: 10Ottomata) [19:20:30] 10Operations, 10Analytics, 10Fundraising-Backlog, 10SRE-Access-Requests: Banner History and page view data access for fundraising analysts - Jerrie and Erin - https://phabricator.wikimedia.org/T233636 (10herron) 05Openβ†’03Resolved a:03herron `jkumalah` has been added to ldap group `wmf`, `eyener` has... [19:20:49] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/539932 (owner: 10Paladox) [19:21:28] 10Operations, 10ops-eqiad, 10DC-Ops: hw troubleshooting: Memory correctable errors -EDAC- for elastic1029.eqiad.wmnet - https://phabricator.wikimedia.org/T233578 (10wiki_willy) a:03Cmjohnson [19:21:58] (03PS1) 10Ottomata: Remove api-request and cirrussearch-request from low volume mediawiki_events camus [puppet] - 10https://gerrit.wikimedia.org/r/539945 (https://phabricator.wikimedia.org/T233718) [19:22:07] 10Operations, 10LDAP-Access-Requests: Turnilo access for Jerrie Kumalah and Erin Yener (fundraising analysts) - https://phabricator.wikimedia.org/T233780 (10herron) 05Openβ†’03Resolved a:03herron The resolution of parent task T233636 should address this ask as well. Please re-open if any follow up is neede... [19:22:10] 10Operations, 10Analytics, 10Fundraising-Backlog, 10SRE-Access-Requests: Banner History and page view data access for fundraising analysts - Jerrie and Erin - https://phabricator.wikimedia.org/T233636 (10herron) [19:22:38] (03CR) 10jerkins-bot: [V: 04-1] Remove api-request and cirrussearch-request from low volume mediawiki_events camus [puppet] - 10https://gerrit.wikimedia.org/r/539945 (https://phabricator.wikimedia.org/T233718) (owner: 10Ottomata) [19:23:11] (03PS2) 10Ottomata: Remove api-request and cirrussearch-request from mediawiki_events camus [puppet] - 10https://gerrit.wikimedia.org/r/539945 (https://phabricator.wikimedia.org/T233718) [19:23:51] (03CR) 10Paladox: "Puppet compiler https://puppet-compiler.wmflabs.org/compiler1002/283/" [puppet] - 10https://gerrit.wikimedia.org/r/539932 (owner: 10Paladox) [19:23:54] (03CR) 10Ottomata: [C: 03+2] Remove api-request and cirrussearch-request from mediawiki_events camus [puppet] - 10https://gerrit.wikimedia.org/r/539945 (https://phabricator.wikimedia.org/T233718) (owner: 10Ottomata) [19:24:04] (03PS3) 10Ottomata: Remove api-request and cirrussearch-request from mediawiki_events camus [puppet] - 10https://gerrit.wikimedia.org/r/539945 (https://phabricator.wikimedia.org/T233718) [19:24:06] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Remove api-request and cirrussearch-request from mediawiki_events camus [puppet] - 10https://gerrit.wikimedia.org/r/539945 (https://phabricator.wikimedia.org/T233718) (owner: 10Ottomata) [19:25:12] (03CR) 10Jhedden: [C: 03+2] openstack: move haproxy exporter to controller role [puppet] - 10https://gerrit.wikimedia.org/r/539936 (https://phabricator.wikimedia.org/T223907) (owner: 10Jhedden) [19:25:24] (03PS3) 10Jhedden: openstack: move haproxy exporter to controller role [puppet] - 10https://gerrit.wikimedia.org/r/539936 (https://phabricator.wikimedia.org/T223907) [19:26:22] PROBLEM - Check systemd state on an-presto1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:26:54] (03PS1) 10Ottomata: Fix connecttor name for presto hive [puppet] - 10https://gerrit.wikimedia.org/r/539946 [19:27:19] (03PS13) 10Dzahn: Gerrit: Support java 8 under buster [puppet] - 10https://gerrit.wikimedia.org/r/538925 (owner: 10Paladox) [19:27:26] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/18687/cobalt.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/538925 (owner: 10Paladox) [19:27:29] \o/ [19:27:49] (03CR) 10Ottomata: [C: 03+2] Fix connecttor name for presto hive [puppet] - 10https://gerrit.wikimedia.org/r/539946 (owner: 10Ottomata) [19:28:26] (03PS4) 10Herron: logstash: throttle duplicate normalized_message with level:ERR* [puppet] - 10https://gerrit.wikimedia.org/r/538931 (https://phabricator.wikimedia.org/T233739) [19:29:23] (03CR) 10BryanDavis: [C: 03+1] "LGTM. Not sure if you want to amend here to change the debian/* control files or do that as a separate commit, but either way this will ma" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/539895 (https://phabricator.wikimedia.org/T218461) (owner: 10Phamhi) [19:29:36] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:30:20] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:31:49] (03PS14) 10Dzahn: Gerrit: Support java 8 under buster [puppet] - 10https://gerrit.wikimedia.org/r/538925 (owner: 10Paladox) [19:31:51] (03CR) 10jerkins-bot: [V: 04-1] logstash: throttle duplicate normalized_message with level:ERR* [puppet] - 10https://gerrit.wikimedia.org/r/538931 (https://phabricator.wikimedia.org/T233739) (owner: 10Herron) [19:31:59] (03PS1) 10Ottomata: Fix hive connector name in presto coodinator hiera [puppet] - 10https://gerrit.wikimedia.org/r/539947 [19:32:21] (03CR) 10Dzahn: [V: 03+2 C: 03+2] Gerrit: Support java 8 under buster [puppet] - 10https://gerrit.wikimedia.org/r/538925 (owner: 10Paladox) [19:32:23] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Fix hive connector name in presto coodinator hiera [puppet] - 10https://gerrit.wikimedia.org/r/539947 (owner: 10Ottomata) [19:33:47] (03PS2) 10Ottomata: Fix hive connector name in presto coodinator hiera [puppet] - 10https://gerrit.wikimedia.org/r/539947 [19:33:49] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Fix hive connector name in presto coodinator hiera [puppet] - 10https://gerrit.wikimedia.org/r/539947 (owner: 10Ottomata) [19:35:23] (03PS11) 10Paladox: Gerrit: Introduce 'is_replica' paramater and rename slave_hosts [puppet] - 10https://gerrit.wikimedia.org/r/539932 [19:35:54] PROBLEM - Check systemd state on an-presto1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:36:16] (03PS4) 10Jhedden: openstack: move haproxy exporter to controller role [puppet] - 10https://gerrit.wikimedia.org/r/539936 (https://phabricator.wikimedia.org/T223907) [19:39:22] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:48:38] (03PS5) 10Herron: logstash: throttle duplicate normalized_message with level:ERR* [puppet] - 10https://gerrit.wikimedia.org/r/538931 (https://phabricator.wikimedia.org/T233739) [19:50:19] (03PS1) 10Ottomata: Add ferm rules for presto ports [puppet] - 10https://gerrit.wikimedia.org/r/539952 [19:50:53] (03PS1) 10Dzahn: wikistats (cloud): set host_name to $::fqdn by default [puppet] - 10https://gerrit.wikimedia.org/r/539953 [19:51:29] (03PS12) 10Dzahn: Gerrit: Introduce 'is_replica' paramater and rename slave_hosts [puppet] - 10https://gerrit.wikimedia.org/r/539932 (owner: 10Paladox) [19:51:46] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/18688/" [puppet] - 10https://gerrit.wikimedia.org/r/539932 (owner: 10Paladox) [19:53:03] (03CR) 10Herron: logstash: throttle duplicate normalized_message with level:ERR* (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/538931 (https://phabricator.wikimedia.org/T233739) (owner: 10Herron) [19:53:07] (03CR) 10jerkins-bot: [V: 04-1] wikistats (cloud): set host_name to $::fqdn by default [puppet] - 10https://gerrit.wikimedia.org/r/539953 (owner: 10Dzahn) [19:55:33] (03PS2) 10Ottomata: Add ferm rules for presto ports [puppet] - 10https://gerrit.wikimedia.org/r/539952 [19:57:45] (03CR) 10Ottomata: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/18690/" [puppet] - 10https://gerrit.wikimedia.org/r/539952 (owner: 10Ottomata) [19:57:53] (03PS3) 10Ottomata: Add ferm rules for presto ports [puppet] - 10https://gerrit.wikimedia.org/r/539952 [19:58:00] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Add ferm rules for presto ports [puppet] - 10https://gerrit.wikimedia.org/r/539952 (owner: 10Ottomata) [19:59:08] 10Operations, 10netops: configure BGP route damping on IX sessions - https://phabricator.wikimedia.org/T222424 (10ayounsi) Great doc, thanks! We can use 2000 for reuse, the following will happen: Flaps up to 6000, then gets stable: 15 min -> 3000 30 min -> 1500 (unblocked as < 2000 ) Accepting a prefix 30m... [19:59:50] 10Operations, 10Analytics, 10Fundraising-Backlog, 10SRE-Access-Requests: Banner History and page view data access for fundraising analysts - Jerrie and Erin - https://phabricator.wikimedia.org/T233636 (10Nuria) Please make sure you can access https://turnilo.wikimedia.org [19:59:52] (03PS2) 10Dzahn: wikistats (cloud): set host_name to $::fqdn by default [puppet] - 10https://gerrit.wikimedia.org/r/539953 [20:00:04] cscott, arlolra, subbu, bearND, halfak, and accraze: Time to snap out of that daydream and deploy Services – Parsoid / Citoid / Mobileapps / ORES / …. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190930T2000). [20:00:08] (03PS4) 10Phamhi: tools-webservice: Run update_manifest() on restart. [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/539895 (https://phabricator.wikimedia.org/T218461) [20:00:24] !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@1f9fedd]: Update mobileapps to 131b83f [20:00:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:02] 10Operations, 10DBA: snapshot for s6/s7 at eqiad taken more than 4 days ago - https://phabricator.wikimedia.org/T234152 (10jcrespo) 05Openβ†’03Resolved ` root@db1115.eqiad.wmnet[zarcillo]> SELECT * FROM backups WHERE section in ('s6', 's7') ORDER BY id desc LIMIT 5; +------+----------------------------------... [20:01:05] (03CR) 10BryanDavis: [C: 03+1] tools-webservice: Run update_manifest() on restart. [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/539895 (https://phabricator.wikimedia.org/T218461) (owner: 10Phamhi) [20:01:24] (03CR) 10CDanis: [C: 03+1] profile: sanity checks for cluster [puppet] - 10https://gerrit.wikimedia.org/r/539934 (https://phabricator.wikimedia.org/T234232) (owner: 10Filippo Giunchedi) [20:01:31] (03CR) 10Phamhi: [C: 03+2] tools-webservice: Run update_manifest() on restart. [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/539895 (https://phabricator.wikimedia.org/T218461) (owner: 10Phamhi) [20:02:07] (03Merged) 10jenkins-bot: tools-webservice: Run update_manifest() on restart. [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/539895 (https://phabricator.wikimedia.org/T218461) (owner: 10Phamhi) [20:02:10] (03CR) 10jerkins-bot: [V: 04-1] wikistats (cloud): set host_name to $::fqdn by default [puppet] - 10https://gerrit.wikimedia.org/r/539953 (owner: 10Dzahn) [20:03:13] 10Operations, 10ops-codfw, 10decommission: Decommission db2036 - https://phabricator.wikimedia.org/T223885 (10Papaul) [20:03:41] 10Operations, 10ops-codfw, 10decommission: Decommission db2037 - https://phabricator.wikimedia.org/T224720 (10Papaul) [20:04:15] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: decommission db2038 - https://phabricator.wikimedia.org/T227565 (10Papaul) [20:04:24] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: decommission db2038 - https://phabricator.wikimedia.org/T227565 (10Papaul) [20:05:02] (03PS1) 10Paladox: test [puppet] - 10https://gerrit.wikimedia.org/r/539956 [20:05:24] (03PS2) 10Paladox: test [puppet] - 10https://gerrit.wikimedia.org/r/539956 [20:06:19] !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@1f9fedd]: Update mobileapps to 131b83f (duration: 05m 55s) [20:06:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:38] (03CR) 10jerkins-bot: [V: 04-1] test [puppet] - 10https://gerrit.wikimedia.org/r/539956 (owner: 10Paladox) [20:07:57] (03PS3) 10Dzahn: wikistats (cloud): set host_name to $::fqdn [puppet] - 10https://gerrit.wikimedia.org/r/539953 [20:08:19] !sal [20:08:19] https://wikitech.wikimedia.org/wiki/Server_Admin_Log https://tools.wmflabs.org/sal/production See it and you will know all you need. [20:11:17] (03PS1) 10Ottomata: Release 0.266-2 [debs/presto] (debian) - 10https://gerrit.wikimedia.org/r/539957 [20:14:25] (03CR) 10Ottomata: [C: 03+2] Release 0.266-2 [debs/presto] (debian) - 10https://gerrit.wikimedia.org/r/539957 (owner: 10Ottomata) [20:14:41] (03CR) 10Dzahn: "is this applied on a labs instance?" [puppet] - 10https://gerrit.wikimedia.org/r/539180 (https://phabricator.wikimedia.org/T227509) (owner: 10Paladox) [20:15:16] (03CR) 10Paladox: "> is this applied on a labs instance?" [puppet] - 10https://gerrit.wikimedia.org/r/539180 (https://phabricator.wikimedia.org/T227509) (owner: 10Paladox) [20:16:14] (03CR) 10Dzahn: [C: 03+2] wikistats (cloud): set host_name to $::fqdn [puppet] - 10https://gerrit.wikimedia.org/r/539953 (owner: 10Dzahn) [20:16:23] (03PS4) 10Dzahn: wikistats (cloud): set host_name to $::fqdn [puppet] - 10https://gerrit.wikimedia.org/r/539953 [20:17:42] 10Operations, 10ops-codfw, 10decommission: Decommission db2040 - https://phabricator.wikimedia.org/T224079 (10Papaul) [20:18:16] 10Operations, 10ops-codfw, 10decommission: Decommission db2041 - https://phabricator.wikimedia.org/T223950 (10Papaul) [20:19:21] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decommission db2043.codfw.wmnet - https://phabricator.wikimedia.org/T230311 (10Papaul) [20:20:11] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission db2044.codfw.wmnet - https://phabricator.wikimedia.org/T230761 (10Papaul) [20:23:20] RECOVERY - Check systemd state on an-presto1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:23:57] (03PS1) 10Jhedden: openstack: haproxy fix stats url for exporter [puppet] - 10https://gerrit.wikimedia.org/r/539959 (https://phabricator.wikimedia.org/T223907) [20:24:33] (03PS2) 10Cwhite: hiera: update ores to pass statsd through statsd_exporter [puppet] - 10https://gerrit.wikimedia.org/r/538976 (https://phabricator.wikimedia.org/T205870) [20:24:48] (03PS6) 10Paladox: gerrit: add role on gerrit1001 and remove spare [puppet] - 10https://gerrit.wikimedia.org/r/539204 (https://phabricator.wikimedia.org/T222391) [20:24:54] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/539204 (https://phabricator.wikimedia.org/T222391) (owner: 10Paladox) [20:25:37] !log arlolra@deploy1001 Started deploy [parsoid/deploy@a6da34c]: Updating Parsoid to 1922eb6 [20:25:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:23] (03CR) 10Jhedden: [C: 03+2] openstack: haproxy fix stats url for exporter [puppet] - 10https://gerrit.wikimedia.org/r/539959 (https://phabricator.wikimedia.org/T223907) (owner: 10Jhedden) [20:27:21] (03PS1) 10Andrew Bogott: openstack: neutron: newton: customize l3_agent_hack manifest for newton [puppet] - 10https://gerrit.wikimedia.org/r/539960 [20:27:23] (03CR) 10Paladox: "Puppet compiler https://puppet-compiler.wmflabs.org/compiler1002/284/" [puppet] - 10https://gerrit.wikimedia.org/r/539204 (https://phabricator.wikimedia.org/T222391) (owner: 10Paladox) [20:27:54] (03PS2) 10Andrew Bogott: openstack: neutron: newton: customize l3_agent_hack manifest for newton [puppet] - 10https://gerrit.wikimedia.org/r/539960 [20:28:02] (03CR) 10jerkins-bot: [V: 04-1] openstack: neutron: newton: customize l3_agent_hack manifest for newton [puppet] - 10https://gerrit.wikimedia.org/r/539960 (owner: 10Andrew Bogott) [20:28:36] (03CR) 10jerkins-bot: [V: 04-1] openstack: neutron: newton: customize l3_agent_hack manifest for newton [puppet] - 10https://gerrit.wikimedia.org/r/539960 (owner: 10Andrew Bogott) [20:29:54] (03PS3) 10Andrew Bogott: openstack: neutron: newton: customize l3_agent_hack manifest for newton [puppet] - 10https://gerrit.wikimedia.org/r/539960 [20:30:38] (03CR) 10jerkins-bot: [V: 04-1] openstack: neutron: newton: customize l3_agent_hack manifest for newton [puppet] - 10https://gerrit.wikimedia.org/r/539960 (owner: 10Andrew Bogott) [20:31:10] 10Operations, 10Analytics, 10Fundraising-Backlog, 10SRE-Access-Requests: Banner History and page view data access for fundraising analysts - Jerrie and Erin - https://phabricator.wikimedia.org/T233636 (10jkumalah) @Nuria I have access. Thank you! [20:31:45] (03PS4) 10Andrew Bogott: openstack: neutron: newton: customize l3_agent_hack manifest for newton [puppet] - 10https://gerrit.wikimedia.org/r/539960 [20:34:16] !log arlolra@deploy1001 Finished deploy [parsoid/deploy@a6da34c]: Updating Parsoid to 1922eb6 (duration: 08m 39s) [20:34:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:04] (03CR) 10Andrew Bogott: [C: 03+2] openstack: neutron: newton: customize l3_agent_hack manifest for newton [puppet] - 10https://gerrit.wikimedia.org/r/539960 (owner: 10Andrew Bogott) [20:43:20] (03PS1) 10Dzahn: wikistats (cloud): support buster with PHP 7.3 [puppet] - 10https://gerrit.wikimedia.org/r/539966 [20:43:21] Updated Parsoid to 1922eb6 (T233459, T230359, T208070) [20:43:22] T230359: Create N'Ko Wikipedia - https://phabricator.wikimedia.org/T230359 [20:43:22] T233459: Undefined property: stdClass::$src in TokenUtils.php::tokensToString - https://phabricator.wikimedia.org/T233459 [20:43:23] T208070: Parse requests return gratuitous newlines - https://phabricator.wikimedia.org/T208070 [20:43:26] !log T208070 [20:43:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:35] !log Updated Parsoid to 1922eb6 (T233459, T230359, T208070) [20:43:39] ugh, sorry [20:43:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:31] (03PS2) 10Dzahn: wikistats (cloud): support buster with PHP 7.3 [puppet] - 10https://gerrit.wikimedia.org/r/539966 [20:45:49] 10Operations, 10ops-eqiad: replace scs-a8-eqiad - https://phabricator.wikimedia.org/T228919 (10Jclark-ctr) [20:46:13] (03CR) 10Paladox: [C: 03+1] wikistats (cloud): support buster with PHP 7.3 [puppet] - 10https://gerrit.wikimedia.org/r/539966 (owner: 10Dzahn) [20:46:31] (03CR) 10Dzahn: [C: 03+2] wikistats (cloud): support buster with PHP 7.3 [puppet] - 10https://gerrit.wikimedia.org/r/539966 (owner: 10Dzahn) [20:46:56] 10Operations, 10ops-eqiad: replace scs-a8-eqiad - https://phabricator.wikimedia.org/T228919 (10Jclark-ctr) add to netbox as 'future-scs-a8-eqiad' [20:53:09] 10Operations, 10ops-eqiad, 10serviceops: mw1286.mgmt is down - https://phabricator.wikimedia.org/T234009 (10Jclark-ctr) a:05Jclark-ctrβ†’03Cmjohnson @Cmjohnson Reseated green mgmt cable [20:54:28] 10Operations, 10ops-eqiad: replace scs-a8-eqiad - https://phabricator.wikimedia.org/T228919 (10Jclark-ctr) 05Openβ†’03Resolved a:05Jclark-ctrβ†’03Cmjohnson [20:58:55] (03PS1) 10Dzahn: wikistats (cloud): support buster with PHP7.3, httpd module [puppet] - 10https://gerrit.wikimedia.org/r/539968 [21:00:01] (03CR) 10Paladox: [C: 03+1] wikistats (cloud): support buster with PHP7.3, httpd module [puppet] - 10https://gerrit.wikimedia.org/r/539968 (owner: 10Dzahn) [21:00:04] Reedy and sbassett: Your horoscope predicts another unfortunate Weekly Security deployment window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190930T2100). [21:00:58] (03CR) 10jerkins-bot: [V: 04-1] wikistats (cloud): support buster with PHP7.3, httpd module [puppet] - 10https://gerrit.wikimedia.org/r/539968 (owner: 10Dzahn) [21:01:37] 10Operations, 10ops-eqiad, 10serviceops: mw1286.mgmt is down - https://phabricator.wikimedia.org/T234009 (10Dzahn) 05Openβ†’03Resolved Thanks @Jclark-ctr @Cmjohnson It seems to be working fine right now. If it comes back we will just reopen this ticket. Calling it resolved tentatively. [21:02:18] (03PS1) 10Ottomata: Allow presto to proxy as hdfs users when running queries [puppet] - 10https://gerrit.wikimedia.org/r/539969 [21:03:12] (03CR) 10Ottomata: [C: 03+2] Allow presto to proxy as hdfs users when running queries [puppet] - 10https://gerrit.wikimedia.org/r/539969 (owner: 10Ottomata) [21:03:58] (03PS2) 10Dzahn: wikistats (cloud): support buster with PHP7.3, httpd module [puppet] - 10https://gerrit.wikimedia.org/r/539968 [21:06:41] (03CR) 10Dzahn: [C: 03+2] wikistats (cloud): support buster with PHP7.3, httpd module [puppet] - 10https://gerrit.wikimedia.org/r/539968 (owner: 10Dzahn) [21:06:50] (03PS3) 10Dzahn: wikistats (cloud): support buster with PHP7.3, httpd module [puppet] - 10https://gerrit.wikimedia.org/r/539968 [21:08:27] !log delete BGP to AS131285 on cr1-eqsin [21:08:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:23] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1290.eqiad.wmnet [21:09:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:45] 10Operations, 10ops-eqiad: Can't SSH to mw1290.mgmt - https://phabricator.wikimedia.org/T234153 (10Dzahn) depooled 17:09 <+logmsgbot> !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1290.eqiad.wmnet [21:09:57] 10Operations, 10ops-eqiad: Can't SSH to mw1290.mgmt - https://phabricator.wikimedia.org/T234153 (10Dzahn) p:05Triageβ†’03Normal [21:10:21] 10Operations, 10ops-eqiad: Can't SSH to mw1290.mgmt - https://phabricator.wikimedia.org/T234153 (10Dzahn) a:03Jclark-ctr [21:10:26] RECOVERY - BGP status on cr1-eqsin is OK: BGP OK - up: 266, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:15:50] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack/setup/install frban1001.eqiad.wmnet - https://phabricator.wikimedia.org/T234068 (10Jclark-ctr) [21:17:00] (03PS1) 10Dzahn: mariadb::packages: support buster with libmariadb3 [puppet] - 10https://gerrit.wikimedia.org/r/539973 [21:17:45] ACKNOWLEDGEMENT - SSH mw1290.mgmt on mw1290.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn https://phabricator.wikimedia.org/T234153 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:18:54] (03CR) 10jerkins-bot: [V: 04-1] mariadb::packages: support buster with libmariadb3 [puppet] - 10https://gerrit.wikimedia.org/r/539973 (owner: 10Dzahn) [21:19:21] (03PS2) 10Dzahn: mariadb::packages: support buster with libmariadb3 [puppet] - 10https://gerrit.wikimedia.org/r/539973 [21:19:41] (03CR) 10Dzahn: [C: 03+1] "per comment "These are not used on production"" [puppet] - 10https://gerrit.wikimedia.org/r/539973 (owner: 10Dzahn) [21:20:12] (03CR) 10jerkins-bot: [V: 04-1] mariadb::packages: support buster with libmariadb3 [puppet] - 10https://gerrit.wikimedia.org/r/539973 (owner: 10Dzahn) [21:23:07] 10Operations, 10Discovery, 10Elasticsearch, 10Discovery-Search (Current work), 10Patch-For-Review: Icinga should alert on free disk space < 15% (now < 12%) on Elasticsearch hosts - https://phabricator.wikimedia.org/T130329 (10Dzahn) From the paste above it looks like this was already set to the default a... [21:26:09] !log mw1290 - downtimed for onsite work on mgmt, depooled earlier [21:26:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:26] (03PS6) 10Jforrester: Revert "Disable MessageBlobStore::clear() via hook" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508476 (https://phabricator.wikimedia.org/T222539) (owner: 10Catrope) [21:30:59] (03CR) 10Jforrester: [C: 03+2] Revert "Disable MessageBlobStore::clear() via hook" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508476 (https://phabricator.wikimedia.org/T222539) (owner: 10Catrope) [21:31:53] (03Merged) 10jenkins-bot: Revert "Disable MessageBlobStore::clear() via hook" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508476 (https://phabricator.wikimedia.org/T222539) (owner: 10Catrope) [21:32:54] (03PS2) 10Jforrester: robots.txt: Remove old and disabled archive.org_bot rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358171 (https://phabricator.wikimedia.org/T7582) (owner: 10Framawiki) [21:33:10] (03CR) 10Jforrester: [C: 03+2] robots.txt: Remove old and disabled archive.org_bot rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358171 (https://phabricator.wikimedia.org/T7582) (owner: 10Framawiki) [21:34:06] (03Merged) 10jenkins-bot: robots.txt: Remove old and disabled archive.org_bot rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358171 (https://phabricator.wikimedia.org/T7582) (owner: 10Framawiki) [21:36:05] 10Operations, 10ops-eqiad: Can't SSH to mw1290.mgmt - https://phabricator.wikimedia.org/T234153 (10Dzahn) downtimed in Icinga for an hour.. mgmt and server and all services on them. first attempt the server came back just fine but mgmt was not fixed yet. we are trying a second time and leave it off longer. [21:36:45] (03PS3) 10Dzahn: mariadb::packages: support buster with libmariadb3 [puppet] - 10https://gerrit.wikimedia.org/r/539973 [21:37:02] (03CR) 10Jforrester: "This still hasn't been done, 8 months later. :-(" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484633 (https://phabricator.wikimedia.org/T204031) (owner: 10Addshore) [21:37:26] (03CR) 10jerkins-bot: [V: 04-1] mariadb::packages: support buster with libmariadb3 [puppet] - 10https://gerrit.wikimedia.org/r/539973 (owner: 10Dzahn) [21:38:29] !log sync failure on mw1290.eqiad.wmnet – Connection timed out [21:38:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:23] James_F: it's depooled and onsite is working on it. we did not remove it from scap hosts, the chance seemed pretty low ..but we hit it anyways [21:39:37] Ha, whoops. [21:39:43] it's because mgmt interface is broken [21:39:46] OK, if it's know, I'll just ignore. [21:39:47] the server itself is ok [21:39:52] yes please, thanks [21:39:55] Ah, yeah, I saw that task fly past. [21:40:08] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: T222539 Drop no-op hacky disablement of MessageBlobStore::clear() (duration: 05m 13s) [21:40:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:12] T222539: Scap deployments are not purging MessageBlobStore (was: Stale localized messages) - https://phabricator.wikimedia.org/T222539 [21:40:26] Oy, 5 mins not 45 seconds. [21:42:06] !log jforrester@deploy1001 Synchronized robots.txt: Remove old InternetArchive bot rule that's been disabled since 2008 T7582 (duration: 00m 57s) [21:42:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:10] T7582: Disallow ia_archiver on user and user_talk pages (robots.txt) - https://phabricator.wikimedia.org/T7582 [21:43:59] (03PS4) 10Paladox: mariadb::packages: support buster with libmariadb3 [puppet] - 10https://gerrit.wikimedia.org/r/539973 (owner: 10Dzahn) [21:47:55] !log mw1290 - scap pull to get it in sync with latest deployment - it was down during scap run for T234153 [21:47:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:59] T234153: Can't SSH to mw1290.mgmt - https://phabricator.wikimedia.org/T234153 [21:48:28] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1290.eqiad.wmnet [21:48:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:11] (03PS1) 10Filippo Giunchedi: logstash: parse nested json from mmkubernetes [puppet] - 10https://gerrit.wikimedia.org/r/539978 (https://phabricator.wikimedia.org/T207200) [21:50:36] (03PS7) 10Paladox: gerrit: add role on gerrit1001 and remove spare [puppet] - 10https://gerrit.wikimedia.org/r/539204 (https://phabricator.wikimedia.org/T222391) [21:50:46] 10Operations, 10ops-eqiad: Can't SSH to mw1290.mgmt - https://phabricator.wikimedia.org/T234153 (10Dzahn) rebooting it unfortunately did not fix mgmt yet. currently pooled again. [21:51:35] (03CR) 10Filippo Giunchedi: rsyslog: Correctly parse docker logs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/539519 (https://phabricator.wikimedia.org/T207200) (owner: 10Alexandros Kosiaris) [21:57:48] (03CR) 10Filippo Giunchedi: "I went through some logs but couldn't find additional metadata like namespace or labels to expand into fields (e.g. to get the service/dep" [puppet] - 10https://gerrit.wikimedia.org/r/539978 (https://phabricator.wikimedia.org/T207200) (owner: 10Filippo Giunchedi) [22:12:38] (03PS1) 10Filippo Giunchedi: Move ms-be2055 to row C [dns] - 10https://gerrit.wikimedia.org/r/539982 (https://phabricator.wikimedia.org/T233638) [22:12:49] (03PS1) 10Papaul: DNS: Remove mgmt DNS for db2036 [dns] - 10https://gerrit.wikimedia.org/r/539983 [22:13:34] (03CR) 10Filippo Giunchedi: [C: 03+1] Move ms-be2055 to row C [dns] - 10https://gerrit.wikimedia.org/r/539982 (https://phabricator.wikimedia.org/T233638) (owner: 10Filippo Giunchedi) [22:13:40] (03CR) 10Filippo Giunchedi: [C: 03+2] Move ms-be2055 to row C [dns] - 10https://gerrit.wikimedia.org/r/539982 (https://phabricator.wikimedia.org/T233638) (owner: 10Filippo Giunchedi) [22:15:32] PROBLEM - Host ms-be2055 is DOWN: PING CRITICAL - Packet loss = 100% [22:16:17] 10Operations, 10ops-codfw, 10media-storage, 10Patch-For-Review: rack/setup/install ms-be205[1-6].codfw.wmnet - https://phabricator.wikimedia.org/T233638 (10fgiunchedi) >>! In T233638#5534157, @Papaul wrote: > @sorry I fogot to mentioned that on the task, It was the faster and easier way to rack those serve... [22:44:45] (03PS2) 10Aaron Schulz: Set "allow_tcp_nagle_delay" to false in mc.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521967 [22:51:18] RECOVERY - mediawiki originals uploads -hourly- for codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/dashboard/file/swift?panelId=9&fullscreen&orgId=1&var-DC=codfw [22:52:04] RECOVERY - mediawiki originals uploads -hourly- for eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/dashboard/file/swift?panelId=9&fullscreen&orgId=1&var-DC=eqiad [22:54:53] (03PS1) 10Aaron Schulz: Configure allow_tcp_nagle_delay for mcrouter cache in deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539985 [23:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: #bothumor I οΏ½ Unicode. All rise for Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190930T2300). [23:00:05] AndyRussG: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:53] (03PS8) 10CRusnov: Initial support for custom scripts [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/537242 (https://phabricator.wikimedia.org/T230449) [23:02:48] o/ [23:03:21] (03CR) 10Bstorm: toolforge-kubernetes: restructure pod security policies (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/537732 (https://phabricator.wikimedia.org/T227290) (owner: 10Bstorm) [23:03:24] (03PS2) 10CRusnov: netbox: Add SCRIPTS_ROOT configuration [puppet] - 10https://gerrit.wikimedia.org/r/537243 [23:04:39] Anyone around for SWAT? [23:05:56] jouncebot: now [23:05:56] For the next 0 hour(s) and 54 minute(s): Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190930T2300) [23:08:26] (03CR) 10CRusnov: [C: 03+2] netbox: Add SCRIPTS_ROOT configuration [puppet] - 10https://gerrit.wikimedia.org/r/537243 (owner: 10CRusnov) [23:11:20] (03CR) 10Dzahn: "the link in the commit message links back to itself" [puppet] - 10https://gerrit.wikimedia.org/r/539676 (owner: 10MarcoAurelio) [23:12:35] AndyRussG: I can swat. [23:13:41] AndyRussG: You'll need to cherry pick the patch. [23:13:58] Niharika: hi! thanks! [23:14:19] Niharika: you mean for the submodule update for core? [23:15:03] AndyRussG: I assume you want to deploy to wmf.24? https://tools.wmflabs.org/versions/ [23:15:21] (03PS2) 10CRusnov: Add script to generate DNS records from Netbox [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/539013 (https://phabricator.wikimedia.org/T233183) [23:15:24] Niharika: yes, as indicate on the Deployments page [23:15:29] *indicated [23:15:46] Thanks so much in advance [23:16:15] AndyRussG: Patches need to be cherry-picked to the branch they need to be deployed to. I can do it for you. It can be done by using the 'cherry-pick' button on the gerrit page. [23:16:24] (03PS2) 10CRusnov: Add script to rotate backup dumps, and dump with timestamp [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/533131 (https://phabricator.wikimedia.org/T231512) [23:16:46] AndyRussG: So the patch that will be deployed now becomes https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/CentralNotice/+/539990/ [23:17:32] Niharika: ah ok thanks. Hmmm recent SWAT deploys I haven't done that, but yes, looks good! [23:17:42] or maybe I'm remembering wrong 8p [23:17:54] CN deploy procedures changed not that long ago [23:18:32] AndyRussG: This is the process for all extensions, afaik. [23:18:40] (03CR) 10CRusnov: Add script to rotate backup dumps, and dump with timestamp (031 comment) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/533131 (https://phabricator.wikimedia.org/T231512) (owner: 10CRusnov) [23:18:42] (03CR) 10CRusnov: [V: 03+2 C: 03+2] Add script to rotate backup dumps, and dump with timestamp [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/533131 (https://phabricator.wikimedia.org/T231512) (owner: 10CRusnov) [23:18:48] Niharika: CN is a special snowflake but recently became less of one [23:19:09] We have a special deploy branch, wmf_deploy [23:19:25] Anyway, all good! [23:27:58] AndyRussG: You're right CN was a special snowflake but isn't one anymore and follows the same deploy process as other extensions. Good thing I found JamesF. :) [23:28:12] !log niharika29@deploy1001 Synchronized php-1.34.0-wmf.24/extensions/CentralNotice/resources/infrastructure/: CentralNotice: Replace deprecated editToken with csrfToken - T233538 (duration: 00m 57s) [23:28:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:28:15] T233538: CentralNotice needs 'editToken' replaced with 'csrfToken' - https://phabricator.wikimedia.org/T233538 [23:28:16] AndyRussG: Your change should be deployed now. [23:28:30] Let me know if it looks okay. [23:28:39] Niharika: ok so deployed everywhere [23:28:41] ? [23:28:50] I realize that in my conversation I forgot to do the testing step. [23:28:54] AndyRussG: Yeah. [23:29:01] Please check though. [23:29:02] ok one sec [23:29:05] yes of course [23:30:10] 10Operations, 10ops-eqiad, 10cloud-services-team: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only - https://phabricator.wikimedia.org/T230289 (10Bstorm) Nope. It is only showing 6 disks instead of the 10 it has on board. It is definitely malfunctioning. [23:30:46] PROBLEM - mediawiki originals uploads -hourly- for eqiad on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005:9112 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/dashboard/file/swift?panelId=9&fullscreen&orgId=1&var-DC=eqiad [23:31:40] PROBLEM - mediawiki originals uploads -hourly- for codfw on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005:9112 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/dashboard/file/swift?panelId=9&fullscreen&orgId=1&var-DC=codfw [23:34:23] Niharika: looks fine! [23:34:40] AndyRussG: Great. :) [23:35:41] Niharika: thanks so much! [23:38:06] 10Operations, 10ops-eqiad, 10cloud-services-team: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only - https://phabricator.wikimedia.org/T230289 (10Bstorm) The pattern I'm seeing is that it complains that a disk isn't functioning correctly, it resets it and then it is logged as removed. It is, n... [23:43:56] (03Abandoned) 10Dzahn: add mwmaint.discovery and point to mwmaint1002 [dns] - 10https://gerrit.wikimedia.org/r/539635 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [23:47:10] 10Operations, 10ops-eqiad, 10cloud-services-team: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only - https://phabricator.wikimedia.org/T230289 (10Bstorm) I don't see a record of whether they are the same four disks that were removed by the controller. However, I did record that it removed four... [23:48:02] (03Restored) 10Dzahn: add mwmaint.discovery and point to mwmaint1002 [dns] - 10https://gerrit.wikimedia.org/r/539635 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [23:49:15] (03PS2) 10Dzahn: mediawiki::maintenance: add envoy for TLS termination for noc.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/539633 (https://phabricator.wikimedia.org/T210411) [23:51:16] (03PS2) 10Dzahn: add maintenance.discovery.wmnet and point to mwmaint1002 [dns] - 10https://gerrit.wikimedia.org/r/539635 (https://phabricator.wikimedia.org/T210411) [23:57:18] PROBLEM - Check systemd state on netbox2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:57:36] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:59:11] (03PS1) 10Dzahn: misc webservices: style cleanup and comment [dns] - 10https://gerrit.wikimedia.org/r/539992 [23:59:38] (03CR) 10jerkins-bot: [V: 04-1] misc webservices: style cleanup and comment [dns] - 10https://gerrit.wikimedia.org/r/539992 (owner: 10Dzahn)